Sunday, April 23, 2017

What is the difference between estimators vs transformers vs predictors in sklearn?

Hi All,

While working in Machine Learning projects using scikit-learn library, I would like to highlight important and fundamental concepts that every ML ninja needs to be aware of. In this post i am highlighting few concepts to differentiate estimators vs transformers vs predictors in building machine learning solutions using sklearn.

1) Estimators: Any objects that can estimate some parameters based on a dataset is called an estimator. The estimation itself is performed by calling fit() method.
This method takes one parameter (or two in case of supervised learning algorithms). Any other parameter needed to guide the estimation process is called hyperparameter and must be set as in instance variable.

For example: i would like to estimate a mean, median or most frequent value of a column in my dataset.

This is a cheat sheet of sklearn estimators. you can find the up to date version here.

2) Transformers: Transform a dataset. It transforms a dataset by calling transform() method and it returns a transformed dataset. some estimators can also transform a dataset.

For example: Imputer class in sklearn is an estimator and a transformer. You can call fit_transform() method that estimate and transform a dataset.

Python code: 

from sklearn.preprocessing inport Imputer

imputer = Imputer(strategy="mean") #estimate mean value for dataset columns    # Imputer as an estimator

imputer.fit_transform(mydataset)   # Imputer as a transformer and estimator (Combined two steps)

3) Predictors: making predictions for  given a dataset. A predictor class has predict() method that takes a new instances of a dataset and returns a dataset with corresponding predictions. Also, it contains score() method that measures the quality of the predictions for a giving test dataset.

For example: LinearRegression, SVM, Decision Tree,..etc are predictors.

You can combine building blocks of estimators, transformers and predictors as a pipeline in sklearn. This allows developers to use multiple estimators from a sequence of transformers followed by a final estimator or predictor. This concept is called composition in Machine Learning.

Hope this helps

No comments: