- scaling the data using z-score: Standartscaler()
from sklearn.preprocessing import StandardScaler
Methods: https://scikit- learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html *fit(self, X[, y]) Compute the mean and std to be used for later scaling. *fit_transform(self, X[, y]) Fit to data, then transform it. *transform(self, X[, copy]) Perform standardization by centering and scaling
- GridSearchCV/RandomizedSearchCV for hyperparameter tuning:
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
param_grid = [{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}] forest_reg = RandomForestRegressor() grid_search = GridSearchCV(classifier, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True, scoring = 'recall') random_search = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=20, cv=5, iid=False) grid_search.fit(X_train, y_train) # fit to training data random_search.fit(X_train, y_train) # fit to training data y_pred_acc = grid_clf_acc.predict(X_test) # predict based on best parameters
- grid_search.best_estimator_ to see the best estimator
- grid_search.cv_results_ for the scores of all combinations - grid_search.best_estimator_.feature_importances_ for relative importance of features
Chapter 3 : Classification
from sklearn.metrics import
accuracy_score,recall_score,precision_score,f1_score roc_curve, precision_recall_curve # returns 3 objects: fpr,tpr,thresholds roc_auc_score # returns area under the curve
- if you implement a classifier, it should inherit from BaseEstimator
- Getting the predictions of a classifier:
from sklearn.model_selection import cross_val_predict y_train_pred = cross_val_predict(clf, X_train, y_train, cv=3)
- Confusion matrix: from sklearn.metrics confusion_matrix
cm = confusion_matrix(true_labels,y_train_pred)
Chapter 4: Training models
- Lasso and Ridge regression: SGDRegressor(penalty="l1/l2") or from
sklearn.linear_model import Lasso, Lasso(alpha=0.1), fit, predict - Elastic net: linear_model->ElasticNet; l1_ratio = 0.5 is mixing ratio
- Using Pipelines: from sklearn.pipeline import Pipeline # usually combined with
GridSearchCV - Goal: Sequentially apply a list of transforms and a final estimator. Ex. > anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)]) anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y) Ex. > pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
- One hot encoding: Takes numerical and categorical data as dict
from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer(sparse=False, dtype=int) # set sparse+True if the category has many possible values