Sunteți pe pagina 1din 2

Machine

Learning with Python

Chapter 2: General

Data cleaning and feature selection:

- scaling the data using z-score: Standartscaler()

from sklearn.preprocessing import StandardScaler


Methods:
https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
*fit(self, X[, y]) Compute the mean and std to
be used for later scaling.
*fit_transform(self, X[, y]) Fit to data, then transform it.
*transform(self, X[, copy]) Perform standardization by
centering and scaling

- GridSearchCV/RandomizedSearchCV for hyperparameter tuning:

from sklearn.model_selection import GridSearchCV,RandomizedSearchCV


param_grid = [{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(classifier, param_grid, cv=5,
scoring='neg_mean_squared_error', return_train_score=True, scoring = 'recall')
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=20, cv=5, iid=False)
grid_search.fit(X_train, y_train) # fit to training data
random_search.fit(X_train, y_train) # fit to training data
y_pred_acc = grid_clf_acc.predict(X_test) # predict based on best parameters

- grid_search.best_estimator_ to see the best estimator


- grid_search.cv_results_ for the scores of all combinations
- grid_search.best_estimator_.feature_importances_ for relative importance of
features

Chapter 3 : Classification

from sklearn.metrics import


accuracy_score,recall_score,precision_score,f1_score
roc_curve, precision_recall_curve #
returns 3 objects: fpr,tpr,thresholds
roc_auc_score # returns area under
the curve

- if you implement a classifier, it should inherit from BaseEstimator

- Getting the predictions of a classifier:


from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(clf, X_train, y_train, cv=3)

- Confusion matrix: from sklearn.metrics confusion_matrix


cm = confusion_matrix(true_labels,y_train_pred)

Chapter 4: Training models

- Lasso and Ridge regression: SGDRegressor(penalty="l1/l2") or from


sklearn.linear_model import Lasso, Lasso(alpha=0.1), fit, predict
- Elastic net: linear_model->ElasticNet; l1_ratio = 0.5 is mixing ratio

Softmax regression - softmax_reg =


LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10)

- Using Pipelines: from sklearn.pipeline import Pipeline # usually combined with


GridSearchCV
- Goal: Sequentially apply a list of transforms and a final estimator.
Ex. > anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)])
anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
Ex. > pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

- One hot encoding: Takes numerical and categorical data as dict


from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int) # set sparse+True if the category has
many possible values

S-ar putea să vă placă și