Notes on Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow
Author
Marco Dalla Vecchia
Chapter 1: The Machine Learning Lanscape
Types of models
Supervised/Unsupervised
Supervised -> feed the algorithm with the desired solutions (labels)
Unsupervised -> data in unlabeled, the algorithm tried to learn by itself
Semisupervised -> Use data that are partially labeled
Reinforcement learning -> Use rewards and penalties to train a policy
Instance-based vs Model-based learning
Instance-based -> learn by looking at copies of training data
Model-based -> learn by looking at similar data of training
Batch vs Online learning
Batch -> must be trained on all available data at once
Online -> can be trained in small batches continuously.
Main challenges of Machine Learning
Insufficient quantity of training data
Lots of data actually allow machines to learn a lot! Even with relatively simple models
Nonrepresentative training data
Watch out to use only data that represents what we want to model
Tradeoff between:
Small dataset -> large noise
Large dataset -> dilute effects, include non-relevant data
Sampling Bias!
Poor-quality data
Outliers
Noise
Measurements errors
Null/wrong values
Irrelevant features
Garbage in -> Garbage out
Feature engineering:
Feature selection
Feature extraction
New features creation
Come up with a good set of features to train the model on
Overfitting the Training data
Model too specific -> doesn’t generalize well
Model more complex than the data/features available
If sample is small -> noise takes up larger proportion -> model will pick up on the noise and won’t generalize well
Solutions!
Gather more training data
Reduce the noise in the data
Simplify the model: less parameters
Constraining the model (regularization): establish a /hyper-parameter/ = a parameter of a learning algorithm (not of a model) that remains constant during training.
Underfitting the Training data
Model is too simple -> bad predictions even on the training data itself
Solutions
Select a more powerful model
Feed better features to the algorithm (/feature engineering/)
Reduce the constraints on the model (reduce /regularization/)
Testing and validating
Split train and test data (usually 80%-20%)
Training data to train model and test to validate it!
Error rate on test = generalization error or out-of-sample error
If training error is low but testing error is high -> model if overfitting
Model Selection and Hyper-parameter Tuning:
If you use ALL the training data to check several models and different parameters, you will find the conditions that work best only for this specific dataset! -> won’t generalize well!
Solution: split the training data into train+validation = hold-out validation. Validation set = dev set.
Experiment with models and parameters only on the training set, then train the final model with the full training set (train+validation) and then test it against the test set!
If validation set is too smal or too large it’s a problem! A possible solution is to do multiple cross-validations using many small validations sets and average the validation error.
Chapter 2: End-toEnd Machine Learning project
Performance Measure
RMSE (root mean squared error) -> usually preferred for regression tasks
MAE (mean absolute error) -> usually better if there are a lot of outliers
Hands-on project
Frame the problem:
predict median housing price in any district given all other metrics. This predicted price will be fed to another system with other signals to finally decide whether it’s worth investing in an area or not.
Approach:
Typical supervised learning: I have the “ground-truth” -> median housing price
Typical regression task: predict one value
Multiple regression: predict one value from #any features
Univariate regression: predict only ONE value
Single batch: data is small enough to load all at once and train directly#
Performance Measure:
RMSE (root mean square error) -> measure the “difference” between predicted and true labels with a higher weight for large
Start of project
Import data and look at stats
import pandas as pdfrom matplotlib import pyplot as pltplt.rcParams['figure.dpi'] =140df = pd.read_csv('../datasets/housing/housing.csv')df.describe()
We don’t have equal amount of entries for each categories! We have to take this into account when sampling otherwise we will have a /sample bias/ -> stratified sampling will take this into account!
splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)for train, test in splitter.split(df, df['income_cat']): train_set = df.loc[train] test_set = df.loc[test]print(test_set['income_cat'].value_counts() /len(test_set))print(df['income_cat'].value_counts() /len(df))
We have indeed very similar distributions between the sampled data and the original data! Now we remove the income_cat column to restore the split data to original variables. We will never look at the test set again from this point forward!
As we can see form the matrix the only significant correlation in the one between median house value and the median income, let’s take a look at it closer!
with plt.style.context('dark_background'): housing.plot(kind='scatter', x='median_income', y='median_house_value', alpha=0.1)
Engineer and explore new attributeswith
It doesn’t make so much sense to have the number of rooms or bedrooms in absolute terms, let’s make them relative so we can use them better!
Note that the output is a sparse scipy matrix, because one-hot matrices can be very large!
Create custom transformer
from sklearn.base import BaseEstimator, TransformerMixinimport numpy as np# Define which indeces are which variable since we are in numpy array formatrooms_ix, bedrooms_ix, population_ix, households_ix =3, 4, 5, 6class CombinedAttributesAdder(BaseEstimator, TransformerMixin):def__init__(self, add_bedrooms_per_room=True): # no *args or **kargsself.add_bedrooms_per_room = add_bedrooms_per_roomdef fit(self, X, y=None):returnself# nothing else to dodef transform(self, X): rooms_per_household = X[:, rooms_ix] / X[:, households_ix] population_per_household = X[:, population_ix] / X[:, households_ix]ifself.add_bedrooms_per_room: bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]else:return np.c_[X, rooms_per_household, population_per_household]attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)housing_tr = attr_adder.transform(housing.values)print(housing_tr.shape)
(16512, 11)
Features Scaling
Scaling your feature can be very important, especially if they have vastly different values range!. There are two main ways to scale features: 1. Min-Max scaling: values are shifted and scaled between 0 and 1 2. Standardization: values are subtracted by their mean and divided by their standard deviation to have a unit variance.
Standardization does not limit values within a certain range (which might a problem for certain models) but it’s also less influenced by outliers compared to Min-Max scaling.
Create Transformation Pipeline
Since we have to apply many different transformation to our data before proceding, it’s convenient to create a sklearn pipeline to be able to apply all the transformations in an automatic way.
This is probably underfitting! Either because the features don’t provide enough information or because the model is not complex enough. Let’s try a more complex model.
RMSE = 0??? It’s not possible! Could it be that now we are overfitting? How can we tell without using the test set?
Let’s use cross-validation. Sklearn cross-validation function expects a higher is better function and not a lower is better cost function. Below we use the negative RMSE to train and invert it to evaluate it.
In order to find the best hyperparameters for a given model we can try several and see which model is the best!
He we will try 2 approaches: 1. To test 3 different number of estimators and 4 max features 2. Test 2 number of estimators and 3 max features without bootstrap (default behavior)
This classifier that basically classifis every image as non 5 still has >90% accuracy! How is it possible?
Just because only about 10% of the digits are 5s! So even if you classify every image as non 5, you will still be correct >90% of the time! ### Evaluating the Classifier with a Confusion Matrix
These numbers give a lot of information, but we can summarize it into a Precision measure.
\[ precision = \frac{TP}{TP+FP} \]
Where TP is the number of True Positives and FP is the number of False Positives. Precision is often used along Recall.
\[ recall = \frac{TP}{TP+FN} \]
Where FN is the number of False Negatives. Let’s see Precision and Recall in action.
from sklearn.metrics import precision_score, recall_scoreprint('Precision score for classifier',precision_score(y_train_5, y_train_pred))print('Recall score for classifier',recall_score(y_train_5, y_train_pred))
Precision score for classifier 0.8370879772350012
Recall score for classifier 0.6511713705958311
It’s often useful to express Precision and Recall as a unique metrics called F1.
from sklearn.metrics import f1_scoreprint(f1_score(y_train_5, y_train_pred))
0.7325171197343846
Depending on what the goal of the classifier is precision or recall might be more useful. If you are detecting kid-safe videos on the internet, you prefer to have a low recall but high precision in order to keep few safe videos and reject many possibly bad videos. On the other hand, if you want to detect thieves form security camera footage, you might want high recall in order to catch all shoplifters even if with lower precision you might get some false alarms.
F1 score favors classifiers that have similar precision and recall. You can never have high precision and recall; this is the precision/recall trade-off.
Classifier function computes a score, which compares to a threshold. If it’s higher than the threshold it will classify the entry as positive, otherwise as negative. In the classifier you cannot directly set the threshold but it is set to 0. We can however, extract the decision scores and check manually if they are lower or higher than a threshold we choose ourselves.
How do we choose a good threshold? We can plot the values of precisions and recalls for many different threshold! There is a function for this!
# Get the decision scores for all the train datay_scores = cross_val_predict(sgd, X_train, y_train_5, cv=3, method="decision_function")y_scores
The receiver operating characteristic curve plots the sensitivity (recall) versus 1-specificity of the classifier.
FPR is the false positive rate (1-TNR)
TPR is the true positive rate (i.e. recall)
from sklearn.metrics import roc_curvefpr, tpr, thresholds = roc_curve(y_train_5, y_scores)with plt.style.context("dark_background"): plt.plot(fpr, tpr, c='cornflowerblue') plt.plot([0,1], [0,1], 'w--') plt.plot(0,1, c='orangered', marker='o') plt.annotate(text="A good classifier is as\nclose to here as possible", xy=(0.0, 1), xytext=(0.1,0.7), arrowprops=dict(facecolor='black'), bbox=dict(fc="black", ec="w", boxstyle='round,pad=0.6')) plt.xlabel("False positive rate") plt.ylabel("True positive rate (recall)")
Since we want the ROC curve to be as close to the upper left corner as possible, a good way to compare different classifier is to use the area under the curve (AUC). A perfect model will have a AUC=1 whereas a random classifier would have AUC=0.5
We should prefer the use of the precision/recall plot whenever the positive class is rare or when you care more about the false positives than the false negatives. Otherwise, we should use the ROC curve.
from sklearn.metrics import roc_auc_scoreroc_auc_score(y_train_5, y_scores)
0.9604938554008616
Let’s try to compare the stochastic gradient descent classifier we already built with a new random forest classifier using the ROC curve.
Some algotithms can handle multiple classes such as SGD, Random Forest and naive Bayesian classifiers. Others, like Regression and Support Vector Machines are strictly binary classifiers. But we can still use binary classifiers to solve multiclass classifications.
One versus the rest (OvR)
To classify in multiple class one can create as many binary classifiers as classes you want to classify (0-classifier, 1-classifier etc..). In this case to know which class a certain data entry is, you can take the classifier with the highest score.
For most binary classification algorithms the OvR approach is preferred.
One versus one (OvO)
Train a binary classifier for every pair of entries to classify (distinguish 0 and 1, distinguish 0 and 2 etc..). For N classes you need $N (N-1) /2 $ classifiers. You have to run an image through all the combinations and see which class wins the most duels/comparisons. The main advantage of the OvO approach is that each classifier needs to be trained only on the part of the training set for the two classes that it must distinguish.
For some algorigthms that don’t scale well with more and more classes (such as Support Vector Machine) OvO approaches are preferred.
Sklearn
Sklearn automatically runs OvR or OvO depending on the algorithm.
If we want Sklearn to use a specific OvO or OvR approach we use the approach below. We can for example ‘force’ a SVC model to be a multiclass classifier.
from sklearn.multiclass import OneVsRestClassifierovr_clf = OneVsRestClassifier(SVC(gamma="auto", random_state=42))ovr_clf.fit(X_train[:1000], y_train[:1000])ovr_clf.predict([some_digit])
array([7], dtype=uint8)
SGD is already capable of classifying multiclass so calling it directly without specifying OvO or OvR will work for us!
The clearest thing from the matrix above, is that many classes get misclassified as ‘8’!
Some of the 3s and 5s are also misclassified as other classes in a minor fashion. Take a look at the image below, where you find on the left images classified as ‘3’ and on the right you find images classified as ‘5’.
def plot_digits(instances, images_per_row=10, **options): size =28 images_per_row =min(len(instances), images_per_row)# This is equivalent to n_rows = ceil(len(instances) / images_per_row): n_rows = (len(instances) -1) // images_per_row +1# Append empty images to fill the end of the grid, if needed: n_empty = n_rows * images_per_row -len(instances) padded_instances = np.concatenate([instances, np.zeros((n_empty, size * size))], axis=0)# Reshape the array so it's organized as a grid containing 28×28 images: image_grid = padded_instances.reshape((n_rows, images_per_row, size, size))# Combine axes 0 and 2 (vertical image grid axis, and vertical image axis),# and axes 1 and 3 (horizontal axes). We first need to move the axes that we# want to combine next to each other, using transpose(), and only then we# can reshape: big_image = image_grid.transpose(0, 2, 1, 3).reshape(n_rows * size, images_per_row * size)# Now that we have a big image, we just need to show it: plt.imshow(big_image, cmap='binary_r',**options) plt.axis("off")cl_a, cl_b =3, 5X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]with plt.style.context('dark_background'): plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5); plt.title("Classified as '3'") plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5); plt.title("Classified as '5'") plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5) plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
Multilabel Classification
A classification problem where an entry should have multiple classification labels. Instead of each image having to be a single digiti we could have multiple digits in a single image. For example a face-detection algorithm could detect multiple faces of different people in a single picture.
Since in our case every image has a unique label, we can phrase a multilabel classification problem in another way. Let’s say we want to tell for a digit of an image if it’s > 7 and if it’s odd or not. In this case for each digit image we can have a classification (True or False) for both of these classes.
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)f1_score(y_multilabel, y_train_knn_pred, average='macro') # change to average='weighted' to assume labels are not equally important
0.976410265560605
Multioutput Classification
Problem in which each label can have multiple classes (i.e. more than just 2 values).
Example: remove noise from image
One label per pixel and each label can be anything from 0 to 255 in values.
The above is perfect example of underfitting behaviour because both validation and training curves reach a plateaux of RMSE performance and they literally cannot do better, since the model is too simple.
Let’s look below at an example of overfitting!
from sklearn.pipeline import Pipelinepolynomial_regression = Pipeline([ ('poly_features', PolynomialFeatures(degree=10, include_bias=False)), ('lin_reg', LinearRegression())])plot_learning_curves(polynomial_regression, X, y)plt.title('Example of overfitting')plt.ylim(0,3)
(0.0, 3.0)
In the plot above we can see that the overall error on the training set is much lower than in the underfitting case, and that there is a big gap between train and validation set!
Regularized Linear Models
A good way to prevent over fitting is to try and constrain the model. For a linear model, we are mostly talking about reducing the number of polynomial degress.
Ridge regression (Tikhonov regularization)
We modify the cost function to fit introducing small weights and we try to minimize those weights as much as possible during training.
Least Absolute Shrinkage and Selection Operator Regression (Lasso) adds another regularization factor to the cost function. \[ J(\theta) = MSE(\theta) + \alpha \sum_{i=1}^n |\theta_i| \]
Lasso regularization tends to remove teh weights of the least important features → it performs feature selection and outputs a sparse model.
from sklearn.linear_model import Lassoplt.figure(figsize=(10,4))plt.subplot(121)plot_model(Lasso, polynomial=False, alphas=(0, 0.1, 1), random_state=42)plt.ylabel("$y$", rotation=0, fontsize=18)plt.subplot(122)plot_model(Lasso, polynomial=True, alphas=(0, 10**-7, 1), random_state=42)plt.suptitle('Higher values regularizes the fit too much!')plt.show()
/home/dallavem/.conda/envs/image-analysis/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:648: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.277e+01, tolerance: 5.060e-02
model = cd_fast.enet_coordinate_descent(
# Lasso regression closed-form solutionfrom sklearn.linear_model import Lassolasso_reg = Lasso(alpha=0.1)lasso_reg.fit(X,y)print('Using Lasso class')print(lasso_reg.predict([[1.5]]))# Using stochastic gradient descentfrom sklearn.linear_model import SGDRegressorsgd_reg = SGDRegressor(penalty='l1') # another way to specify the lasso regressionsgd_reg.fit(X,y.ravel())print("Using SGD class")print(sgd_reg.predict([[1.5]]))
Using Lasso class
[4.63505617]
Using SGD class
[4.64805295]
Elastic Net
Mix of regularization between Lasso and Ridge in which you can control the mix ration \(r\).
In general it’s important to have some kind of regularization so you should avoid using pure linear regression!
Ridge regularization is a good default but if think only a few features are important then Lasso or Elastic Net are preferrable, as they tend to repress those extra less important features.
In general Extra Net is preferred over Lasso because Lasso may be erratic when several features are strongly correlated or when the number of features is much higher the number of training instances.
# Elastic Net in sklearnfrom sklearn.linear_model import ElasticNetelastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5) # half-half regularizationelastic_net.fit(X,y)elastic_net.predict([[1.5]])
array([4.63955284])
Regularization by Early Stopping
While using an interative training process a smart way to prevent under/over fitting is to stop when the model performance is best. For example stopping when the RMSE is not longer decreasing but before is starts increasing is a good move.
The score \(t\) is often called logit. It comes from the fact that the logit function
\[ logit(p) = log(\frac{p}{1-p}) \]
is actually the inverse of the logistic function. If you computer the logit of the estimated probability \(p\), you will find that the result is \(t\).
import numpy as npfrom matplotlib import pyplot as pltplt.rcParams['figure.dpi'] =140plt.style.use('dark_background')# Visualization of logistic functionx = np.linspace(start=-10, stop=10, num=100)y =1/(1+np.exp(-x))plt.figure(figsize=(6,3))plt.axhline(0.5, ls=':')plt.axvline(0., ls='--')plt.plot(x,y)plt.xlabel('$t$')plt.ylabel('$\sigma(t)$')plt.show()
Cost function Logistic Regression
Cost function of a single training instance \[ c(\theta) = \begin{cases} -log(\^{p}) \quad y=1 \\ -log(1-\^{p}) \quad y=0 \end{cases} \]
The cost function over the whole training set is the average cost over all the training instances that can be written as the log loss expression:
There is no known closed-form solutions to the above equation to the computer the value of \(\theta\) so we have to rely on Gradient Descent (or other optimization algorithms) which is guaranteed to find a global minimum.
Decision Boundaries
Let’s find the decision boundaries to detect Irisi virginica flowers.
from sklearn import datasetsiris = datasets.load_iris()print(iris.DESCR)X = iris['data'][:,3:] # load petal width as Xy = (iris['target'] ==2).astype(int) # if it's virginica or not
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
/home/dallavem/.conda/envs/image-analysis/lib/python3.9/site-packages/matplotlib/patches.py:1444: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
self.verts = np.dot(coords, M) + [
Logistic regression can be extended to multiple classes directly without having to train and combine multiple binary classifiers.
Softmax function, where K is number of classes, s(x) is the vector containing the scores of each class for the instance x and \(\sigma(s(x))_k\) is the estimated probability that the instance x belongs to class k.
Support vector machines are models particularly useful for small and medium size datasets. They can be used for linear and non-linear classification, regression and even outliers detection.
SVM try to fit the largest street between classes of the dataset. SVMs are very sensitive to data scaling.
Margins
A Hard Margin is when we don’t allow any misclassification and we force the classes to be separated entirely by the support vectors of the classes (i.e. the closest data points to the ‘street’ separating the classes). → this creates a very strict model that is super susceptible to outliers.
A better tactic is to use a Soft Margin classifier, where we allow some mis-classification but we build a more robust model.
Linear SVMs are very efficient and can be very valuable, however many dataset are not linearly separable. To work around this issue, we can transform our data introducing some non-linearity and then use a SVM to separate the classes.