Notes on Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow

Author

Marco Dalla Vecchia

Chapter 1: The Machine Learning Lanscape

Types of models

Supervised/Unsupervised

  • Supervised -> feed the algorithm with the desired solutions (labels)
  • Unsupervised -> data in unlabeled, the algorithm tried to learn by itself
  • Semisupervised -> Use data that are partially labeled
  • Reinforcement learning -> Use rewards and penalties to train a policy

Instance-based vs Model-based learning

  • Instance-based -> learn by looking at copies of training data
  • Model-based -> learn by looking at similar data of training

Batch vs Online learning

  • Batch -> must be trained on all available data at once
  • Online -> can be trained in small batches continuously.

Main challenges of Machine Learning

Insufficient quantity of training data

Lots of data actually allow machines to learn a lot! Even with relatively simple models

Nonrepresentative training data

  • Watch out to use only data that represents what we want to model
  • Tradeoff between:
    • Small dataset -> large noise
    • Large dataset -> dilute effects, include non-relevant data
  • Sampling Bias!

Poor-quality data

  • Outliers
  • Noise
  • Measurements errors
  • Null/wrong values

Irrelevant features

  • Garbage in -> Garbage out
  • Feature engineering:
    • Feature selection
    • Feature extraction
    • New features creation
  • Come up with a good set of features to train the model on

Overfitting the Training data

  • Model too specific -> doesn’t generalize well
  • Model more complex than the data/features available
  • If sample is small -> noise takes up larger proportion -> model will pick up on the noise and won’t generalize well
  • Solutions!
    • Gather more training data
    • Reduce the noise in the data
    • Simplify the model: less parameters
    • Constraining the model (regularization): establish a /hyper-parameter/ = a parameter of a learning algorithm (not of a model) that remains constant during training.

Underfitting the Training data

  • Model is too simple -> bad predictions even on the training data itself
  • Solutions
    • Select a more powerful model
    • Feed better features to the algorithm (/feature engineering/)
    • Reduce the constraints on the model (reduce /regularization/)

Testing and validating

  • Split train and test data (usually 80%-20%)
  • Training data to train model and test to validate it!
  • Error rate on test = generalization error or out-of-sample error
  • If training error is low but testing error is high -> model if overfitting

Model Selection and Hyper-parameter Tuning:

  • If you use ALL the training data to check several models and different parameters, you will find the conditions that work best only for this specific dataset! -> won’t generalize well!
  • Solution: split the training data into train+validation = hold-out validation. Validation set = dev set.
    • Experiment with models and parameters only on the training set, then train the final model with the full training set (train+validation) and then test it against the test set!
    • If validation set is too smal or too large it’s a problem! A possible solution is to do multiple cross-validations using many small validations sets and average the validation error.

Chapter 2: End-toEnd Machine Learning project

Performance Measure

  • RMSE (root mean squared error) -> usually preferred for regression tasks
  • MAE (mean absolute error) -> usually better if there are a lot of outliers

Hands-on project

Frame the problem:

predict median housing price in any district given all other metrics. This predicted price will be fed to another system with other signals to finally decide whether it’s worth investing in an area or not.

Approach:

  • Typical supervised learning: I have the “ground-truth” -> median housing price
  • Typical regression task: predict one value
  • Multiple regression: predict one value from #any features
  • Univariate regression: predict only ONE value
  • Single batch: data is small enough to load all at once and train directly#

Performance Measure:

RMSE (root mean square error) -> measure the “difference” between predicted and true labels with a higher weight for large

Start of project

Import data and look at stats

import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams['figure.dpi'] = 140
df = pd.read_csv('../datasets/housing/housing.csv')
df.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

Create test set

Purely randomic split

Doesn’t take into account proper sampling of features -> There is a better way!

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

Stratified Sampling

We need a sufficient amount of entries in each statum!

from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np
df['income_cat'] = pd.cut(df['median_income'],
                          bins=[0.0, 1.5, 3.0, 4.5, 6, np.inf],
                          labels=[1,2,3,4,5])
with plt.style.context('dark_background'):
    plt.hist(df.income_cat)
    plt.xlabel('Income categories')
    plt.ylabel('Counts')                          

We don’t have equal amount of entries for each categories! We have to take this into account when sampling otherwise we will have a /sample bias/ -> stratified sampling will take this into account!

splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train, test in splitter.split(df, df['income_cat']):
    train_set = df.loc[train]
    test_set = df.loc[test]

print(test_set['income_cat'].value_counts() / len(test_set))
print(df['income_cat'].value_counts() / len(df))
3    0.350533
2    0.318798
4    0.176357
5    0.114341
1    0.039971
Name: income_cat, dtype: float64
3    0.350581
2    0.318847
4    0.176308
5    0.114438
1    0.039826
Name: income_cat, dtype: float64

We have indeed very similar distributions between the sampled data and the original data! Now we remove the income_cat column to restore the split data to original variables. We will never look at the test set again from this point forward!

test_set.drop('income_cat', axis=1, inplace=True)
train_set.drop('income_cat', axis=1, inplace=True)
housing = train_set.copy()
print(housing.head().T)
                      12655       15502    2908        14053      20496
longitude           -121.46     -117.23  -119.04     -117.13     -118.7
latitude              38.52       33.09    35.37       32.75      34.28
housing_median_age     29.0         7.0     44.0        24.0       27.0
total_rooms          3873.0      5320.0   1618.0      1877.0     3536.0
total_bedrooms        797.0       855.0    310.0       519.0      646.0
population           2237.0      2015.0    667.0       898.0     1837.0
households            706.0       768.0    300.0       483.0      580.0
median_income        2.1736      6.3373    2.875      2.2264     4.4964
median_house_value  72100.0    279600.0  82700.0    112500.0   238300.0
ocean_proximity      INLAND  NEAR OCEAN   INLAND  NEAR OCEAN  <1H OCEAN

Let’s take a look at the train data!

with plt.style.context('dark_background'):
    f, ax = plt.subplots(1, figsize=(7,5))
    housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4,
                 s=housing['population']/100, label ='population',
                 c='median_house_value', cmap=plt.get_cmap('jet'), ax=ax)

Investigating Correlation

print(housing.corr()['median_house_value'].sort_values(ascending=False))
median_house_value    1.000000
median_income         0.687151
total_rooms           0.135140
housing_median_age    0.114146
households            0.064590
total_bedrooms        0.047781
population           -0.026882
longitude            -0.047466
latitude             -0.142673
Name: median_house_value, dtype: float64

Median income is highly correlated but other variables not so much! Visualize correlation in a nicer way using scatter_matrix.

from pandas.plotting import scatter_matrix
attributes = ['median_house_value', 'median_income',
              'total_rooms', 'housing_median_age']
fname = 'ch2_proj/correlation_matrix.pdf'
with plt.style.context('dark_background'):
    scatter_matrix(housing[attributes], figsize=(7,7))
    plt.tight_layout()

As we can see form the matrix the only significant correlation in the one between median house value and the median income, let’s take a look at it closer!

with plt.style.context('dark_background'):
    housing.plot(kind='scatter', x='median_income', y='median_house_value', alpha=0.1)

Engineer and explore new attributeswith

It doesn’t make so much sense to have the number of rooms or bedrooms in absolute terms, let’s make them relative so we can use them better!

housing['rooms_per_household'] = housing['total_rooms'] / housing['households']
housing['bedrooms_per_room'] = housing['total_bedrooms'] / housing['total_rooms']
housing['population_per_household'] = housing['population'] / housing['households']
print(housing.corr()['median_house_value'].sort_values(ascending=False))
median_house_value          1.000000
median_income               0.687151
rooms_per_household         0.146255
total_rooms                 0.135140
housing_median_age          0.114146
households                  0.064590
total_bedrooms              0.047781
population_per_household   -0.021991
population                 -0.026882
longitude                  -0.047466
latitude                   -0.142673
bedrooms_per_room          -0.259952
Name: median_house_value, dtype: float64

Bedrooms per household is negatively correlated with median house value!

Separate features from labels (ground truth)!

housing = train_set.drop('median_house_value', axis=1)
housing_labels = train_set['median_house_value'].copy()
print(housing.shape)
(16512, 9)

Clean the data!

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')
housing_num = housing.drop('ocean_proximity', axis=1)
housing_cat = housing[['ocean_proximity']]
imputer.fit(housing_num)

# imputer statistics are the medians of all numerical features
print(imputer.statistics_)
print(housing_num.median().values)

X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns = housing_num.columns,
                           index=housing_num.index)
print(housing_tr.shape)
[-118.51      34.26      29.      2119.       433.      1164.
  408.         3.54155]
[-118.51      34.26      29.      2119.       433.      1164.
  408.         3.54155]
(16512, 8)

Encode ocean_proximity categories from string to categories!

from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
encoder.fit_transform(housing_cat)
print(encoder.categories_)
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype=object)]

This approach is not good because it assumes all categories are equally different!

In our case <1 OCEAN and NEAR OCEAN are more similar than NEAR OCEAN adn INLAND

A better solution is to use a one-hot encoder!

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_onehot = encoder.fit_transform(housing_cat)
print(type(housing_cat_onehot))
print(encoder.categories_)
print(housing_cat_onehot.shape)
<class 'scipy.sparse.csr.csr_matrix'>
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype=object)]
(16512, 5)

Note that the output is a sparse scipy matrix, because one-hot matrices can be very large!

Create custom transformer

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

# Define which indeces are which variable since we are in numpy array format
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_tr = attr_adder.transform(housing.values)
print(housing_tr.shape)
(16512, 11)

Features Scaling

Scaling your feature can be very important, especially if they have vastly different values range!. There are two main ways to scale features: 1. Min-Max scaling: values are shifted and scaled between 0 and 1 2. Standardization: values are subtracted by their mean and divided by their standard deviation to have a unit variance.

Standardization does not limit values within a certain range (which might a problem for certain models) but it’s also less influenced by outliers compared to Min-Max scaling.

Create Transformation Pipeline

Since we have to apply many different transformation to our data before proceding, it’s convenient to create a sklearn pipeline to be able to apply all the transformations in an automatic way.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

full_pipeline = ColumnTransformer([
('num', num_pipeline, num_attribs),
('cat', OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

print(housing_prepared.shape)
(16512, 16)

Training and Evaluating a regression model on Training Set

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
print('Model Trained!')
Model Trained!

Let’s test the model on some data!

some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
for lbl, pred in zip(some_labels, lin_reg.predict(some_data_prepared)):
    print('Label: ',lbl, 'Prediction: ',pred)
Label:  72100.0 Prediction:  85657.9019201438
Label:  279600.0 Prediction:  305492.60737487697
Label:  82700.0 Prediction:  152056.461224557
Label:  112500.0 Prediction:  186095.70946094394
Label:  238300.0 Prediction:  244550.67966088967

Pretty bad! They are clearly not random but we are pretty off! Let’s evaluated the RMSE in the whole training set.

from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)
68627.87390018745

This is probably underfitting! Either because the features don’t provide enough information or because the model is not complex enough. Let’s try a more complex model.

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
print('Model trained')
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
print(tree_rmse)
Model trained
0.0

RMSE = 0??? It’s not possible! Could it be that now we are overfitting? How can we tell without using the test set?

Let’s use cross-validation. Sklearn cross-validation function expects a higher is better function and not a lower is better cost function. Below we use the negative RMSE to train and invert it to evaluate it.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
print(tree_rmse_scores.mean(), tree_rmse_scores.std())
71517.87747920479 2558.3382211052253
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
print(lin_rmse_scores.mean(), lin_rmse_scores.std())
69104.07998247063 2880.3282098180616

Linear regression is simpler and it’s actually performing better! The Decision Tree is overfitting so bad that it performs worse.

Let’s try a Random Forest, which is basically a series of Decision Trees on random subsets of the data -> Ensemble learning.

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
print("model trained")
forest_predictions = forest_reg.predict(housing_prepared)
forest_rmse = np.sqrt(mean_squared_error(housing_labels, forest_predictions))
print('Score on training set',forest_rmse)
model trained
Score on training set 18797.456977805236
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring='neg_mean_squared_error', cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
print('Score on cross-validation', forest_rmse_scores.mean(), forest_rmse_scores.std())
Score on cross-validation 50359.093563515395 2200.5649837657197

Just for completeness let’s try a Support Vector Machine model.

from sklearn.svm import SVR

svm_reg = SVR(kernel="linear")
svm_reg.fit(housing_prepared, housing_labels)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
print(np.sqrt(svm_mse))
111095.06635291966

It is a good practice to save all the models you play around to including their hyperparameters, let’s do that below.

import joblib
path = "ch2_proj/models/"
if not os.path.exists(path):
    os.mkdir(path)
joblib.dump(lin_reg, path+'linear_regression.pkl')
joblib.dump(tree_reg, path+'decision_tree_regression.pkl')
joblib.dump(forest_reg, path+'random_forest_regression.pkl')
joblib.dump(svm_reg, path+'support_vector_machine_regression.pkl')

In order to find the best hyperparameters for a given model we can try several and see which model is the best!

He we will try 2 approaches: 1. To test 3 different number of estimators and 4 max features 2. Test 2 number of estimators and 3 max features without bootstrap (default behavior)

from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3,10,30], 'max_features': [2,4,6,8]},
{'bootstrap': [False], 'n_estimators': [3,10], 'max_features': [2,3,4]}
            ]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
print(grid_search.best_estimator_)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
    print(np.sqrt(-mean_score), params)
joblib.dump(grid_search.best_estimator_, path+'random_forest_regression_finetuned.pkl')
# Export fine-tuned model
pd.DataFrame(cvres).to_csv(path+'random_forest_finetune_results.csv', index=False)
RandomForestRegressor(max_features=6, n_estimators=30)
64663.5205960859 {'max_features': 2, 'n_estimators': 3}
55426.18899173725 {'max_features': 2, 'n_estimators': 10}
52740.403895568226 {'max_features': 2, 'n_estimators': 30}
61147.22965442266 {'max_features': 4, 'n_estimators': 3}
52658.98063521634 {'max_features': 4, 'n_estimators': 10}
50472.51405928446 {'max_features': 4, 'n_estimators': 30}
59742.88583222966 {'max_features': 6, 'n_estimators': 3}
52094.51249054094 {'max_features': 6, 'n_estimators': 10}
50079.228571118205 {'max_features': 6, 'n_estimators': 30}
58447.692715741716 {'max_features': 8, 'n_estimators': 3}
52094.13471216696 {'max_features': 8, 'n_estimators': 10}
50113.39955761553 {'max_features': 8, 'n_estimators': 30}
62707.950441575995 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54052.6440138164 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59718.84278480412 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52398.47169649248 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
59438.018347120815 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51159.23984387675 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

The grid search method is great for relatively few combinations of parameters but if you need to scan a large amount of them this approach is better.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
joblib.dump(rnd_search.best_estimator_, path+'random_forest_finetune_results_randomsearch.pkl')
feature_importances = grid_search.best_estimator_.feature_importances_
print('------')
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
#cat_encoder = cat_pipeline.named_steps["cat_encoder"] # old solution
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
for feature, attr in sorted(zip(feature_importances, attributes), reverse=True):
    print('{:.3f} {}'.format(feature, attr))
49117.55344336652 {'max_features': 7, 'n_estimators': 180}
51450.63202856348 {'max_features': 5, 'n_estimators': 15}
50692.53588182537 {'max_features': 3, 'n_estimators': 72}
50783.614493515 {'max_features': 5, 'n_estimators': 21}
49162.89877456354 {'max_features': 7, 'n_estimators': 122}
50655.798471042704 {'max_features': 3, 'n_estimators': 75}
50513.856319990606 {'max_features': 3, 'n_estimators': 88}
49521.17201976928 {'max_features': 5, 'n_estimators': 100}
50302.90440763418 {'max_features': 3, 'n_estimators': 150}
65167.02018649492 {'max_features': 5, 'n_estimators': 2}
------
0.304 median_income
0.145 INLAND
0.110 pop_per_hhold
0.084 bedrooms_per_room
0.077 longitude
0.075 rooms_per_hhold
0.073 latitude
0.042 housing_median_age
0.019 population
0.018 total_rooms
0.017 total_bedrooms
0.016 households
0.013 <1H OCEAN
0.004 NEAR BAY
0.004 NEAR OCEAN
0.000 ISLAND

Evaluate the model on the Test Set

final_model = grid_search.best_estimator_

X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print(final_rmse)
47615.00383283665

Computer a 95% confidence interval around the final score

from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
ci = np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))
print(ci)
[45618.00700694 49531.55172934]

Chapter 3: Classification

We will be using MNIST dataset -> 70.000 small images of handwritten digits to classify

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)

MNIST data is already separated in data and its labels (target), so let’s use that division!

X, y = mnist['data'], mnist['target']
print(X.shape, y.shape)
(70000, 784) (70000,)

Visualize the data

Let’s visualize a few random digits from the dataset

from matplotlib import pyplot as plt
import numpy as np
plt.rcParams['figure.dpi'] = 140

random_ids = np.random.randint(0,70000, 20)

with plt.style.context('dark_background'):
    f, axes = plt.subplots(4,5, figsize=(5,5))
    for ax_id, idx in enumerate(random_ids):
        some_digit = X.iloc[idx].to_numpy()
        some_digit_image = some_digit.reshape(28,28)
        ax = axes.ravel()[ax_id]
        ax.imshow(some_digit_image, cmap='binary')
        ax.set_title(y[idx])
        ax.axis('off')
plt.tight_layout()

Random sample of MNIST digits

Training a Binary Classifier

Now let’s separate the dataset in train and test sets. Since the data is already shuffled for us we can just divide it with hard numbers.

y = y.astype(np.uint8)
X_train, X_test = X.iloc[:60000].to_numpy(), X.iloc[60000:].to_numpy()
y_train, y_test = y[:60000].to_numpy(), y[60000:].to_numpy()

To start with let’s focus on creating a simple binary classifier. Let’s say a classifier that can identify if a digit image is a 5 or not.

y_train_5 = (y_train==5)
y_test_5 = (y_test==5)

from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier(random_state=42)
sgd.fit(X_train, y_train_5)
print(sgd.predict([some_digit]))
[False]

Evaluating Classifier Performance

Measuring Accuracy using Cross-Validation

from sklearn.model_selection import cross_val_score
print(cross_val_score(sgd, X_train, y_train_5, cv=3, scoring='accuracy'))
[0.95035 0.96035 0.9604 ]

Wow! Over 95% accuracy in the 3 runs! This seems a very good classifier.. However..

from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

never_5_clf = Never5Classifier()
print(cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring='accuracy'))
[0.91125 0.90855 0.90915]

This classifier that basically classifis every image as non 5 still has >90% accuracy! How is it possible?

Just because only about 10% of the digits are 5s! So even if you classify every image as non 5, you will still be correct >90% of the time! ### Evaluating the Classifier with a Confusion Matrix

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd, X_train, y_train_5, cv=3)

from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(y_train_5, y_train_pred)
print('True negatives: {}'.format(conf_matrix[0,0]))
print('False positives: {}'.format(conf_matrix[0,1]))
print('False negatives: {}'.format(conf_matrix[1,0]))
print('True positives: {}'.format(conf_matrix[1,1]))
True negatives: 53892
False positives: 687
False negatives: 1891
True positives: 3530

This is how it would be in the ideal case scenario in which we have 100% accuracy in the classification.

y_train_perfect = y_train_5
conf_matrix = confusion_matrix(y_train_5, y_train_perfect)
print('True negatives: {}'.format(conf_matrix[0,0]))
print('False positives: {}'.format(conf_matrix[0,1]))
print('False negatives: {}'.format(conf_matrix[1,0]))
print('True positives: {}'.format(conf_matrix[1,1]))
True negatives: 54579
False positives: 0
False negatives: 0
True positives: 5421

Precision and Recall

These numbers give a lot of information, but we can summarize it into a Precision measure.

\[ precision = \frac{TP}{TP+FP} \]

Where TP is the number of True Positives and FP is the number of False Positives. Precision is often used along Recall.

\[ recall = \frac{TP}{TP+FN} \]

Where FN is the number of False Negatives. Let’s see Precision and Recall in action.

from sklearn.metrics import precision_score, recall_score
print('Precision score for classifier',precision_score(y_train_5, y_train_pred))
print('Recall score for classifier',recall_score(y_train_5, y_train_pred))
Precision score for classifier 0.8370879772350012
Recall score for classifier 0.6511713705958311

It’s often useful to express Precision and Recall as a unique metrics called F1.

\[ F_1 = \frac{2}{1/precision + 1/recall} = 2 \cdot \frac{precision \cdot recall}{precision + recall} = \frac{TP}{TP+\frac{FN+FP}{2}} \]

Using sklearn library:

from sklearn.metrics import f1_score
print(f1_score(y_train_5, y_train_pred))
0.7325171197343846

Depending on what the goal of the classifier is precision or recall might be more useful. If you are detecting kid-safe videos on the internet, you prefer to have a low recall but high precision in order to keep few safe videos and reject many possibly bad videos. On the other hand, if you want to detect thieves form security camera footage, you might want high recall in order to catch all shoplifters even if with lower precision you might get some false alarms.

F1 score favors classifiers that have similar precision and recall. You can never have high precision and recall; this is the precision/recall trade-off.

y_scores = sgd.decision_function([some_digit]) # returns the decision scores
print(y_scores)
threshold = 0
y_scores > threshold
[-6168.66710663]
array([False])

Precision/Recall trade-off

Classifier function computes a score, which compares to a threshold. If it’s higher than the threshold it will classify the entry as positive, otherwise as negative. In the classifier you cannot directly set the threshold but it is set to 0. We can however, extract the decision scores and check manually if they are lower or higher than a threshold we choose ourselves.

How do we choose a good threshold? We can plot the values of precisions and recalls for many different threshold! There is a function for this!

# Get the decision scores for all the train data
y_scores = cross_val_predict(sgd, X_train, y_train_5, cv=3, method="decision_function")
y_scores
array([  1200.93051237, -26883.79202424, -33072.03475406, ...,
        13272.12718981,  -7258.47203373, -16877.50840447])
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "--", label="Recall")
    plt.legend(loc="center right", fontsize=16) 
    plt.xlabel("Threshold", fontsize=16)        
    plt.grid(True)                              
    plt.axis([-50000, 50000, 0, 1])

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

recall_90_precision = recalls[np.argmax(precisions >= 0.90)]
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]

with plt.style.context('dark_background'):
    plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
    plt.plot([threshold_90_precision, threshold_90_precision], [0., 0.9], "w:")                 
    plt.plot([-50000, threshold_90_precision], [0.9, 0.9], "w:")                                
    plt.plot([-50000, threshold_90_precision], [recall_90_precision, recall_90_precision], "w:")
    plt.plot([threshold_90_precision], [0.9], "wo")                                             
    plt.plot([threshold_90_precision], [recall_90_precision], "wo")                             

Recall and precision versus the decision threshold. White dots and lines correspond to 90% precision.

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])
    plt.grid(True)

with plt.style.context("dark_background"):
    plot_precision_vs_recall(precisions, recalls)
    plt.plot([recall_90_precision, recall_90_precision], [0., 0.9], "w:")
    plt.plot([0.0, recall_90_precision], [0.9, 0.9], "w:")
    plt.plot([recall_90_precision], [0.9], "wo")

Precision versus recall. White dot and dotted lines correspond to 90% precision.

# Let's say we want 90% precision for our classifier
threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]
y_train_pred_90 = (y_scores >= threshold_90_precision)
print("Precision score ", precision_score(y_train_5, y_train_pred_90))
print("Recall score", recall_score(y_train_5, y_train_pred_90))
Precision score  0.9000345901072293
Recall score 0.4799852425751706

The ROC curve

The receiver operating characteristic curve plots the sensitivity (recall) versus 1-specificity of the classifier.

  • FPR is the false positive rate (1-TNR)
  • TPR is the true positive rate (i.e. recall)
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

with plt.style.context("dark_background"):
    plt.plot(fpr, tpr, c='cornflowerblue')
    plt.plot([0,1], [0,1], 'w--')
    plt.plot(0,1, c='orangered', marker='o')
    plt.annotate(text="A good classifier is as\nclose to here as possible", 
        xy=(0.0, 1), xytext=(0.1,0.7), 
        arrowprops=dict(facecolor='black'),
        bbox=dict(fc="black", ec="w", boxstyle='round,pad=0.6'))
    plt.xlabel("False positive rate")
    plt.ylabel("True positive rate (recall)")

Since we want the ROC curve to be as close to the upper left corner as possible, a good way to compare different classifier is to use the area under the curve (AUC). A perfect model will have a AUC=1 whereas a random classifier would have AUC=0.5

We should prefer the use of the precision/recall plot whenever the positive class is rare or when you care more about the false positives than the false negatives. Otherwise, we should use the ROC curve.

from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)
0.9604938554008616

Let’s try to compare the stochastic gradient descent classifier we already built with a new random forest classifier using the ROC curve.

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method='predict_proba')

y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)

with plt.style.context("dark_background"):
    plt.plot(fpr, tpr, label='SGD')
    plt.plot(fpr_forest, tpr_forest, label='Random Forest')
    plt.plot([0,1], [0,1], 'w--')
    plt.legend(loc='lower right')
    plt.xlabel("False positive rate")
    plt.ylabel("True positive rate (recall)")
    
print(roc_auc_score(y_train_5, y_scores_forest))
0.9983436731328145

Multiclass classification

Some algotithms can handle multiple classes such as SGD, Random Forest and naive Bayesian classifiers. Others, like Regression and Support Vector Machines are strictly binary classifiers. But we can still use binary classifiers to solve multiclass classifications.

One versus the rest (OvR)

To classify in multiple class one can create as many binary classifiers as classes you want to classify (0-classifier, 1-classifier etc..). In this case to know which class a certain data entry is, you can take the classifier with the highest score.

For most binary classification algorithms the OvR approach is preferred.

One versus one (OvO)

Train a binary classifier for every pair of entries to classify (distinguish 0 and 1, distinguish 0 and 2 etc..). For N classes you need $N (N-1) /2 $ classifiers. You have to run an image through all the combinations and see which class wins the most duels/comparisons. The main advantage of the OvO approach is that each classifier needs to be trained only on the part of the training set for the two classes that it must distinguish.

For some algorigthms that don’t scale well with more and more classes (such as Support Vector Machine) OvO approaches are preferred.

Sklearn

Sklearn automatically runs OvR or OvO depending on the algorithm.

from sklearn.svm import SVC
svm_clf = SVC()
svm_clf.fit(X_train, y_train)
pred = svm_clf.predict([some_digit])

with plt.style.context('dark_background'):
    f, ax = plt.subplots(1, figsize=(2,2))
    ax.imshow(some_digit_image, cmap='binary')
    plt.axis('off')
    ax.set_title('The prediction is {}'.format(pred[0]))

some_digit_scores = svm_clf.decision_function([some_digit])
print(some_digit_scores)

print(np.argmax(some_digit_scores))

print(svm_clf.classes_)
[[-0.30832333  9.31438876  8.28855154  4.79081091  2.73699707  0.71341561
   2.73426097  6.23793081  7.29220689  2.72390608]]
1
[0 1 2 3 4 5 6 7 8 9]

If we want Sklearn to use a specific OvO or OvR approach we use the approach below. We can for example ‘force’ a SVC model to be a multiclass classifier.

from sklearn.multiclass import OneVsRestClassifier
ovr_clf = OneVsRestClassifier(SVC(gamma="auto", random_state=42))
ovr_clf.fit(X_train[:1000], y_train[:1000])
ovr_clf.predict([some_digit])
array([7], dtype=uint8)

SGD is already capable of classifying multiclass so calling it directly without specifying OvO or OvR will work for us!

sgd.fit(X_train, y_train)
print(sgd.predict([some_digit]))
sgd.decision_function([some_digit])
[6]
array([[-45620.45961751,  -6807.11170595,  -3962.20329559,
         -2586.47315091,  -8461.21424704,  -5319.68470554,
          1709.83150757, -27045.65071565,     61.98685276,
         -9030.68374421]])
cross_val_score(sgd, X_train, y_train, cv=3, scoring="accuracy")
array([0.87365, 0.85835, 0.8689 ])
# Scaling features work even better!
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd, X_train_scaled, y_train, cv=3, scoring="accuracy")
array([0.8983, 0.891 , 0.9018])

Error Analysis

from sklearn.metrics import confusion_matrix
y_train_pred = cross_val_predict(sgd, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx
array([[5577,    0,   22,    5,    8,   43,   36,    6,  225,    1],
       [   0, 6400,   37,   24,    4,   44,    4,    7,  212,   10],
       [  27,   27, 5220,   92,   73,   27,   67,   36,  378,   11],
       [  22,   17,  117, 5227,    2,  203,   27,   40,  403,   73],
       [  12,   14,   41,    9, 5182,   12,   34,   27,  347,  164],
       [  27,   15,   30,  168,   53, 4444,   75,   14,  535,   60],
       [  30,   15,   42,    3,   44,   97, 5552,    3,  131,    1],
       [  21,   10,   51,   30,   49,   12,    3, 5684,  195,  210],
       [  17,   63,   48,   86,    3,  126,   25,   10, 5429,   44],
       [  25,   18,   30,   64,  118,   36,    1,  179,  371, 5107]])
with plt.style.context('dark_background'):
    plt.matshow(conf_mx)

Here we have real classes as rows and predicted classes as columns.

Let’s normalize by dividing by number of images for each class to compare the error rates.

row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx, 0)
with plt.style.context('dark_background'):
    plt.matshow(norm_conf_mx)

The clearest thing from the matrix above, is that many classes get misclassified as ‘8’!

Some of the 3s and 5s are also misclassified as other classes in a minor fashion. Take a look at the image below, where you find on the left images classified as ‘3’ and on the right you find images classified as ‘5’.

def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    # This is equivalent to n_rows = ceil(len(instances) / images_per_row):
    n_rows = (len(instances) - 1) // images_per_row + 1

    # Append empty images to fill the end of the grid, if needed:
    n_empty = n_rows * images_per_row - len(instances)
    padded_instances = np.concatenate([instances, np.zeros((n_empty, size * size))], axis=0)

    # Reshape the array so it's organized as a grid containing 28×28 images:
    image_grid = padded_instances.reshape((n_rows, images_per_row, size, size))

    # Combine axes 0 and 2 (vertical image grid axis, and vertical image axis),
    # and axes 1 and 3 (horizontal axes). We first need to move the axes that we
    # want to combine next to each other, using transpose(), and only then we
    # can reshape:
    big_image = image_grid.transpose(0, 2, 1, 3).reshape(n_rows * size,
                                                         images_per_row * size)
    # Now that we have a big image, we just need to show it:
    
    plt.imshow(big_image, cmap='binary_r',**options)
    plt.axis("off")

cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

with plt.style.context('dark_background'):
    plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5); plt.title("Classified as '3'")
    plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5); plt.title("Classified as '5'")
    plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
    plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)

Multilabel Classification

A classification problem where an entry should have multiple classification labels. Instead of each image having to be a single digiti we could have multiple digits in a single image. For example a face-detection algorithm could detect multiple faces of different people in a single picture.

Since in our case every image has a unique label, we can phrase a multilabel classification problem in another way. Let’s say we want to tell for a digit of an image if it’s > 7 and if it’s odd or not. In this case for each digit image we can have a classification (True or False) for both of these classes.

from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

knn_clf.predict([some_digit])
array([[False,  True]])
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3)
f1_score(y_multilabel, y_train_knn_pred, average='macro') # change to average='weighted' to assume labels are not equally important
0.976410265560605

Multioutput Classification

Problem in which each label can have multiple classes (i.e. more than just 2 values).

Example: remove noise from image

One label per pixel and each label can be anything from 0 to 255 in values.

noise = np.random.randint(0,100,(len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0,100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test

random_number = np.random.randint(0, len(X_train_mod))

with plt.style.context('dark_background'):
    f, ax = plt.subplots(1,2)
    ax[0].imshow(X_train_mod[random_number].reshape(28,28), cmap='binary_r')
    ax[1].imshow(y_train_mod[random_number].reshape(28,28), cmap='binary_r')
    ax[0].axis('off')
    ax[1].axis('off')

knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[random_number]])

with plt.style.context('dark_background'):
    f, ax = plt.subplots(1,2)
    ax[1].imshow(clean_digit.reshape(28,28))
    ax[0].imshow(X_test_mod[random_number].reshape(28,28))
    ax[0].axis('off'); ax[0].set_title('noisy image')
    ax[1].axis('off'); ax[1].set_title('denoised image')

Chapter 4: Training Models

Linear regression as a simple example

A linear regression problem can be solved in 2 ways: 1. Exact solution using the Normal Equation 2. Approximate solution using Gradient Descent

Normal Equation

Analytical approach to solve linear regression with a least square cost function.

\(\theta = (X^TX)^{-1}\cdot(X^Ty)\)

\(\theta\) is the parameters vector, X is the training data vector and y is the predicted values vector.

\(\^{y}=X\cdot\theta\) predicted values simple the dot product between the train vector and the theta vector

# Let's generate some fake linear data
from matplotlib import pyplot as plt
import numpy as np
plt.rcParams['figure.dpi'] = 140

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

with plt.style.context('dark_background'):
    plt.plot(X, y, '.')
    plt.xlabel("$x_1$", fontsize=18)
    plt.ylabel("$y$", rotation=0, fontsize=18)
    plt.axis([0, 2, 0, 15])
    plt.text(x=0.1, y=13, s=r"Real equation: $y=4+3 X$")

X_b = np.c_[np.ones((100, 1)), X]  # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y) # Normal equation from above

X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new]  # add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)

with plt.style.context('dark_background'):
    plt.plot(X, y, ".")
    plt.plot(X_new, y_predict, "-")
    plt.axis([0, 2, 0, 15])
    plt.xlabel("$x_1$", fontsize=18)
    plt.ylabel("$y$", rotation=0, fontsize=18)
    plt.text(x=0.1,y=14, s=r"$\theta_0$={:.3f}".format(theta_best[0,0]))
    plt.text(x=0.1,y=13, s=r"$\theta_1$={:.3f}".format(theta_best[1,0]))

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
print(lin_reg.intercept_[0], lin_reg.coef_[0,0])
3.863205106429559 3.083062011795244

Gradient Descent

Batch Gradient Descent

eta = 0.1  # learning rate
n_iterations = 1000
m = 100

theta = np.random.randn(2,1)  # random initialization

for iteration in range(n_iterations):
    gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
    theta = theta - eta * gradients

print(theta[0,0], theta[1,0])
3.8632051064295294 3.083062011795269
theta_path_bgd = []

def plot_gradient_descent(theta, eta, theta_path=None):
    m = len(X_b)
    plt.plot(X, y, ".")
    n_iterations = 1000
    for iteration in range(n_iterations):
        if iteration < 10:
            y_predict = X_new_b.dot(theta)
            style = "w-" if iteration > 0 else "r--"
            plt.plot(X_new, y_predict, style)
        gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
        theta = theta - eta * gradients
        if theta_path is not None:
            theta_path.append(theta)
    plt.xlabel("$x_1$", fontsize=18)
    plt.axis([0, 2, 0, 15])
    plt.title(r"$\eta = {}$".format(eta), fontsize=16)

np.random.seed(42)
theta = np.random.randn(2,1)  # random initialization

with plt.style.context('dark_background'):
    plt.figure(figsize=(10,3))
    plt.subplot(131); plot_gradient_descent(theta, eta=0.02)
    plt.ylabel("$y$", rotation=0, fontsize=18)
    plt.subplot(132); plot_gradient_descent(theta, eta=0.1, theta_path=theta_path_bgd)
    plt.subplot(133); plot_gradient_descent(theta, eta=0.5)

Stochastic Gradient Descent

theta_path_sgd = []
m = len(X_b)
n_epochs = 50

t0, t1 = 5, 50  # learning schedule hyperparameters

def learning_schedule(t):
    return t0 / (t + t1)

theta = np.random.randn(2,1)  # random initialization

with plt.style.context('dark_background'):
    for epoch in range(n_epochs):
        for i in range(m):
            if epoch == 0 and i < 20:                    
                y_predict = X_new_b.dot(theta)
                if i>0:           
                    style = "w-"
                    alpha=i/20
                else: 
                    style = "r--"         
                    alpha=1
                plt.plot(X_new, y_predict, style, alpha=alpha, label=i)   
            random_index = np.random.randint(m)
            xi = X_b[random_index:random_index+1]
            yi = y[random_index:random_index+1]
            gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
            eta = learning_schedule(epoch * m + i)
            theta = theta - eta * gradients
            theta_path_sgd.append(theta)                 

    plt.plot(X, y, ".")                                 
    plt.xlabel("$x_1$", fontsize=18)                     
    plt.ylabel("$y$", rotation=0, fontsize=18)           
    plt.axis([0, 2, 0, 15])    
    # plt.legend(bbox_to_anchor=(1.1,1.2)) 

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1, random_state=42)
sgd_reg.fit(X, y.ravel())
print(sgd_reg.intercept_[0], sgd_reg.coef_[0])
3.8070104457185705 3.035135308517953

Mini-batch gradient descent

theta_path_mgd = []

n_iterations = 50
minibatch_size = 20

np.random.seed(42)
theta = np.random.randn(2,1)  # random initialization

t0, t1 = 200, 1000
def learning_schedule(t):
    return t0 / (t + t1)

t = 0
for epoch in range(n_iterations):
    shuffled_indices = np.random.permutation(m)
    X_b_shuffled = X_b[shuffled_indices]
    y_shuffled = y[shuffled_indices]
    for i in range(0, m, minibatch_size):
        t += 1
        xi = X_b_shuffled[i:i+minibatch_size]
        yi = y_shuffled[i:i+minibatch_size]
        gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(t)
        theta = theta - eta * gradients
        theta_path_mgd.append(theta)

theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)

with plt.style.context('dark_background'):
    plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "-^", linewidth=1, markersize=3, label="Stochastic")
    plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "-s", linewidth=1, markersize=3, label="Mini-batch")
    plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "-o", linewidth=1, markersize=3, label="Batch")
    plt.legend(loc="lower right", fontsize=14)
    plt.xlabel(r"$\theta_0$", fontsize=18)
    plt.ylabel(r"$\theta_1$   ", fontsize=18, rotation=0)

Linear Regression algorithm comparison

Algorithm Large m Large n Hyperparams Scaling required Scikit-Learn
Normal Equation Fast Slow 0 No N/A
SVD Fast Slow 0 No LinearRegression
Batch GD Slow Fast >=2 Yes SGDRegressor
Stochastic GD Fast Fast >=2 Yes SGDRegressor
Mini-batch GD Fast Fast >=2 Yes SGDRegressor

Polynomial Regression

If your data is more complex than a simple linear relationship you can use polynomial regression to add more powers to a normal linear regression!

import numpy as np
from matplotlib import pyplot as plt
plt.rcParams['figure.dpi'] = 140

m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m,1)

with plt.style.context('dark_background'):
    plt.scatter(X,y)
    plt.xlabel('$X_1$', fontsize=18)
    plt.ylabel('$y$', fontsize=18)
    plt.title('Clearly not linear data')

Scikit-Learn PolynomialFeatures can transform our training data adding the second (or more) degree polynomial for each of the features.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X_poly[0] # 2 values now!
array([2.60846358, 6.80408224])
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
# print(lin_reg.intercept_, lin_reg.coef_)

mock_x = np.linspace(-3, 3)
mock_y = lin_reg.coef_[0][1]*mock_x**2 + lin_reg.coef_[0][0]*mock_x + lin_reg.intercept_

with plt.style.context('dark_background'):
    plt.scatter(X,y, s=10, label='Data')
    plt.plot(mock_x, mock_y, color='red', label='Polynomial Fit')
    plt.xlabel('$X_1$', fontsize=18)
    plt.ylabel('$y$', fontsize=18)
    plt.title('Not a bad fit!\n Predicted model: ${:.2f}x^2+{:.2f}x+{:.2f}$'.format(lin_reg.coef_[0][1], lin_reg.coef_[0][0], lin_reg.intercept_[0]))
    plt.legend()

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

plt.style.use('dark_background')
for style, width, degree in (("w-", 1, 300), ("r--", 2, 2), ("g-.", 2, 1)):
    polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
    std_scaler = StandardScaler()
    lin_reg = LinearRegression()
    polynomial_regression = Pipeline([
            ("poly_features", polybig_features),
            ("std_scaler", std_scaler),
            ("lin_reg", lin_reg),
        ])
    polynomial_regression.fit(X, y)
    y_newbig = polynomial_regression.predict(mock_x.reshape(-1,1))
    plt.plot(mock_x, y_newbig, style, label=str(degree), linewidth=width)

plt.scatter(X, y, s=5)
plt.legend(title='Polynomial degree',loc="upper left")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
plt.title('Example of over/under fitting')
plt.show()

How to tell if the model is over/under fitting?

  1. Use cross-validation to evaluate performance
    • Performs great in training set but horribly in validation set? → overfitting
    • Performs bad/mediocrely on both training and validation sets? → underfitting
  2. Look at the learning curves → train model many times on different training set sizes (see below)
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=0.2)
    train_errors = []
    val_errors = []

    for m in range(1, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
        val_errors.append(mean_squared_error(y_val, y_val_predict))

    plt.plot(np.sqrt(train_errors), '+-', markersize=4, label='train')
    plt.plot(np.sqrt(val_errors), '|-', markersize=4, label='val')
    plt.xlabel('Training set size')
    plt.ylabel('RMSE')
    plt.legend()

lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)
plt.title('Example of underfitting behaviour')
plt.ylim(0,3)
plt.show()

The above is perfect example of underfitting behaviour because both validation and training curves reach a plateaux of RMSE performance and they literally cannot do better, since the model is too simple.

Let’s look below at an example of overfitting!

from sklearn.pipeline import Pipeline
polynomial_regression = Pipeline([
    ('poly_features', PolynomialFeatures(degree=10, include_bias=False)),
    ('lin_reg', LinearRegression())
])

plot_learning_curves(polynomial_regression, X, y)
plt.title('Example of overfitting')
plt.ylim(0,3)
(0.0, 3.0)

In the plot above we can see that the overall error on the training set is much lower than in the underfitting case, and that there is a big gap between train and validation set!

Regularized Linear Models

A good way to prevent over fitting is to try and constrain the model. For a linear model, we are mostly talking about reducing the number of polynomial degress.

Ridge regression (Tikhonov regularization)

We modify the cost function to fit introducing small weights and we try to minimize those weights as much as possible during training.

\[ J(\theta) = MSE(\theta) + \alpha \frac{1}{2} \sum_{i=1}^n \theta_i^2 \]

Where \(\alpha\) is the regularization term.

from sklearn.linear_model import Ridge

def plot_model(model_class, polynomial, alphas, **model_kargs):
    for alpha, style in zip(alphas, ("-", "--", "r:")):
        model = model_class(alpha, **model_kargs) if alpha > 0 else LinearRegression()
        if polynomial:
            model = Pipeline([
                    ("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
                    ("std_scaler", StandardScaler()),
                    ("regul_reg", model),
                ])
        model.fit(X, y)
        y_new_regul = model.predict(mock_x.reshape(-1,1))
        lw = 2 if alpha > 0 else 1
        plt.plot(mock_x, y_new_regul, style, linewidth=lw, label=r"$\alpha = {}$".format(alpha))
    plt.scatter(X, y, s=5)
    plt.legend(loc="upper left", fontsize=15)
    plt.xlabel("$x_1$", fontsize=18)

plt.figure(figsize=(10,4))
plt.subplot(121)
plot_model(Ridge, polynomial=False, alphas=(0, 10, 100), random_state=42)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(122)
plot_model(Ridge, polynomial=True, alphas=(0, 10**-5, 1), random_state=42)
plt.suptitle("Higher alpha regularizes the fit")
plt.show()

# Ridge regression closed-form solution
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver='cholesky')
ridge_reg.fit(X,y)
print('Closed-form solution prediction')
print(ridge_reg.predict([[1.5]]))

# Using stochastic gradient descent
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(penalty='l2') # another way to specify the ridge regression
sgd_reg.fit(X,y.ravel())
print("SGD prediction")
print(sgd_reg.predict([[1.5]]))
Closed-form solution prediction
[[4.68712667]]
SGD prediction
[4.669416]

Lasso Regression

Least Absolute Shrinkage and Selection Operator Regression (Lasso) adds another regularization factor to the cost function. \[ J(\theta) = MSE(\theta) + \alpha \sum_{i=1}^n |\theta_i| \]

Lasso regularization tends to remove teh weights of the least important features → it performs feature selection and outputs a sparse model.

from sklearn.linear_model import Lasso

plt.figure(figsize=(10,4))
plt.subplot(121)
plot_model(Lasso, polynomial=False, alphas=(0, 0.1, 1), random_state=42)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(122)
plot_model(Lasso, polynomial=True, alphas=(0, 10**-7, 1), random_state=42)
plt.suptitle('Higher values regularizes the fit too much!')
plt.show()
/home/dallavem/.conda/envs/image-analysis/lib/python3.9/site-packages/sklearn/linear_model/_coordinate_descent.py:648: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.277e+01, tolerance: 5.060e-02
  model = cd_fast.enet_coordinate_descent(

# Lasso regression closed-form solution
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X,y)
print('Using Lasso class')
print(lasso_reg.predict([[1.5]]))

# Using stochastic gradient descent
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(penalty='l1') # another way to specify the lasso regression
sgd_reg.fit(X,y.ravel())
print("Using SGD class")
print(sgd_reg.predict([[1.5]]))
Using Lasso class
[4.63505617]
Using SGD class
[4.64805295]

Elastic Net

Mix of regularization between Lasso and Ridge in which you can control the mix ration \(r\).

\[ J(\theta) = MSE(\theta) + r\alpha \sum_{i=1}^n |\theta_i| + \frac{1 - r}{2} \alpha \sum_{i=1}^n \theta^2 \]

In general it’s important to have some kind of regularization so you should avoid using pure linear regression!

Ridge regularization is a good default but if think only a few features are important then Lasso or Elastic Net are preferrable, as they tend to repress those extra less important features.

In general Extra Net is preferred over Lasso because Lasso may be erratic when several features are strongly correlated or when the number of features is much higher the number of training instances.

# Elastic Net in sklearn
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5) # half-half regularization
elastic_net.fit(X,y)
elastic_net.predict([[1.5]])
array([4.63955284])

Regularization by Early Stopping

While using an interative training process a smart way to prevent under/over fitting is to stop when the model performance is best. For example stopping when the RMSE is not longer decreasing but before is starts increasing is a good move.

np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 2 + X + 0.5 * X**2 + np.random.randn(m, 1)

X_train, X_val, y_train, y_val = train_test_split(X[:50], y[:50].ravel(), test_size=0.5, random_state=10)

One way to implement early stopping during batch gradient descent

from copy import deepcopy

poly_scaler = Pipeline([
        ("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
        ("std_scaler", StandardScaler())
    ])

X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)

sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True, # warm_start allow to continue training every loop
                       penalty=None, learning_rate="constant", eta0=0.0005, random_state=42)

minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
    sgd_reg.fit(X_train_poly_scaled, y_train)  # continues where it left off
    y_val_predict = sgd_reg.predict(X_val_poly_scaled)
    val_error = mean_squared_error(y_val, y_val_predict)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = deepcopy(sgd_reg)

Bonus visualization of stopping moment (best model RMSE)

sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
                       penalty=None, learning_rate="constant", eta0=0.0005, random_state=42)

n_epochs = 500
train_errors, val_errors = [], []
for epoch in range(n_epochs):
    sgd_reg.fit(X_train_poly_scaled, y_train)
    y_train_predict = sgd_reg.predict(X_train_poly_scaled)
    y_val_predict = sgd_reg.predict(X_val_poly_scaled)
    train_errors.append(mean_squared_error(y_train, y_train_predict))
    val_errors.append(mean_squared_error(y_val, y_val_predict))

best_epoch = np.argmin(val_errors)
best_val_rmse = np.sqrt(val_errors[best_epoch])

plt.annotate('Best model',
             xy=(best_epoch, best_val_rmse),
             xytext=(best_epoch, best_val_rmse + 1),
             ha="center",
             arrowprops=dict(facecolor='black', shrink=0.05),
             fontsize=16,
            )

best_val_rmse -= 0.03  # just to make the graph look better
plt.plot([0, n_epochs], [best_val_rmse, best_val_rmse], ":", color='white', linewidth=2)
plt.plot(np.sqrt(val_errors), "-", linewidth=3, label="Validation set")
plt.plot(np.sqrt(train_errors), "--", linewidth=2, label="Training set")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Epoch", fontsize=14)
plt.ylabel("RMSE", fontsize=14)
plt.show()

t1a, t1b, t2a, t2b = -1, 3, -1.5, 1.5

t1s = np.linspace(t1a, t1b, 500)
t2s = np.linspace(t2a, t2b, 500)
t1, t2 = np.meshgrid(t1s, t2s)
T = np.c_[t1.ravel(), t2.ravel()]
Xr = np.array([[1, 1], [1, -1], [1, 0.5]])
yr = 2 * Xr[:, :1] + 0.5 * Xr[:, 1:]

J = (1/len(Xr) * np.sum((T.dot(Xr.T) - yr.T)**2, axis=1)).reshape(t1.shape)

N1 = np.linalg.norm(T, ord=1, axis=1).reshape(t1.shape)
N2 = np.linalg.norm(T, ord=2, axis=1).reshape(t1.shape)

t_min_idx = np.unravel_index(np.argmin(J), J.shape)
t1_min, t2_min = t1[t_min_idx], t2[t_min_idx]

t_init = np.array([[0.25], [-1]])

def bgd_path(theta, X, y, l1, l2, core = 1, eta = 0.05, n_iterations = 200):
    path = [theta]
    for iteration in range(n_iterations):
        gradients = core * 2/len(X) * X.T.dot(X.dot(theta) - y) + l1 * np.sign(theta) + l2 * theta
        theta = theta - eta * gradients
        path.append(theta)
    return np.array(path)

fig, axes = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(10.1, 8))
for i, N, l1, l2, title in ((0, N1, 2., 0, "Lasso"), (1, N2, 0,  2., "Ridge")):
    JR = J + l1 * N1 + l2 * 0.5 * N2**2
    
    tr_min_idx = np.unravel_index(np.argmin(JR), JR.shape)
    t1r_min, t2r_min = t1[tr_min_idx], t2[tr_min_idx]

    levelsJ=(np.exp(np.linspace(0, 1, 20)) - 1) * (np.max(J) - np.min(J)) + np.min(J)
    levelsJR=(np.exp(np.linspace(0, 1, 20)) - 1) * (np.max(JR) - np.min(JR)) + np.min(JR)
    levelsN=np.linspace(0, np.max(N), 10)
    
    path_J = bgd_path(t_init, Xr, yr, l1=0, l2=0)
    path_JR = bgd_path(t_init, Xr, yr, l1, l2)
    path_N = bgd_path(np.array([[2.0], [0.5]]), Xr, yr, np.sign(l1)/3, np.sign(l2), core=0)

    ax = axes[i, 0]
    ax.grid(True)
    ax.axhline(y=0, color='k')
    ax.axvline(x=0, color='k')
    ax.contourf(t1, t2, N / 2., levels=levelsN)
    ax.plot(path_N[:, 0], path_N[:, 1], "y--")
    ax.plot(0, 0, "ys")
    ax.plot(t1_min, t2_min, "ys")
    ax.set_title(r"$\ell_{}$ penalty".format(i + 1), fontsize=16)
    ax.axis([t1a, t1b, t2a, t2b])
    if i == 1:
        ax.set_xlabel(r"$\theta_1$", fontsize=16)
    ax.set_ylabel(r"$\theta_2$", fontsize=16, rotation=0)

    ax = axes[i, 1]
    ax.grid(True)
    ax.axhline(y=0, color='k')
    ax.axvline(x=0, color='k')
    ax.contourf(t1, t2, JR, levels=levelsJR, alpha=0.9)
    ax.plot(path_JR[:, 0], path_JR[:, 1], "w-o")
    ax.plot(path_N[:, 0], path_N[:, 1], "y--")
    ax.plot(0, 0, "ys")
    ax.plot(t1_min, t2_min, "ys")
    ax.plot(t1r_min, t2r_min, "rs")
    ax.set_title(title, fontsize=16)
    ax.axis([t1a, t1b, t2a, t2b])
    if i == 1:
        ax.set_xlabel(r"$\theta_1$", fontsize=16)

plt.show()

Logistic (Logit) Regression

It is a typical binary classifier The logistic regression model probability function is:

\[ \^{p} = h_\theta(x) = \sigma(x^T\theta) \]

Where \(\sigma\) is the logistic function

\[ \sigma(t) = \frac{1}{1+exp(-t)} \]

Once the logistic regression model has estimated \(\^{p}\), it make its prediction \(\^{y}\) by:

\[ \^{y} = \begin{cases} 0 \quad \^{p} <0.5 \\ 1 \quad \^{p} >= 0.5 \end{cases} \]

The score \(t\) is often called logit. It comes from the fact that the logit function

\[ logit(p) = log(\frac{p}{1-p}) \]

is actually the inverse of the logistic function. If you computer the logit of the estimated probability \(p\), you will find that the result is \(t\).

import numpy as np
from matplotlib import pyplot as plt

plt.rcParams['figure.dpi'] = 140
plt.style.use('dark_background')

# Visualization of logistic function
x = np.linspace(start=-10, stop=10, num=100)
y = 1/(1+np.exp(-x))

plt.figure(figsize=(6,3))
plt.axhline(0.5, ls=':')
plt.axvline(0., ls='--')
plt.plot(x,y)
plt.xlabel('$t$')
plt.ylabel('$\sigma(t)$')
plt.show()

Cost function Logistic Regression

Cost function of a single training instance \[ c(\theta) = \begin{cases} -log(\^{p}) \quad y=1 \\ -log(1-\^{p}) \quad y=0 \end{cases} \]

The cost function over the whole training set is the average cost over all the training instances that can be written as the log loss expression:

\[ J(\theta) = -\frac{1}{m} \sum_{i=1}^n [y^{(i)}log(\^{p}^{(i)}) + (1 - y^{(i)}) log(1-\^{p}^{(i)})] \]

There is no known closed-form solutions to the above equation to the computer the value of \(\theta\) so we have to rely on Gradient Descent (or other optimization algorithms) which is guaranteed to find a global minimum.

Decision Boundaries

Let’s find the decision boundaries to detect Irisi virginica flowers.

from sklearn import datasets
iris = datasets.load_iris()
print(iris.DESCR)

X = iris['data'][:,3:] # load petal width as X
y = (iris['target'] == 2).astype(int) # if it's virginica or not
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X,y)

X_new = np.linspace(0,3,1000).reshape(-1,1)
y_proba = log_reg.predict_proba(X_new) # returns the probabilities instead .predict will return the classes
decision_boundary = X_new[y_proba[:, 1] >= 0.5][0]

plt.figure(figsize=(8,3))
plt.plot(X[y==0], y[y==0], "s", color='tab:orange')
plt.plot(X[y==1], y[y==1], "^", color='tab:cyan')
plt.plot([decision_boundary, decision_boundary], [-1, 2], ":", linewidth=2)
plt.plot(X_new, y_proba[:, 1], label='Iris virginica', color='tab:cyan')
plt.plot(X_new, y_proba[:, 0], label='Not Iris virginica', color='tab:orange')
plt.text(decision_boundary+0.02, 0.15, "Decision  boundary", fontsize=14, ha="center")
plt.arrow(decision_boundary, 0.08, -0.3, 0, head_width=0.05, head_length=0.1)
plt.arrow(decision_boundary, 0.92, 0.3, 0, head_width=0.05, head_length=0.1)
plt.ylim(-0.02, 1.02)
plt.xlabel('Petal width [cm]')
plt.ylabel('Probability')
plt.legend()
plt.show()
/home/dallavem/.conda/envs/image-analysis/lib/python3.9/site-packages/matplotlib/patches.py:1444: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  self.verts = np.dot(coords, M) + [

# 2D example of logistic regression
from sklearn.linear_model import LogisticRegression

X = iris["data"][:, (2, 3)]  # petal length, petal width
y = (iris["target"] == 2).astype(int)

log_reg = LogisticRegression(solver="lbfgs", C=10**10, random_state=42)
log_reg.fit(X, y)

x0, x1 = np.meshgrid(
        np.linspace(2.9, 7, 500).reshape(-1, 1),
        np.linspace(0.8, 2.7, 200).reshape(-1, 1),
    )
X_new = np.c_[x0.ravel(), x1.ravel()]

y_proba = log_reg.predict_proba(X_new)

plt.figure(figsize=(10, 4))
plt.plot(X[y==0, 0], X[y==0, 1], "s", color='tab:orange')
plt.plot(X[y==1, 0], X[y==1, 1], "^", color='tab:cyan')

zz = y_proba[:, 1].reshape(x0.shape)
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)


left_right = np.array([2.9, 7])
boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) / log_reg.coef_[0][1]

plt.clabel(contour, inline=1, fontsize=12)
plt.plot(left_right, boundary, "--", linewidth=1.5, color='white')
plt.text(3.5, 1.5, "Not Iris virginica", fontsize=14, color='tab:orange', ha="center")
plt.text(6.5, 2.3, "Iris virginica", fontsize=14, color='tab:cyan', ha="center")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.axis([2.9, 7, 0.8, 2.7])
plt.show()

Softmax (Multinomial Logistic) Regression

Logistic regression can be extended to multiple classes directly without having to train and combine multiple binary classifiers.

Softmax function, where K is number of classes, s(x) is the vector containing the scores of each class for the instance x and \(\sigma(s(x))_k\) is the estimated probability that the instance x belongs to class k.

\[ \^{p}_k = \sigma(s(x))_k = \frac{exp(s_k(x))}{\sum_{j=1}^K exp(s_j(x))} \]

Softmax regression classifier prediction

\[ \^{y} = argmax_k \sigma(s(x)) = argmax_k s_k(x) = argmax_k ((\sigma^{(k)})^{\intercal}x) \]

Cross-entropy cost function

\[ J(\sigma) = -\frac{1}{m}\sum_{i=1}^m \sum_{k=1}^K y^{(i)} log(\^{p}_k^{(i)}) \]

Where \(y_k^{(i)}\) is the target probability that the ith instance belongs to class k. It is either 0 or 1.

Cross-entropy and entropy come from information theory (Claude Shannon), check out this video for more info.

from wsgiref.simple_server import software_version


X = iris['data'][:, (2,3)] # petal length and width
y = iris['target']

softmax_reg = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=10)
softmax_reg.fit(X, y)

print(softmax_reg.predict([[5,2]]))
softmax_reg.predict_proba([[5,2]])
[2]
array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]])
x0, x1 = np.meshgrid(
        np.linspace(0, 8, 500).reshape(-1, 1),
        np.linspace(0, 3.5, 200).reshape(-1, 1),
    )
X_new = np.c_[x0.ravel(), x1.ravel()]


y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)

zz1 = y_proba[:, 1].reshape(x0.shape)
zz = y_predict.reshape(x0.shape)

plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa")

from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

plt.contourf(x0, x1, zz, cmap=custom_cmap)
contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
plt.show()

Support Vector Machines

Support vector machines are models particularly useful for small and medium size datasets. They can be used for linear and non-linear classification, regression and even outliers detection.

SVM try to fit the largest street between classes of the dataset. SVMs are very sensitive to data scaling.

Margins

  • A Hard Margin is when we don’t allow any misclassification and we force the classes to be separated entirely by the support vectors of the classes (i.e. the closest data points to the ‘street’ separating the classes). → this creates a very strict model that is super susceptible to outliers.
  • A better tactic is to use a Soft Margin classifier, where we allow some mis-classification but we build a more robust model.

Linear SVM

from matplotlib import pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC

plt.rcParams['figure.dpi'] = 140
plt.style.use('dark_background')

iris = datasets.load_iris()
X = iris["data"][:, (2, 3)]  # petal length, petal width
y = iris["target"]

setosa_or_versicolor = (y == 0) | (y == 1)
X = X[setosa_or_versicolor]
y = y[setosa_or_versicolor]

X_outliers = np.array([[3.4, 1.3], [3.2, 0.8]])
y_outliers = np.array([0, 0])

Xo1 = np.concatenate([X, X_outliers[:1]], axis=0)
yo1 = np.concatenate([y, y_outliers[:1]], axis=0)
Xo2 = np.concatenate([X, X_outliers[1:]], axis=0)
yo2 = np.concatenate([y, y_outliers[1:]], axis=0)

svm_clf1 = SVC(kernel="linear", C=10**9)
svm_clf1.fit(Xo2, yo2)

fig, axes = plt.subplots(1,2, figsize=(12,3), sharey=True)

axes[0].plot(Xo1[:, 0][yo1==1], Xo1[:, 1][yo1==1], "s")
axes[0].plot(Xo1[:, 0][yo1==0], Xo1[:, 1][yo1==0], "o")
axes[0].set_xlabel("Petal length")
axes[0].set_ylabel("Petal width")
axes[0].annotate("Outlier",
             xy=(X_outliers[0][0], X_outliers[0][1]),
             xytext=(2.5, 1.7),
             ha="center",
             arrowprops=dict(facecolor='black', shrink=0.1),
            )

axes[1].plot(Xo2[:, 0][yo2==1], Xo2[:, 1][yo2==1], "s")
axes[1].plot(Xo2[:, 0][yo2==0], Xo2[:, 1][yo2==0], "o")

axes[1].set_xlabel("Petal length")
axes[1].annotate("Outlier",
             xy=(X_outliers[1][0], X_outliers[1][1]),
             xytext=(3.2, 0.08),
             ha="center",
             arrowprops=dict(facecolor='black', shrink=0.1),
            )

for ax in axes:
    ax.set_ylim(0,2)
    ax.set_xlim(0,5.5)

w = svm_clf1.coef_[0]
b = svm_clf1.intercept_[0]

# At the decision boundary, w0*x0 + w1*x1 + b = 0
# => x1 = -w0/w1 * x0 - b/w1
x0 = np.linspace(0, 5.5, 200)
decision_boundary = -w[0]/w[1] * x0 - b/w[1]
margin = 1/w[1]

gutter_up = decision_boundary + margin
gutter_down = decision_boundary - margin

svs = svm_clf1.support_vectors_

axes[1].scatter(svs[:, 0], svs[:, 1], s=180, facecolors='#AAAAAA')
axes[1].plot(x0, decision_boundary, "-", linewidth=2, color='tab:orange', alpha=0.8)
axes[1].plot(x0, gutter_up, "--", linewidth=2, color='tab:cyan', alpha=0.8)
axes[1].plot(x0, gutter_down, "--", linewidth=2, color='tab:cyan', alpha=0.8)

fig.suptitle('SVMs are susceptible to outliers')
plt.show()

iris = datasets.load_iris()
X = iris['data'][:, (2,3)] # petal length and width
y = (iris['target'] == 2).astype(np.float64) # select Iris virginica class

scaler = StandardScaler()

svm_clf1 = SVC(kernel = 'linear', C=1, random_state=42)
svm_clf2 = SVC(kernel = 'linear', C=100, random_state=42)

scaled_svm_clf1 = Pipeline([
        ("scaler", scaler),
        ("linear_svc", svm_clf1),
    ])
scaled_svm_clf2 = Pipeline([
        ("scaler", scaler),
        ("linear_svc", svm_clf2),
    ])

scaled_svm_clf1.fit(X, y)
scaled_svm_clf2.fit(X, y)

unscaled_svs_clf1 = scaler.inverse_transform(svm_clf1.support_vectors_)
unscaled_svs_clf2 = scaler.inverse_transform(svm_clf2.support_vectors_)

f, axes = plt.subplots(1,2, figsize=(12,4), sharey=True)

axes[0].scatter(unscaled_svs_clf1[:, 0], unscaled_svs_clf1[:, 1], facecolors='tab:purple', s=180)
axes[0].scatter(X[:,0][y==0], X[:,1][y==0])
axes[0].scatter(X[:,0][y==1], X[:,1][y==1])

axes[1].scatter(unscaled_svs_clf2[:, 0], unscaled_svs_clf2[:, 1], facecolors='tab:purple', s=180)
axes[1].scatter(X[:,0][y==0], X[:,1][y==0])
axes[1].scatter(X[:,0][y==1], X[:,1][y==1])

w1 = svm_clf1.coef_[0] / scaler.scale_
b1 = svm_clf1.decision_function([-scaler.mean_ / scaler.scale_])

# At the decision boundary, w0*x0 + w1*x1 + b = 0
# => x1 = -w0/w1 * x0 - b/w1
x0 = np.linspace(0, 10, 200)
decision_boundary1 = -w1[0]/w1[1] * x0 - b1/w1[1]
margin1 = 1/w1[1]
gutter_up1 = decision_boundary1 + margin1
gutter_down1 = decision_boundary1 - margin1

axes[0].plot(x0, decision_boundary1, "-", linewidth=2, color='tab:orange', alpha=0.8)
axes[0].plot(x0, gutter_up1, "--", linewidth=2, color='tab:cyan', alpha=0.8)
axes[0].plot(x0, gutter_down1, "--", linewidth=2, color='tab:cyan', alpha=0.8)
axes[0].set_ylabel('Petal width')

w2 = svm_clf2.coef_[0] / scaler.scale_
b2 = svm_clf2.decision_function([-scaler.mean_ / scaler.scale_])

decision_boundary2 = -w2[0]/w2[1] * x0 - b2/w2[1]
margin2 = 1/w2[1]
gutter_up2 = decision_boundary2 + margin2
gutter_down2 = decision_boundary2 - margin2

axes[1].plot(x0, decision_boundary2, "-", linewidth=2, color='tab:orange', alpha=0.8)
axes[1].plot(x0, gutter_up2, "--", linewidth=2, color='tab:cyan', alpha=0.8)
axes[1].plot(x0, gutter_down2, "--", linewidth=2, color='tab:cyan', alpha=0.8)

for ax in axes:
    ax.set_xlim(4, 6)
    ax.set_ylim(0.8, 2.8)
    ax.set_xlabel('Petal length')

axes[0].set_title('C = 1')
axes[1].set_title('C = 100')
f.suptitle('Hyperparameter C can regularize the number of margin violations')
plt.tight_layout()
plt.show()

SVM models don’t return probability for each class but simply the class itself.

svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_classifier', LinearSVC(C = 1, loss = 'hinge'))
])

svm_clf.fit(X, y)
svm_clf.predict([[5.5, 1.7]])
array([1.])

Nonlinear SVM Classification

Linear SVMs are very efficient and can be very valuable, however many dataset are not linearly separable. To work around this issue, we can transform our data introducing some non-linearity and then use a SVM to separate the classes.

from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons(n_samples = 100, noise = 0.15)

poly_svm_clf = Pipeline([
    ('poly_features', PolynomialFeatures(degree = 3)),
    ('scaler', StandardScaler()),
    ('svm_classifier', LinearSVC(C = 10, loss = 'hinge')),
])

poly_svm_clf.fit(X, y)

x0s = np.linspace(-1.5, 2.5, 100)
x1s = np.linspace(-1.5, 2.5, 100)
x0, x1 = np.meshgrid(x0s, x1s)
X2 = np.c_[x0.ravel(), x1.ravel()]

y_pred = poly_svm_clf.predict(X2).reshape(x0.shape)
y_decision = poly_svm_clf.decision_function(X2).reshape(x0.shape)

plt.contourf(x0, x1, y_pred, cmap=plt.cm.seismic, alpha=0.8)
plt.contourf(x0, x1, y_decision, cmap=plt.cm.seismic, alpha=0.8)

plt.scatter(X[:, 0][y == 0], X[:, 1][y == 0], marker='s', color='tab:blue')
plt.scatter(X[:, 0][y == 1], X[:, 1][y == 1], color='tab:green')
plt.ylim(-1, 1.5)
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.title('Using polynomial transformation with nonlinear SVM')
plt.show()