Optimization in Machine Learning

Table of Contents

Introduction
K Fold
GridsearchCV
Exercise

Introduction

In machine learning, optimization refers to the process of improving a model’s performance by finding the best possible configuration of its parameters and hyperparameters. The goal is to build a model that not only fits the training data well but also generalizes effectively to unseen data.

[470]:

from sklearn.datasets import load_iris
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.svm import SVC as svm

We will use again the iris dataset.

[471]:

iris = load_iris()
X = iris.data[:,:2]
y = (iris.target == 0).astype(int)

K-Folding

K-Fold Cross Validation is a model evaluation technique that helps obtain a more robust measure of the performance of a classification model, especially when the available dataset is limited. This technique splits the dataset into k subsets or «folds,» and uses each subset multiple times to train and validate the model. This way, a more reliable estimate of the model’s ability to generalize to new data can be obtained.

A model can perform well on training data but poorly on new data (overfitting). To avoid this, we use K-Fold Cross Validation.

To perform this task, we again use scikit-learn — specifically, the KFold function. This function has the following parameters:

n_splits: Number of splits to make.
shuffle: Boolean indicating whether the data should be shuffled before splitting.
random_state: Random seed.

It returns the different train and test splits, ensuring that the training set and test set sizes follow the specified distribution.

We will use the function `KFold to do cross validation. This function takes these parameters:

n_splits: how many folds you want.
shuffle: Before splitting the data, it is shuffled randomly.
random_state: To control randomness.

We can define a function that performs:

Kfolding
Training and Testing spliting
Train a model
Test it and evaluate performance.

Returning the average performance, the results of each fold and the best model.

[472]:

def cross_validate(model, X, y, k=5):
     # 1. We create a K-Fold object that will split the data into 5 parts,
     # shuffle the dataset before splitting and use fixed seed
    kf = KFold(n_splits=k, shuffle=True, random_state=42)

    # Variables to store results
    scores = []
    best_score = -1
    best_model = None

    # Kfold returns the indices of the rows used for training and testing.
    for train_index, test_index in kf.split(X):
        #split data
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # train model
        model.fit(X_train, y_train)

        # make prediction
        y_pred = model.predict(X_test)

        # evaluate performance
        score = accuracy_score(y_test, y_pred)
        scores.append(score)

        #store result
        if score > best_score:
            best_score = score
            best_model = model

    return np.mean(scores), scores, best_model #<- We return the mean accuracy, all fold accuracy and best model

Let’s see how this works. We create a Logistic Regression model.

[473]:

model = LogisticRegression()

And compute cross validation:

[474]:

mean_score, scores, best_model = cross_validate(model, X, y, 5)
print(f"Mean score:{mean_score}")
print(f"Scores: {scores}")

Mean score:1.0
Scores: [1.0, 1.0, 1.0, 1.0, 1.0]

What can we do with the best_model variable?

We can print the hyperparameters with we trained the model.

[475]:

print(best_model.get_params())

{'C': 1.0, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1, 'l1_ratio': None, 'max_iter': 100, 'multi_class': 'deprecated', 'n_jobs': None, 'penalty': 'l2', 'random_state': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0, 'warm_start': False}

Gridsearch

In the training of machine learning models, a hyperparameter is any parameter that is not directly learned during the training process but must be defined before training begins. These parameters are configurations that influence the behavior of the model, as well as its ability to learn and generalize on new datasets.

Unlike parameters that are determined during the training of the model (such as weights in a neural network), hyperparameters must be set beforehand, typically based on the designer’s experience, trial and error, or through search techniques.

GridSearchCV automates the search for the best combination of hyperparameters by:

Trying all possible combinations from a predefined grid
Evaluating each using cross-validation
Selecting the configuration with the best average performance

This ensures the model is not only trained, but also systematically optimized.

In order to use GridSearchCV, we need to create a dictionary of all the parameters we want to try our model:

[476]:

param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

Then, we can create an object GridSearchCV with the following parameters:

estimator:The model you want to optimize.
param_grid: Dictionary of parameters to try.
cv: Use 5-fold cross validation.
scoring: Metric used to compare models. We can indicate accuracy, precision, recall,F1.
verbose: Prints progress while running if set to 1.

[477]:

model = svm()
grid = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    verbose=1
)

Once we have the object created, we call fit to mtrain the grid search:

[478]:

grid.fit(X, y)

Fitting 5 folds for each of 12 candidates, totalling 60 fits

[478]:

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': [0.1, 1, 10], 'gamma': ['scale', 'auto'],
                         'kernel': ['linear', 'rbf']},
             scoring='accuracy', verbose=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

To see the results we can call to best_params and best_score:

[479]:

print("Best parameters:", grid.best_params_)
print("Best accuracy:", grid.best_score_)

Best parameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}
Best accuracy: 1.0

We can obtain the best model calling best_estimator_

[480]:

best_model = grid.best_estimator_

Then, we can perform predictions and compute more metrics:

[481]:

predictions = best_model.predict(X)
print("\nReport:\n", classification_report(y, predictions))


Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       100
           1       1.00      1.00      1.00        50

    accuracy                           1.00       150
   macro avg       1.00      1.00      1.00       150
weighted avg       1.00      1.00      1.00       150

Exercise

In this exercise, we will use the fish market dataset. The fish market dataset is a collection of data related to different species of fish and their characteristics.

You can download the dataset here: Link

You can find more info of the dataset here: Link

[482]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score, confusion_matrix, classification_report

df = pd.read_csv("Fishers maket.csv")
df.head()

[482]:

	Species	Weight	Length1	Length2	Length3	Height	Width
0	Bream	242.0	23.2	25.4	30.0	11.5200	4.0200
1	Bream	290.0	24.0	26.3	31.2	12.4800	4.3056
2	Bream	340.0	23.9	26.5	31.1	12.3778	4.6961
3	Bream	363.0	26.3	29.0	33.5	12.7300	4.4555
4	Bream	430.0	26.5	29.0	34.0	12.4440	5.1340

After reading the dataset, let’s see if there is any NaN value and other statistical information.

[483]:

df.describe()

[483]:

	Weight	Length1	Length2	Length3	Height	Width
count	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000
mean	398.326415	26.247170	28.415723	31.227044	8.970994	4.417486
std	357.978317	9.996441	10.716328	11.610246	4.286208	1.685804
min	0.000000	7.500000	8.400000	8.800000	1.728400	1.047600
25%	120.000000	19.050000	21.000000	23.150000	5.944800	3.385650
50%	273.000000	25.200000	27.300000	29.400000	7.786000	4.248500
75%	650.000000	32.700000	35.500000	39.650000	12.365900	5.584500
max	1650.000000	59.000000	63.400000	68.000000	18.957000	8.142000

There is no None value or NaN. We decided to classify between Perch and No Perch. The No Perch class will contain the Bream and Roach classes.

[484]:

df = df[df["Species"].isin(["Perch", "Bream", "Roach"])].copy()

Let’s create our Input features variable (X) and our target variable (Y). Our target variable will contain 1 if the fish is Perch, if not will have a 0:

[485]:

X = df.drop("Species", axis=1)
y = (df["Species"] == "Perch").astype(int)

We decided to use Height and Weight variables as input features. Feel free to use other variable and see what happens!

[486]:

X_2 = df[["Weight", "Height"]]

Task 1

Split the data between train and test split. Use X_2 as input features variable.

[ ]:

If we take a look to our features, we can see that weight has larger values in comparison with height:

[488]:

print(X_train)

     Weight   Height
15    600.0  15.4380
21    685.0  15.9936
101   218.0   7.1680
26    720.0  16.3618
127  1000.0  12.4888
..      ...      ...
86    120.0   6.1100
102   300.0   8.3230
41    110.0   6.1677
30    920.0  18.0369
76     70.0   4.5880

[77 rows x 2 columns]

A model like SVM or Logistic Regression will treat large numbers as more important just because they are bigger. We need to scale the data. Without this, the weight will dominate the process.

We are going to use StandardScaler, which performs the following: z=x−μ/σ

[489]:

scaler = StandardScaler()
scaler.fit(X_train)

[489]:

StandardScaler()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We use fit to make the scaler learn the training data the mean of each feature and the standard deviation of each feature.

[490]:

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

We then use transform to transform each value using this: x`= x- mean / std

We also transform the test data using the same scaler es the training data. As the test data simulates real-world and unseen data, it must be transformed using the rules learned from training.

Task 2

Create a Logistic Regression or SVM model and create a param grid of the parameters:

[ ]:

model =...

[ ]:

param_grid = {

}

Task 3

Perform gridsearch and print the best params and best score. Remember to use the variable X_train_scaled.

[ ]:

[502]:

# Print best_params
# Print best score

Task 4

Evaluate the performance. Remember to use X_test_scaled

[ ]:

y_pred = ....

[ ]:

print() # Classification report

              precision    recall  f1-score   support

           0       0.93      0.76      0.84        17
           1       0.80      0.94      0.86        17

    accuracy                           0.85        34
   macro avg       0.86      0.85      0.85        34
weighted avg       0.86      0.85      0.85        34

[ ]:

cm = ... # Confusion Matrix

<Axes: >

../../_images/notebooks_01_ML_02_Optimization_62_1.png