Multi-layer Perceptron a PyTorch

Introduction to PyTorch

PyTorch is a deep learning library developed by Meta that allows building and training neural networks in a flexible and efficient way. Unlike scikit-learn, which completely abstracts the training process, PyTorch gives complete control over each step of the training loop, making it especially suitable for complex architectures like RNNs, CNNs, or Transformers. The fundamental difference with scikit-learn is that in PyTorch we ourselves write the training process explicitly: what scikit-learn does with model.fit() automatically, in PyTorch we must implement it step by step.

In this section we’ll learn to build our own MLP to discover how the PyTorch library works.

Building an MLP

Using the nn.Sequential structure is the simplest way to build a network in PyTorch. It allows defining the architecture as an ordered sequence of layers, similar to how we specify hidden_layer_sizes in scikit-learn.

In the following example we see how we define an MLP for the problem we saw in the previous section. In this case we must define both the number of elements in the input layer and the output layer. In this case we don’t differentiate between classification and regression problems, the definition of the network’s output and the activation and loss functions we use will be what define its functionality.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 64),   # input layer → first hidden layer
    nn.ReLU(),
    nn.Linear(64, 32),   # first hidden layer → second hidden layer
    nn.ReLU(),
    nn.Linear(32, 3)     # second hidden layer → output layer (3 classes)
)

Each nn.Linear(in, out) defines a fully connected layer with the corresponding weights \(W\) and biases \(b\). Activation functions are added explicitly between layers, unlike scikit-learn where they are specified with the activation parameter.

Complete Example

Below we explain how the code works to train and evaluate an MLP, we’ll solve the same problem as in the previous section so we can observe the similarities and differences between the two libraries.

Data generation and preparation

[ ]:

import torch
import torch.nn as nn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Data generation and splitting (same as with scikit-learn)
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_classes=3,
    n_informative=6,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

# Conversion of NumPy arrays to PyTorch tensors: This step is mandatory
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test  = torch.tensor(X_test,  dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.long)
y_test  = torch.tensor(y_test,  dtype=torch.long)

Next, we define the model, the loss function and the optimizer.

[ ]:

# Model definition. Here we have to define the layers and the activation functions between them.
model = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 3)
)

# Definition of loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # The learning rate is defined here

Training loop

The training loop follows the 4 elemental steps that we explained in the introduction: Forward pass, Loss calculation, Backward and Weight update.

[ ]:

epochs = 200
for epoch in range(epochs):
    model.train()

    # 1. Forward pass
    y_pred = model(X_train)

    # 2. Loss calculation
    loss = criterion(y_pred, y_train)

    # 3. Backward pass
    optimizer.zero_grad()
    loss.backward()

    # 4. Weight update
    optimizer.step()
    # Display partial results
    if (epoch + 1) % 50 == 0:
        print(f"Epoch {epoch+1}/{epochs} - Loss: {loss.item():.4f}")

Model evaluation

Finally, we use the test set to evaluate the model.

[ ]:

model.eval()
with torch.no_grad():
    y_pred_test = model(X_test)
    predicted   = torch.argmax(y_pred_test, dim=1)
    accuracy    = (predicted == y_test).float().mean()
    print(f"\nAccuracy: {accuracy:.4f}")

Aspects to keep in mind of this code that make it very different from training with Scikit:

optimizer.zero_grad() is necessary because PyTorch accumulates gradients by default. This means that if we don’t reset them, the gradients from the previous iteration are added to those of the current iteration, leading to incorrect weight updates.
model.eval() and torch.no_grad() disable gradient calculation during evaluation, saving memory.
nn.CrossEntropyLoss already incorporates the Softmax function internally, which is why the output layer has no activation function.

Tensors

As you may have noticed, in the code we’ve transformed NumPy ndarray arrays to the tensor data type. A tensor is the basic data structure of PyTorch, equivalent to NumPy arrays but with the additional ability to run on GPU and automatically calculate gradients. A scalar, a vector, a matrix, or any n-dimensional array is represented as tensors.

We can create tensors manually, although it won’t be a common thing:

import torch

# From a list
t = torch.tensor([1.0, 2.0, 3.0])

# Special tensors
torch.zeros(3, 4)      # 3x4 matrix of zeros
torch.ones(3, 4)       # 3x4 matrix of ones
torch.rand(3, 4)       # 3x4 matrix of random values between 0 and 1
torch.randn(3, 4)      # 3x4 matrix of random values with normal distribution

Data Types (dtype)

The data type is important because PyTorch is strict: operations between tensors of different types generate errors.

`dtype`	Description	Typical use
`torch.float32`	32-bit decimal	Features, network weights
`torch.float64`	64-bit decimal	High precision (rare)
`torch.long`	64-bit integer	Classification labels
`torch.bool`	Boolean	Masks

t = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32)
print(t.dtype)   # torch.float32

Conversion between NumPy and PyTorch

Converting a data structure between these two libraries is straightforward and simple, you just need to know one operation:

import numpy as np

# NumPy → PyTorch
array = np.array([1.0, 2.0, 3.0])
t = torch.from_numpy(array)

# PyTorch → NumPy
array = t.numpy()

Note: torch.from_numpy shares memory with the original NumPy array. Modifying one modifies the other. If you want an independent copy use torch.tensor(array).

Basic Operations

Some basic operations that can be useful, specifically those that provide us with information:

a = torch.tensor([1.0, 2.0, 3.0])
b = torch.tensor([4.0, 5.0, 6.0])

a + b                    # element-wise addition
a * b                    # element-wise product
torch.matmul(a, b)       # dot product (or matrix product)

# Tensor information
a.shape                  # dimensions
a.dtype                  # data type
a.device                 # cpu or cuda

Graphics Card

GPUs are designed to perform many mathematical operations in parallel, making them much more efficient than CPUs for training neural networks. The main operations of a neural network (matrix multiplications and gradient calculation) especially benefit from this parallelization. CUDA is NVIDIA’s platform that allows taking advantage of the GPU from PyTorch. To find out if we have a GPU available:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

To use the GPU, you need to move both the model and the data to the same device. It’s important to note that the model and data must always be on the same device. Mixing CPU and GPU tensors generates an error. The changes to the previous code are minimal:

# Move model to GPU
model = model.to(device)

# Move data to GPU
X_train = X_train.to(device)
X_test  = X_test.to(device)
y_train = y_train.to(device)
y_test  = y_test.to(device)

The rest of the code (training loop, evaluation) doesn’t need any changes, as PyTorch automatically manages operations on the corresponding device.

Back to the cpu

Once the model has been trained on the GPU, it is sometimes necessary to move the results back to the CPU — for example, to convert predictions to NumPy arrays or to use scikit-learn metrics, which do not support CUDA tensors. This can be done using the .cpu() method:

predictions = model(X_test).cpu().detach().numpy()

The .detach() call is needed to remove the tensor from the computation graph before converting it to NumPy.

Saving and Loading Models

Once the model has been trained, it can be saved to disk so that it can be reused later without the need to retrain it. PyTorch allows saving the model weights at any point, including during training, for example, saving the best model found so far based on validation loss. Later, the saved weights can be loaded into a model with the same architecture to make predictions on new samples, which is especially useful in production environments or when sharing models with other researchers.

Once training is complete we can use the following command to save the model weights:

torch.save(model.state_dict(), 'NameOfTheModel') # Typically model.pth

At any time, we can load the model. To load the weights, you must first instantiate the model with the same architecture and then load the weights:

# Instantiate model with same architecture
model = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 3)
)

# Load the weights
model.load_state_dict(torch.load('NameOfTheModel')) # Typically model.pth
model.eval()

Why do we use state_dict? A state_dict is simply a Python dictionary that maps each layer of the model with its weights and biases. Saving only the weights (not the entire model) is more robust against code changes and PyTorch versions.

Exercises

In these exercises we’ll use the same dataset as in the previous block: Ocean Quality Dataset

Train an MLP with the same configuration as Scikit but now using PyTorch for the clasification task.
Repeat the training with different values of the learning rate and visualize how the loss evolution changes. Remember that this parameter makes sense with values close to 0. Extra: Make the plot using the matplotlib library.
Save the best model. In another cell or Python script load it and make predictions without retraining.