Project: MycalCOFI

The MycalCOFI dataset is a variation of the CalCOFI (California Cooperative Oceanic Fisheries Investigations) dataset, which contains oceanographic measurements collected off the coast of California from 1949 to 2016.

Each sample includes the date and geographic position (latitude and longitude) of the measurement, the depth, the water temperature, and the salinity. From these original variables, additional features have been derived with oceanographic significance: an approximation of water density, the season of the year, the approximate distance to the Californian coastline, and a coastal upwelling index.

The target variable is the water mass, a category that classifies each sample into four types: Pacific Subtropical (warm and saline waters from the south), Pacific Subarctic (cold and less saline waters from the north), Upwelling (coastal upwelling, characterized by cold and less saline waters), and Deep Water (cold and saline waters). This classification is based on temperature and salinity values following the classical T-S diagram from physical oceanography.

Download dataset

Data Preparation:

Load mycalcofi_train.csv and mycalcofi_test.csv and perform an initial exploration of the training dataset.
Data normalization.

Classifiers with scikit-learn:

Use at least three of the classifiers seen in block 1. One of them must be xGBoost.
For each classifier. Find the best hyperparameter configuration automatically using GridSearchCV with k-folding on the training set. Evaluate the best model obtained on the test set.
See if you can obtain best results removing or transforming features.
Discuss the evaluation metrics.

MLP with scikit-learn

Implement an MLPClassifier and find the best hyperparameter configuration (architecture) using Grid Search.
Evaluate the best model on the test set.

MLP with PyTorch

Implement an MLP equivalent to scikit-learn using nn.Sequential.
Train the model and save the weights.
Generate an independent block of code that loads the saved model and evaluates the test set without retraining.

Comparison and Conclusions

Build a comparative table with metrics from all models (accuracy, precision, recall and F1-score)

Delivery

The project must be submitted as a Jupyter Notebook (.ipynb) file containing all the code, results, and markdown cells explaining each step and the decisions made. Additionally, a PDF export of the notebook must be included so that the results and visualisations are easily accessible without needing to run the code. Both files must be submitted together.