{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/bmalcover/AppOC/blob/main/docs/notebooks/03_Series/03_LSTM_GRU.ipynb\">\n",
    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
    "</a>"
   ],
   "id": "c814b5b80358b1b"
  },
  {
   "metadata": {
    "collapsed": true
   },
   "cell_type": "markdown",
   "source": [
    "# LSTM and GRU\n",
    "\n",
    "\n",
    "\n",
    "## LSTM (*Long Short-Term Memory*)\n",
    "\n",
    "LSTM was proposed by Hochreiter and Schmidhuber in 1997 as a solution to the vanishing gradient problem of simple RNNs. The key idea is to introduce a **cell state** $c_t$: a long-term memory vector that flows along the sequence and that the network can learn to read, write, and erase selectively through three gates:\n",
    "\n",
    "- **Forget gate** $f_t$: decides what information from the previous cell state is forgotten.\n",
    "- **Input gate** $i_t$: decides what new information is written to the cell state.\n",
    "- **Output gate** $o_t$: decides which part of the cell state is used to generate the hidden state $h_t$.\n",
    "\n",
    "![LSTM](../../_static/03/LSTM.png \"LSTM\")\n",
    "\n",
    "$$f_t = \\sigma(W_f \\cdot [h_{t-1}, x_t] + b_f),$$\n",
    "$$i_t = \\sigma(W_i \\cdot [h_{t-1}, x_t] + b_i),$$\n",
    "$$\\tilde{c}_t = \\tanh(W_c \\cdot [h_{t-1}, x_t] + b_c),$$\n",
    "$$c_t = f_t \\odot c_{t-1} + i_t \\odot \\tilde{c}_t,$$\n",
    "$$o_t = \\sigma(W_o \\cdot [h_{t-1}, x_t] + b_o),$$\n",
    "$$h_t = o_t \\odot \\tanh(c_t).$$\n",
    "\n",
    "where $\\sigma$ is the sigmoid function and $\\odot$ is the element-wise product.\n",
    "\n",
    "\n",
    "### Parameters of `nn.LSTM`\n",
    "\n",
    "`nn.LSTM` shares the same parameters as `nn.RNN` with one addition:\n",
    "\n",
    "- `input_size`: number of input variables at each time step.\n",
    "- `hidden_size`: dimension of the hidden state vector $h_t$ and the cell state $c_t$.\n",
    "- `num_layers`: number of stacked LSTM layers.\n",
    "- `batch_first`: if `True`, the input tensor is expected with shape `[batch, seq_len, features]`.\n",
    "- `dropout`: applies dropout between layers when `num_layers > 1`.\n",
    "- `bidirectional`: if `True`, creates a bidirectional LSTM that processes the sequence in both directions.\n",
    "\n",
    "Unlike `nn.RNN`, the `forward` method of `nn.LSTM` returns `(out, (h_n, c_n))`, i.e., both the hidden state and the cell state. This must be taken into account when defining the model's `forward` method.\n",
    "\n",
    "### LSTM Model\n",
    "\n",
    "```python\n",
    "import torch.nn as nn\n",
    "\n",
    "class LSTMModel(nn.Module):\n",
    "    def __init__(self, input_size=1, hidden_size=32, num_layers=1):\n",
    "        super(LSTMModel, self).__init__()\n",
    "        self.lstm = nn.LSTM(\n",
    "            input_size=input_size,\n",
    "            hidden_size=hidden_size,\n",
    "            num_layers=num_layers,\n",
    "            batch_first=True\n",
    "        )\n",
    "        self.fc = nn.Linear(hidden_size, 1)\n",
    "\n",
    "    def forward(self, x):\n",
    "        out, (h_n, c_n) = self.lstm(x) # difference with RNN\n",
    "        out = self.fc(out[:, -1, :])  # take the last time step\n",
    "        return out.squeeze()\n",
    "```\n",
    "\n",
    "### GRU (*Gated Recurrent Unit*)\n",
    "\n",
    "The GRU was proposed by Cho et al. in 2014 as a simplification of the LSTM. It eliminates the cell state and reduces the three gates to two:\n",
    "\n",
    "- **Reset gate** $r_t$: controls how much information from the previous hidden state is discarded.\n",
    "- **Update gate** $z_t$: controls how much of the previous hidden state is preserved and how much is replaced by new information.\n",
    "\n",
    "![GRU](../../_static/03/GRU.png \"LSTM\")\n",
    "\n",
    "$$r_t = \\sigma(W_r \\cdot [h_{t-1}, x_t] + b_r)$$\n",
    "$$z_t = \\sigma(W_z \\cdot [h_{t-1}, x_t] + b_z)$$\n",
    "$$\\tilde{h}_t = \\tanh(W_h \\cdot [r_t \\odot h_{t-1}, x_t] + b_h)$$\n",
    "$$h_t = (1 - z_t) \\odot h_{t-1} + z_t \\odot \\tilde{h}_t$$\n",
    "\n",
    "### Parameters of `nn.GRU`\n",
    "\n",
    "`nn.GRU` has exactly the same parameters as `nn.RNN`:\n",
    "\n",
    "- `input_size`: number of input variables at each time step\n",
    "- `hidden_size`: dimension of the hidden state vector $h_t$\n",
    "- `num_layers`: number of stacked GRU layers\n",
    "- `batch_first`: if `True`, the input tensor is expected with shape `[batch, seq_len, features]`\n",
    "- `dropout`: applies dropout between layers when `num_layers > 1`\n",
    "- `bidirectional`: if `True`, creates a bidirectional GRU\n",
    "\n",
    "> **Note:** unlike LSTM, `nn.GRU` returns `(out, h_n)`, without a cell state, just like `nn.RNN`.\n",
    "\n",
    "### GRU Model\n",
    "\n",
    "```python\n",
    "class GRUModel(nn.Module):\n",
    "    def __init__(self, input_size=1, hidden_size=32, num_layers=1):\n",
    "        super(GRUModel, self).__init__()\n",
    "        self.gru = nn.GRU(\n",
    "            input_size=input_size,\n",
    "            hidden_size=hidden_size,\n",
    "            num_layers=num_layers,\n",
    "            batch_first=True\n",
    "        )\n",
    "        self.fc = nn.Linear(hidden_size, 1)\n",
    "\n",
    "    def forward(self, x):\n",
    "        out, h_n = self.gru(x)\n",
    "        out = self.fc(out[:, -1, :])  # take the last time step\n",
    "        return out.squeeze()\n",
    "```\n",
    "\n",
    "\n",
    "## Differences Between RNN, LSTM, and GRU\n",
    "\n",
    "Let's see the differences of these three types of recurrent modules:\n",
    "\n",
    "| | RNN | LSTM | GRU |\n",
    "|---|---|---|---|\n",
    "| Long-term memory | No | Yes (cell state) | Yes (update gate) |\n",
    "| Number of gates | 0 | 3 | 2 |\n",
    "| Parameters | Few | Many | Intermediate |\n",
    "| Training speed | Fast | Slow | Intermediate |\n",
    "| Vanishing gradient | Yes | No | No |\n",
    "\n",
    "### When to Use Each One\n",
    "\n",
    "**RNN:** short sequences where long-term memory is not needed. In practice, it is rarely the best option.\n",
    "\n",
    "**LSTM:** when long-term dependencies are important and sufficient data and computational capacity are available. It is the default option for most time series problems.\n",
    "\n",
    "**GRU:** when a good balance between performance and computational efficiency is desired. With small datasets or limited resources, it often gives results similar to LSTM with less training time.\n",
    "\n",
    "\n",
    "## Exercise: Comparison of RNN, LSTM, and GRU\n",
    "\n",
    "1. Using the surface temperature (SST) time series  dataset and the same data preparation as in the previous section, train an LSTM and GRU with the same architecture and experimental configuration  and fill in the results table:\n",
    "\n",
    "| Model | MAE | RMSE | MAPE (%) |\n",
    "|-------|-----|------|----------|\n",
    "| RNN   |     |      |          |\n",
    "| LSTM  |     |      |          |\n",
    "| GRU   |     |      |          |\n",
    "\n",
    "Visualise the predictions of the three models in the same plot to facilitate comparison."
   ],
   "id": "6b0ca140b6f865c7"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}