{ "cells": [ { "cell_type": "markdown", "id": "cd1c9998", "metadata": {}, "source": [ "\n", " \"Open\n", "" ] }, { "cell_type": "markdown", "id": "ecc8769e", "metadata": {}, "source": [ "# LLM Fundamentals\n", "\n", "This notebook introduces the basic ideas behind Large Language Models (LLMs), what they are good at, where they fail, and how to evaluate them in real workflows.\n", "\n", "**Suggested duration:** 1 hour" ] }, { "cell_type": "markdown", "id": "d28c22e9", "metadata": {}, "source": [ "
\n", "Learning goals\n", "\n", "By the end of this notebook, you should be able to:\n", "\n", "- explain what an LLM does in simple terms\n", "- identify strong and weak use cases\n", "- recognize common risks such as hallucinations and bias\n", "- describe the role of Hugging Face and open models\n", "- connect model size and hardware to speed and memory usage\n", "- interpret benchmarks with healthy skepticism\n", "
" ] }, { "cell_type": "markdown", "id": "0b3d3458", "metadata": {}, "source": [ "
\n", "Table of Contents\n", "\n", "1. [What is an LLM?](#what-is-an-llm)\n", "2. [What LLMs are good at](#what-llms-are-good-at)\n", "3. [What LLMs are not good at](#what-llms-are-not-good-at)\n", "4. [Key concepts and risks](#key-concepts-and-risks)\n", "5. [Useful tools](#useful-tools)\n", "6. [Models and Hugging Face](#models-and-hugging-face)\n", "7. [Performance and hardware](#performance-and-hardware)\n", "8. [Benchmarks](#benchmarks)\n", "
" ] }, { "cell_type": "markdown", "id": "c5e72967", "metadata": {}, "source": [ "## What is an LLM? \n", "\n", "A Large Language Model is a neural network trained to predict the next token in a sequence. A neural network is a parameterized function made of many simple computational units arranged in layers; during training, its parameters (weights) are updated with gradient-based optimization (backpropagation) to reduce prediction error on large datasets .\n", "\n", "Modern LLMs are usually Transformer-based neural networks that model long-range token relationships with self-attention, which is a key reason they scale well on language tasks . During training they see large amounts of text, code, and other structured data, and they learn statistical patterns that let them continue text in useful ways.\n", "\n", "In practice, this means an LLM can:\n", "\n", "- summarize documents\n", "- rewrite text in a target style\n", "- extract structured information\n", "- answer questions about provided context\n", "- generate code, explanations, and drafts\n", "\n", "An important mental model is that an LLM is not a database with guaranteed facts. It is a model that produces likely continuations based on patterns learned during training and on the prompt it receives at inference time.\n", "\n", "This is why prompts matter so much: the model's answer depends heavily on the instructions, the context we provide, the examples we include, and the format we request.\n", "\n", "LLMs are powerful pattern completion systems, but they do not automatically know which facts are true, current, or safe." ] }, { "cell_type": "markdown", "id": "0beb2dc8", "metadata": {}, "source": [ "### How an LLM answers a prompt\n", "\n", "At inference time, the model receives a prompt, converts it into tokens, and repeatedly predicts what token should come next. This process is repeated many times until the answer is complete.\n", "\n", "A simplified pipeline is:\n", "\n", "1. The user writes a prompt.\n", "2. The text is tokenized.\n", "3. The model computes probabilities over possible next tokens.\n", "4. A decoding strategy selects the next token.\n", "5. The process repeats until the response ends.\n", "\n", "This helps explain why LLM outputs can be fluent even when they are wrong: fluency comes from learned language patterns, not from guaranteed access to truth." ] }, { "cell_type": "markdown", "id": "04e9e4dd", "metadata": {}, "source": [ "### Example: Training a Neural Network to Generate Well-Formed Word Sequences\n", "\n", "In this hands-on example, we build a small neural language model that learns local word-order patterns and generates grammatically plausible sequences.\n", "\n", "Step descriptions:\n", "- **3-token context window**: uses the previous three tokens to predict the next one; this defines the model's short-context memory.\n", "- **trainable embeddings**: represents each token as a dense vector learned during training, instead of a fixed one-hot vector.\n", "- **mini-batch training**: trains on small batches per iteration to stabilize gradients and speed up optimization.\n", "- **validation split**: reserves part of the data to evaluate generalization beyond the training set.\n", "- **perplexity tracking**: measures how \"surprised\" the model is by the data; lower perplexity usually means better next-token predictions.\n", "- **top-k prediction inspection**: checks the top-k most likely next tokens to understand local model behavior.\n", "- **temperature-based generation**: controls randomness during decoding; lower temperature is more conservative, higher temperature is more diverse.\n", "\n", "Special tokens (`special_tokens = [\"\", \"\", \"\", \"\"]`):\n", "- **``**: padding token used to make sequences the same length within a batch.\n", "- **``**: \"beginning of sequence\" token that marks where a sequence starts.\n", "- **``**: \"end of sequence\" token that signals where a sequence ends.\n", "- **``**: \"unknown\" token used for words not present in the known vocabulary." ] }, { "cell_type": "code", "execution_count": null, "id": "83370c15", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Vocabulary size: 87\n", "Training examples: 117 | Validation examples: 30\n", "Epoch 0 | train_loss=4.4202 train_acc=14.53% train_ppl=83.12 | val_loss=4.4221 val_acc=23.33% val_ppl=83.27\n", "Epoch 100 | train_loss=1.5558 train_acc=51.28% train_ppl=4.74 | val_loss=4.9246 val_acc=20.00% val_ppl=137.64\n", "Epoch 200 | train_loss=0.7792 train_acc=74.36% train_ppl=2.18 | val_loss=6.6347 val_acc=23.33% val_ppl=761.09\n", "Epoch 300 | train_loss=0.5004 train_acc=82.05% train_ppl=1.65 | val_loss=7.4939 val_acc=10.00% val_ppl=1797.04\n", "Epoch 400 | train_loss=0.4281 train_acc=82.91% train_ppl=1.53 | val_loss=7.8468 val_acc=10.00% val_ppl=2557.45\n", "Epoch 499 | train_loss=0.4085 train_acc=82.91% train_ppl=1.50 | val_loss=8.2307 val_acc=10.00% val_ppl=3754.46\n", "\n", "Top next-token predictions:\n", "Context ['', '', 'el'] -> [('perro', 0.37899802163743385), ('gato', 0.1441843925076278), ('nino', 0.11311341056122948), ('coche', 0.09503225811707484), ('asistente', 0.08856609115669524)]\n", "Context ['la', 'cientifica', 'analiza'] -> [('los', 0.9702507142023091), ('un', 0.009214923142383103), ('', 0.005744469491607177), ('parque', 0.00375609648102881), ('sobre', 0.003254359781328852)]\n", "Context ['el', 'perro', 'corre'] -> [('por', 0.9633687517233784), ('sobre', 0.01362457898019891), ('modelos', 0.009082521432186824), ('del', 0.0032136696933594677), ('casa', 0.002558788846126655)]\n", "\n", "Generated samples:\n", "seed=('', '', 'el') -> perro ladra sobre el mar tranquilo\n", "seed=('', '', 'el') -> coche gira en la esquina estrecha\n", "\n", "seed=('', '', 'la') -> musica lee un articulo cientifico cientifico de lenguaje\n", "seed=('', '', 'la') -> biblioteca con un articulo cientifico en la silla con los\n", "\n", "seed=('la', 'cientifica', 'analiza') -> un libro interesante\n", "seed=('la', 'cientifica', 'analiza') -> los datos del oceano\n", "\n" ] } ], "source": [ "import re\n", "import numpy as np\n", "\n", "np.random.seed(7)\n", "\n", "# 1) A richer toy corpus with several semantic patterns.\n", "corpus = [\n", " \"el gato duerme en la silla\",\n", " \"el gato observa la ventana\",\n", " \"el perro corre por el parque\",\n", " \"el perro ladra en la noche\",\n", " \"la nina lee un libro interesante\",\n", " \"la nina escribe una historia corta\",\n", " \"el nino dibuja una casa pequena\",\n", " \"el nino escribe una carta amable\",\n", " \"la profesora explica la leccion con calma\",\n", " \"el alumno responde la pregunta correcta\",\n", " \"la musica suena en la sala grande\",\n", " \"la lluvia cae sobre el tejado rojo\",\n", " \"el tren llega a la estacion central\",\n", " \"el coche gira en la esquina estrecha\",\n", " \"el barco navega sobre el mar tranquilo\",\n", " \"la cientifica analiza los datos del oceano\",\n", " \"el investigador compara modelos de lenguaje\",\n", " \"la estudiante resume un articulo cientifico\",\n", " \"el asistente organiza notas para la reunion\",\n", " \"la biblioteca guarda libros de historia\",\n", "]\n", "\n", "# 2) Simple tokenization.\n", "def tok(text):\n", " return re.findall(r\"[a-zA-Záéíóúñ]+\", text.lower())\n", "\n", "tokenized = [tok(sentence) for sentence in corpus]\n", "special_tokens = [\"\", \"\", \"\", \"\"]\n", "vocab_words = sorted({word for sent in tokenized for word in sent})\n", "vocab = special_tokens + vocab_words\n", "word2id = {word: idx for idx, word in enumerate(vocab)}\n", "id2word = {idx: word for word, idx in word2id.items()}\n", "V = len(vocab)\n", "context_size = 3\n", "\n", "\n", "def encode(word):\n", " return word2id.get(word, word2id[\"\"])\n", "\n", "\n", "# 3) Dataset: 3-word context -> next token.\n", "X_list, y_list = [], []\n", "for sent in tokenized:\n", " padded = [\"\"] * context_size + sent + [\"\"]\n", " for i in range(context_size, len(padded)):\n", " context = padded[i - context_size:i]\n", " target = padded[i]\n", " X_list.append([encode(word) for word in context])\n", " y_list.append(encode(target))\n", "\n", "X = np.array(X_list, dtype=np.int64)\n", "y = np.array(y_list, dtype=np.int64)\n", "\n", "# Train / validation split.\n", "indices = np.random.permutation(len(X))\n", "cut = int(0.8 * len(indices))\n", "train_idx, val_idx = indices[:cut], indices[cut:]\n", "X_train, y_train = X[train_idx], y[train_idx]\n", "X_val, y_val = X[val_idx], y[val_idx]\n", "\n", "print(f\"Vocabulary size: {V}\")\n", "print(f\"Training examples: {len(X_train)} | Validation examples: {len(X_val)}\")\n", "\n", "# 4) A slightly more realistic neural language model:\n", "# embeddings -> hidden layer -> vocabulary logits.\n", "D = 24 #embedding size\n", "H = 64 #hidden layer size\n", "# weights and biases\n", "E = 0.05 * np.random.randn(V, D)\n", "W1 = 0.05 * np.random.randn(context_size * D, H)\n", "b1 = np.zeros((1, H))\n", "W2 = 0.05 * np.random.randn(H, V)\n", "b2 = np.zeros((1, V))\n", "\n", "\n", "def softmax(logits):\n", " logits = logits - np.max(logits, axis=1, keepdims=True)\n", " exp_logits = np.exp(logits)\n", " return exp_logits / np.sum(exp_logits, axis=1, keepdims=True)\n", "\n", "\n", "def forward(batch_ids):\n", " embeddings = E[batch_ids]\n", " x = embeddings.reshape(batch_ids.shape[0], context_size * D)\n", " h_pre = x @ W1 + b1\n", " h = np.tanh(h_pre)\n", " logits = h @ W2 + b2\n", " probs = softmax(logits)\n", " cache = (batch_ids, x, h)\n", " return probs, cache\n", "\n", "\n", "def cross_entropy(probs, targets):\n", " return -np.log(probs[np.arange(len(targets)), targets] + 1e-12).mean()\n", "\n", "\n", "def accuracy(probs, targets):\n", " return (np.argmax(probs, axis=1) == targets).mean()\n", "\n", "\n", "def evaluate(X_data, y_data):\n", " probs, _ = forward(X_data)\n", " loss = cross_entropy(probs, y_data)\n", " acc = accuracy(probs, y_data)\n", " ppl = np.exp(loss)\n", " return loss, acc, ppl\n", "\n", "\n", "# 5) Mini-batch gradient descent.\n", "lr = 0.12\n", "epochs = 500\n", "batch_size = 16\n", "\n", "for epoch in range(epochs):\n", " order = np.random.permutation(len(X_train))\n", " X_train = X_train[order]\n", " y_train = y_train[order]\n", "\n", " for start in range(0, len(X_train), batch_size):\n", " xb = X_train[start:start + batch_size]\n", " yb = y_train[start:start + batch_size]\n", " probs, (batch_ids, x, h) = forward(xb)\n", " n = len(xb)\n", "\n", " dlogits = probs.copy()\n", " dlogits[np.arange(n), yb] -= 1.0\n", " dlogits /= n\n", "\n", " dW2 = h.T @ dlogits\n", " db2 = dlogits.sum(axis=0, keepdims=True)\n", "\n", " dh = dlogits @ W2.T\n", " dh_pre = dh * (1.0 - h ** 2)\n", "\n", " dW1 = x.T @ dh_pre\n", " db1 = dh_pre.sum(axis=0, keepdims=True)\n", "\n", " dx = (dh_pre @ W1.T).reshape(n, context_size, D)\n", " dE = np.zeros_like(E)\n", " for pos in range(context_size):\n", " np.add.at(dE, batch_ids[:, pos], dx[:, pos, :])\n", "\n", " E -= lr * dE\n", " W1 -= lr * dW1\n", " b1 -= lr * db1\n", " W2 -= lr * dW2\n", " b2 -= lr * db2\n", "\n", " if epoch % 100 == 0 or epoch == epochs - 1:\n", " train_loss, train_acc, train_ppl = evaluate(X_train, y_train)\n", " val_loss, val_acc, val_ppl = evaluate(X_val, y_val)\n", " print(\n", " f\"Epoch {epoch:>3} | \"\n", " f\"train_loss={train_loss:.4f} train_acc={train_acc:.2%} train_ppl={train_ppl:.2f} | \"\n", " f\"val_loss={val_loss:.4f} val_acc={val_acc:.2%} val_ppl={val_ppl:.2f}\"\n", " )\n", "\n", "\n", "# 6) Inspect the distribution over next tokens.\n", "def top_predictions(context_words, k=5):\n", " context_ids = np.array([[encode(word) for word in context_words]], dtype=np.int64)\n", " probs, _ = forward(context_ids)\n", " best = np.argsort(probs[0])[::-1][:k]\n", " return [(id2word[idx], float(probs[0, idx])) for idx in best]\n", "\n", "\n", "print(\"\\nTop next-token predictions:\")\n", "for context in [\n", " [\"\", \"\", \"el\"],\n", " [\"la\", \"cientifica\", \"analiza\"],\n", " [\"el\", \"perro\", \"corre\"],\n", "]:\n", " print(f\"Context {context} -> {top_predictions(context, k=5)}\")\n", "\n", "\n", "# 7) Autoregressive generation with temperature and top-k sampling.\n", "def sample_next(context_words, temperature=0.8, top_k=5):\n", " context_ids = np.array([[encode(word) for word in context_words]], dtype=np.int64)\n", " probs, _ = forward(context_ids)\n", " logits = np.log(probs[0] + 1e-12) / max(temperature, 1e-6)\n", " logits = logits - np.max(logits)\n", " filtered = np.argsort(logits)[::-1][:top_k]\n", " filtered_logits = logits[filtered]\n", " filtered_probs = np.exp(filtered_logits)\n", " filtered_probs /= filtered_probs.sum()\n", " next_id = int(np.random.choice(filtered, p=filtered_probs))\n", " return id2word[next_id]\n", "\n", "\n", "def generate(seed=(\"\", \"\", \"\"), max_len=10, temperature=0.8, top_k=5):\n", " context = list(seed)\n", " generated = []\n", " for _ in range(max_len):\n", " next_word = sample_next(context, temperature=temperature, top_k=top_k)\n", " if next_word == \"\":\n", " break\n", " generated.append(next_word)\n", " context = context[1:] + [next_word]\n", " return \" \".join(generated)\n", "\n", "\n", "print(\"\\nGenerated samples:\")\n", "for seed in [\n", " (\"\", \"\", \"el\"),\n", " (\"\", \"\", \"la\"),\n", " (\"la\", \"cientifica\", \"analiza\"),\n", "]:\n", " for _ in range(2):\n", " print(f\"seed={seed} -> {generate(seed=seed, temperature=0.9, top_k=4)}\")\n", " print()\n", "\n" ] }, { "cell_type": "markdown", "id": "40134c48", "metadata": {}, "source": [ "## What LLMs are good at \n", "\n", "Good use cases usually have one or more of these characteristics:\n", "\n", "| Strong fit | Why it works well |\n", "|---|---|\n", "| Summarization | The task is mostly compression and re-expression |\n", "| Classification | The model can map text into predefined labels |\n", "| Information extraction | The output can be constrained into fields or tables |\n", "| Drafting | The model can propose a first version quickly |\n", "| Question answering with context | The answer can be grounded in supplied text |\n", "| Brainstorming | The value comes from many candidate ideas |\n", "\n", "As a rule, LLMs work best when we define the goal, the context, and the format clearly.\n", "\n", "Typical examples of strong classroom or professional use cases include:\n", "\n", "- turning a long paper abstract into five key points\n", "- extracting author, year, method, and limitation from a set of papers\n", "- generating a first draft of an email, abstract, or meeting summary\n", "- converting unstructured notes into a Markdown table\n", "- suggesting code comments or documentation from existing code\n", "- translating technical language into a version for a broader audience\n", "\n", "A useful heuristic is this: if a human could solve the task mostly by reading, rewriting, classifying, or structuring text, an LLM may be a strong accelerator." ] }, { "cell_type": "markdown", "id": "127d65e8", "metadata": {}, "source": [ "## What LLMs are not good at \n", "\n", "LLMs are weaker when a task requires guaranteed truth, precise arithmetic, or hidden domain knowledge that is not present in the prompt.\n", "\n", "| Weak fit | Main risk |\n", "|---|---|\n", "| Up-to-date factual search | The model may be outdated |\n", "| High-stakes decisions | Errors can be expensive or dangerous |\n", "| Exact citations | The model may invent references |\n", "| Long multi-step planning without checks | The model may drift or skip constraints |\n", "| Sensitive data handling | Privacy and compliance risks |\n", "| Numeric reliability | Small arithmetic mistakes are common |\n", "\n", "A useful slogan is: **use LLMs to accelerate thinking, not to replace verification**.\n", "\n", "Examples of weak or risky use cases include:\n", "\n", "- asking for the latest regulations without checking official sources\n", "- using the model as the final judge in legal or medical decisions\n", "- trusting generated references without opening the original papers\n", "- asking for exact calculations in finance or engineering without verification\n", "- sending private institutional data to a public hosted chatbot\n", "\n", "In these settings, an LLM can still be useful as an assistant, but only if a human or an external system validates the result." ] }, { "cell_type": "markdown", "id": "61556de1", "metadata": {}, "source": [ "### A quick test for use-case quality\n", "\n", "Before using an LLM for a task, ask three questions:\n", "\n", "- Is the task mainly about language, structure, or retrieval?\n", "- Can I define what a good answer looks like?\n", "- Do I have a way to verify the result?\n", "\n", "If the answer to all three is yes, the task is often a good candidate for LLM assistance." ] }, { "cell_type": "markdown", "id": "c3ff3896", "metadata": {}, "source": [ "## Key concepts and risks \n", "\n", "### Hallucinations\n", "\n", "A hallucination happens when the model generates fluent but false or unsupported content.\n", "\n", "Ways to detect and reduce hallucinations:\n", "\n", "- ask for answers grounded in a provided passage\n", "- request quoted evidence or source spans\n", "- compare the answer with the original document\n", "- break the task into extraction first, synthesis second\n", "- verify critical facts with a trusted external source\n", "\n", "Common warning signs include overly confident claims, invented citations, missing uncertainty, and answers that sound plausible but do not match the provided material.\n", "\n", "This is why evidence-aware prompting and source checking are core habits, not optional extras.\n", "\n", "### Bias\n", "\n", "LLMs can reproduce stereotypes or skewed assumptions from training data. Bias can appear in classification, ranking, tone, and recommendations.\n", "\n", "Examples include unequal treatment of demographic groups, culturally narrow assumptions, or systematically different tone when describing similar people or regions.\n", "\n", "- [Article 1](https://www.authorea.com/users/889394/articles/1266881-comparative-analysis-of-deepseek-r1-chatgpt-gemini-alibaba-and-llama-performance-reasoning-capabilities-and-political-bias)\n", "\n", "### Overreliance\n", "\n", "Fast answers can create false confidence. Users may stop checking evidence, especially when the writing sounds expert.\n", "\n", "\n", "### Data privacy\n", "\n", "Never assume a public model or web tool is safe for confidential data. Before sharing private data, check the deployment mode, retention policy, and institutional rules.\n", "\n", "In professional settings, privacy review is not optional. Data governance, retention, and access control matter as much as model quality." ] }, { "cell_type": "markdown", "id": "d02c343c", "metadata": {}, "source": [ "## Useful tools \n", "\n", "Three tools that often appear in research and knowledge workflows are:\n", "\n", "- **Perplexity AI**: useful for web-grounded question answering and fast exploration.\n", "- **NotebookLM**: useful for asking questions over a user-provided collection of documents.\n", "- **Scite.ai**: useful for literature search and for checking whether papers are supported or contradicted by later citations.\n", "\n", "These tools can save time, but they still need human review. Always inspect the source material before trusting a claim.\n", "\n", "A practical way to think about them is:\n", "\n", "| Tool | Best for | Main caution |\n", "|---|---|---|\n", "| Perplexity AI | broad exploration, quick web search, finding recent sources | may still summarize weak or irrelevant sources |\n", "| NotebookLM | asking questions over your own uploaded reports, slides, or papers | quality depends on the documents you provide |\n", "| Scite.ai | literature discovery and citation context | citation counts and citation labels still need interpretation |\n", "\n", "These are not replacements for reading. They are better understood as navigation tools that help you find, compare, and prioritize material faster." ] }, { "cell_type": "markdown", "id": "41b35fd7", "metadata": {}, "source": [ "## Models and Hugging Face \n", "\n", "Models differ in size, architecture, context length, license, training data, and instruction tuning.\n", "\n", "[Web File](https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/edit?pli=1&gid=1158069878#gid=1158069878)\n", "\n", "**Hugging Face** [URL](http://huggingface.co/) is a major platform for discovering and sharing models, datasets, and tokenizers. In practice, it is useful because it provides:\n", "\n", "- model cards describing strengths, limits, and licenses\n", "- ready-to-use checkpoints for text generation and embeddings\n", "- dataset hosting and evaluation resources\n", "- an ecosystem for running models locally or in the cloud\n", "\n", "When choosing a model, look at more than raw size. Also consider latency, memory footprint, multilingual quality, domain fit, and whether the task needs generation, embeddings, or classification.\n", "\n", "A simple distinction that helps beginners is:\n", "\n", "- **hosted API models**: easy to use, usually strong performance, but involve external infrastructure and cost\n", "- **open models**: flexible and transparent, often available through Hugging Face, and useful for local experimentation\n", "- **local models**: attractive for privacy or offline use, but more constrained by hardware and often less capable than large hosted systems\n", "\n", "For many teaching scenarios, Hugging Face is useful not only to download models, but also to compare model cards and discuss licensing, intended use, and known limitations." ] }, { "cell_type": "markdown", "id": "7d80894b", "metadata": {}, "source": [ "### Choosing a model in practice\n", "\n", "A practical model-selection checklist includes:\n", "\n", "- task type: generation, extraction, classification, embeddings, or search\n", "- language coverage: English-only or multilingual\n", "- deployment: cloud API, local machine, or institutional infrastructure\n", "- latency needs: interactive chat or offline batch processing\n", "- privacy constraints: public, internal, or confidential data\n", "- budget: both inference cost and engineering effort\n", "\n", "In many real projects, the best model is not the biggest one. It is the one that is good enough, affordable, controllable, and compatible with your data constraints." ] }, { "cell_type": "markdown", "id": "08f4f2f0", "metadata": {}, "source": [ "### Ollama, llama.cpp, and vLLM to run LLMs locally\n", "\n", "There are three styles: `Ollama` an easy desktop-style workflow, `llama.cpp` a lightweight low-level runtime or `vLLM` a high-throughput inference server.\n", "\n", "**Ollama** [URL](https://ollama.com/) is usually the easiest entry point for local experimentation. It focuses on usability: download a model, run it with a simple command, and interact with it through a local API or chat-like workflow. It is a strong choice for:\n", "\n", "- classroom demos\n", "- fast prototyping on a laptop or workstation\n", "- trying different open models without much setup\n", "- building simple local applications that call a model through an HTTP API\n", "\n", "Its main advantage is convenience. Its main limitation is that it is not primarily designed for maximum serving efficiency at scale.\n", "\n", "**llama.cpp** [URL](https://github.com/ggml-org/llama.cpp) is a lightweight C/C++ inference runtime that became especially popular for running quantized models on local hardware, including CPUs and modest devices. It is often used when people want direct control, portability, and efficient inference without a heavy serving stack. It is a strong choice for:\n", "\n", "- CPU inference\n", "- low-resource machines\n", "- edge or embedded experimentation\n", "- understanding the mechanics of local inference more directly\n", "\n", "Its main advantage is portability and efficiency with quantized models. Its main limitation is that the workflow can feel lower-level and less polished than Ollama.\n", "\n", "**vLLM** [URL](https://vllm.ai/) is a more production-oriented inference engine. It is designed for efficient serving, especially when many requests must be handled at the same time. It is known for high throughput and better GPU utilization, so it is often used when a team wants to expose a model behind an API for multiple users or applications. It is a strong choice for:\n", "\n", "- research labs serving one or more models internally\n", "- backend APIs for chat or batch generation\n", "- deployments where latency and throughput matter\n", "- larger GPU-based systems\n", "\n", "Its main advantage is serving performance. Its main limitation is that it is more infrastructure-oriented than beginner-oriented.\n", "\n", "\n", "\n", "A practical comparison is:\n", "\n", "| Tool | Best mental model | Best for | Main trade-off |\n", "|---|---|---|---|\n", "| Ollama | easy local model manager | quick demos, teaching, personal workflows | less focused on large-scale serving |\n", "| vLLM | efficient inference server | multi-user APIs, GPU serving, production-like setups | more complex operationally |\n", "| llama.cpp | lightweight local runtime | CPU inference, quantized models, portable setups | lower-level workflow |\n", "\n", "\n", "These tools are complementary, not mutually exclusive. " ] }, { "cell_type": "markdown", "id": "9ec447e9", "metadata": {}, "source": [ "## Performance and hardware \n", "\n", "Running an LLM has a cost in memory and compute.\n", "\n", "[Matches LLM models to your hardware](https://github.com/AlexsJones/llmfit)\n", "\n", "- **GPU**: usually gives much lower latency than CPU for generation.\n", "- **VRAM / memory usage**: larger models and longer contexts require more memory.\n", "- **Precision**: lower precision such as 8-bit or 4-bit quantization reduces memory usage.\n", "- **Context length**: longer prompts increase compute and latency.\n", "- **Tokens per second**: a practical measure of interactive speed.\n", "\n", "A rough rule is that parameter count strongly affects memory needs, but deployment details also matter.\n", "\n", "Practical examples:\n", "\n", "- a small local model may be good enough for classification, extraction, or drafting\n", "- a larger model may be worth the cost for harder reasoning or multilingual tasks\n", "- very long prompts can make even a good model slow and expensive\n", "- quantization [Reference](https://developer.nvidia.com/blog/model-quantization-concepts-methods-and-why-it-matters/) can make a model feasible on smaller hardware, but sometimes with quality trade-offs" ] }, { "cell_type": "code", "execution_count": 5, "id": "cffbb5d7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3B model at 16-bit weights: ~5.59 GiB\n", "3B model at 4-bit weights: ~1.4 GiB\n", "--------------------------------------------------\n", "7B model at 16-bit weights: ~13.04 GiB\n", "7B model at 4-bit weights: ~3.26 GiB\n", "--------------------------------------------------\n", "13B model at 16-bit weights: ~24.21 GiB\n", "13B model at 4-bit weights: ~6.05 GiB\n", "--------------------------------------------------\n", "70B model at 16-bit weights: ~130.39 GiB\n", "70B model at 4-bit weights: ~32.6 GiB\n", "--------------------------------------------------\n", "3000B model at 16-bit weights: ~5587.94 GiB\n", "3000B model at 4-bit weights: ~1396.98 GiB\n", "--------------------------------------------------\n" ] } ], "source": [ "def estimate_weight_memory(parameters_billion, bits_per_parameter=16):\n", " parameters = parameters_billion * 1_000_000_000\n", " total_bits = parameters * bits_per_parameter\n", " total_bytes = total_bits / 8\n", " gib = total_bytes / (1024 ** 3)\n", " return round(gib, 2)\n", "\n", "for params in [3, 7, 13, 70, 3000]: #GPT5.5 ~ 3000\n", " print(f\"{params}B model at 16-bit weights: ~{estimate_weight_memory(params, 16)} GiB\")\n", " print(f\"{params}B model at 4-bit weights: ~{estimate_weight_memory(params, 4)} GiB\")\n", " print('-' * 50)" ] }, { "cell_type": "markdown", "id": "84cbd159", "metadata": {}, "source": [ "## Benchmarks \n", "\n", "Benchmarks try to measure capability in a controlled way. Common examples include:\n", "\n", "- **MMLU** for broad academic knowledge\n", "- **GSM8K** for grade-school math word problems\n", "- **HumanEval** for coding tasks\n", "- **MT-Bench** for multi-turn chat quality\n", "\n", "[Web File](https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJwbB-ooYvzhCHaHcNUiA0_hY/edit?pli=1&gid=1158069878#gid=1158069878)\n", "\n", "Benchmarks are useful, but they do not fully predict real-world usefulness. A model with higher benchmark scores may still be slower, more expensive, harder to control, or worse for a domain-specific task.\n", "\n", "**Always test models on your own data and your own workflow.**\n", "\n", "A good reading habit is to ask:\n", "\n", "- what exactly is being measured?\n", "- does the benchmark match my real task?\n", "- was the model optimized specifically for this test?\n", "- what trade-offs are hidden, such as latency or cost?\n", "\n", "Benchmark results are signals, not final answers." ] }, { "cell_type": "markdown", "id": "df2913a1", "metadata": {}, "source": [ "
\n", "Activity \n", "
" ] }, { "cell_type": "markdown", "id": "7460a270", "metadata": {}, "source": [ "For each scenario below, decide whether an LLM is a strong fit, a weak fit, or only safe with verification:\n", "\n", "1. Extract sampling dates and locations from PDF cruise reports.\n", "2. Provide final medical advice to a patient.\n", "3. Draft three versions of a project abstract.\n", "4. Produce exact legal citations checking the source documents.\n", "5. Summarize ten papers and compare their methods in a table.\n", "6. NetCDF Scientific data assistant to ...\n", " - Turn the analysis into a report for a paper/project\n", " - Detect potential issues: units, calendar, incorrect variables, inverted coordinates.\n", " - Create a dataset quality checklist.\n", "\n", "Discuss your choices!\n", "\n", "Extension questions:\n", "\n", "- Which of these tasks would benefit from grounding in supplied documents?\n", "- Which ones require a human expert to approve the final answer?\n", "- Which ones are mainly about speed and productivity, and which ones are high-risk?" ] }, { "cell_type": "markdown", "id": "6ea3a958", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 5 }