Open In Colab

LLM Fundamentals

This notebook introduces the basic ideas behind Large Language Models (LLMs), what they are good at, where they fail, and how to evaluate them in real workflows.

Suggested duration: 1 hour

Learning goals

By the end of this notebook, you should be able to:

  • explain what an LLM does in simple terms

  • identify strong and weak use cases

  • recognize common risks such as hallucinations and bias

  • describe the role of Hugging Face and open models

  • connect model size and hardware to speed and memory usage

  • interpret benchmarks with healthy skepticism

What is an LLM?

A Large Language Model is a neural network trained to predict the next token in a sequence. A neural network is a parameterized function made of many simple computational units arranged in layers; during training, its parameters (weights) are updated with gradient-based optimization (backpropagation) to reduce prediction error on large datasets .

Modern LLMs are usually Transformer-based neural networks that model long-range token relationships with self-attention, which is a key reason they scale well on language tasks . During training they see large amounts of text, code, and other structured data, and they learn statistical patterns that let them continue text in useful ways.

In practice, this means an LLM can:

  • summarize documents

  • rewrite text in a target style

  • extract structured information

  • answer questions about provided context

  • generate code, explanations, and drafts

An important mental model is that an LLM is not a database with guaranteed facts. It is a model that produces likely continuations based on patterns learned during training and on the prompt it receives at inference time.

This is why prompts matter so much: the model’s answer depends heavily on the instructions, the context we provide, the examples we include, and the format we request.

LLMs are powerful pattern completion systems, but they do not automatically know which facts are true, current, or safe.

How an LLM answers a prompt

At inference time, the model receives a prompt, converts it into tokens, and repeatedly predicts what token should come next. This process is repeated many times until the answer is complete.

A simplified pipeline is:

  1. The user writes a prompt.

  2. The text is tokenized.

  3. The model computes probabilities over possible next tokens.

  4. A decoding strategy selects the next token.

  5. The process repeats until the response ends.

This helps explain why LLM outputs can be fluent even when they are wrong: fluency comes from learned language patterns, not from guaranteed access to truth.

Example: Training a Neural Network to Generate Well-Formed Word Sequences

In this hands-on example, we build a small neural language model that learns local word-order patterns and generates grammatically plausible sequences.

Step descriptions:

  • 3-token context window: uses the previous three tokens to predict the next one; this defines the model’s short-context memory.

  • trainable embeddings: represents each token as a dense vector learned during training, instead of a fixed one-hot vector.

  • mini-batch training: trains on small batches per iteration to stabilize gradients and speed up optimization.

  • validation split: reserves part of the data to evaluate generalization beyond the training set.

  • perplexity tracking: measures how «surprised» the model is by the data; lower perplexity usually means better next-token predictions.

  • top-k prediction inspection: checks the top-k most likely next tokens to understand local model behavior.

  • temperature-based generation: controls randomness during decoding; lower temperature is more conservative, higher temperature is more diverse.

Special tokens (special_tokens = ["<pad>", "<bos>", "<eos>", "<unk>"]):

  • ``<pad>``: padding token used to make sequences the same length within a batch.

  • ``<bos>``: «beginning of sequence» token that marks where a sequence starts.

  • ``<eos>``: «end of sequence» token that signals where a sequence ends.

  • ``<unk>``: «unknown» token used for words not present in the known vocabulary.

[ ]:
import re
import numpy as np

np.random.seed(7)

# 1) A richer toy corpus with several semantic patterns.
corpus = [
    "el gato duerme en la silla",
    "el gato observa la ventana",
    "el perro corre por el parque",
    "el perro ladra en la noche",
    "la nina lee un libro interesante",
    "la nina escribe una historia corta",
    "el nino dibuja una casa pequena",
    "el nino escribe una carta amable",
    "la profesora explica la leccion con calma",
    "el alumno responde la pregunta correcta",
    "la musica suena en la sala grande",
    "la lluvia cae sobre el tejado rojo",
    "el tren llega a la estacion central",
    "el coche gira en la esquina estrecha",
    "el barco navega sobre el mar tranquilo",
    "la cientifica analiza los datos del oceano",
    "el investigador compara modelos de lenguaje",
    "la estudiante resume un articulo cientifico",
    "el asistente organiza notas para la reunion",
    "la biblioteca guarda libros de historia",
]

# 2) Simple tokenization.
def tok(text):
    return re.findall(r"[a-zA-Záéíóúñ]+", text.lower())

tokenized = [tok(sentence) for sentence in corpus]
special_tokens = ["<pad>", "<bos>", "<eos>", "<unk>"]
vocab_words = sorted({word for sent in tokenized for word in sent})
vocab = special_tokens + vocab_words
word2id = {word: idx for idx, word in enumerate(vocab)}
id2word = {idx: word for word, idx in word2id.items()}
V = len(vocab)
context_size = 3


def encode(word):
    return word2id.get(word, word2id["<unk>"])


# 3) Dataset: 3-word context -> next token.
X_list, y_list = [], []
for sent in tokenized:
    padded = ["<bos>"] * context_size + sent + ["<eos>"]
    for i in range(context_size, len(padded)):
        context = padded[i - context_size:i]
        target = padded[i]
        X_list.append([encode(word) for word in context])
        y_list.append(encode(target))

X = np.array(X_list, dtype=np.int64)
y = np.array(y_list, dtype=np.int64)

# Train / validation split.
indices = np.random.permutation(len(X))
cut = int(0.8 * len(indices))
train_idx, val_idx = indices[:cut], indices[cut:]
X_train, y_train = X[train_idx], y[train_idx]
X_val, y_val = X[val_idx], y[val_idx]

print(f"Vocabulary size: {V}")
print(f"Training examples: {len(X_train)} | Validation examples: {len(X_val)}")

# 4) A slightly more realistic neural language model:
# embeddings -> hidden layer -> vocabulary logits.
D = 24 #embedding size
H = 64 #hidden layer size
# weights and biases
E = 0.05 * np.random.randn(V, D)
W1 = 0.05 * np.random.randn(context_size * D, H)
b1 = np.zeros((1, H))
W2 = 0.05 * np.random.randn(H, V)
b2 = np.zeros((1, V))


def softmax(logits):
    logits = logits - np.max(logits, axis=1, keepdims=True)
    exp_logits = np.exp(logits)
    return exp_logits / np.sum(exp_logits, axis=1, keepdims=True)


def forward(batch_ids):
    embeddings = E[batch_ids]
    x = embeddings.reshape(batch_ids.shape[0], context_size * D)
    h_pre = x @ W1 + b1
    h = np.tanh(h_pre)
    logits = h @ W2 + b2
    probs = softmax(logits)
    cache = (batch_ids, x, h)
    return probs, cache


def cross_entropy(probs, targets):
    return -np.log(probs[np.arange(len(targets)), targets] + 1e-12).mean()


def accuracy(probs, targets):
    return (np.argmax(probs, axis=1) == targets).mean()


def evaluate(X_data, y_data):
    probs, _ = forward(X_data)
    loss = cross_entropy(probs, y_data)
    acc = accuracy(probs, y_data)
    ppl = np.exp(loss)
    return loss, acc, ppl


# 5) Mini-batch gradient descent.
lr = 0.12
epochs = 500
batch_size = 16

for epoch in range(epochs):
    order = np.random.permutation(len(X_train))
    X_train = X_train[order]
    y_train = y_train[order]

    for start in range(0, len(X_train), batch_size):
        xb = X_train[start:start + batch_size]
        yb = y_train[start:start + batch_size]
        probs, (batch_ids, x, h) = forward(xb)
        n = len(xb)

        dlogits = probs.copy()
        dlogits[np.arange(n), yb] -= 1.0
        dlogits /= n

        dW2 = h.T @ dlogits
        db2 = dlogits.sum(axis=0, keepdims=True)

        dh = dlogits @ W2.T
        dh_pre = dh * (1.0 - h ** 2)

        dW1 = x.T @ dh_pre
        db1 = dh_pre.sum(axis=0, keepdims=True)

        dx = (dh_pre @ W1.T).reshape(n, context_size, D)
        dE = np.zeros_like(E)
        for pos in range(context_size):
            np.add.at(dE, batch_ids[:, pos], dx[:, pos, :])

        E -= lr * dE
        W1 -= lr * dW1
        b1 -= lr * db1
        W2 -= lr * dW2
        b2 -= lr * db2

    if epoch % 100 == 0 or epoch == epochs - 1:
        train_loss, train_acc, train_ppl = evaluate(X_train, y_train)
        val_loss, val_acc, val_ppl = evaluate(X_val, y_val)
        print(
            f"Epoch {epoch:>3} | "
            f"train_loss={train_loss:.4f} train_acc={train_acc:.2%} train_ppl={train_ppl:.2f} | "
            f"val_loss={val_loss:.4f} val_acc={val_acc:.2%} val_ppl={val_ppl:.2f}"
        )


# 6) Inspect the distribution over next tokens.
def top_predictions(context_words, k=5):
    context_ids = np.array([[encode(word) for word in context_words]], dtype=np.int64)
    probs, _ = forward(context_ids)
    best = np.argsort(probs[0])[::-1][:k]
    return [(id2word[idx], float(probs[0, idx])) for idx in best]


print("\nTop next-token predictions:")
for context in [
    ["<bos>", "<bos>", "el"],
    ["la", "cientifica", "analiza"],
    ["el", "perro", "corre"],
]:
    print(f"Context {context} -> {top_predictions(context, k=5)}")


# 7) Autoregressive generation with temperature and top-k sampling.
def sample_next(context_words, temperature=0.8, top_k=5):
    context_ids = np.array([[encode(word) for word in context_words]], dtype=np.int64)
    probs, _ = forward(context_ids)
    logits = np.log(probs[0] + 1e-12) / max(temperature, 1e-6)
    logits = logits - np.max(logits)
    filtered = np.argsort(logits)[::-1][:top_k]
    filtered_logits = logits[filtered]
    filtered_probs = np.exp(filtered_logits)
    filtered_probs /= filtered_probs.sum()
    next_id = int(np.random.choice(filtered, p=filtered_probs))
    return id2word[next_id]


def generate(seed=("<bos>", "<bos>", "<bos>"), max_len=10, temperature=0.8, top_k=5):
    context = list(seed)
    generated = []
    for _ in range(max_len):
        next_word = sample_next(context, temperature=temperature, top_k=top_k)
        if next_word == "<eos>":
            break
        generated.append(next_word)
        context = context[1:] + [next_word]
    return " ".join(generated)


print("\nGenerated samples:")
for seed in [
    ("<bos>", "<bos>", "el"),
    ("<bos>", "<bos>", "la"),
    ("la", "cientifica", "analiza"),
]:
    for _ in range(2):
        print(f"seed={seed} -> {generate(seed=seed, temperature=0.9, top_k=4)}")
    print()


Vocabulary size: 87
Training examples: 117 | Validation examples: 30
Epoch   0 | train_loss=4.4202 train_acc=14.53% train_ppl=83.12 | val_loss=4.4221 val_acc=23.33% val_ppl=83.27
Epoch 100 | train_loss=1.5558 train_acc=51.28% train_ppl=4.74 | val_loss=4.9246 val_acc=20.00% val_ppl=137.64
Epoch 200 | train_loss=0.7792 train_acc=74.36% train_ppl=2.18 | val_loss=6.6347 val_acc=23.33% val_ppl=761.09
Epoch 300 | train_loss=0.5004 train_acc=82.05% train_ppl=1.65 | val_loss=7.4939 val_acc=10.00% val_ppl=1797.04
Epoch 400 | train_loss=0.4281 train_acc=82.91% train_ppl=1.53 | val_loss=7.8468 val_acc=10.00% val_ppl=2557.45
Epoch 499 | train_loss=0.4085 train_acc=82.91% train_ppl=1.50 | val_loss=8.2307 val_acc=10.00% val_ppl=3754.46

Top next-token predictions:
Context ['<bos>', '<bos>', 'el'] -> [('perro', 0.37899802163743385), ('gato', 0.1441843925076278), ('nino', 0.11311341056122948), ('coche', 0.09503225811707484), ('asistente', 0.08856609115669524)]
Context ['la', 'cientifica', 'analiza'] -> [('los', 0.9702507142023091), ('un', 0.009214923142383103), ('<eos>', 0.005744469491607177), ('parque', 0.00375609648102881), ('sobre', 0.003254359781328852)]
Context ['el', 'perro', 'corre'] -> [('por', 0.9633687517233784), ('sobre', 0.01362457898019891), ('modelos', 0.009082521432186824), ('del', 0.0032136696933594677), ('casa', 0.002558788846126655)]

Generated samples:
seed=('<bos>', '<bos>', 'el') -> perro ladra sobre el mar tranquilo
seed=('<bos>', '<bos>', 'el') -> coche gira en la esquina estrecha

seed=('<bos>', '<bos>', 'la') -> musica lee un articulo cientifico cientifico de lenguaje
seed=('<bos>', '<bos>', 'la') -> biblioteca con un articulo cientifico en la silla con los

seed=('la', 'cientifica', 'analiza') -> un libro interesante
seed=('la', 'cientifica', 'analiza') -> los datos del oceano

What LLMs are good at

Good use cases usually have one or more of these characteristics:

Strong fit

Why it works well

Summarization

The task is mostly compression and re-expression

Classification

The model can map text into predefined labels

Information extraction

The output can be constrained into fields or tables

Drafting

The model can propose a first version quickly

Question answering with context

The answer can be grounded in supplied text

Brainstorming

The value comes from many candidate ideas

As a rule, LLMs work best when we define the goal, the context, and the format clearly.

Typical examples of strong classroom or professional use cases include:

  • turning a long paper abstract into five key points

  • extracting author, year, method, and limitation from a set of papers

  • generating a first draft of an email, abstract, or meeting summary

  • converting unstructured notes into a Markdown table

  • suggesting code comments or documentation from existing code

  • translating technical language into a version for a broader audience

A useful heuristic is this: if a human could solve the task mostly by reading, rewriting, classifying, or structuring text, an LLM may be a strong accelerator.

What LLMs are not good at

LLMs are weaker when a task requires guaranteed truth, precise arithmetic, or hidden domain knowledge that is not present in the prompt.

Weak fit

Main risk

Up-to-date factual search

The model may be outdated

High-stakes decisions

Errors can be expensive or dangerous

Exact citations

The model may invent references

Long multi-step planning without checks

The model may drift or skip constraints

Sensitive data handling

Privacy and compliance risks

Numeric reliability

Small arithmetic mistakes are common

A useful slogan is: use LLMs to accelerate thinking, not to replace verification.

Examples of weak or risky use cases include:

  • asking for the latest regulations without checking official sources

  • using the model as the final judge in legal or medical decisions

  • trusting generated references without opening the original papers

  • asking for exact calculations in finance or engineering without verification

  • sending private institutional data to a public hosted chatbot

In these settings, an LLM can still be useful as an assistant, but only if a human or an external system validates the result.

A quick test for use-case quality

Before using an LLM for a task, ask three questions:

  • Is the task mainly about language, structure, or retrieval?

  • Can I define what a good answer looks like?

  • Do I have a way to verify the result?

If the answer to all three is yes, the task is often a good candidate for LLM assistance.

Key concepts and risks

Hallucinations

A hallucination happens when the model generates fluent but false or unsupported content.

Ways to detect and reduce hallucinations:

  • ask for answers grounded in a provided passage

  • request quoted evidence or source spans

  • compare the answer with the original document

  • break the task into extraction first, synthesis second

  • verify critical facts with a trusted external source

Common warning signs include overly confident claims, invented citations, missing uncertainty, and answers that sound plausible but do not match the provided material.

This is why evidence-aware prompting and source checking are core habits, not optional extras.

Bias

LLMs can reproduce stereotypes or skewed assumptions from training data. Bias can appear in classification, ranking, tone, and recommendations.

Examples include unequal treatment of demographic groups, culturally narrow assumptions, or systematically different tone when describing similar people or regions.

Overreliance

Fast answers can create false confidence. Users may stop checking evidence, especially when the writing sounds expert.

Data privacy

Never assume a public model or web tool is safe for confidential data. Before sharing private data, check the deployment mode, retention policy, and institutional rules.

In professional settings, privacy review is not optional. Data governance, retention, and access control matter as much as model quality.

Useful tools

Three tools that often appear in research and knowledge workflows are:

  • Perplexity AI: useful for web-grounded question answering and fast exploration.

  • NotebookLM: useful for asking questions over a user-provided collection of documents.

  • Scite.ai: useful for literature search and for checking whether papers are supported or contradicted by later citations.

These tools can save time, but they still need human review. Always inspect the source material before trusting a claim.

A practical way to think about them is:

Tool

Best for

Main caution

Perplexity AI

broad exploration, quick web search, finding recent sources

may still summarize weak or irrelevant sources

NotebookLM

asking questions over your own uploaded reports, slides, or papers

quality depends on the documents you provide

Scite.ai

literature discovery and citation context

citation counts and citation labels still need interpretation

These are not replacements for reading. They are better understood as navigation tools that help you find, compare, and prioritize material faster.

Models and Hugging Face

Models differ in size, architecture, context length, license, training data, and instruction tuning.

Web File

Hugging Face URL is a major platform for discovering and sharing models, datasets, and tokenizers. In practice, it is useful because it provides:

  • model cards describing strengths, limits, and licenses

  • ready-to-use checkpoints for text generation and embeddings

  • dataset hosting and evaluation resources

  • an ecosystem for running models locally or in the cloud

When choosing a model, look at more than raw size. Also consider latency, memory footprint, multilingual quality, domain fit, and whether the task needs generation, embeddings, or classification.

A simple distinction that helps beginners is:

  • hosted API models: easy to use, usually strong performance, but involve external infrastructure and cost

  • open models: flexible and transparent, often available through Hugging Face, and useful for local experimentation

  • local models: attractive for privacy or offline use, but more constrained by hardware and often less capable than large hosted systems

For many teaching scenarios, Hugging Face is useful not only to download models, but also to compare model cards and discuss licensing, intended use, and known limitations.

Choosing a model in practice

A practical model-selection checklist includes:

  • task type: generation, extraction, classification, embeddings, or search

  • language coverage: English-only or multilingual

  • deployment: cloud API, local machine, or institutional infrastructure

  • latency needs: interactive chat or offline batch processing

  • privacy constraints: public, internal, or confidential data

  • budget: both inference cost and engineering effort

In many real projects, the best model is not the biggest one. It is the one that is good enough, affordable, controllable, and compatible with your data constraints.

Ollama, llama.cpp, and vLLM to run LLMs locally

There are three styles: Ollama an easy desktop-style workflow, llama.cpp a lightweight low-level runtime or vLLM a high-throughput inference server.

Ollama URL is usually the easiest entry point for local experimentation. It focuses on usability: download a model, run it with a simple command, and interact with it through a local API or chat-like workflow. It is a strong choice for:

  • classroom demos

  • fast prototyping on a laptop or workstation

  • trying different open models without much setup

  • building simple local applications that call a model through an HTTP API

Its main advantage is convenience. Its main limitation is that it is not primarily designed for maximum serving efficiency at scale.

llama.cpp URL is a lightweight C/C++ inference runtime that became especially popular for running quantized models on local hardware, including CPUs and modest devices. It is often used when people want direct control, portability, and efficient inference without a heavy serving stack. It is a strong choice for:

  • CPU inference

  • low-resource machines

  • edge or embedded experimentation

  • understanding the mechanics of local inference more directly

Its main advantage is portability and efficiency with quantized models. Its main limitation is that the workflow can feel lower-level and less polished than Ollama.

vLLM URL is a more production-oriented inference engine. It is designed for efficient serving, especially when many requests must be handled at the same time. It is known for high throughput and better GPU utilization, so it is often used when a team wants to expose a model behind an API for multiple users or applications. It is a strong choice for:

  • research labs serving one or more models internally

  • backend APIs for chat or batch generation

  • deployments where latency and throughput matter

  • larger GPU-based systems

Its main advantage is serving performance. Its main limitation is that it is more infrastructure-oriented than beginner-oriented.

A practical comparison is:

Tool

Best mental model

Best for

Main trade-off

Ollama

easy local model manager

quick demos, teaching, personal workflows

less focused on large-scale serving

vLLM

efficient inference server

multi-user APIs, GPU serving, production-like setups

more complex operationally

llama.cpp

lightweight local runtime

CPU inference, quantized models, portable setups

lower-level workflow

These tools are complementary, not mutually exclusive.

Performance and hardware

Running an LLM has a cost in memory and compute.

Matches LLM models to your hardware

  • GPU: usually gives much lower latency than CPU for generation.

  • VRAM / memory usage: larger models and longer contexts require more memory.

  • Precision: lower precision such as 8-bit or 4-bit quantization reduces memory usage.

  • Context length: longer prompts increase compute and latency.

  • Tokens per second: a practical measure of interactive speed.

A rough rule is that parameter count strongly affects memory needs, but deployment details also matter.

Practical examples:

  • a small local model may be good enough for classification, extraction, or drafting

  • a larger model may be worth the cost for harder reasoning or multilingual tasks

  • very long prompts can make even a good model slow and expensive

  • quantization Reference can make a model feasible on smaller hardware, but sometimes with quality trade-offs

[5]:
def estimate_weight_memory(parameters_billion, bits_per_parameter=16):
    parameters = parameters_billion * 1_000_000_000
    total_bits = parameters * bits_per_parameter
    total_bytes = total_bits / 8
    gib = total_bytes / (1024 ** 3)
    return round(gib, 2)

for params in [3, 7, 13, 70, 3000]: #GPT5.5 ~ 3000
    print(f"{params}B model at 16-bit weights: ~{estimate_weight_memory(params, 16)} GiB")
    print(f"{params}B model at 4-bit weights:  ~{estimate_weight_memory(params, 4)} GiB")
    print('-' * 50)
3B model at 16-bit weights: ~5.59 GiB
3B model at 4-bit weights:  ~1.4 GiB
--------------------------------------------------
7B model at 16-bit weights: ~13.04 GiB
7B model at 4-bit weights:  ~3.26 GiB
--------------------------------------------------
13B model at 16-bit weights: ~24.21 GiB
13B model at 4-bit weights:  ~6.05 GiB
--------------------------------------------------
70B model at 16-bit weights: ~130.39 GiB
70B model at 4-bit weights:  ~32.6 GiB
--------------------------------------------------
3000B model at 16-bit weights: ~5587.94 GiB
3000B model at 4-bit weights:  ~1396.98 GiB
--------------------------------------------------

Benchmarks

Benchmarks try to measure capability in a controlled way. Common examples include:

  • MMLU for broad academic knowledge

  • GSM8K for grade-school math word problems

  • HumanEval for coding tasks

  • MT-Bench for multi-turn chat quality

Web File

Benchmarks are useful, but they do not fully predict real-world usefulness. A model with higher benchmark scores may still be slower, more expensive, harder to control, or worse for a domain-specific task.

Always test models on your own data and your own workflow.

A good reading habit is to ask:

  • what exactly is being measured?

  • does the benchmark match my real task?

  • was the model optimized specifically for this test?

  • what trade-offs are hidden, such as latency or cost?

Benchmark results are signals, not final answers.

Activity

For each scenario below, decide whether an LLM is a strong fit, a weak fit, or only safe with verification:

  1. Extract sampling dates and locations from PDF cruise reports.

  2. Provide final medical advice to a patient.

  3. Draft three versions of a project abstract.

  4. Produce exact legal citations checking the source documents.

  5. Summarize ten papers and compare their methods in a table.

  6. NetCDF Scientific data assistant to …

    • Turn the analysis into a report for a paper/project

    • Detect potential issues: units, calendar, incorrect variables, inverted coordinates.

    • Create a dataset quality checklist.

Discuss your choices!

Extension questions:

  • Which of these tasks would benefit from grounding in supplied documents?

  • Which ones require a human expert to approve the final answer?

  • Which ones are mainly about speed and productivity, and which ones are high-risk?