LLM Fundamentals

This notebook introduces the basic ideas behind Large Language Models (LLMs), what they are good at, where they fail, and how to evaluate them in real workflows.

Suggested duration: 1 hour

Learning goals

By the end of this notebook, you should be able to:

explain what an LLM does in simple terms
identify strong and weak use cases
recognize common risks such as hallucinations and bias
describe the role of Hugging Face and open models
connect model size and hardware to speed and memory usage
interpret benchmarks with healthy skepticism

Table of Contents

What is an LLM?
What LLMs are good at
What LLMs are not good at
Key concepts and risks
Useful tools
Models and Hugging Face
Performance and hardware
Benchmarks

What is an LLM?

A Large Language Model is a neural network trained to predict the next token in a sequence. A neural network is a parameterized function made of many simple computational units arranged in layers; during training, its parameters (weights) are updated with gradient-based optimization (backpropagation) to reduce prediction error on large datasets .

Modern LLMs are usually Transformer-based neural networks that model long-range token relationships with self-attention, which is a key reason they scale well on language tasks . During training they see large amounts of text, code, and other structured data, and they learn statistical patterns that let them continue text in useful ways.

In practice, this means an LLM can:

summarize documents
rewrite text in a target style
extract structured information
answer questions about provided context
generate code, explanations, and drafts

An important mental model is that an LLM is not a database with guaranteed facts. It is a model that produces likely continuations based on patterns learned during training and on the prompt it receives at inference time.

This is why prompts matter so much: the model’s answer depends heavily on the instructions, the context we provide, the examples we include, and the format we request.

LLMs are powerful pattern completion systems, but they do not automatically know which facts are true, current, or safe.

How an LLM answers a prompt

At inference time, the model receives a prompt, converts it into tokens, and repeatedly predicts what token should come next. This process is repeated many times until the answer is complete.

A simplified pipeline is:

The user writes a prompt.
The text is tokenized.
The model computes probabilities over possible next tokens.
A decoding strategy selects the next token.
The process repeats until the response ends.

This helps explain why LLM outputs can be fluent even when they are wrong: fluency comes from learned language patterns, not from guaranteed access to truth.

Example: Training a Neural Network to Generate Well-Formed Word Sequences

In this hands-on example, we build a small neural language model that learns local word-order patterns and generates grammatically plausible sequences.

Step descriptions:

3-token context window: uses the previous three tokens to predict the next one; this defines the model’s short-context memory.
trainable embeddings: represents each token as a dense vector learned during training, instead of a fixed one-hot vector.
mini-batch training: trains on small batches per iteration to stabilize gradients and speed up optimization.
validation split: reserves part of the data to evaluate generalization beyond the training set.
perplexity tracking: measures how «surprised» the model is by the data; lower perplexity usually means better next-token predictions.
top-k prediction inspection: checks the top-k most likely next tokens to understand local model behavior.
temperature-based generation: controls randomness during decoding; lower temperature is more conservative, higher temperature is more diverse.

Special tokens (special_tokens = ["<pad>", "<bos>", "<eos>", "<unk>"]):

``<pad>``: padding token used to make sequences the same length within a batch.
``<bos>``: «beginning of sequence» token that marks where a sequence starts.
``<eos>``: «end of sequence» token that signals where a sequence ends.
``<unk>``: «unknown» token used for words not present in the known vocabulary.

[ ]:

import re
import numpy as np

np.random.seed(7)

# 1) A richer toy corpus with several semantic patterns.
corpus = [
    "el gato duerme en la silla",
    "el gato observa la ventana",
    "el perro corre por el parque",
    "el perro ladra en la noche",
    "la nina lee un libro interesante",
    "la nina escribe una historia corta",
    "el nino dibuja una casa pequena",
    "el nino escribe una carta amable",
    "la profesora explica la leccion con calma",
    "el alumno responde la pregunta correcta",
    "la musica suena en la sala grande",
    "la lluvia cae sobre el tejado rojo",
    "el tren llega a la estacion central",
    "el coche gira en la esquina estrecha",
    "el barco navega sobre el mar tranquilo",
    "la cientifica analiza los datos del oceano",
    "el investigador compara modelos de lenguaje",
    "la estudiante resume un articulo cientifico",
    "el asistente organiza notas para la reunion",
    "la biblioteca guarda libros de historia",
]

# 2) Simple tokenization.
def tok(text):
    return re.findall(r"[a-zA-Záéíóúñ]+", text.lower())

tokenized = [tok(sentence) for sentence in corpus]
special_tokens = ["<pad>", "<bos>", "<eos>", "<unk>"]
vocab_words = sorted({word for sent in tokenized for word in sent})
vocab = special_tokens + vocab_words
word2id = {word: idx for idx, word in enumerate(vocab)}
id2word = {idx: word for word, idx in word2id.items()}
V = len(vocab)
context_size = 3


def encode(word):
    return word2id.get(word, word2id["<unk>"])


# 3) Dataset: 3-word context -> next token.
X_list, y_list = [], []
for sent in tokenized:
    padded = ["<bos>"] * context_size + sent + ["<eos>"]
    for i in range(context_size, len(padded)):
        context = padded[i - context_size:i]
        target = padded[i]
        X_list.append([encode(word) for word in context])
        y_list.append(encode(target))

X = np.array(X_list, dtype=np.int64)
y = np.array(y_list, dtype=np.int64)

# Train / validation split.
indices = np.random.permutation(len(X))
cut = int(0.8 * len(indices))
train_idx, val_idx = indices[:cut], indices[cut:]
X_train, y_train = X[train_idx], y[train_idx]
X_val, y_val = X[val_idx], y[val_idx]

print(f"Vocabulary size: {V}")
print(f"Training examples: {len(X_train)} | Validation examples: {len(X_val)}")

# 4) A slightly more realistic neural language model:
# embeddings -> hidden layer -> vocabulary logits.
D = 24 #embedding size
H = 64 #hidden layer size
# weights and biases
E = 0.05 * np.random.randn(V, D)
W1 = 0.05 * np.random.randn(context_size * D, H)
b1 = np.zeros((1, H))
W2 = 0.05 * np.random.randn(H, V)
b2 = np.zeros((1, V))


def softmax(logits):
    logits = logits - np.max(logits, axis=1, keepdims=True)
    exp_logits = np.exp(logits)
    return exp_logits / np.sum(exp_logits, axis=1, keepdims=True)


def forward(batch_ids):
    embeddings = E[batch_ids]
    x = embeddings.reshape(batch_ids.shape[0], context_size * D)
    h_pre = x @ W1 + b1
    h = np.tanh(h_pre)
    logits = h @ W2 + b2
    probs = softmax(logits)
    cache = (batch_ids, x, h)
    return probs, cache


def cross_entropy(probs, targets):
    return -np.log(probs[np.arange(len(targets)), targets] + 1e-12).mean()


def accuracy(probs, targets):
    return (np.argmax(probs, axis=1) == targets).mean()


def evaluate(X_data, y_data):
    probs, _ = forward(X_data)
    loss = cross_entropy(probs, y_data)
    acc = accuracy(probs, y_data)
    ppl = np.exp(loss)
    return loss, acc, ppl


# 5) Mini-batch gradient descent.
lr = 0.12
epochs = 500
batch_size = 16

for epoch in range(epochs):
    order = np.random.permutation(len(X_train))
    X_train = X_train[order]
    y_train = y_train[order]

    for start in range(0, len(X_train), batch_size):
        xb = X_train[start:start + batch_size]
        yb = y_train[start:start + batch_size]
        probs, (batch_ids, x, h) = forward(xb)
        n = len(xb)

        dlogits = probs.copy()
        dlogits[np.arange(n), yb] -= 1.0
        dlogits /= n

        dW2 = h.T @ dlogits
        db2 = dlogits.sum(axis=0, keepdims=True)

        dh = dlogits @ W2.T
        dh_pre = dh * (1.0 - h ** 2)

        dW1 = x.T @ dh_pre
        db1 = dh_pre.sum(axis=0, keepdims=True)

        dx = (dh_pre @ W1.T).reshape(n, context_size, D)
        dE = np.zeros_like(E)
        for pos in range(context_size):
            np.add.at(dE, batch_ids[:, pos], dx[:, pos, :])

        E -= lr * dE
        W1 -= lr * dW1
        b1 -= lr * db1
        W2 -= lr * dW2
        b2 -= lr * db2

    if epoch % 100 == 0 or epoch == epochs - 1:
        train_loss, train_acc, train_ppl = evaluate(X_train, y_train)
        val_loss, val_acc, val_ppl = evaluate(X_val, y_val)
        print(
            f"Epoch {epoch:>3} | "
            f"train_loss={train_loss:.4f} train_acc={train_acc:.2%} train_ppl={train_ppl:.2f} | "
            f"val_loss={val_loss:.4f} val_acc={val_acc:.2%} val_ppl={val_ppl:.2f}"
        )


# 6) Inspect the distribution over next tokens.
def top_predictions(context_words, k=5):
    context_ids = np.array([[encode(word) for word in context_words]], dtype=np.int64)
    probs, _ = forward(context_ids)
    best = np.argsort(probs[0])[::-1][:k]
    return [(id2word[idx], float(probs[0, idx])) for idx in best]


print("\nTop next-token predictions:")
for context in [
    ["<bos>", "<bos>", "el"],
    ["la", "cientifica", "analiza"],
    ["el", "perro", "corre"],
]:
    print(f"Context {context} -> {top_predictions(context, k=5)}")


# 7) Autoregressive generation with temperature and top-k sampling.
def sample_next(context_words, temperature=0.8, top_k=5):
    context_ids = np.array([[encode(word) for word in context_words]], dtype=np.int64)
    probs, _ = forward(context_ids)
    logits = np.log(probs[0] + 1e-12) / max(temperature, 1e-6)
    logits = logits - np.max(logits)
    filtered = np.argsort(logits)[::-1][:top_k]
    filtered_logits = logits[filtered]
    filtered_probs = np.exp(filtered_logits)
    filtered_probs /= filtered_probs.sum()
    next_id = int(np.random.choice(filtered, p=filtered_probs))
    return id2word[next_id]


def generate(seed=("<bos>", "<bos>", "<bos>"), max_len=10, temperature=0.8, top_k=5):
    context = list(seed)
    generated = []
    for _ in range(max_len):
        next_word = sample_next(context, temperature=temperature, top_k=top_k)
        if next_word == "<eos>":
            break
        generated.append(next_word)
        context = context[1:] + [next_word]
    return " ".join(generated)


print("\nGenerated samples:")
for seed in [
    ("<bos>", "<bos>", "el"),
    ("<bos>", "<bos>", "la"),
    ("la", "cientifica", "analiza"),
]:
    for _ in range(2):
        print(f"seed={seed} -> {generate(seed=seed, temperature=0.9, top_k=4)}")
    print()

Vocabulary size: 87
Training examples: 117 | Validation examples: 30
Epoch   0 | train_loss=4.4202 train_acc=14.53% train_ppl=83.12 | val_loss=4.4221 val_acc=23.33% val_ppl=83.27
Epoch 100 | train_loss=1.5558 train_acc=51.28% train_ppl=4.74 | val_loss=4.9246 val_acc=20.00% val_ppl=137.64
Epoch 200 | train_loss=0.7792 train_acc=74.36% train_ppl=2.18 | val_loss=6.6347 val_acc=23.33% val_ppl=761.09
Epoch 300 | train_loss=0.5004 train_acc=82.05% train_ppl=1.65 | val_loss=7.4939 val_acc=10.00% val_ppl=1797.04
Epoch 400 | train_loss=0.4281 train_acc=82.91% train_ppl=1.53 | val_loss=7.8468 val_acc=10.00% val_ppl=2557.45
Epoch 499 | train_loss=0.4085 train_acc=82.91% train_ppl=1.50 | val_loss=8.2307 val_acc=10.00% val_ppl=3754.46

Top next-token predictions:
Context ['<bos>', '<bos>', 'el'] -> [('perro', 0.37899802163743385), ('gato', 0.1441843925076278), ('nino', 0.11311341056122948), ('coche', 0.09503225811707484), ('asistente', 0.08856609115669524)]
Context ['la', 'cientifica', 'analiza'] -> [('los', 0.9702507142023091), ('un', 0.009214923142383103), ('<eos>', 0.005744469491607177), ('parque', 0.00375609648102881), ('sobre', 0.003254359781328852)]
Context ['el', 'perro', 'corre'] -> [('por', 0.9633687517233784), ('sobre', 0.01362457898019891), ('modelos', 0.009082521432186824), ('del', 0.0032136696933594677), ('casa', 0.002558788846126655)]

Generated samples:
seed=('<bos>', '<bos>', 'el') -> perro ladra sobre el mar tranquilo
seed=('<bos>', '<bos>', 'el') -> coche gira en la esquina estrecha

seed=('<bos>', '<bos>', 'la') -> musica lee un articulo cientifico cientifico de lenguaje
seed=('<bos>', '<bos>', 'la') -> biblioteca con un articulo cientifico en la silla con los

seed=('la', 'cientifica', 'analiza') -> un libro interesante
seed=('la', 'cientifica', 'analiza') -> los datos del oceano

What LLMs are good at

Good use cases usually have one or more of these characteristics:

Strong fit	Why it works well
Summarization	The task is mostly compression and re-expression
Classification	The model can map text into predefined labels
Information extraction	The output can be constrained into fields or tables
Drafting	The model can propose a first version quickly
Question answering with context	The answer can be grounded in supplied text
Brainstorming	The value comes from many candidate ideas

As a rule, LLMs work best when we define the goal, the context, and the format clearly.

Typical examples of strong classroom or professional use cases include:

turning a long paper abstract into five key points
extracting author, year, method, and limitation from a set of papers
generating a first draft of an email, abstract, or meeting summary
converting unstructured notes into a Markdown table
suggesting code comments or documentation from existing code
translating technical language into a version for a broader audience

A useful heuristic is this: if a human could solve the task mostly by reading, rewriting, classifying, or structuring text, an LLM may be a strong accelerator.

What LLMs are not good at

LLMs are weaker when a task requires guaranteed truth, precise arithmetic, or hidden domain knowledge that is not present in the prompt.

Weak fit	Main risk
Up-to-date factual search	The model may be outdated
High-stakes decisions	Errors can be expensive or dangerous
Exact citations	The model may invent references
Long multi-step planning without checks	The model may drift or skip constraints
Sensitive data handling	Privacy and compliance risks
Numeric reliability	Small arithmetic mistakes are common

A useful slogan is: use LLMs to accelerate thinking, not to replace verification.

Examples of weak or risky use cases include:

asking for the latest regulations without checking official sources
using the model as the final judge in legal or medical decisions
trusting generated references without opening the original papers
asking for exact calculations in finance or engineering without verification
sending private institutional data to a public hosted chatbot

In these settings, an LLM can still be useful as an assistant, but only if a human or an external system validates the result.

A quick test for use-case quality

Before using an LLM for a task, ask three questions:

Is the task mainly about language, structure, or retrieval?
Can I define what a good answer looks like?
Do I have a way to verify the result?

If the answer to all three is yes, the task is often a good candidate for LLM assistance.

Key concepts and risks

Hallucinations

A hallucination happens when the model generates fluent but false or unsupported content.

Ways to detect and reduce hallucinations:

ask for answers grounded in a provided passage
request quoted evidence or source spans
compare the answer with the original document
break the task into extraction first, synthesis second
verify critical facts with a trusted external source

Common warning signs include overly confident claims, invented citations, missing uncertainty, and answers that sound plausible but do not match the provided material.

This is why evidence-aware prompting and source checking are core habits, not optional extras.

Bias

LLMs can reproduce stereotypes or skewed assumptions from training data. Bias can appear in classification, ranking, tone, and recommendations.

Examples include unequal treatment of demographic groups, culturally narrow assumptions, or systematically different tone when describing similar people or regions.

Article 1

Overreliance

Fast answers can create false confidence. Users may stop checking evidence, especially when the writing sounds expert.

Data privacy

Never assume a public model or web tool is safe for confidential data. Before sharing private data, check the deployment mode, retention policy, and institutional rules.

In professional settings, privacy review is not optional. Data governance, retention, and access control matter as much as model quality.

Useful tools

Three tools that often appear in research and knowledge workflows are:

Perplexity AI: useful for web-grounded question answering and fast exploration.
NotebookLM: useful for asking questions over a user-provided collection of documents.
Scite.ai: useful for literature search and for checking whether papers are supported or contradicted by later citations.

These tools can save time, but they still need human review. Always inspect the source material before trusting a claim.

A practical way to think about them is:

Tool	Best for	Main caution
Perplexity AI	broad exploration, quick web search, finding recent sources	may still summarize weak or irrelevant sources
NotebookLM	asking questions over your own uploaded reports, slides, or papers	quality depends on the documents you provide
Scite.ai	literature discovery and citation context	citation counts and citation labels still need interpretation

These are not replacements for reading. They are better understood as navigation tools that help you find, compare, and prioritize material faster.

Models and Hugging Face

Models differ in size, architecture, context length, license, training data, and instruction tuning.

Web File

Hugging Face URL is a major platform for discovering and sharing models, datasets, and tokenizers. In practice, it is useful because it provides:

model cards describing strengths, limits, and licenses
ready-to-use checkpoints for text generation and embeddings
dataset hosting and evaluation resources
an ecosystem for running models locally or in the cloud

When choosing a model, look at more than raw size. Also consider latency, memory footprint, multilingual quality, domain fit, and whether the task needs generation, embeddings, or classification.

A simple distinction that helps beginners is:

hosted API models: easy to use, usually strong performance, but involve external infrastructure and cost
open models: flexible and transparent, often available through Hugging Face, and useful for local experimentation
local models: attractive for privacy or offline use, but more constrained by hardware and often less capable than large hosted systems

For many teaching scenarios, Hugging Face is useful not only to download models, but also to compare model cards and discuss licensing, intended use, and known limitations.

Choosing a model in practice

A practical model-selection checklist includes:

task type: generation, extraction, classification, embeddings, or search
language coverage: English-only or multilingual
deployment: cloud API, local machine, or institutional infrastructure
latency needs: interactive chat or offline batch processing
privacy constraints: public, internal, or confidential data
budget: both inference cost and engineering effort

In many real projects, the best model is not the biggest one. It is the one that is good enough, affordable, controllable, and compatible with your data constraints.

Ollama, llama.cpp, and vLLM to run LLMs locally

There are three styles: Ollama an easy desktop-style workflow, llama.cpp a lightweight low-level runtime or vLLM a high-throughput inference server.

Ollama URL is usually the easiest entry point for local experimentation. It focuses on usability: download a model, run it with a simple command, and interact with it through a local API or chat-like workflow. It is a strong choice for:

classroom demos
fast prototyping on a laptop or workstation
trying different open models without much setup
building simple local applications that call a model through an HTTP API

Its main advantage is convenience. Its main limitation is that it is not primarily designed for maximum serving efficiency at scale.

llama.cpp URL is a lightweight C/C++ inference runtime that became especially popular for running quantized models on local hardware, including CPUs and modest devices. It is often used when people want direct control, portability, and efficient inference without a heavy serving stack. It is a strong choice for:

CPU inference
low-resource machines
edge or embedded experimentation
understanding the mechanics of local inference more directly

Its main advantage is portability and efficiency with quantized models. Its main limitation is that the workflow can feel lower-level and less polished than Ollama.

vLLM URL is a more production-oriented inference engine. It is designed for efficient serving, especially when many requests must be handled at the same time. It is known for high throughput and better GPU utilization, so it is often used when a team wants to expose a model behind an API for multiple users or applications. It is a strong choice for:

research labs serving one or more models internally
backend APIs for chat or batch generation
deployments where latency and throughput matter
larger GPU-based systems

Its main advantage is serving performance. Its main limitation is that it is more infrastructure-oriented than beginner-oriented.

A practical comparison is:

Tool	Best mental model	Best for	Main trade-off
Ollama	easy local model manager	quick demos, teaching, personal workflows	less focused on large-scale serving
vLLM	efficient inference server	multi-user APIs, GPU serving, production-like setups	more complex operationally
llama.cpp	lightweight local runtime	CPU inference, quantized models, portable setups	lower-level workflow

These tools are complementary, not mutually exclusive.

Performance and hardware

Running an LLM has a cost in memory and compute.

Matches LLM models to your hardware

GPU: usually gives much lower latency than CPU for generation.
VRAM / memory usage: larger models and longer contexts require more memory.
Precision: lower precision such as 8-bit or 4-bit quantization reduces memory usage.
Context length: longer prompts increase compute and latency.
Tokens per second: a practical measure of interactive speed.

A rough rule is that parameter count strongly affects memory needs, but deployment details also matter.

Practical examples:

a small local model may be good enough for classification, extraction, or drafting
a larger model may be worth the cost for harder reasoning or multilingual tasks
very long prompts can make even a good model slow and expensive
quantization Reference can make a model feasible on smaller hardware, but sometimes with quality trade-offs

[5]:

def estimate_weight_memory(parameters_billion, bits_per_parameter=16):
    parameters = parameters_billion * 1_000_000_000
    total_bits = parameters * bits_per_parameter
    total_bytes = total_bits / 8
    gib = total_bytes / (1024 ** 3)
    return round(gib, 2)

for params in [3, 7, 13, 70, 3000]: #GPT5.5 ~ 3000
    print(f"{params}B model at 16-bit weights: ~{estimate_weight_memory(params, 16)} GiB")
    print(f"{params}B model at 4-bit weights:  ~{estimate_weight_memory(params, 4)} GiB")
    print('-' * 50)

3B model at 16-bit weights: ~5.59 GiB
3B model at 4-bit weights:  ~1.4 GiB
--------------------------------------------------
7B model at 16-bit weights: ~13.04 GiB
7B model at 4-bit weights:  ~3.26 GiB
--------------------------------------------------
13B model at 16-bit weights: ~24.21 GiB
13B model at 4-bit weights:  ~6.05 GiB
--------------------------------------------------
70B model at 16-bit weights: ~130.39 GiB
70B model at 4-bit weights:  ~32.6 GiB
--------------------------------------------------
3000B model at 16-bit weights: ~5587.94 GiB
3000B model at 4-bit weights:  ~1396.98 GiB
--------------------------------------------------

Benchmarks

Benchmarks try to measure capability in a controlled way. Common examples include:

MMLU for broad academic knowledge
GSM8K for grade-school math word problems
HumanEval for coding tasks
MT-Bench for multi-turn chat quality

Web File

Benchmarks are useful, but they do not fully predict real-world usefulness. A model with higher benchmark scores may still be slower, more expensive, harder to control, or worse for a domain-specific task.

Always test models on your own data and your own workflow.

A good reading habit is to ask:

what exactly is being measured?
does the benchmark match my real task?
was the model optimized specifically for this test?
what trade-offs are hidden, such as latency or cost?

Benchmark results are signals, not final answers.

Activity

For each scenario below, decide whether an LLM is a strong fit, a weak fit, or only safe with verification:

Extract sampling dates and locations from PDF cruise reports.
Provide final medical advice to a patient.
Draft three versions of a project abstract.
Produce exact legal citations checking the source documents.
Summarize ten papers and compare their methods in a table.
NetCDF Scientific data assistant to …
- Turn the analysis into a report for a paper/project
- Detect potential issues: units, calendar, incorrect variables, inverted coordinates.
- Create a dataset quality checklist.

Discuss your choices!

Extension questions:

Which of these tasks would benefit from grounding in supplied documents?
Which ones require a human expert to approve the final answer?
Which ones are mainly about speed and productivity, and which ones are high-risk?