LLM Fundamentals
This notebook introduces the basic ideas behind Large Language Models (LLMs), what they are good at, where they fail, and how to evaluate them in real workflows.
Suggested duration: 1 hour
Learning goals
By the end of this notebook, you should be able to:
explain what an LLM does in simple terms
identify strong and weak use cases
recognize common risks such as hallucinations and bias
describe the role of Hugging Face and open models
connect model size and hardware to speed and memory usage
interpret benchmarks with healthy skepticism
Table of Contents
What is an LLM?
A Large Language Model is a neural network trained to predict the next token in a sequence. A neural network is a parameterized function made of many simple computational units arranged in layers; during training, its parameters (weights) are updated with gradient-based optimization (backpropagation) to reduce prediction error on large datasets .
Modern LLMs are usually Transformer-based neural networks that model long-range token relationships with self-attention, which is a key reason they scale well on language tasks . During training they see large amounts of text, code, and other structured data, and they learn statistical patterns that let them continue text in useful ways.
In practice, this means an LLM can:
summarize documents
rewrite text in a target style
extract structured information
answer questions about provided context
generate code, explanations, and drafts
An important mental model is that an LLM is not a database with guaranteed facts. It is a model that produces likely continuations based on patterns learned during training and on the prompt it receives at inference time.
This is why prompts matter so much: the model’s answer depends heavily on the instructions, the context we provide, the examples we include, and the format we request.
LLMs are powerful pattern completion systems, but they do not automatically know which facts are true, current, or safe.
How an LLM answers a prompt
At inference time, the model receives a prompt, converts it into tokens, and repeatedly predicts what token should come next. This process is repeated many times until the answer is complete.
A simplified pipeline is:
The user writes a prompt.
The text is tokenized.
The model computes probabilities over possible next tokens.
A decoding strategy selects the next token.
The process repeats until the response ends.
This helps explain why LLM outputs can be fluent even when they are wrong: fluency comes from learned language patterns, not from guaranteed access to truth.
Example: Training a Neural Network to Generate Well-Formed Word Sequences
In this hands-on example, we build a small neural language model that learns local word-order patterns and generates grammatically plausible sequences.
Step descriptions:
3-token context window: uses the previous three tokens to predict the next one; this defines the model’s short-context memory.
trainable embeddings: represents each token as a dense vector learned during training, instead of a fixed one-hot vector.
mini-batch training: trains on small batches per iteration to stabilize gradients and speed up optimization.
validation split: reserves part of the data to evaluate generalization beyond the training set.
perplexity tracking: measures how «surprised» the model is by the data; lower perplexity usually means better next-token predictions.
top-k prediction inspection: checks the top-k most likely next tokens to understand local model behavior.
temperature-based generation: controls randomness during decoding; lower temperature is more conservative, higher temperature is more diverse.
Special tokens (special_tokens = ["<pad>", "<bos>", "<eos>", "<unk>"]):
``<pad>``: padding token used to make sequences the same length within a batch.
``<bos>``: «beginning of sequence» token that marks where a sequence starts.
``<eos>``: «end of sequence» token that signals where a sequence ends.
``<unk>``: «unknown» token used for words not present in the known vocabulary.
[ ]:
import re
import numpy as np
np.random.seed(7)
# 1) A richer toy corpus with several semantic patterns.
corpus = [
"el gato duerme en la silla",
"el gato observa la ventana",
"el perro corre por el parque",
"el perro ladra en la noche",
"la nina lee un libro interesante",
"la nina escribe una historia corta",
"el nino dibuja una casa pequena",
"el nino escribe una carta amable",
"la profesora explica la leccion con calma",
"el alumno responde la pregunta correcta",
"la musica suena en la sala grande",
"la lluvia cae sobre el tejado rojo",
"el tren llega a la estacion central",
"el coche gira en la esquina estrecha",
"el barco navega sobre el mar tranquilo",
"la cientifica analiza los datos del oceano",
"el investigador compara modelos de lenguaje",
"la estudiante resume un articulo cientifico",
"el asistente organiza notas para la reunion",
"la biblioteca guarda libros de historia",
]
# 2) Simple tokenization.
def tok(text):
return re.findall(r"[a-zA-Záéíóúñ]+", text.lower())
tokenized = [tok(sentence) for sentence in corpus]
special_tokens = ["<pad>", "<bos>", "<eos>", "<unk>"]
vocab_words = sorted({word for sent in tokenized for word in sent})
vocab = special_tokens + vocab_words
word2id = {word: idx for idx, word in enumerate(vocab)}
id2word = {idx: word for word, idx in word2id.items()}
V = len(vocab)
context_size = 3
def encode(word):
return word2id.get(word, word2id["<unk>"])
# 3) Dataset: 3-word context -> next token.
X_list, y_list = [], []
for sent in tokenized:
padded = ["<bos>"] * context_size + sent + ["<eos>"]
for i in range(context_size, len(padded)):
context = padded[i - context_size:i]
target = padded[i]
X_list.append([encode(word) for word in context])
y_list.append(encode(target))
X = np.array(X_list, dtype=np.int64)
y = np.array(y_list, dtype=np.int64)
# Train / validation split.
indices = np.random.permutation(len(X))
cut = int(0.8 * len(indices))
train_idx, val_idx = indices[:cut], indices[cut:]
X_train, y_train = X[train_idx], y[train_idx]
X_val, y_val = X[val_idx], y[val_idx]
print(f"Vocabulary size: {V}")
print(f"Training examples: {len(X_train)} | Validation examples: {len(X_val)}")
# 4) A slightly more realistic neural language model:
# embeddings -> hidden layer -> vocabulary logits.
D = 24 #embedding size
H = 64 #hidden layer size
# weights and biases
E = 0.05 * np.random.randn(V, D)
W1 = 0.05 * np.random.randn(context_size * D, H)
b1 = np.zeros((1, H))
W2 = 0.05 * np.random.randn(H, V)
b2 = np.zeros((1, V))
def softmax(logits):
logits = logits - np.max(logits, axis=1, keepdims=True)
exp_logits = np.exp(logits)
return exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
def forward(batch_ids):
embeddings = E[batch_ids]
x = embeddings.reshape(batch_ids.shape[0], context_size * D)
h_pre = x @ W1 + b1
h = np.tanh(h_pre)
logits = h @ W2 + b2
probs = softmax(logits)
cache = (batch_ids, x, h)
return probs, cache
def cross_entropy(probs, targets):
return -np.log(probs[np.arange(len(targets)), targets] + 1e-12).mean()
def accuracy(probs, targets):
return (np.argmax(probs, axis=1) == targets).mean()
def evaluate(X_data, y_data):
probs, _ = forward(X_data)
loss = cross_entropy(probs, y_data)
acc = accuracy(probs, y_data)
ppl = np.exp(loss)
return loss, acc, ppl
# 5) Mini-batch gradient descent.
lr = 0.12
epochs = 500
batch_size = 16
for epoch in range(epochs):
order = np.random.permutation(len(X_train))
X_train = X_train[order]
y_train = y_train[order]
for start in range(0, len(X_train), batch_size):
xb = X_train[start:start + batch_size]
yb = y_train[start:start + batch_size]
probs, (batch_ids, x, h) = forward(xb)
n = len(xb)
dlogits = probs.copy()
dlogits[np.arange(n), yb] -= 1.0
dlogits /= n
dW2 = h.T @ dlogits
db2 = dlogits.sum(axis=0, keepdims=True)
dh = dlogits @ W2.T
dh_pre = dh * (1.0 - h ** 2)
dW1 = x.T @ dh_pre
db1 = dh_pre.sum(axis=0, keepdims=True)
dx = (dh_pre @ W1.T).reshape(n, context_size, D)
dE = np.zeros_like(E)
for pos in range(context_size):
np.add.at(dE, batch_ids[:, pos], dx[:, pos, :])
E -= lr * dE
W1 -= lr * dW1
b1 -= lr * db1
W2 -= lr * dW2
b2 -= lr * db2
if epoch % 100 == 0 or epoch == epochs - 1:
train_loss, train_acc, train_ppl = evaluate(X_train, y_train)
val_loss, val_acc, val_ppl = evaluate(X_val, y_val)
print(
f"Epoch {epoch:>3} | "
f"train_loss={train_loss:.4f} train_acc={train_acc:.2%} train_ppl={train_ppl:.2f} | "
f"val_loss={val_loss:.4f} val_acc={val_acc:.2%} val_ppl={val_ppl:.2f}"
)
# 6) Inspect the distribution over next tokens.
def top_predictions(context_words, k=5):
context_ids = np.array([[encode(word) for word in context_words]], dtype=np.int64)
probs, _ = forward(context_ids)
best = np.argsort(probs[0])[::-1][:k]
return [(id2word[idx], float(probs[0, idx])) for idx in best]
print("\nTop next-token predictions:")
for context in [
["<bos>", "<bos>", "el"],
["la", "cientifica", "analiza"],
["el", "perro", "corre"],
]:
print(f"Context {context} -> {top_predictions(context, k=5)}")
# 7) Autoregressive generation with temperature and top-k sampling.
def sample_next(context_words, temperature=0.8, top_k=5):
context_ids = np.array([[encode(word) for word in context_words]], dtype=np.int64)
probs, _ = forward(context_ids)
logits = np.log(probs[0] + 1e-12) / max(temperature, 1e-6)
logits = logits - np.max(logits)
filtered = np.argsort(logits)[::-1][:top_k]
filtered_logits = logits[filtered]
filtered_probs = np.exp(filtered_logits)
filtered_probs /= filtered_probs.sum()
next_id = int(np.random.choice(filtered, p=filtered_probs))
return id2word[next_id]
def generate(seed=("<bos>", "<bos>", "<bos>"), max_len=10, temperature=0.8, top_k=5):
context = list(seed)
generated = []
for _ in range(max_len):
next_word = sample_next(context, temperature=temperature, top_k=top_k)
if next_word == "<eos>":
break
generated.append(next_word)
context = context[1:] + [next_word]
return " ".join(generated)
print("\nGenerated samples:")
for seed in [
("<bos>", "<bos>", "el"),
("<bos>", "<bos>", "la"),
("la", "cientifica", "analiza"),
]:
for _ in range(2):
print(f"seed={seed} -> {generate(seed=seed, temperature=0.9, top_k=4)}")
print()
Vocabulary size: 87
Training examples: 117 | Validation examples: 30
Epoch 0 | train_loss=4.4202 train_acc=14.53% train_ppl=83.12 | val_loss=4.4221 val_acc=23.33% val_ppl=83.27
Epoch 100 | train_loss=1.5558 train_acc=51.28% train_ppl=4.74 | val_loss=4.9246 val_acc=20.00% val_ppl=137.64
Epoch 200 | train_loss=0.7792 train_acc=74.36% train_ppl=2.18 | val_loss=6.6347 val_acc=23.33% val_ppl=761.09
Epoch 300 | train_loss=0.5004 train_acc=82.05% train_ppl=1.65 | val_loss=7.4939 val_acc=10.00% val_ppl=1797.04
Epoch 400 | train_loss=0.4281 train_acc=82.91% train_ppl=1.53 | val_loss=7.8468 val_acc=10.00% val_ppl=2557.45
Epoch 499 | train_loss=0.4085 train_acc=82.91% train_ppl=1.50 | val_loss=8.2307 val_acc=10.00% val_ppl=3754.46
Top next-token predictions:
Context ['<bos>', '<bos>', 'el'] -> [('perro', 0.37899802163743385), ('gato', 0.1441843925076278), ('nino', 0.11311341056122948), ('coche', 0.09503225811707484), ('asistente', 0.08856609115669524)]
Context ['la', 'cientifica', 'analiza'] -> [('los', 0.9702507142023091), ('un', 0.009214923142383103), ('<eos>', 0.005744469491607177), ('parque', 0.00375609648102881), ('sobre', 0.003254359781328852)]
Context ['el', 'perro', 'corre'] -> [('por', 0.9633687517233784), ('sobre', 0.01362457898019891), ('modelos', 0.009082521432186824), ('del', 0.0032136696933594677), ('casa', 0.002558788846126655)]
Generated samples:
seed=('<bos>', '<bos>', 'el') -> perro ladra sobre el mar tranquilo
seed=('<bos>', '<bos>', 'el') -> coche gira en la esquina estrecha
seed=('<bos>', '<bos>', 'la') -> musica lee un articulo cientifico cientifico de lenguaje
seed=('<bos>', '<bos>', 'la') -> biblioteca con un articulo cientifico en la silla con los
seed=('la', 'cientifica', 'analiza') -> un libro interesante
seed=('la', 'cientifica', 'analiza') -> los datos del oceano
What LLMs are good at
Good use cases usually have one or more of these characteristics:
Strong fit |
Why it works well |
|---|---|
Summarization |
The task is mostly compression and re-expression |
Classification |
The model can map text into predefined labels |
Information extraction |
The output can be constrained into fields or tables |
Drafting |
The model can propose a first version quickly |
Question answering with context |
The answer can be grounded in supplied text |
Brainstorming |
The value comes from many candidate ideas |
As a rule, LLMs work best when we define the goal, the context, and the format clearly.
Typical examples of strong classroom or professional use cases include:
turning a long paper abstract into five key points
extracting author, year, method, and limitation from a set of papers
generating a first draft of an email, abstract, or meeting summary
converting unstructured notes into a Markdown table
suggesting code comments or documentation from existing code
translating technical language into a version for a broader audience
A useful heuristic is this: if a human could solve the task mostly by reading, rewriting, classifying, or structuring text, an LLM may be a strong accelerator.
What LLMs are not good at
LLMs are weaker when a task requires guaranteed truth, precise arithmetic, or hidden domain knowledge that is not present in the prompt.
Weak fit |
Main risk |
|---|---|
Up-to-date factual search |
The model may be outdated |
High-stakes decisions |
Errors can be expensive or dangerous |
Exact citations |
The model may invent references |
Long multi-step planning without checks |
The model may drift or skip constraints |
Sensitive data handling |
Privacy and compliance risks |
Numeric reliability |
Small arithmetic mistakes are common |
A useful slogan is: use LLMs to accelerate thinking, not to replace verification.
Examples of weak or risky use cases include:
asking for the latest regulations without checking official sources
using the model as the final judge in legal or medical decisions
trusting generated references without opening the original papers
asking for exact calculations in finance or engineering without verification
sending private institutional data to a public hosted chatbot
In these settings, an LLM can still be useful as an assistant, but only if a human or an external system validates the result.
A quick test for use-case quality
Before using an LLM for a task, ask three questions:
Is the task mainly about language, structure, or retrieval?
Can I define what a good answer looks like?
Do I have a way to verify the result?
If the answer to all three is yes, the task is often a good candidate for LLM assistance.
Key concepts and risks
Hallucinations
A hallucination happens when the model generates fluent but false or unsupported content.
Ways to detect and reduce hallucinations:
ask for answers grounded in a provided passage
request quoted evidence or source spans
compare the answer with the original document
break the task into extraction first, synthesis second
verify critical facts with a trusted external source
Common warning signs include overly confident claims, invented citations, missing uncertainty, and answers that sound plausible but do not match the provided material.
This is why evidence-aware prompting and source checking are core habits, not optional extras.
Bias
LLMs can reproduce stereotypes or skewed assumptions from training data. Bias can appear in classification, ranking, tone, and recommendations.
Examples include unequal treatment of demographic groups, culturally narrow assumptions, or systematically different tone when describing similar people or regions.
Overreliance
Fast answers can create false confidence. Users may stop checking evidence, especially when the writing sounds expert.
Data privacy
Never assume a public model or web tool is safe for confidential data. Before sharing private data, check the deployment mode, retention policy, and institutional rules.
In professional settings, privacy review is not optional. Data governance, retention, and access control matter as much as model quality.
Useful tools
Three tools that often appear in research and knowledge workflows are:
Perplexity AI: useful for web-grounded question answering and fast exploration.
NotebookLM: useful for asking questions over a user-provided collection of documents.
Scite.ai: useful for literature search and for checking whether papers are supported or contradicted by later citations.
These tools can save time, but they still need human review. Always inspect the source material before trusting a claim.
A practical way to think about them is:
Tool |
Best for |
Main caution |
|---|---|---|
Perplexity AI |
broad exploration, quick web search, finding recent sources |
may still summarize weak or irrelevant sources |
NotebookLM |
asking questions over your own uploaded reports, slides, or papers |
quality depends on the documents you provide |
Scite.ai |
literature discovery and citation context |
citation counts and citation labels still need interpretation |
These are not replacements for reading. They are better understood as navigation tools that help you find, compare, and prioritize material faster.
Models and Hugging Face
Models differ in size, architecture, context length, license, training data, and instruction tuning.
Hugging Face URL is a major platform for discovering and sharing models, datasets, and tokenizers. In practice, it is useful because it provides:
model cards describing strengths, limits, and licenses
ready-to-use checkpoints for text generation and embeddings
dataset hosting and evaluation resources
an ecosystem for running models locally or in the cloud
When choosing a model, look at more than raw size. Also consider latency, memory footprint, multilingual quality, domain fit, and whether the task needs generation, embeddings, or classification.
A simple distinction that helps beginners is:
hosted API models: easy to use, usually strong performance, but involve external infrastructure and cost
open models: flexible and transparent, often available through Hugging Face, and useful for local experimentation
local models: attractive for privacy or offline use, but more constrained by hardware and often less capable than large hosted systems
For many teaching scenarios, Hugging Face is useful not only to download models, but also to compare model cards and discuss licensing, intended use, and known limitations.
Choosing a model in practice
A practical model-selection checklist includes:
task type: generation, extraction, classification, embeddings, or search
language coverage: English-only or multilingual
deployment: cloud API, local machine, or institutional infrastructure
latency needs: interactive chat or offline batch processing
privacy constraints: public, internal, or confidential data
budget: both inference cost and engineering effort
In many real projects, the best model is not the biggest one. It is the one that is good enough, affordable, controllable, and compatible with your data constraints.
Ollama, llama.cpp, and vLLM to run LLMs locally
There are three styles: Ollama an easy desktop-style workflow, llama.cpp a lightweight low-level runtime or vLLM a high-throughput inference server.
Ollama URL is usually the easiest entry point for local experimentation. It focuses on usability: download a model, run it with a simple command, and interact with it through a local API or chat-like workflow. It is a strong choice for:
classroom demos
fast prototyping on a laptop or workstation
trying different open models without much setup
building simple local applications that call a model through an HTTP API
Its main advantage is convenience. Its main limitation is that it is not primarily designed for maximum serving efficiency at scale.
llama.cpp URL is a lightweight C/C++ inference runtime that became especially popular for running quantized models on local hardware, including CPUs and modest devices. It is often used when people want direct control, portability, and efficient inference without a heavy serving stack. It is a strong choice for:
CPU inference
low-resource machines
edge or embedded experimentation
understanding the mechanics of local inference more directly
Its main advantage is portability and efficiency with quantized models. Its main limitation is that the workflow can feel lower-level and less polished than Ollama.
vLLM URL is a more production-oriented inference engine. It is designed for efficient serving, especially when many requests must be handled at the same time. It is known for high throughput and better GPU utilization, so it is often used when a team wants to expose a model behind an API for multiple users or applications. It is a strong choice for:
research labs serving one or more models internally
backend APIs for chat or batch generation
deployments where latency and throughput matter
larger GPU-based systems
Its main advantage is serving performance. Its main limitation is that it is more infrastructure-oriented than beginner-oriented.
A practical comparison is:
Tool |
Best mental model |
Best for |
Main trade-off |
|---|---|---|---|
Ollama |
easy local model manager |
quick demos, teaching, personal workflows |
less focused on large-scale serving |
vLLM |
efficient inference server |
multi-user APIs, GPU serving, production-like setups |
more complex operationally |
llama.cpp |
lightweight local runtime |
CPU inference, quantized models, portable setups |
lower-level workflow |
These tools are complementary, not mutually exclusive.
Performance and hardware
Running an LLM has a cost in memory and compute.
Matches LLM models to your hardware
GPU: usually gives much lower latency than CPU for generation.
VRAM / memory usage: larger models and longer contexts require more memory.
Precision: lower precision such as 8-bit or 4-bit quantization reduces memory usage.
Context length: longer prompts increase compute and latency.
Tokens per second: a practical measure of interactive speed.
A rough rule is that parameter count strongly affects memory needs, but deployment details also matter.
Practical examples:
a small local model may be good enough for classification, extraction, or drafting
a larger model may be worth the cost for harder reasoning or multilingual tasks
very long prompts can make even a good model slow and expensive
quantization Reference can make a model feasible on smaller hardware, but sometimes with quality trade-offs
[5]:
def estimate_weight_memory(parameters_billion, bits_per_parameter=16):
parameters = parameters_billion * 1_000_000_000
total_bits = parameters * bits_per_parameter
total_bytes = total_bits / 8
gib = total_bytes / (1024 ** 3)
return round(gib, 2)
for params in [3, 7, 13, 70, 3000]: #GPT5.5 ~ 3000
print(f"{params}B model at 16-bit weights: ~{estimate_weight_memory(params, 16)} GiB")
print(f"{params}B model at 4-bit weights: ~{estimate_weight_memory(params, 4)} GiB")
print('-' * 50)
3B model at 16-bit weights: ~5.59 GiB
3B model at 4-bit weights: ~1.4 GiB
--------------------------------------------------
7B model at 16-bit weights: ~13.04 GiB
7B model at 4-bit weights: ~3.26 GiB
--------------------------------------------------
13B model at 16-bit weights: ~24.21 GiB
13B model at 4-bit weights: ~6.05 GiB
--------------------------------------------------
70B model at 16-bit weights: ~130.39 GiB
70B model at 4-bit weights: ~32.6 GiB
--------------------------------------------------
3000B model at 16-bit weights: ~5587.94 GiB
3000B model at 4-bit weights: ~1396.98 GiB
--------------------------------------------------
Benchmarks
Benchmarks try to measure capability in a controlled way. Common examples include:
MMLU for broad academic knowledge
GSM8K for grade-school math word problems
HumanEval for coding tasks
MT-Bench for multi-turn chat quality
Benchmarks are useful, but they do not fully predict real-world usefulness. A model with higher benchmark scores may still be slower, more expensive, harder to control, or worse for a domain-specific task.
Always test models on your own data and your own workflow.
A good reading habit is to ask:
what exactly is being measured?
does the benchmark match my real task?
was the model optimized specifically for this test?
what trade-offs are hidden, such as latency or cost?
Benchmark results are signals, not final answers.
Activity
For each scenario below, decide whether an LLM is a strong fit, a weak fit, or only safe with verification:
Extract sampling dates and locations from PDF cruise reports.
Provide final medical advice to a patient.
Draft three versions of a project abstract.
Produce exact legal citations checking the source documents.
Summarize ten papers and compare their methods in a table.
NetCDF Scientific data assistant to …
Turn the analysis into a report for a paper/project
Detect potential issues: units, calendar, incorrect variables, inverted coordinates.
Create a dataset quality checklist.
Discuss your choices!
Extension questions:
Which of these tasks would benefit from grounding in supplied documents?
Which ones require a human expert to approve the final answer?
Which ones are mainly about speed and productivity, and which ones are high-risk?