Evaluation and Limitations of LLMs
This notebook introduces a simple way to evaluate LLM responses critically before using them in research tasks.
Suggested duration: 1 hour
Learning goals
By the end of this notebook, you should be able to:
evaluate an LLM response using a small practical rubric
distinguish factual correctness from confidence and fluency
detect subtle errors in calculations and explanations
compare prompts in terms of usefulness for a task
apply a short checklist before trusting model outputs in research
Table of Contents
A simple evaluation framework
Exercise 1: Compare two responses
Exercise 2: Detect subtle errors
Exercise 3: Evaluate prompts
Why evaluation matters
LLMs can produce answers that are fluent, persuasive, and well-structured even when they are incomplete or wrong. This is especially risky in research settings, where a small error in a definition, number, interpretation, or summary can affect later conclusions.
A useful habit is to treat the model as a helpful drafting and exploration tool, but not as an authority. Before reusing an output, we should evaluate whether it is correct, coherent, and useful for the goal we actually have.
A Simple Evaluation Framework
A compact way to evaluate an answer is to ask three questions:
Factuality: Is this factually correct?
Consistency: Is the reasoning coherent and internally consistent?
Usefulness: Is this helpful for my task?
Criterion |
Question to ask |
|---|---|
Factuality |
Is this factually correct? |
Consistency |
Is the reasoning coherent? |
Usefulness |
Is this helpful for my task? |
This framework is simple on purpose. It is easy to remember, and it works well as a first filter before using an LLM output in a report, coding workflow, or scientific interpretation.
[ ]:
from textwrap import dedent
evaluation_questions = {
"Factuality": "Is this factually correct?",
"Consistency": "Is the reasoning coherent?",
"Usefulness": "Is this helpful for my task?",
}
print("Simple evaluation rubric:\n")
for criterion, question in evaluation_questions.items():
print(f"- {criterion}: {question}")
Exercise 1: Compare Two Responses
Prompt
Explain what a drop in sea-level pressure indicates in meteorology.
Below are two possible model answers. Read both and evaluate them using the three criteria.
Your task
evaluate each answer with factuality, consistency, and usefulness
choose the better answer
justify your decision in 2 or 3 sentences
[ ]:
prompt_1 = "Explain what a drop in sea-level pressure indicates in meteorology."
response_a = dedent(
"""
A drop in sea-level pressure usually indicates that the atmosphere is becoming less stable.
This is often associated with rising air, cloud formation, and a higher chance of unsettled weather
such as wind, rain, or storms, depending on the broader situation.
"""
).strip()
response_b = dedent(
"""
A drop in sea-level pressure means the air is getting heavier and more compressed,
which usually causes calm skies and guarantees that temperatures will increase.
It is basically the same thing as high pressure and often means the weather will remain unchanged.
"""
).strip()
print("Prompt:\n")
print(prompt_1)
print("\nResponse A:\n")
print(response_a)
print("\nResponse B:\n")
print(response_b)
A possible analysis
Factuality: Response A is much stronger. Response B contains incorrect statements, such as equating a pressure drop with high pressure.
Consistency: Response A is internally coherent. Response B contradicts itself by mixing a pressure drop with calm, stable, high-pressure conditions.
Usefulness: Response A is clearly more useful for a meteorology context because it gives a realistic interpretation.
This example shows that not all errors are dramatic. Sometimes the answer sounds smooth but still mixes incompatible concepts.
Exercise 2: Detect Subtle Errors
Prompt
Compute the average temperature of: [18.4, 18.9, 19.7, 20.1, 20.6]
Model answer
The mean temperature is 20.1°C.
Your task
detect the error
explain why the result is wrong
compute the correct mean
[ ]:
temperatures = [18.4, 18.9, 19.7, 20.1, 20.6]
model_answer = 20.1
true_mean = sum(temperatures) / len(temperatures)
difference = model_answer - true_mean
print("Temperatures:", temperatures)
print(f"Model answer: {model_answer:.1f} °C")
print(f"Correct mean: {true_mean:.2f} °C")
print(f"Error: {difference:.2f} °C")
The error is subtle because 20.1 is one of the values in the list, so it can look plausible at first glance. But the arithmetic mean is:
[ \frac{18.4 + 18.9 + 19.7 + 20.1 + 20.6}{5} = 19.54 ]
This is a good reminder that confident numerical outputs should not be trusted without checking the calculation.
Exercise 3: Evaluate Prompts
Sometimes the problem is not only the answer. The prompt itself may be too vague.
Consider the same objective with two prompts:
Prompt 1:
Summarize this paper.Prompt 2:
Summarize this paper in 3 bullet points, focusing on methodology, results, and limitations.
Your task
compare the two outputs below
decide which one is more useful
explain why prompt design changes the quality of the answer
[ ]:
paper_summary_goal = "Summarize a scientific paper for a quick research review."
prompt_bad = "Summarize this paper."
prompt_good = (
"Summarize this paper in 3 bullet points, focusing on methodology, "
"results, and limitations."
)
output_bad = dedent(
"""
This paper studies environmental data and presents several interesting findings.
The authors discuss the topic in detail and show that the results are important for future work.
Overall, the paper contributes to the field.
"""
).strip()
output_good = dedent(
"""
- Methodology: The paper analyzes coastal environmental observations using a structured comparative approach.
- Results: The authors report a clear relationship between the observed variables and their study outcome.
- Limitation: The conclusions are constrained by the dataset size and the narrow study period.
"""
).strip()
print("Goal:\n")
print(paper_summary_goal)
print("\nPrompt 1:\n")
print(prompt_bad)
print("\nOutput 1:\n")
print(output_bad)
print("\n" + "=" * 80)
print("\nPrompt 2:\n")
print(prompt_good)
print("\nOutput 2:\n")
print(output_good)
In most research contexts, the second prompt is more useful because it reduces ambiguity and asks for information that matters for evaluation: method, result, and limitation.
This does not guarantee correctness, but it often improves usefulness and makes later checking easier.
Confidence Is Not Correctness
One of the most important ideas to remember is:
LLMs are optimized for plausibility, not truth.
A confident tone, a polished structure, or a detailed explanation does not guarantee correctness. In practice, this means that fluency should never be used as evidence by itself.
This is why evaluation matters: an answer can be clear, confident, and still wrong.
Best-Practice Checklist
Before using an LLM output in research, it is good practice to ask:
Have I verified critical facts?
Have I checked calculations independently?
Have I asked for sources, evidence, or intermediate steps when needed?
Have I compared multiple outputs or reformulated the prompt?
Am I using the model as a copilot rather than as an authority?
A compact version to remember is: