Evaluation and Limitations of LLMs

This notebook introduces a simple way to evaluate LLM responses critically before using them in research tasks.

Suggested duration: 1 hour

Learning goals

By the end of this notebook, you should be able to:

evaluate an LLM response using a small practical rubric
distinguish factual correctness from confidence and fluency
detect subtle errors in calculations and explanations
compare prompts in terms of usefulness for a task
apply a short checklist before trusting model outputs in research

Table of Contents

Why evaluation matters
A simple evaluation framework
Exercise 1: Compare two responses
Exercise 2: Detect subtle errors
Exercise 3: Evaluate prompts
Confidence is not correctness
Best-practice checklist

Why evaluation matters

LLMs can produce answers that are fluent, persuasive, and well-structured even when they are incomplete or wrong. This is especially risky in research settings, where a small error in a definition, number, interpretation, or summary can affect later conclusions.

A useful habit is to treat the model as a helpful drafting and exploration tool, but not as an authority. Before reusing an output, we should evaluate whether it is correct, coherent, and useful for the goal we actually have.

A Simple Evaluation Framework

A compact way to evaluate an answer is to ask three questions:

Factuality: Is this factually correct?
Consistency: Is the reasoning coherent and internally consistent?
Usefulness: Is this helpful for my task?

Criterion	Question to ask
Factuality	Is this factually correct?
Consistency	Is the reasoning coherent?
Usefulness	Is this helpful for my task?

This framework is simple on purpose. It is easy to remember, and it works well as a first filter before using an LLM output in a report, coding workflow, or scientific interpretation.

[ ]:

from textwrap import dedent

evaluation_questions = {
    "Factuality": "Is this factually correct?",
    "Consistency": "Is the reasoning coherent?",
    "Usefulness": "Is this helpful for my task?",
}

print("Simple evaluation rubric:\n")
for criterion, question in evaluation_questions.items():
    print(f"- {criterion}: {question}")

Exercise 1: Compare Two Responses

Prompt

Explain what a drop in sea-level pressure indicates in meteorology.

Below are two possible model answers. Read both and evaluate them using the three criteria.

Your task

evaluate each answer with factuality, consistency, and usefulness
choose the better answer
justify your decision in 2 or 3 sentences

[ ]:

prompt_1 = "Explain what a drop in sea-level pressure indicates in meteorology."

response_a = dedent(
    """
    A drop in sea-level pressure usually indicates that the atmosphere is becoming less stable.
    This is often associated with rising air, cloud formation, and a higher chance of unsettled weather
    such as wind, rain, or storms, depending on the broader situation.
    """
).strip()

response_b = dedent(
    """
    A drop in sea-level pressure means the air is getting heavier and more compressed,
    which usually causes calm skies and guarantees that temperatures will increase.
    It is basically the same thing as high pressure and often means the weather will remain unchanged.
    """
).strip()

print("Prompt:\n")
print(prompt_1)

print("\nResponse A:\n")
print(response_a)

print("\nResponse B:\n")
print(response_b)

A possible analysis

Factuality: Response A is much stronger. Response B contains incorrect statements, such as equating a pressure drop with high pressure.
Consistency: Response A is internally coherent. Response B contradicts itself by mixing a pressure drop with calm, stable, high-pressure conditions.
Usefulness: Response A is clearly more useful for a meteorology context because it gives a realistic interpretation.

This example shows that not all errors are dramatic. Sometimes the answer sounds smooth but still mixes incompatible concepts.

Exercise 2: Detect Subtle Errors

Prompt

Compute the average temperature of: [18.4, 18.9, 19.7, 20.1, 20.6]

Model answer

The mean temperature is 20.1°C.

Your task

detect the error
explain why the result is wrong
compute the correct mean

[ ]:

temperatures = [18.4, 18.9, 19.7, 20.1, 20.6]
model_answer = 20.1

true_mean = sum(temperatures) / len(temperatures)
difference = model_answer - true_mean

print("Temperatures:", temperatures)
print(f"Model answer: {model_answer:.1f} °C")
print(f"Correct mean: {true_mean:.2f} °C")
print(f"Error: {difference:.2f} °C")

The error is subtle because 20.1 is one of the values in the list, so it can look plausible at first glance. But the arithmetic mean is:

[ \frac{18.4 + 18.9 + 19.7 + 20.1 + 20.6}{5} = 19.54 ]

This is a good reminder that confident numerical outputs should not be trusted without checking the calculation.

Exercise 3: Evaluate Prompts

Sometimes the problem is not only the answer. The prompt itself may be too vague.

Consider the same objective with two prompts:

Prompt 1: Summarize this paper.
Prompt 2: Summarize this paper in 3 bullet points, focusing on methodology, results, and limitations.

Your task

compare the two outputs below
decide which one is more useful
explain why prompt design changes the quality of the answer

[ ]:

paper_summary_goal = "Summarize a scientific paper for a quick research review."

prompt_bad = "Summarize this paper."
prompt_good = (
    "Summarize this paper in 3 bullet points, focusing on methodology, "
    "results, and limitations."
)

output_bad = dedent(
    """
    This paper studies environmental data and presents several interesting findings.
    The authors discuss the topic in detail and show that the results are important for future work.
    Overall, the paper contributes to the field.
    """
).strip()

output_good = dedent(
    """
    - Methodology: The paper analyzes coastal environmental observations using a structured comparative approach.
    - Results: The authors report a clear relationship between the observed variables and their study outcome.
    - Limitation: The conclusions are constrained by the dataset size and the narrow study period.
    """
).strip()

print("Goal:\n")
print(paper_summary_goal)

print("\nPrompt 1:\n")
print(prompt_bad)
print("\nOutput 1:\n")
print(output_bad)

print("\n" + "=" * 80)
print("\nPrompt 2:\n")
print(prompt_good)
print("\nOutput 2:\n")
print(output_good)

In most research contexts, the second prompt is more useful because it reduces ambiguity and asks for information that matters for evaluation: method, result, and limitation.

This does not guarantee correctness, but it often improves usefulness and makes later checking easier.

Confidence Is Not Correctness

One of the most important ideas to remember is:

LLMs are optimized for plausibility, not truth.

A confident tone, a polished structure, or a detailed explanation does not guarantee correctness. In practice, this means that fluency should never be used as evidence by itself.

This is why evaluation matters: an answer can be clear, confident, and still wrong.

Best-Practice Checklist

Before using an LLM output in research, it is good practice to ask:

Have I verified critical facts?
Have I checked calculations independently?
Have I asked for sources, evidence, or intermediate steps when needed?
Have I compared multiple outputs or reformulated the prompt?
Am I using the model as a copilot rather than as an authority?

A compact version to remember is:

✔ Verify important data
✔ Do not trust calculations blindly
✔ Ask for sources or steps
✔ Compare outputs
✔ Use the model as a copilot, not an authority