Open In Colab

Evaluation and Limitations of LLMs

This notebook introduces a simple way to evaluate LLM responses critically before using them in research tasks.

Suggested duration: 1 hour

Learning goals

By the end of this notebook, you should be able to:

  • evaluate an LLM response using a small practical rubric

  • distinguish factual correctness from confidence and fluency

  • detect subtle errors in calculations and explanations

  • compare prompts in terms of usefulness for a task

  • apply a short checklist before trusting model outputs in research

Table of Contents

  1. Why evaluation matters

  2. A simple evaluation framework

  3. Exercise 1: Compare two responses

  4. Exercise 2: Detect subtle errors

  5. Exercise 3: Evaluate prompts

  6. Confidence is not correctness

  7. Best-practice checklist

Why evaluation matters

LLMs can produce answers that are fluent, persuasive, and well-structured even when they are incomplete or wrong. This is especially risky in research settings, where a small error in a definition, number, interpretation, or summary can affect later conclusions.

A useful habit is to treat the model as a helpful drafting and exploration tool, but not as an authority. Before reusing an output, we should evaluate whether it is correct, coherent, and useful for the goal we actually have.

A Simple Evaluation Framework

A compact way to evaluate an answer is to ask three questions:

  • Factuality: Is this factually correct?

  • Consistency: Is the reasoning coherent and internally consistent?

  • Usefulness: Is this helpful for my task?

Criterion

Question to ask

Factuality

Is this factually correct?

Consistency

Is the reasoning coherent?

Usefulness

Is this helpful for my task?

This framework is simple on purpose. It is easy to remember, and it works well as a first filter before using an LLM output in a report, coding workflow, or scientific interpretation.

[ ]:
from textwrap import dedent

evaluation_questions = {
    "Factuality": "Is this factually correct?",
    "Consistency": "Is the reasoning coherent?",
    "Usefulness": "Is this helpful for my task?",
}

print("Simple evaluation rubric:\n")
for criterion, question in evaluation_questions.items():
    print(f"- {criterion}: {question}")

Exercise 1: Compare Two Responses

Prompt

Explain what a drop in sea-level pressure indicates in meteorology.

Below are two possible model answers. Read both and evaluate them using the three criteria.

Your task

  • evaluate each answer with factuality, consistency, and usefulness

  • choose the better answer

  • justify your decision in 2 or 3 sentences

[ ]:
prompt_1 = "Explain what a drop in sea-level pressure indicates in meteorology."

response_a = dedent(
    """
    A drop in sea-level pressure usually indicates that the atmosphere is becoming less stable.
    This is often associated with rising air, cloud formation, and a higher chance of unsettled weather
    such as wind, rain, or storms, depending on the broader situation.
    """
).strip()

response_b = dedent(
    """
    A drop in sea-level pressure means the air is getting heavier and more compressed,
    which usually causes calm skies and guarantees that temperatures will increase.
    It is basically the same thing as high pressure and often means the weather will remain unchanged.
    """
).strip()

print("Prompt:\n")
print(prompt_1)

print("\nResponse A:\n")
print(response_a)

print("\nResponse B:\n")
print(response_b)

A possible analysis

  • Factuality: Response A is much stronger. Response B contains incorrect statements, such as equating a pressure drop with high pressure.

  • Consistency: Response A is internally coherent. Response B contradicts itself by mixing a pressure drop with calm, stable, high-pressure conditions.

  • Usefulness: Response A is clearly more useful for a meteorology context because it gives a realistic interpretation.

This example shows that not all errors are dramatic. Sometimes the answer sounds smooth but still mixes incompatible concepts.

Exercise 2: Detect Subtle Errors

Prompt

Compute the average temperature of: [18.4, 18.9, 19.7, 20.1, 20.6]

Model answer

The mean temperature is 20.1°C.

Your task

  • detect the error

  • explain why the result is wrong

  • compute the correct mean

[ ]:
temperatures = [18.4, 18.9, 19.7, 20.1, 20.6]
model_answer = 20.1

true_mean = sum(temperatures) / len(temperatures)
difference = model_answer - true_mean

print("Temperatures:", temperatures)
print(f"Model answer: {model_answer:.1f} °C")
print(f"Correct mean: {true_mean:.2f} °C")
print(f"Error: {difference:.2f} °C")

The error is subtle because 20.1 is one of the values in the list, so it can look plausible at first glance. But the arithmetic mean is:

[ \frac{18.4 + 18.9 + 19.7 + 20.1 + 20.6}{5} = 19.54 ]

This is a good reminder that confident numerical outputs should not be trusted without checking the calculation.

Exercise 3: Evaluate Prompts

Sometimes the problem is not only the answer. The prompt itself may be too vague.

Consider the same objective with two prompts:

  • Prompt 1: Summarize this paper.

  • Prompt 2: Summarize this paper in 3 bullet points, focusing on methodology, results, and limitations.

Your task

  • compare the two outputs below

  • decide which one is more useful

  • explain why prompt design changes the quality of the answer

[ ]:
paper_summary_goal = "Summarize a scientific paper for a quick research review."

prompt_bad = "Summarize this paper."
prompt_good = (
    "Summarize this paper in 3 bullet points, focusing on methodology, "
    "results, and limitations."
)

output_bad = dedent(
    """
    This paper studies environmental data and presents several interesting findings.
    The authors discuss the topic in detail and show that the results are important for future work.
    Overall, the paper contributes to the field.
    """
).strip()

output_good = dedent(
    """
    - Methodology: The paper analyzes coastal environmental observations using a structured comparative approach.
    - Results: The authors report a clear relationship between the observed variables and their study outcome.
    - Limitation: The conclusions are constrained by the dataset size and the narrow study period.
    """
).strip()

print("Goal:\n")
print(paper_summary_goal)

print("\nPrompt 1:\n")
print(prompt_bad)
print("\nOutput 1:\n")
print(output_bad)

print("\n" + "=" * 80)
print("\nPrompt 2:\n")
print(prompt_good)
print("\nOutput 2:\n")
print(output_good)

In most research contexts, the second prompt is more useful because it reduces ambiguity and asks for information that matters for evaluation: method, result, and limitation.

This does not guarantee correctness, but it often improves usefulness and makes later checking easier.

Confidence Is Not Correctness

One of the most important ideas to remember is:

LLMs are optimized for plausibility, not truth.

A confident tone, a polished structure, or a detailed explanation does not guarantee correctness. In practice, this means that fluency should never be used as evidence by itself.

This is why evaluation matters: an answer can be clear, confident, and still wrong.

Best-Practice Checklist

Before using an LLM output in research, it is good practice to ask:

  • Have I verified critical facts?

  • Have I checked calculations independently?

  • Have I asked for sources, evidence, or intermediate steps when needed?

  • Have I compared multiple outputs or reformulated the prompt?

  • Am I using the model as a copilot rather than as an authority?

A compact version to remember is:

✔ Verify important data
✔ Do not trust calculations blindly
✔ Ask for sources or steps
✔ Compare outputs
✔ Use the model as a copilot, not an authority