{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", " \"Open\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Evaluation and Limitations of LLMs\n", "\n", "This notebook introduces a simple way to evaluate LLM responses critically before using them in research tasks.\n", "\n", "**Suggested duration:** 1 hour" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Learning goals\n", "\n", "By the end of this notebook, you should be able to:\n", "\n", "- evaluate an LLM response using a small practical rubric\n", "- distinguish factual correctness from confidence and fluency\n", "- detect subtle errors in calculations and explanations\n", "- compare prompts in terms of usefulness for a task\n", "- apply a short checklist before trusting model outputs in research\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Table of Contents\n", "\n", "1. [Why evaluation matters](#why-evaluation-matters)\n", "2. [A simple evaluation framework](#simple-evaluation-framework)\n", "3. [Exercise 1: Compare two responses](#exercise-1)\n", "4. [Exercise 2: Detect subtle errors](#exercise-2)\n", "5. [Exercise 3: Evaluate prompts](#exercise-3)\n", "6. [Confidence is not correctness](#confidence-is-not-correctness)\n", "7. [Best-practice checklist](#best-practice-checklist)\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why evaluation matters \n", "\n", "LLMs can produce answers that are fluent, persuasive, and well-structured even when they are incomplete or wrong. This is especially risky in research settings, where a small error in a definition, number, interpretation, or summary can affect later conclusions.\n", "\n", "A useful habit is to treat the model as a helpful drafting and exploration tool, but not as an authority. Before reusing an output, we should evaluate whether it is correct, coherent, and useful for the goal we actually have." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Simple Evaluation Framework \n", "\n", "A compact way to evaluate an answer is to ask three questions:\n", "\n", "- **Factuality**: Is this factually correct?\n", "- **Consistency**: Is the reasoning coherent and internally consistent?\n", "- **Usefulness**: Is this helpful for my task?\n", "\n", "| Criterion | Question to ask |\n", "|-------------|------------------------------------|\n", "| Factuality | Is this factually correct? |\n", "| Consistency | Is the reasoning coherent? |\n", "| Usefulness | Is this helpful for my task? |\n", "\n", "This framework is simple on purpose. It is easy to remember, and it works well as a first filter before using an LLM output in a report, coding workflow, or scientific interpretation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from textwrap import dedent\n", "\n", "evaluation_questions = {\n", " \"Factuality\": \"Is this factually correct?\",\n", " \"Consistency\": \"Is the reasoning coherent?\",\n", " \"Usefulness\": \"Is this helpful for my task?\",\n", "}\n", "\n", "print(\"Simple evaluation rubric:\\n\")\n", "for criterion, question in evaluation_questions.items():\n", " print(f\"- {criterion}: {question}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1: Compare Two Responses \n", "\n", "**Prompt**\n", "\n", "> Explain what a drop in sea-level pressure indicates in meteorology.\n", "\n", "Below are two possible model answers. Read both and evaluate them using the three criteria.\n", "\n", "**Your task**\n", "\n", "- evaluate each answer with factuality, consistency, and usefulness\n", "- choose the better answer\n", "- justify your decision in 2 or 3 sentences" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prompt_1 = \"Explain what a drop in sea-level pressure indicates in meteorology.\"\n", "\n", "response_a = dedent(\n", " \"\"\"\n", " A drop in sea-level pressure usually indicates that the atmosphere is becoming less stable.\n", " This is often associated with rising air, cloud formation, and a higher chance of unsettled weather\n", " such as wind, rain, or storms, depending on the broader situation.\n", " \"\"\"\n", ").strip()\n", "\n", "response_b = dedent(\n", " \"\"\"\n", " A drop in sea-level pressure means the air is getting heavier and more compressed,\n", " which usually causes calm skies and guarantees that temperatures will increase.\n", " It is basically the same thing as high pressure and often means the weather will remain unchanged.\n", " \"\"\"\n", ").strip()\n", "\n", "print(\"Prompt:\\n\")\n", "print(prompt_1)\n", "\n", "print(\"\\nResponse A:\\n\")\n", "print(response_a)\n", "\n", "print(\"\\nResponse B:\\n\")\n", "print(response_b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A possible analysis\n", "\n", "- **Factuality**: Response A is much stronger. Response B contains incorrect statements, such as equating a pressure drop with high pressure.\n", "- **Consistency**: Response A is internally coherent. Response B contradicts itself by mixing a pressure drop with calm, stable, high-pressure conditions.\n", "- **Usefulness**: Response A is clearly more useful for a meteorology context because it gives a realistic interpretation.\n", "\n", "This example shows that not all errors are dramatic. Sometimes the answer sounds smooth but still mixes incompatible concepts." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2: Detect Subtle Errors \n", "\n", "**Prompt**\n", "\n", "> Compute the average temperature of: [18.4, 18.9, 19.7, 20.1, 20.6]\n", "\n", "**Model answer**\n", "\n", "> The mean temperature is 20.1°C.\n", "\n", "**Your task**\n", "\n", "- detect the error\n", "- explain why the result is wrong\n", "- compute the correct mean" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "temperatures = [18.4, 18.9, 19.7, 20.1, 20.6]\n", "model_answer = 20.1\n", "\n", "true_mean = sum(temperatures) / len(temperatures)\n", "difference = model_answer - true_mean\n", "\n", "print(\"Temperatures:\", temperatures)\n", "print(f\"Model answer: {model_answer:.1f} °C\")\n", "print(f\"Correct mean: {true_mean:.2f} °C\")\n", "print(f\"Error: {difference:.2f} °C\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The error is subtle because `20.1` is one of the values in the list, so it can look plausible at first glance. But the arithmetic mean is:\n", "\n", "\\[\n", "\\frac{18.4 + 18.9 + 19.7 + 20.1 + 20.6}{5} = 19.54\n", "\\]\n", "\n", "This is a good reminder that confident numerical outputs should not be trusted without checking the calculation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 3: Evaluate Prompts \n", "\n", "Sometimes the problem is not only the answer. The prompt itself may be too vague.\n", "\n", "Consider the same objective with two prompts:\n", "\n", "- **Prompt 1**: `Summarize this paper.`\n", "- **Prompt 2**: `Summarize this paper in 3 bullet points, focusing on methodology, results, and limitations.`\n", "\n", "**Your task**\n", "\n", "- compare the two outputs below\n", "- decide which one is more useful\n", "- explain why prompt design changes the quality of the answer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "paper_summary_goal = \"Summarize a scientific paper for a quick research review.\"\n", "\n", "prompt_bad = \"Summarize this paper.\"\n", "prompt_good = (\n", " \"Summarize this paper in 3 bullet points, focusing on methodology, \"\n", " \"results, and limitations.\"\n", ")\n", "\n", "output_bad = dedent(\n", " \"\"\"\n", " This paper studies environmental data and presents several interesting findings.\n", " The authors discuss the topic in detail and show that the results are important for future work.\n", " Overall, the paper contributes to the field.\n", " \"\"\"\n", ").strip()\n", "\n", "output_good = dedent(\n", " \"\"\"\n", " - Methodology: The paper analyzes coastal environmental observations using a structured comparative approach.\n", " - Results: The authors report a clear relationship between the observed variables and their study outcome.\n", " - Limitation: The conclusions are constrained by the dataset size and the narrow study period.\n", " \"\"\"\n", ").strip()\n", "\n", "print(\"Goal:\\n\")\n", "print(paper_summary_goal)\n", "\n", "print(\"\\nPrompt 1:\\n\")\n", "print(prompt_bad)\n", "print(\"\\nOutput 1:\\n\")\n", "print(output_bad)\n", "\n", "print(\"\\n\" + \"=\" * 80)\n", "print(\"\\nPrompt 2:\\n\")\n", "print(prompt_good)\n", "print(\"\\nOutput 2:\\n\")\n", "print(output_good)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In most research contexts, the second prompt is more useful because it reduces ambiguity and asks for information that matters for evaluation: method, result, and limitation.\n", "\n", "This does not guarantee correctness, but it often improves usefulness and makes later checking easier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confidence Is Not Correctness \n", "\n", "One of the most important ideas to remember is:\n", "\n", "> **LLMs are optimized for plausibility, not truth.**\n", "\n", "A confident tone, a polished structure, or a detailed explanation does not guarantee correctness. In practice, this means that fluency should never be used as evidence by itself.\n", "\n", "This is why evaluation matters: an answer can be clear, confident, and still wrong." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Best-Practice Checklist \n", "\n", "Before using an LLM output in research, it is good practice to ask:\n", "\n", "- Have I verified critical facts?\n", "- Have I checked calculations independently?\n", "- Have I asked for sources, evidence, or intermediate steps when needed?\n", "- Have I compared multiple outputs or reformulated the prompt?\n", "- Am I using the model as a copilot rather than as an authority?\n", "\n", "A compact version to remember is:\n", "\n", "✔ Verify important data \n", "✔ Do not trust calculations blindly \n", "✔ Ask for sources or steps \n", "✔ Compare outputs \n", "✔ Use the model as a copilot, not an authority" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.13" } }, "nbformat": 4, "nbformat_minor": 5 }