{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a target=\"_blank\" href=\"https://colab.research.google.com/github/bmalcover/AppOC/blob/main/docs/notebooks/05_Text/05_Evaluation_Limitations_of_LLMs.ipynb\">\n",
        "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
        "</a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Evaluation and Limitations of LLMs\n",
        "\n",
        "This notebook introduces a simple way to evaluate LLM responses critically before using them in research tasks.\n",
        "\n",
        "**Suggested duration:** 1 hour"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-warning\">\n",
        "<b>Learning goals</b>\n",
        "\n",
        "By the end of this notebook, you should be able to:\n",
        "\n",
        "- evaluate an LLM response using a small practical rubric\n",
        "- distinguish factual correctness from confidence and fluency\n",
        "- detect subtle errors in calculations and explanations\n",
        "- compare prompts in terms of usefulness for a task\n",
        "- apply a short checklist before trusting model outputs in research\n",
        "</div>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<div class=\"alert alert-warning\">\n",
        "<b>Table of Contents</b>\n",
        "\n",
        "1. [Why evaluation matters](#why-evaluation-matters)\n",
        "2. [A simple evaluation framework](#simple-evaluation-framework)\n",
        "3. [Exercise 1: Compare two responses](#exercise-1)\n",
        "4. [Exercise 2: Detect subtle errors](#exercise-2)\n",
        "5. [Exercise 3: Evaluate prompts](#exercise-3)\n",
        "6. [Confidence is not correctness](#confidence-is-not-correctness)\n",
        "7. [Best-practice checklist](#best-practice-checklist)\n",
        "</div>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Why evaluation matters <a id=\"why-evaluation-matters\"></a>\n",
        "\n",
        "LLMs can produce answers that are fluent, persuasive, and well-structured even when they are incomplete or wrong. This is especially risky in research settings, where a small error in a definition, number, interpretation, or summary can affect later conclusions.\n",
        "\n",
        "A useful habit is to treat the model as a helpful drafting and exploration tool, but not as an authority. Before reusing an output, we should evaluate whether it is correct, coherent, and useful for the goal we actually have."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## A Simple Evaluation Framework <a id=\"simple-evaluation-framework\"></a>\n",
        "\n",
        "A compact way to evaluate an answer is to ask three questions:\n",
        "\n",
        "- **Factuality**: Is this factually correct?\n",
        "- **Consistency**: Is the reasoning coherent and internally consistent?\n",
        "- **Usefulness**: Is this helpful for my task?\n",
        "\n",
        "| Criterion    | Question to ask                    |\n",
        "|-------------|------------------------------------|\n",
        "| Factuality  | Is this factually correct?         |\n",
        "| Consistency | Is the reasoning coherent?         |\n",
        "| Usefulness  | Is this helpful for my task?       |\n",
        "\n",
        "This framework is simple on purpose. It is easy to remember, and it works well as a first filter before using an LLM output in a report, coding workflow, or scientific interpretation."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from textwrap import dedent\n",
        "\n",
        "evaluation_questions = {\n",
        "    \"Factuality\": \"Is this factually correct?\",\n",
        "    \"Consistency\": \"Is the reasoning coherent?\",\n",
        "    \"Usefulness\": \"Is this helpful for my task?\",\n",
        "}\n",
        "\n",
        "print(\"Simple evaluation rubric:\\n\")\n",
        "for criterion, question in evaluation_questions.items():\n",
        "    print(f\"- {criterion}: {question}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Exercise 1: Compare Two Responses <a id=\"exercise-1\"></a>\n",
        "\n",
        "**Prompt**\n",
        "\n",
        "> Explain what a drop in sea-level pressure indicates in meteorology.\n",
        "\n",
        "Below are two possible model answers. Read both and evaluate them using the three criteria.\n",
        "\n",
        "**Your task**\n",
        "\n",
        "- evaluate each answer with factuality, consistency, and usefulness\n",
        "- choose the better answer\n",
        "- justify your decision in 2 or 3 sentences"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "prompt_1 = \"Explain what a drop in sea-level pressure indicates in meteorology.\"\n",
        "\n",
        "response_a = dedent(\n",
        "    \"\"\"\n",
        "    A drop in sea-level pressure usually indicates that the atmosphere is becoming less stable.\n",
        "    This is often associated with rising air, cloud formation, and a higher chance of unsettled weather\n",
        "    such as wind, rain, or storms, depending on the broader situation.\n",
        "    \"\"\"\n",
        ").strip()\n",
        "\n",
        "response_b = dedent(\n",
        "    \"\"\"\n",
        "    A drop in sea-level pressure means the air is getting heavier and more compressed,\n",
        "    which usually causes calm skies and guarantees that temperatures will increase.\n",
        "    It is basically the same thing as high pressure and often means the weather will remain unchanged.\n",
        "    \"\"\"\n",
        ").strip()\n",
        "\n",
        "print(\"Prompt:\\n\")\n",
        "print(prompt_1)\n",
        "\n",
        "print(\"\\nResponse A:\\n\")\n",
        "print(response_a)\n",
        "\n",
        "print(\"\\nResponse B:\\n\")\n",
        "print(response_b)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### A possible analysis\n",
        "\n",
        "- **Factuality**: Response A is much stronger. Response B contains incorrect statements, such as equating a pressure drop with high pressure.\n",
        "- **Consistency**: Response A is internally coherent. Response B contradicts itself by mixing a pressure drop with calm, stable, high-pressure conditions.\n",
        "- **Usefulness**: Response A is clearly more useful for a meteorology context because it gives a realistic interpretation.\n",
        "\n",
        "This example shows that not all errors are dramatic. Sometimes the answer sounds smooth but still mixes incompatible concepts."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Exercise 2: Detect Subtle Errors <a id=\"exercise-2\"></a>\n",
        "\n",
        "**Prompt**\n",
        "\n",
        "> Compute the average temperature of: [18.4, 18.9, 19.7, 20.1, 20.6]\n",
        "\n",
        "**Model answer**\n",
        "\n",
        "> The mean temperature is 20.1°C.\n",
        "\n",
        "**Your task**\n",
        "\n",
        "- detect the error\n",
        "- explain why the result is wrong\n",
        "- compute the correct mean"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "temperatures = [18.4, 18.9, 19.7, 20.1, 20.6]\n",
        "model_answer = 20.1\n",
        "\n",
        "true_mean = sum(temperatures) / len(temperatures)\n",
        "difference = model_answer - true_mean\n",
        "\n",
        "print(\"Temperatures:\", temperatures)\n",
        "print(f\"Model answer: {model_answer:.1f} °C\")\n",
        "print(f\"Correct mean: {true_mean:.2f} °C\")\n",
        "print(f\"Error: {difference:.2f} °C\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The error is subtle because `20.1` is one of the values in the list, so it can look plausible at first glance. But the arithmetic mean is:\n",
        "\n",
        "\\[\n",
        "\\frac{18.4 + 18.9 + 19.7 + 20.1 + 20.6}{5} = 19.54\n",
        "\\]\n",
        "\n",
        "This is a good reminder that confident numerical outputs should not be trusted without checking the calculation."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Exercise 3: Evaluate Prompts <a id=\"exercise-3\"></a>\n",
        "\n",
        "Sometimes the problem is not only the answer. The prompt itself may be too vague.\n",
        "\n",
        "Consider the same objective with two prompts:\n",
        "\n",
        "- **Prompt 1**: `Summarize this paper.`\n",
        "- **Prompt 2**: `Summarize this paper in 3 bullet points, focusing on methodology, results, and limitations.`\n",
        "\n",
        "**Your task**\n",
        "\n",
        "- compare the two outputs below\n",
        "- decide which one is more useful\n",
        "- explain why prompt design changes the quality of the answer"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "paper_summary_goal = \"Summarize a scientific paper for a quick research review.\"\n",
        "\n",
        "prompt_bad = \"Summarize this paper.\"\n",
        "prompt_good = (\n",
        "    \"Summarize this paper in 3 bullet points, focusing on methodology, \"\n",
        "    \"results, and limitations.\"\n",
        ")\n",
        "\n",
        "output_bad = dedent(\n",
        "    \"\"\"\n",
        "    This paper studies environmental data and presents several interesting findings.\n",
        "    The authors discuss the topic in detail and show that the results are important for future work.\n",
        "    Overall, the paper contributes to the field.\n",
        "    \"\"\"\n",
        ").strip()\n",
        "\n",
        "output_good = dedent(\n",
        "    \"\"\"\n",
        "    - Methodology: The paper analyzes coastal environmental observations using a structured comparative approach.\n",
        "    - Results: The authors report a clear relationship between the observed variables and their study outcome.\n",
        "    - Limitation: The conclusions are constrained by the dataset size and the narrow study period.\n",
        "    \"\"\"\n",
        ").strip()\n",
        "\n",
        "print(\"Goal:\\n\")\n",
        "print(paper_summary_goal)\n",
        "\n",
        "print(\"\\nPrompt 1:\\n\")\n",
        "print(prompt_bad)\n",
        "print(\"\\nOutput 1:\\n\")\n",
        "print(output_bad)\n",
        "\n",
        "print(\"\\n\" + \"=\" * 80)\n",
        "print(\"\\nPrompt 2:\\n\")\n",
        "print(prompt_good)\n",
        "print(\"\\nOutput 2:\\n\")\n",
        "print(output_good)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "In most research contexts, the second prompt is more useful because it reduces ambiguity and asks for information that matters for evaluation: method, result, and limitation.\n",
        "\n",
        "This does not guarantee correctness, but it often improves usefulness and makes later checking easier."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Confidence Is Not Correctness <a id=\"confidence-is-not-correctness\"></a>\n",
        "\n",
        "One of the most important ideas to remember is:\n",
        "\n",
        "> **LLMs are optimized for plausibility, not truth.**\n",
        "\n",
        "A confident tone, a polished structure, or a detailed explanation does not guarantee correctness. In practice, this means that fluency should never be used as evidence by itself.\n",
        "\n",
        "This is why evaluation matters: an answer can be clear, confident, and still wrong."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Best-Practice Checklist <a id=\"best-practice-checklist\"></a>\n",
        "\n",
        "Before using an LLM output in research, it is good practice to ask:\n",
        "\n",
        "- Have I verified critical facts?\n",
        "- Have I checked calculations independently?\n",
        "- Have I asked for sources, evidence, or intermediate steps when needed?\n",
        "- Have I compared multiple outputs or reformulated the prompt?\n",
        "- Am I using the model as a copilot rather than as an authority?\n",
        "\n",
        "A compact version to remember is:\n",
        "\n",
        "✔ Verify important data  \n",
        "✔ Do not trust calculations blindly  \n",
        "✔ Ask for sources or steps  \n",
        "✔ Compare outputs  \n",
        "✔ Use the model as a copilot, not an authority"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": ".venv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.13"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}