Prompt Engineering

This notebook focuses on how to design prompts that produce useful, reproducible, and well-formatted outputs. The central idea is not using magic words, but reducing ambiguity and giving the model a clear target.

Suggested duration: 2.5 hours

Learning goals

By the end of this notebook, you should be able to:

write a structured prompt for a real task
explain why prompt quality changes model behavior
control output format with explicit instructions
distinguish zero-shot, few-shot, and reasoning-oriented prompting
adapt prompts for smaller or local models
test prompting techniques with Hugging Face models
evaluate prompt quality using realistic case studies

Table of Contents

Why prompt structure matters
Working with external model APIs
A practical prompt template
Controlling output format
Local models
Zero-shot and few-shot
Reasoning strategies
From NetCDF data to a textual alert

Why prompt structure matters

Prompt engineering is the practice of expressing a task so that the model can solve it with fewer wrong assumptions. A vague prompt leaves many decisions open: audience, scope, tone, format, uncertainty, and missing information.

Consider the difference between these two instructions:

Summarize this paper.

and

Summarize the paper for professionals in data science. in 5 bullet points. Include the research question, data source, method, main result, and one limitation. Use only the abstract provided below.

The second prompt is stronger because it defines:

the task: summarize
the audience: Professionals in data science.
the scope: use only the abstract
the format: 5 bullet points
the quality bar: include one limitation

A useful way to think about prompting is that we are designing the model’s working conditions. Better working conditions usually produce more stable outputs.

[ ]:

from textwrap import dedent, fill  # utilities to format displayed text
from transformers import pipeline   # transformers library contains models and pipelines
# A pipeline is a function that takes a prompt and returns a model's output
# according to the task we want to perform

weak_prompt = "Summarize this abstract."

strong_prompt = dedent(
    """
    You are helping with scientific literature review.
    Task: summarize the abstract.
    Audience: Professionals in data science.
    Constraints:
    - use only the information in the abstract
    - do not invent results
    - mention exactly one limitation
    Output format:
    - 5 bullet points
    """
).strip()

# paper: https://www.int-res.com/journals/meps/articles/meps15103
abstract = (
    "Connectivity between patchy marine habitats through larval dispersal is crucial for the persistence of local populations."
    "Studies of various marine species suggest broad-scale gene flow across the tropical Indo-West Pacific (IWP), "
    "presumably facilitated by larval dispersal via stepping-stone habitats. "
    "However, the generational timescales and geographic paths involved in such dispersal remain unclear, owing to limited biophysical modelling studies."
    "Here, we quantified connectivity among patchy habitats of the mangrove whelk Terebralia palustris across the IWP using habitat suitability modelling,"
    "larval dispersal modelling, and mitochondrial DNA-based population genetic analysis. Our modelling revealed a single larval dispersal network connecting all potential habitats across the IWP. "
    "At least 14 generations were required for dispersal via stepping-stone habitats to connect the outer edges of the IWP. "
    "The Maldives and Seychelles served as key stepping stones for dispersal, linking the western Indian Ocean and the western Pacific Ocean through monsoon-driven ocean currents. "
    "Major haplotypes were shared across 9 regions of the IWP, providing genetic support for a single larval dispersal network. "
    "Our findings provide fundamental insights into ecological networks formed by stepping-stone dispersal across the IWP, which maintain broad-scale connectivity of T. palustris and potentially other coastal species."
)

print("Weak prompt:\n")
print(weak_prompt)
print("\n" + "=" * 70 + "\n")
print("Strong prompt:\n")
print(strong_prompt)
print("\n" + "=" * 70 + "\n")
print("Example input abstract:\n")
print(abstract)

# Suggested model for trying this section:
generator = pipeline("text-generation", model="google/flan-t5-small")
weak_output = generator(f"{weak_prompt}\n\nAbstract: {abstract}", max_new_tokens=120)[0]["generated_text"]
strong_output = generator(f"{strong_prompt}\n\nAbstract: {abstract}", max_new_tokens=120)[0]["generated_text"]

[ ]:

print("\nModel outputs:\n")
print("Weak prompt output:\n")
#print(fill(weak_output, width=100)) #ALERT the variable contains the prompt & the response! (type string)
print(fill(weak_output.split("Abstract:")[1], width=100))

print("\n" + "-" * 100 + "\n")
print("Strong prompt output:\n")
print(fill(strong_output.split("Abstract:")[1], width=100))

# Compare the difference string (len, words, chars)
print("\nDifference between the two outputs (at table):\n")
print("| Prompt | Length | Words | Chars |")
print("|--------|--------|-------|-------|")
print(f"| Weak | {len(weak_output)} | {len(weak_output.split())} | {len(weak_output)} |")
print(f"| Strong | {len(strong_output)} | {len(strong_output.split())} | {len(strong_output)} |")

[ ]:

# Where Hugging Face models are stored (MAC & Linux) & MODEL SIZE!
!MODEL_DIR="${HF_HOME:-$HOME/.cache/huggingface}/hub/models--google--flan-t5-small"; \
if [ -d "$MODEL_DIR" ]; then \
  du -sh "$MODEL_DIR"; \
fi

Activity

Ask the same prompts (weak and strong) to Chatgpt?
1. Try different models:

google/flan-t5-base: stronger than the small version, still manageable for demonstrations.
distilgpt2: useful to show the limits of a non-instruction-tuned model.

Working with External Model APIs

When we use an external model API, our notebook does not run the model locally. Instead, it sends a request to a remote service and receives a generated response.

Core concepts

API (Application Programming Interface): a structured way for one program to ask another program for a result.
Endpoint: the URL where requests are sent.
API key: a private token used for authentication. It should never be hardcoded in notebooks or pushed to Git.
Model name: the identifier of the model we want to use (for example, meta/llama-3.1-8b-instruct).
Parameters: options such as temperature, max_tokens, and top_p that control response style and length.
Response object: the JSON payload returned by the API, containing generated text and metadata.

What «OpenAI-compatible» means

Many providers implement an API that follows the same request/response shape as OpenAI’s Chat Completions API. This is often called an OpenAI-compatible standard.

In practice, this means you can usually keep the same client code and only change:

base_url (the provider endpoint)
api_key (your provider token)
model (a model available on that provider)

Lets work with external API Providers

Some (Free) API Providers for Prototyping

Groq

Strength: high-performance inference with very low latency.
Typical free limits: around 14,400 requests/day and ~30 requests/min (varies by model/account).
Common model families: Llama 3.1, Mixtral, Gemma.

Google AI Studio (Gemini API)

Strength: easy access to Gemini models for quick experiments.
Typical free limits: around 1,500 requests/day and 15 requests/min.
Note: stronger Pro-tier models (for example, Gemini 2.5 Pro) are typically paid.

OpenRouter

Strength: one API interface across many providers/models.
Free access: selected models are marked as free; limits vary by model.
Use case: excellent for fast prototyping and cross-model comparisons.

GitHub Models

Strength: simple model playground/prototyping flow in the GitHub ecosystem.
Free access: free prototyping for many models.
Typical constraint: around 8K input tokens per request (model-dependent).

Mistral AI (La Plateforme)

Strength: direct access to Mistral-hosted models.
Typical free tier: around 1 request/second plus monthly token allowances.

Cohere

Strength: clean API and strong enterprise-style NLP tooling.
Typical free limits: around 1,000 requests/month and ~20 requests/min (for example, Command R+).
Note: free keys are generally intended for non-commercial use.

Hugging Face Inference API

Strength: access to thousands of open-source models.
Free access: available, but effective rate limits can vary with demand/server load.

Always verify official pricing and rate-limit pages before !!

Groq: High-performance inference ~14,400 requests/day (e.g., Llama 3 8B) ~30 requests/minute limit Models: Llama 3.1, Mixtral, Gemma
Google AI Studio (Gemini API) Free tier: ~1,500 requests/day, 15 requests/minute Pro models (e.g., Gemini 2.5 Pro) typically paid
OpenRouter Access to multiple “free” models (DeepSeek R1, Llama 3, Mistral) Limits vary per model Suitable for prototyping
GitHub Models Free prototyping for 45+ models ~8K input tokens per request Models from Microsoft (Phi), OpenAI (GPT-4o), Meta (Llama)
Mistral AI (La Plateforme) Free tier with ~1 request/second Generous monthly token limits
Cohere Free API key for non-commercial use ~1,000 requests/month ~20 requests/minute (e.g., Command R+)
Hugging Face (Inference API) Free inference on thousands of open-source models Rate limits depend on server load

OpenRouter is a unified API gateway for language models from multiple providers. With one API format, users can switch models quickly and compare prompt behavior across model families.

Examples of commonly used free-tier model IDs on OpenRouter include:

meta-llama/llama-3.1-8b-instruct:free: good for structured prompting and lightweight instruction-following tests.
mistralai/mistral-7b-instruct:free: useful for concise prompt/response comparisons with a smaller instruct model.
google/gemma-2-9b-it:free: useful for checking how another model family follows format constraints.

The example:

[ ]:

# HOW TO USE THE API KEY IN OUR NOTEBOOK OR SYSTEM!
# TAKE CARE OF THE API KEY!

# The system will ask for the API key
# In VSCode, the input-form is visible in the upper part of the window

from getpass import getpass
import os

api_key_input = getpass("Enter your API key: ").strip()
if not api_key_input:
    raise ValueError("No API key provided.")

# Store it in the current notebook session so other cells can read it.
os.environ["API_KEY_COURSE"] = api_key_input
print("API key loaded for this session.")

[ ]:

# Example: calling an NVIDIA model through the OpenAI-compatible API
import os
from openai import OpenAI

api_key = os.getenv("API_KEY_COURSE")
if not api_key:
    raise ValueError("Set API_KEY with previous cell")

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=api_key,
)

messages = [
    {
        "role": "user",
        "content": (
            weak_prompt +
            "Abstract: " + abstract
        ),
    },
]

response = client.chat.completions.create(
    model="nvidia/nemotron-3-super-120b-a12b:free",
    messages=messages,
    temperature=0.3,
    max_tokens=180,
)

print(fill(response.choices[0].message.content, width=100))

[ ]:

print("\nDifference between the outputs:\n")
print("| Prompt | Length | Words | Chars |")
print("|--------|--------|-------|-------|")
print(f"| Weak | {len(weak_output)} | {len(weak_output.split())} | {len(weak_output)} |")
print(f"| Strong | {len(strong_output)} | {len(strong_output.split())} | {len(strong_output)} |")
print(f"| API | {len(response.choices[0].message.content)} | {len(response.choices[0].message.content.split())} | {len(response.choices[0].message.content)} |")

[ ]:

response #What kind of object is this? and choices attribute?

A practical prompt template

A strong prompt often contains five building blocks:

Task: What should the model do?
Context: What information should it use?
Constraints: What must it avoid or prioritize?
Output format: How should the answer be structured?
Quality criteria: What makes a good answer?

A reusable template is:

You are helping with [domain/task].
Goal: [what to produce].
Context: [relevant background or source text].
Constraints: [rules, exclusions, definitions].
Output format: [table, bullets, JSON, abstract, etc.].
Quality criteria: [what makes the output good].

This template works well because it reduces hidden choices. The model does not need to guess whether you want a formal answer, a table, a short summary, or a speculative answer.

[ ]:

paper_prompt = dedent(
    """
    You are helping with scientific literature review.
    Goal: summarize the paper for an MSc student.
    Context: use only the abstract provided below.
    Constraints:
    - do not invent results that are not explicitly stated
    - if information is missing, write 'unknown'
    Output format: a Markdown table with columns [Question, Answer].
    Quality criteria: concise, accurate, readable, and include one limitation.

    Abstract:
    [PASTE ABSTRACT HERE]
    """
).strip()

print(paper_prompt)

# Try with:
# generator = load_generator("google/flan-t5-base")
# result = generator(paper_prompt.replace("[PASTE ABSTRACT HERE]", abstract), max_new_tokens=180)
# print(result[0]["generated_text"])

Controlling output format

Format control matters whenever the model output will be reused by a person, a spreadsheet, or another program.

Common output targets include:

bullet lists for short summaries
Markdown tables for comparisons
JSON for pipelines and applications
paper-style prose for academic writing

The more reusable the output must be, the more explicit the format instructions should be.

Good format instructions often include:

the exact structure to return
field names or column names
a rule for missing values
a limit on length
a reminder not to invent unsupported content

Example instruction:

Return valid JSON with keys: title, method, dataset, main_result, limitation. If a field is missing, use unknown.

That last sentence is important because it gives the model a safe behavior when evidence is incomplete.

[ ]:

json_prompt = dedent(
    """
    Extract information from the abstract below.
    Return valid JSON with keys:
    - title
    - method
    - dataset
    - main_result
    - limitation
    If a field is missing, use 'unknown'.

    Abstract:
    We evaluate a transformer-based classifier for plankton image recognition
    using 120,000 labeled images from Mediterranean coastal stations.
    The model improves macro-F1 by 8% over a CNN baseline, but performance drops
    for rare taxa and low-light images.
    """
).strip()

table_prompt = dedent(
    """
    Summarize the abstract below as a Markdown table with two columns:
    [Item, Value].
    Include rows for task, data, model, result, and limitation.

    Abstract:
    We evaluate a transformer-based classifier for plankton image recognition
    using 120,000 labeled images from Mediterranean coastal stations.
    The model improves macro-F1 by 8% over a CNN baseline, but performance drops
    for rare taxa and low-light images.
    """
).strip()

# print("JSON-oriented prompt:\n")
# print(json_prompt)
# print("\n" + "=" * 70 + "\n")
# print("Table-oriented prompt:\n")
# print(table_prompt)

# Using the OpenAI API

import os
from openai import OpenAI

api_key = os.getenv("API_KEY_COURSE")
if not api_key:
    raise ValueError("Set API_KEY with previous cell")

client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=api_key,
)

messages = [
    {
        "role": "user",
        "content": json_prompt,
    },
]

response = client.chat.completions.create(
    model="nvidia/nemotron-3-super-120b-a12b:free",
    messages=messages,
)

[ ]:

print(fill(response.choices[0].message.content, width=100))

Activity

Request the other table_prompt with another free model

Using prompts with local models (Ollama, llama.cpp, … )

Local models are attractive for privacy, cost control, and offline use, but they often require more careful prompting than larger hosted systems.

Why? Smaller or locally deployed models usually:

have weaker instruction-following behavior
are more sensitive to ambiguous prompts
may struggle with long context or complex formatting
benefit more from explicit examples

[ ]:

import requests
from openai import OpenAI

# Local Ollama server (start with: ollama serve)
OLLAMA_BASE_URL = "http://localhost:11434"

# 1) Discover locally installed models
try:
    tags_response = requests.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=10)
    tags_response.raise_for_status()
    local_models = [m["name"] for m in tags_response.json().get("models", [])]
except Exception as exc:
    raise RuntimeError(
        "Could not connect to local Ollama at http://localhost:11434. "
        "Run `ollama serve` and ensure Ollama is installed."
    ) from exc

print("Local Ollama models:\n")
for model_name in local_models:
    print(f"- {model_name}")

if not local_models:
    raise ValueError("No local models found. Pull one first, e.g. `ollama pull llama3.2:3b`.")

# 2) Choose a model
# (Note about me): I have two local Ollama models:
#- 0: gemma4:e4b
#- 1: deepseek-coder-v2:16b
## IMPORTANT

selected_model = local_models[0] # Lets see what happens with the first model
print(f"\nSelected model: {selected_model}")

# 3) Prompt fallback (in case json_prompt was not executed earlier)
prompt_text = globals().get(
    "json_prompt",
    "Classify this abstract as one of: observation, experiment, review. Return only one label.",
)

# 4) Call Ollama through OpenAI-compatible endpoint
client = OpenAI(base_url=f"{OLLAMA_BASE_URL}/v1", api_key="ollama")
response = client.chat.completions.create(
    model=selected_model,
    messages=[
        {"role": "system", "content": "You are a concise scientific assistant."},
        {"role": "user", "content": prompt_text},
    ],
    temperature=0.2,
    max_tokens=120,
)

# 5) Debug-friendly print
choice = response.choices[0]
content = choice.message.content

print("\nModel response:\n")
if content and content.strip():
    print(content)
else:
    print("(empty content returned)") # The model can not generate a response!
    print(f"finish_reason: {choice.finish_reason}")
    print("raw choice:", choice)

Notes about the previous results:

Using the model 1, What happens with the title?

Why can local Ollama inference feel slow?

Local inference can be slower than hosted APIs because everything runs on your own machine, which has less compute and memory than cloud inference clusters.

Common reasons:

Model size vs hardware: larger models (for example, 16B+) require much more VRAM/RAM and compute per token.
CPU fallback: if the model does not fully fit in GPU memory, part of the workload can run on CPU, which is much slower.
First-run overhead: loading weights and warming up kernels adds startup latency.
Token-by-token generation: generation is sequential; longer outputs (max_tokens) take more time.
Resource contention: browser, IDE, and notebook processes compete for local CPU/GPU/RAM.

Practical speed tips:

Use smaller models for class demos (for example, 3B-8B).
Keep prompts concise and lower max_tokens.
Close heavy background apps.
Prefer quantized models when available.
Run requests one at a time during exercises.

Zero-shot and few-shot

Zero-shot

In zero-shot prompting, we describe the task without giving examples. This works well when the task is familiar, the labels are clear, and the model already knows the pattern.

Example:

Classify each abstract as observation, experiment, or review.

Few-shot

In few-shot prompting, we provide small examples that demonstrate the intended behavior. Few-shot prompting is useful when:

labels are subtle
the desired style matters
there are hidden conventions
the model tends to confuse nearby categories

Few-shot prompting teaches by demonstration instead of only by instruction.

[ ]:

zero_shot_prompt = dedent(
    """
    Classify the following abstract as one of: observation, experiment, review.
    Return only the label.

    Abstract:
    We tested nutrient uptake in mesocosms under different temperature conditions.
    """
).strip()

few_shot_prompt = dedent(
    """
    Classify each abstract as one of: observation, experiment, review.
    Return only the label.

    Example 1:
    Abstract: We measured salinity and chlorophyll trends from coastal buoys over ten years.
    Label: observation

    Example 2:
    Abstract: We manipulated nutrient concentration in mesocosms and compared growth rates.
    Label: experiment

    Example 3:
    Abstract: This paper synthesizes recent studies on marine heatwaves.
    Label: review

    Now classify:
    Abstract: We tested nutrient uptake in mesocosms under different temperature conditions.
    Label:
    """
).strip()

print("Zero-shot prompt:\n")
print(zero_shot_prompt)
print("\n" + "=" * 70 + "\n")
print("Few-shot prompt:\n")
print(few_shot_prompt)

import os
from openai import OpenAI


client = OpenAI(base_url=f"{OLLAMA_BASE_URL}/v1", api_key="ollama")
model = "deepseek-coder-v2:16b"

for label, prompt in [("zero-shot", zero_shot_prompt), ("few-shot", few_shot_prompt)]:
    r = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,
        max_tokens=40,
    )
    print(f"\n{label}:\n{r.choices[0].message.content}")
# print(generator(few_shot_prompt, max_new_tokens=40)[0]["generated_text"])

Reasoning strategies

Chain-of-Thought

Chain-of-Thought prompting encourages the model to decompose a problem into intermediate steps. This can help when the task benefits from explicit decomposition, such as multi-step classification, planning, or error analysis.

In real applications, a safer pattern is often to ask for:

a short reasoning summary
a checklist of criteria
intermediate outputs
a final answer in a fixed format

instead of asking for a long unrestricted reasoning trace.

Tree-of-Thought

Tree-of-Thought generalizes this idea by exploring several candidate paths before selecting one. It is useful for open-ended tasks such as planning, hypothesis generation, or comparing alternative strategies.

It is more expensive and slower than simple prompting, so it should be used only when the task really needs multiple alternatives.

[ ]:

chain_of_thought_prompt = dedent(
    """
    Read the abstract and decide whether it is observation, experiment, or review.
    First list the cues that support your decision.
    Then return the final label on a final line as:
    Final label: <label>

    Abstract:
    We manipulated temperature and nutrient concentration in controlled tanks.
    """
).strip()

concise_reasoning_prompt = dedent(
    """
    Classify the abstract as observation, experiment, or review.
    Return your answer in this format:
    Cues: <short comma-separated cues>
    Final label: <label>

    Abstract:
    We manipulated temperature and nutrient concentration in controlled tanks.
    """
).strip()

print("Longer reasoning-oriented prompt:\n")
print(chain_of_thought_prompt)
print("\n" + "=" * 70 + "\n")
print("Safer concise reasoning prompt:\n")
print(concise_reasoning_prompt)

import os
from openai import OpenAI


client = OpenAI(base_url=f"{OLLAMA_BASE_URL}/v1", api_key="ollama")
model = "deepseek-coder-v2:16b"

for label, prompt in [("chain_of_thought_prompt", chain_of_thought_prompt), ("concise_reasoning_prompt", concise_reasoning_prompt)]:
    r = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2,
        max_tokens=40,
    )
    print(f"\n{label}:\n{r.choices[0].message.content}")

Activity

From NetCDF data to a textual alert

Goal: Given a nc file with PRessure/Mean Sea Level, we get a region (Gibraltar area) in two different time windows.

Use an LLM-assisted workflow to:

Convert the selected data into JSON.
Generate a textual warning based on predefined conditions. For example:

“”” Moderate heat warning for July 15, 2024. The selected area may reach a maximum temperature of 36.2°C. People should avoid prolonged exposure during the hottest hours and stay hydrated. “””

[ ]:

import netCDF4 as nc
import numpy as np

# Load the data
f = nc.Dataset('data/prmsl.2011.nc')
# Print the variables / features of the file
print(f.variables.keys())

# Define the area and time windows of interest
lat_min, lat_max = 35.5, 37.5
lon_min, lon_max = 353.5, 355.5
time_window_1 = ("2011-01-10", "2011-01-12")
time_window_2 = ("2011-01-20", "2011-01-22")

# We create a mask for the Gibraltar area
lat = f.variables['lat'][:]
lon = f.variables['lon'][:]
time = f.variables['time'][:]

# Mask for the Gibraltar area
lat_idx = np.where((lat >= lat_min) & (lat <= lat_max))[0]
lon_idx = np.where((lon >= lon_min) & (lon <= lon_max))[0]


# Mask for the time windows
dates = nc.num2date(time, f.variables['time'].units)

def get_time_indices(start, end):
    start_num = nc.date2num(np.datetime64(start).astype("datetime64[s]").astype(object), f.variables['time'].units)
    end_num = nc.date2num(np.datetime64(end).astype("datetime64[s]").astype(object), f.variables['time'].units)
    return np.where((time >= start_num) & (time <= end_num))[0]

t1_idx = get_time_indices(*time_window_1)
t2_idx = get_time_indices(*time_window_2)

# Extract the data for the Gibraltar area and the time windows
prmsl = f.variables['prmsl']
data_1 = prmsl[t1_idx][:, lat_idx][:, :, lon_idx]
data_2 = prmsl[t2_idx][:, lat_idx][:, :, lon_idx]

print(data_1)
print(data_2)

[ ]:

# You turn!
# Request (I) the LLM to generate a Json file with the data and (II) the conditions for the alert