Prompt Engineering That Actually Works in Production
Prompt engineering has a reputation problem. Half the advice online is "add more adjectives to your prompt" or "tell the model it's an expert." That stuff works in demos. In production — where you have real users, real edge cases, and wrong answers that cost money — it falls apart fast.
Here's what I've actually learned shipping LLM features.
Structure beats cleverness
A clever prompt that sort of works is worse than a boring prompt that always works. LLMs are pattern matchers. Give them a consistent, predictable format and they return consistent, predictable output.
The format I reach for:
SYSTEM:
You are a [specific role]. Your job is to [specific task].
Rules:
1. [Hard constraint]
2. [Hard constraint]
3. If unsure, say so rather than guessing.
Output format: [Exact JSON schema or template]
Separating the role, constraints, and output format into explicit sections reduces variance dramatically versus a wall of natural language.
Force structured output
Free-form text responses are hard to parse reliably. Whenever the downstream code needs to use the output programmatically, force JSON.
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_input},
],
)
result = json.loads(response.choices[0].message.content)
Combined with Pydantic validation on the parsed output, you catch schema violations at the boundary rather than three function calls deep.
Few-shot examples are worth more than long instructions
Telling a model "classify the sentiment as positive, negative, or neutral" produces mediocre results. Showing it three examples of each class — especially ambiguous ones — produces much better results.
The examples I include are always the hard cases: a sarcastic positive review, a politely worded complaint, a neutral statement that uses emotional language. Easy examples don't teach the model anything it didn't already know.
Failure modes beat success cases in your system prompt
Most prompts describe what the model should do when things go right. The real value is describing what to do when they go wrong.
Instead of just:
Summarize the document in 3 bullet points.
Add:
If the document is too short to generate 3 meaningful bullets, return fewer. If the document is not in English, return a single bullet saying so. Never fabricate information not present in the document.
These guardrails feel obvious but without them you'll hit every one of them in production.
Test with adversarial inputs, not happy paths
Your prompt works fine on the input you wrote it for. Test it on:
- Empty input
- Input in a different language
- Input that's 10x longer than expected
- Input that's trying to override the system prompt ("Ignore all previous instructions and...")
- Input with no clear answer
Prompt injection — where user input escapes the user role and tries to control system behavior — is a real attack vector. At minimum, rephrase user input before inserting it: User question: {question} rather than injecting it raw.
Log everything and eval continuously
You can't improve prompts without visibility into how they're failing. I log every prompt, every response, and a structured failure reason when I can detect one.
A minimal eval harness:
- 50-100 golden examples with expected outputs
- A scoring function per task (exact match, semantic similarity, or human label)
- Run on every prompt change before deploying
Intuition about prompts is almost always wrong. The eval set is the ground truth.
The uncomfortable reality: most prompt engineering is just software engineering applied to natural language. Write clear specs, test against real inputs, measure outcomes, iterate. The "magic" framing is mostly marketing.