The Prompt Engineering Patterns I Actually Use in Production (And the Ones I Don't)

There's a funny thing about prompt engineering tutorials. They present 15 techniques as if they're all equally important, then show you a toy example of each. You leave feeling like you've learned a lot, but when you sit down to write a prompt for your actual system, you still stare at a blank screen.
I've written prompts for production systems processing millions of queries. Here's what I've actually learned: three or four patterns do 90% of the work. The rest are situational — useful when you need them, but you usually don't.
So instead of a checklist of 12 techniques with identical formatting, let me tell you which patterns I reach for first, which ones I pull out when things get hard, and which ones I almost never use despite them being in every tutorial.
The Big Three: Patterns I Use Every Single Day
1. Constrained Generation (The One That Fixes Everything)
If I could only teach one prompt engineering technique, it would be this: tell the model exactly what format you want.
This sounds obvious. It isn't. I review client prompts regularly, and the single most common mistake is vague output expectations. "Analyze this data" vs. "Return a JSON object with these exact fields" is the difference between a prototype and a production system.
Here's a real example. We had a product categorization system that was "working but unreliable." The prompt was:
prompt = "Extract product details from this description."
# Output: Sometimes JSON, sometimes markdown, sometimes prose.
# Parsing broke randomly. Team was pulling their hair out.
The fix took five minutes:
prompt = """
Extract product details in this exact JSON format:
{
"name": "product name",
"price": numeric value,
"category": "category name",
"features": ["feature1", "feature2"],
"inStock": boolean
}
Rules:
- Price must be a number (no currency symbols)
- Features must be an array
- inStock must be true or false
- Use null for missing information
Product description: {description}
JSON output:
"""
Parsing errors dropped from ~20% to under 1%. Not because the model got smarter — because we stopped being vague about what we wanted.
For production systems, pair this with Pydantic validation:
from pydantic import BaseModel
import json
class ProductInfo(BaseModel):
name: str
price: float
category: str
features: list[str]
in_stock: bool
def extract_product(description: str) -> ProductInfo:
response = llm.generate(prompt.format(description=description))
data = json.loads(response)
return ProductInfo(**data) # Validates or throws
If the model returns garbage, Pydantic catches it. You retry or escalate. No silent failures.
2. Few-Shot Learning (Show, Don't Tell)
The second pattern I can't live without: give the model examples instead of instructions.
I used to write elaborate prompts explaining exactly how I wanted entity extraction done. Five paragraphs of rules, edge cases, formatting requirements. The model would follow some rules, ignore others, and invent its own formatting.
Then I switched to examples:
prompt = """
Extract entities in the format [TYPE: VALUE]
Example 1:
Input: Google launched Gemini in Mountain View.
Output: [COMPANY: Google], [PRODUCT: Gemini], [LOCATION: Mountain View]
Example 2:
Input: Tesla unveiled Cybertruck at the LA Auto Show.
Output: [COMPANY: Tesla], [PRODUCT: Cybertruck], [EVENT: LA Auto Show]
Example 3:
Input: Microsoft released GPT-4 integration in Azure.
Output: [COMPANY: Microsoft], [PRODUCT: GPT-4], [PLATFORM: Azure]
Now extract entities from:
Input: Apple announced the iPhone 15 in Cupertino.
Output:
Three examples do more than three paragraphs of explanation. The model pattern-matches from examples far better than it follows complex written rules.
How many examples? Three is my default. Sometimes two works. I've never needed more than five. If five examples don't solve it, the problem isn't the number of examples — it's that the task is genuinely ambiguous.
Which examples to pick? Cover your edge cases. If you have a common case, a tricky case, and a boundary case, those three examples will outperform ten "normal" examples.

3. Chain-of-Thought (Making the Model Show Its Work)
Chain-of-thought is the most famous prompt engineering technique, and it deserves the hype — for the right tasks.
The idea: instead of asking for the answer directly, ask the model to reason step by step. This dramatically improves accuracy on math, logic, and multi-step problems.
# Without CoT — model often gets this wrong
prompt = """
Calculate the total cost:
- 3 notebooks at $4.50 each
- 2 pens at $1.25 each
- 15% discount on total
"""
# Output: "$12.75" (wrong)
# With CoT — model almost always gets it right
prompt = """
Calculate the total cost step by step:
- 3 notebooks at $4.50 each
- 2 pens at $1.25 each
- 15% discount on total
Let's solve this step by step:
1. First, calculate the cost of notebooks
2. Then, calculate the cost of pens
3. Add them together for subtotal
4. Apply the 15% discount
5. Calculate final total
"""
# Output:
# "1. Notebooks: 3 × $4.50 = $13.50
# 2. Pens: 2 × $1.25 = $2.50
# 3. Subtotal: $13.50 + $2.50 = $16.00
# 4. Discount: $16.00 × 0.15 = $2.40
# 5. Final total: $16.00 - $2.40 = $13.60"
When I use it: Any time the task involves calculation, comparison, multi-step logic, or analysis. It's essentially free (a few extra tokens) and reliably improves accuracy by 40-60%.
When I don't: Simple extraction, classification, or translation. For "what category does this product belong to?" — chain-of-thought just adds unnecessary tokens. The model doesn't need to "think" about it.
A practical pattern for building CoT into your system:
def chain_of_thought_prompt(problem: str, steps: list[str]) -> str:
return f"""
{problem}
Let's solve this step by step:
{chr(10).join(f"{i+1}. {step}" for i, step in enumerate(steps))}
Now, work through each step carefully.
"""
The Power Moves: Patterns I Reach For When Things Get Hard
These patterns aren't daily drivers, but when you need them, nothing else works.
4. Role-Based Prompting
Assigning a role sounds almost too simple to be a "technique." But the difference between "explain neural networks" and "you are a senior ML engineer explaining neural networks to a developer who's never done ML" is enormous.
prompt = """
You are a senior machine learning engineer with 10 years of experience
teaching complex concepts to software developers who are new to AI.
Explain neural networks to a developer who understands programming
but hasn't worked with ML before. Use code analogies and practical examples.
"""
The role doesn't just change the style — it changes what the model includes and excludes. A "teacher" role adds analogies. A "code reviewer" role focuses on bugs. A "consultant" role gives actionable recommendations instead of textbook explanations.
I combine this with almost every other pattern. It's the seasoning, not the main dish.
5. Negative Prompting (The DO NOT List)
Here's a counterintuitive discovery: sometimes telling the model what not to do is more effective than telling it what to do.
prompt = """
Summarize this article in 3-4 sentences.
DO:
- Focus on main facts and key points
- Use neutral, objective language
DO NOT:
- Add your own opinions or interpretations
- Include minor details or examples
- Use more than 4 sentences
- Start with "This article discusses..."
"""
That last "DO NOT" is my favorite. Without it, about 40% of summaries start with "This article discusses..." which is useless padding. One negative constraint, 40% better outputs.
I keep a running list of "DO NOTs" for each production prompt. Every time the model does something annoying, I add it to the list. After a few iterations, the prompt is dialed in.
6. Prompt Chaining (Breaking It Down)
When a task is too complex for one prompt — and you can feel the model struggling — split it into steps.
# Instead of one monolithic prompt...
prompt = "Analyze this code, identify bugs, suggest fixes, and rewrite it with improvements."
# ...which produces unfocused results
# Break it into a chain:
# Step 1: Analysis
analysis = llm.generate(f"Analyze this code and list potential issues:\n{code}")
# Step 2: Prioritization
priorities = llm.generate(f"Prioritize these issues by severity:\n{analysis}")
# Step 3: Fix suggestions
fixes = llm.generate(f"Suggest fixes for the Critical and High items:\n{priorities}")
# Step 4: Rewrite
final = llm.generate(f"Rewrite the code with these fixes applied:\n{code}\n\nFixes:\n{fixes}")
Each step gets the model's full attention on one sub-task. The output is better, and if something goes wrong, you can see exactly which step failed.
The cost trade-off: chaining uses 3-4x more tokens than a single prompt. For most production systems, the quality improvement is worth it. For high-volume, low-stakes tasks (sentiment classification, simple extraction), a single well-crafted prompt is fine.
7. Contextual Priming
When the model gives generic advice, it's usually because you gave it a generic question. Adding context transforms "should we use microservices?" from a textbook answer into a relevant recommendation:
prompt = """
Context:
- Team size: 8 developers
- Current system: Django monolith (50k lines)
- Traffic: 100k requests/day
- Pain points: Slow deployments, testing bottlenecks
- Budget: Limited DevOps resources
- Timeline: 6 months
Given this context, should we migrate to microservices?
Provide a recommendation specific to our situation.
"""
# Output: Tailored advice that accounts for team size, budget, timeline
Without context, you get "microservices have pros and cons." With context, you get "with 8 developers and limited DevOps, microservices will slow you down. Consider modularizing your monolith first."
Night and day.

The Specialist Tools: Powerful But Situational
8. Self-Consistency (Voting on the Answer)
This is one of those techniques that sounds unnecessary until you're working on a task where accuracy really matters.
The idea: generate multiple independent answers to the same question, then pick the most common one. It's like asking five doctors instead of one.
from collections import Counter
class SelfConsistency:
def __init__(self, llm_client, num_samples: int = 5):
self.llm = llm_client
self.num_samples = num_samples
def solve(self, problem: str) -> dict:
answers = []
for _ in range(self.num_samples):
solution = self.llm.generate(f"{problem}\n\nSolve step by step.")
answers.append(self._extract_answer(solution))
answer_counts = Counter(answers)
best, count = answer_counts.most_common(1)[0]
return {
"answer": best,
"confidence": count / self.num_samples
}
When I use it: Math problems, classification tasks where the cost of being wrong is high, medical/legal/financial analysis. The 5x cost increase is justified when one wrong answer matters.
When I don't: 95% of the time. For a chatbot or a content generation system, one good prompt with CoT is enough.
9. ReAct (Reasoning + Acting)
ReAct is the foundation of AI agents. The model alternates between thinking ("I need to look up X") and acting (calling a tool to look up X).
prompt = """
Use the ReAct framework to solve this:
Question: What is the capital of the country where the Eiffel Tower is located?
Thought: I need to find which country the Eiffel Tower is in.
Action: Search["Eiffel Tower location"]
Observation: The Eiffel Tower is in Paris, France.
Thought: France's capital is Paris. I have the answer.
Action: Finish["Paris"]
"""
This is less a "prompt engineering pattern" and more an "application architecture." If you're building agents, you'll use ReAct. If you're not, you probably won't. See my agent building guide for the full picture.
10. Tree of Thoughts
Tree of Thoughts explores multiple reasoning paths and picks the best one. It's like brainstorming three approaches, evaluating each, then committing to the winner.
I'll be honest: I've used this in production exactly twice. Both times for complex planning tasks where the first solution path was often suboptimal. For 99% of use cases, regular chain-of-thought is sufficient.
If you're curious, the pattern is straightforward: generate three solutions, then ask the model to evaluate and pick the best one. But the 3x cost and latency means I only reach for it when the decision quality genuinely matters.
11. Meta-Prompting
Using the model to generate or improve prompts. This sounds recursive and weird, but it's genuinely useful when you're stuck.
meta_prompt = """
I need a prompt that extracts named entities from medical records.
The output should be structured as [TYPE: VALUE].
Generate an effective prompt that includes:
1. An appropriate role/expertise
2. 2-3 examples
3. Clear formatting rules
4. Edge case handling
Generate the prompt:
"""
I use this as a starting point, never as the final product. The model generates a reasonable first draft, and then I iterate manually based on real outputs. It's particularly useful for domains where I'm not an expert — the model often suggests edge cases I wouldn't have thought of.
12. Iterative Refinement
Multiple passes to progressively improve output. Draft → refine for clarity → refine for engagement → final polish.
Useful for content generation where quality matters. Overkill for most production tasks. I mention it for completeness, but in practice, a single well-prompted call with good constraints usually gets you 90% of the way there.
What I've Learned About Combining Patterns
The real skill isn't knowing individual patterns — it's knowing which ones to stack for a given problem.
My most common stacks:
For structured extraction: Few-shot + Constrained Generation + Negative Prompting
- Show examples of the format, specify the schema, list common mistakes to avoid
For analysis tasks: Role + Context + Chain-of-Thought
- Set the expertise, provide background, ask for step-by-step reasoning
For high-stakes decisions: Context + CoT + Self-Consistency
- Full context, step-by-step reasoning, multiple samples with voting
For agent systems: Role + ReAct + Constrained Generation (for tool calls)
- See my agent guide for details
Don't stack more than 3-4 patterns at once. Each one adds tokens and complexity. If your prompt is longer than your expected output, you've probably over-engineered it.
The Honest Performance Table
Every tutorial gives you percentages like "40-60% improvement." Here's my honest take: these numbers are real but context-dependent. A 50% improvement on a toy benchmark might be a 10% improvement on your specific production data. Still worth it, but manage expectations.
| Pattern | My Real-World Impact | When I Use It | Cost |
|---|---|---|---|
| Constrained Generation | Huge — fixes most "unreliable output" problems | Every structured task | Free |
| Few-Shot | High — 3 examples beat 3 paragraphs of instructions | Domain-specific extraction | Minimal |
| Chain-of-Thought | High for reasoning, useless for simple tasks | Math, logic, analysis | Minimal |
| Role-Based | Medium — seasoning, not main dish | Almost always, as a prefix | Free |
| Negative Prompting | Medium — fixes specific annoying behaviors | When model keeps doing something wrong | Free |
| Prompt Chaining | High — but 3-4x cost | Complex multi-step tasks | 3-4x |
| Contextual Priming | High for recommendations, low for extraction | Advisory/recommendation tasks | Minimal |
| Self-Consistency | High for accuracy, but 5x cost | Only when accuracy is critical | 5x |
| ReAct | Essential for agents, irrelevant otherwise | Agent systems | Variable |
| Tree of Thoughts | Rarely justified outside planning | Complex planning tasks | 3x |
| Meta-Prompting | Useful as starting point, not final product | When stuck or entering new domain | 1 extra call |
| Iterative Refinement | Moderate — usually overkill | High-quality content generation | 3-4x |
The Advice I Wish I'd Gotten Earlier
Your prompt is a living document. Version control it. Track which version produces which results. When something breaks in production, you want to git diff your prompts.
The best prompt is the shortest one that works. Longer is not better. Every extra sentence is a potential source of confusion for the model. If you can get the same results with fewer words, do it.
Test on real data, not examples you made up. I've written prompts that worked perfectly on my test cases and failed on the first real input. Real data is messier, more ambiguous, and more diverse than anything you'll construct.
When the prompt isn't working, the problem might not be the prompt. Sometimes the model genuinely can't do the task. Sometimes your data is bad. Sometimes you need RAG or fine-tuning, not a better prompt. Prompt engineering has limits.
Start with Constrained Generation and Few-Shot. Add Chain-of-Thought for reasoning tasks. Layer in Role and Negative Prompting to fine-tune behavior. That's the playbook. Everything else is for specific situations.
Need help optimizing your AI system's prompts? I've tuned prompts for systems handling millions of queries. Sometimes the fix is a single line. Let's talk.
