Skip to main content
AI

Prompt Engineering for Developers: Patterns That Actually Work

Ravinder··9 min read
AIPrompt EngineeringLLMGPTClaude
Share:
Prompt Engineering for Developers: Patterns That Actually Work

Prompts Are Code

The first shift in mindset that makes prompt engineering click for developers is this: treat prompts exactly like code. They have inputs and outputs. They have bugs. They break when inputs change in unexpected ways. They need tests. They need version control.

Once you adopt that mindset, the question stops being "how do I talk to an AI?" and becomes "how do I write a reliable, testable function that calls an LLM?" The answer is the same as for any function: clear inputs, defined outputs, edge case handling, and regression tests.

This post covers the six patterns I use consistently in production. No philosophy, no theory — just the patterns, when to use them, and what they actually look like in code.


Pattern 1: Zero-Shot with Explicit Constraints

The simplest pattern. No examples. Just a clear instruction with explicit constraints on the output.

When to use: Well-defined tasks with a clear instruction set. Translation, summarisation, classification with known labels.

The mistake: Leaving output format implicit. LLMs will invent whatever format seems reasonable to them. That format will not match your parser.

# Bad — implicit output format
prompt = "Classify this support ticket as urgent or not urgent: {ticket}"
 
# Good — explicit format constraint
prompt = """Classify the following support ticket.
 
Respond with EXACTLY one word: either "URGENT" or "NOT_URGENT". 
Nothing else. No explanation. No punctuation.
 
Ticket: {ticket}"""

The constraint must be stated in terms the model cannot misinterpret. "EXACTLY one word" leaves no room for the model to add an explanation. Test with adversarial inputs — long tickets, ambiguous tickets, empty tickets — to find where the constraint breaks.


Pattern 2: Few-Shot Prompting

You provide examples of correct input-output pairs before the actual input. The model learns the task from the examples rather than from abstract instruction.

When to use: Tasks where it is difficult to describe the desired output format in words but easy to show it. Custom extraction formats, domain-specific classification, structured parsing.

few_shot_prompt = """Extract product mentions and sentiment from customer reviews.
 
Review: "The MacBook Pro is incredible but the dongle situation is frustrating."
Output: {{"products": ["MacBook Pro"], "sentiments": {{"MacBook Pro": "mixed"}}}}
 
Review: "I love my AirPods but the AirPods case scratches too easily."
Output: {{"products": ["AirPods"], "sentiments": {{"AirPods": "mixed"}}}}
 
Review: "The new iPad is beautiful. The keyboard cover is also excellent."
Output: {{"products": ["iPad", "keyboard cover"], "sentiments": {{"iPad": "positive", "keyboard cover": "positive"}}}}
 
Review: "{review}"
Output:"""

The examples demonstrate the JSON structure, how to handle multiple products, and how to name products consistently. A zero-shot instruction to "extract products and sentiment" would produce wildly inconsistent output across different models and temperatures.

Choosing few-shot examples

flowchart TD Select["Selecting Few-Shot Examples"] Select --> Diverse["Cover diverse cases\n(edge cases matter)"] Select --> Format["Demonstrate exact output format"] Select --> Edge["Include at least one tricky case\n(empty, ambiguous, multi-entity)"] Select --> Order["Put most similar example last\n(recency bias works in your favour)"] Select --> Count["3-5 examples is usually optimal\n(more examples = more tokens = higher cost)"]

Pattern 3: Chain of Thought (CoT)

You instruct the model to reason step by step before giving the final answer. This dramatically improves accuracy on tasks requiring multi-step reasoning: code analysis, math, logic, and complex classification.

When to use: Anything requiring more than one reasoning step. Code review. SQL generation from ambiguous natural language. Risk assessment.

cot_prompt = """Analyse this Python function for potential bugs. Think through each step carefully before giving your verdict.
 
```python
def calculate_discount(price, discount_pct, user_tier):
    if user_tier == "premium":
        discount_pct += 10
    final_price = price - (price * discount_pct / 100)
    return final_price

Reasoning steps:

  1. What are the inputs and their types?
  2. Are there boundary conditions that could cause issues?
  3. Is there any risk of unexpected type coercion or overflow?
  4. What happens with edge case inputs?
  5. Final verdict: bugs found / no bugs found

Work through each step explicitly."""

 
The "reasoning steps" scaffold prevents the model from jumping to a conclusion. Without it, the model might glance at the function and say "looks fine" — missing the integer truncation risk in the division, or the unchecked negative price case.
 
### Zero-shot CoT
 
For situations where you cannot write custom steps, the classic trigger phrase works surprisingly well:
 
```python
# Appending this phrase to any prompt activates chain-of-thought
prompt += "\n\nLet's think through this step by step."

This was demonstrated in the original CoT paper and still holds. It is not magic — it works because the phrase signals to the model that extended reasoning is expected rather than a quick answer.


Pattern 4: Structured Output with Schema Enforcement

Return a defined JSON schema. Make the format a hard constraint, not a soft suggestion. Use your LLM provider's native structured output feature when available.

When to use: Any time the LLM output feeds into application code. Always prefer structured output over parsing free text.

from anthropic import Anthropic
import json
 
client = Anthropic()
 
EXTRACTION_SCHEMA = {
    "type": "object",
    "required": ["entities", "sentiment", "topics", "confidence"],
    "properties": {
        "entities": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Named entities mentioned in the text"
        },
        "sentiment": {
            "type": "string",
            "enum": ["positive", "negative", "neutral", "mixed"]
        },
        "topics": {
            "type": "array",
            "items": {"type": "string"},
            "maxItems": 5
        },
        "confidence": {
            "type": "number",
            "minimum": 0,
            "maximum": 1
        }
    }
}
 
def analyse_text(text: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Analyse this text and respond with JSON matching the schema exactly.
 
Schema: {json.dumps(EXTRACTION_SCHEMA, indent=2)}
 
Text: {text}"""
        }]
    )
    
    # Parse and validate
    result = json.loads(response.content[0].text)
    return result

With OpenAI and Anthropic's structured output modes, the model is constrained at the token generation level to only produce tokens that conform to the schema. This eliminates parsing failures entirely.


Pattern 5: Role Prompting with Domain Constraints

Assign the model a specific expert role and define the scope of that role. This narrows the response distribution toward domain-appropriate answers.

When to use: Code review, security analysis, documentation generation, anything where domain expertise shapes the expected output.

security_review_prompt = """You are a senior application security engineer specialising in web application vulnerabilities (OWASP Top 10) and secure coding practices for Java Spring Boot applications.
 
Your task: review the following code change for security vulnerabilities.
 
Constraints:
- Flag only confirmed vulnerabilities, not theoretical risks
- Categorise each finding by OWASP category
- Rate severity: CRITICAL / HIGH / MEDIUM / LOW / INFO
- Provide the specific line number(s) of the vulnerability
- Suggest the exact fix, not a general recommendation
 
Do not comment on code style, performance, or non-security concerns.
 
Code diff:
{diff}"""

The role establishes domain expertise. The constraints narrow the output to what you actually need. Without the constraints, a security-focused prompt will produce a mix of security findings, style suggestions, and performance observations.

Role prompt pitfalls

graph TD Pitfalls["Role Prompting Pitfalls"] Pitfalls --> P1["Sycophancy: model agrees with the human's\nassumptions rather than finding real issues"] Pitfalls --> P2["Overconfidence: model invents domain facts\nto fill gaps in its knowledge"] Pitfalls --> P3["Role bleed: model applies role constraints\nto things outside its scope"] Fix1["Fix: include 'if you are uncertain, say so'"] --> P2 Fix2["Fix: ask model to challenge your assumptions\nexplicitly"] --> P1

Always include a hedging instruction: "If you are not confident, explicitly state your uncertainty rather than guessing." This dramatically reduces fabrication of domain-specific details.


Pattern 6: Self-Critique and Iterative Refinement

Ask the model to review its own output and improve it. This is a structured way to catch the model's own errors.

When to use: High-stakes outputs where you want quality over speed. SQL queries, legal clause generation, API documentation.

def generate_with_self_critique(task: str, content: str) -> str:
    # Step 1: Generate initial response
    initial = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[{"role": "user", "content": f"{task}\n\n{content}"}]
    )
    
    initial_response = initial.content[0].text
    
    # Step 2: Self-critique
    critique_prompt = f"""You wrote the following response to a task:
 
<task>{task}</task>
 
<your_response>
{initial_response}
</your_response>
 
Review your response critically:
1. Is anything factually incorrect?
2. Are there any edge cases you did not handle?
3. Is anything ambiguous or potentially misleading?
4. What would you change to make this more accurate and complete?
 
After your critique, provide a revised, improved response."""
 
    revised = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=3000,
        messages=[{"role": "user", "content": critique_prompt}]
    )
    
    return revised.content[0].text

Self-critique adds latency and token cost. Use it selectively — for complex SQL generation, not for trivial lookups. The quality improvement on complex tasks is typically worth a 2× latency budget.


Prompts Need Tests

This is the part most teams skip and then regret. A prompt is untested code. It will behave differently on inputs you did not anticipate.

import pytest
 
# Your prompt under test
from prompts import classify_sentiment
 
@pytest.mark.parametrize("text,expected", [
    ("I love this product!", "POSITIVE"),
    ("This is terrible.", "NEGATIVE"),
    ("It's okay I guess.", "NEUTRAL"),
    ("Great features but terrible support.", "MIXED"),
    ("", "NEUTRAL"),                    # Edge: empty input
    ("a" * 10000, "NEUTRAL"),           # Edge: very long input
    ("😊🎉", "POSITIVE"),              # Edge: emoji only
    ("Good. Bad. Good. Bad.", "MIXED"), # Edge: contradictory signals
])
def test_sentiment_classification(text, expected):
    result = classify_sentiment(text)
    assert result == expected, f"For input '{text[:50]}...', expected {expected}, got {result}"

Run these tests:

  • On every prompt change
  • Before deploying a new model version
  • When your LLM provider announces a model update

The eval loop is what separates a robust prompt-based feature from one that works in demo and breaks in production.


Version Control Your Prompts

Store prompts in your repository, not in config files or database tables. Treat prompt changes like code changes: PR review, changelog entry, rollback capability.

prompts/
  v1/
    classify_sentiment.txt       # First version
    extract_entities.txt
  v2/
    classify_sentiment.txt       # Improved version after eval
    extract_entities.txt
  README.md                      # Why each version exists

When you change a prompt, document what problem the previous version had and what you changed to fix it. You will thank yourself six months later when you need to diagnose a regression.


The Production Prompt Checklist

Before any prompt reaches production:

Prompt Production Readiness
═══════════════════════════════════════════
  ☐ Output format explicitly constrained
  ☐ Edge cases tested (empty, very long, adversarial)
  ☐ Few-shot examples cover diversity of inputs
  ☐ Model uncertainty handled ("if unsure, say so")
  ☐ Structured output schema defined
  ☐ Token count within budget
  ☐ Prompt versioned in source control
  ☐ Regression test suite written
  ☐ Fallback behaviour defined for API errors
  ☐ Latency measured under realistic load
═══════════════════════════════════════════

Prompt engineering is not a soft skill. It is a software engineering discipline with the same quality bar as any other production component. Treat it accordingly.