Building a Prompt Registry: Versioning, A/B Testing, and Rollback for LLM Prompts
Three months into production, your customer-support LLM starts giving subtly wrong refund policy answers. You trace it back to a prompt edit made two weeks ago — but nobody remembers who changed it, what it used to say, or why the change was made. There is no diff. There is no rollback. There is just a Slack message from a panicked engineer asking "does anyone have the old prompt saved somewhere?"
This is what happens when prompts are treated as config strings instead of production code. The fix is a prompt registry.
What a Prompt Registry Actually Is
A prompt registry is a versioned, queryable store for prompt templates that provides the same guarantees you already expect from your application code: history, authorship, rollback, staged rollout, and observability.
It is not a fancy prompt editor. It is not a playground. It is infrastructure.
The registry has four core concerns:
- Versioning: every change creates a new immutable version
- Ownership: every prompt has an accountable team or individual
- Experiment surface: A/B between prompt versions with traffic splitting
- Rollback path: revert to any prior version in seconds, not hours
Schema Design
Start with a schema that captures everything you will eventually need to query. Migrating schema later is painful when you have thousands of versions in production.
-- prompts table: one row per logical prompt
CREATE TABLE prompts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
slug TEXT UNIQUE NOT NULL, -- e.g. "support/refund-policy"
description TEXT,
owner_team TEXT NOT NULL,
owner_email TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT now(),
archived_at TIMESTAMPTZ
);
-- prompt_versions: immutable, append-only
CREATE TABLE prompt_versions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
prompt_id UUID REFERENCES prompts(id),
semver TEXT NOT NULL, -- "1.3.2"
template TEXT NOT NULL, -- the actual prompt text
variables JSONB NOT NULL DEFAULT '[]', -- ["customer_name", "order_id"]
model_hint TEXT, -- "gpt-4o", "claude-3-5-sonnet"
max_tokens INT,
temperature NUMERIC(3,2),
created_by TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT now(),
change_summary TEXT, -- required, not nullable
UNIQUE(prompt_id, semver)
);
-- active_deployments: which version is live per environment
CREATE TABLE active_deployments (
prompt_id UUID REFERENCES prompts(id),
environment TEXT NOT NULL, -- "prod", "staging", "canary"
version_id UUID REFERENCES prompt_versions(id),
deployed_at TIMESTAMPTZ DEFAULT now(),
deployed_by TEXT NOT NULL,
PRIMARY KEY (prompt_id, environment)
);The change_summary field is NOT NULL deliberately. Engineers hate filling it in, but six months later they love having it.
Semver for Prompts
Semantic versioning maps cleanly to prompt changes once you define what each level means:
| Bump | Meaning | Example |
|---|---|---|
| MAJOR | Output format changes (breaks downstream parsers) | JSON → Markdown, added/removed required fields |
| MINOR | Behavior change (same format, different output) | Tone shift, new instructions, persona change |
| PATCH | Wording fix, typo, whitespace cleanup | No behavioral change expected |
This gives reviewers a signal about risk before they approve a PR. A PATCH bump needs a spot-check. A MAJOR bump needs a full eval run.
Enforce this at the API layer:
import re
from enum import Enum
class BumpType(Enum):
MAJOR = "major"
MINOR = "minor"
PATCH = "patch"
def next_version(current: str, bump: BumpType) -> str:
match = re.match(r"(\d+)\.(\d+)\.(\d+)", current)
if not match:
raise ValueError(f"Invalid semver: {current}")
major, minor, patch = map(int, match.groups())
if bump == BumpType.MAJOR:
return f"{major + 1}.0.0"
elif bump == BumpType.MINOR:
return f"{major}.{minor + 1}.0"
else:
return f"{major}.{minor}.{patch + 1}"
def create_version(prompt_slug: str, template: str, bump: BumpType,
change_summary: str, created_by: str, db) -> dict:
prompt = db.fetch_one("SELECT * FROM prompts WHERE slug = %s", [prompt_slug])
latest = db.fetch_one(
"SELECT semver FROM prompt_versions WHERE prompt_id = %s ORDER BY created_at DESC LIMIT 1",
[prompt["id"]]
)
current_ver = latest["semver"] if latest else "0.0.0"
new_ver = next_version(current_ver, bump)
version = db.insert("prompt_versions", {
"prompt_id": prompt["id"],
"semver": new_ver,
"template": template,
"change_summary": change_summary,
"created_by": created_by,
})
return versionThe A/B Harness
Traffic splitting happens at the experiment layer, not in the registry schema. Keep them separate — the registry stores facts, experiments are ephemeral.
import hashlib
from dataclasses import dataclass
@dataclass
class Experiment:
id: str
prompt_id: str
variant_a_version_id: str
variant_b_version_id: str
split_pct: int # % of traffic going to variant B
metric: str # "thumbs_up_rate", "task_completion", "cost_per_session"
active: bool
def resolve_prompt_version(prompt_slug: str, user_id: str,
db, experiment_store) -> str:
prompt = db.fetch_one("SELECT id FROM prompts WHERE slug = %s", [prompt_slug])
experiment = experiment_store.get_active(prompt["id"])
if experiment:
# Deterministic bucketing — same user always sees same variant
bucket = int(hashlib.sha256(
f"{experiment.id}:{user_id}".encode()
).hexdigest(), 16) % 100
version_id = (
experiment.variant_b_version_id
if bucket < experiment.split_pct
else experiment.variant_a_version_id
)
else:
deployment = db.fetch_one(
"SELECT version_id FROM active_deployments WHERE prompt_id = %s AND environment = 'prod'",
[prompt["id"]]
)
version_id = deployment["version_id"]
version = db.fetch_one(
"SELECT template, variables FROM prompt_versions WHERE id = %s",
[version_id]
)
return version, version_idLog version_id on every LLM call. Without it, your eval results are useless.
Rollback Procedure
Rollback is an UPDATE to active_deployments. The old version is still in the database — you are just pointing prod back at it.
def rollback(prompt_slug: str, target_semver: str,
initiated_by: str, db, audit_log) -> None:
prompt = db.fetch_one("SELECT id FROM prompts WHERE slug = %s", [prompt_slug])
target = db.fetch_one(
"SELECT id FROM prompt_versions WHERE prompt_id = %s AND semver = %s",
[prompt["id"], target_semver]
)
if not target:
raise ValueError(f"Version {target_semver} not found for {prompt_slug}")
current = db.fetch_one(
"SELECT version_id FROM active_deployments WHERE prompt_id = %s AND environment = 'prod'",
[prompt["id"]]
)
db.execute(
"""UPDATE active_deployments
SET version_id = %s, deployed_at = now(), deployed_by = %s
WHERE prompt_id = %s AND environment = 'prod'""",
[target["id"], initiated_by, prompt["id"]]
)
audit_log.record({
"action": "rollback",
"prompt_slug": prompt_slug,
"from_version_id": current["version_id"],
"to_semver": target_semver,
"initiated_by": initiated_by,
})Write the audit log to a separate, append-only table or stream (Kafka, Kinesis). You will need it during incident reviews.
Owner Accountability
Ownership without teeth is theater. Wire the registry into your incident runbook:
- PagerDuty / OpsGenie routing: when an LLM-related alert fires, the on-call rule queries the registry for
owner_teamand routes accordingly. - Change review gate: PRs that modify a prompt template in the registry automatically request a review from
owner_email. - SLO tracking per prompt: error rates and latency SLOs live on the prompt slug, not the endpoint. When
support/refund-policydegrades, the owner gets the alert.
# GitHub Actions step — auto-request review on prompt changes
def get_owners_for_changed_prompts(changed_files: list[str], db) -> list[str]:
owners = set()
for file_path in changed_files:
# Assume prompts are stored as files mirroring slug structure
# e.g. prompts/support/refund-policy.yaml → slug "support/refund-policy"
slug = file_path.removeprefix("prompts/").removesuffix(".yaml")
prompt = db.fetch_one(
"SELECT owner_email FROM prompts WHERE slug = %s", [slug]
)
if prompt:
owners.add(prompt["owner_email"])
return list(owners)Key Takeaways
- Prompts are production artifacts — they deserve the same versioning, ownership, and rollback guarantees as application code.
- Semver bump types (major/minor/patch) give reviewers a risk signal before approving prompt changes.
- Deterministic user bucketing ensures A/B experiments are reproducible and users see consistent experiences within a session.
- Log
prompt_version_idon every LLM call — without it, you cannot attribute eval outcomes to specific prompt versions. - Rollback is an O(1) database update, not a deployment — design for it from day one.
- Owner accountability requires integration with your alerting and review tooling, not just a column in a database table.