Prompts Deserve the Same Discipline as Code

If a prompt directly affects:

  • the quality of answers
  • the quality of generated code
  • the safety of agent behavior

then it is no longer a “personal trick.” It is part of the working system.

Therefore, prompts should have:

  • versions
  • change history
  • owners
  • evaluation criteria

Why Gut-Feel Assessment Is Not Enough

Many teams tweak prompts by feel:

  • “this version seems better”
  • “the responses feel smoother”
  • “the agent seems smarter this time”

The problem is that feelings are not reproducible.

A “better prompt” should mean:

  • fewer errors
  • better format compliance
  • less scope creep
  • less manual correction needed

How to Version Prompts

The simplest approach:

  • store prompts in a repository
  • every change goes through a pull request
  • document the reason for each change
  • if possible, attach before/after examples

Example changelog:

v1.2
- Clarified fallback behavior when data is missing
- Required findings to include file references
- Added length constraint to reduce rambling

Just doing this puts a team ahead of most organizations that still prompt from memory.

What Is Prompt Evaluation?

Evaluation (eval) is a small test suite that checks whether a prompt achieves its objectives.

For a review agent, an eval might include 5 cases:

  1. A diff with a null pointer bug
  2. A diff with a performance regression
  3. A diff with only formatting changes
  4. A diff with missing context
  5. A diff with a security change

The expectation is not identical output word-for-word. The expectation is:

  • Did it detect the real issue?
  • Did it follow the output contract?
  • Did it fabricate information when context was missing?

A Critical Principle: Change One Thing at a Time

When editing a prompt, do not change 5 things at once.

Change one element at a time:

  • add an output contract
  • fix the fallback behavior
  • narrow the scope

Then re-run the eval.

This is the only way to know which change actually helped.

Practical Metrics to Start With

You do not need a complex measurement system on day one. Start with simple criteria:

  • Rate of output matching the required format
  • Rate of correctly identifying critical issues
  • Rate of users needing to re-prompt
  • Rate of agent going out of scope
  • Rate of agent stating uncertainty when appropriate

Key Takeaway

Standardizing prompts without versioning and evaluation only gets you halfway.

A strong prompt is not just a well-written prompt. It is a prompt that is measurable, reviewable, and improvable in a controlled way.

In the final foundations part, we assemble everything into a minimum viable Prompt Standard kit for immediate team deployment. Continue to Part 5 — A Minimum Viable Prompt Standard Kit.