Evaluating Prompt Quality
The five dimensions of prompt quality
Not all prompts are equal — but 'this prompt is better' is too vague to be useful. Evaluate prompts on five specific dimensions:
1. Accuracy — Does the output contain correct information? Are factual claims verifiable?
2. Relevance — Does the output actually answer what was asked? Did the AI drift off-topic?
3. Format — Is the output structured as requested? Does the format match the intended use?
4. Completeness — Does the output cover everything needed, or are important elements missing?
5. Consistency — Run the same prompt 3 times. Does it produce reliably similar quality, or is the output unpredictable?
When a prompt underperforms, identify which dimension failed — that tells you exactly how to fix it.
The AI judge pattern
The fastest way to evaluate prompt output quality is to use AI as the judge:
```
You are a prompt output quality evaluator. I will give you a prompt and its output. Rate the output on:
- Accuracy (1-10): Is the information correct and verifiable?
- Relevance (1-10): Does it answer what was asked?
- Format (1-10): Is the structure appropriate for the use case?
- Completeness (1-10): Is anything important missing?
- Consistency: Based on the output, would you expect this prompt to produce similar quality if run 5 more times?
For any score below 8, explain specifically what's wrong and how to fix the prompt to address it.
Prompt: [paste your prompt]
Output: [paste the output to evaluate]
```
This creates a tight feedback loop: prompt → output → AI evaluation → improved prompt. You can iterate 5x faster than manual review.
A/B testing your prompts
For prompts you'll use repeatedly (in a product, automation, or daily workflow), A/B testing gives you data instead of opinions:
How to run a prompt A/B test:
1. Write two versions of the prompt (A: your current version, B: a variation)
2. Run both on the same 10 test inputs
3. Use the AI judge pattern to score each output independently
4. Compare average scores across all 5 dimensions
5. Choose the winner, but keep the loser — sometimes the 'worse' prompt works better for specific edge cases
What to vary in each test:
- The role definition (specific vs. general)
- The instruction phrasing (imperative vs. descriptive)
- The format specification (explicit schema vs. described format)
- The number of examples (0-shot vs. 1-shot vs. 3-shot)
- The output length constraint
Even a small improvement in a prompt used 1,000 times daily creates massive cumulative value.
Prompt quality red flags
Learn to recognize the signals that a prompt needs work:
Red flag 1: The AI apologizes or hedges excessively. 'I'm just an AI and cannot...' or 'As an AI language model...' — the role definition is too vague or conflicting with what you're asking.
Red flag 2: The output is longer than needed. The AI is padding because the format wasn't specified. Add explicit length constraints.
Red flag 3: The AI answers a different question than asked. The task instruction is ambiguous. Rewrite it as a single, unambiguous sentence.
Red flag 4: The output quality varies wildly between runs. The prompt is relying on the model's discretion too heavily. Add more constraints, a format spec, and examples.
Red flag 5: The AI says 'it depends' without helping you figure out on what. Add: 'If the answer depends on factors, list the top 3 factors and give a recommendation for each scenario.'
Red flag 6: First sentence is a restatement of your question. The AI is warming up. Fix: add 'Skip any introduction. Start directly with [the first element].'
Key Takeaways
- Evaluate prompts on five dimensions: Accuracy, Relevance, Format, Completeness, and Consistency.
- The AI judge pattern uses AI to score its own outputs — creating a fast, scalable quality feedback loop.
- A/B test prompts you use repeatedly; even small improvements compound to massive value at scale.
- Six red flags signal a prompt needs work: excessive hedging, padded output, wrong question answered, high variance, 'it depends' without help, and intro restatement.
Run your first AI judge evaluation
Take any prompt from your library (or from a previous lesson). Run it to get an output. Then run the AI judge pattern on that output. Get scores on all 5 dimensions. Use the feedback to rewrite the prompt and test if the scores improve.