Lesson 1014 lessons

Evaluating Prompt Quality

The five dimensions of prompt quality

Not all prompts are equal — but 'this prompt is better' is too vague to be useful. Evaluate prompts on five specific dimensions:


1. Accuracy — Does the output contain correct information? Are factual claims verifiable?


2. Relevance — Does the output actually answer what was asked? Did the AI drift off-topic?


3. Format — Is the output structured as requested? Does the format match the intended use?


4. Completeness — Does the output cover everything needed, or are important elements missing?


5. Consistency — Run the same prompt 3 times. Does it produce reliably similar quality, or is the output unpredictable?


When a prompt underperforms, identify which dimension failed — that tells you exactly how to fix it.

The AI judge pattern

The fastest way to evaluate prompt output quality is to use AI as the judge:


```

You are a prompt output quality evaluator. I will give you a prompt and its output. Rate the output on:


- Accuracy (1-10): Is the information correct and verifiable?

- Relevance (1-10): Does it answer what was asked?

- Format (1-10): Is the structure appropriate for the use case?

- Completeness (1-10): Is anything important missing?

- Consistency: Based on the output, would you expect this prompt to produce similar quality if run 5 more times?


For any score below 8, explain specifically what's wrong and how to fix the prompt to address it.


Prompt: [paste your prompt]

Output: [paste the output to evaluate]

```


This creates a tight feedback loop: prompt → output → AI evaluation → improved prompt. You can iterate 5x faster than manual review.

A/B testing your prompts

For prompts you'll use repeatedly (in a product, automation, or daily workflow), A/B testing gives you data instead of opinions:


How to run a prompt A/B test:

1. Write two versions of the prompt (A: your current version, B: a variation)

2. Run both on the same 10 test inputs

3. Use the AI judge pattern to score each output independently

4. Compare average scores across all 5 dimensions

5. Choose the winner, but keep the loser — sometimes the 'worse' prompt works better for specific edge cases


What to vary in each test:

- The role definition (specific vs. general)

- The instruction phrasing (imperative vs. descriptive)

- The format specification (explicit schema vs. described format)

- The number of examples (0-shot vs. 1-shot vs. 3-shot)

- The output length constraint


Even a small improvement in a prompt used 1,000 times daily creates massive cumulative value.

Prompt quality red flags

Learn to recognize the signals that a prompt needs work:


Red flag 1: The AI apologizes or hedges excessively. 'I'm just an AI and cannot...' or 'As an AI language model...' — the role definition is too vague or conflicting with what you're asking.


Red flag 2: The output is longer than needed. The AI is padding because the format wasn't specified. Add explicit length constraints.


Red flag 3: The AI answers a different question than asked. The task instruction is ambiguous. Rewrite it as a single, unambiguous sentence.


Red flag 4: The output quality varies wildly between runs. The prompt is relying on the model's discretion too heavily. Add more constraints, a format spec, and examples.


Red flag 5: The AI says 'it depends' without helping you figure out on what. Add: 'If the answer depends on factors, list the top 3 factors and give a recommendation for each scenario.'


Red flag 6: First sentence is a restatement of your question. The AI is warming up. Fix: add 'Skip any introduction. Start directly with [the first element].'


Key Takeaways

  • Evaluate prompts on five dimensions: Accuracy, Relevance, Format, Completeness, and Consistency.
  • The AI judge pattern uses AI to score its own outputs — creating a fast, scalable quality feedback loop.
  • A/B test prompts you use repeatedly; even small improvements compound to massive value at scale.
  • Six red flags signal a prompt needs work: excessive hedging, padded output, wrong question answered, high variance, 'it depends' without help, and intro restatement.

Run your first AI judge evaluation

Take any prompt from your library (or from a previous lesson). Run it to get an output. Then run the AI judge pattern on that output. Get scores on all 5 dimensions. Use the feedback to rewrite the prompt and test if the scores improve.

Prompt to evaluate: 'Write me a marketing email for my new AI course.' AI Judge prompt: Rate this output on Accuracy (1-10), Relevance (1-10), Format (1-10), Completeness (1-10). For any score below 8, tell me exactly what's wrong and how to fix the prompt. Expected result: The judge will likely flag missing role, vague task, no audience specification, no format/length constraints — giving you a precise improvement checklist.