Lesson 914 lessons

Adversarial and Safety Patterns

Why AI outputs need critical evaluation

AI models are confident. They produce fluent, authoritative-sounding text even when they're wrong. This is the hallucination problem — and it's the most dangerous failure mode for anyone building on top of AI.


Hallucinations happen because language models predict the most probable next token, not the most accurate one. A model that has seen thousands of articles about a topic will generate plausible-sounding text about it — even when the specific facts it's generating never existed.


Real examples of costly AI hallucinations:

- Lawyers citing court cases that don't exist (happened in actual legal proceedings)

- Medical information that sounds accurate but recommends wrong dosages

- Business statistics that seem credible but were fabricated

- Code that looks correct but has security vulnerabilities


The solution is not to distrust AI — it's to build verification habits and use adversarial prompting to catch errors before they cause harm.

The devil's advocate prompt

The most powerful adversarial technique: after the AI gives you an answer, immediately ask it to argue against that answer.


```

[After receiving any analysis or recommendation]


Now play devil's advocate. Assume the analysis above is wrong or incomplete. What are:

1. The 3 strongest counterarguments to this conclusion?

2. The key assumptions it makes that could be false?

3. The scenarios where this recommendation would fail badly?

4. What important information might be missing?


Be genuinely critical — not a softball critique. Find real weaknesses.

```


This pattern is especially valuable for:

- Business decisions and strategy

- Technical architecture choices

- Content claims and statistics

- Any recommendation you're about to act on


The AI arguing against its own previous answer often surfaces the most important considerations.

Fact-checking and uncertainty flagging

Build uncertainty flagging into every prompt that deals with factual claims:


```

For every factual claim in your response, rate your confidence:

[CERTAIN] — established fact you're very confident about

[LIKELY] — well-supported but could have exceptions

[UNCERTAIN] — plausible but you're not sure

[VERIFY] — you think this is right but the user should verify with a primary source


Never present uncertain information as fact. If you don't know, say so.

```


Separate fact-check prompt (use after any research output):

```

Review your previous response. For each specific claim, statistic, or fact:

1. How confident are you (high/medium/low)?

2. What's the potential source of this information?

3. What should the user verify independently before relying on this?


Flag any claim that was generated from pattern-matching rather than a specific known source.

```

Building safe AI products: guardrail prompting

When you deploy AI to users, you're responsible for what it produces. Guardrail prompting is how you prevent harmful, off-brand, or legally risky outputs:


Hard refusal guardrails (never respond to these):

```

You must never: generate medical diagnoses, give specific legal advice, provide financial investment recommendations, discuss competitor products negatively, generate any content that could be considered discriminatory.


If asked to do any of these, respond: 'That's outside what I'm able to help with here. For [topic], please consult a qualified [professional].'

```


Soft redirect guardrails (handle gracefully):

```

If a user's question is outside your knowledge domain, don't guess. Say: 'I'm not certain about that specific detail. Here's what I do know: [what you know]. For accurate information on this, I'd recommend checking [appropriate resource].'

```


Tone guardrails (maintain brand safety):

```

Always maintain a respectful, professional tone even if the user is frustrated or uses informal language. Never match aggression. If a conversation becomes hostile, respond: 'I want to help resolve this. Let me find the best solution for you.'

```

Key Takeaways

  • AI hallucinations are confident errors — building verification habits is non-negotiable for anyone relying on AI outputs.
  • The devil's advocate prompt — asking the AI to argue against its own answer — surfaces hidden weaknesses in any analysis.
  • Uncertainty flagging in prompts forces the model to distinguish between certain facts and pattern-matched guesses.
  • Guardrail prompting (hard refusals, soft redirects, tone guardrails) is how you deploy AI responsibly to real users.

Stress-test an AI output you rely on

Take any AI-generated analysis, recommendation, or factual claim you've been using or planning to use. Run the devil's advocate prompt on it, then run the fact-checking prompt. Document: what weaknesses were found, what claims need verification, and how (if at all) this changes your decision.

Scenario: You used Claude to analyze your market and it said 'The Arabic e-learning market is expected to reach $2.5 billion by 2027.' Devil's advocate prompt: Argue against this market analysis. What assumptions is it making? What could make this figure wrong? What market factors might be missing? Fact-check prompt: For the statistic about Arabic e-learning market size, rate your confidence (high/medium/low), cite the likely source, and tell me what I should verify independently.
System Prompt Engineering