awesome-copilot/skills/quality-playbook/references/constitution.md

# Writing the Quality Constitution (File 1: QUALITY.md)

The quality constitution defines what "quality" means for this specific project and makes the bar explicit, persistent, and inherited by every AI session.

## Template

```markdown
# Quality Constitution: [Project Name]

## Purpose

[2–3 paragraphs grounding quality in three principles:]

- **Deming** ("quality is built in, not inspected in") — Quality is built into context files
  and the quality playbook so every AI session inherits the same bar.
- **Juran** ("fitness for use") — Define fitness specifically for this project. Not "tests pass"
  but the actual real-world requirement. Example: "generates correct output that survives
  input schema changes without silently producing wrong results."
- **Crosby** ("quality is free") — Building a quality playbook upfront costs less than
  debugging problems found after deployment.

## Coverage Targets

| Subsystem | Target | Why |
|-----------|--------|-----|
| [Most fragile module] | 90–95% | [Real edge case or past bug] |
| [Core logic module] | 85–90% | [Concrete risk] |
| [I/O or integration layer] | 80% | [Explain] |
| [Configuration/utilities] | 75–80% | [Explain] |

The rationale column is essential. It must reference specific risks or past failures.
If you can't explain why a subsystem needs high coverage with a concrete example,
the target is arbitrary.

## Coverage Theater Prevention

[Define what constitutes a fake test for this project.]

Generic examples that apply to most projects:
- Asserting a function returned *something* without checking what
- Testing with synthetic data that lacks the quirks of real data
- Asserting an import succeeded
- Asserting mock returns what the mock was configured to return
- Calling a function and only asserting no exception was thrown

[Add project-specific examples based on what you learned during exploration.
For a data pipeline: "counting output records without checking their values."
For a web app: "checking HTTP 200 without checking the response body."
For a compiler: "checking output compiles without checking behavior."]

## Fitness-to-Purpose Scenarios

[5–10 scenarios. Every scenario must include a `[Req: tier — source]` tag linking it to its requirement source. Use the template below:]

### Scenario N: [Memorable Name]

**Requirement tag:** [Req: formal — Spec §X] *(or `user-confirmed` / `inferred` — see SKILL.md Phase 1, Step 1 for tier definitions)*

**What happened:** [The architectural vulnerability, edge case, or design decision.
Reference actual code — function names, file names, line numbers. Frame as "this architecture permits the following failure mode."]

**The requirement:** [What the code must do to prevent this failure.
Be specific enough that an AI can verify it.]

**How to verify:** [Concrete test or query that would fail if this regressed.
Include exact commands, test names, or assertions.]

---

[Repeat for each scenario]

## AI Session Quality Discipline

1. Read QUALITY.md before starting work.
2. Run the full test suite before marking any task complete.
3. Add tests for new functionality (not just happy path — include edge cases).
4. Update this file if new failure modes are discovered.
5. Output a Quality Compliance Checklist before ending a session.
6. Never remove a fitness-to-purpose scenario. Only add new ones.

## The Human Gate

[List things that require human judgment:]
- Output that "looks right" (requires domain knowledge)
- UX and responsiveness
- Documentation accuracy
- Security review of auth changes
- Backward compatibility decisions
```

## Where Scenarios Come From

Scenarios come from two sources — **code exploration** and **domain knowledge** — and the best scenarios combine both.

### Source 1: Defensive Code Patterns (Code Exploration)

Every defensive pattern is evidence of a past failure or known risk:

1. **Defensive code** — Every `if value is None: return` guard is a scenario. Why was it needed?
2. **Normalization functions** — Every function that cleans input exists because raw input caused problems
3. **Configuration that could be hardcoded** — If a value is read from config instead of hardcoded, someone learned the value varies
4. **Git blame / commit messages** — "Fix crash when X is missing" → Scenario: X can be missing
5. **Comments explaining "why"** — "We use hash(id) not sequential index because..." → Scenario about correctness under that constraint

### Source 2: What Could Go Wrong (Domain Knowledge)

Don't limit yourself to what the code already defends against. Use your knowledge of similar systems to generate realistic failure scenarios that the code **should** handle. For every major subsystem, ask:

- "What happens if this process is killed mid-operation?" (state machines, file I/O, batch processing)
- "What happens if external input is subtly wrong?" (validation pipelines, API integrations)
- "What happens if this runs at 10x scale?" (batch processing, databases, queues)
- "What happens if two operations overlap?" (concurrency, file locks, shared state)
- "What produces correct-looking output that is actually wrong?" (randomness, statistical operations, type coercion)

These are not hypothetical — they are things that happen to every system of this type. Write them as **architectural vulnerability analyses**: "Because `save_state()` lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention. At scale (9,240 records across 64 batches), this pattern risks silent loss of 1,693+ records with nothing to flag them as missing." Concrete numbers and specific consequences make scenarios authoritative and non-negotiable. An AI session reading "records can be lost" will argue the standard down. An AI session reading a specific failure mode with quantified impact will not.

### The Narrative Voice

Each scenario's "What happened" must read like an architectural vulnerability analysis, not an abstract specification. Include:

- **Specific quantities** — "308 records across 64 batches" not "some records"
- **Cascade consequences** — "cascading through all subsequent pipeline steps, requiring reprocessing of 4,300 records instead of 308"
- **Detection difficulty** — "nothing would flag them as missing" or "only statistical verification would catch it"
- **Root cause in code** — "`random.seed(index)` creates correlated sequences because sequential integers produce related random streams"

The narrative voice serves a critical purpose: it makes standards non-negotiable. Abstract requirements ("records should not be lost") invite rationalization. Specific failure modes with quantified impact ("a mid-batch crash silently loses 1,693 records with no detection mechanism") do not. Frame these as "this architecture permits the following failure" — grounded in the actual code, not fabricated as past incidents.

### Combining Both Sources

The strongest scenarios combine a defensive pattern found in code with domain knowledge about why it matters:

1. Find the defensive code: `save_state()` writes to a temp file then renames
2. Ask what failure this prevents: mid-write crash leaves corrupted state file
3. Write the scenario as a vulnerability analysis: "Without the atomic rename pattern, a crash mid-write leaves state.json 50% complete. The next run gets JSONDecodeError and cannot resume without manual intervention."
4. Ground it in code: "Read persistence.py line ~340: verify temp file + rename pattern"

### The "Why" Requirement

Every coverage target, every quality gate, every standard must have a "why" that references a specific scenario or risk. Without rationale, a future AI session will optimize for speed and argue the standard down.

Bad: "Core logic: 100% coverage"
Good: "Core logic: 100% — because `random.seed(index)` created correlated sequences that produced 77.5% bias instead of 50/50. Subtle bugs here produce plausible-but-wrong output. Only statistical verification catches them."

The "why" is not documentation — it is protection against erosion.

## Calibrating Scenario Count

Aim for 2+ scenarios per core module (the modules identified as most complex or fragile). For a medium-sized project, this typically yields 8–10 scenarios. Fewer is fine for small projects; more for complex ones. If you're finding very few scenarios, it usually means the exploration was shallow rather than the project being simple — go back and read function bodies more carefully. Quality matters more than count: one scenario that precisely captures an architectural vulnerability is worth more than three generic "what if the input is bad" scenarios.

## Self-Critique Before Finishing

After drafting all scenarios, review each one and ask:

1. **"Would an AI session argue this standard down?"** If yes, the "why" isn't concrete enough. Add numbers, consequences, and detection difficulty.
2. **"Does the 'What happened' read like a vulnerability analysis or an abstract spec?"** If it reads like a spec, rewrite it with specific quantities, cascading consequences, and grounding in actual code.
3. **"Is there a scenario I'm not seeing?"** Think about what a different AI model would flag. Architecture models catch data flow problems. Edge-case models catch boundary conditions. What are you blind to?

## Critical Rule

Each scenario's "How to verify" section must map to at least one automated test in the functional test file. If a scenario can't be automated, note why (it may require the Human Gate) — but most scenarios should be testable.