Files
awesome-copilot/skills/quality-playbook/references/constitution.md
2026-03-26 10:09:58 +11:00

9.5 KiB
Raw Blame History

Writing the Quality Constitution (File 1: QUALITY.md)

The quality constitution defines what "quality" means for this specific project and makes the bar explicit, persistent, and inherited by every AI session.

Template

# Quality Constitution: [Project Name]

## Purpose

[23 paragraphs grounding quality in three principles:]

- **Deming** ("quality is built in, not inspected in") — Quality is built into context files
  and the quality playbook so every AI session inherits the same bar.
- **Juran** ("fitness for use") — Define fitness specifically for this project. Not "tests pass"
  but the actual real-world requirement. Example: "generates correct output that survives
  input schema changes without silently producing wrong results."
- **Crosby** ("quality is free") — Building a quality playbook upfront costs less than
  debugging problems found after deployment.

## Coverage Targets

| Subsystem | Target | Why |
|-----------|--------|-----|
| [Most fragile module] | 9095% | [Real edge case or past bug] |
| [Core logic module] | 8590% | [Concrete risk] |
| [I/O or integration layer] | 80% | [Explain] |
| [Configuration/utilities] | 7580% | [Explain] |

The rationale column is essential. It must reference specific risks or past failures.
If you can't explain why a subsystem needs high coverage with a concrete example,
the target is arbitrary.

## Coverage Theater Prevention

[Define what constitutes a fake test for this project.]

Generic examples that apply to most projects:
- Asserting a function returned *something* without checking what
- Testing with synthetic data that lacks the quirks of real data
- Asserting an import succeeded
- Asserting mock returns what the mock was configured to return
- Calling a function and only asserting no exception was thrown

[Add project-specific examples based on what you learned during exploration.
For a data pipeline: "counting output records without checking their values."
For a web app: "checking HTTP 200 without checking the response body."
For a compiler: "checking output compiles without checking behavior."]

## Fitness-to-Purpose Scenarios

[510 scenarios. Every scenario must include a `[Req: tier — source]` tag linking it to its requirement source. Use the template below:]

### Scenario N: [Memorable Name]

**Requirement tag:** [Req: formal — Spec §X] *(or `user-confirmed` / `inferred` — see SKILL.md Phase 1, Step 1 for tier definitions)*

**What happened:** [The architectural vulnerability, edge case, or design decision.
Reference actual code — function names, file names, line numbers. Frame as "this architecture permits the following failure mode."]

**The requirement:** [What the code must do to prevent this failure.
Be specific enough that an AI can verify it.]

**How to verify:** [Concrete test or query that would fail if this regressed.
Include exact commands, test names, or assertions.]

---

[Repeat for each scenario]

## AI Session Quality Discipline

1. Read QUALITY.md before starting work.
2. Run the full test suite before marking any task complete.
3. Add tests for new functionality (not just happy path — include edge cases).
4. Update this file if new failure modes are discovered.
5. Output a Quality Compliance Checklist before ending a session.
6. Never remove a fitness-to-purpose scenario. Only add new ones.

## The Human Gate

[List things that require human judgment:]
- Output that "looks right" (requires domain knowledge)
- UX and responsiveness
- Documentation accuracy
- Security review of auth changes
- Backward compatibility decisions

Where Scenarios Come From

Scenarios come from two sources — code exploration and domain knowledge — and the best scenarios combine both.

Source 1: Defensive Code Patterns (Code Exploration)

Every defensive pattern is evidence of a past failure or known risk:

  1. Defensive code — Every if value is None: return guard is a scenario. Why was it needed?
  2. Normalization functions — Every function that cleans input exists because raw input caused problems
  3. Configuration that could be hardcoded — If a value is read from config instead of hardcoded, someone learned the value varies
  4. Git blame / commit messages — "Fix crash when X is missing" → Scenario: X can be missing
  5. Comments explaining "why" — "We use hash(id) not sequential index because..." → Scenario about correctness under that constraint

Source 2: What Could Go Wrong (Domain Knowledge)

Don't limit yourself to what the code already defends against. Use your knowledge of similar systems to generate realistic failure scenarios that the code should handle. For every major subsystem, ask:

  • "What happens if this process is killed mid-operation?" (state machines, file I/O, batch processing)
  • "What happens if external input is subtly wrong?" (validation pipelines, API integrations)
  • "What happens if this runs at 10x scale?" (batch processing, databases, queues)
  • "What happens if two operations overlap?" (concurrency, file locks, shared state)
  • "What produces correct-looking output that is actually wrong?" (randomness, statistical operations, type coercion)

These are not hypothetical — they are things that happen to every system of this type. Write them as architectural vulnerability analyses: "Because save_state() lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention. At scale (9,240 records across 64 batches), this pattern risks silent loss of 1,693+ records with nothing to flag them as missing." Concrete numbers and specific consequences make scenarios authoritative and non-negotiable. An AI session reading "records can be lost" will argue the standard down. An AI session reading a specific failure mode with quantified impact will not.

The Narrative Voice

Each scenario's "What happened" must read like an architectural vulnerability analysis, not an abstract specification. Include:

  • Specific quantities — "308 records across 64 batches" not "some records"
  • Cascade consequences — "cascading through all subsequent pipeline steps, requiring reprocessing of 4,300 records instead of 308"
  • Detection difficulty — "nothing would flag them as missing" or "only statistical verification would catch it"
  • Root cause in code — "random.seed(index) creates correlated sequences because sequential integers produce related random streams"

The narrative voice serves a critical purpose: it makes standards non-negotiable. Abstract requirements ("records should not be lost") invite rationalization. Specific failure modes with quantified impact ("a mid-batch crash silently loses 1,693 records with no detection mechanism") do not. Frame these as "this architecture permits the following failure" — grounded in the actual code, not fabricated as past incidents.

Combining Both Sources

The strongest scenarios combine a defensive pattern found in code with domain knowledge about why it matters:

  1. Find the defensive code: save_state() writes to a temp file then renames
  2. Ask what failure this prevents: mid-write crash leaves corrupted state file
  3. Write the scenario as a vulnerability analysis: "Without the atomic rename pattern, a crash mid-write leaves state.json 50% complete. The next run gets JSONDecodeError and cannot resume without manual intervention."
  4. Ground it in code: "Read persistence.py line ~340: verify temp file + rename pattern"

The "Why" Requirement

Every coverage target, every quality gate, every standard must have a "why" that references a specific scenario or risk. Without rationale, a future AI session will optimize for speed and argue the standard down.

Bad: "Core logic: 100% coverage" Good: "Core logic: 100% — because random.seed(index) created correlated sequences that produced 77.5% bias instead of 50/50. Subtle bugs here produce plausible-but-wrong output. Only statistical verification catches them."

The "why" is not documentation — it is protection against erosion.

Calibrating Scenario Count

Aim for 2+ scenarios per core module (the modules identified as most complex or fragile). For a medium-sized project, this typically yields 810 scenarios. Fewer is fine for small projects; more for complex ones. If you're finding very few scenarios, it usually means the exploration was shallow rather than the project being simple — go back and read function bodies more carefully. Quality matters more than count: one scenario that precisely captures an architectural vulnerability is worth more than three generic "what if the input is bad" scenarios.

Self-Critique Before Finishing

After drafting all scenarios, review each one and ask:

  1. "Would an AI session argue this standard down?" If yes, the "why" isn't concrete enough. Add numbers, consequences, and detection difficulty.
  2. "Does the 'What happened' read like a vulnerability analysis or an abstract spec?" If it reads like a spec, rewrite it with specific quantities, cascading consequences, and grounding in actual code.
  3. "Is there a scenario I'm not seeing?" Think about what a different AI model would flag. Architecture models catch data flow problems. Edge-case models catch boundary conditions. What are you blind to?

Critical Rule

Each scenario's "How to verify" section must map to at least one automated test in the functional test file. If a scenario can't be automated, note why (it may require the Human Gate) — but most scenarios should be testable.