Add quality-playbook skill (#1168)

2026-07-19 03:59:17 +00:00 · 2026-03-25 19:09:58 -04:00
parent 25205651c6
commit 50f87bdc13
10 changed files with 2081 additions and 0 deletions
@@ -0,0 +1,160 @@
+# Writing the Quality Constitution (File 1: QUALITY.md)
+
+The quality constitution defines what "quality" means for this specific project and makes the bar explicit, persistent, and inherited by every AI session.
+
+## Template
+
+```markdown
+# Quality Constitution: [Project Name]
+
+## Purpose
+
+[2–3 paragraphs grounding quality in three principles:]
+
+- **Deming** ("quality is built in, not inspected in") — Quality is built into context files
+  and the quality playbook so every AI session inherits the same bar.
+- **Juran** ("fitness for use") — Define fitness specifically for this project. Not "tests pass"
+  but the actual real-world requirement. Example: "generates correct output that survives
+  input schema changes without silently producing wrong results."
+- **Crosby** ("quality is free") — Building a quality playbook upfront costs less than
+  debugging problems found after deployment.
+
+## Coverage Targets
+
+| Subsystem | Target | Why |
+|-----------|--------|-----|
+| [Most fragile module] | 90–95% | [Real edge case or past bug] |
+| [Core logic module] | 85–90% | [Concrete risk] |
+| [I/O or integration layer] | 80% | [Explain] |
+| [Configuration/utilities] | 75–80% | [Explain] |
+
+The rationale column is essential. It must reference specific risks or past failures.
+If you can't explain why a subsystem needs high coverage with a concrete example,
+the target is arbitrary.
+
+## Coverage Theater Prevention
+
+[Define what constitutes a fake test for this project.]
+
+Generic examples that apply to most projects:
+- Asserting a function returned *something* without checking what
+- Testing with synthetic data that lacks the quirks of real data
+- Asserting an import succeeded
+- Asserting mock returns what the mock was configured to return
+- Calling a function and only asserting no exception was thrown
+
+[Add project-specific examples based on what you learned during exploration.
+For a data pipeline: "counting output records without checking their values."
+For a web app: "checking HTTP 200 without checking the response body."
+For a compiler: "checking output compiles without checking behavior."]
+
+## Fitness-to-Purpose Scenarios
+
+[5–10 scenarios. Every scenario must include a `[Req: tier — source]` tag linking it to its requirement source. Use the template below:]
+
+### Scenario N: [Memorable Name]
+
+**Requirement tag:** [Req: formal — Spec §X] *(or `user-confirmed` / `inferred` — see SKILL.md Phase 1, Step 1 for tier definitions)*
+
+**What happened:** [The architectural vulnerability, edge case, or design decision.
+Reference actual code — function names, file names, line numbers. Frame as "this architecture permits the following failure mode."]
+
+**The requirement:** [What the code must do to prevent this failure.
+Be specific enough that an AI can verify it.]
+
+**How to verify:** [Concrete test or query that would fail if this regressed.
+Include exact commands, test names, or assertions.]
+
+---
+
+[Repeat for each scenario]
+
+## AI Session Quality Discipline
+
+1. Read QUALITY.md before starting work.
+2. Run the full test suite before marking any task complete.
+3. Add tests for new functionality (not just happy path — include edge cases).
+4. Update this file if new failure modes are discovered.
+5. Output a Quality Compliance Checklist before ending a session.
+6. Never remove a fitness-to-purpose scenario. Only add new ones.
+
+## The Human Gate
+
+[List things that require human judgment:]
+- Output that "looks right" (requires domain knowledge)
+- UX and responsiveness
+- Documentation accuracy
+- Security review of auth changes
+- Backward compatibility decisions
+```
+
+## Where Scenarios Come From
+
+Scenarios come from two sources — **code exploration** and **domain knowledge** — and the best scenarios combine both.
+
+### Source 1: Defensive Code Patterns (Code Exploration)
+
+Every defensive pattern is evidence of a past failure or known risk:
+
+1. **Defensive code** — Every `if value is None: return` guard is a scenario. Why was it needed?
+2. **Normalization functions** — Every function that cleans input exists because raw input caused problems
+3. **Configuration that could be hardcoded** — If a value is read from config instead of hardcoded, someone learned the value varies
+4. **Git blame / commit messages** — "Fix crash when X is missing" → Scenario: X can be missing
+5. **Comments explaining "why"** — "We use hash(id) not sequential index because..." → Scenario about correctness under that constraint
+
+### Source 2: What Could Go Wrong (Domain Knowledge)
+
+Don't limit yourself to what the code already defends against. Use your knowledge of similar systems to generate realistic failure scenarios that the code **should** handle. For every major subsystem, ask:
+
+- "What happens if this process is killed mid-operation?" (state machines, file I/O, batch processing)
+- "What happens if external input is subtly wrong?" (validation pipelines, API integrations)
+- "What happens if this runs at 10x scale?" (batch processing, databases, queues)
+- "What happens if two operations overlap?" (concurrency, file locks, shared state)
+- "What produces correct-looking output that is actually wrong?" (randomness, statistical operations, type coercion)
+
+These are not hypothetical — they are things that happen to every system of this type. Write them as **architectural vulnerability analyses**: "Because `save_state()` lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention. At scale (9,240 records across 64 batches), this pattern risks silent loss of 1,693+ records with nothing to flag them as missing." Concrete numbers and specific consequences make scenarios authoritative and non-negotiable. An AI session reading "records can be lost" will argue the standard down. An AI session reading a specific failure mode with quantified impact will not.
+
+### The Narrative Voice
+
+Each scenario's "What happened" must read like an architectural vulnerability analysis, not an abstract specification. Include:
+
+- **Specific quantities** — "308 records across 64 batches" not "some records"
+- **Cascade consequences** — "cascading through all subsequent pipeline steps, requiring reprocessing of 4,300 records instead of 308"
+- **Detection difficulty** — "nothing would flag them as missing" or "only statistical verification would catch it"
+- **Root cause in code** — "`random.seed(index)` creates correlated sequences because sequential integers produce related random streams"
+
+The narrative voice serves a critical purpose: it makes standards non-negotiable. Abstract requirements ("records should not be lost") invite rationalization. Specific failure modes with quantified impact ("a mid-batch crash silently loses 1,693 records with no detection mechanism") do not. Frame these as "this architecture permits the following failure" — grounded in the actual code, not fabricated as past incidents.
+
+### Combining Both Sources
+
+The strongest scenarios combine a defensive pattern found in code with domain knowledge about why it matters:
+
+1. Find the defensive code: `save_state()` writes to a temp file then renames
+2. Ask what failure this prevents: mid-write crash leaves corrupted state file
+3. Write the scenario as a vulnerability analysis: "Without the atomic rename pattern, a crash mid-write leaves state.json 50% complete. The next run gets JSONDecodeError and cannot resume without manual intervention."
+4. Ground it in code: "Read persistence.py line ~340: verify temp file + rename pattern"
+
+### The "Why" Requirement
+
+Every coverage target, every quality gate, every standard must have a "why" that references a specific scenario or risk. Without rationale, a future AI session will optimize for speed and argue the standard down.
+
+Bad: "Core logic: 100% coverage"
+Good: "Core logic: 100% — because `random.seed(index)` created correlated sequences that produced 77.5% bias instead of 50/50. Subtle bugs here produce plausible-but-wrong output. Only statistical verification catches them."
+
+The "why" is not documentation — it is protection against erosion.
+
+## Calibrating Scenario Count
+
+Aim for 2+ scenarios per core module (the modules identified as most complex or fragile). For a medium-sized project, this typically yields 8–10 scenarios. Fewer is fine for small projects; more for complex ones. If you're finding very few scenarios, it usually means the exploration was shallow rather than the project being simple — go back and read function bodies more carefully. Quality matters more than count: one scenario that precisely captures an architectural vulnerability is worth more than three generic "what if the input is bad" scenarios.
+
+## Self-Critique Before Finishing
+
+After drafting all scenarios, review each one and ask:
+
+1. **"Would an AI session argue this standard down?"** If yes, the "why" isn't concrete enough. Add numbers, consequences, and detection difficulty.
+2. **"Does the 'What happened' read like a vulnerability analysis or an abstract spec?"** If it reads like a spec, rewrite it with specific quantities, cascading consequences, and grounding in actual code.
+3. **"Is there a scenario I'm not seeing?"** Think about what a different AI model would flag. Architecture models catch data flow problems. Edge-case models catch boundary conditions. What are you blind to?
+
+## Critical Rule
+
+Each scenario's "How to verify" section must map to at least one automated test in the functional test file. If a scenario can't be automated, note why (it may require the Human Gate) — but most scenarios should be testable.