mirror of
https://github.com/github/awesome-copilot.git
synced 2026-04-13 19:55:56 +00:00
Add quality-playbook skill (#1168)
This commit is contained in:
160
skills/quality-playbook/references/constitution.md
Normal file
160
skills/quality-playbook/references/constitution.md
Normal file
@@ -0,0 +1,160 @@
|
||||
# Writing the Quality Constitution (File 1: QUALITY.md)
|
||||
|
||||
The quality constitution defines what "quality" means for this specific project and makes the bar explicit, persistent, and inherited by every AI session.
|
||||
|
||||
## Template
|
||||
|
||||
```markdown
|
||||
# Quality Constitution: [Project Name]
|
||||
|
||||
## Purpose
|
||||
|
||||
[2–3 paragraphs grounding quality in three principles:]
|
||||
|
||||
- **Deming** ("quality is built in, not inspected in") — Quality is built into context files
|
||||
and the quality playbook so every AI session inherits the same bar.
|
||||
- **Juran** ("fitness for use") — Define fitness specifically for this project. Not "tests pass"
|
||||
but the actual real-world requirement. Example: "generates correct output that survives
|
||||
input schema changes without silently producing wrong results."
|
||||
- **Crosby** ("quality is free") — Building a quality playbook upfront costs less than
|
||||
debugging problems found after deployment.
|
||||
|
||||
## Coverage Targets
|
||||
|
||||
| Subsystem | Target | Why |
|
||||
|-----------|--------|-----|
|
||||
| [Most fragile module] | 90–95% | [Real edge case or past bug] |
|
||||
| [Core logic module] | 85–90% | [Concrete risk] |
|
||||
| [I/O or integration layer] | 80% | [Explain] |
|
||||
| [Configuration/utilities] | 75–80% | [Explain] |
|
||||
|
||||
The rationale column is essential. It must reference specific risks or past failures.
|
||||
If you can't explain why a subsystem needs high coverage with a concrete example,
|
||||
the target is arbitrary.
|
||||
|
||||
## Coverage Theater Prevention
|
||||
|
||||
[Define what constitutes a fake test for this project.]
|
||||
|
||||
Generic examples that apply to most projects:
|
||||
- Asserting a function returned *something* without checking what
|
||||
- Testing with synthetic data that lacks the quirks of real data
|
||||
- Asserting an import succeeded
|
||||
- Asserting mock returns what the mock was configured to return
|
||||
- Calling a function and only asserting no exception was thrown
|
||||
|
||||
[Add project-specific examples based on what you learned during exploration.
|
||||
For a data pipeline: "counting output records without checking their values."
|
||||
For a web app: "checking HTTP 200 without checking the response body."
|
||||
For a compiler: "checking output compiles without checking behavior."]
|
||||
|
||||
## Fitness-to-Purpose Scenarios
|
||||
|
||||
[5–10 scenarios. Every scenario must include a `[Req: tier — source]` tag linking it to its requirement source. Use the template below:]
|
||||
|
||||
### Scenario N: [Memorable Name]
|
||||
|
||||
**Requirement tag:** [Req: formal — Spec §X] *(or `user-confirmed` / `inferred` — see SKILL.md Phase 1, Step 1 for tier definitions)*
|
||||
|
||||
**What happened:** [The architectural vulnerability, edge case, or design decision.
|
||||
Reference actual code — function names, file names, line numbers. Frame as "this architecture permits the following failure mode."]
|
||||
|
||||
**The requirement:** [What the code must do to prevent this failure.
|
||||
Be specific enough that an AI can verify it.]
|
||||
|
||||
**How to verify:** [Concrete test or query that would fail if this regressed.
|
||||
Include exact commands, test names, or assertions.]
|
||||
|
||||
---
|
||||
|
||||
[Repeat for each scenario]
|
||||
|
||||
## AI Session Quality Discipline
|
||||
|
||||
1. Read QUALITY.md before starting work.
|
||||
2. Run the full test suite before marking any task complete.
|
||||
3. Add tests for new functionality (not just happy path — include edge cases).
|
||||
4. Update this file if new failure modes are discovered.
|
||||
5. Output a Quality Compliance Checklist before ending a session.
|
||||
6. Never remove a fitness-to-purpose scenario. Only add new ones.
|
||||
|
||||
## The Human Gate
|
||||
|
||||
[List things that require human judgment:]
|
||||
- Output that "looks right" (requires domain knowledge)
|
||||
- UX and responsiveness
|
||||
- Documentation accuracy
|
||||
- Security review of auth changes
|
||||
- Backward compatibility decisions
|
||||
```
|
||||
|
||||
## Where Scenarios Come From
|
||||
|
||||
Scenarios come from two sources — **code exploration** and **domain knowledge** — and the best scenarios combine both.
|
||||
|
||||
### Source 1: Defensive Code Patterns (Code Exploration)
|
||||
|
||||
Every defensive pattern is evidence of a past failure or known risk:
|
||||
|
||||
1. **Defensive code** — Every `if value is None: return` guard is a scenario. Why was it needed?
|
||||
2. **Normalization functions** — Every function that cleans input exists because raw input caused problems
|
||||
3. **Configuration that could be hardcoded** — If a value is read from config instead of hardcoded, someone learned the value varies
|
||||
4. **Git blame / commit messages** — "Fix crash when X is missing" → Scenario: X can be missing
|
||||
5. **Comments explaining "why"** — "We use hash(id) not sequential index because..." → Scenario about correctness under that constraint
|
||||
|
||||
### Source 2: What Could Go Wrong (Domain Knowledge)
|
||||
|
||||
Don't limit yourself to what the code already defends against. Use your knowledge of similar systems to generate realistic failure scenarios that the code **should** handle. For every major subsystem, ask:
|
||||
|
||||
- "What happens if this process is killed mid-operation?" (state machines, file I/O, batch processing)
|
||||
- "What happens if external input is subtly wrong?" (validation pipelines, API integrations)
|
||||
- "What happens if this runs at 10x scale?" (batch processing, databases, queues)
|
||||
- "What happens if two operations overlap?" (concurrency, file locks, shared state)
|
||||
- "What produces correct-looking output that is actually wrong?" (randomness, statistical operations, type coercion)
|
||||
|
||||
These are not hypothetical — they are things that happen to every system of this type. Write them as **architectural vulnerability analyses**: "Because `save_state()` lacks an atomic rename pattern, a mid-write crash during a 10,000-record batch will leave a corrupted state file — the next run gets JSONDecodeError and cannot resume without manual intervention. At scale (9,240 records across 64 batches), this pattern risks silent loss of 1,693+ records with nothing to flag them as missing." Concrete numbers and specific consequences make scenarios authoritative and non-negotiable. An AI session reading "records can be lost" will argue the standard down. An AI session reading a specific failure mode with quantified impact will not.
|
||||
|
||||
### The Narrative Voice
|
||||
|
||||
Each scenario's "What happened" must read like an architectural vulnerability analysis, not an abstract specification. Include:
|
||||
|
||||
- **Specific quantities** — "308 records across 64 batches" not "some records"
|
||||
- **Cascade consequences** — "cascading through all subsequent pipeline steps, requiring reprocessing of 4,300 records instead of 308"
|
||||
- **Detection difficulty** — "nothing would flag them as missing" or "only statistical verification would catch it"
|
||||
- **Root cause in code** — "`random.seed(index)` creates correlated sequences because sequential integers produce related random streams"
|
||||
|
||||
The narrative voice serves a critical purpose: it makes standards non-negotiable. Abstract requirements ("records should not be lost") invite rationalization. Specific failure modes with quantified impact ("a mid-batch crash silently loses 1,693 records with no detection mechanism") do not. Frame these as "this architecture permits the following failure" — grounded in the actual code, not fabricated as past incidents.
|
||||
|
||||
### Combining Both Sources
|
||||
|
||||
The strongest scenarios combine a defensive pattern found in code with domain knowledge about why it matters:
|
||||
|
||||
1. Find the defensive code: `save_state()` writes to a temp file then renames
|
||||
2. Ask what failure this prevents: mid-write crash leaves corrupted state file
|
||||
3. Write the scenario as a vulnerability analysis: "Without the atomic rename pattern, a crash mid-write leaves state.json 50% complete. The next run gets JSONDecodeError and cannot resume without manual intervention."
|
||||
4. Ground it in code: "Read persistence.py line ~340: verify temp file + rename pattern"
|
||||
|
||||
### The "Why" Requirement
|
||||
|
||||
Every coverage target, every quality gate, every standard must have a "why" that references a specific scenario or risk. Without rationale, a future AI session will optimize for speed and argue the standard down.
|
||||
|
||||
Bad: "Core logic: 100% coverage"
|
||||
Good: "Core logic: 100% — because `random.seed(index)` created correlated sequences that produced 77.5% bias instead of 50/50. Subtle bugs here produce plausible-but-wrong output. Only statistical verification catches them."
|
||||
|
||||
The "why" is not documentation — it is protection against erosion.
|
||||
|
||||
## Calibrating Scenario Count
|
||||
|
||||
Aim for 2+ scenarios per core module (the modules identified as most complex or fragile). For a medium-sized project, this typically yields 8–10 scenarios. Fewer is fine for small projects; more for complex ones. If you're finding very few scenarios, it usually means the exploration was shallow rather than the project being simple — go back and read function bodies more carefully. Quality matters more than count: one scenario that precisely captures an architectural vulnerability is worth more than three generic "what if the input is bad" scenarios.
|
||||
|
||||
## Self-Critique Before Finishing
|
||||
|
||||
After drafting all scenarios, review each one and ask:
|
||||
|
||||
1. **"Would an AI session argue this standard down?"** If yes, the "why" isn't concrete enough. Add numbers, consequences, and detection difficulty.
|
||||
2. **"Does the 'What happened' read like a vulnerability analysis or an abstract spec?"** If it reads like a spec, rewrite it with specific quantities, cascading consequences, and grounding in actual code.
|
||||
3. **"Is there a scenario I'm not seeing?"** Think about what a different AI model would flag. Architecture models catch data flow problems. Edge-case models catch boundary conditions. What are you blind to?
|
||||
|
||||
## Critical Rule
|
||||
|
||||
Each scenario's "How to verify" section must map to at least one automated test in the functional test file. If a scenario can't be automated, note why (it may require the Human Gate) — but most scenarios should be testable.
|
||||
Reference in New Issue
Block a user