mirror of
https://github.com/github/awesome-copilot.git
synced 2026-02-24 04:15:14 +00:00
feat: add BigQuery pipeline audit prompt and list it in the documentation.
This commit is contained in:
130
prompts/bigquery-pipeline-audit.prompt.md
Normal file
130
prompts/bigquery-pipeline-audit.prompt.md
Normal file
@@ -0,0 +1,130 @@
|
||||
---
|
||||
agent: 'agent'
|
||||
tools: ['search/codebase', 'edit/editFiles', 'search']
|
||||
description: 'Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations.'
|
||||
---
|
||||
|
||||
# BigQuery Pipeline Audit: Cost, Safety and Production Readiness
|
||||
|
||||
You are a senior data engineer reviewing a Python + BigQuery pipeline script.
|
||||
Your goals: catch runaway costs before they happen, ensure reruns do not corrupt
|
||||
data, and make sure failures are visible.
|
||||
|
||||
Analyze the codebase and respond in the structure below (A to F + Final).
|
||||
Reference exact function names and line locations. Suggest minimal fixes, not
|
||||
rewrites.
|
||||
|
||||
---
|
||||
|
||||
## A) COST EXPOSURE: What will actually get billed?
|
||||
|
||||
Locate every BigQuery job trigger (`client.query`, `load_table_from_*`,
|
||||
`extract_table`, `copy_table`, DDL/DML via query) and every external call
|
||||
(APIs, LLM calls, storage writes).
|
||||
|
||||
For each, answer:
|
||||
- Is this inside a loop, retry block, or async gather?
|
||||
- What is the realistic worst-case call count?
|
||||
- For each `client.query`, is `QueryJobConfig.maximum_bytes_billed` set?
|
||||
For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS?
|
||||
- Is the same SQL and params being executed more than once in a single run?
|
||||
Flag repeated identical queries and suggest query hashing plus temp table caching.
|
||||
|
||||
**Flag immediately if:**
|
||||
- Any BQ query runs once per date or once per entity in a loop
|
||||
- Worst-case BQ job count exceeds 20
|
||||
- `maximum_bytes_billed` is missing on any `client.query` call
|
||||
|
||||
---
|
||||
|
||||
## B) DRY RUN AND EXECUTION MODES
|
||||
|
||||
Verify a `--mode` flag exists with at least `dry_run` and `execute` options.
|
||||
|
||||
- `dry_run` must print the plan and estimated scope with zero billed BQ execution
|
||||
(BigQuery dry-run estimation via job config is allowed) and zero external API or LLM calls
|
||||
- `execute` requires explicit confirmation for prod (`--env=prod --confirm`)
|
||||
- Prod must not be the default environment
|
||||
|
||||
If missing, propose a minimal `argparse` patch with safe defaults.
|
||||
|
||||
---
|
||||
|
||||
## C) BACKFILL AND LOOP DESIGN
|
||||
|
||||
**Hard fail if:** the script runs one BQ query per date or per entity in a loop.
|
||||
|
||||
Check that date-range backfills use one of:
|
||||
1. A single set-based query with `GENERATE_DATE_ARRAY`
|
||||
2. A staging table loaded with all dates then one join query
|
||||
3. Explicit chunks with a hard `MAX_CHUNKS` cap
|
||||
|
||||
Also check:
|
||||
- Is the date range bounded by default (suggest 14 days max without `--override`)?
|
||||
- If the script crashes mid-run, is it safe to re-run without double-writing?
|
||||
- For backdated simulations, verify data is read from time-consistent snapshots
|
||||
(`FOR SYSTEM_TIME AS OF`, partitioned as-of tables, or dated snapshot tables).
|
||||
Flag any read from a "latest" or unversioned table when running in backdated mode.
|
||||
|
||||
Suggest a concrete rewrite if the current approach is row-by-row.
|
||||
|
||||
---
|
||||
|
||||
## D) QUERY SAFETY AND SCAN SIZE
|
||||
|
||||
For each query, check:
|
||||
- **Partition filter** is on the raw column, not `DATE(ts)`, `CAST(...)`, or
|
||||
any function that prevents pruning
|
||||
- **No `SELECT *`**: only columns actually used downstream
|
||||
- **Joins will not explode**: verify join keys are unique or appropriately scoped
|
||||
and flag any potential many-to-many
|
||||
- **Expensive operations** (`REGEXP`, `JSON_EXTRACT`, UDFs) only run after
|
||||
partition filtering, not on full table scans
|
||||
|
||||
Provide a specific SQL fix for any query that fails these checks.
|
||||
|
||||
---
|
||||
|
||||
## E) SAFE WRITES AND IDEMPOTENCY
|
||||
|
||||
Identify every write operation. Flag plain `INSERT`/append with no dedup logic.
|
||||
|
||||
Each write should use one of:
|
||||
1. `MERGE` on a deterministic key (e.g., `entity_id + date + model_version`)
|
||||
2. Write to a staging table scoped to the run, then swap or merge into final
|
||||
3. Append-only with a dedupe view:
|
||||
`QUALIFY ROW_NUMBER() OVER (PARTITION BY <key>) = 1`
|
||||
|
||||
Also check:
|
||||
- Will a re-run create duplicate rows?
|
||||
- Is the write disposition (`WRITE_TRUNCATE` vs `WRITE_APPEND`) intentional
|
||||
and documented?
|
||||
- Is `run_id` being used as part of the merge or dedupe key? If so, flag it.
|
||||
`run_id` should be stored as a metadata column, not as part of the uniqueness
|
||||
key, unless you explicitly want multi-run history.
|
||||
|
||||
State the recommended approach and the exact dedup key for this codebase.
|
||||
|
||||
---
|
||||
|
||||
## F) OBSERVABILITY: Can you debug a failure?
|
||||
|
||||
Verify:
|
||||
- Failures raise exceptions and abort with no silent `except: pass` or warn-only
|
||||
- Each BQ job logs: job ID, bytes processed or billed when available,
|
||||
slot milliseconds, and duration
|
||||
- A run summary is logged or written at the end containing:
|
||||
`run_id, env, mode, date_range, tables written, total BQ jobs, total bytes`
|
||||
- `run_id` is present and consistent across all log lines
|
||||
|
||||
If `run_id` is missing, propose a one-line fix:
|
||||
`run_id = run_id or datetime.utcnow().strftime('%Y%m%dT%H%M%S')`
|
||||
|
||||
---
|
||||
|
||||
## Final
|
||||
|
||||
**1. PASS / FAIL** with specific reasons per section (A to F).
|
||||
**2. Patch list** ordered by risk, referencing exact functions to change.
|
||||
**3. If FAIL: Top 3 cost risks** with a rough worst-case estimate
|
||||
(e.g., "loop over 90 dates x 3 retries = 270 BQ jobs").
|
||||
Reference in New Issue
Block a user