diff --git a/docs/README.prompts.md b/docs/README.prompts.md index c618c44c..190fd5e7 100644 --- a/docs/README.prompts.md +++ b/docs/README.prompts.md @@ -31,6 +31,7 @@ Ready-to-use prompt templates for specific development scenarios and tasks, defi | [Azure Cosmos DB NoSQL Data Modeling Expert System Prompt](../prompts/cosmosdb-datamodeling.prompt.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcosmosdb-datamodeling.prompt.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcosmosdb-datamodeling.prompt.md) | Step-by-step guide for capturing key application requirements for NoSQL use-case and produce Azure Cosmos DB Data NoSQL Model design using best practices and common patterns, artifacts_produced: "cosmosdb_requirements.md" file and "cosmosdb_data_model.md" file | | [Azure Cost Optimize](../prompts/az-cost-optimize.prompt.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Faz-cost-optimize.prompt.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Faz-cost-optimize.prompt.md) | Analyze Azure resources used in the app (IaC files and/or resources in a target rg) and optimize costs - creating GitHub issues for identified optimizations. | | [Azure Resource Health & Issue Diagnosis](../prompts/azure-resource-health-diagnose.prompt.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fazure-resource-health-diagnose.prompt.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fazure-resource-health-diagnose.prompt.md) | Analyze Azure resource health, diagnose issues from logs and telemetry, and create a remediation plan for identified problems. | +| [BigQuery Pipeline Audit: Cost, Safety and Production Readiness](../prompts/bigquery-pipeline-audit.prompt.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fbigquery-pipeline-audit.prompt.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fbigquery-pipeline-audit.prompt.md) | Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations. | | [Boost Prompt](../prompts/boost-prompt.prompt.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fboost-prompt.prompt.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fboost-prompt.prompt.md) | Interactive prompt refinement workflow: interrogates scope, deliverables, constraints; copies final markdown to clipboard; never writes code. Requires the Joyride extension. | | [C# Async Programming Best Practices](../prompts/csharp-async.prompt.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcsharp-async.prompt.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcsharp-async.prompt.md) | Get best practices for C# async programming | | [C# Documentation Best Practices](../prompts/csharp-docs.prompt.md)
[![Install in VS Code](https://img.shields.io/badge/VS_Code-Install-0098FF?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcsharp-docs.prompt.md)
[![Install in VS Code Insiders](https://img.shields.io/badge/VS_Code_Insiders-Install-24bfa5?style=flat-square&logo=visualstudiocode&logoColor=white)](https://aka.ms/awesome-copilot/install/prompt?url=vscode-insiders%3Achat-prompt%2Finstall%3Furl%3Dhttps%3A%2F%2Fraw.githubusercontent.com%2Fgithub%2Fawesome-copilot%2Fmain%2Fprompts%2Fcsharp-docs.prompt.md) | Ensure that C# types are documented with XML comments and follow best practices for documentation. | diff --git a/prompts/bigquery-pipeline-audit.prompt.md b/prompts/bigquery-pipeline-audit.prompt.md new file mode 100644 index 00000000..5031bee5 --- /dev/null +++ b/prompts/bigquery-pipeline-audit.prompt.md @@ -0,0 +1,130 @@ +--- +agent: 'agent' +tools: ['search/codebase', 'edit/editFiles', 'search'] +description: 'Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations.' +--- + +# BigQuery Pipeline Audit: Cost, Safety and Production Readiness + +You are a senior data engineer reviewing a Python + BigQuery pipeline script. +Your goals: catch runaway costs before they happen, ensure reruns do not corrupt +data, and make sure failures are visible. + +Analyze the codebase and respond in the structure below (A to F + Final). +Reference exact function names and line locations. Suggest minimal fixes, not +rewrites. + +--- + +## A) COST EXPOSURE: What will actually get billed? + +Locate every BigQuery job trigger (`client.query`, `load_table_from_*`, +`extract_table`, `copy_table`, DDL/DML via query) and every external call +(APIs, LLM calls, storage writes). + +For each, answer: +- Is this inside a loop, retry block, or async gather? +- What is the realistic worst-case call count? +- For each `client.query`, is `QueryJobConfig.maximum_bytes_billed` set? + For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS? +- Is the same SQL and params being executed more than once in a single run? + Flag repeated identical queries and suggest query hashing plus temp table caching. + +**Flag immediately if:** +- Any BQ query runs once per date or once per entity in a loop +- Worst-case BQ job count exceeds 20 +- `maximum_bytes_billed` is missing on any `client.query` call + +--- + +## B) DRY RUN AND EXECUTION MODES + +Verify a `--mode` flag exists with at least `dry_run` and `execute` options. + +- `dry_run` must print the plan and estimated scope with zero billed BQ execution + (BigQuery dry-run estimation via job config is allowed) and zero external API or LLM calls +- `execute` requires explicit confirmation for prod (`--env=prod --confirm`) +- Prod must not be the default environment + +If missing, propose a minimal `argparse` patch with safe defaults. + +--- + +## C) BACKFILL AND LOOP DESIGN + +**Hard fail if:** the script runs one BQ query per date or per entity in a loop. + +Check that date-range backfills use one of: +1. A single set-based query with `GENERATE_DATE_ARRAY` +2. A staging table loaded with all dates then one join query +3. Explicit chunks with a hard `MAX_CHUNKS` cap + +Also check: +- Is the date range bounded by default (suggest 14 days max without `--override`)? +- If the script crashes mid-run, is it safe to re-run without double-writing? +- For backdated simulations, verify data is read from time-consistent snapshots + (`FOR SYSTEM_TIME AS OF`, partitioned as-of tables, or dated snapshot tables). + Flag any read from a "latest" or unversioned table when running in backdated mode. + +Suggest a concrete rewrite if the current approach is row-by-row. + +--- + +## D) QUERY SAFETY AND SCAN SIZE + +For each query, check: +- **Partition filter** is on the raw column, not `DATE(ts)`, `CAST(...)`, or + any function that prevents pruning +- **No `SELECT *`**: only columns actually used downstream +- **Joins will not explode**: verify join keys are unique or appropriately scoped + and flag any potential many-to-many +- **Expensive operations** (`REGEXP`, `JSON_EXTRACT`, UDFs) only run after + partition filtering, not on full table scans + +Provide a specific SQL fix for any query that fails these checks. + +--- + +## E) SAFE WRITES AND IDEMPOTENCY + +Identify every write operation. Flag plain `INSERT`/append with no dedup logic. + +Each write should use one of: +1. `MERGE` on a deterministic key (e.g., `entity_id + date + model_version`) +2. Write to a staging table scoped to the run, then swap or merge into final +3. Append-only with a dedupe view: + `QUALIFY ROW_NUMBER() OVER (PARTITION BY ) = 1` + +Also check: +- Will a re-run create duplicate rows? +- Is the write disposition (`WRITE_TRUNCATE` vs `WRITE_APPEND`) intentional + and documented? +- Is `run_id` being used as part of the merge or dedupe key? If so, flag it. + `run_id` should be stored as a metadata column, not as part of the uniqueness + key, unless you explicitly want multi-run history. + +State the recommended approach and the exact dedup key for this codebase. + +--- + +## F) OBSERVABILITY: Can you debug a failure? + +Verify: +- Failures raise exceptions and abort with no silent `except: pass` or warn-only +- Each BQ job logs: job ID, bytes processed or billed when available, + slot milliseconds, and duration +- A run summary is logged or written at the end containing: + `run_id, env, mode, date_range, tables written, total BQ jobs, total bytes` +- `run_id` is present and consistent across all log lines + +If `run_id` is missing, propose a one-line fix: +`run_id = run_id or datetime.utcnow().strftime('%Y%m%dT%H%M%S')` + +--- + +## Final + +**1. PASS / FAIL** with specific reasons per section (A to F). +**2. Patch list** ordered by risk, referencing exact functions to change. +**3. If FAIL: Top 3 cost risks** with a rough worst-case estimate +(e.g., "loop over 90 dates x 3 retries = 270 BQ jobs").