Fabric Lakehouse Skill

This is new skill for Copilot agent to work with fabric Lakehouse
This commit is contained in:
Ted Vilutis
2026-02-16 18:18:48 -08:00
parent fc0ffa8cb3
commit 2dcc97df98
4 changed files with 330 additions and 0 deletions

View File

@@ -35,6 +35,7 @@ Skills differ from other primitives by supporting bundled assets (scripts, code
| [copilot-sdk](../skills/copilot-sdk/SKILL.md) | Build agentic applications with GitHub Copilot SDK. Use when embedding AI agents in apps, creating custom tools, implementing streaming responses, managing sessions, connecting to MCP servers, or creating custom agents. Triggers on Copilot SDK, GitHub SDK, agentic app, embed Copilot, programmable agent, MCP server, custom agent. | None |
| [create-web-form](../skills/create-web-form/SKILL.md) | Create robust, accessible web forms with best practices for HTML structure, CSS styling, JavaScript interactivity, form validation, and server-side processing. Use when asked to "create a form", "build a web form", "add a contact form", "make a signup form", or when building any HTML form with data handling. Covers PHP and Python backends, MySQL database integration, REST APIs, XML data exchange, accessibility (ARIA), and progressive web apps. | `references/accessibility.md`<br />`references/aria-form-role.md`<br />`references/css-styling.md`<br />`references/form-basics.md`<br />`references/form-controls.md`<br />`references/form-data-handling.md`<br />`references/html-form-elements.md`<br />`references/html-form-example.md`<br />`references/hypertext-transfer-protocol.md`<br />`references/javascript.md`<br />`references/php-cookies.md`<br />`references/php-forms.md`<br />`references/php-json.md`<br />`references/php-mysql-database.md`<br />`references/progressive-web-app.md`<br />`references/python-as-web-framework.md`<br />`references/python-contact-form.md`<br />`references/python-flask-app.md`<br />`references/python-flask.md`<br />`references/security.md`<br />`references/styling-web-forms.md`<br />`references/web-api.md`<br />`references/web-performance.md`<br />`references/xml.md` |
| [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`<br />`references/excalidraw-schema.md`<br />`scripts/.gitignore`<br />`scripts/README.md`<br />`scripts/add-arrow.py`<br />`scripts/add-icon-to-diagram.py`<br />`scripts/split-excalidraw-library.py`<br />`templates/business-flow-swimlane-template.excalidraw`<br />`templates/class-diagram-template.excalidraw`<br />`templates/data-flow-diagram-template.excalidraw`<br />`templates/er-diagram-template.excalidraw`<br />`templates/flowchart-template.excalidraw`<br />`templates/mindmap-template.excalidraw`<br />`templates/relationship-template.excalidraw`<br />`templates/sequence-diagram-template.excalidraw` |
| [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Provide definition and context about Fabric Lakehouse and its capabilities for software systems and AI-powered features. Help users design, build, and optimize Lakehouse solutions using best practices. | `references/getdata.md`<br />`references/pyspark.md` |
| [finnish-humanizer](../skills/finnish-humanizer/SKILL.md) | Detect and remove AI-generated markers from Finnish text, making it sound like a native Finnish speaker wrote it. Use when asked to "humanize", "naturalize", or "remove AI feel" from Finnish text, or when editing .md/.txt files containing Finnish content. Identifies 26 patterns (12 Finnish-specific + 14 universal) and 4 style markers. | `references/patterns.md` |
| [gh-cli](../skills/gh-cli/SKILL.md) | GitHub CLI (gh) comprehensive reference for repositories, issues, pull requests, Actions, projects, releases, gists, codespaces, organizations, extensions, and all GitHub operations from the command line. | None |
| [git-commit](../skills/git-commit/SKILL.md) | Execute git commit with conventional commit message analysis, intelligent staging, and message generation. Use when user asks to commit changes, create a git commit, or mentions "/commit". Supports: (1) Auto-detecting type and scope from changes, (2) Generating conventional commit messages from diff, (3) Interactive commit with optional type/scope/description overrides, (4) Intelligent file staging for logical grouping | None |

View File

@@ -0,0 +1,106 @@
---
name: fabric-lakehouse
description: 'Provide definition and context about Fabric Lakehouse and its capabilities for software systems and AI-powered features. Help users design, build, and optimize Lakehouse solutions using best practices.'
metadata:
author: tedvilutis
version: "1.0"
---
# When to Use This Skill
Use this skill when you need to:
- Generate document or explanation that includes definition and context about Fabric Lakehouse and its capabilities.
- Design, build, and optimize Lakehouse solutions using best practices.
- Understand the core concepts and components of a Lakehouse in Microsoft Fabric.
- Learn how to manage tabular and non-tabular data within a Lakehouse.
# Fabric Lakehouse
## Core Concepts
### What is a Lakehouse?
Lakehouse in Microsoft Fabric is an item that gives users a place to store their tabular, like tables, and non-tabular, like files, data. It combines the flexibility of a data lake with the management capabilities of a data warehouse. It provides:
- **Unified storage** in OneLake for structured and unstructured data
- **Delta Lake format** for ACID transactions, versioning, and time travel
- **SQL analytics endpoint** for T-SQL queries
- **Semantic model** for Power BI integration
- Support for other table formats like CSV, Parquet
- Support for any file formats
- Tools for table optimization and data management
### Key Components
- **Delta Tables**: Managed tables with ACID compliance and schema enforcement
- **Files**: Unstructured/semi-structured data in the Files section
- **SQL Endpoint**: Auto-generated read-only SQL interface for querying
- **Shortcuts**: Virtual links to external/internal data without copying
- **Fabric Materialized Views**: Pre-computed tables for fast query performance
### Tabular data in a Lakehouse
Tabular data in a form of tables are stored under "Tables" folder. Main format for tables in Lakehouse is Delta. Lakehouse can store tabular data in other formats like CSV or Parquet, these formats only available for Spark querying.
Tables can be internal, when data is stored under "Tables" folder or external, when only reference to a table is stored under "Tables" folder but the data itself is stored in a referenced location. Referencing tables are done through Shortcuts, which can be internal, pointing to other location in Fabric, or external pointing to data stored outside of Fabric.
### Schemas for tables in a Lakehouse
When creating a lakehouse user can choose to enable schemas. Schemas are used to organize Lakehouse tables. Schemas are implemented as folders under "Tables" folder and store tables inside of those folders. Default schema is "dbo" and it can't be deleted or renamed. All other schemas are optional and can be created, renamed, or deleted. User can reference schema located in other lakehouse using Schema Shortcut that way referencing all tables with one shortcut that are at the destination schema.
### Files in a Lakehouse
Files are stored under "Files" folder. Users can create folders and subfolders to organize their files. Any file format can be stored in Lakehouse.
### Fabric Materialized Views
Set of pre-computed tables that are automatically updated based on schedule. They provide fast query performance for complex aggregations and joins. Materialized views are defined using PySpark or Spark SQL stored in associated Notebook.
### Spark Views
Logical tables defined by a SQL query. They do not store data but provide a virtual layer for querying. Views are defined using Spark SQL and stored in Lakehouse next to Tables.
## Security
### Item access or control plane security
User can have workspace roles (Admin, Member, Contributor, Viewer) that provide different levels of access to Lakehouse and its contents. User can also get access permission using sharing capabilities of Lakehouse.
### Data access or OneLake Security
For data access use OneLake security model, which is based on Microsoft Entra ID (formerly Azure Active Directory) and role-based access control (RBAC). Lakehouse data is stored in OneLake, so access to data is controlled through OneLake permissions. In addition to object-level permissions, Lakehouse also supports column-level and row-level security for tables, allowing fine-grained control over who can see specific columns or rows in a table.
## Lakehouse Shortcuts
Shortcuts create virtual links to data without copying:
### Types of Shortcuts
- **Internal**: Link to other Fabric Lakehouses/tables, cross-workspace data sharing
- **ADLS Gen2**: Azure Data Lake Storage Gen2 external Azure storage
- **Amazon S3**: AWS S3 buckets, cross-cloud data access
- **Dataverse**: Microsoft Dataverse, business application data
- **Google Cloud Storage**: GCS buckets, cross-cloud data access
## Performance Optimization
### V-Order Optimization
For faster data read with semantic model enable V-Order optimization on Delta tables.This presorts data in a way that improves query performance for common access patterns.
### Table Optimization
Tables can also be optimized using OPTIMIZE command, which compacts small files into larger ones and can also apply Z-ordering to improve query performance on specific columns. Regular optimization helps maintain performance as data is ingested and updated over time. Vacuum command can be used to clean up old files and free up storage space, especially after updates and deletes.
## Lineage
Lakehosue item supports lineage, which allows users to track the origin and transformations of data. Lineage information is automatically captured for tables and files in Lakehouse, showing how data flows from source to destination. This helps with debugging, auditing, and understanding data dependencies.
## PySpark Code Examples
See [PySpark code](references/pyspark.md) for details.
## Getting data into Lakehouse
See [Get data](references/getdata.md) for details.

View File

@@ -0,0 +1,36 @@
### Data Factory Integration
Microsoft Fabric includes Data Factory for ETL/ELT orchestration:
- **180+ connectors** for data sources
- **Copy activity** for data movement
- **Dataflow Gen2** for transformations
- **Notebook activity** for Spark processing
- **Scheduling** and triggers
### Pipeline Activities
| Activity | Description |
|----------|-------------|
| Copy Data | Move data between sources and Lakehouse |
| Notebook | Execute Spark notebooks |
| Dataflow | Run Dataflow Gen2 transformations |
| Stored Procedure | Execute SQL procedures |
| ForEach | Loop over items |
| If Condition | Conditional branching |
| Get Metadata | Retrieve file/folder metadata |
| Lakehouse Maintenance | Optimize and vacuum Delta tables |
### Orchestration Patterns
```
Pipeline: Daily_ETL_Pipeline
├── Get Metadata (check for new files)
├── ForEach (process each file)
│ ├── Copy Data (bronze layer)
│ └── Notebook (silver transformation)
├── Notebook (gold aggregation)
└── Lakehouse Maintenance (optimize tables)
```
---

View File

@@ -0,0 +1,187 @@
### Spark Configuration (Best Practices)
```python
# Enable Fabric optimizations
spark.conf.set("spark.sql.parquet.vorder.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
```
### Reading Data
```python
# Read CSV file
df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("Files/bronze/data.csv")
# Read JSON file
df = spark.read.format("json").load("Files/bronze/data.json")
# Read Parquet file
df = spark.read.format("parquet").load("Files/bronze/data.parquet")
# Read Delta table
df = spark.read.format("delta").table("my_delta_table")
# Read from SQL endpoint
df = spark.sql("SELECT * FROM lakehouse.my_table")
```
### Writing Delta Tables
```python
# Write DataFrame as managed Delta table
df.write.format("delta") \
.mode("overwrite") \
.saveAsTable("silver_customers")
# Write with partitioning
df.write.format("delta") \
.mode("overwrite") \
.partitionBy("year", "month") \
.saveAsTable("silver_transactions")
# Append to existing table
df.write.format("delta") \
.mode("append") \
.saveAsTable("silver_events")
```
### Delta Table Operations (CRUD)
```python
# UPDATE
spark.sql("""
UPDATE silver_customers
SET status = 'active'
WHERE last_login > '2024-01-01'
""")
# DELETE
spark.sql("""
DELETE FROM silver_customers
WHERE is_deleted = true
""")
# MERGE (Upsert)
spark.sql("""
MERGE INTO silver_customers AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
```
### Schema Definition
```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DecimalType
schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), True),
StructField("email", StringType(), True),
StructField("amount", DecimalType(18, 2), True),
StructField("created_at", TimestampType(), True)
])
df = spark.read.format("csv") \
.schema(schema) \
.option("header", "true") \
.load("Files/bronze/customers.csv")
```
### SQL Magic in Notebooks
```sql
%%sql
-- Query Delta table directly
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_amount
FROM gold_orders
GROUP BY customer_id
ORDER BY total_amount DESC
LIMIT 10
```
### V-Order Optimization
```python
# Enable V-Order for read optimization
spark.conf.set("spark.sql.parquet.vorder.enabled", "true")
```
### Table Optimization
```sql
%%sql
-- Optimize table (compact small files)
OPTIMIZE silver_transactions
-- Optimize with Z-ordering on query columns
OPTIMIZE silver_transactions ZORDER BY (customer_id, transaction_date)
-- Vacuum old files (default 7 days retention)
VACUUM silver_transactions
-- Vacuum with custom retention
VACUUM silver_transactions RETAIN 168 HOURS
### Incremental Load Pattern
```python
from pyspark.sql.functions import col, max as spark_max
# Get last processed watermark
last_watermark = spark.sql("""
SELECT MAX(processed_timestamp) as watermark
FROM silver_orders
""").collect()[0]["watermark"]
# Load only new records
new_records = spark.read.format("delta") \
.table("bronze_orders") \
.filter(col("created_at") > last_watermark)
# Merge new records
new_records.createOrReplaceTempView("staging_orders")
spark.sql("""
MERGE INTO silver_orders AS target
USING staging_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
```
### SCD Type 2 Pattern
```python
from pyspark.sql.functions import current_timestamp, lit
# Close existing records
spark.sql("""
UPDATE dim_customer
SET is_current = false, end_date = current_timestamp()
WHERE customer_id IN (SELECT customer_id FROM staging_customer)
AND is_current = true
""")
# Insert new versions
spark.sql("""
INSERT INTO dim_customer
SELECT
customer_id,
name,
email,
address,
current_timestamp() as start_date,
null as end_date,
true as is_current
FROM staging_customer
""")
```