data-engineering
ETL pipeline construction, data warehouse design, batch processing workflows, and data-driven feature development
View on GitHubTable of content
ETL pipeline construction, data warehouse design, batch processing workflows, and data-driven feature development
Installation
npx claude-plugins install @wshobson/claude-code-workflows/data-engineering
Contents
Folders: agents, commands, skills
Included Skills
This plugin includes 4 skill definitions:
airflow-dag-patterns
Build production Apache Airflow DAGs with best practices for operators, sensors, testing, and deployment. Use when creating data pipelines, orchestrating workflows, or scheduling batch jobs.
View skill definition
Apache Airflow DAG Patterns
Production-ready patterns for Apache Airflow including DAG design, operators, sensors, testing, and deployment strategies.
When to Use This Skill
- Creating data pipeline orchestration with Airflow
- Designing DAG structures and dependencies
- Implementing custom operators and sensors
- Testing Airflow DAGs locally
- Setting up Airflow in production
- Debugging failed DAG runs
Core Concepts
1. DAG Design Principles
| Principle | Description |
|---|---|
| Idempotent | Running twice produces same result |
| Atomic | Tasks succeed or fail completely |
| Incremental | Process only new/changed data |
| Observable | Logs, metrics, alerts at every step |
2. Task Dependencies
# Linear
task1 >> task2 >> task3
# Fan-out
task1 >> [task2, task3, task4]
# Fan-in
[task1, task2, task3] >> task4
# Complex
task1 >> task2 >> task4
task1 >> task3 >> task4
Quick Start
# dags/example_dag.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.empty import EmptyOperator
default_args = {
'owner': 'data-team',
'depends_on_past': False,
'email_on_failure': True,
'email_on_retry': False,
'retries': 3,
'retry_delay': timedelta(minutes=5),
'retry_exponential_backoff': True,
'max_retry_delay': t
...(truncated)
</details>
### data-quality-frameworks
> Implement data quality validation with Great Expectations, dbt tests, and data contracts. Use when building data quality pipelines, implementing validation rules, or establishing data contracts.
<details>
<summary>View skill definition</summary>
# Data Quality Frameworks
Production patterns for implementing data quality with Great Expectations, dbt tests, and data contracts to ensure reliable data pipelines.
## When to Use This Skill
- Implementing data quality checks in pipelines
- Setting up Great Expectations validation
- Building comprehensive dbt test suites
- Establishing data contracts between teams
- Monitoring data quality metrics
- Automating data validation in CI/CD
## Core Concepts
### 1. Data Quality Dimensions
| Dimension | Description | Example Check |
| ---------------- | ------------------------ | -------------------------------------------------- |
| **Completeness** | No missing values | `expect_column_values_to_not_be_null` |
| **Uniqueness** | No duplicates | `expect_column_values_to_be_unique` |
| **Validity** | Values in expected range | `expect_column_values_to_be_in_set` |
| **Accuracy** | Data matches reality | Cross-reference validation |
| **Consistency** | No contradictions | `expect_column_pair_values_A_to_be_greater_than_B` |
| **Timeliness** | Data is recent | `expect_column_max_to_be_between` |
### 2. Testing Pyramid for Data
/\
/ \ Integration Tests (cross-table)
/────\
/ \ Unit Tests (single column)
/────────\
/ \ Sc
…(truncated)
dbt-transformation-patterns
Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.
View skill definition
dbt Transformation Patterns
Production-ready patterns for dbt (data build tool) including model organization, testing strategies, documentation, and incremental processing.
When to Use This Skill
- Building data transformation pipelines with dbt
- Organizing models into staging, intermediate, and marts layers
- Implementing data quality tests
- Creating incremental models for large datasets
- Documenting data models and lineage
- Setting up dbt project structure
Core Concepts
1. Model Layers (Medallion Architecture)
sources/ Raw data definitions
↓
staging/ 1:1 with source, light cleaning
↓
intermediate/ Business logic, joins, aggregations
↓
marts/ Final analytics tables
2. Naming Conventions
| Layer | Prefix | Example |
|---|---|---|
| Staging | stg_ | stg_stripe__payments |
| Intermediate | int_ | int_payments_pivoted |
| Marts | dim_, fct_ | dim_customers, fct_orders |
Quick Start
# dbt_project.yml
name: "analytics"
version: "1.0.0"
profile: "analytics"
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
vars:
start_date: "2020-01-01"
models:
analytics:
staging:
+materialized: view
+schema: staging
intermediate:
+materialized: ephemeral
...(truncated)
</details>
### spark-optimization
> Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.
<details>
<summary>View skill definition</summary>
# Apache Spark Optimization
Production patterns for optimizing Apache Spark jobs including partitioning strategies, memory management, shuffle optimization, and performance tuning.
## When to Use This Skill
- Optimizing slow Spark jobs
- Tuning memory and executor configuration
- Implementing efficient partitioning strategies
- Debugging Spark performance issues
- Scaling Spark pipelines for large datasets
- Reducing shuffle and data skew
## Core Concepts
### 1. Spark Execution Model
Driver Program ↓ Job (triggered by action) ↓ Stages (separated by shuffles) ↓ Tasks (one per partition)
### 2. Key Performance Factors
| Factor | Impact | Solution |
| ----------------- | --------------------- | ----------------------------- |
| **Shuffle** | Network I/O, disk I/O | Minimize wide transformations |
| **Data Skew** | Uneven task duration | Salting, broadcast joins |
| **Serialization** | CPU overhead | Use Kryo, columnar formats |
| **Memory** | GC pressure, spills | Tune executor memory |
| **Partitions** | Parallelism | Right-size partitions |
## Quick Start
```python
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
# Create optimized Spark session
spark = (SparkSession.builder
.appName("OptimizedJob")
.config("spark.sql.adaptive.enabled", "true")
.config("spark.sql.adaptive.coalescePartitions.enabl
...(truncated)
</details>
## Source
[View on GitHub](https://github.com/wshobson/agents)