data-engineering

Name: data-engineering
Rating: 4.5 (76 reviews)
Author: Seth Hobson

ETL pipeline construction, data warehouse design, batch processing workflows, and data-driven feature development

View on GitHub

Author Seth Hobson

Namespace @wshobson/claude-code-workflows

Category data

Version 1.2.2

Stars 27,261

Downloads 76

self.md verified

Table of content

ETL pipeline construction, data warehouse design, batch processing workflows, and data-driven feature development

Installation

npx claude-plugins install @wshobson/claude-code-workflows/data-engineering

Folders: agents, commands, skills

Included Skills

This plugin includes 4 skill definitions:

airflow-dag-patterns

Build production Apache Airflow DAGs with best practices for operators, sensors, testing, and deployment. Use when creating data pipelines, orchestrating workflows, or scheduling batch jobs.

View skill definition

Apache Airflow DAG Patterns

Production-ready patterns for Apache Airflow including DAG design, operators, sensors, testing, and deployment strategies.

When to Use This Skill

Creating data pipeline orchestration with Airflow
Designing DAG structures and dependencies
Implementing custom operators and sensors
Testing Airflow DAGs locally
Setting up Airflow in production
Debugging failed DAG runs

Core Concepts

1. DAG Design Principles

Principle	Description
Idempotent	Running twice produces same result
Atomic	Tasks succeed or fail completely
Incremental	Process only new/changed data
Observable	Logs, metrics, alerts at every step

2. Task Dependencies

# Linear
task1 >> task2 >> task3

# Fan-out
task1 >> [task2, task3, task4]

# Fan-in
[task1, task2, task3] >> task4

# Complex
task1 >> task2 >> task4
task1 >> task3 >> task4

Quick Start

# dags/example_dag.py
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.empty import EmptyOperator

default_args = {
    'owner': 'data-team',
    'depends_on_past': False,
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'retry_exponential_backoff': True,
    'max_retry_delay': t

...(truncated)

</details>

### data-quality-frameworks

> Implement data quality validation with Great Expectations, dbt tests, and data contracts. Use when building data quality pipelines, implementing validation rules, or establishing data contracts.

<details>
<summary>View skill definition</summary>

# Data Quality Frameworks

Production patterns for implementing data quality with Great Expectations, dbt tests, and data contracts to ensure reliable data pipelines.

## When to Use This Skill

- Implementing data quality checks in pipelines
- Setting up Great Expectations validation
- Building comprehensive dbt test suites
- Establishing data contracts between teams
- Monitoring data quality metrics
- Automating data validation in CI/CD

## Core Concepts

### 1. Data Quality Dimensions

| Dimension        | Description              | Example Check                                      |
| ---------------- | ------------------------ | -------------------------------------------------- |
| **Completeness** | No missing values        | `expect_column_values_to_not_be_null`              |
| **Uniqueness**   | No duplicates            | `expect_column_values_to_be_unique`                |
| **Validity**     | Values in expected range | `expect_column_values_to_be_in_set`                |
| **Accuracy**     | Data matches reality     | Cross-reference validation                         |
| **Consistency**  | No contradictions        | `expect_column_pair_values_A_to_be_greater_than_B` |
| **Timeliness**   | Data is recent           | `expect_column_max_to_be_between`                  |

### 2. Testing Pyramid for Data

      /\
     /  \     Integration Tests (cross-table)
    /────\
   /      \   Unit Tests (single column)
  /────────\
 /          \ Sc

…(truncated)

dbt-transformation-patterns

Master dbt (data build tool) for analytics engineering with model organization, testing, documentation, and incremental strategies. Use when building data transformations, creating data models, or implementing analytics engineering best practices.

View skill definition

dbt Transformation Patterns

Production-ready patterns for dbt (data build tool) including model organization, testing strategies, documentation, and incremental processing.

When to Use This Skill

Building data transformation pipelines with dbt
Organizing models into staging, intermediate, and marts layers
Implementing data quality tests
Creating incremental models for large datasets
Documenting data models and lineage
Setting up dbt project structure

Core Concepts

1. Model Layers (Medallion Architecture)

sources/          Raw data definitions
    ↓
staging/          1:1 with source, light cleaning
    ↓
intermediate/     Business logic, joins, aggregations
    ↓
marts/            Final analytics tables

2. Naming Conventions

Layer	Prefix	Example
Staging	`stg_`	`stg_stripe__payments`
Intermediate	`int_`	`int_payments_pivoted`
Marts	`dim_`, `fct_`	`dim_customers`, `fct_orders`

Quick Start

# dbt_project.yml
name: "analytics"
version: "1.0.0"
profile: "analytics"

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]

vars:
  start_date: "2020-01-01"

models:
  analytics:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
    

...(truncated)

</details>

### spark-optimization

> Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

<details>
<summary>View skill definition</summary>

# Apache Spark Optimization

Production patterns for optimizing Apache Spark jobs including partitioning strategies, memory management, shuffle optimization, and performance tuning.

## When to Use This Skill

- Optimizing slow Spark jobs
- Tuning memory and executor configuration
- Implementing efficient partitioning strategies
- Debugging Spark performance issues
- Scaling Spark pipelines for large datasets
- Reducing shuffle and data skew

## Core Concepts

### 1. Spark Execution Model

Driver Program ↓ Job (triggered by action) ↓ Stages (separated by shuffles) ↓ Tasks (one per partition)


### 2. Key Performance Factors

| Factor            | Impact                | Solution                      |
| ----------------- | --------------------- | ----------------------------- |
| **Shuffle**       | Network I/O, disk I/O | Minimize wide transformations |
| **Data Skew**     | Uneven task duration  | Salting, broadcast joins      |
| **Serialization** | CPU overhead          | Use Kryo, columnar formats    |
| **Memory**        | GC pressure, spills   | Tune executor memory          |
| **Partitions**    | Parallelism           | Right-size partitions         |

## Quick Start

```python
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Create optimized Spark session
spark = (SparkSession.builder
    .appName("OptimizedJob")
    .config("spark.sql.adaptive.enabled", "true")
    .config("spark.sql.adaptive.coalescePartitions.enabl

...(truncated)

</details>

## Source

[View on GitHub](https://github.com/wshobson/agents)

Tags: data data-engineering etl data-pipeline data-warehouse batch-processing

data-engineering

Installation

Contents

Included Skills

airflow-dag-patterns

Apache Airflow DAG Patterns

When to Use This Skill

Core Concepts

1. DAG Design Principles

2. Task Dependencies

Quick Start

dbt-transformation-patterns

dbt Transformation Patterns

When to Use This Skill

Core Concepts

1. Model Layers (Medallion Architecture)

2. Naming Conventions

Quick Start