data-pipeline-builder

Build data pipelines without framework expertise. Extract from any source, transform with code, load to any destination — all with natural language commands.

What It Does

Extract data — From databases, APIs, files, S3, GCS, Kafka
Transform — Filters, mappings, aggregations, joins, custom code
Load — To databases, data warehouses, files, APIs
Schedule — Cron-based or event-triggered execution
Monitor — Pipeline status, throughput, error rates
Validate — Schema checks, data quality rules

Quick Start

# 1. Create a simple pipeline
create pipeline from mysql users to postgres users_backup

# 2. Add transformation
add transform to users-backup: filter where active = true

# 3. Schedule it
schedule users-backup daily at 2:00 AM

# 4. Run and monitor
run pipeline users-backup
check pipeline status

Common Use Cases

🔄 Database Synchronization

# Sync production to analytics warehouse
create pipeline from mysql production.orders \
  to bigquery analytics.orders

# Run incremental sync every hour
schedule orders-sync hourly

📊 API Data Extraction

# Pull data from REST API
create pipeline from api https://api.shop.com/orders \
  to postgres analytics.orders

# Add authentication
set source auth: bearer token xxx

🧹 Data Cleaning

# Clean and transform data
create pipeline from csv raw_data.csv to postgres clean_data

add transform: \
  remove duplicates on email \
  fill nulls in age with 0 \
  validate email format

📈 Analytics Preparation

# Aggregate for dashboards
create pipeline from postgres transactions \
  to postgres daily_summary

add transform: \
  group by date, product \
  aggregate sum(revenue), count(*) \
  where date >= yesterday

All Commands

| Command | Purpose | |---------|---------| | create pipeline from <src> to <dst> | Define new pipeline | | add transform <pipeline> | Add transformation step | | schedule <pipeline> <when> | Set run schedule | | run pipeline <name> | Execute immediately | | check pipeline status | View running pipelines | | pause pipeline <name> | Stop scheduled runs | | view logs <pipeline> | See execution history | | validate <pipeline> | Test without executing |

Supported Sources & Destinations

Databases: MySQL, PostgreSQL, MongoDB, Redis, SQLite

Cloud Storage: S3, GCS, Azure Blob

Data Warehouses: BigQuery, Snowflake, Redshift

Streaming: Kafka, Kinesis, Pub/Sub

Files: CSV, JSON, Parquet, Excel

Requirements

Node.js 18+ or Python 3.8+
Source/destination connectors (auto-installed)
Optional: Airflow, Dagster for orchestration