Data Pipeline Guide: ETL, ELT, and Modern Architectures

Complete data pipeline guide: ETL vs ELT, batch vs streaming, dbt, Airflow, Kafka, and how modern data stacks work end to end.

PRIMARY DB REPLICA 1 REPLICA 2 SELECT * FROM data WHERE id = $1
3
ETL stages
10x
Faster modern tools
$130K
DE salary
2026
Real-time pipelines

Key Takeaways

A data pipeline is the infrastructure that moves data from where it originates to where it creates value. Modern data engineering has standardized around ELT architectures, SQL-based transformations with dbt, and orchestration with Airflow. This guide covers the patterns and tools that make up the modern data stack.

01

ETL vs ELT

ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. This was necessary when storage was expensive and transformation compute was on dedicated servers outside the warehouse.

ELT (Extract, Load, Transform) loads raw data directly into the warehouse first, then transforms it using the warehouse's own compute. This is the modern standard because cloud storage is cheap, warehouse compute is powerful and cost-effective, and keeping raw data enables rebuilding transformations when requirements change.

The modern data stack: ingestion tool (Fivetran, Airbyte) extracts and loads raw data into the warehouse. dbt transforms it. BI tools (Looker, Metabase, Tableau) query the transformed models.

02

dbt: SQL Transformations as Code

-- models/marts/revenue_by_month.sql
SELECT
  DATE_TRUNC(order_date, MONTH) AS month,
  SUM(order_total) AS total_revenue,
  COUNT(DISTINCT customer_id) AS unique_customers
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
GROUP BY 1

-- tests/schema.yml
# Define data quality tests: unique, not_null, accepted_values

The ref() function creates a dependency between models. dbt builds a DAG of all models, runs them in dependency order, and generates documentation with lineage graphs automatically. dbt tests validate data quality on every run.

03

Orchestration with Apache Airflow

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator
from datetime import datetime

with DAG('daily_pipeline', schedule_interval='@daily', start_date=datetime(2026,1,1)) as dag:
    extract = PythonOperator(task_id='extract', python_callable=run_fivetran_sync)
    transform = DbtCloudRunJobOperator(task_id='dbt_run', job_id='12345')
    notify = PythonOperator(task_id='notify', python_callable=send_slack_alert)
    extract >> transform >> notify

Airflow handles retries, alerting, SLA monitoring, and backfilling historical runs. Set max_retries=3 and retry_delay on each task. Monitor DAG health in the Airflow UI.

04

Real-Time with Kafka

Kafka is a distributed event streaming platform. Producers write events to topics; consumers read and process them. Kafka retains events for a configurable period (days to weeks), enabling multiple independent consumers and event replay.

Common patterns: application events (orders, clicks, signups) streamed to Kafka, consumed by a Flink or Spark Streaming job, aggregated, and written to the warehouse in near real-time. Separate consumer reads the same events and writes to an operational database for real-time dashboards.

Kafka complexity is real: managing partitions, consumer groups, schema registry, and offset management requires operational expertise. Cloud managed options (Confluent Cloud, AWS MSK, Redpanda Cloud) significantly reduce this burden.

05

Frequently Asked Questions

What is the difference between ETL and ELT?

ETL (Extract, Transform, Load) transforms data before loading it into the destination warehouse. ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the warehouse. ELT is the modern standard for cloud data warehouses because storage is cheap, warehouse compute is powerful, and preserving raw data enables rebuilding transformations when requirements change.

What is dbt?

dbt (data build tool) is the standard tool for data transformation in modern data stacks. You write SQL SELECT statements as dbt models; dbt wraps them in CREATE TABLE or VIEW statements and runs them in your data warehouse. dbt manages dependencies between models, generates documentation, runs data quality tests, and provides complete lineage from source tables to final analytics models.

Do I need Apache Kafka for data pipelines?

Not unless you need near real-time data (latency under minutes). Most analytics workloads are well-served by daily or hourly batch pipelines. Use Kafka when: you need to process events in real-time (fraud detection, real-time recommendations), your data volumes exceed what batch ingestion can handle within your latency budget, or you need multiple independent consumers of the same event stream.

What tools does a data engineer need to know in 2026?

Core skills: SQL (essential), Python (for pipeline code and orchestration), dbt (transformation), Airflow or Prefect (orchestration), and one cloud data warehouse (BigQuery, Snowflake, or Redshift). Secondary skills: Kafka (streaming), Spark (large-scale processing), Terraform (infrastructure as code), and at least one ingestion tool (Fivetran or Airbyte). Start with SQL, Python, dbt, and Airflow before adding streaming tools.

Data pipelines are the infrastructure of the data-driven company. Get the skills.

Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. June–October 2026 (Thu–Fri). Seats are limited.

Reserve Your Seat

Note: Information reflects early 2026.

Bottom Line
Complete data pipeline guide: ETL vs ELT, batch vs streaming, dbt, Airflow, Kafka, and how modern data stacks work end to end.
PA
Our Take

ETL vs ELT was the wrong debate. The real split is batch vs event.

The ETL-vs-ELT framing dominated data conferences for most of a decade and has now largely been settled in favor of ELT — load raw data into a powerful warehouse and transform it there. That's the correct answer for analytical workloads, and it won. But treating that as the big question means missing the actually important one: whether your data should move in batches or as a continuous event stream. That choice determines your architecture, your cost, your complexity, and your team's skill mix far more than the ETL/ELT debate ever did.

Our read is that most companies should default to batch and be honest about when streaming is actually required. Streaming is genuinely necessary for fraud detection, personalization at the session level, real-time bidding, and a handful of other use cases. For everything else, batch runs every five or fifteen minutes are cheaper, simpler, more reliable, and indistinguishable from streaming to the end user. The number of dashboards that 'need real-time data' and actually need it is smaller than people think. Most want it because it feels modern, not because the business requires it.

For engineers picking a skill stack in 2026: get fluent in dbt, Airflow or Dagster, and one warehouse (Snowflake or BigQuery). Then learn Kafka and Flink when a real streaming requirement shows up, not before. That sequence optimizes for what you'll actually do.

PA

Published By

Precision AI Academy

Practitioner-focused AI education · 2-day in-person bootcamp in 5 U.S. cities

Precision AI Academy publishes deep-dives on applied AI engineering for working professionals. Founded by Bo Peng (Kaggle Top 200) who leads the in-person bootcamp in Denver, NYC, Dallas, LA, and Chicago.

Kaggle Top 200 Federal AI Practitioner 5 U.S. Cities Thu–Fri Cohorts