Key Takeaways
- ELT is the modern standard: Modern cloud warehouses (BigQuery, Snowflake) are powerful enough to run transformations. Load raw data first, then transform inside the warehouse — this preserves source data and makes transformations reversible.
- dbt is the transformation layer: dbt turns SQL SELECT statements into version-controlled, tested, documented warehouse transformations. It is the standard for the T in ELT in 2026.
- Airflow orchestrates everything: Apache Airflow defines pipelines as Python DAGs. It schedules runs, monitors execution, retries failures, and sends alerts. Managed options: MWAA (AWS), Cloud Composer (GCP), Astronomer.
- Streaming requires Kafka: When you need data fresher than hourly batch runs, Kafka is the standard streaming backbone. Producers publish events to topics; consumers process them in near real-time.
A data pipeline is the infrastructure that moves data from where it originates to where it creates value. Modern data engineering has standardized around ELT architectures, SQL-based transformations with dbt, and orchestration with Airflow. This guide covers the patterns and tools that make up the modern data stack.
ETL vs ELT
ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. This was necessary when storage was expensive and transformation compute was on dedicated servers outside the warehouse.
ELT (Extract, Load, Transform) loads raw data directly into the warehouse first, then transforms it using the warehouse's own compute. This is the modern standard because cloud storage is cheap, warehouse compute is powerful and cost-effective, and keeping raw data enables rebuilding transformations when requirements change.
The modern data stack: ingestion tool (Fivetran, Airbyte) extracts and loads raw data into the warehouse. dbt transforms it. BI tools (Looker, Metabase, Tableau) query the transformed models.
dbt: SQL Transformations as Code
-- models/marts/revenue_by_month.sql SELECT DATE_TRUNC(order_date, MONTH) AS month, SUM(order_total) AS total_revenue, COUNT(DISTINCT customer_id) AS unique_customers FROM {{ ref('stg_orders') }} WHERE status = 'completed' GROUP BY 1 -- tests/schema.yml # Define data quality tests: unique, not_null, accepted_values
The ref() function creates a dependency between models. dbt builds a DAG of all models, runs them in dependency order, and generates documentation with lineage graphs automatically. dbt tests validate data quality on every run.
Orchestration with Apache Airflow
from airflow import DAG from airflow.operators.python import PythonOperator from airflow.providers.dbt.cloud.operators.dbt import DbtCloudRunJobOperator from datetime import datetime with DAG('daily_pipeline', schedule_interval='@daily', start_date=datetime(2026,1,1)) as dag: extract = PythonOperator(task_id='extract', python_callable=run_fivetran_sync) transform = DbtCloudRunJobOperator(task_id='dbt_run', job_id='12345') notify = PythonOperator(task_id='notify', python_callable=send_slack_alert) extract >> transform >> notify
Airflow handles retries, alerting, SLA monitoring, and backfilling historical runs. Set max_retries=3 and retry_delay on each task. Monitor DAG health in the Airflow UI.
Real-Time with Kafka
Kafka is a distributed event streaming platform. Producers write events to topics; consumers read and process them. Kafka retains events for a configurable period (days to weeks), enabling multiple independent consumers and event replay.
Common patterns: application events (orders, clicks, signups) streamed to Kafka, consumed by a Flink or Spark Streaming job, aggregated, and written to the warehouse in near real-time. Separate consumer reads the same events and writes to an operational database for real-time dashboards.
Kafka complexity is real: managing partitions, consumer groups, schema registry, and offset management requires operational expertise. Cloud managed options (Confluent Cloud, AWS MSK, Redpanda Cloud) significantly reduce this burden.
Frequently Asked Questions
What is the difference between ETL and ELT?
ETL (Extract, Transform, Load) transforms data before loading it into the destination warehouse. ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the warehouse. ELT is the modern standard for cloud data warehouses because storage is cheap, warehouse compute is powerful, and preserving raw data enables rebuilding transformations when requirements change.
What is dbt?
dbt (data build tool) is the standard tool for data transformation in modern data stacks. You write SQL SELECT statements as dbt models; dbt wraps them in CREATE TABLE or VIEW statements and runs them in your data warehouse. dbt manages dependencies between models, generates documentation, runs data quality tests, and provides complete lineage from source tables to final analytics models.
Do I need Apache Kafka for data pipelines?
Not unless you need near real-time data (latency under minutes). Most analytics workloads are well-served by daily or hourly batch pipelines. Use Kafka when: you need to process events in real-time (fraud detection, real-time recommendations), your data volumes exceed what batch ingestion can handle within your latency budget, or you need multiple independent consumers of the same event stream.
What tools does a data engineer need to know in 2026?
Core skills: SQL (essential), Python (for pipeline code and orchestration), dbt (transformation), Airflow or Prefect (orchestration), and one cloud data warehouse (BigQuery, Snowflake, or Redshift). Secondary skills: Kafka (streaming), Spark (large-scale processing), Terraform (infrastructure as code), and at least one ingestion tool (Fivetran or Airbyte). Start with SQL, Python, dbt, and Airflow before adding streaming tools.
Data pipelines are the infrastructure of the data-driven company. Get the skills.
Join professionals from Denver, NYC, Dallas, LA, and Chicago for two days of hands-on AI and tech training. $1,490. June–October 2026 (Thu–Fri). Seats are limited.
Reserve Your SeatNote: Information reflects early 2026.
ETL vs ELT was the wrong debate. The real split is batch vs event.
The ETL-vs-ELT framing dominated data conferences for most of a decade and has now largely been settled in favor of ELT — load raw data into a powerful warehouse and transform it there. That's the correct answer for analytical workloads, and it won. But treating that as the big question means missing the actually important one: whether your data should move in batches or as a continuous event stream. That choice determines your architecture, your cost, your complexity, and your team's skill mix far more than the ETL/ELT debate ever did.
Our read is that most companies should default to batch and be honest about when streaming is actually required. Streaming is genuinely necessary for fraud detection, personalization at the session level, real-time bidding, and a handful of other use cases. For everything else, batch runs every five or fifteen minutes are cheaper, simpler, more reliable, and indistinguishable from streaming to the end user. The number of dashboards that 'need real-time data' and actually need it is smaller than people think. Most want it because it feels modern, not because the business requires it.
For engineers picking a skill stack in 2026: get fluent in dbt, Airflow or Dagster, and one warehouse (Snowflake or BigQuery). Then learn Kafka and Flink when a real streaming requirement shows up, not before. That sequence optimizes for what you'll actually do.