Module 3: BigQuery and Data Warehouse – What You Should Know Before Loading Your First Dataset

This post documents my experience with Module 3 of the Data Engineering Zoomcamp by DataTalks.Club — a free 9-week course covering the modern data engineering stack end-to-end. The module covers data warehousing concepts, loading NYC taxi data into Google BigQuery via GCS, and understanding how to make queries fast and cheap through partitioning and clustering.

Stack: Google BigQuery, Google Cloud Storage, Python 3.11. Skills Gained: OLTP/OLAP distinctions, external vs native BigQuery tables, three partitioning strategies, clustering, query cost optimization, BigQuery internals.

Last Module I Orchestrated Pipelines. This Module I Need Somewhere To Put The Data.

After Module 2 I had Kestra running beautiful parametric pipelines that loaded NYC taxi data into PostgreSQL with proper staging, deduplication, and monthly scheduling. It felt complete.

Module 3 starts by pointing out that PostgreSQL is fine for operational data, but you wouldn’t run analytical queries against it at scale. Not because it’s bad — because it’s built for different things. You’d be fighting the tool instead of using it.

Enter data warehouses. And specifically, Google BigQuery.

OLTP vs OLAP, Or: Why You Can’t Just Use One Database for Everything

The module opens with a conceptual split that’s worth internalizing properly, because it explains a lot of architectural decisions you’ll see in the wild.

OLTP (Online Transaction Processing) is what you reach for when you need to record things happening right now. An order placed, a user logging in, a payment processed. The operations are INSERT, UPDATE, DELETE — short, fast, affecting a few rows at a time. MySQL, PostgreSQL. Normalized schemas to avoid redundancy. Storage in gigabytes.

OLAP (Online Analytical Processing) is what you reach for when you need to understand what happened across millions of rows. Revenue by region over the last 18 months. Trip durations by hour of day across two years of taxi data. The operations are almost exclusively SELECT with aggregations. BigQuery, Snowflake, Redshift. Denormalized schemas because joins at petabyte scale are painful. Storage in terabytes and petabytes.

The key distinction isn’t just scale — it’s the shape of the queries. OLTP queries touch a few rows and return immediately. OLAP queries touch most of the table and take seconds or minutes. Trying to run OLAP workloads on an OLTP database isn’t wrong, exactly — it just gets expensive and slow in ways that are hard to fix retroactively.

ETL, ELT, and Why the Letter Order Matters

Data warehouses use ETL: Extract, Transform, Load. You clean and reshape the data before it lands in the warehouse.

Data lakes use ELT: Extract, Load, Transform. You dump raw data first and figure out the shape later.

I’ve seen people treat these as interchangeable. They’re not. The difference is where the transformation logic lives and when it runs. In an ETL world, your warehouse only contains clean, structured, ready-to-query data. In an ELT world, you might be querying against raw JSON blobs and doing the cleanup in SQL at query time.

Both are valid. This module is ETL: we transform the NY Taxi parquet files, load them to GCS, and then query clean structured data from BigQuery.

External Tables: Querying GCS Without Moving the Data

BigQuery’s external table feature is the first thing in this module that genuinely surprised me.

You can point BigQuery at a file (or set of files) in GCS and query it as if it were a native table — without copying anything into BigQuery’s storage:

CREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.external_yellow_tripdata`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://nyc-tl-data/yellow_taxi/yellow_tripdata_2019-*.parquet']
);

That wildcard in the URI is doing a lot of work. One table definition, twelve monthly files, query them all transparently.

Two things to know about external tables before you rely on them:

BigQuery can’t determine the row count. The table exists, you can query it, but SELECT COUNT(*) requires a full scan and metadata like estimated row count shows as unknown. For rough orientation that’s fine; for anything that depends on size estimates it’s not.

Performance is worse than native tables. Every query against an external table goes over the network to GCS. For exploration and initial loading that’s fine. For production analytical queries that run on a schedule, convert to a native table first:

CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_non_partitoned AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;

The external table pattern I’ll actually use: create it as a temporary bridge to load data, then drop it. Same approach as the staging table pattern from Module 2, but at the GCS/BigQuery layer.

Partitioning: Paying Only for the Data You Actually Need

BigQuery charges by the amount of data scanned. A query against a 100GB table that touches all 100GB costs more than the same query that — because of smart table design — touches only 2GB.

Partitioning is the main tool for controlling this.

A partitioned table stores its data in physical blocks, one per partition. When you filter on the partition column, BigQuery reads only the matching blocks and skips the rest. The cost and time savings are proportional to how much you skip.

Three ways to partition:

Integer range — partition by ranges in an INT column. Useful if your natural partition key is something like customer_id or a numeric category:

CREATE TABLE mydataset.newtable (customer_id INT64, date1 DATE)
PARTITION BY
  RANGE_BUCKET(customer_id, GENERATE_ARRAY(0, 100, 10))
OPTIONS (require_partition_filter = TRUE);

Time-unit column — partition by a DATE, TIMESTAMP, or DATETIME column in the table. This is the one you’ll use 90% of the time:

CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitoned
PARTITION BY DATE(tpep_pickup_datetime) AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;

Ingestion time (_PARTITIONDATE) — BigQuery assigns rows to partitions based on when they were loaded, not the values in the data. Useful when your data doesn’t have a meaningful timestamp column, or when you care more about when something arrived than when it happened:

CREATE TABLE mydataset.newtable (transaction_id INT64)
PARTITION BY _PARTITIONDATE
OPTIONS (partition_expiration_days = 3, require_partition_filter = TRUE);

The require_partition_filter = TRUE option is worth noting. With it set, any query that doesn’t filter on the partition column will error rather than silently scanning the whole table. Useful for enforcing cost discipline on a shared dataset.

Clustering: The Partition’s Less-Famous Friend

Clustering sorts rows within partitions according to one or more columns. Where partitioning eliminates entire blocks from the scan, clustering allows BigQuery to skip individual row groups within a block.

CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitoned_clustered
PARTITION BY DATE(tpep_pickup_datetime)
CLUSTER BY VendorID AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;

The column order in CLUSTER BY matters — it defines the sort priority. Think of it like a composite index.

When to reach for clustering over partitioning:

Your partitions are very small (under ~1 GB each) — partitioning overhead isn’t worth it
You’d need more than 4,000 partitions — that’s BigQuery’s hard limit
Your pipelines modify most partitions every run — BigQuery maintains metadata per partition and frequent updates to many partitions has overhead

In practice: partition by date (almost always), cluster by the columns you filter on most often after the date.

Best Practices, Or: How to Not Get a Surprise Bill

The module covers a list of cost and performance guidelines. The ones I found non-obvious:

Never use SELECT * in production queries. BigQuery is columnar. Selecting all columns defeats the entire storage architecture and scans everything. Select only what you need.

Price your queries before running them. BigQuery’s query validator shows estimated bytes scanned before execution. Use it. A query that scans 500GB costs real money even if it finishes in two seconds.

WITH clauses are not prepared statements. This one trips people up. In PostgreSQL, CTEs are sometimes cached and reused. In BigQuery, each reference to a CTE re-executes it. If you reference the same CTE three times in a query, it runs three times. Materialize into a temp table if you need to reference it multiple times.

In JOINs, put the largest table first. BigQuery’s distributed query engine performs better when the large table is on the left side of the join and the small table can be broadcast to all workers. Counterintuitive if you’re used to OLTP JOIN optimization rules.

How BigQuery Actually Works (The 10-Minute Version)

The module points to a Google blog post on BigQuery internals. Here’s my distillation:

Your SQL query enters the Dremel engine, which converts it into an execution tree. The leaves of the tree are “slots” — workers that read data and perform computation. The branches are “mixers” — aggregators that combine results from slots.

Data lives in Colossus, Google’s distributed file system, stored in a columnar format called ColumnIO. Columnar storage means reading a single column doesn’t require touching the adjacent columns at all — which is why SELECT one_column FROM giant_table is dramatically faster than SELECT * FROM giant_table.

Slots communicate via Jupyter (Google’s internal networking infrastructure). The whole thing is scheduled and resource-allocated by Borg, the same system that runs most of Google’s infrastructure.

The architectural takeaway: BigQuery is designed from the ground up for parallel column-oriented reads across distributed storage. That’s why the performance model is so different from row-oriented databases — and why the optimization rules are different too.

Things That Will Bite You

The estimated cost is an estimate, not a cap. The query validator tells you “this query will process approximately X GB.” It’s a good estimate for partitioned tables with clear filters, but it can be wrong for complex queries with multiple joins. Set a personal bytes-billed limit in your BigQuery settings if you’re exploring a large dataset.

Partition pruning only works if the filter is on the partition column directly. If you partition by DATE(tpep_pickup_datetime) and then filter with WHERE tpep_pickup_datetime BETWEEN '2019-01-01' AND '2019-01-31', BigQuery will prune correctly. But if you wrap the column in a function — WHERE YEAR(tpep_pickup_datetime) = 2019 — partition pruning won’t trigger and you’ll scan the whole table. The filter needs to match the partition expression.

External table wildcard patterns don’t validate at creation time. If you write uris = ['gs://my-bucket/data/yellow_tripdata_2019-*.parquet'] and the files don’t exist yet, BigQuery creates the external table without complaint. You won’t find out until you run a query against it. “Table created successfully” is not confirmation that the files are there.

The 4,000 partition limit is per table, not per query. If you’re partitioning by day on a dataset that will grow for years, do the math. Daily partitioning on a 20-year dataset is 7,300 partitions — over the limit. Switch to monthly or weekly partitioning, or set a partition expiration.

This site runs on coffee and good intentions. Help keep it running?