I Just Wanted to Run a Database. Docker Had Other Plans.

This post documents my experience with Module 1 of the Data Engineering Zoomcamp by DataTalks.Club — a free 9-week course covering the modern data engineering stack end-to-end. The module covers containerization with Docker, local data ingestion pipelines with Python, SQL fundamentals using NYC taxi data, and cloud infrastructure provisioning with Terraform and GCP.

Stack: Docker 24, PostgreSQL 15, pgAdmin 4, Python 3.11 (pandas, SQLAlchemy), Terraform 1.7, Google Cloud Platform. Skills Gained: Container orchestration with Docker Compose, PostgreSQL database setup, chunked data ingestion patterns, Infrastructure as Code with Terraform, GCS and BigQuery provisioning.

Late to the Party

I signed up for the Data Engineering Zoomcamp because I wanted to understand how data pipelines actually work in production — not the tutorial version, the real one. Except I missed the start date. The course runs once a year, starts in January, and somehow I only noticed it existed when it was already halfway through. Classic. So I joined mid-course, caught up on the recorded lectures, and started from Module 1 while everyone else was already doing Spark.

Week 1 was Docker and PostgreSQL. I thought: how hard can it be? Three hours later I was reading Docker networking documentation at midnight. Here’s what happened.

Cubbi Thompson

The Setup: PostgreSQL and pgAdmin via Docker Compose

The goal was straightforward: run PostgreSQL locally using Docker, load 1.3 million rows of NYC taxi data into it, and query it through pgAdmin — a graphical interface for PostgreSQL. Everything defined in a docker-compose.yml for reproducibility.

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: root
      POSTGRES_PASSWORD: root
      POSTGRES_DB: ny_taxi
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"

  pgadmin:
    image: dpage/pgadmin4
    environment:
      PGADMIN_DEFAULT_EMAIL: admin@admin.com
      PGADMIN_DEFAULT_PASSWORD: root
    ports:
      - "8080:80"
    depends_on:
      - postgres

volumes:
  postgres_data:

Two services. Forty lines. docker-compose up -d and done. Except it wasn’t done.

The First Surprise: depends_on Lies to You

pgAdmin kept failing to connect to PostgreSQL. The error said the server wasn’t available. I checked — both containers were running, green status in docker ps. So what’s the problem?

Turns out depends_on: postgres means “start the postgres container before pgAdmin.” It does NOT mean “wait until postgres is actually ready to accept connections.” The database needs a few seconds to initialize after the container starts. pgAdmin was connecting before PostgreSQL finished booting. The fix: add a proper health check so Docker Compose waits for PostgreSQL to be genuinely ready before starting pgAdmin:

postgres:
  image: postgres:15
  healthcheck:
    test: ["CMD-SHELL", "pg_isready -U root"]
    interval: 5s
    timeout: 5s
    retries: 5

The Second Surprise: Where Is My Data?

After fixing the connection issue I loaded the taxi data, queried a few rows, felt great about myself, and ran docker-compose down to clean up. When I brought everything back up with docker-compose up, the database was empty. I had forgotten about volumes.

Without a named volume, Docker stores PostgreSQL data inside the container itself. When the container dies, the data dies with it. The postgres_data: named volume in the compose file maps database files to a Docker-managed storage area that survives container restarts.

This is obvious in retrospect. But it’s also a perfect metaphor for what I’m learning: in data engineering, persistence and durability are never accidental. You have to think about them explicitly, at every layer of the stack.

Loading 1.3 Million Rows Without Killing Your RAM

Once the setup was actually working, loading the data was the interesting part. The naive approach:

pythondf = pd.read_csv('yellow_tripdata_2021-01.csv')
df.to_sql('yellow_taxi_trips', engine, if_exists='replace')

This works — but it loads the entire 400MB CSV into memory before writing anything to the database. On a machine with limited RAM this is a problem. On a cloud VM with 2GB of RAM (which is exactly what I’ll be using for my own project later), this crashes. The right way is chunking:

pythondf_iter = pd.read_csv('yellow_tripdata_2021-01.csv',
                      iterator=True,
                      chunksize=100000)

for chunk in df_iter:
    chunk.to_sql('yellow_taxi_trips', engine, if_exists='append', index=False)

Read 100,000 rows, write them, discard them from memory, read the next 100,000. Memory stays flat regardless of file size. This is the streaming pattern — and it’s everywhere in real data engineering. Kafka consumers do this. Spark does this. Even the simplest ETL script should do this.

I’ll be using this exact pattern when building my own pipeline later, processing hourly electricity price data from ~20 European countries.

The Thing Nobody Told Me About Docker Networking

Inside Docker Compose, services communicate using their service name as the hostname — not localhost.

So when pgAdmin asks “where is my PostgreSQL server?”, the answer is not localhost:5432. It’s postgres:5432 — the service name from the compose file. This is because each container runs in its own network namespace; localhost inside pgAdmin points to the pgAdmin container itself, not the host machine or the postgres container.

This feels counterintuitive at first. But it’s exactly how production microservices work. When an API container talks to a database container in Kubernetes, it uses a service name too. Docker Compose is a local simulation of the same concept. Once this clicked, a lot of other architectural patterns started making sense.

The second part of module 1 was Terraform — and this is where things went from “interesting” to “oh, this is how real infrastructure works.”

The task: use Terraform to create two GCP resources — a Cloud Storage bucket (data lake) and a BigQuery dataset (data warehouse). No clicking in the GCP console. Just code. First I had to set up a GCP account, create a project, and download a service account key as a JSON file. The key gives Terraform permission to create resources on your behalf:

export GOOGLE_APPLICATION_CREDENTIALS="./gcp-credentials.json"

Then the Terraform config:

provider "google" {
  project = var.project_id
  region  = "europe-west1"
}

resource "google_storage_bucket" "data_lake" {
  name     = "${var.project_id}-data-lake"
  location = "EU"

  lifecycle_rule {
    action { type = "Delete" }
    condition { age = 30 }  # auto-delete files older than 30 days
  }
}

resource "google_bigquery_dataset" "dataset" {
  dataset_id = "ny_taxi"
  location   = "EU"
}

Three commands and everything exists in GCP:

bashterraform init # download the GCP provider plugin
terraform plan     # show what will be created (dry run)
terraform apply    # actually create the resources

And when you’re done:

bashterraform destroy  # delete

That last command is the one I’ll use most. GCP charges by the hour for compute resources. terraform destroy is how you make sure you’re not paying for something you forgot about at 3am.

What Terraform Taught Me About Infrastructure

Before this module I thought infrastructure was something you set up once and never touched again. Terraform changed that mental model completely.

When you define infrastructure as code, it becomes something you can version control, review, and reproduce. If you mess something up, you destroy it and recreate it in minutes. If a colleague wants to replicate your setup, they run terraform apply. No 20-page setup guide. No “works on my machine.”

There’s also something called idempotency — if you run terraform apply twice on the same config, the second run does nothing because the resources already exist and match the config. This is a property you want in all data engineering systems: running something twice should produce the same result as running it once.

I’ll be using Terraform heavily in EnergyLens to provision the Kafka VM, the GCS data lake, and the BigQuery dataset. The entire infrastructure in one terraform apply. That’s the goal.

Why I’m doing this?

I’m going through the DE Zoomcamp because I’m building EnergyLens — a real-time pipeline that ingests electricity price data and weather data for 20+ European countries, processes it through Kafka, stores it in BigQuery, transforms it with dbt, and visualizes it in Looker Studio.

The Docker + PostgreSQL setup from week 1 is the simplest possible version of the infrastructure I’ll need. Same concepts, different scale. Local container vs cloud VM. PostgreSQL vs BigQuery. Manual script vs automated pipeline.

That’s the thing about foundations — they feel trivial until you realize everything else is built on top of them.

This site runs on coffee and good intentions. Help keep it running?