1. Introduction

Over the past few days, I've been learning data modeling, which is a foundational step in my journey as a data engineer and data scientist. Data modeling is one of those topics that beginners often skip, forgetting how beneficial it can be when designing new OLTP or OLAP database infrastructures. In the short term, it might seem straightforward, but its broader implications are extremely valuable.

In this area, I've explored several crucial concepts such as the data flow between the Bronze, Silver, and Gold layers, the Medallion architecture, incremental data loading, merging/upserting data using the MERGE INTO SQL query, fact and dimension tables, Slowly Changing Dimensions (SCD Types 1 & 2), and schema designs like the Star and Snowflake models. At first glance, each of these topics may appear simple, but in reality, they can save significant time and effort in the future if the data pipeline is properly modeled.

Let's dive deeper into these topics, shall we?

1.1 Motivation Behind Learning Data Modeling

Data modeling provides the foundational framework for building scalable and maintainable data systems. Understanding these principles early prevents costly redesigns later in a data engineering career.

1.2 Why Beginners Often Skip Data Modeling

Many beginners focus immediately on building pipelines and writing code without considering the underlying structure. This short-term approach often leads to technical debt and system maintenance challenges as data volume grows.

1.3 Short-Term Simplicity vs Long-Term Impact

While basic scripts may work initially, a poorly modeled data architecture becomes increasingly expensive to maintain. Investing time in proper data modeling upfront saves significant effort and resources in the long term.


2. Conceptual Foundations

2.1 What Is Data Modeling in Data Engineering?

Data modeling in data engineering involves designing the structure and relationships of data within a system. It encompasses understanding how data flows through different layers, ensuring proper normalization, and optimizing for both storage and query performance.

2.2 OLTP vs OLAP: Why Modeling Matters

OLTP (Online Transaction Processing) systems prioritize write performance and data integrity for transactional workloads. OLAP (Online Analytical Processing) systems optimize for read performance and complex analytical queries. Data modeling differs significantly between these paradigms, affecting schema design, normalization strategies, and indexing approaches.

2.3 Overview of Topics Covered

This document covers the Medallion Architecture (Bronze-Silver-Gold layers), fact and dimension tables, Slowly Changing Dimensions, Star and Snowflake schema designs, and incremental loading strategies essential for modern data engineering.


3. Medallion Architecture Overview

Bronze–Silver–Gold Architecture

Picture this: data as a river of mud from your sources (APIs, CSVs, databases). You don't want to drink it straight – you filter it through stages until it's crystal clear. That's the Medallion Architecture, a Databricks-coined "quality funnel". It enforces progressive refinement: Bronze for raw ingestion (no judgments), Silver for cleaning and normalization (fix the mess), Gold for business-ready aggregation (serve the steak).

Data as a Quality Funnel

Assumption check: I assumed raw data is always "truth"; counterpoint, sources lie (typos, biases), so Bronze preserves it for audits. Flaw in my thinking? If I over-transform early, I lose auditability; this layers it right.

Design Assumptions and Trade-offs

The Medallion Architecture balances several competing concerns: preserving raw data for auditability while progressively refining it for usability, managing storage efficiency while maintaining data quality, and enabling incremental loads while supporting complete rebuilds when necessary.


3.1 🥉 Bronze Layer — Raw Data Ingestion

Purpose of the Bronze Layer

The Bronze layer is the landing zone. The rule here is strict: do not transform business logic. If the source sends bad data, you keep it. If something breaks tomorrow, Bronze allows you to replay history without asking the source system again.

Immutability and Auditability

Bronze layer data remains immutable, preserving the exact state of data as received from source systems. This immutability ensures auditability—you can always trace back to see what data was originally received, enabling debugging and compliance auditing when data issues arise downstream.

Incremental Load Strategy

Flow: (1) Check if table exists and get last load date. (2) Filter source data post-that date for incremental pull. (3) Create/replace Bronze table with filtered data.

Table Existence Check

I started by assuming a source table data_modeling.default.source_data with columns like order_id, customer_name, last_update, etc.

Python for logic, SQL for creation:

if spark.catalog.tableExists("data_modeling.bronze.bronze_table"):
    result = spark.sql("SELECT MAX(last_update) AS max_date FROM data_modeling.bronze.bronze_table").collect()[0][0]
    if result is None:
        last_load_date = '2000-02-29' # fallback start date
    else:
        last_load_date = result.strftime("%Y-%m-%d") # format as string
else:
    last_load_date = '2000-02-29'

This block answers a critical question: From which point in time should I fetch new data?

  • If the Bronze table already exists, I query the maximum last_update value.
  • If the table does not exist or is empty, I fall back to a safe historical date.

Fallback Load Date Logic

The fallback mechanism ensures that even if the Bronze table is empty or damaged, the system can restart from a known safe point. This prevents data loss and enables recovery from infrastructure failures.

Incremental Data Extraction

Next, filter and create the temp view:

spark.sql(f"""SELECT * FROM data_modeling.default.source_data
          WHERE last_update > '{last_load_date}'""").createOrReplaceTempView("bronze_source")

This creates a temporary view containing only incremental data.

  • spark.sql(f"""SELECT * FROM ... WHERE last_update > '{last_load_date}'""") – Pulls all columns from source, but only rows newer than last load.
  • .createOrReplaceTempView("bronze_source") – Makes this a temporary table Spark can query like a real one – disposable, in-memory for speed.

Temporary Views and Materialization

Finally, I materialize it into the Bronze table.

CREATE OR REPLACE TABLE data_modeling.bronze.bronze_table
AS
SELECT * FROM bronze_source;

This layer exists purely for safety, replayability, and auditability.

Why No Business Logic Belongs Here

The Bronze layer must remain pure—containing only raw data as extracted from source systems. Any business logic or transformations belong in higher layers (Silver, Gold), ensuring that Bronze always serves as a reliable audit trail and recovery point.


3.2 🥈 Silver Layer — Data Cleaning and Normalization

Purpose of the Silver Layer

If Bronze is about trust in the source, Silver is about trust in the data. In this layer we usually fix nulls, deduping, enforcing types. Generally in industry, normalized to Snowflake style here (separate tables for customers, products) to avoid redundancy, but I kept it simple.

Data Cleaning and Standardization

The Silver layer applies consistent data quality rules: standardizing formats, handling missing values, enforcing data types, and removing duplicates.

Transformation Workflow

Flow: (1) Transform Bronze data (e.g., uppercase names, add timestamp). (2) Create temp view. (3) Build Silver table. (4) MERGE for upserts.

spark.sql("""
SELECT *, upper(customer_name) AS upper_names, current_date AS curent_time 
FROM data_modeling.bronze.bronze_table""").createOrReplaceTempView("silver_source")

Incremental Upserts Using MERGE

Create Silver table if it did not already exist:

CREATE TABLE IF NOT EXISTS data_modeling.silver.silver_table
AS
SELECT * FROM silver_source;

The star: MERGE INTO for incremental upsert:

MERGE INTO data_modeling.silver.silver_table
USING silver_source
ON data_modeling.silver.silver_table.order_id = silver_source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
  • MERGE INTO target_table USING source ON key_match – Core upsert: Compares order_id (unique key)
  • WHEN MATCHED THEN UPDATE SET * – If ID exists, overwrite all columns (SCD Type 1 vibe)
  • WHEN NOT MATCHED THEN INSERT * – New ID? Add full row

3.3 🥇 Gold Layer — Business-Ready Data

Purpose of the Gold Layer

Silver's normalized for purity; Gold flattens to Star Schema for BI tools – joins pre-baked, aggregations ready. Target: Analysts querying "revenue by country?" without lag.

Star Schema Philosophy

The Star Schema denormalizes dimension tables into a single flat layer, minimizing the number of joins required for queries. While this increases storage requirements, it dramatically improves query performance for BI tools.

Dimension Table Construction

Flow: (1) Extract distinct dims from Silver (add surrogate keys via ROW_NUMBER). (2) Build fact by joining back.

Deduplication of Business Keys

First, dimensions (e.g., dimcustomer):

CREATE OR REPLACE TABLE data_modeling.gold.dimcustomer AS
SELECT *, ROW_NUMBER() OVER (ORDER BY customer_id) AS dimsagkeycust FROM(
  SELECT
    DISTINCT(customer_id),
    customer_name,
    customer_email,
    upper_names
  FROM data_modeling.silver.silver_table
)

Surrogate Key Generation

Surrogate keys provide stable, system-assigned identifiers independent of business logic. These keys remain constant even if business keys change.

Fact Table Design

Fact tables contain:

  • Measures: Numeric values you perform calculations on (quantity, price, amount)
  • Descriptors: Foreign keys pointing to dimensions, providing context for measures
CREATE OR REPLACE TABLE data_modeling.gold.factsales AS
SELECT
  p.dimsagkeyprod,
  s.dimsagkeysales,
  c.dimsagkeycountry,
  pay.dimsagkeypay,
  cust.dimsagkeycust,
  f.quantity,
  f.price
FROM data_modeling.silver.silver_table AS f
LEFT JOIN data_modeling.gold.dimproduct AS p ON f.product_id = p.product_id
LEFT JOIN data_modeling.gold.dimcountry AS c ON f.country = c.country
LEFT JOIN data_modeling.gold.dimpayments AS pay ON f.payment_method = pay.payment_method
LEFT JOIN data_modeling.gold.dimcustomer AS cust ON f.customer_id = cust.customer_id
LEFT JOIN data_modeling.gold.dimsales AS s ON f.order_id = s.order_id

4. Incremental Loading Revisited

Incremental Flow Across Layers

Incremental loading ties Bronze and Silver together. Bronze captures changes using last_update. Silver consumes those changes and merges them using MERGE INTO. Gold is then rebuilt or incrementally refreshed from Silver.

Minimizing Data Movement

The key idea is simple: never move more data than necessary. By filtering at Bronze based on change timestamps and propagating only changed rows through Silver, incremental loading minimizes I/O, compute costs, and processing time.

Rebuild vs Refresh Trade-offs

Incremental refresh of Gold layer fact tables requires careful design of slowly changing dimension logic. In some cases, a complete rebuild of Gold tables from Silver may be more efficient and maintainable than complex incremental logic, depending on data volume and change frequency.


5. Types of Fact Tables

Fact table is like the main table in a data warehouse, storing actual data, typically numerical and additive, like sales amount or website clicks. It's like the core of a star or snowflake schema, surrounded by dimension tables that give context to the facts.

5.1 Transactional Facts

Transactional Fact (Append-Only Events):

Transaction_IDDate_IDProduct_IDQuantityTotal_Price
100120250101P123210.00
100220250101P45615.00

Grain: Per transaction. No updates – just add rows. Great for "peak hour sales."

5.2 Periodic Snapshot Facts

Periodic Snapshot (Time-Boxed States):

Month_IDProduct_IDWarehouse_IDUnits_In_Stock
202501P123W1500
202502P123W1420

Grain: End-of-month pic. Semi-additive (sum across warehouses, not time).

5.3 Accumulating Snapshot Facts

Accumulating Snapshot (Lifecycle Milestones):

Order_IDDate_PlacedDate_PackedDate_ShippedDate_DeliveredTotal_Value
5552025-01-012025-01-022025-01-032025-01-05150.00

Updates in-place (add ship date). Measures "delivery lag."

5.4 Factless and Aggregate Facts

Specials: Factless (just COUNT events, e.g., attendance); Aggregate (pre-summed for speed, e.g., daily totals).

Fact Types Comparison Table:

FeatureTransactionalPeriodic SnapshotAccumulating Snapshot
GrainOne row per eventOne row per time periodOne row per process lifecycle
UpdatesNever (Append only)Never (Append only)Always (Updated milestones)
Date Dimension1 Date key1 Date key (End of period)Multiple Date keys
Primary GoalDetailed historyTrend/State trackingEfficiency/Process speed

6. Dimension Tables and SCDs

Dimensions add context, but they evolve (customers move). Types keep it sane. Dimensions are the "who/what/where" descriptors that provide context for numeric measures in fact tables.

6.1 Slowly Changing Dimensions Overview

SCD (Slowly Changing Dimensions): Core for history. Type 0: Fixed (birthdate – never change). Type 1: Overwrite (fix "Jhon" to "John" – lose old). Type 2: New row (preserve both with dates).

6.1.1 SCD Type 0

Type 0 dimensions are immutable. Once a dimension record is created, it never changes. This is appropriate for attributes that should never change, such as birth dates or original product codes.

6.1.2 SCD Type 1: Overwrite – Simple, History-Lossy

SCD Type 1 is like, overwrite the old data with new data, no history kept. So if a customer's address changes, you just update the address, and poof, the old address is gone. Simple, straightforward, but you're losin' history.

MERGE INTO data_modeling.gold.scd1_tgt
USING scd1_src
ON data_modeling.gold.scd1_tgt.id = scd1_src.id
WHEN MATCHED AND scd1_src.currentDate >= data_modeling.gold.scd1_tgt.currentDate THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

6.1.3 SCD Type 2: Versioned – History Keeper

SCD Type 2 is like keepin' it real, keepin' history! When data changes, you add a new record with the updated info and mark the old one as inactive, keepin' that historical context.

-- First MERGE: Expire old if changed
MERGE INTO data_modeling.gold.scd2_tgt AS scd2_tgt
USING scd2_src
ON scd2_tgt.id = scd2_src.id AND scd2_tgt.filter_flag = 1
WHEN MATCHED AND (scd2_src.cat <> scd2_tgt.cat AND scd2_src.currentDate >= scd2_tgt.currentDate) 
THEN UPDATE SET scd2_tgt.end_date = current_date(), scd2_tgt.filter_flag = 0

-- Second MERGE: Insert new version
MERGE INTO data_modeling.gold.scd2_tgt AS scd2_tgt
USING scd2_src
ON scd2_tgt.id = scd2_src.id AND scd2_tgt.filter_flag = 1
WHEN NOT MATCHED THEN INSERT (id, cat, currentDate, start_date, end_date, filter_flag)
VALUES (scd2_src.id, scd2_src.cat, scd2_src.currentDate, scd2_src.start_date, scd2_src.end_date, scd2_src.filter_flag)

6.2 Conformed Dimensions

A conformed dimension is a dimension table that's shared across multiple fact tables, ensuring consistency. Example: A Date dimension table is used across Sales, Inventory, and Shipping fact tables.

6.3 Junk Dimensions

A junk dimension is a single table that combines multiple low-cardinality attributes, like flags and indicators, to simplify the data model.

6.4 Degenerate Dimensions

A degenerate dimension is a dimension attribute that's stored directly in the fact table, without a separate dimension table. It's typically a transaction-specific identifier, like an order number.

6.5 Role-Playing Dimensions

A role-playing dimension is a single dimension table that's used multiple times in the same fact table, with each instance playing a different role. Example: A Date dimension table is used for Order Date, Ship Date, and Delivery Date.

6.6 Outrigger Dimensions

An outrigger dimension is a dimension table that's related to another dimension table, providing additional context and attributes.

6.7 Shrunken Dimensions

A shrunken dimension is a subset of a larger dimension table, containing a smaller set of attributes or rows.


7. Star Schema vs Snowflake Schema

7.1 Conceptual Comparison

The classic engineering trade-off: Query Speed vs. Data Integrity.

7.2 Booking Fact Example

Star Schema (Denormalized): Think of this as a "Flat File" approach. In your Host Dim, you would have columns like Host_Name, Host_City, Host_State, and Host_Country all in one single table.

Snowflake Schema (Normalized): This is the "Organized Folders" approach. You take that Host Dim and pull out the Country into its own table. Now, the Host Dim only stores a Country_ID.

7.3 Query Path Differences

SchemaPath to get "Country Name"Number of Joins
StarBooking_Fact → Host_Dim (Country name is already there)1 Join
SnowflakeBooking_Fact → Host_Dim → Country_Lookup2 Joins

7.4 Storage vs Performance Trade-offs

Storage Efficiency (Snowflake Wins): Snowflake is much more efficient because it follows Third Normal Form (3NF) logic. By storing "Country" once in a lookup table, you save massive amounts of space.

Query Efficiency (Star Wins): Star schemas are "Read-Optimized." Because there are fewer "hops," the database engine can create a simpler Execution Plan.

7.5 BI Optimization

Summary Table:

FeatureStar SchemaSnowflake Schema
StructureDenormalized (Flat)Normalized (Layered)
JoinsFewer (Fast)More (Slower)
StorageHigher (Redundant)Lower (Optimized)
MaintenanceHarder (Updates repeat)Easier (Update once)
Best ForBI Reporting/DashboardsComplex Data Warehousing

8. Key Learnings and Reflections

8.1 Why Data Modeling Matters

Data modeling is not an optional preliminary step but a critical investment in system stability and scalability. Well-designed data models enable efficient queries, reduce maintenance burden, and provide a solid foundation for future analytical needs.

8.2 Engineering Trade-offs

Every design decision involves trade-offs. Star schemas optimize query performance at the cost of storage; Snowflake schemas optimize storage at the cost of query complexity. SCD Type 1 simplifies implementation but loses history; Type 2 preserves history with added complexity. The key is understanding your specific requirements and choosing intentionally.

8.3 Impact on Future Pipelines

Proper data modeling at the foundation prevents expensive redesigns later. Investing time upfront in Bronze-Silver-Gold architecture, proper dimension design, and appropriate schema selection pays dividends as data volume grows and analytical demands increase.


9. Acknowledgements

This knowledge wouldn't have been possible without Ansh Lamba's Data Modelling lecture on YouTube: https://youtu.be/K7C1sWKQU-o

Huge thanks to him too!


Note: Written by me, corrected with the help of Grok, ChatGPT and Gemini, and taught by Ansh Lamba.