My Journey into Data Modeling: A Foundational Step in Data Engineering
Published: 06-01-2026
1. Introduction
Over the past few days, I've been learning data modeling, which is a foundational step in my journey as a data engineer and data scientist. Data modeling is one of those topics that beginners often skip, forgetting how beneficial it can be when designing new OLTP or OLAP database infrastructures. In the short term, it might seem straightforward, but its broader implications are extremely valuable.
In this area, I've explored several crucial concepts such as the data flow between the Bronze, Silver, and Gold layers, the Medallion architecture, incremental data loading, merging/upserting data using the MERGE INTO SQL query, fact and dimension tables, Slowly Changing Dimensions (SCD Types 1 & 2), and schema designs like the Star and Snowflake models. At first glance, each of these topics may appear simple, but in reality, they can save significant time and effort in the future if the data pipeline is properly modeled.
Let's dive deeper into these topics, shall we?
1.1 Motivation Behind Learning Data Modeling
Data modeling provides the foundational framework for building scalable and maintainable data systems. Understanding these principles early prevents costly redesigns later in a data engineering career.
1.2 Why Beginners Often Skip Data Modeling
Many beginners focus immediately on building pipelines and writing code without considering the underlying structure. This short-term approach often leads to technical debt and system maintenance challenges as data volume grows.
1.3 Short-Term Simplicity vs Long-Term Impact
While basic scripts may work initially, a poorly modeled data architecture becomes increasingly expensive to maintain. Investing time in proper data modeling upfront saves significant effort and resources in the long term.
2. Conceptual Foundations
2.1 What Is Data Modeling in Data Engineering?
Data modeling in data engineering involves designing the structure and relationships of data within a system. It encompasses understanding how data flows through different layers, ensuring proper normalization, and optimizing for both storage and query performance.
2.2 OLTP vs OLAP: Why Modeling Matters
OLTP (Online Transaction Processing) systems prioritize write performance and data integrity for transactional workloads. OLAP (Online Analytical Processing) systems optimize for read performance and complex analytical queries. Data modeling differs significantly between these paradigms, affecting schema design, normalization strategies, and indexing approaches.
2.3 Overview of Topics Covered
This document covers the Medallion Architecture (Bronze-Silver-Gold layers), fact and dimension tables, Slowly Changing Dimensions, Star and Snowflake schema designs, and incremental loading strategies essential for modern data engineering.
3. Medallion Architecture Overview
Bronze–Silver–Gold Architecture
Picture this: data as a river of mud from your sources (APIs, CSVs, databases). You don't want to drink it straight – you filter it through stages until it's crystal clear. That's the Medallion Architecture, a Databricks-coined "quality funnel". It enforces progressive refinement: Bronze for raw ingestion (no judgments), Silver for cleaning and normalization (fix the mess), Gold for business-ready aggregation (serve the steak).
Data as a Quality Funnel
Assumption check: I assumed raw data is always "truth"; counterpoint, sources lie (typos, biases), so Bronze preserves it for audits. Flaw in my thinking? If I over-transform early, I lose auditability; this layers it right.
Design Assumptions and Trade-offs
The Medallion Architecture balances several competing concerns: preserving raw data for auditability while progressively refining it for usability, managing storage efficiency while maintaining data quality, and enabling incremental loads while supporting complete rebuilds when necessary.
3.1 🥉 Bronze Layer — Raw Data Ingestion
Purpose of the Bronze Layer
The Bronze layer is the landing zone. The rule here is strict: do not transform business logic. If the source sends bad data, you keep it. If something breaks tomorrow, Bronze allows you to replay history without asking the source system again.
Immutability and Auditability
Bronze layer data remains immutable, preserving the exact state of data as received from source systems. This immutability ensures auditability—you can always trace back to see what data was originally received, enabling debugging and compliance auditing when data issues arise downstream.
Incremental Load Strategy
Flow: (1) Check if table exists and get last load date. (2) Filter source data post-that date for incremental pull. (3) Create/replace Bronze table with filtered data.
Table Existence Check
I started by assuming a source table data_modeling.default.source_data with columns like order_id, customer_name, last_update, etc.
Python for logic, SQL for creation:
if spark.catalog.tableExists("data_modeling.bronze.bronze_table"):
result = spark.sql("SELECT MAX(last_update) AS max_date FROM data_modeling.bronze.bronze_table").collect()[0][0]
if result is None:
last_load_date = '2000-02-29' # fallback start date
else:
last_load_date = result.strftime("%Y-%m-%d") # format as string
else:
last_load_date = '2000-02-29'
This block answers a critical question: From which point in time should I fetch new data?
- If the Bronze table already exists, I query the maximum
last_updatevalue. - If the table does not exist or is empty, I fall back to a safe historical date.
Fallback Load Date Logic
The fallback mechanism ensures that even if the Bronze table is empty or damaged, the system can restart from a known safe point. This prevents data loss and enables recovery from infrastructure failures.
Incremental Data Extraction
Next, filter and create the temp view:
spark.sql(f"""SELECT * FROM data_modeling.default.source_data
WHERE last_update > '{last_load_date}'""").createOrReplaceTempView("bronze_source")
This creates a temporary view containing only incremental data.
spark.sql(f"""SELECT * FROM ... WHERE last_update > '{last_load_date}'""")– Pulls all columns from source, but only rows newer than last load..createOrReplaceTempView("bronze_source")– Makes this a temporary table Spark can query like a real one – disposable, in-memory for speed.
Temporary Views and Materialization
Finally, I materialize it into the Bronze table.
CREATE OR REPLACE TABLE data_modeling.bronze.bronze_table
AS
SELECT * FROM bronze_source;
This layer exists purely for safety, replayability, and auditability.
Why No Business Logic Belongs Here
The Bronze layer must remain pure—containing only raw data as extracted from source systems. Any business logic or transformations belong in higher layers (Silver, Gold), ensuring that Bronze always serves as a reliable audit trail and recovery point.
3.2 🥈 Silver Layer — Data Cleaning and Normalization
Purpose of the Silver Layer
If Bronze is about trust in the source, Silver is about trust in the data. In this layer we usually fix nulls, deduping, enforcing types. Generally in industry, normalized to Snowflake style here (separate tables for customers, products) to avoid redundancy, but I kept it simple.
Data Cleaning and Standardization
The Silver layer applies consistent data quality rules: standardizing formats, handling missing values, enforcing data types, and removing duplicates.
Transformation Workflow
Flow: (1) Transform Bronze data (e.g., uppercase names, add timestamp). (2) Create temp view. (3) Build Silver table. (4) MERGE for upserts.
spark.sql("""
SELECT *, upper(customer_name) AS upper_names, current_date AS curent_time
FROM data_modeling.bronze.bronze_table""").createOrReplaceTempView("silver_source")
Incremental Upserts Using MERGE
Create Silver table if it did not already exist:
CREATE TABLE IF NOT EXISTS data_modeling.silver.silver_table
AS
SELECT * FROM silver_source;
The star: MERGE INTO for incremental upsert:
MERGE INTO data_modeling.silver.silver_table
USING silver_source
ON data_modeling.silver.silver_table.order_id = silver_source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
MERGE INTO target_table USING source ON key_match– Core upsert: Comparesorder_id(unique key)WHEN MATCHED THEN UPDATE SET *– If ID exists, overwrite all columns (SCD Type 1 vibe)WHEN NOT MATCHED THEN INSERT *– New ID? Add full row
3.3 🥇 Gold Layer — Business-Ready Data
Purpose of the Gold Layer
Silver's normalized for purity; Gold flattens to Star Schema for BI tools – joins pre-baked, aggregations ready. Target: Analysts querying "revenue by country?" without lag.
Star Schema Philosophy
The Star Schema denormalizes dimension tables into a single flat layer, minimizing the number of joins required for queries. While this increases storage requirements, it dramatically improves query performance for BI tools.
Dimension Table Construction
Flow: (1) Extract distinct dims from Silver (add surrogate keys via ROW_NUMBER). (2) Build fact by joining back.
Deduplication of Business Keys
First, dimensions (e.g., dimcustomer):
CREATE OR REPLACE TABLE data_modeling.gold.dimcustomer AS
SELECT *, ROW_NUMBER() OVER (ORDER BY customer_id) AS dimsagkeycust FROM(
SELECT
DISTINCT(customer_id),
customer_name,
customer_email,
upper_names
FROM data_modeling.silver.silver_table
)
Surrogate Key Generation
Surrogate keys provide stable, system-assigned identifiers independent of business logic. These keys remain constant even if business keys change.
Fact Table Design
Fact tables contain:
- Measures: Numeric values you perform calculations on (quantity, price, amount)
- Descriptors: Foreign keys pointing to dimensions, providing context for measures
CREATE OR REPLACE TABLE data_modeling.gold.factsales AS
SELECT
p.dimsagkeyprod,
s.dimsagkeysales,
c.dimsagkeycountry,
pay.dimsagkeypay,
cust.dimsagkeycust,
f.quantity,
f.price
FROM data_modeling.silver.silver_table AS f
LEFT JOIN data_modeling.gold.dimproduct AS p ON f.product_id = p.product_id
LEFT JOIN data_modeling.gold.dimcountry AS c ON f.country = c.country
LEFT JOIN data_modeling.gold.dimpayments AS pay ON f.payment_method = pay.payment_method
LEFT JOIN data_modeling.gold.dimcustomer AS cust ON f.customer_id = cust.customer_id
LEFT JOIN data_modeling.gold.dimsales AS s ON f.order_id = s.order_id
4. Incremental Loading Revisited
Incremental Flow Across Layers
Incremental loading ties Bronze and Silver together. Bronze captures changes using last_update. Silver consumes those changes and merges them using MERGE INTO. Gold is then rebuilt or incrementally refreshed from Silver.
Minimizing Data Movement
The key idea is simple: never move more data than necessary. By filtering at Bronze based on change timestamps and propagating only changed rows through Silver, incremental loading minimizes I/O, compute costs, and processing time.
Rebuild vs Refresh Trade-offs
Incremental refresh of Gold layer fact tables requires careful design of slowly changing dimension logic. In some cases, a complete rebuild of Gold tables from Silver may be more efficient and maintainable than complex incremental logic, depending on data volume and change frequency.
5. Types of Fact Tables
Fact table is like the main table in a data warehouse, storing actual data, typically numerical and additive, like sales amount or website clicks. It's like the core of a star or snowflake schema, surrounded by dimension tables that give context to the facts.
5.1 Transactional Facts
Transactional Fact (Append-Only Events):
| Transaction_ID | Date_ID | Product_ID | Quantity | Total_Price |
|---|---|---|---|---|
| 1001 | 20250101 | P123 | 2 | 10.00 |
| 1002 | 20250101 | P456 | 1 | 5.00 |
Grain: Per transaction. No updates – just add rows. Great for "peak hour sales."
5.2 Periodic Snapshot Facts
Periodic Snapshot (Time-Boxed States):
| Month_ID | Product_ID | Warehouse_ID | Units_In_Stock |
|---|---|---|---|
| 202501 | P123 | W1 | 500 |
| 202502 | P123 | W1 | 420 |
Grain: End-of-month pic. Semi-additive (sum across warehouses, not time).
5.3 Accumulating Snapshot Facts
Accumulating Snapshot (Lifecycle Milestones):
| Order_ID | Date_Placed | Date_Packed | Date_Shipped | Date_Delivered | Total_Value |
|---|---|---|---|---|---|
| 555 | 2025-01-01 | 2025-01-02 | 2025-01-03 | 2025-01-05 | 150.00 |
Updates in-place (add ship date). Measures "delivery lag."
5.4 Factless and Aggregate Facts
Specials: Factless (just COUNT events, e.g., attendance); Aggregate (pre-summed for speed, e.g., daily totals).
Fact Types Comparison Table:
| Feature | Transactional | Periodic Snapshot | Accumulating Snapshot |
|---|---|---|---|
| Grain | One row per event | One row per time period | One row per process lifecycle |
| Updates | Never (Append only) | Never (Append only) | Always (Updated milestones) |
| Date Dimension | 1 Date key | 1 Date key (End of period) | Multiple Date keys |
| Primary Goal | Detailed history | Trend/State tracking | Efficiency/Process speed |
6. Dimension Tables and SCDs
Dimensions add context, but they evolve (customers move). Types keep it sane. Dimensions are the "who/what/where" descriptors that provide context for numeric measures in fact tables.
6.1 Slowly Changing Dimensions Overview
SCD (Slowly Changing Dimensions): Core for history. Type 0: Fixed (birthdate – never change). Type 1: Overwrite (fix "Jhon" to "John" – lose old). Type 2: New row (preserve both with dates).
6.1.1 SCD Type 0
Type 0 dimensions are immutable. Once a dimension record is created, it never changes. This is appropriate for attributes that should never change, such as birth dates or original product codes.
6.1.2 SCD Type 1: Overwrite – Simple, History-Lossy
SCD Type 1 is like, overwrite the old data with new data, no history kept. So if a customer's address changes, you just update the address, and poof, the old address is gone. Simple, straightforward, but you're losin' history.
MERGE INTO data_modeling.gold.scd1_tgt
USING scd1_src
ON data_modeling.gold.scd1_tgt.id = scd1_src.id
WHEN MATCHED AND scd1_src.currentDate >= data_modeling.gold.scd1_tgt.currentDate THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
6.1.3 SCD Type 2: Versioned – History Keeper
SCD Type 2 is like keepin' it real, keepin' history! When data changes, you add a new record with the updated info and mark the old one as inactive, keepin' that historical context.
-- First MERGE: Expire old if changed
MERGE INTO data_modeling.gold.scd2_tgt AS scd2_tgt
USING scd2_src
ON scd2_tgt.id = scd2_src.id AND scd2_tgt.filter_flag = 1
WHEN MATCHED AND (scd2_src.cat <> scd2_tgt.cat AND scd2_src.currentDate >= scd2_tgt.currentDate)
THEN UPDATE SET scd2_tgt.end_date = current_date(), scd2_tgt.filter_flag = 0
-- Second MERGE: Insert new version
MERGE INTO data_modeling.gold.scd2_tgt AS scd2_tgt
USING scd2_src
ON scd2_tgt.id = scd2_src.id AND scd2_tgt.filter_flag = 1
WHEN NOT MATCHED THEN INSERT (id, cat, currentDate, start_date, end_date, filter_flag)
VALUES (scd2_src.id, scd2_src.cat, scd2_src.currentDate, scd2_src.start_date, scd2_src.end_date, scd2_src.filter_flag)
6.2 Conformed Dimensions
A conformed dimension is a dimension table that's shared across multiple fact tables, ensuring consistency. Example: A Date dimension table is used across Sales, Inventory, and Shipping fact tables.
6.3 Junk Dimensions
A junk dimension is a single table that combines multiple low-cardinality attributes, like flags and indicators, to simplify the data model.
6.4 Degenerate Dimensions
A degenerate dimension is a dimension attribute that's stored directly in the fact table, without a separate dimension table. It's typically a transaction-specific identifier, like an order number.
6.5 Role-Playing Dimensions
A role-playing dimension is a single dimension table that's used multiple times in the same fact table, with each instance playing a different role. Example: A Date dimension table is used for Order Date, Ship Date, and Delivery Date.
6.6 Outrigger Dimensions
An outrigger dimension is a dimension table that's related to another dimension table, providing additional context and attributes.
6.7 Shrunken Dimensions
A shrunken dimension is a subset of a larger dimension table, containing a smaller set of attributes or rows.
7. Star Schema vs Snowflake Schema
7.1 Conceptual Comparison
The classic engineering trade-off: Query Speed vs. Data Integrity.
7.2 Booking Fact Example
Star Schema (Denormalized): Think of this as a "Flat File" approach. In your Host Dim, you would have columns like Host_Name, Host_City, Host_State, and Host_Country all in one single table.
Snowflake Schema (Normalized): This is the "Organized Folders" approach. You take that Host Dim and pull out the Country into its own table. Now, the Host Dim only stores a Country_ID.
7.3 Query Path Differences
| Schema | Path to get "Country Name" | Number of Joins |
|---|---|---|
| Star | Booking_Fact → Host_Dim (Country name is already there) | 1 Join |
| Snowflake | Booking_Fact → Host_Dim → Country_Lookup | 2 Joins |
7.4 Storage vs Performance Trade-offs
Storage Efficiency (Snowflake Wins): Snowflake is much more efficient because it follows Third Normal Form (3NF) logic. By storing "Country" once in a lookup table, you save massive amounts of space.
Query Efficiency (Star Wins): Star schemas are "Read-Optimized." Because there are fewer "hops," the database engine can create a simpler Execution Plan.
7.5 BI Optimization
Summary Table:
| Feature | Star Schema | Snowflake Schema |
|---|---|---|
| Structure | Denormalized (Flat) | Normalized (Layered) |
| Joins | Fewer (Fast) | More (Slower) |
| Storage | Higher (Redundant) | Lower (Optimized) |
| Maintenance | Harder (Updates repeat) | Easier (Update once) |
| Best For | BI Reporting/Dashboards | Complex Data Warehousing |
8. Key Learnings and Reflections
8.1 Why Data Modeling Matters
Data modeling is not an optional preliminary step but a critical investment in system stability and scalability. Well-designed data models enable efficient queries, reduce maintenance burden, and provide a solid foundation for future analytical needs.
8.2 Engineering Trade-offs
Every design decision involves trade-offs. Star schemas optimize query performance at the cost of storage; Snowflake schemas optimize storage at the cost of query complexity. SCD Type 1 simplifies implementation but loses history; Type 2 preserves history with added complexity. The key is understanding your specific requirements and choosing intentionally.
8.3 Impact on Future Pipelines
Proper data modeling at the foundation prevents expensive redesigns later. Investing time upfront in Bronze-Silver-Gold architecture, proper dimension design, and appropriate schema selection pays dividends as data volume grows and analytical demands increase.
9. Acknowledgements
This knowledge wouldn't have been possible without Ansh Lamba's Data Modelling lecture on YouTube: https://youtu.be/K7C1sWKQU-o
Huge thanks to him too!
Note: Written by me, corrected with the help of Grok, ChatGPT and Gemini, and taught by Ansh Lamba.