A Comparative Analysis of Data Lake Data Warehouse and Lakehouse Architectures

A Comparative Analysis of Data Lake Data Warehouse and Lakehouse Architectures sits at the boundary between data engineering and analytics. Whether your data lives in Postgres, Snowflake or a data lake, the concepts in this lesson let you ingest, query and reshape it efficiently at scale.

Why Comparative Analysis Data Matters

A data scientist fluent in SQL is independent of analysts and engineers for most data access. That independence lets you iterate faster and ask sharper, better-informed questions of the data.

Keep analytical queries declarative — describe what, not how.
Lean on window functions and CTEs for readability.
Design indexes based on query patterns, not table structure.
Know when to push computation to the database versus to Python.

How Comparative Analysis Data Shows Up in Practice

In a typical project, a comparative analysis of data lake data warehouse and lakehouse architectures is combined with the rest of the SQL & Databases toolkit. You rarely use any one technique in isolation; the real skill is knowing which combination fits the problem you are trying to solve, and being able to explain that choice to a non-technical stakeholder.

Expect to use this when pulling training data, shipping a dashboard, optimising a slow query or architecting a new analytical table.

Back to the Data Science curriculum →

Code Examples: Comparative Analysis Data Lake Data Warehouse (5 runnable snippets)

Copy any block into a file or notebook and run it end-to-end — each example stands alone.

Example 1: SQLAlchemy Core: typed schema + bulk insert

# Example 1: SQLAlchemy Core: typed schema + bulk insert -- Comparative Analysis Data Lake Data Warehouse
from sqlalchemy import (create_engine, MetaData, Table, Column,
                        Integer, String, Float, select, func)

engine = create_engine("sqlite:///:memory:", future=True)
meta   = MetaData()

orders = Table("orders", meta,
    Column("id",        Integer, primary_key=True),
    Column("customer",  String(64), nullable=False),
    Column("amount",    Float,  nullable=False),
)
meta.create_all(engine)

with engine.begin() as conn:
    conn.execute(orders.insert(), [
        {"customer": "Ana", "amount": 42.0},
        {"customer": "Bob", "amount": 99.5},
        {"customer": "Ana", "amount": 17.3},
    ])
    stmt = (select(orders.c.customer, func.sum(orders.c.amount).label("total"))
            .group_by(orders.c.customer))
    for row in conn.execute(stmt):
        print(row.customer, round(row.total, 2))

Example 2: Recursive CTE for hierarchical data

# Example 2: Recursive CTE for hierarchical data -- Comparative Analysis Data Lake Data Warehouse
import sqlite3

con = sqlite3.connect(":memory:")
con.executescript("""
CREATE TABLE org (id INT PRIMARY KEY, name TEXT, manager_id INT);
INSERT INTO org VALUES
 (1,'Ada',    NULL),
 (2,'Ben',     1),
 (3,'Cai',     1),
 (4,'Dee',     2),
 (5,'Eli',     2),
 (6,'Fay',     3),
 (7,'Gio',     5);
""")

query = """
WITH RECURSIVE chain(id, name, level, path) AS (
    SELECT id, name, 0, name
    FROM   org
    WHERE  manager_id IS NULL
    UNION ALL
    SELECT o.id, o.name, c.level + 1, c.path || ' / ' || o.name
    FROM   org o JOIN chain c ON o.manager_id = c.id
)
SELECT level, id, printf('%s%s', replace(hex(zeroblob(level*2)),'00','  '),
                         name) AS tree
FROM chain
ORDER BY path;
"""
for row in con.execute(query):
    print(row)

Example 3: Window functions and common table expressions

# Example 3: Window functions and common table expressions -- Comparative Analysis Data Lake Data Warehouse
import sqlite3

con = sqlite3.connect(":memory:")
con.executescript("""
CREATE TABLE sales (day DATE, region TEXT, amount REAL);
INSERT INTO sales VALUES
 ('2026-01-01','NA', 1200), ('2026-01-02','NA', 1550),
 ('2026-01-01','EU',  900), ('2026-01-02','EU', 1180),
 ('2026-01-03','NA', 1700), ('2026-01-03','EU', 1220);
""")

query = """
WITH daily AS (
    SELECT region, day, SUM(amount) AS revenue
    FROM sales GROUP BY region, day
)
SELECT region, day, revenue,
       SUM(revenue) OVER (PARTITION BY region ORDER BY day) AS running_total,
       RANK()       OVER (PARTITION BY day    ORDER BY revenue DESC) AS day_rank
FROM daily
ORDER BY day, region;
"""
for row in con.execute(query):
    print(row)

Example 4: Parameterised upsert against an indexed table

# Example 4: Parameterised upsert against an indexed table -- Comparative Analysis Data Lake Data Warehouse
import sqlite3

con = sqlite3.connect(":memory:")
con.execute("""
CREATE TABLE users (
    id     INTEGER PRIMARY KEY,
    email  TEXT UNIQUE NOT NULL,
    visits INTEGER NOT NULL DEFAULT 0
);
""")

def record_visit(email: str) -> None:
    con.execute("""
        INSERT INTO users (email, visits) VALUES (?, 1)
        ON CONFLICT(email) DO UPDATE
            SET visits = visits + 1;
    """, (email,))

for e in ["a@x.com", "b@x.com", "a@x.com", "a@x.com", "b@x.com"]:
    record_visit(e)

for row in con.execute("SELECT email, visits FROM users ORDER BY visits DESC"):
    print(row)

Example 5: EXPLAIN QUERY PLAN before and after indexing

# Example 5: EXPLAIN QUERY PLAN before and after indexing -- Comparative Analysis Data Lake Data Warehouse
import sqlite3, random

con = sqlite3.connect(":memory:")
con.execute("CREATE TABLE events (id INTEGER PRIMARY KEY, user_id INT, kind TEXT);")

rng  = random.Random(0)
rows = [(i, rng.randint(1, 10_000), rng.choice(["click", "view", "buy"]))
        for i in range(200_000)]
con.executemany("INSERT INTO events VALUES (?, ?, ?)", rows)

sql = "SELECT COUNT(*) FROM events WHERE user_id = 4242 AND kind = 'buy'"
print("BEFORE index:")
for r in con.execute("EXPLAIN QUERY PLAN " + sql):
    print(" ", r)

con.execute("CREATE INDEX idx_events_user_kind ON events(user_id, kind);")
print("AFTER index:")
for r in con.execute("EXPLAIN QUERY PLAN " + sql):
    print(" ", r)
print("count =", con.execute(sql).fetchone()[0])