A Comparative Analysis of Data Lake Data Warehouse and Lakehouse Architectures

A Comparative Analysis of Data Lake Data Warehouse and Lakehouse Architectures sits at the boundary between data engineering and analytics. Whether your data lives in Postgres, Snowflake or a data lake, the concepts in this lesson let you ingest, query and reshape it efficiently at scale.

Why Comparative Analysis Data Matters

A data scientist fluent in SQL is independent of analysts and engineers for most data access. That independence lets you iterate faster and ask sharper, better-informed questions of the data.

  • Keep analytical queries declarative — describe what, not how.
  • Lean on window functions and CTEs for readability.
  • Design indexes based on query patterns, not table structure.
  • Know when to push computation to the database versus to Python.

How Comparative Analysis Data Shows Up in Practice

In a typical project, a comparative analysis of data lake data warehouse and lakehouse architectures is combined with the rest of the SQL & Databases toolkit. You rarely use any one technique in isolation; the real skill is knowing which combination fits the problem you are trying to solve, and being able to explain that choice to a non-technical stakeholder.

Expect to use this when pulling training data, shipping a dashboard, optimising a slow query or architecting a new analytical table.

Back to the Data Science curriculum →

Code Examples: Comparative Analysis Data Lake Data Warehouse (5 runnable snippets)

Copy any block into a file or notebook and run it end-to-end — each example stands alone.

Example 1: SQLAlchemy Core: typed schema + bulk insert

# Example 1: SQLAlchemy Core: typed schema + bulk insert -- Comparative Analysis Data Lake Data Warehouse
from sqlalchemy import (create_engine, MetaData, Table, Column,
                        Integer, String, Float, select, func)

engine = create_engine("sqlite:///:memory:", future=True)
meta   = MetaData()

orders = Table("orders", meta,
    Column("id",        Integer, primary_key=True),
    Column("customer",  String(64), nullable=False),
    Column("amount",    Float,  nullable=False),
)
meta.create_all(engine)

with engine.begin() as conn:
    conn.execute(orders.insert(), [
        {"customer": "Ana", "amount": 42.0},
        {"customer": "Bob", "amount": 99.5},
        {"customer": "Ana", "amount": 17.3},
    ])
    stmt = (select(orders.c.customer, func.sum(orders.c.amount).label("total"))
            .group_by(orders.c.customer))
    for row in conn.execute(stmt):
        print(row.customer, round(row.total, 2))

Example 2: Recursive CTE for hierarchical data

# Example 2: Recursive CTE for hierarchical data -- Comparative Analysis Data Lake Data Warehouse
import sqlite3

con = sqlite3.connect(":memory:")
con.executescript("""
CREATE TABLE org (id INT PRIMARY KEY, name TEXT, manager_id INT);
INSERT INTO org VALUES
 (1,'Ada',    NULL),
 (2,'Ben',     1),
 (3,'Cai',     1),
 (4,'Dee',     2),
 (5,'Eli',     2),
 (6,'Fay',     3),
 (7,'Gio',     5);
""")

query = """
WITH RECURSIVE chain(id, name, level, path) AS (
    SELECT id, name, 0, name
    FROM   org
    WHERE  manager_id IS NULL
    UNION ALL
    SELECT o.id, o.name, c.level + 1, c.path || ' / ' || o.name
    FROM   org o JOIN chain c ON o.manager_id = c.id
)
SELECT level, id, printf('%s%s', replace(hex(zeroblob(level*2)),'00','  '),
                         name) AS tree
FROM chain
ORDER BY path;
"""
for row in con.execute(query):
    print(row)

Example 3: Window functions and common table expressions

# Example 3: Window functions and common table expressions -- Comparative Analysis Data Lake Data Warehouse
import sqlite3

con = sqlite3.connect(":memory:")
con.executescript("""
CREATE TABLE sales (day DATE, region TEXT, amount REAL);
INSERT INTO sales VALUES
 ('2026-01-01','NA', 1200), ('2026-01-02','NA', 1550),
 ('2026-01-01','EU',  900), ('2026-01-02','EU', 1180),
 ('2026-01-03','NA', 1700), ('2026-01-03','EU', 1220);
""")

query = """
WITH daily AS (
    SELECT region, day, SUM(amount) AS revenue
    FROM sales GROUP BY region, day
)
SELECT region, day, revenue,
       SUM(revenue) OVER (PARTITION BY region ORDER BY day) AS running_total,
       RANK()       OVER (PARTITION BY day    ORDER BY revenue DESC) AS day_rank
FROM daily
ORDER BY day, region;
"""
for row in con.execute(query):
    print(row)

Example 4: Parameterised upsert against an indexed table

# Example 4: Parameterised upsert against an indexed table -- Comparative Analysis Data Lake Data Warehouse
import sqlite3

con = sqlite3.connect(":memory:")
con.execute("""
CREATE TABLE users (
    id     INTEGER PRIMARY KEY,
    email  TEXT UNIQUE NOT NULL,
    visits INTEGER NOT NULL DEFAULT 0
);
""")

def record_visit(email: str) -> None:
    con.execute("""
        INSERT INTO users (email, visits) VALUES (?, 1)
        ON CONFLICT(email) DO UPDATE
            SET visits = visits + 1;
    """, (email,))

for e in ["a@x.com", "b@x.com", "a@x.com", "a@x.com", "b@x.com"]:
    record_visit(e)

for row in con.execute("SELECT email, visits FROM users ORDER BY visits DESC"):
    print(row)

Example 5: EXPLAIN QUERY PLAN before and after indexing

# Example 5: EXPLAIN QUERY PLAN before and after indexing -- Comparative Analysis Data Lake Data Warehouse
import sqlite3, random

con = sqlite3.connect(":memory:")
con.execute("CREATE TABLE events (id INTEGER PRIMARY KEY, user_id INT, kind TEXT);")

rng  = random.Random(0)
rows = [(i, rng.randint(1, 10_000), rng.choice(["click", "view", "buy"]))
        for i in range(200_000)]
con.executemany("INSERT INTO events VALUES (?, ?, ?)", rows)

sql = "SELECT COUNT(*) FROM events WHERE user_id = 4242 AND kind = 'buy'"
print("BEFORE index:")
for r in con.execute("EXPLAIN QUERY PLAN " + sql):
    print(" ", r)

con.execute("CREATE INDEX idx_events_user_kind ON events(user_id, kind);")
print("AFTER index:")
for r in con.execute("EXPLAIN QUERY PLAN " + sql):
    print(" ", r)
print("count =", con.execute(sql).fetchone()[0])