A Survey of Database Internals Storage Engines and Query Processing
A Survey of Database Internals Storage Engines and Query Processing sits at the boundary between data engineering and analytics. Whether your data lives in Postgres, Snowflake or a data lake, the concepts in this lesson let you ingest, query and reshape it efficiently at scale.
Why Survey Database Internals Matters
A data scientist fluent in SQL is independent of analysts and engineers for most data access. That independence lets you iterate faster and ask sharper, better-informed questions of the data.
- Keep analytical queries declarative — describe what, not how.
- Lean on window functions and CTEs for readability.
- Design indexes based on query patterns, not table structure.
- Know when to push computation to the database versus to Python.
How Survey Database Internals Shows Up in Practice
In a typical project, a survey of database internals storage engines and query processing is combined with the rest of the SQL & Databases toolkit. You rarely use any one technique in isolation; the real skill is knowing which combination fits the problem you are trying to solve, and being able to explain that choice to a non-technical stakeholder.
Expect to use this when pulling training data, shipping a dashboard, optimising a slow query or architecting a new analytical table.
- The Macroeconomic Impact of Big Data
- Data Ingestion Manipulation Structured Query Language
- Advanced SQL Window Functions Ctes Query
- High-performance SQL and Database Indexing Strategies
Back to the Data Science curriculum →
Code Examples: Survey Database Internals Storage Engines Query (5 runnable snippets)
Copy any block into a file or notebook and run it end-to-end — each example stands alone.
Example 1: Recursive CTE for hierarchical data
# Example 1: Recursive CTE for hierarchical data -- Survey Database Internals Storage Engines Query
import sqlite3
con = sqlite3.connect(":memory:")
con.executescript("""
CREATE TABLE org (id INT PRIMARY KEY, name TEXT, manager_id INT);
INSERT INTO org VALUES
(1,'Ada', NULL),
(2,'Ben', 1),
(3,'Cai', 1),
(4,'Dee', 2),
(5,'Eli', 2),
(6,'Fay', 3),
(7,'Gio', 5);
""")
query = """
WITH RECURSIVE chain(id, name, level, path) AS (
SELECT id, name, 0, name
FROM org
WHERE manager_id IS NULL
UNION ALL
SELECT o.id, o.name, c.level + 1, c.path || ' / ' || o.name
FROM org o JOIN chain c ON o.manager_id = c.id
)
SELECT level, id, printf('%s%s', replace(hex(zeroblob(level*2)),'00',' '),
name) AS tree
FROM chain
ORDER BY path;
"""
for row in con.execute(query):
print(row)
Example 2: Window functions and common table expressions
# Example 2: Window functions and common table expressions -- Survey Database Internals Storage Engines Query
import sqlite3
con = sqlite3.connect(":memory:")
con.executescript("""
CREATE TABLE sales (day DATE, region TEXT, amount REAL);
INSERT INTO sales VALUES
('2026-01-01','NA', 1200), ('2026-01-02','NA', 1550),
('2026-01-01','EU', 900), ('2026-01-02','EU', 1180),
('2026-01-03','NA', 1700), ('2026-01-03','EU', 1220);
""")
query = """
WITH daily AS (
SELECT region, day, SUM(amount) AS revenue
FROM sales GROUP BY region, day
)
SELECT region, day, revenue,
SUM(revenue) OVER (PARTITION BY region ORDER BY day) AS running_total,
RANK() OVER (PARTITION BY day ORDER BY revenue DESC) AS day_rank
FROM daily
ORDER BY day, region;
"""
for row in con.execute(query):
print(row)
Example 3: Parameterised upsert against an indexed table
# Example 3: Parameterised upsert against an indexed table -- Survey Database Internals Storage Engines Query
import sqlite3
con = sqlite3.connect(":memory:")
con.execute("""
CREATE TABLE users (
id INTEGER PRIMARY KEY,
email TEXT UNIQUE NOT NULL,
visits INTEGER NOT NULL DEFAULT 0
);
""")
def record_visit(email: str) -> None:
con.execute("""
INSERT INTO users (email, visits) VALUES (?, 1)
ON CONFLICT(email) DO UPDATE
SET visits = visits + 1;
""", (email,))
for e in ["a@x.com", "b@x.com", "a@x.com", "a@x.com", "b@x.com"]:
record_visit(e)
for row in con.execute("SELECT email, visits FROM users ORDER BY visits DESC"):
print(row)
Example 4: EXPLAIN QUERY PLAN before and after indexing
# Example 4: EXPLAIN QUERY PLAN before and after indexing -- Survey Database Internals Storage Engines Query
import sqlite3, random
con = sqlite3.connect(":memory:")
con.execute("CREATE TABLE events (id INTEGER PRIMARY KEY, user_id INT, kind TEXT);")
rng = random.Random(0)
rows = [(i, rng.randint(1, 10_000), rng.choice(["click", "view", "buy"]))
for i in range(200_000)]
con.executemany("INSERT INTO events VALUES (?, ?, ?)", rows)
sql = "SELECT COUNT(*) FROM events WHERE user_id = 4242 AND kind = 'buy'"
print("BEFORE index:")
for r in con.execute("EXPLAIN QUERY PLAN " + sql):
print(" ", r)
con.execute("CREATE INDEX idx_events_user_kind ON events(user_id, kind);")
print("AFTER index:")
for r in con.execute("EXPLAIN QUERY PLAN " + sql):
print(" ", r)
print("count =", con.execute(sql).fetchone()[0])
Example 5: SQLAlchemy Core: typed schema + bulk insert
# Example 5: SQLAlchemy Core: typed schema + bulk insert -- Survey Database Internals Storage Engines Query
from sqlalchemy import (create_engine, MetaData, Table, Column,
Integer, String, Float, select, func)
engine = create_engine("sqlite:///:memory:", future=True)
meta = MetaData()
orders = Table("orders", meta,
Column("id", Integer, primary_key=True),
Column("customer", String(64), nullable=False),
Column("amount", Float, nullable=False),
)
meta.create_all(engine)
with engine.begin() as conn:
conn.execute(orders.insert(), [
{"customer": "Ana", "amount": 42.0},
{"customer": "Bob", "amount": 99.5},
{"customer": "Ana", "amount": 17.3},
])
stmt = (select(orders.c.customer, func.sum(orders.c.amount).label("total"))
.group_by(orders.c.customer))
for row in conn.execute(stmt):
print(row.customer, round(row.total, 2))