Dualo
SQL Essentials

Joins — INNER, LEFT, RIGHT, FULL

Combine rows across tables using keys. Pick the right join type or you'll silently lose or duplicate data.

2 min read

INNER vs OUTER: INNER retains only matching pairs. OUTER (LEFT/RIGHT/FULL) preserves non-matching rows from one or both sides, padding with NULL. Which outer you use depends which side must be preserved.

Physical join algorithms (picked by the ): Nested Loop Join — good for small outer + indexed inner; O(N·log M). Hash Join — builds a hash table on one side, probes with the other; best for large unsorted inputs; O(N + M). Merge Join — both inputs pre-sorted on the key; streams both; O(N + M) with pre-sort cost. Use EXPLAIN ANALYZE to see which the planner picks and why.

Semi-join and anti-join: `WHERE EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = c.id)` is a SEMI-JOIN — customers with any order, no duplication. `WHERE NOT EXISTS (...)` is an ANTI-JOIN — customers without orders. The optimizer usually transforms `IN` / `NOT IN` subqueries into these.

`IN` vs `EXISTS` vs `LEFT JOIN` for 'customers with orders': all correct semantically. `EXISTS` stops at the first match (no duplication, often fastest for sparse matches). `IN` with a subquery has NULL pitfalls (`NOT IN` fails if the subquery returns any NULL). `LEFT JOIN ... WHERE right.id IS NOT NULL` works but can duplicate if not used carefully.

Join on multiple columns: `ON a.region = b.region AND a.year = b.year`. Composite join keys are fine — the engine may use a if available; otherwise each condition contributes to filtering.

Avoid implicit Cartesian: writing `FROM customers, orders WHERE c.id = o.customer_id` works but is error-prone (forget the WHERE = full cross join, hundreds of million of rows). Use explicit `JOIN ... ON ...` syntax — the compiler catches missing conditions.

Join reordering & fan-out: when A ⋈ B ⋈ C, order matters for performance. Optimizer tries to apply filters first, then join smallest intermediate result with next. Statistics drive this — out-of-date stats (no recent ANALYZE/VACUUM in Postgres, no UPDATE STATISTICS in SQL Server) cause bad plans.

Non-equi joins: joins on `<` / `>` / `BETWEEN` instead of `=`. Used in range matching (sessions overlapping a time range, prices valid at a given date). Hash join can't apply; merge join on range with pre-sort, or nested loop — generally more expensive.

Diagram

Grounded on https://www.postgresql.org/docs/current/queries-table-expressions.html

Next up

GROUP BY & aggregates

Roll up rows into buckets and summarize: COUNT, SUM, AVG, MIN, MAX. The foundation of every report.