SELECT and WHERE basics
Imagine you’ve opened a dataset and feel a little overwhelmed by rows and columns — where do you even start? Right away, two words will become your best friends: SELECT and WHERE. SELECT (the SQL keyword that tells the database which columns to return) and WHERE (the clause that tells the database which rows you care about) let us turn a noisy table into a tidy answer. How do you filter rows in SQL? We’ll walk through that question step by step, like unfolding a map before a trip.
First, let’s meet SELECT as if it were a shopping list. SELECT names the columns you want to see — for example, SELECT name, email asks the database to give you only the name and email columns from a table. A column is a single attribute (like “name” or “price”), and a row is one record (like one customer or one purchase). When you write an SQL query (a request the database understands to retrieve or manipulate data), SELECT is the part that decides which attributes show up in your result.
Next, picture WHERE as a sieve that keeps only the rows that match a condition. WHERE is a clause (a section of the query) that holds a condition — a true/false test — called a predicate. For example: SELECT name, age FROM employees WHERE department = 'Sales' means: give me name and age from the employees table but only for rows where the department is Sales. We use = for equality, < and > for comparisons, LIKE for simple pattern matching (think wildcards like %), and IN to match against a short list. Strings are wrapped in quotes because they are text values.
Building on this foundation, let’s see how SELECT and WHERE work together in a little choreography. Conceptually, we decide which rows matter first (WHERE) and then choose the columns to show (SELECT), even though SQL engines have an internal order of operations. That matters when you start adding aggregation (summing or averaging) — WHERE filters rows before aggregation, so if you want to filter aggregated groups you’d use HAVING later. For now, remember: WHERE narrows the dataset; SELECT shapes the view you see.
Along the way, you’ll run into a few common traps — but nothing you can’t handle. One is NULL, which represents missing or unknown values; NULL is not the same as an empty string, and you check it with IS NULL or IS NOT NULL rather than =. Another is logical precedence: AND binds more tightly than OR, so use parentheses to make your intent explicit: WHERE (salary > 70000 AND title = 'Senior') OR remote = true. Also be careful with case sensitivity — some databases treat text comparisons differently — and remember that forgetting quotes around text or misnaming a column will return an error instead of results.
To make this concrete, try a tiny experiment: open any sample table and run SELECT * FROM table_name WHERE 1 = 0; — this asks for no rows and lets you inspect the column names without being swamped. Then replace 1 = 0 with a real condition like age >= 30 or country IN ('US', 'CA') and watch the results reshape. That practice will help the rules stick because you’ll see how SELECT and WHERE change what the data looks like.
Now that you can confidently choose columns with SELECT and narrow rows with WHERE, we’re ready to layer in sorting, grouping, and aggregation. In the next section we’ll take the filtered view you’ve learned to build and learn how to order it and summarize it — the kind of techniques that turn raw rows into insight.
GROUP BY aggregations
Imagine you’ve just filtered a noisy table with SELECT and WHERE and you can feel a pattern trying to emerge — now you want numbers that summarize groups, not dozens of individual rows. Building on that foundation, we meet a clause that turns rows into summaries by grouping them around a shared value; this is where GROUP BY and aggregation become your friends. Aggregation means combining multiple row values into a single number — think totals, averages, or counts — so you can answer questions like “How much did each product sell?” or “Which region has the most customers?” How do you turn rows into meaningful summaries? We’ll walk that through together.
First, let’s name the players. A group is a set of rows that share the same value for one or more columns; the column(s) you use to form those sets is called the group key. An aggregate function is a built-in operation that reduces a group to a single value: COUNT (number of rows), SUM (total of a numeric column), AVG (average), MIN and MAX (smallest and largest). Each aggregate function takes a column and returns a single result for every group — like turning a pile of receipts into one total per store.
Next, understand the difference between filtering rows and filtering groups. WHERE limits which rows enter the grouping step; it’s applied before aggregation. If you need to filter based on an aggregated result (for example, only products whose total sales exceed $10,000), you use HAVING, which runs after the aggregation and tests the group-level numbers. This two-stage filtering — first rows with WHERE, then groups with HAVING — keeps your logic clear and prevents incorrect counts or sums.
Let’s make this concrete with a simple example you can try on any sample dataset. Imagine a sales table with columns (product_id, sale_date, revenue). To get total revenue per product we pick the group key and an aggregate: SELECT product_id, SUM(revenue) AS total_revenue FROM sales GROUP BY product_id;. This query returns one row per product_id with the summed revenue. Reading the result is like looking at a scoreboard: each product_id is a team, and total_revenue is their score.
Often you’ll want multiple aggregates and a nicer column name. You can combine COUNT and AVG and give each aggregated column an alias (a temporary name) for readability: SELECT product_id, COUNT(*) AS orders, AVG(revenue) AS avg_order_value FROM sales GROUP BY product_id ORDER BY orders DESC;. Aliasing with AS makes results human-friendly, and ORDER BY works at the end to sort your groups — for example, to find your top-selling products at a glance.
Watch out for a few common traps as you practice. If a column appears in SELECT but is not wrapped in an aggregate function, it must appear in GROUP BY; otherwise the database will complain because it can’t decide which row’s value to show. NULL represents missing values and can affect aggregates (COUNT(column) ignores NULLs, but COUNT(*) counts rows regardless). When you need time-based summaries, group by a transformed expression such as DATE(sale_date) or a truncated month so that timestamps fall into the buckets you expect.
Now that you can convert rows into group summaries and control which groups appear, you’ve unlocked a powerful way to turn raw data into insight. With this tool in your toolkit, the next logical step is learning how to order, limit, and compare those summaries — and how window functions let you compute running totals and ranks without collapsing rows. We’ll take that next step together.
Joins: INNER, LEFT, RIGHT
Building on the SELECT and WHERE groundwork we just covered, imagine you have answers scattered across two different tables and you want to bring them together — this is where JOINs become your toolbox. A join is an operation that combines rows from two tables based on a related column (usually a key), and the most common flavors you’ll meet are INNER JOIN, LEFT JOIN, and RIGHT JOIN. Right away, remember: a table is a grid of rows (records) and columns (attributes), a primary key is a unique identifier for a row, and a foreign key is a column that points to that identifier in another table. Understanding these basics makes joins feel less like magic and more like careful matchmaking between tables.
First, let’s meet the INNER JOIN — think of it as a handshake that only happens when both sides agree. An INNER JOIN returns only the rows where the join condition matches on both tables; if either side doesn’t have a partner, that row is left out. For example, SELECT c.id, c.name, o.total FROM customers c INNER JOIN orders o ON c.id = o.customer_id; gives you only customers who have at least one order. This is great when you care about paired data only — like “customers with purchases” — and it prevents empty-match noise from appearing in your results.
Next, the LEFT JOIN is like telling the database: keep everyone from the left table, and attach matching info from the right when available. A LEFT JOIN returns all rows from the left table and fills in NULLs for any columns from the right table that don’t have a match. For instance, SELECT c.id, c.name, o.total FROM customers c LEFT JOIN orders o ON c.id = o.customer_id; will also show customers who haven’t ordered yet, with NULL in the total column. Use LEFT JOIN when you want a complete list on one side — for example, auditing which customers haven’t purchased — and celebrate the NULLs as signposts to missing relationships.
A RIGHT JOIN is the mirror image of LEFT JOIN: it keeps every row from the right table and matches from the left when possible. In practice, RIGHT JOIN behaves the same as flipping the tables and using LEFT JOIN, so many people prefer to avoid RIGHT JOIN for clarity. You might see SELECT o.id, o.total, c.name FROM customers c RIGHT JOIN orders o ON c.id = o.customer_id; but swapping the FROM order and using LEFT JOIN gives the same result and is often easier to read. When you encounter RIGHT JOIN in existing code, treat it as a cue to check which table the author intended as the “complete” list.
Joins also bring a few practical complications we should name now. NULLs from unmatched rows need to be handled explicitly (IS NULL or IS NOT NULL) when filtering; a one-to-many relationship (one customer, many orders) will multiply rows after a join; and joining on the wrong columns (or forgetting to qualify columns with table aliases like c.id) can silently return incorrect results. A quick sanity check is to run SELECT COUNT(*) before and after a join to see how row counts change — if you didn’t expect multiplication but see it, that’s a hint you’ve joined on a non-unique key.
So how do you decide between INNER and LEFT joins in practice? Use INNER JOIN when a match on both sides is required for your question, and use LEFT JOIN when you want all rows from one table regardless of matches on the other. RIGHT JOIN is functionally useful but often avoidable by flipping table order. As we move on to grouping and aggregations, remember this: the choice of join changes which rows enter your GROUP BY step, so pick the join that preserves the records you need before you start summing or averaging. With that in mind, we’ll next look at how joining and aggregating together can reveal totals, gaps, and surprising patterns.
Window functions overview
Imagine you’ve just produced a tidy grouped table and joined the right pieces together, but you still want calculations that live on each original row — that’s where window functions step in. Window functions (also called SQL window functions) let you compute values across a set of rows that are related to the current row without collapsing those rows into one summary, so you keep the full detail while adding summary insights. Think of them as putting a movable magnifying glass over the table: for each row the window looks around, performs a calculation, and writes the result back next to that row. This ability to compute running totals, ranks, and neighbor comparisons while preserving every row is what makes window functions feel like a superpower for analysts learning to tell richer stories with data.
First, let’s meet the three parts that act like the stage directions for any window calculation: the function, the OVER clause, and the window specification inside OVER. The function is the actor — things like ROW_NUMBER() (gives a sequential number per partition), RANK() (gives rank with gaps for ties), SUM() or AVG() used as window aggregates, and LEAD()/LAG() (which look at the next or previous row). The OVER clause tells the database which rows to include for each current row; inside it you can PARTITION BY to split the table into independent groups (like grouping by customer), ORDER BY to define sequence for running calculations, and optionally declare a frame (for example ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) to say exactly which neighbors count. Unlike GROUP BY which reduces rows into groups and returns one result per group, these elements let us compute group-aware metrics while leaving every original row intact.
What does that look like in practice, and when would you use it? Suppose you want a running total per customer: you’d write SUM(amount) OVER (PARTITION BY customer_id ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and every sale row will carry its cumulative spend to date — this is a running total. If you want to rank products by revenue within each category, you might use ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) to give each product a position, or use RANK() if you want ties to share a rank and leave gaps. To compare a row with its predecessor (for example to compute day-over-day change) LAG(value, 1) pulls the previous row’s value into the current row so you can subtract and see deltas; LEAD does the same for the next row. These examples show how window functions let you ask row-level comparative questions that used to require awkward self-joins or subqueries.
There are practical details that make window functions behave predictably and efficiently, so let’s call them out. The ORDER BY inside the OVER clause matters for any calculation that depends on sequence: without a stable order, running totals and ROW_NUMBER() are ambiguous, so always specify it when order matters. The notion of a frame controls which rows around the current row are visible to the function; ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW produces a cumulative sum, while ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING creates a three-row moving average. Performance-wise, partitioning aligns with how the database can parallelize work, so choose sensible partition keys (too many tiny partitions or one massive partition can both hurt). Also be aware of tie behavior: RANK() leaves gaps for equal values while DENSE_RANK() does not, and NULL values can affect ordering unless you handle them explicitly.
Building on what we learned about SELECT, GROUP BY, and JOINs, window functions empower us to ask new questions without losing the row-level detail we care about. They let us compute running totals, assign ranks, and peek at neighboring rows directly in the SELECT list, which keeps queries readable and often faster than complex subqueries. Try converting one grouped metric you already know into a window version — for example, instead of grouping to get total sales per day, add a running daily total next to each sale — and you’ll feel how naturally these tools fit into an analyst’s workflow. With that practice under our belt, we’re ready to see concrete, commonly used window queries and best practices that make these patterns repeatable and reliable.
Subqueries and CTEs
Imagine you’ve pulled a messy result from a join or a GROUP BY and you need one more layer of logic to answer the question — this is the moment nested queries and named temporary results become your friends. A subquery (a query written inside another query) and a CTE — short for common table expression, a temporary named result set declared with WITH — let you structure that extra logic so it reads like a short story instead of a single giant sentence. We’ll walk through what each does, why they feel different, and when one will make your life easier.
A subquery appears where you need a single value or a filtered list and hides inside the main SQL like a little helper. Think of a scalar subquery that returns one value (for example SELECT (SELECT MAX(sales) FROM monthly) AS top_sale) or an IN-style subquery that produces a list (WHERE product_id IN (SELECT id FROM new_products)). A correlated subquery is a special kind that refers to the outer query’s current row and is evaluated per-row; that makes it powerful but potentially slow because the database may re-run it many times. When you see EXISTS, IN, or a nested SELECT in the WHERE or SELECT list, you’re looking at a subquery doing focused, inline work.
A CTE moves that helper query out of the middle so you can name it, read it, and reuse it. You declare it at the top with WITH recent_sales AS (SELECT * FROM sales WHERE sale_date >= '2026-01-01') and then treat recent_sales like a temporary table in the following SELECT. This makes complex queries much easier to read because you separate logical steps: first we define a filtered set, then we join, aggregate, or rank it. CTEs also support recursion in systems that allow it, which means you can write queries that walk hierarchical data (like organizational charts) in a clear, linear way.
So how do you decide when to use a subquery versus a CTE? If the logic is tiny and used only once — for example, a single scalar value or a small lookup — a subquery keeps the code compact and close to where it’s needed. If you’re breaking a large problem into logical stages, want to reuse the intermediate result multiple times, or simply want to make your intent readable to a colleague, a CTE usually wins. Performance can vary by database: some engines inline simple CTEs so they behave like subqueries, while others materialize them; the practical rule is to pick clarity first and then test performance.
Let’s take a short journey converting a tricky nested example into a friendly CTE. You might start with a dense WHERE clause that contains IN (SELECT id FROM customers WHERE signup_date > ...) and find it hard to follow; instead, extract the inner SELECT as WITH new_customers AS (SELECT id FROM customers WHERE signup_date > '2026-01-01') SELECT o.* FROM orders o JOIN new_customers nc ON o.customer_id = nc.id. This transformation reads like a recipe: first filter customers, then use that filtered set to get orders. Breaking complex logic into sequential CTEs (you can declare multiple with commas) helps you test each step independently and makes debugged queries easier to explain in conversations or code reviews.
There are a few practical guardrails to keep things reliable. Correlated subqueries can be expressive but may become a performance trap on large tables, so compare them against JOINs or window functions when you need per-row context. Use CTEs to improve readability and modularity, but remember to check the execution plan — some databases will materialize CTEs and some will not, which affects memory and speed. When in doubt, profile with EXPLAIN or run both versions on a representative dataset; the clearer version is usually the one you’ll maintain weeks from now.
Building on what we covered about joins, GROUP BY, and window functions, these tools let us structure step-by-step logic without losing sight of the row-level detail. Practice by taking one of your dense queries and first isolating inner SELECTs into named CTEs, then see whether rewriting correlated subqueries into joins or window functions keeps correctness while improving speed. That small habit of naming intermediate results is the kind of craft—part readability, part performance—that turns messy queries into dependable analysis you’ll be proud to share.
Query tuning and best practices
Imagine you’ve just written a query that answers the question you care about, but it stumbles and takes minutes instead of seconds — this is the moment query tuning and best practices become your saving grace. Query tuning is simply the process of making a query run faster and use fewer resources; best practices are the repeatable habits that help you do that reliably. We’ll treat this like a lab: measure what’s slow, change one thing at a time, and confirm the improvement. That mindset alone will save you hours and build confidence as you explore SQL performance.
The first step is measurement: don’t change before you know where the pain is. How do you find the slow part of a query? Ask the database for its execution plan (often via EXPLAIN or EXPLAIN ANALYZE), which is the planner’s roadmap showing how it will read tables, apply filters, and combine rows. An execution plan tells you estimated versus actual rows, which operations are most expensive, and whether indexes are being used; reading it is like inspecting a car’s dashboard to find which subsystem is overheating. Run the plan on representative data and capture timings; this is the baseline we’ll use to validate every tuning decision.
One of the most powerful levers in query tuning is the index — a small structure that helps the database find rows quickly, like the index in a book rather than scanning every page. Add an index on columns you filter (WHERE) or join on, and consider a composite index when multiple columns are frequently queried together; a composite index stores ordered pairs so the planner can satisfy multi-column lookups faster. Be mindful: too many indexes slow writes and waste space, and an index only helps when a query is “sargable” — that is, able to use the index (for example, avoid wrapping an indexed column in a function). A covering index (one that contains all columns a query needs) can sometimes make the database skip the table read entirely, which is a big win.
How you write a query matters almost as much as what you index. Push filters as early as possible so the database trims rows before heavy work like joins or aggregation; this means applying WHERE conditions and moving restrictive predicates into subqueries or CTEs if it clarifies intent. Replace SELECT * with explicit columns to avoid reading and transferring unneeded data, and be cautious with correlated subqueries that run per row — try rewriting them as JOINs, window functions, or a CTE used once. When choosing EXISTS versus IN, prefer EXISTS for correlated checks on large tables and watch how NULLs and duplicates affect semantics; these choices influence both correctness and performance.
Don’t forget the ecosystem: the query planner relies on accurate statistics about your data distribution, so regular stats updates (ANALYZE in many systems) and maintenance tasks matter. For very large tables, consider partitioning — splitting a table into smaller segments — so queries touch only relevant partitions instead of the whole dataset. Monitor slow-query logs or a performance dashboard to catch regressions over time; tuning is iterative and context-dependent, and what helped yesterday may not be ideal after data volumes change.
At the end of the day, practical query tuning and best practices boil down to a simple loop: measure with EXPLAIN, make one targeted change (add or adjust an index, rewrite a predicate, reduce columns), test again, and observe the impact. We’ve built this guidance on the SELECT, WHERE, JOIN, and aggregation foundations you’ve already learned, so the next step is to practice on a real slow query: run an execution plan, try one small change, and celebrate the improvement when it arrives. That steady, experimental approach is how we turn fragile queries into dependable analytics you can trust.



