SQL GROUP BY & Aggregate Functions Tutorial — Master Summarizing Data with COUNT, SUM, AVG and More

Introduction to aggregate functions: COUNT, SUM, AVG, MIN, MAX

Aggregate functions compress many row values into a single summary value — for example, counting rows, adding up numeric values, or finding the highest or lowest entry. They operate on sets of rows (or on each group when used with GROUP BY) rather than returning a value per row, and they form the backbone of summarizing, reporting, and analytics queries in SQL. (dev.mysql.com)

At a conceptual level you can think of the five core aggregates as follows:

COUNT: returns how many items are in a set (with several variants: COUNT(*), COUNT(column), COUNT(DISTINCT column)).
SUM: returns the total (summation) of a numeric column across the set.
AVG: returns the arithmetic mean (sum divided by number of non-NULL values).
MIN: returns the smallest value (works for numbers, strings, dates, etc.).
MAX: returns the largest value (also works for numbers, strings, dates, etc.).

How NULLs and empty sets behave (common gotchas). Aggregate functions typically ignore NULL values in the input — that is, NULLs do not contribute to SUM, AVG, MIN, or MAX calculations, and COUNT(expression) counts only non-NULL expressions. COUNT() is the important exception: it counts rows (including rows that contain NULL values in other columns). When a query matches no rows, most aggregates (SUM, AVG, MIN, MAX) return NULL rather than zero or an empty number; COUNT (when used as COUNT()) returns zero for no matching rows. Use COALESCE(…) when you want to replace a NULL aggregate result with a default (for example, COALESCE(SUM(amount), 0)). (dev.mysql.com)

Practical examples — one small example table and step-by-step queries

Assume a compact Employee table for examples:

-- sample table (for illustration)
CREATE TABLE employees (
  id INT,
  name VARCHAR(50),
  department VARCHAR(50),
  salary NUMERIC(10,2),
  hired DATE
);

INSERT INTO employees (id, name, department, salary, hired) VALUES
(1, 'Alice', 'Engineering', 85000.00, '2020-03-01'),
(2, 'Bob',   'Engineering', 72000.00, '2019-08-15'),
(3, 'Carol', 'Marketing',   65000.00, '2021-01-10'),
(4, 'Dave',  'Marketing',   NULL,     '2021-11-05'),
(5, 'Eve',   'Sales',       59000.00, '2018-06-25');

1) COUNT variations

-- count all rows
SELECT COUNT(*) AS total_rows FROM employees;

-- count non-null salary values
SELECT COUNT(salary) AS salary_count FROM employees;

-- count distinct departments
SELECT COUNT(DISTINCT department) AS distinct_depts FROM employees;

COUNT(*) returns the number of rows (5 in this sample). COUNT(salary) returns 4 because one salary is NULL. COUNT(DISTINCT department) returns 3. This illustrates the difference between counting rows vs counting non-NULL expressions and removing duplicates. (dev.mysql.com)

2) SUM and AVG

-- total payroll (NULL salaries ignored)
SELECT SUM(salary) AS total_payroll FROM employees;

-- average salary (NULL ignored; AVG = SUM / count_of_non_nulls)
SELECT AVG(salary) AS avg_salary FROM employees;

-- force zero when SUM would be NULL
SELECT COALESCE(SUM(salary), 0) AS total_payroll_nonnull FROM employees WHERE department = 'Legal';

SUM and AVG exclude NULLs from their calculations. If no rows match a WHERE clause, SUM and AVG will return NULL (so use COALESCE to substitute a numeric default). AVG divides by the number of non-NULL values, not by the total rowcount. (postgresql.org)

3) MIN and MAX (numbers, strings, dates)

-- highest and lowest salaries
SELECT MIN(salary) AS min_salary, MAX(salary) AS max_salary FROM employees;

-- earliest and latest hire dates
SELECT MIN(hired) AS earliest_hire, MAX(hired) AS latest_hire FROM employees;

-- lexicographic min/max on names
SELECT MIN(name) AS first_name_alphabetically, MAX(name) AS last_name_alphabetically FROM employees;

MIN and MAX operate on whatever data type you supply: numeric, date/time, or text (text is compared lexicographically). NULL values are ignored. For empty result sets these will return NULL, so handle that if you need a fallback. (postgresql.org)

4) Combining aggregates and grouping

-- totals and averages per department
SELECT
  department,
  COUNT(*)        AS headcount,
  SUM(salary)     AS total_salary,
  AVG(salary)     AS avg_salary,
  MIN(salary)     AS lowest_salary,
  MAX(salary)     AS highest_salary
FROM employees
GROUP BY department
ORDER BY total_salary DESC;

Running multiple aggregates alongside GROUP BY returns one summary row per group (department in this example). Note that GROUP BY groups rows first and then the aggregates compute across each group. This is a primary way to turn granular data into summaries for dashboards and reports. (dev.mysql.com)

Advanced notes and common pitfalls

DISTINCT inside aggregates: You can use COUNT(DISTINCT col) or SUM(DISTINCT col) to remove duplicates before aggregation; this is useful when you want the number of unique items rather than raw totals. Use DISTINCT sparingly because it can add sorting or hashing overhead. (mysqltutorial.org)
Data type and precision: Be aware of result types. Some systems promote integer sums/averages to a wider numeric or floating type; others return a type consistent with input. That affects rounding and precision (especially for AVG). Check your RDBMS docs if precision matters. (postgresql.org)
Performance considerations: Aggregating over an entire table can require scanning many rows. Some databases do not optimize MIN/MAX/COUNT over the whole table using indexes in every case; PostgreSQL, for example, will often perform a sequential scan for aggregate queries unless a different query plan (like ORDER BY … LIMIT 1) is used to leverage an index for MIN/MAX. If performance is critical, test plans and consider indexes or pre-aggregated summary tables. (postgresql.org)
Replacing NULL results: If you expect zero instead of NULL when no rows match, wrap aggregates in COALESCE (e.g., COALESCE(SUM(qty), 0)). This is very common in reporting queries where a numeric zero is semantically easier to consume than NULL. (postgresql.org)

Putting it together: a checklist for using aggregates correctly

Decide whether you need row counts (COUNT(*)) or counts of non-NULL expressions (COUNT(col)).
Remember aggregates ignore NULLs; use COALESCE when you need defaults.
Use DISTINCT inside aggregates only when you actually want uniqueness.
Test data types and rounding for SUM/AVG; cast if necessary.
Use GROUP BY to produce per-group summaries and ORDER/LIMIT or indexes to optimize MIN/MAX if needed.

These principles and examples give you the practical foundation to start summarizing data in SQL with COUNT, SUM, AVG, MIN, and MAX. For RDBMS-specific behavior and exact return types, consult the vendor documentation (MySQL, PostgreSQL, SQL Server) because small implementation differences (types, optimization behavior) do exist. (dev.mysql.com)

GROUP BY syntax and simple examples

GROUPing turns row-level data into summary rows by partitioning result rows on one or more columns or expressions, then computing aggregates for each partition. In a SELECT query the GROUP BY clause comes after WHERE (which filters input rows) and before HAVING, ORDER BY and LIMIT; the engine first applies FROM/WHERE, then groups the remaining rows, computes aggregates per group, filters groups with HAVING, and finally orders/limits the result set. (postgrespro.com)

A few core rules to keep in mind before looking at examples: non-aggregated expressions in the SELECT list must either appear in GROUP BY or be functionally dependent on the grouped columns (meaning the grouped columns uniquely determine them); otherwise the query is non-standard and many engines will reject it (MySQL enforces this with ONLY_FULL_GROUP_BY by default). Aggregate functions (SUM, AVG, MIN, MAX, COUNT, etc.) operate on the rows in each group and — unless the documentation says otherwise — they ignore NULL input values; SUM/AVG return NULL when the group has no non-NULL inputs, so use COALESCE if you want 0 instead of NULL. (dev.mysql.com)

Example 1 — basic per-group counts and aggregates (single grouping column)

-- count and aggregates per department
SELECT
  department,
  COUNT(*)        AS headcount,       -- counts rows in each department
  SUM(salary)     AS total_salary,    -- NULL salaries are ignored in SUM
  AVG(salary)     AS avg_salary
FROM employees
GROUP BY department
ORDER BY total_salary DESC;

Step-by-step: WHERE (if present) filters rows first; GROUP BY department collapses rows into one group per department; the aggregates compute over those rows. In the sample employees table SUM(salary) will ignore the NULL salary row; if a department had no rows at all SUM would return NULL — use COALESCE(SUM(salary), 0) to force 0. (dev.mysql.com)

Example 2 — grouping by multiple columns or expressions

-- hires per department per year (grouping by an expression)
SELECT
  department,
  EXTRACT(YEAR FROM hired) AS hire_year,
  COUNT(*)                AS hires
FROM employees
GROUP BY department, EXTRACT(YEAR FROM hired)
ORDER BY department, hire_year;

You can group by any deterministic expression (functions of columns, date_trunc/EXTRACT, concatenation, CASE expressions, etc.). If you use the same expression in SELECT and GROUP BY, either repeat the expression in GROUP BY or place the grouping expression in a subquery/CTE and refer to its alias in the outer query for readability. The order of columns in GROUP BY doesn’t change grouping semantics (it only affects how you think about multi-column grouping). (postgrespro.com)

Example 3 — filtering groups with HAVING (vs filtering rows with WHERE)

-- departments with at least 2 employees and average salary above 60k
SELECT
  department,
  COUNT(*)    AS headcount,
  AVG(salary) AS avg_salary
FROM employees
WHERE hired >= '2018-01-01'   -- filters input rows first
GROUP BY department
HAVING COUNT(*) >= 2 AND AVG(salary) > 60000;  -- filters groups afterward

Remember: WHERE cannot reference aggregate results; HAVING is evaluated after aggregation and is where you apply conditions that depend on aggregates. If your HAVING clause only refers to raw columns (not aggregates), it may be clearer and more efficient to move that condition into WHERE so unnecessary groups are never built. (postgresql.org)

Example 4 — mixing grouped and non-grouped columns (common pitfall)

-- risky: selecting non-grouped columns without aggregation
SELECT department, name, COUNT(*) FROM employees GROUP BY department;

Standard SQL requires that every non-aggregated column in SELECT either appears in GROUP BY or is functionally dependent on the grouped columns. MySQL historically allowed nondeterministic choices for such columns, but modern MySQL enforces ONLY_FULL_GROUP_BY by default and will reject queries like the example unless you add the non-aggregated column to GROUP BY or use an aggregation (or ANY_VALUE() for a deliberate arbitrary choice). To write portable, deterministic SQL, include only grouped columns and aggregates (or compute the extra columns with an aggregate like MIN(name) or via a window function / subquery). (dev.mysql.com)

Example 5 — practical patterns and small refinements

-- 1) show departments and replace NULL sums with 0
SELECT department, COALESCE(SUM(salary), 0) AS total_salary
FROM employees
GROUP BY department;

-- 2) count distinct values per group
SELECT department, COUNT(DISTINCT name) AS distinct_employees
FROM employees
GROUP BY department;

-- 3) rollup for subtotals (MySQL/Postgres/SQL Server have different syntaxes)
-- MySQL example (WITH ROLLUP modifier produces subtotal and grand-total rows):
SELECT year(hired) AS y, department, SUM(salary) AS s
FROM employees
GROUP BY year(hired), department WITH ROLLUP;

Notes on the patterns above: wrap aggregates with COALESCE when you need numeric zeros instead of NULLs; use COUNT(DISTINCT …) when you want unique counts inside groups; WITH ROLLUP (or GROUPING SETS/CUBE) gives subtotals but produces rows with NULLs in grouping columns to indicate subtotal/grand-total levels — treat those NULLs carefully when consuming results. Different DBMS have slightly different extensions and syntax for rollup/grouping sets, so check your engine’s docs when you need subtotal behavior. (dev.mysql.com)

Quick debugging checklist when GROUP BY returns unexpected results

Did you include every non-aggregated SELECT column in GROUP BY (or otherwise make it functionally dependent)? If not, make the query explicit or add aggregation. (dev.mysql.com)
Are NULLs skewing your aggregates? Remember aggregates ignore NULL inputs; use COALESCE if you want default values. (dev.mysql.com)
Should a condition be WHERE (filter rows) or HAVING (filter groups)? Move simple row-level filters to WHERE to avoid unnecessary aggregation. (postgresql.org)
If you see nondeterministic values in non-standard queries, avoid relying on engine-specific permissive behavior — make the grouping explicit or use ANY_VALUE() in MySQL only when you intentionally accept an arbitrary choice. (dev.mysql.com)

These examples show common, practical ways to apply GROUP BY: group by one or more columns or expressions, compute aggregates per group, and optionally filter groups with HAVING or add subtotals with grouping extensions. Start with simple groups, verify NULL handling and deterministic SELECT lists, and then add HAVING or grouping-sets features as needed for reporting or analytics.

Using COUNT, SUM and AVG in real queries (examples and common patterns)

Start with a few concrete, copy-paste examples that you can adapt immediately. Use the sample employees table from earlier (id, name, department, salary, hired) or any similar schema — these patterns are portable across MySQL, PostgreSQL and SQL Server.

Count patterns you’ll use daily:

-- total rows (always counts rows)
SELECT COUNT(*) AS total_rows FROM employees;

-- count non-NULL values in a column
SELECT COUNT(salary) AS salaries_reported FROM employees;

-- count distinct values
SELECT COUNT(DISTINCT department) AS distinct_departments FROM employees;

Remember the behavioral rule: aggregate functions ignore NULL inputs except COUNT(*) which counts rows. COUNT(column) and COUNT(DISTINCT …) skip NULLs; SUM/AVG/MIN/MAX ignore NULLs and return NULL if there are no non-NULL inputs. Use COALESCE(…) when you need a numeric default instead of NULL. (dev.mysql.com)

Conditional counts and partial aggregates are extremely useful when you want several related metrics in one scan. Two common patterns are the CASE/SUM trick (works everywhere) and the FILTER clause (PostgreSQL-specific and more readable):

-- portable: conditional counts using SUM + CASE
SELECT
  department,
  SUM(CASE WHEN salary IS NOT NULL THEN 1 ELSE 0 END) AS headcount_with_salary,
  SUM(CASE WHEN salary IS NULL THEN 1 ELSE 0 END)     AS headcount_missing_salary,
  SUM(CASE WHEN EXTRACT(YEAR FROM hired) = 2021 THEN 1 ELSE 0 END) AS hired_2021
FROM employees
GROUP BY department;

-- PostgreSQL: same idea with FILTER (clearer when many filters)
SELECT
  department,
  COUNT(*) FILTER (WHERE salary IS NOT NULL) AS headcount_with_salary,
  COUNT(*) FILTER (WHERE salary IS NULL)     AS headcount_missing_salary,
  COUNT(*) FILTER (WHERE EXTRACT(YEAR FROM hired) = 2021) AS hired_2021
FROM employees
GROUP BY department;

PostgreSQL’s FILTER clause attaches a per-aggregate WHERE and keeps the query tidy when you need many filtered aggregates. For engines that don’t support FILTER, SUM(CASE WHEN …) is the usual portable alternative. (postgresql.org)

Summing and averaging with nulls, defaults and precision:

-- total payroll per department, replace NULL with 0 when desired
SELECT department,
       COALESCE(SUM(salary), 0) AS total_payroll,
       AVG(salary)                AS avg_salary
FROM employees
GROUP BY department;

-- if you want an average that treats NULL as 0 (rare, but sometimes required):
SELECT AVG(COALESCE(salary, 0)) AS avg_treat_null_as_zero FROM employees;

-- avoid integer division / preserve precision (example in SQL Server / MySQL):
SELECT CAST(SUM(salary) AS DECIMAL(18,2)) / NULLIF(COUNT(salary), 0) AS avg_salary_safe
FROM employees;

Notes and patterns:

COALESCE(SUM(…), 0) is the common way to convert a missing SUM (NULL when no non-NULL inputs) to zero for reports or downstream math.
AVG ignores NULLs and divides by the count of non-NULL values; to compute your own average you can do SUM(…) / NULLIF(COUNT(…), 0) to avoid division-by-zero and to control casts/precision. The database engine often calculates AVG internally from SUM and COUNT, so explicit NULL/zero handling is sometimes necessary. (techcommunity.microsoft.com)

Counting very large tables and engine-specific notes:

SQL Server: if you need a row count type that cannot overflow INT, use COUNT_BIG(*) — it returns BIGINT instead of INT and is required in some indexed view scenarios. Use COUNT_BIG when tables are extremely large or when the view/indexing rules require it. (blog.sqlauthority.com)
MySQL / InnoDB: a plain SELECT COUNT() over an entire InnoDB table can require scanning the clustered index because InnoDB does not maintain an always-updated, global row count (unlike MyISAM, which stores a fast row count). When you need frequent, ultra-fast total-row answers consider maintaining a lightweight counter table or relying on approximate stats; otherwise COUNT() will be accurate but potentially expensive on very large InnoDB tables. For filtered counts (WHERE clauses) a proper index on the filtered columns is the usual optimization. (cnblogs.com)

Common grouping-and-filtering patterns that combine COUNT, SUM, AVG:

-- headcount and payroll, but only for active hires, and only show groups with at least 2 people
SELECT department,
       COUNT(*)                 AS headcount,
       COALESCE(SUM(salary),0)  AS total_salary,
       AVG(salary)              AS avg_salary
FROM employees
WHERE hired >= '2018-01-01'    -- row-level filter applied before grouping
GROUP BY department
HAVING COUNT(*) >= 2            -- group-level filter applied after aggregation
ORDER BY total_salary DESC;

When to use HAVING vs WHERE: push any predicate that can be evaluated per-row into WHERE so the engine groups fewer rows; use HAVING for conditions that depend on aggregates.

When you need multiple aggregates with different filters (e.g., total, returned, voided orders), compute them in one pass using conditional aggregation — this is much cheaper than separate GROUP BY scans:

SELECT
  store_id,
  COUNT(*) AS total_orders,
  SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS completed_orders,
  SUM(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) AS refunded_amount,
  AVG(CASE WHEN status = 'completed' THEN amount END) AS avg_completed_amount
FROM orders
GROUP BY store_id;

COUNT(DISTINCT …) and distinct-summing: useful but costly. DISTINCT inside aggregates eliminates duplicates before aggregation, which can force sorting or hashing large intermediate sets. Use it only when you truly need uniqueness (for example, COUNT(DISTINCT customer_id) per campaign). Many engines also offer approximate DISTINCT functions for very large data (engine-dependent) — check your platform docs before using them in production. (dev.mysql.com)

Using window functions instead of GROUP BY when you need per-row context

If you need both row-level data and group-level metrics side-by-side (for example each employee’s salary plus department average), use window functions rather than grouping and joining back:

SELECT
  id,
  name,
  department,
  salary,
  AVG(salary) OVER (PARTITION BY department) AS department_avg_salary,
  COUNT(*)  OVER (PARTITION BY department) AS department_headcount
FROM employees;

Window aggregates compute the same SUM/AVG/COUNT logic as grouped aggregates but return the value for each row in the partition — handy for percent-of-group calculations, ranking and row-by-row comparisons.

Performance and correctness checklist (practical):

Decide whether you want row counts (COUNT()) or non-NULL counts (COUNT(column)). COUNT() is the canonical way to count rows in a result set. (dev.mysql.com)
Use COALESCE(…) around SUM/AVG results when you require numeric defaults (e.g., 0) instead of NULL.
Use SUM(CASE WHEN …) for conditional aggregates in engines without FILTER; prefer FILTER in PostgreSQL for clarity when available. (postgresql.org)
Watch out for integer division: cast numerator/denominator to a decimal type or use explicit CAST/CONVERT so AVG or SUM/COUNT calculations preserve precision.
COUNT(DISTINCT …) is functionally useful but expensive; consider approximate distinct functions (if supported) or pre-aggregation where appropriate. (dev.mysql.com)
For extremely large tables, consider maintaining summary tables, using a counter table updated by your application or triggers for ultra-fast reads, or rely on approximate statistics when exact counts are not required. For SQL Server, use COUNT_BIG when you expect counts beyond INT range. (cnblogs.com)

Putting these patterns together lets you write compact, efficient queries that return many related metrics in a single scan. Start by choosing the correct counting semantics for your need (rows vs non-NULL values), decide how NULLs should be treated, prefer filtered/conditional aggregates over multiple passes, and only reach for engine-specific optimizations (COUNT_BIG, FILTER) when your platform supports them or you need their benefits.

Filtering groups: WHERE vs HAVING with practical examples

In SQL the engine filters rows and groups them in a fixed order: row-level predicates are applied first (WHERE), rows are then grouped (GROUP BY) and aggregates are computed, and finally group-level predicates are applied (HAVING). This execution order is why WHERE cannot reference aggregate results (COUNT, SUM, AVG, etc.) — those values don’t exist yet when WHERE runs — while HAVING can, because it is evaluated after aggregation. (postgresql.org)

Practical rule-of-thumb you should follow when writing queries: push any condition that can be evaluated on a per-row basis into WHERE so the query processes fewer rows during grouping; reserve HAVING only for conditions that depend on aggregates or truly must operate after grouping. Moving suitable predicates to WHERE often reduces I/O and CPU because GROUP BY will operate on a smaller input set. (alibabacloud.com)

Step-by-step examples using the sample employees table (id, name, department, salary, hired). Each example shows the intent, the query, and why one clause is preferable.

Example A — filter rows first, then group (preferred when possible)

Intent: consider only employees hired since 2019, then report departments with at least two such hires.

Query:

SELECT department,
       COUNT(*)       AS recent_hires,
       AVG(salary)     AS avg_salary
FROM employees
WHERE hired >= '2019-01-01'   -- filter rows before grouping
GROUP BY department
HAVING COUNT(*) >= 2;          -- filter groups after aggregation

Why: the hired >= ‘2019-01-01’ test removes irrelevant rows before GROUP BY, so the database groups fewer rows and aggregates less data. The HAVING clause is used only to test the aggregate COUNT(*) which cannot be moved into WHERE. (postgresql.org)

Example B — attempting to use aggregate in WHERE (incorrect) and the fix

Incorrect (won’t work in standard SQL):

SELECT department, COUNT(*) AS headcount
FROM employees
WHERE COUNT(*) > 1      -- invalid: COUNT(*) is not available to WHERE
GROUP BY department;

Fix using HAVING:

SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department
HAVING COUNT(*) > 1;     -- valid: evaluated after grouping

This illustrates the central semantic difference: WHERE filters rows, HAVING filters groups. (postgresql.org)

Example C — conditional aggregates (compute many metrics in one scan) and when HAVING tests those aggregates

Intent: for each department produce total payroll, number of employees with NULL salaries, and then show only departments whose payroll (sum of non-NULL salaries) exceeds $150,000.

Query (portable pattern):

SELECT department,
       COALESCE(SUM(salary), 0)                          AS total_payroll,
       SUM(CASE WHEN salary IS NULL THEN 1 ELSE 0 END)  AS missing_salary_count
FROM employees
GROUP BY department
HAVING COALESCE(SUM(salary), 0) > 150000;

Notes: SUM(CASE WHEN …) is a portable way to compute conditional counts; the HAVING clause references the aggregate SUM(salary) to filter groups whose payroll is above the threshold. Using COALESCE avoids NULL behavior when no non-NULL salaries exist. If your RDBMS supports FILTER (PostgreSQL), you can make the conditional aggregate clearer using FILTER (WHERE …). (mysqltutorial.org)

Example D — HAVING without GROUP BY (the whole result set as one group)

When a query contains aggregates but no GROUP BY, the entire result set is treated as a single group and HAVING can be used to filter that single summary. This is sometimes surprising but standard-compliant behavior.

Example:

-- Return the total payroll only if the total exceeds $300k
SELECT SUM(salary) AS total_payroll
FROM employees
HAVING SUM(salary) > 300000;   -- applies to the single aggregate row

This acts as a guard: the query returns one row only when the HAVING condition is true. (mysqltutorial.org)

Example E — when to replace HAVING with a subquery (or move to WHERE) for clarity and portability

Some engines (and coding standards) discourage putting predicates on non-aggregated columns into HAVING. If you need to filter by an expression that is derived in the SELECT list (an alias) or you want to reuse an aggregate in a predicate in a more portable way, compute aggregates in a derived table/CTE and then filter in an outer WHERE.

Pattern using a CTE (portable and explicit):

WITH dept_stats AS (
  SELECT department,
         COUNT(*) AS headcount,
         SUM(salary) AS total_salary
  FROM employees
  GROUP BY department
)
SELECT *
FROM dept_stats
WHERE headcount >= 2 AND total_salary > 150000;  -- now a row-level filter on derived rows

Why: this separates aggregation from filtering and can make intent clearer to other developers and to tools. It’s also a way to avoid relying on DBMS-specific alias/HAVING behaviors. (stackoverflow.com)

Practical caveats and engine-specific behaviors you must watch for

Aliases and non-standard extensions: some databases (notably MySQL) permit the use of SELECT aliases in HAVING or GROUP BY as an extension to the standard, while standard SQL disallows using SELECT aliases inside WHERE and in some cases in HAVING. Relying on this extension makes your SQL less portable. If portability matters, repeat the expression or use a subquery/CTE. (dev.mysql.com)
Pushdown optimizations: modern engines and cloud DBMS variants may push parts of a HAVING condition down into earlier phases (effectively moving them into WHERE) when the predicate only depends on grouped columns and not on aggregates. This can improve performance, but you should still write your query to apply row-level filters in WHERE when possible — don’t rely on the optimizer to automatically rewrite every HAVING into a WHERE. (alibabacloud.com)
Performance guidance: treat HAVING as potentially more expensive than WHERE because HAVING is executed after aggregation; if you put a filter in HAVING that could have been evaluated earlier, the database will waste time grouping and computing aggregates you didn’t need. Use explain/actual plans to verify whether predicates are being applied early or late in your engine. (stackoverflow.com)

Quick checklist for deciding WHERE vs HAVING when you write a query

Can the predicate be evaluated on each raw row (no aggregates or GROUP BY-only columns)? Put it in WHERE.
Does the predicate reference an aggregate (COUNT, SUM, AVG, MIN, MAX) or a grouped result computed across multiple rows? Put it in HAVING.
Can you compute the aggregate in a subquery/CTE and then use WHERE in the outer query for readability or portability? Prefer that when clarity or portability is important.
Want many related metrics in a single scan? Use conditional aggregation (SUM(CASE WHEN …) or FILTER in PostgreSQL) and then HAVING for group-level thresholds.
If you depend on alias behavior in HAVING (or GROUP BY), verify that your target RDBMS permits that extension — MySQL does, but many others are stricter. (mysqltutorial.org)

Hands-on debugging tips

If a HAVING condition seems to do nothing, confirm whether the expression actually uses an aggregate. If not, move it to WHERE and rerun.
Use EXPLAIN/EXPLAIN ANALYZE to see whether the optimizer applies filtering before or after grouping. If the plan shows grouping early and HAVING applied late, consider moving row-level predicates to WHERE.
When results look non-deterministic (e.g., selecting non-grouped columns without aggregation), check your DBMS’s GROUP BY rules (some engines allow nonstandard behavior). Make the grouping explicit or use aggregates like MIN()/MAX() or window functions to express deterministic intent. (postgresql.org)

Grouping with JOINs, multiple columns, NULLs and DISTINCT

When you aggregate data that comes from joined tables, the primary risk is accidental multiplication of rows: joins with one-to-many relationships expand the input set that GROUP BY sees, so aggregates computed after the join may reflect the expanded (and often unintended) row-count rather than the logical entity you intended to summarize. A common real-world situation is counting orders per customer while also joining order_items: joining orders -> order_items turns each order into multiple rows (one per item), so COUNT(*) or COUNT(o.id) returns the number of items, not the number of orders. The two straightforward, correct strategies are (A) deduplicate/aggregate the child table before joining, or (B) use DISTINCT inside the aggregate to count unique parent keys; both yield correct logical results but have different performance characteristics and tradeoffs. (postgresql.org)

Step-by-step demonstration (orders/customers example). Suppose you have customers, orders and order_items. The naive query that produces the wrong answer looks like this:

-- incorrect: counts order_item rows, not orders
SELECT c.customer_id,
       COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY c.customer_id;

Why it’s wrong: the JOIN to order_items multiplies each order into as many rows as items it contains, so COUNT(o.order_id) behaves like COUNT(oi.order_id). To fix the logic, choose one of the patterns below and pick the one that matches your data size and index layout. (postgresql.org)

Pattern A — aggregate the many-side first, then join (recommended for correctness and usually for performance):

WITH order_counts AS (
  SELECT order_id, COUNT(*) AS items_per_order
  FROM order_items
  GROUP BY order_id
)
SELECT c.customer_id,
       COUNT(o.order_id) AS order_count,
       SUM(oc.items_per_order) AS total_items
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
LEFT JOIN order_counts oc ON oc.order_id = o.order_id
GROUP BY c.customer_id;

Why this helps: the expensive per-order aggregation runs only once on the child table; joins afterwards operate on already-reduced rows, avoiding row explosion and double-counting. This idea — “aggregate first, join later” — is a common performance and correctness pattern in PostgreSQL, MySQL and other engines. (postgresql.org)

Pattern B — use COUNT(DISTINCT parent_key) when you need uniqueness but can tolerate the cost:

-- counts distinct orders per customer even though details are joined
SELECT c.customer_id,
       COUNT(DISTINCT o.order_id) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id
JOIN order_items oi ON oi.order_id = o.order_id
GROUP BY c.customer_id;

Caveat: COUNT(DISTINCT …) (and other DISTINCT aggregates) can be significantly more expensive than a pre-aggregation because it requires deduplication (hashing or sorting) over potentially large intermediate sets; on very large datasets it can dominate query cost unless supported by useful indexes or enough working memory. Evaluate plans and query memory (e.g., work_mem in PostgreSQL) before choosing COUNT(DISTINCT) as the default fix. (citusdata.com)

NULLs in grouping columns behave differently than simple boolean equality semantics might suggest. The SQL language treats multiple NULL markers in a grouping column as a single group — that is, all NULLs are placed into one group when you GROUP BY that column. Practically this means you will see a single summary row for rows whose grouping key is NULL (not one row per NULL). If you need to separate NULLs from other values or label them explicitly, use COALESCE() (or a CASE) to map NULL to a sentinel value (for example COALESCE(department, ‘<>’)) before grouping. The SQL standard’s treatment of NULLs in grouping and major RDBMS documentation confirm this behavior. (stackoverflow.com)

Example: group rows where department may be NULL

-- NULL departments are grouped into a single row
SELECT COALESCE(department, '(no department)') AS dept_label,
       COUNT(*) AS headcount
FROM employees
GROUP BY COALESCE(department, '(no department)');

When multiple grouping columns are used, GROUP BY forms groups on the tuple of column values; two rows are in the same group only if every grouped expression is “not distinct” (i.e., equal or both NULL) for those rows. This lets you group by combinations safely, but keep in mind that NULLs in any one element are handled as explained above. Use explicit COALESCE or conditional logic if your reporting wants NULLs shown or treated differently for certain dimensions. (postgresql.org)

Be careful selecting non-aggregated columns that are not listed in GROUP BY. Standard SQL requires every non-aggregated expression in the SELECT list to appear in GROUP BY (or be functionally dependent on the grouped columns). Some database engines (notably MySQL historically) have permissive modes and/or implement the SQL optional feature for functional dependencies; modern MySQL enforces ONLY_FULL_GROUP_BY by default and will reject queries that return non-deterministic values unless you explicitly indicate you accept nondeterminism (e.g., ANY_VALUE()) or you rely on a recognized functional dependency (like grouping by a primary-key column). For portable, deterministic SQL, include grouping expressions for every non-aggregated column or use aggregates (MIN, MAX, ANY_VALUE) deliberately. (dev.mysql.com)

Practical patterns for multi-column grouping with joins

When grouping by multiple dimensions that come from different tables, prefer to compute group keys consistently (for example, compute a derived column like EXTRACT(YEAR FROM t.ts) in a CTE and refer to that alias to avoid duplicating expression logic in GROUP BY).
If you need both the row-level detail and the group-level totals, consider window functions (PARTITION BY) instead of GROUP BY+JOIN-back patterns — window functions compute aggregates per partition while keeping the original rows intact. When you only need summary rows, GROUP BY is still the right tool. (Test explain plans: sometimes a grouped subquery joined to the small dimension table is faster than letting the planner perform grouping after large joins.) (postgresql.org)

Conditional aggregation and avoiding join-induced inflation. If your goal is to compute multiple related metrics in one scan (for example: total_orders, completed_orders, refunded_amount), use conditional aggregates so the engine scans the base table once and computes multiple counters. If you must join other tables that would multiply rows, push the condition into a pre-aggregation on the child table and then JOIN the reduced result into the parent. Example:

-- compute several counters per customer while avoiding double counts
WITH order_agg AS (
  SELECT o.customer_id,
         COUNT(*) AS total_orders,
         SUM(CASE WHEN o.status = 'completed' THEN 1 ELSE 0 END) AS completed_orders,
         SUM(oi.qty * oi.unit_price) AS total_order_value
  FROM orders o
  LEFT JOIN order_items oi ON oi.order_id = o.order_id
  GROUP BY o.customer_id
)
SELECT c.customer_id, oa.*
FROM customers c
LEFT JOIN order_agg oa ON oa.customer_id = c.customer_id;

This pattern keeps business logic (the per-order aggregation) isolated and prevents multiplication from additional joins. It also makes it easier to add indexes on the child table’s grouping keys and to tune memory for the aggregation stage. (postgresql.org)

When to prefer DISTINCT vs pre-aggregation vs windowing

Use pre-aggregation (aggregate child-table first) when the child table is large and you can reduce rows dramatically before joining — almost always the best for performance and clarity.
Use COUNT(DISTINCT …) when the deduplication set is small enough (or you have suitable indexes) and you prefer a single-pass query for simplicity; profile carefully because COUNT(DISTINCT) can trigger expensive hashing or sorting. (citusdata.com)
Use window functions when you need group-level values available per row (e.g., percentage of group) without collapsing to summary rows. Window functions avoid an explicit GROUP BY/JOIN cycle but may still require scanning the same data; compare plans. (postgresql.org)

Debugging tips and quick checklist before you trust aggregate results after joins

Ask: is any join producing one-to-many relationships? If yes, consider pre-aggregating the many side or using DISTINCT on the parent key in the aggregate. (postgresql.org)
Check whether grouping columns contain NULLs and whether you want NULLs merged into a single group — map NULLs explicitly if not. (stackoverflow.com)
If your query selects non-aggregated columns that aren’t in GROUP BY, confirm your RDBMS accepts the query deterministically (functional dependency) or else add aggregates/ANY_VALUE or include the column in GROUP BY. This avoids nondeterministic results or engine-specific behavior. (dev.mysql.com)
Run EXPLAIN (or EXPLAIN ANALYZE) to see whether grouping happens before or after joins and whether the planner is able to push down aggregates — if the plan groups after a huge join, rewrite the query to aggregate earlier. (postgresql.org)

By combining an understanding of how joins expand rows, how NULLs are grouped, how DISTINCT/COUNT(DISTINCT) work, and when an RDBMS enforces grouping determinism, you can write concise, correct GROUP BY queries that avoid double-counting, handle NULLs intentionally, and scale to large datasets. Test with representative data, examine query plans, and prefer explicit pre-aggregation when joins would otherwise cause multiplicative blow-ups. (postgresql.org)

Advanced grouping: ROLLUP, CUBE, GROUPING SETS and window-function alternatives

When you need more than a single level of aggregation — for example per-department, per-year, per-department-per-year, plus subtotals and a grand total — SQL offers compact, server-side primitives that generate those multiple summary rows in one pass: GROUPING SETS, ROLLUP and CUBE. These let you produce many grouped combinations in a single scan (much cheaper than separate GROUP BY + UNION ALL queries) and provide helper functions (GROUPING, GROUPING_ID) to distinguish “real” NULLs from the summary rows the engine injected. The exact syntax and feature support varies by engine (PostgreSQL, SQL Server, Oracle, Snowflake, DuckDB implement full ANSI GROUPING SETS/CUBE/ROLLUP; MySQL historically supports WITH ROLLUP and GROUPING() but not every engine exposes full GROUPING SETS/CUBE). (postgresql.org)

How these constructs relate (conceptual shortcuts you can rely on): GROUPING SETS enumerates exactly which grouping combinations you want; ROLLUP generates a hierarchy of subtotals (first n, then n-1, …, then grand total); CUBE generates all combinations (the power set) of the listed grouping expressions. ROLLUP and CUBE are just shorthand for particular GROUPING SETS lists, so use GROUPING SETS when you want fine-grained control and ROLLUP/CUBE when you want the built-in common patterns. The engine usually implements these in a single pass (MixedAggregate / multi-level aggregate logic) rather than literally doing multiple GROUP BY + UNION ALL queries, which is why they are both convenient and efficient. (postgresql.org)

Practical examples (use the employees table introduced earlier). These examples use standard (ANSI) GROUP BY syntax where possible; adapt to your RDBMS if it requires the legacy WITH ROLLUP form.

1) GROUPING SETS — pick exactly the combinations you want in one query

-- totals by (department, year), by department only, by year only, and a grand total
SELECT
  department,
  EXTRACT(YEAR FROM hired) AS hire_year,
  SUM(salary) AS total_salary
FROM employees
GROUP BY GROUPING SETS (
  (department, EXTRACT(YEAR FROM hired)),  -- department + year
  (department),                             -- department totals
  (EXTRACT(YEAR FROM hired)),               -- year totals
  ()                                        -- grand total
)
ORDER BY department, hire_year;

Why use this: you read employees once and the DB returns the four logical grouping levels in one result set, instead of running four separate GROUP BY queries and UNIONing them. That reduces IO and intermediate work. (postgresql.org)

2) ROLLUP — hierarchical subtotals (dept -> year -> grand total when ordered this way)

-- ROLLUP creates a sequence of subtotals for the ordered list of grouping columns
SELECT
  department,
  EXTRACT(YEAR FROM hired) AS hire_year,
  SUM(salary) AS total_salary
FROM employees
GROUP BY ROLLUP (department, EXTRACT(YEAR FROM hired))
ORDER BY department NULLS LAST, hire_year NULLS LAST;  -- ordering often helps readability

Notes: ROLLUP produces rows where one or more grouping columns are NULL to indicate subtotal/grand-total rows. The order of columns in ROLLUP matters — it rolls up from left to right (first department, then year, then the () grand total). Use GROUPING() or GROUPING_ID() to detect which rows are subtotals versus “real” rows if your grouping columns may contain NULL values themselves. (postgresql.org)

3) CUBE — all combinations (use carefully; it can explode)

-- CUBE produces every combination of department & year (dept-year, dept-only, year-only, and overall)
SELECT
  department,
  EXTRACT(YEAR FROM hired) AS hire_year,
  SUM(salary) AS total_salary
FROM employees
GROUP BY CUBE (department, EXTRACT(YEAR FROM hired))
ORDER BY department, hire_year;

Caveat: for N grouping columns CUBE produces up to 2^N grouping sets. That’s convenient for small N but can be computationally expensive for many dimensions — prefer explicit GROUPING SETS when you only need specific combinations. (postgresql.org)

4) Distinguishing subtotal rows: GROUPING() and GROUPING_ID()

When the engine injects subtotal rows it usually sets the unused grouping columns to SQL NULL. If your data can contain real NULLs you need a reliable way to tell whether a NULL is a data value or a subtotal marker. Use GROUPING(expr) (returns 1 when expr is NOT present in the grouping for that row, 0 otherwise) or GROUPING_ID(…) (returns a compact integer bitmap representing which columns are aggregated away). Example:

-- label subtotals; GROUPING() returns 1 when the column is a super-aggregate marker
SELECT
  CASE WHEN GROUPING(department) = 1 THEN 'ALL DEPARTMENTS' ELSE department END AS dept_label,
  CASE WHEN GROUPING(EXTRACT(YEAR FROM hired)) = 1 THEN 'ALL YEARS' ELSE TO_CHAR(EXTRACT(YEAR FROM hired),'9999') END AS year_label,
  SUM(salary) AS total_salary,
  GROUPING_ID(department, EXTRACT(YEAR FROM hired)) AS grouping_id
FROM employees
GROUP BY ROLLUP (department, EXTRACT(YEAR FROM hired))
ORDER BY grouping_id, dept_label, year_label;

You can also filter for subtotal rows: HAVING GROUPING_ID(department, EXTRACT(YEAR FROM hired)) > 0 will keep only subtotal/grand-total rows. GROUPING_ID is especially handy for programmatic filtering or when producing nicely labeled reports. Engines expose these functions under the standard names in many platforms (PostgreSQL, SQL Server, Oracle, Snowflake, DuckDB, etc.). (postgresql.org)

5) Engine differences and portability notes

PostgreSQL, SQL Server, Oracle, Snowflake, DuckDB: full support for GROUPING SETS / ROLLUP / CUBE and the helper functions GROUPING / GROUPING_ID. Use the ANSI-style GROUP BY GROUPING SETS (…) / GROUP BY ROLLUP(…) / GROUP BY CUBE(…) syntax. (postgresql.org)
MySQL: supports WITH ROLLUP and the GROUPING() helper in modern versions; it historically did not support full GROUPING SETS/CUBE in the core server (some MySQL engines/extensions such as HeatWave add extra support). If your target is MySQL, check the specific version and whether you must use the legacy GROUP BY … WITH ROLLUP syntax or the ROLLUP(…) modifier. When portability matters, test the specific server version and adapt queries (or fall back to UNION ALL emulation). (dev.mysql.com)

6) Emulating GROUPING SETS / ROLLUP on systems that lack them

If your engine lacks GROUPING SETS or CUBE, you can emulate them with UNION ALL of separate GROUP BY queries. This is portable but may be more expensive because the table is scanned once per grouping set.

-- emulate GROUPING SETS((department, hire_year), (department), (hire_year), ())
SELECT department, EXTRACT(YEAR FROM hired) AS hire_year, SUM(salary) AS total_salary
FROM employees
GROUP BY department, EXTRACT(YEAR FROM hired)
UNION ALL
SELECT department, NULL AS hire_year, SUM(salary)
FROM employees
GROUP BY department
UNION ALL
SELECT NULL AS department, EXTRACT(YEAR FROM hired), SUM(salary)
FROM employees
GROUP BY EXTRACT(YEAR FROM hired)
UNION ALL
SELECT NULL, NULL, SUM(salary) FROM employees;

Use this only if you cannot upgrade or if grouping sets would generate a smaller number of combinations but you still require compatibility. When you do have grouping-sets support, prefer it for performance and clarity. (cybertec-postgresql.com)

7) Window-function alternatives and when to pick them

Window functions (OVER (PARTITION BY …)) are not a substitute for GROUPING SETS/CUBE when your goal is to produce multiple aggregated grouping rows in a single result set. Instead they serve a complementary role:

Window functions compute aggregates for each row’s partition but do not collapse rows. Use them when you need per-row detail plus group-level metrics (e.g., each employee’s salary together with their department total). For example:

SELECT
  id, name, department, salary,
  SUM(salary) OVER (PARTITION BY department) AS dept_total,
  SUM(salary) OVER () AS grand_total,
  ROUND(100.0 * salary / NULLIF(SUM(salary) OVER (PARTITION BY department),0), 2) AS pct_of_dept
FROM employees;

This returns one row per employee with department and global totals attached — something GROUP BY alone cannot do without joining back aggregated data. Window functions also implement running totals, row ranking (RANK(), ROW_NUMBER()), percentiles, and other row-aware analytics. (docs.jade.fyi)

If you need both detail rows and the summary rows together (the usual reporting layout: rows followed by subtotals and totals), combine window functions for the per-row context with a separate GROUP BY (or a GROUPING SETS query) for the subtotal rows, then UNION ALL them into the final report, or produce the detail with windowed aggregates and append the summary rows using a small aggregated query. That keeps work compact: detail uses windowing (no re-aggregation of the detail rows) and totals use GROUP BY (or ROLLUP) for efficient summarization.

Example pattern: per-row detail + subtotal rows appended

WITH detail AS (
  SELECT id, name, department, salary,
         SUM(salary) OVER (PARTITION BY department) AS dept_total
  FROM employees
),
subtotals AS (
  SELECT department, SUM(salary) AS total_salary
  FROM employees
  GROUP BY department
  UNION ALL
  SELECT NULL AS department, SUM(salary) FROM employees
)
SELECT * FROM detail
UNION ALL
SELECT NULL AS id, '[SUBTOTAL]' AS name, department, total_salary FROM subtotals
ORDER BY department, id NULLS LAST;  -- tweak ordering for display

8) Practical debugging and best practices

If your grouping columns can be NULL, rely on GROUPING() / GROUPING_ID() to decide which NULLs are summary markers and which are true data values. (postgresql.org)
Prefer GROUPING SETS when you only need a handful of combinations from many possible dimensions (it avoids the exponential blowup of CUBE). Prefer CUBE only for small numbers (N ≤ 3 is common) when you truly need every combination. (postgresql.org)
When migrating between engines (MySQL ↔ PostgreSQL ↔ SQL Server), remember syntax differences: MySQL historically offers WITH ROLLUP and supports GROUPING() in recent versions, but full GROUPING SETS/CUBE support is not uniform across all MySQL distributions; verify per-version behavior. When portability is critical, explicitly test the target server and prefer ANSI GROUPING SETS where supported. (dev.mysql.com)
Use ORDER BY carefully: some servers historically restricted ORDER BY with rollup; modern versions are better, but if you need a specific display order wrap the aggregated result in a derived table and sort in an outer query. (dev.mysql.com)
For very large fact tables, consider summary tables or materialized views for the most-common grouping combinations (pre-aggregating avoids repeating heavy computations), and use GROUPING SETS/ROLLUP for ad-hoc multi-level reports. Also examine the optimizer’s plan — grouping-set implementations are often optimized to avoid multiple scans, but if you see repeated large scans in EXPLAIN, try rewriting or pre-aggregating. (cybertec-postgresql.com)

9) Quick checklist when you build multi-level reports

Choose GROUPING SETS when you need a small set of specific groupings. Use ROLLUP for hierarchical subtotals. Use CUBE only if you need all combinations and N is small. (postgresql.org)
Detect subtotal rows with GROUPING() or GROUPING_ID(), and replace NULL markers with human-friendly labels in the SELECT list. (postgresql.org)
Use window functions for per-row context (percent-of-group, rank, running totals), and combine with aggregated rows via UNION ALL when you need both row-level details and group subtotals in a single report. (docs.jade.fyi)
Test on a production-sized sample and inspect EXPLAIN/EXPLAIN ANALYZE plans — grouping constructs are powerful, but the optimizer’s behavior and memory settings (hash vs sort aggregates, work_mem in PostgreSQL, etc.) will determine real-world performance. (cybertec-postgresql.com)

Taken together, GROUPING SETS / ROLLUP / CUBE and GROUPING()/GROUPING_ID give you a compact, readable way to produce multi-level reports from SQL. Use GROUPING SETS for precision, ROLLUP for natural hierarchies, CUBE for complete cross-tab exploration (with caution), and window functions when you need rich row-level context paired with group-level metrics.

Self-Evolving Machine Learning Models: Continuous Learning Through Streaming Feedback

Advantage-Weighted Supervised Fine-Tuning for Post-Training Generative Recommenders — Improving LLM Recommendation Quality and Scalability

October 25, 2025

SQL GROUP BY & Aggregate Functions Tutorial — Master Summarizing Data with COUNT, SUM, AVG and More

Table of Contents

Introduction to aggregate functions: COUNT, SUM, AVG, MIN, MAX

GROUP BY syntax and simple examples

Using COUNT, SUM and AVG in real queries (examples and common patterns)

Filtering groups: WHERE vs HAVING with practical examples

Grouping with JOINs, multiple columns, NULLs and DISTINCT

Advanced grouping: ROLLUP, CUBE, GROUPING SETS and window-function alternatives

Related

Self-Evolving Machine Learning Models: Continuous Learning Through Streaming Feedback

Building a Voice Reactive Orb with React for Enhanced Voice Assistant Audio Visuals

How to Create AI Systems That Truly Learn from Errors

Advantage-Weighted Supervised Fine-Tuning for Post-Training Generative Recommenders — Improving LLM Recommendation Quality and Scalability