Define quality metrics and thresholds
Start by selecting a small set of measurable quality dimensions tied to business impact: accuracy (correct values), completeness (missingness), consistency (cross-field and cross-source agreement), uniqueness (duplicates), validity (schema/type conformance) and timeliness (freshness). For each dimension define a numeric metric (e.g., percent nulls, duplicate rate, schema-mismatch rate, distributional divergence, latency in hours) and a clear unit of measurement so checks can run automatically and produce comparable outputs.
Set thresholds using a mix of empirical baselines and business tolerances. Derive a baseline from historical data (median and relevant percentiles) and set thresholds relative to that baseline or to SLAs: a permissive “warning” threshold (e.g., null rate > 2%) and a stricter “critical” threshold (e.g., null rate > 5%). Where distributions matter, use statistical distances (KL divergence, PSI) with chosen cutoffs (for example PSI > 0.2 or KL > 0.1) calibrated on past incidents. For schema and validity, set near-zero tolerances (schema-mismatch rate > 0.01%) when downstream jobs cannot tolerate change.
Provide concrete, actionable examples that are easy to automate: allow at most 2% missing values in customer_id, allow ≤0.1% duplicate rows, require 99.9% schema conformance, require data freshness within 24 hours. Prefer tiered alerts: warning triggers additional profiling and retries; critical triggers pipeline halt, annotation for data engineers, and automated rollback if available. Revisit thresholds periodically: log incidents, measure false positives/negatives, and adjust thresholds based on cost of remediation versus business risk.
Missing values and null checks
Treat nulls as a first-class quality signal: they indicate schema issues, ingestion errors, or legitimate absence, and they often break joins, ML models, and aggregations. Compute per-column null rates and track them over time with the same warning/critical thresholds used elsewhere. Treat primary keys and foreign keys with near-zero tolerance (example: critical if null rate > 0.01%); allow larger tolerances for optional fields (example: warning at 1–2%, critical at 5%).
Detect nulls robustly by checking database NULLs plus common sentinels and formatting problems (empty strings, whitespace, ‘N/A’, ‘-1’, extreme outliers used as placeholders). Use column-specific rules: numeric columns may use -999 or 0 as sentinels, text columns may contain empty or whitespace-only values. Example SQL check:
SELECT COUNT(*) AS cnt,
SUM(CASE WHEN col IS NULL OR TRIM(col) = '' OR col IN ('N/A','-') THEN 1 ELSE 0 END) AS missing
FROM table;
Automate simple tests in pipeline unit checks and data contracts: fail fast for critical fields, emit warnings and profile for noncritical fields. Log sample rows for every alert and capture timestamps so engineers can reproduce issues. For remediation choose conservative strategies: add a missing-indicator feature for ML, prefer model-based or group-wise imputations over global means, and avoid silent automatic fills for keys. When imputing, record imputation method and provenance in metadata so downstream consumers can filter or re-run with raw data. Finally, monitor missingness drift with time-series alerts and surface correlated null patterns (same source, same time window) to detect systemic failures rather than individual bad records.
Duplicate and uniqueness checks
Uniqueness problems break joins, inflate aggregates, and poison model training, so treat uniqueness as a first-class metric: track duplicate-rate (duplicate rows divided by total rows) and distinct-key coverage over time, with clear warning and critical thresholds that map to business risk (for example, warning at ~0.1% duplicate rate, critical at ~0.5% for high-volume customer datasets; tighten to near-zero for primary keys). Always log samples and timestamps for any alert so engineers can reproduce issues.
Detect exact duplicates with a grouped count on primary/composite key columns; a simple SQL check is effective for automation:
SELECT key_cols, COUNT(*) AS dup_count
FROM table
GROUP BY key_cols
HAVING COUNT(*) > 1;
For near-duplicates (typos, different formatting, duplicate customers across sources) use normalized keys, deterministic fingerprinting (lowercasing, punctuation removal), and similarity algorithms (Levenshtein/TF-IDF or cosine on name/email) with calibrated thresholds. Hash-based blocking or locality-sensitive hashing speeds up fuzzy joins on large tables.
Define and automate remediation: policy-driven dedupe (keep first/last/most-complete), merge rules for conflicting attributes, and preserve provenance by writing canonical_id and source_rank fields rather than deleting raw rows. Enforce uniqueness checks early in ingestion for critical keys (fail-fast), and run periodic backfills that both reconcile duplicates and record metrics so thresholds can be tuned based on observed false positives and operational cost.
Schema and data type drift
Schema changes and subtle type shifts are common failure modes: new or removed columns, widened types (int → float), string-encoded numerics, or inconsistent timestamp formats break joins, aggregations, and model featurizers. Detect these early by treating schema as testable metadata: persist a canonical schema (names, types, nullability, allowed domains) and compare each batch against it.
Automate three lightweight checks per field: exact type conformance (percent of rows matching expected type), presence (column exists and null rate within threshold), and format/domain validity (regex, enum membership, or numeric range). Use tiered thresholds (warning vs critical) so nonblocking changes trigger profiling while critical deviations halt downstream jobs. Example SQL to find type mismatches for an integer column:
SELECT COUNT(*) AS total,
SUM(CASE WHEN TRY_CAST(col AS BIGINT) IS NULL AND col IS NOT NULL THEN 1 ELSE 0 END) AS type_errors
FROM dataset;
For new columns detect additions and surface diffs with sample values and type guesses; for removed columns alert immediately and require a schema migration. Record a schema hash and last-seen sample snapshot in metadata; compare hashes each run to generate a human-readable diff that includes new types, widened types, and changed nullability.
Remediation strategies should be policy-driven: fail-fast for primary keys and foreign keys, allow controlled coercions for optional fields (log casts and add provenance columns like _original_col and _cast_reason), and gate schema evolution through an automated approval or versioned contract. Run shadow tests when evolving schemas: route a copy of the new-batched data to downstream consumers to validate behavior before flipping production.
Finally, surface compact alerts with failing samples, metric deltas (e.g., type-mismatch rate), schema diff, and a recommended action (rollback, cast, or migrate). Automating detection plus lightweight governance keeps pipelines resilient while enabling safe, auditable schema change.
Distribution and anomaly detection
Detecting distributional shifts and record-level anomalies prevents subtle data rot that breaks models and analytics. Regularly profile numeric features (means, quantiles, tails), categorical frequencies, and timestamp coverage; compare each batch to a historical baseline to catch both gradual drift and sudden spikes.
Quantitative comparisons work well: use PSI or KL for feature-level drift (PSI > 0.2 as a practical warning), Kolmogorov–Smirnov or Mann–Whitney tests for continuous distributions (p < 0.01 signals noteworthy change), and quantile- or histogram-based deltas for heavy-tailed data. For single-record outliers compute robust z-scores or median absolute deviation and flag extreme values (e.g., |z| > 5). For categorical fields surface new/unexpected categories and large shifts in top-k frequency share.
Complement statistical tests with unsupervised anomaly detectors for multivariate patterns: Isolation Forests, One-Class SVMs, or lightweight autoencoders, tuned conservatively (example contamination ≈ 0.01) and ensembled across methods to reduce false positives. For large datasets, apply feature hashing or PCA to speed detection and log feature contributions so alerts are actionable.
Automate checks in the pipeline: run batch profiling, compute delta metrics, persist baselines, and attach sampled anomalous rows and top contributing features to alerts. Use tiered responses—warning (profile + retry), critical (hold/rollback + engineer review). Record every alert with metadata (batch id, data source, metric deltas, sample rows) so thresholds can be calibrated from real incidents and false-positive rates reduced over time.
from sklearn.ensemble import IsolationForest
clf=IsolationForest(contamination=0.01, random_state=0).fit(X_ref)
pred=clf.predict(X_batch) # -1 anomaly, 1 normal
Freshness, volume, timeliness checks
Timely and appropriately sized data are safety checks that prevent stale analytics and silent data loss. Implement three automated, tiered tests that run each ingestion: (1) batch freshness — compute the lag between pipeline run time and the most recent event timestamp; flag a warning if lag exceeds a soft SLA (example: > 4 hours) and critical if it exceeds a hard SLA (example: > 24 hours). Example SQL snippet:
SELECT MAX(event_ts) AS last_seen,
EXTRACT(EPOCH FROM (NOW() - MAX(event_ts)))/3600 AS lag_hours
FROM raw.events;
(2) volume sanity — compare incoming row counts and partition counts to historical baselines (median and interquartile range). Alert when counts fall outside an acceptable band (e.g., <90% or >110% of median) or when expected partitions/hours are missing. Simple rule: if cnt < 0.9 * median(cnt) then warning; if cnt < 0.5 * median(cnt) then critical.
SELECT COUNT(*) AS cnt FROM raw.events WHERE ds = '{{batch_date}}';
(3) timeliness coverage — ensure expected time windows are present (no large gaps in event timestamps) and that watermark advancing logic behaves as intended. Check per-source hour/window coverage and surface longest gap and percent-covered-hours.
Automate responses: warning triggers auto-reprofile, notify on-call, and enqueue retries; critical pauses downstream consumers, includes sample rows and metadata in alert, and triggers RCA workflow or rollback. Persist metrics (lag, cnt, missing_partitions) to a monitoring store so thresholds can be tuned from historical false-positive rates and incident impact.



