Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables
- Qixu Chun ,
- Yeye He (yeyehe) ,
- Weiwei Cui ,
- Song Ge ,
- Haidong Zhang ,
- Dongmei Zhang ,
- Surajit Chaudhuri
SIGMOD 2025 |
Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied.
In this work, we propose a new class of data-quality constraints that we call Semantic-Domain Constraints, which can be reliably inferred and automatically applied to any tables, without requiring domain-experts to manually specify on a per-table basis.
We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints.
Our extensively labeled benchmark dataset with 2400 real data columns, as well as our code are available at here (opens in new tab) to facilitate future research.