2025.emnlp-industry.183@ACL

Total: 1

#1 Quality Assessment of Tabular Data using Large Language Models and Code Generation [PDF] [Copy] [Kimi] [REL]

Authors: Ashlesha Akella, Akshar Kaul, Krishnasuri Narayanam, Sameep Mehta

Reliable data quality is crucial for downstream analysis of tabular datasets, yet rule-based validation often struggles with inefficiency, human intervention, and high computational costs. We present a three-stage framework that combines statistical inliner detection with LLM-driven rule and code generation. After filtering data samples through traditional clustering, we iteratively prompt LLMs to produce semantically valid quality rules and synthesize their executable validators through code-generating LLMs. To generate reliable quality rules, we aid LLMs with retrieval-augmented generation (RAG) by leveraging external knowledge sources and domain-specific few-shot examples. Robust guardrails ensure the accuracy and consistency of both rules and code snippets. Extensive evaluations on benchmark datasets confirm the effectiveness of our approach.

Subject: EMNLP.2025 - Industry Track