A Sobering Look at Tabular Data Generation via Probabilistic Circuits
AI 摘要
该论文批判了表格数据生成领域对扩散模型的过度依赖,并提出了基于概率电路的替代方案。
主要贡献
- 指出现有表格数据生成评估协议的局限性
- 提出了基于深度概率电路(PCs)的表格数据生成方法
- 证明了PCs方法在表格数据生成上可与SotA模型竞争
方法论
通过使用深度概率电路(PCs)构建分层混合模型,进行表格数据的生成,并采用更严格的指标进行评估。
原文摘要
Tabular data is more challenging to generate than text and images, due to its heterogeneous features and much lower sample sizes. On this task, diffusion-based models are the current state-of-the-art (SotA) model class, achieving almost perfect performance on commonly used benchmarks. In this paper, we question the perception of progress for tabular data generation. First, we highlight the limitations of current protocols to evaluate the fidelity of generated data, and advocate for alternative ones. Next, we revisit a simple baseline -- hierarchical mixture models in the form of deep probabilistic circuits (PCs) -- which delivers competitive or superior performance to SotA models for a fraction of the cost. PCs are the generative counterpart of decision forests, and as such can natively handle heterogeneous data as well as deliver tractable probabilistic generation and inference. Finally, in a rigorous empirical analysis we show that the apparent saturation of progress for SotA models is largely due to the use of inadequate metrics. As such, we highlight that there is still much to be done to generate realistic tabular data. Code available at https://github.com/april-tools/tabpc.