Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance
AI 摘要
提出DisCo和Table-GLS框架,解耦表格结构和内容,提升LVLM在表格推理上的效率和泛化性。
主要贡献
- 提出DisCo框架,解耦结构和内容。
- 提出Table-GLS框架,进行结构引导的推理。
- 实验证明框架有效提升LVLM表格理解和推理能力。
方法论
DiSCo分离结构抽象和语义对齐,Table-GLS进行结构探索和证据推理,利用全局到局部结构引导LVLM进行表格推理。
原文摘要
Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures.