LLM Reasoning 相关度: 9/10

Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance

Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Youcheng Pan, Xiaoqiang Zhou, Min Zhang
arXiv: 2602.03491v1 发布: 2026-02-03 更新: 2026-02-03

AI 摘要

提出DisCo和Table-GLS框架,解耦表格结构和内容,提升LVLM在表格推理上的效率和泛化性。

主要贡献

  • 提出DisCo框架,解耦结构和内容。
  • 提出Table-GLS框架,进行结构引导的推理。
  • 实验证明框架有效提升LVLM表格理解和推理能力。

方法论

DiSCo分离结构抽象和语义对齐,Table-GLS进行结构探索和证据推理,利用全局到局部结构引导LVLM进行表格推理。

原文摘要

Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures.

标签

LVLM 表格推理 解耦 结构引导 多模态

arXiv 分类

cs.CV cs.CL