Multimodal Learning 相关度: 9/10

Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

Athos Georgiou
arXiv: 2603.28554v1 发布: 2026-03-30 更新: 2026-03-30

AI 摘要

Hydra将文档检索和生成统一到单个视觉-语言模型中,降低了内存和复杂度。

主要贡献

  • 提出了Hydra双头架构,实现检索和生成统一
  • 通过LoRA适配器实现检索功能切换,不影响生成质量
  • 证明了该方法可以泛化到音频检索和视频嵌入

方法论

采用双头架构,一个头用于ColBERT风格的检索,另一个头用于自回归生成,通过LoRA适配器控制检索功能的开关。

原文摘要

Visual document understanding typically requires separate retrieval and generation models, doubling memory and system complexity. We present Hydra, a dual-head approach that provides both ColBERT-style late-interaction retrieval and autoregressive generation from a single vision-language model (VLM). A single LoRA adapter, trained only for retrieval, is toggled at inference: enabling it produces multi-vector embeddings; disabling it recovers the base model's generation quality -- byte-identical outputs in 100% of 10,500 greedy and stochastic samples, with max delta-ANLS = 0.0044 across 15,301 samples on four VQA benchmarks (three informative; ChartQA is near-zero for both models under greedy decoding) when compared against an independent base-model pipeline. We identify three engineering requirements (attention-mode restoration, lm_head preservation, KV-cache-aware decoding) whose omission silently breaks generation despite correct weight recovery. On ViDoRe V1, Hydra (4B) is within 1 percentage point of a controlled single-head baseline in a single training run, with higher aggregate scores on V2 and V3 that are concentrated on a subset of tasks; multi-seed experiments are needed to confirm these trends. The single-model design reduces peak GPU memory by 41%, though adapter switching introduces throughput overhead under concurrent serving loads. An ablation shows that GritLM-style joint training provides no benefit within the LoRA-based (r=16) training regime. A proof-of-concept extension to Qwen2.5-Omni-3B demonstrates that the mechanism generalizes to audio retrieval and video embedding, with speech generation.

标签

视觉-语言模型 文档检索 生成 LoRA 多模态

arXiv 分类

cs.CV cs.AI cs.IR