AI Agents 相关度: 9/10

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie
arXiv: 2603.12265v1 发布: 2026-03-12 更新: 2026-03-12

AI 摘要

OmniStream是一种统一的流式视觉骨干网络,能够有效感知、重建和执行视觉任务。

主要贡献

  • 提出了统一的流式视觉骨干网络OmniStream
  • 引入了因果时空注意力机制和3D旋转位置嵌入(3D-RoPE)
  • 证明了单模型在多种视觉任务上的泛化能力

方法论

采用协同多任务框架,结合静态和时间表示学习、流式几何重建、视觉语言对齐进行预训练。

原文摘要

Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.

标签

视觉 流式处理 多任务学习 AI Agent

arXiv 分类

cs.CV