AURA: Always-On Understanding and Real-Time Assistance via Video Streams - AI 论文深度分析

TL;DR
AURA enables VideoLLMs to process continuous video streams with real-time responses through dual sliding windows and silent-speech balanced loss. It outperforms Gemini-1.5-Pro by 6.0% and achieves 312ms end-to-end latency.

已证实

证据不足

无法验证

N/A

可复现性

置信度

核心问题

How can Video Large Language Models be enabled to continuously observe live video streams and support real-time question answering with selective silence, timely responses, and long-horizon context management?

核心方法

{'approach': 'AURA implements Interactive Video Stream Context Management using dual sliding windows over video (N=30 seconds) and QA interactions (M=10 groups). A Coarse-to-Fine Data Engine generates training data for three QA types through five stages including synthesis, refinement, and quality verification. Silent-Speech Balanced Loss down-weights silent tokens to address imbalance, while the inference framework leverages KV-cache reuse and streaming optimizations.', 'key_components': ['The system integrates AURA with ASR and TTS modules operating asynchronously for continuous streaming perception.', 'Video chunks are combined with transcribed user speech and inserted into context as user messages.', "An improved truncation strategy allows the video window to extend within a margin (N') to enable KV prefix cache reuse.", "Context truncation removes N' oldest video chunks at once when the window reaches N+N', reducing cache recomputation frequency.", 'Multiple optimizations including streaming output and multiprocess resource isolation support practical real-time deployment.'], 'section_ids': ['sec_16']}

论点验证

可复现性评估

较低可复现性 (0%)

缺失的复现细节

No actual code available despite statement claiming code is released
No training data available or described in detail
Missing training hyperparameters: learning rate, batch size, number of epochs, optimizer settings, weight decay, learning rate schedule
Missing random seeds for reproducibility
No specific ASR and TTS model implementations specified
Video chunk size parameter not specified
Context window parameters (N, N', M) values not provided
Training data preprocessing steps not detailed
Training duration and convergence criteria not specified
Loss functions and fine-tuning methodology not described

局限性（作者自述）

most existing VideoLLMs are still designed for offline settings. In this paradigm, a complete video or a pre-collected segment is first buffered and then analyzed. Although this setup is well-suited for post hoc analysis, it limits the system's ability to respond promptly to ongoing events
Decoupled architectures rely on two separately deployed models, where a trigger model determines whether the primary VideoLLM should respond. Because the trigger model does not share the same contextual state with the primary model and is typically much smaller in scale, the triggering accuracy and its consistency with response generation may be limited, leading to unstable system behavior.
Unified architectures offer a higher performance upper bound, but these works are generally limited to captioning-style narration tasks and remain less effective for complex open-ended video question answering.
recent full-duplex models integrate question answering into a single framework, they still lack sufficient robustness for long-duration streaming and may suffer from memory overflow or performance degradation during extended inference.
MiniCPM-o-4.5 supports a full-duplex multimodal live-streaming mode, we find that it often becomes silent in this setting and may produce irrelevant responses for video streams longer than two minutes.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-26T13:36:00+00:00 · 数据来源：Paper Collector