TL;DR
AURA enables VideoLLMs to process continuous video streams with real-time responses through dual sliding windows and silent-speech balanced loss. It outperforms Gemini-1.5-Pro by 6.0% and achieves 312ms end-to-end latency.
0
已证实
0
证据不足
0
无法验证
N/A
可复现性
置信度
0%

核心问题

How can Video Large Language Models be enabled to continuously observe live video streams and support real-time question answering with selective silence, timely responses, and long-horizon context management?

核心方法

{'approach': 'AURA implements Interactive Video Stream Context Management using dual sliding windows over video (N=30 seconds) and QA interactions (M=10 groups). A Coarse-to-Fine Data Engine generates training data for three QA types through five stages including synthesis, refinement, and quality verification. Silent-Speech Balanced Loss down-weights silent tokens to address imbalance, while the inference framework leverages KV-cache reuse and streaming optimizations.', 'key_components': ['The system integrates AURA with ASR and TTS modules operating asynchronously for continuous streaming perception.', 'Video chunks are combined with transcribed user speech and inserted into context as user messages.', "An improved truncation strategy allows the video window to extend within a margin (N') to enable KV prefix cache reuse.", "Context truncation removes N' oldest video chunks at once when the window reaches N+N', reducing cache recomputation frequency.", 'Multiple optimizations including streaming output and multiprocess resource isolation support practical real-time deployment.'], 'section_ids': ['sec_16']}

论点验证

可复现性评估

较低可复现性 (0%)

缺失的复现细节

局限性(作者自述)

本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv

分析时间:2026-04-26T13:36:00+00:00 · 数据来源:Paper Collector