OneVL enables efficient one-step latent reasoning for autonomous driving via dual-modal auxiliary decoders and prefill inference, outperforming explicit CoT on NAVSIM (88.84 vs 88.29) while matching answer-only speed with 1.5-2.3× faster latency.
核心问题
How can Vision-Language-Action models for autonomous driving achieve efficient latent chain-of-thought reasoning that matches or exceeds explicit CoT performance while eliminating inference latency overhead?
核心方法
{'approach': 'OneVL introduces dual-modal auxiliary decoders—a language decoder that reconstructs CoT reasoning and a visual decoder that predicts future frames as world model supervision—alongside a prefill inference mechanism that enables single-pass latent token generation. A three-stage training pipeline progressively aligns the latent bottleneck with trajectory prediction objectives.', 'key_components': ['World models in autonomous driving serve three purposes: data generation, closed-loop evaluation, and representation learning.', 'Cosmos integrates multimodal inputs to synthesize training data for robotic and autonomous driving systems.', 'DICC and AD-R1 use world models as interactive simulators for adversarial evaluation and reinforcement learning.', 'OneVL uses future visual token prediction as training-only auxiliary to certify latent bottlenecks, then discards it at inference.', 'OneVL augments a pretrained VLM with compact latent tokens and dual auxiliary decoders.', 'The architecture enables multimodal explanation through language and visual reconstruction.', 'The design supports one-step latent reasoning with vision-language explanations.', 'OneVL uses Qwen3-VL-4B-Instruct as its backbone VLM for processing interleaved image and text inputs.', 'The model consists of Vision Encoder (ViT), Visual Projector (MLP Aligner), and Large Language Model (LLM).', 'All components are initialized from pretrained checkpoint and remain fully trainable in Stages 0 and 2.'], 'section_ids': ['sec_8', 'sec_9', 'sec_10', 'sec_19', 'sec_37']}
论点验证
The dual-modal auxiliary decoders are fully specified in Sections 3.2-3.4 with detailed architecture descriptions, input construction formulas, and training objectives. Figure 3 provides a complete overview. The framework is empirically validated acr
The prefill inference mechanism is clearly described in Section 3.5 (p_44-p_46). The paper explains how latent tokens are prefilled into the prompt context and provides latency measurements demonstrating the efficiency gains across all four benchmark
The three-stage training pipeline is described in detail in Section 4 with clear motivation for each stage. Crucially, ablation studies in Section 5.5 (Table 7) demonstrate that skipping this training causes catastrophic failure (PDM-Score drops from
The language auxiliary decoder is fully specified in Section 3.2 with input construction (Eq. 1-2) and training objective (Eq. 3). It is quantitatively evaluated in Section 5.4 with Meta Action Accuracy, STS Score, and LLM-as-Judge metrics showing it
The visual auxiliary decoder is fully specified in Section 3.3-3.4 with motivation, input construction (Eq. 4), and training objective (Eq. 5). Qualitative results in Figure 8 show it generates spatially coherent future frames. Ablation studies confi
The paper mentions the MLP head variant in p_14 and p_118, claiming 0.24s latency (4.16 Hz, 5.4% of AR model latency). However, there is no dedicated table or figure presenting these results systematically. The claim appears in prose without the same
Table 1 (NAVSIM) shows OneVL achieves 88.84 PDM-score vs AR CoT+Answer at 88.29. Tables 2-5 show similar patterns across ROADWork, Impromptu, and APR1. COCONUT, CODI, and SIM-CoT all underperform AR baselines. This directly supports the claim that On
The latency numbers are partially verifiable but the speedup factors have calculation issues. For NAVSIM: OneVL 4.46s vs AR CoT+Answer 6.71s = 1.50× (matches claim). For ROADWork: OneVL 4.71s vs AR CoT+Answer 10.83s = 2.30× (matches claim). However,
Table 1 (referenced in p_94) shows OneVL prefill latency at 4.46s and AR Answer at 4.49s on NAVSIM. The difference of 0.03s is indeed essentially identical. However, no error bars or variance measures are provided, which would strengthen confidence i
Table 6 (referenced in p_108) explicitly shows Meta Action Accuracy: OneVL 71.00 vs SIM-CoT 67.20, a difference of 3.8 percentage points. This is a clear, quantitative result from a controlled comparison.
Table 7 (referenced in p_109-p_110) shows the ablation results: OneVL w/o visual decoder scores 87.97, full OneVL scores 88.84. The difference is exactly +0.87 PDM-score. This is a well-designed ablation study with clear numerical evidence.
Table 7 shows OneVL w/o language decoder scores 88.53, full OneVL scores 88.84. The difference is +0.31 PDM-score. This is a clear ablation result with precise numerical evidence.
Table 7 and p_112 provide clear numerical evidence: full OneVL achieves 88.84 PDM-score, while OneVL w/o staged training achieves only 67.13. The drop is exactly 21.71 points. This is a dramatic and well-documented ablation result.
The gradient norms are reported in p_113: 378.22 for direct approach vs 0.28 for three-stage strategy. These are specific numerical measurements from training dynamics. However, the paper doesn't show the full gradient norm curves over training, only
The final trajectory prediction losses are reported in p_113: 0.186 for end-to-end method vs 0.136 for three-stage approach. This provides quantitative evidence of optimization difficulty. However, the paper doesn't show the full loss curves, only fi
Tables 1-5 show COCONUT, CODI, and SIM-CoT all underperform AR Answer on NAVSIM, ROADWork, and Impromptu. The exception is APR1 where some latent methods are competitive. This is a consistent finding across multiple benchmarks. However, the 'except f
Table 1 shows AR CoT+Answer at 88.29 vs AR Answer at 87.47 on NAVSIM. The difference is exactly +0.82 PDM-score (paper rounds to +0.80). This is clear numerical evidence from a controlled comparison.
Table 1 shows OneVL at 88.84 vs AR CoT+Answer at 88.29 on NAVSIM, a difference of +0.55 PDM-score. The claim about 'fewer tokens' is supported by the latent token design (C_t=2 language latent tokens, C_v=4 visual latent tokens vs full CoT text seque
This is directly supported by the ablation results in p_110 and p_123. Visual decoder contributes +0.87 (88.84 - 87.97), language decoder contributes +0.31 (88.84 - 88.53). The asymmetry is clearly documented.
Table 7 shows OneVL w/o visual decoder at 87.97 vs full OneVL at 88.84. The -0.87 drop is precisely calculated. This is redundant with claim_11 but still well-supported.
... 共 41 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- 完整代码实现不可用(未找到公开代码仓库)
- 训练数据集不可用
- 超参数缺失:学习率、批次大小、训练轮数、优化器类型及配置
- 随机种子未指定
- 硬件环境规格未说明(GPU型号、数量、内存等)
- 训练阶段细节不完整:Stage 0和Stage 2的具体训练配置未详细说明
- 潜在标记接口的具体架构细节(维度、数量、位置编码等)
- 辅助解码器的详细架构和实现细节
- 数据预处理步骤未说明
- 训练/验证/测试数据划分方式未说明
局限性(作者自述)
- The current system requires roughly 3× memory during training, since three full 4B model instances must be held in memory. This is mitigated by DeepSpeed ZeRO-2 but still imposes nontrivial infrastructure requirements.
- The latent token count was chosen empirically. Thus a systematic study of the trade-off between latent token count and representation capacity is left for future work.
- While OneVL's prefill mechanism eliminates latent CoT overhead, the trajectory tokens themselves are still generated autoregressively.
- Extending the world model decoder to multi-camera inputs would enable 360-degree future-scene prediction and more comprehensive causal scene understanding, further strengthening the compression targets available to the visual latents.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-22T01:19:25+00:00 · 数据来源:Paper Collector