OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation - AI 论文深度分析

TL;DR
OneVL enables efficient one-step latent reasoning for autonomous driving via dual-modal auxiliary decoders and prefill inference, outperforming explicit CoT on NAVSIM (88.84 vs 88.29) while matching answer-only speed with 1.5-2.3× faster latency.

已证实

证据不足

无法验证

N/A

可复现性

置信度

86%

核心问题

How can Vision-Language-Action models for autonomous driving achieve efficient latent chain-of-thought reasoning that matches or exceeds explicit CoT performance while eliminating inference latency overhead?

核心方法

{'approach': 'OneVL introduces dual-modal auxiliary decoders—a language decoder that reconstructs CoT reasoning and a visual decoder that predicts future frames as world model supervision—alongside a prefill inference mechanism that enables single-pass latent token generation. A three-stage training pipeline progressively aligns the latent bottleneck with trajectory prediction objectives.', 'key_components': ['World models in autonomous driving serve three purposes: data generation, closed-loop evaluation, and representation learning.', 'Cosmos integrates multimodal inputs to synthesize training data for robotic and autonomous driving systems.', 'DICC and AD-R1 use world models as interactive simulators for adversarial evaluation and reinforcement learning.', 'OneVL uses future visual token prediction as training-only auxiliary to certify latent bottlenecks, then discards it at inference.', 'OneVL augments a pretrained VLM with compact latent tokens and dual auxiliary decoders.', 'The architecture enables multimodal explanation through language and visual reconstruction.', 'The design supports one-step latent reasoning with vision-language explanations.', 'OneVL uses Qwen3-VL-4B-Instruct as its backbone VLM for processing interleaved image and text inputs.', 'The model consists of Vision Encoder (ViT), Visual Projector (MLP Aligner), and Large Language Model (LLM).', 'All components are initialized from pretrained checkpoint and remain fully trainable in Stages 0 and 2.'], 'section_ids': ['sec_8', 'sec_9', 'sec_10', 'sec_19', 'sec_37']}

论点验证

已证实 (95%) We present OneVL, a framework that overcomes the limitations of prior latent CoT methods through two key innovations. First, we introduce dual-modal auxiliary decoders: a language auxiliary decoder that reconstructs human-readable CoT reasoning from compact language latent tokens, and a visual auxiliary decoder that predicts anticipated future frames from visual latent representations.
The dual-modal auxiliary decoders are fully specified in Sections 3.2-3.4 with detailed architecture descriptions, input construction formulas, and training objectives. Figure 3 provides a complete overview. The framework is empirically validated acr

已证实 (95%) Second, we design a prefill inference mechanism. At inference time, the latent tokens (both visual and language) are prefilled into the model's context as fixed prompt inputs, enabling single-pass generation of all latent tokens.
The prefill inference mechanism is clearly described in Section 3.5 (p_44-p_46). The paper explains how latent tokens are prefilled into the prompt context and provides latency measurements demonstrating the efficiency gains across all four benchmark

已证实 (95%) A principled three-stage training pipeline progressively aligns the latent bottleneck with trajectory prediction, ensuring that the compressed representations capture causal structure rather than memorized patterns.
The three-stage training pipeline is described in detail in Section 4 with clear motivation for each stage. Crucially, ablation studies in Section 5.5 (Table 7) demonstrate that skipping this training causes catastrophic failure (PDM-Score drops from

已证实 (90%) The language auxiliary decoder D l aims to recover human-readable CoT reasoning text from the compact language latent hidden states.
The language auxiliary decoder is fully specified in Section 3.2 with input construction (Eq. 1-2) and training objective (Eq. 3). It is quantitatively evaluated in Section 5.4 with Meta Action Accuracy, STS Score, and LLM-as-Judge metrics showing it

已证实 (90%) The visual auxiliary decoder D v aims to predict anticipated future-frame visual tokens. This visual prediction objective serves as a world model auxiliary, supplementing language-only latent CoT.
The visual auxiliary decoder is fully specified in Section 3.3-3.4 with motivation, input construction (Eq. 4), and training objective (Eq. 5). Qualitative results in Figure 8 show it generates spatially coherent future frames. Ablation studies confi

证据不足 (60%) For real-world deployment, appending an MLP head for producing trajectory further reduces latency to 0.24s (4.16 Hz), just 5.4% of the AR model's latency, offering a practical deployment option.
The paper mentions the MLP head variant in p_14 and p_118, claiming 0.24s latency (4.16 Hz, 5.4% of AR model latency). However, there is no dedicated table or figure presenting these results systematically. The claim appears in prose without the same

已证实 (90%) OneVL is the only latent CoT method that outperforms explicit autoregressive CoT, directly supporting our hypothesis that tighter compression encourages more generalizable reasoning.
Table 1 (NAVSIM) shows OneVL achieves 88.84 PDM-score vs AR CoT+Answer at 88.29. Tables 2-5 show similar patterns across ROADWork, Impromptu, and APR1. COCONUT, CODI, and SIM-CoT all underperform AR baselines. This directly supports the claim that On

证据不足 (50%) On NAVSIM, the latency matches AR answer-only prediction and is 1.5× faster than explicit autoregressive CoT. On ROADWork, prefill latency is identical to answer-only and 2.3× faster than its explicit counterpart.
The latency numbers are partially verifiable but the speedup factors have calculation issues. For NAVSIM: OneVL 4.46s vs AR CoT+Answer 6.71s = 1.50× (matches claim). For ROADWork: OneVL 4.71s vs AR CoT+Answer 10.83s = 2.30× (matches claim). However,

已证实 (85%) On NAVSIM, OneVL with prefill inference achieves 4.46 latency, essentially identical to AR answer-only prediction (4.49s).
Table 1 (referenced in p_94) shows OneVL prefill latency at 4.46s and AR Answer at 4.49s on NAVSIM. The difference of 0.03s is indeed essentially identical. However, no error bars or variance measures are provided, which would strengthen confidence i

已证实 (95%) OneVL achieves a Meta Action Accuracy of 71.00, a significant 3.8 improvement over SIM-CoT (67.20).
Table 6 (referenced in p_108) explicitly shows Meta Action Accuracy: OneVL 71.00 vs SIM-CoT 67.20, a difference of 3.8 percentage points. This is a clear, quantitative result from a controlled comparison.

已证实 (95%) Comparing OneVL w/o visual decoder (87.97) to the full OneVL model (88.84), we find that the visual auxiliary decoder contributes +0.87 score.
Table 7 (referenced in p_109-p_110) shows the ablation results: OneVL w/o visual decoder scores 87.97, full OneVL scores 88.84. The difference is exactly +0.87 PDM-score. This is a well-designed ablation study with clear numerical evidence.

已证实 (95%) A comparison between OneVL without its language decoder (88.53) and the full OneVL model (88.84) validates the contribution of the language auxiliary decoder and language latent tokens, yielding a modest performance gain of +0.31.
Table 7 shows OneVL w/o language decoder scores 88.53, full OneVL scores 88.84. The difference is +0.31 PDM-score. This is a clear ablation result with precise numerical evidence.

已证实 (95%) Direct end-to-end joint fine-tuning fails catastrophically, causing the PDM-Score to drop by 21.71 points (from 88.84 to 67.13).
Table 7 and p_112 provide clear numerical evidence: full OneVL achieves 88.84 PDM-score, while OneVL w/o staged training achieves only 67.13. The drop is exactly 21.71 points. This is a dramatic and well-documented ablation result.

已证实 (85%) The direct approach suffers from severe "gradient shock" at initialization, with an exploding gradient norm of 378.22 that destabilizes the pre-trained backbone. In contrast, the three-stage strategy maintains a stable gradient norm of 0.28.
The gradient norms are reported in p_113: 378.22 for direct approach vs 0.28 for three-stage strategy. These are specific numerical measurements from training dynamics. However, the paper doesn't show the full gradient norm curves over training, only

已证实 (85%) The end-to-end method causes catastrophic task interference. The backbone struggles to optimize conflicting objectives simultaneously, resulting in a much higher final trajectory prediction loss (0.186 vs. 0.136).
The final trajectory prediction losses are reported in p_113: 0.186 for end-to-end method vs 0.136 for three-stage approach. This provides quantitative evidence of optimization difficulty. However, the paper doesn't show the full loss curves, only fi

已证实 (85%) All three adapted latent CoT methods perform substantially worse than answer-only AR prediction, except for the APR1 dataset. This is a critical finding that purely linguistic latent CoT approaches that were designed for text-only reasoning tasks do not transfer effectively to the multimodal spatial-temporal domain of autonomous driving trajectory prediction.
Tables 1-5 show COCONUT, CODI, and SIM-CoT all underperform AR Answer on NAVSIM, ROADWork, and Impromptu. The exception is APR1 where some latent methods are competitive. This is a consistent finding across multiple benchmarks. However, the 'except f

已证实 (95%) Comparing AR CoT+Answer (88.29) to AR Answer (87.47) on NAVSIM confirms that explicit reasoning supervision provides meaningful trajectory improvements (+0.80 PDM-score).
Table 1 shows AR CoT+Answer at 88.29 vs AR Answer at 87.47 on NAVSIM. The difference is exactly +0.82 PDM-score (paper rounds to +0.80). This is clear numerical evidence from a controlled comparison.

已证实 (90%) OneVL's implicit latent reasoning achieves better performance than explicit AR CoT+Answer (88.84 vs. 88.29), despite using fewer tokens to express the reasoning.
Table 1 shows OneVL at 88.84 vs AR CoT+Answer at 88.29 on NAVSIM, a difference of +0.55 PDM-score. The claim about 'fewer tokens' is supported by the latent token design (C_t=2 language latent tokens, C_v=4 visual latent tokens vs full CoT text seque

已证实 (95%) The visual auxiliary decoder contributes +0.87 PDM-score compared to +0.31 for the language auxiliary decoder.
This is directly supported by the ablation results in p_110 and p_123. Visual decoder contributes +0.87 (88.84 - 87.97), language decoder contributes +0.31 (88.84 - 88.53). The asymmetry is clearly documented.

已证实 (95%) OneVL w/o visual decoder scores 87.97, still below the full model (88.84), removing the world model supervision alone accounts for a -0.87 drop, even when all other components remain intact.
Table 7 shows OneVL w/o visual decoder at 87.97 vs full OneVL at 88.84. The -0.87 drop is precisely calculated. This is redundant with claim_11 but still well-supported.

... 共 41 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

完整代码实现不可用（未找到公开代码仓库）
训练数据集不可用
超参数缺失：学习率、批次大小、训练轮数、优化器类型及配置
随机种子未指定
硬件环境规格未说明（GPU型号、数量、内存等）
训练阶段细节不完整：Stage 0和Stage 2的具体训练配置未详细说明
潜在标记接口的具体架构细节（维度、数量、位置编码等）
辅助解码器的详细架构和实现细节
数据预处理步骤未说明
训练/验证/测试数据划分方式未说明

局限性（作者自述）

The current system requires roughly 3× memory during training, since three full 4B model instances must be held in memory. This is mitigated by DeepSpeed ZeRO-2 but still imposes nontrivial infrastructure requirements.
The latent token count was chosen empirically. Thus a systematic study of the trade-off between latent token count and representation capacity is left for future work.
While OneVL's prefill mechanism eliminates latent CoT overhead, the trajectory tokens themselves are still generated autoregressively.
Extending the world model decoder to multi-camera inputs would enable 360-degree future-scene prediction and more comprehensive causal scene understanding, further strengthening the compression targets available to the visual latents.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-22T01:19:25+00:00 · 数据来源：Paper Collector