SteerViT enables text-steerable visual representations by inserting cross-attention layers into frozen Vision Transformers. With 21M parameters, it achieves 96% accuracy on conditional retrieval versus 44% for DINOv2, outperforming billion-parameter models while maintaining representation quality t…
核心问题
How can visual representations be made steerable by text prompts at inference time, allowing vision models to focus on specific queried concepts rather than defaulting to dominant salient objects?
核心方法
{'approach': "SteerViT inserts lightweight gated cross-attention layers into every other frozen ViT block, allowing visual patch tokens to attend to text embeddings projected through a multimodal adapter. The model is trained on referential segmentation using 162k images with 2.28M image-text pairs, employing a zero-initialized tanh gate that preserves the frozen ViT's capabilities while gradually activating the conditioning pathway.", 'key_components': ['Four components: frozen visual encoder, frozen text encoder, multimodal adapter, and gated cross-attention layers.', 'Cross-attention layers are inserted into every other Transformer block (e.g., 6 layers for 12-block ViT-B).', 'Visual patch tokens attend to adapted text tokens through cross-attention.', 'Tanh gate with learnable scalar is zero-initialized, ensuring the model starts identical to frozen ViT.', 'The gate receives learning signals during optimization, gradually activating the conditioning pathway.', 'GeneCIS Focus Object benchmark evaluates conditional retrieval in real images with scene-matching and object-presence requirements.', 'SteerViT achieves 25.4% R@1 in zero-shot evaluation, substantially outperforming DINOv2 (9.6%) and specialized baselines (18.7%).', 'Text-based representational steering transfers effectively from controlled synthetic benchmarks to real-world retrieval tasks.', 'SteerViT and OV localization models can be reliably steered through text while standard ViTs collapse to salient concepts.', 'MLLMs offer moderate steerability but bear significant computational cost and diminished visual feature quality.'], 'section_ids': ['sec_6', 'sec_11', 'sec_25']}
论点验证
The method is fully specified in Section 3 with architectural details (cross-attention layers, gating mechanism). The 21M parameter count is stated multiple times (p_11, p_13, p_15). The framework is demonstrated to work on multiple ViT backbones (DI
CORE benchmark is fully described in p_33 and p_64-p_66 with specific details: 100 images per scene, 6 scenes from SUN397, 5 inpainted objects per scene, one-vs-all retrieval setup, and evaluation methodology.
MOSAIC benchmark construction is described in p_38 and p_68 with specific details: 4 PASCAL-VOC images stitched into 2x2 mosaic, 363 composite images total.
The architectural contribution is fully specified in Section 3 (p_18-p_23) with cross-attention formulation, gating mechanism, and integration with frozen ViT blocks. The paradigm inversion is clearly described.
The 96% retrieval accuracy is stated in p_34. While not shown in a table, the number is specific and the CORE benchmark methodology is fully described. Per-scene results in Table 7 provide supporting evidence.
DINOv2's 44% acc@1 on CORE is stated in p_34. The number is specific and consistent with the benchmark methodology.
The 0.02% boost from post-hoc element-wise addition is stated in p_35. This is a specific numerical result demonstrating late fusion ineffectiveness.
The claim references Tab. 1 and states specific performance differences (49 and 20 percentage points). While the exact InternVL3 numbers aren't shown in provided text, the claim is specific and Table 7 shows InternVL3 results. The 21M parameter compa
Specific PR-AUC numbers (50.2% for SteerViT vs 14.3% for DINOv2) are stated in p_38 for the MOSAIC benchmark.
Specific R@1 numbers for GeneCIS Focus Object are stated in p_36 with reference to Tab. 2: SteerViT 25.4%, DINOv2 9.6%, specialized baseline 18.7%.
Specific PR-AUC numbers for PODS are stated in p_43 with reference to Fig. 8: SteerViT with instance-level descriptions 58.1%, fine-tuned DINOv2 48.0%.
Specific PRO scores for MVTec AD are stated in p_50-p_51 and p_71 with reference to Table 8: SteerViT 82.1, SAM3 54.5, CLIPseg 34.6, FADE 84.5.
Specific ROC_P and PRO scores for VisA are stated in p_71: SteerViT ROC_P 92.1 vs FADE 91.5, PRO 82.0 vs 79.3.
The 47.7 percentage point drop for SteerViT with incorrect prompts is stated in p_35 and p_66, with Table 7 showing per-scene results with correct/incorrect prompts.
The 52.5 point average improvement over DINOv2 on CORE is stated in p_66, with Table 7 providing per-scene breakdown.
Specific training iteration results are stated in p_76: 50k iterations yields CORE 95.3% (vs 43.5 frozen DINOv2), FG-CLS 89.6 (vs 89.2 frozen DINOv2).
Specific training iteration results are stated in p_76: PODS improves from 49.9 to 58.1, RefCOCOg from 63.4 to 70.6 between 50k and 450k iterations.
Specific ablation results are stated in p_75 with reference to Tab. 9: segmentation outperforms pointing with FG-CLS +7.3, ADE20k +8.0, PODS +12.4.
The design choice is stated in p_18 and demonstrated with experiments on multiple backbones in p_54 and Table 5, showing SteerViT improves steerability for DINOv2, SigLIP, and MAE.
The text encoder choice is clearly specified in p_19: frozen RoBERTa-Large for token-level embeddings.
... 共 41 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- Training hyperparameters (learning rate, batch size, number of epochs, optimizer type, weight decay, learning rate schedule)
- Training data - what dataset was used to train the multimodal adapter and cross-attention layers
- Random seeds for reproducibility
- Hardware specifications (GPU type, memory requirements, training time)
- MLP architecture details (hidden dimensions, activation functions, number of parameters)
- Cross-attention implementation details (number of attention heads, dropout rates)
- Loss functions and training objectives
- Data preprocessing steps for images and text
- Evaluation protocol details (feature extraction methods, pooling strategies, distance metrics)
- Code repository and trained model weights
局限性(作者自述)
- The simplest formulation assigns a hard target (1 at the center patch, 0 elsewhere), but we find this unstable to optimize in practice for both classification and regression.
- While the predicted heatmaps are especially accurate for texture-based inputs, the zero-shot setting makes it impossible to perfectly localize all anomalies.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-25T13:33:21+00:00 · 数据来源:Paper Collector