WildDet3D introduces an open-vocabulary monocular 3D detector unifying text, point, and box prompts with geometry-aware architecture. It achieves 22.6-24.8 AP_3D on WildDet3D-Bench, vastly outperforming prior methods (2.3 AP). The work contributes a 1M-image dataset with 13.
核心问题
How can we build a general-purpose monocular 3D object detector that generalizes to long-tailed and unseen categories, supports multiple prompt modalities (text, points, boxes), and leverages optional geometric cues like depth?
核心方法
{'approach': 'WildDet3D uses dual-vision encoders (ViT-H for semantics, DINOv2 ViT-L for geometry) with a depth fusion module, a promptable detector supporting four prompt types, and a 3D detection head with unambiguous rotation normalization. The authors construct WildDet3D-Data through a three-stage pipeline: candidate generation from multiple 3D methods, rule-based geometric filtering, and human/VLM-based selection.', 'key_components': [], 'section_ids': []}
论点验证
The paper fully specifies the WildDet3D architecture in Section 2, including dual encoders, depth fusion module, promptable detector accepting four prompt types (text, point, box, exemplar), and 3D detection head. The model is evaluated on multiple b
Table 1 summarizes the dataset with specific quantitative details: ~1M images, 3.7M valid 3D annotations, 13.5K object categories. The 138× increase over Omni3D is calculable from Omni3D's 98 categories (mentioned in paragraph 90) vs 13.5K categories
Section 3 describes the three-stage pipeline in detail with Figure 4 showing the overview. Stage 1 (paragraphs 68-71): five complementary models generate candidates. Stage 2 (paragraphs 73-78): rule-based filters including edge contact, occlusion, si
Section 2.1 fully specifies the dual-vision encoders (image encoder: ViT-H with SimpleFPN from SAM3; RGBD encoder: DINOv2 ViT-L/14 from LingBot-Depth) and depth fusion module with ControlNet-style residual design. The architecture is completely descr
Section 2.2 describes the promptable detector that accepts fused vision features and four prompt types to produce unified query representations. The per-prompt batching strategy is specified, and the detector is evaluated across all prompt modalities
Section 2.3 fully specifies the 3D detection head with ray features from spherical harmonics, depth prompt branch, and the unambiguous rotation normalization technique. The head aggregates depth, 2D spatial, and semantic features as described.
The auxiliary 2D detection and depth estimation heads are described in Section 2.4. The 2D head's contribution is validated by ablation (claim 19 shows -19.1 AP drop when removed). However, the depth estimation head's contribution is NOT directly abl
The two-step unambiguous rotation normalization technique is fully specified in paragraphs 34-39. However, there is NO ablation study comparing performance with vs. without this normalization. The paper describes the technique and its motivation but
The 3D confidence branch is described in paragraph 41-43 with specific formulation. Claim 20 provides ablation evidence: removing it drops AP by 0.8 (30.2→29.4), validating its contribution.
Table 3 shows results on WildDet3D-Bench. The specific numbers 22.6 AP_3D (text), 24.8 AP_3D (box), and 2.3 AP for 3D-MOOD are stated in paragraph 101 and should appear in the table. The comparison is quantitative and directly supports the claim.
Paragraph 2 and paragraph 113 mention these results. Table 4 shows the effect of sparse/ground-truth depth, with specific numbers 41.6 AP (text) and 47.2 AP (box) when ground-truth depth is provided.
The Omni3D results (34.2 and 36.4 AP_3D) are stated in paragraph 2. The training efficiency claim (12 epochs vs 80-120 for baselines) is stated in paragraph 103. These are quantitative comparisons.
The overall ODS numbers (40.3 on Argoverse 2, 48.9 on ScanNet) are stated in paragraph 105 and should appear in Table 5. However, the claim about 'particularly large improvements on novel categories unseen during training' lacks specific quantitative
Paragraph 101 states the specific numbers: AP_rare = 47.4 for WildDet3D vs. 2.4 for 3D-MOOD. This is quantitative evidence supporting the claim about rare category performance.
Paragraph 105 states specific mAOE numbers: 0.526 on AV2 and 0.437 on ScanNet for WildDet3D, compared to 0.580 and 0.655 for 3D-MOOD Swin-B. Lower mAOE is better, confirming the claim.
Paragraph 105 states specific mATE numbers: 0.714 for WildDet3D vs. 0.755 for Swin-B on AV2. Lower mATE is better, confirming the claim about translation accuracy.
Paragraph 106 states specific numbers: ScanNet improves from 48.9 to 50.2 ODS (+1.3) with GT depth, while AV2 shows marginal gain (40.3 to 40.4). This is quantitative evidence.
Paragraph 108 states specific numbers from Table 6: monocular model 7.5 AP vs DetAny3D 7.1 AP; with real depth 27.7 AP, which is 2.8× over OVMono3D-LIFT's 9.9 AP. These are quantitative comparisons.
Paragraph 110 provides specific ablation numbers: AP drops from 30.2 to 11.1 (-19.1) when 2D head removed. Indoor datasets show largest drops: SUNRGBD 33.9→5.1, Objectron 56.8→10.9. This is strong quantitative evidence.
Paragraph 111 provides specific ablation numbers: removing 3D confidence head drops AP by 0.8 (30.2→29.4). This is quantitative evidence.
... 共 59 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code repository available - model architecture implementation details for SAM3 backbone, LingBot-Depth geometry backend, and 3D detection head are not publicly accessible
- WildDet3D-Data training dataset not available - the paper's main training data (human and synthetic annotations) cannot be accessed
- Random seeds not specified for reproducibility of training and evaluation
- GPU hardware specifications not provided (only mentions 32 GPUs without model/type)
- Learning rate schedule details referenced as 'in appendix' but appendix content not available
- Data mixing ratios for Stage 2 training referenced as 'in appendix' but not provided
- Loss functions and training objectives not described in implementation details
- Detailed model architecture specifications missing (layer dimensions, attention mechanisms, etc.)
- Training time and computational cost not reported
- Detailed preprocessing steps for each dataset not fully specified
局限性(作者自述)
- While WildDet3D achieves strong results across diverse settings, several limitations remain.
- VLM scoring alone cannot substitute for human judgment. Even at score 10-which accounts for 64.5% of all candidates (310K)-the human rejection rate remains 16.7%. This gap motivates our two-stage design: VLM scoring as an efficient pre-filter, followed by human verification for quality-critical subsets.
- The LLM-based size filters and VLM scoring heuristics used in the pipeline inherit biases from their training data, potentially affecting which annotations are retained or rejected for underrepresented object types or scenes.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-21T07:13:07+00:00 · 数据来源:Paper Collector