FORGE benchmarks 18 MLLMs on manufacturing tasks using 2D/3D data across three scenarios. Models excel at macroscopic recognition but struggle with fine-grained reasoning, especially surface inspection. Supervised fine-tuning achieves significant improvements, matching larger models.
核心问题
How well do current Multimodal Large Language Models perform on fine-grained manufacturing tasks requiring domain-specific reasoning, and what are the key bottlenecks limiting their performance?
核心方法
{'approach': 'The authors constructed FORGE, a benchmark comprising approximately 12,000 samples from 2D images (~3,000) and 3D point clouds (14 categories, 90 models) rendered as three-view projections. Three manufacturing tasks were designed as multiple-choice questions: WORKVERI (material sorting), SURFINSP (defect classification), and ASSYVERI (assembly verification), evaluated across 18 MLLMs using exact-match accuracy.', 'key_components': ['18 representative MLLMs are evaluated in the benchmark.', 'Models span both open-source and closed-source families for comprehensive coverage.'], 'section_ids': ['sec_7']}
论点验证
The paper provides comprehensive evidence for FORGE benchmark: detailed dataset construction (p_10, p_46-54), three evaluation tasks (p_13-17), evaluation of 18 MLLMs (p_19, Table 2), and complete experimental results. The benchmark is fully specifie
Specific quantitative details provided: 3D Point Cloud Subset with 14 workpiece categories across 90 distinct models (p_10, p_49), Image Subset with ~3,000 images across 4 scenarios (p_10, p_50), specific model number examples (M10-M20 for nuts), and
Three tasks clearly defined with mapping to manufacturing applications: WORKVERI (material sorting, p_14, p_55), SURFINSP (quality inspection, p_15), ASSYVERI (assembly recognition, p_16). Each task has detailed specifications and evaluation protocol
Training dataset mentioned in p_33 and p_46. SFT experiments conducted on Qwen2.5-VL-3B-Instruct with specific configurations (p_57). Total dataset includes ~30,000 samples including training data (p_46). However, the exact split between benchmark an
While the paper presents a dataset with aligned 2D images and 3D point clouds, the claim of being 'first' requires comparison with prior work. The paper cites related work (MMAD, MME-Industry, DesignQA) but does not systematically verify that no prio
Three tasks are explicitly designed and named: WORKVERI (p_14), SURFINSP (p_15), ASSYVERI (p_16). Each task has detailed specifications, input/output formats, and evaluation criteria.
18 MLLMs evaluated (p_19, Table 2), three evaluation settings (zero-shot, Ref-Cond, ICD mentioned in p_20), main results in Table 3, extended results in Appendix A.2. The evaluation is comprehensive and systematic.
SFT experiments demonstrate training utility: 90.8% improvement on WORKVERI three-view (p_33), 27.1% relative gain on ASSYVERI image (p_33). Training configurations provided (p_57). Gains measured on held-out scenarios validate transferability.
This is a limitation claim about the broader field. While the paper cites references [42,22,41,47] to support this, verifying the accuracy of this characterization of 'current manufacturing datasets' would require comprehensive survey of the field be
This is a limitation claim about prior datasets. The paper asserts this gap but does not provide systematic comparison with existing datasets to verify the claim. Would require external verification of prior dataset characteristics.
This is a limitation claim about the field. The paper identifies this gap but verifying the absence of systematic benchmarks would require comprehensive survey of existing work beyond this paper.
Specific numbers provided: 3D Point Cloud Subset with 14 workpiece categories and 90 distinct models (p_10, p_49), Image Subset with ~3,000 images across 4 scenarios (p_10, p_50). Tables 9 and 10 provide detailed breakdowns.
Two-step process described: automated contour/coordinate extraction followed by manual refinement (p_11, p_52). The methodology is clearly specified.
Batch sample synthesis described: stitching 4-5 individual point clouds with random orientations, automatic label generation (p_11, p_52). Methodology is clearly specified.
Four defect types specified (Crack, Deformation, Dent, Cut) with morphology-based algorithms and non-rigid deformation (p_11, p_53). Defect point proportion constrained 5-15% (p_53).
Multi-view projection strategy clearly described: three-view (3V) images with front, side, and top orthogonal projections (p_12, p_54). Rationale provided (general MLLMs lack native 3D encoders).
Final dataset size specified as ~12,000 samples across all tasks (p_12). Note: p_46 mentions ~30,000 samples including training data, suggesting benchmark subset is ~12,000.
MCQ format clearly specified for all tasks (p_20). Options correspond to parts with normalized coordinates or letter labels.
Exact-match accuracy explicitly stated as evaluation metric (p_20). Prediction extraction and comparison with ground-truth described.
The claim mentions 'three progressively informative settings' but only names 'Zero-Shot'. The other two settings (Ref-Cond and ICD) are mentioned elsewhere (p_42) but not fully defined in the evaluation protocol section. The description is incomplete
... 共 60 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code available - implementation details for evaluation pipeline, prompt templates, and fine-tuning procedures are not accessible
- No data available - the FORGE benchmark dataset (images, point clouds, annotations) is not publicly released
- Model versions and API details - specific versions of the 18 MLLMs evaluated and API configurations for closed-source models are not specified
- Inference hyperparameters - temperature, top-p, max tokens, and other generation parameters for model inference are not reported
- Fine-tuning details - learning rate, batch size, epochs, optimizer, and training configuration for the cross-scenario generalization experiments are missing
- Random seeds - no random seed information provided for reproducibility of stochastic processes
- Hardware specifications - computational resources and environment details are not mentioned
- Data splits - train/validation/test split ratios and sample counts are not specified
- Complete prompt templates - only partial prompt examples are shown, full prompts for all tasks and settings are not provided
- Response parsing implementation - exact method for extracting MCQ letters from free-form model responses is not detailed
局限性(作者自述)
- Current manufacturing datasets are constrained by limited scale and diversity, so that many studies rely on simulated or CAD-based data
- Many current manufacturing datasets merely treat manufacturing workpieces as generic visual subjects. They fail to integrate explicit, fine-grained domain semantics (e.g., model numbers of workpiece) that are essential to the rigorous demands of real-world manufacturing
- There is a lack of systematic and representative benchmarks to assess the reasoning, understanding, and decision-making capabilities of MLLMs in manufacturing scenarios
- insufficient manufacturing domain knowledge and morphology understanding are the key gaps
- the insufficient capability to internalize and reason about complex manufacturing standards makes this domain an arduous challenge for future MLLM development
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-20T01:25:34+00:00 · 数据来源:Paper Collector