Programming with Data introduces a test-driven framework for LLM data engineering using three-level knowledge structures to trace model failures to specific data deficiencies and repair them through targeted patches, achieving consistent improvements across 16 disciplines and model scales.
核心问题
How can LLM data engineering be transformed from an open-loop process into a closed-loop, test-driven framework that enables tracing model failures back to specific training data deficiencies and repairing them through targeted patches?
核心方法
{'approach': 'The authors extract a three-level knowledge structure (L1 Key Concepts, L2 Knowledge Relations, L3 Reasoning Chains) from 117K textbook-grade documents across 16 disciplines. The ProDa pipeline comprises Builder (synthesizes training data from L1/L2), Tester (constructs benchmarks from L3 chains before training), and Debugger (traces failures backward through the knowledge structure to generate targeted patches). Models are fine-tuned in two stages: V1 (static synthesis) then V2 (diagnostic repair).', 'key_components': [], 'section_ids': ['sec_15']}
论点验证
The paper formally defines the Programming with Data paradigm in Section 2, providing a structured framework with clear analogies to software engineering (requirements specification, source code, compilation, unit testing, debugging). The paradigm is
The three-level structure (L1 Key Concepts, L2 Knowledge Relations, L3 Reasoning Chains) is clearly defined in paragraphs 9-11 with specific descriptions of each level. The paper provides concrete examples of extracted L1, L2, L3 structures in the su
The paper provides detailed descriptions of all three pipeline components: Builder (paragraph 14), Tester (paragraph 18), and Debugger (paragraphs 19-21). Each component's function is specified with respect to the shared three-level knowledge structu
The CORE principle is defined in paragraph 13 with all four components (Contextualized, Organized, Rigorous, Evolving). Each standard is operationalized through specific pipeline components in subsequent paragraphs (15-20).
The paper describes ProDa Studio in paragraphs 49-51 with Figure 7 showing the interface. However, there is no evidence that this IDE is actually released or available for use. The paper mentions releasing 'a structured knowledge base, benchmark suit
The instantiation across 16 disciplines is demonstrated through experimental results (paragraph 22, Figure 3a, Figure 4). The paper shows per-discipline accuracy distributions and benchmark construction across all 16 domains. The 'releasing as open r
The top-down extraction order is clearly described in paragraph 14 and elaborated in paragraph 55. The process is specified: L3 chains extracted first, decomposed into L2 triples, then L1 concepts harvested from L2 subjects and objects.
The paper argues for the top-down approach in paragraphs 14 and 56, stating it guarantees reachability (equation 5). However, there is no empirical validation through ablation comparing top-down vs bottom-up approaches. The claim is asserted but not
The specific number (160K instances) and the design choice (from L1/L2, without L3) are stated in paragraph 26.
The specific model families and scales are listed in paragraph 26 with the statement that identical hyperparameters were used.
The adversarial distractor construction is described in paragraph 19 and detailed in paragraphs 59-62 with three perturbation operators (INVREL, TRUNC, NEARMISS).
The instance-level orthogonality between benchmark (L3-derived) and training data (L1/L2-derived) is described in paragraph 63 with the structural separation formalized.
Specific numbers are provided: 117,000 documents, 48,000 high-quality chunks, ~1.5 billion tokens, 10:1 compression ratio. Referenced to Figure 3a.
Specific correlation coefficient (ρ = 0.847) is provided in paragraph 23 with reference to Figure 4a.
Specific correlation coefficients are provided: GPQA (ρ = 0.943) and MMLU-Pro (ρ = 0.905) in paragraph 23.
Specific accuracy (approximately 76% for frontier models) and the monotonically increasing trajectory for Qwen series are stated in paragraph 23 with reference to Figure 4b.
The claim is supported by paragraph 24 with reference to Figure 4c showing per-discipline accuracy distributions.
Specific accuracy numbers are provided in paragraph 27 with reference to Table 1: Qwen-2.5-7B-V1 (65.86% vs 62.31%), Qwen-2.5-32B-V1 (76.54% vs 73.61%).
Specific accuracy numbers are provided in paragraph 27: Qwen-3-4B-V1 (65.79% vs 54.62%, +11.17), Qwen-3-14B-V1 (76.44% vs 74.13%, +2.31).
Specific accuracy (77.35%) for Qwen-3-32B-V1 is provided in paragraph 27, stated as highest V1 score.
... 共 51 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- 具体的超参数设置(学习率、批次大小、训练轮数、优化器等)- 文中仅提到'相同超参数'但未给出具体数值
- 随机种子设置
- 硬件环境规格(GPU/TPU型号、数量、训练时间)
- 数据合成过程的具体实现(如何从L1概念和L2关系生成160K实例)
- L1/L2/L3层级的详细定义和提取方法
- 16个学科的具体范围和内容
- ProDa-16评估基准的详细说明和评估指标实现
- 测试套件的具体内容和验证方法
- 数据预处理步骤
- 模型检查点/权重文件
局限性(作者自述)
- For specific model sizes, ProDa-V1 fails to eclipse the officially aligned versions. This performance divergence exposes a fundamental mechanistic constraint: the inherent limitations of static data synthesis.
- Irrespective of the rigor applied to corpus filtering and structured extraction, a one-off static data injection (first-pass compilation) inevitably leaves conceptual blind spots. Unlike human experts in RLHF, static synthesis cannot dynamically rectify model-specific 'concept gaps,' nor can it adequately cover the long-tail errors inherent in multi-step reasoning.
- The present work establishes the macro-level architecture of Programming with Data; it does not exhaust the design space it opens. Every module constitutes a research problem in its own right, each admitting improvement as the community brings domain-specific techniques to bear.
- We expect particularly intersections with retrieval-augmented generation for grounding synthesized data in primary sources, with mechanistic interpretability for fine-grained diagnosis.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-30T07:33:29+00:00 · 数据来源:Paper Collector