ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation - AI 论文深度分析

TL;DR
ACES breaks the circular dependency in code-test evaluation using leave-one-out AUC to measure test quality without knowing code correctness. ACES-O achieves 84.

已证实

证据不足

无法验证

N/A

可复现性

置信度

核心问题

How can we evaluate test quality and select correct code solutions when neither generated code nor test cases are guaranteed to be correct?

核心方法

{'approach': "The authors formalize code ranking as a weighted voting problem over a pass matrix and introduce LOO-AUC to measure test quality by evaluating each test's consistency with rankings from remaining tests. ACES-C computes closed-form weights using pass-rate correction under an average-quality assumption, while ACES-O jointly optimizes weights through a differentiable LOO-AUC objective without requiring this assumption.", 'key_components': ['Post-hoc execution-only baselines: Majority Voting, CodeT (consensus-set scoring), and MBR-exec (pairwise output comparison).', 'Methods requiring additional information: SC+Spec, Self-collaborate, MPSC, and DS3 (static analysis).', 'Direct inference baselines from GPT-3.5-Turbo, GPT-4, DeepSeek-Coder, WizardCoder, and CodeLlama.', 'All post-hoc methods use identical GPT-3.5-Turbo candidates and test cases for fair comparison.', 'ACES is evaluated on Qwen2.5-Coder-7B, Qwen2.5-Coder-14B, and DeepSeek-Coder-V2-16B.', 'Benchmarks include HumanEval, HumanEval+, and LeetCodeDataset.', 'Table 5 reports all tasks; Table 6 reports non-trivial tasks with 0 < n+ < n.'], 'section_ids': ['sec_12', 'sec_27']}

论点验证

可复现性评估

较低可复现性 (0%)

缺失的复现细节

No code repository available - ACES-C and ACES-O algorithms cannot be implemented
No data available - candidates and test cases from Huang et al. [2024] not provided
Incomplete algorithm implementation details - Appendix B referenced but not accessible, ACES-C details missing
No random seeds specified for tie-breaking and any stochastic components
No hardware specifications (GPU requirements, memory)
No software environment details (frameworks, dependencies, versions)
Preprocessing steps not fully specified beyond removing constant columns
Generation prompts and parameters for candidates/test cases only referenced (C.13) but not shown
Exact implementation of Pass@k estimator not provided
Runtime and computational cost details missing

局限性（作者自述）

in the Hard region, no method passes more than 1 of 14 tasks, suggesting the pass matrix alone is insufficient when average test quality is very low and complementary signals such as static analysis may be needed
The LOO-AUC framework opens several research directions. Incorporating correlations among LLM-generated tests could yield tighter bounds and stronger weighting schemes.
The surrogate objective is non-convex because LOO-AUC j (w) depends on w through the leave-one-out scores, so gradient ascent converges to a stationary point rather than a global maximum.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-25T13:23:42+00:00 · 数据来源：Paper Collector