ACES breaks the circular dependency in code-test evaluation using leave-one-out AUC to measure test quality without knowing code correctness. ACES-O achieves 84.
核心问题
How can we evaluate test quality and select correct code solutions when neither generated code nor test cases are guaranteed to be correct?
核心方法
{'approach': "The authors formalize code ranking as a weighted voting problem over a pass matrix and introduce LOO-AUC to measure test quality by evaluating each test's consistency with rankings from remaining tests. ACES-C computes closed-form weights using pass-rate correction under an average-quality assumption, while ACES-O jointly optimizes weights through a differentiable LOO-AUC objective without requiring this assumption.", 'key_components': ['Post-hoc execution-only baselines: Majority Voting, CodeT (consensus-set scoring), and MBR-exec (pairwise output comparison).', 'Methods requiring additional information: SC+Spec, Self-collaborate, MPSC, and DS3 (static analysis).', 'Direct inference baselines from GPT-3.5-Turbo, GPT-4, DeepSeek-Coder, WizardCoder, and CodeLlama.', 'All post-hoc methods use identical GPT-3.5-Turbo candidates and test cases for fair comparison.', 'ACES is evaluated on Qwen2.5-Coder-7B, Qwen2.5-Coder-14B, and DeepSeek-Coder-V2-16B.', 'Benchmarks include HumanEval, HumanEval+, and LeetCodeDataset.', 'Table 5 reports all tasks; Table 6 reports non-trivial tasks with 0 < n+ < n.'], 'section_ids': ['sec_12', 'sec_27']}
论点验证
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code repository available - ACES-C and ACES-O algorithms cannot be implemented
- No data available - candidates and test cases from Huang et al. [2024] not provided
- Incomplete algorithm implementation details - Appendix B referenced but not accessible, ACES-C details missing
- No random seeds specified for tie-breaking and any stochastic components
- No hardware specifications (GPU requirements, memory)
- No software environment details (frameworks, dependencies, versions)
- Preprocessing steps not fully specified beyond removing constant columns
- Generation prompts and parameters for candidates/test cases only referenced (C.13) but not shown
- Exact implementation of Pass@k estimator not provided
- Runtime and computational cost details missing
局限性(作者自述)
- in the Hard region, no method passes more than 1 of 14 tasks, suggesting the pass matrix alone is insufficient when average test quality is very low and complementary signals such as static analysis may be needed
- The LOO-AUC framework opens several research directions. Incorporating correlations among LLM-generated tests could yield tighter bounds and stronger weighting schemes.
- The surrogate objective is non-convex because LOO-AUC j (w) depends on w through the leave-one-out scores, so gradient ascent converges to a stationary point rather than a global maximum.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-25T13:23:42+00:00 · 数据来源:Paper Collector