TL;DR
ACES breaks the circular dependency in code-test evaluation using leave-one-out AUC to measure test quality without knowing code correctness. ACES-O achieves 84.
0
已证实
0
证据不足
0
无法验证
N/A
可复现性
置信度
0%

核心问题

How can we evaluate test quality and select correct code solutions when neither generated code nor test cases are guaranteed to be correct?

核心方法

{'approach': "The authors formalize code ranking as a weighted voting problem over a pass matrix and introduce LOO-AUC to measure test quality by evaluating each test's consistency with rankings from remaining tests. ACES-C computes closed-form weights using pass-rate correction under an average-quality assumption, while ACES-O jointly optimizes weights through a differentiable LOO-AUC objective without requiring this assumption.", 'key_components': ['Post-hoc execution-only baselines: Majority Voting, CodeT (consensus-set scoring), and MBR-exec (pairwise output comparison).', 'Methods requiring additional information: SC+Spec, Self-collaborate, MPSC, and DS3 (static analysis).', 'Direct inference baselines from GPT-3.5-Turbo, GPT-4, DeepSeek-Coder, WizardCoder, and CodeLlama.', 'All post-hoc methods use identical GPT-3.5-Turbo candidates and test cases for fair comparison.', 'ACES is evaluated on Qwen2.5-Coder-7B, Qwen2.5-Coder-14B, and DeepSeek-Coder-V2-16B.', 'Benchmarks include HumanEval, HumanEval+, and LeetCodeDataset.', 'Table 5 reports all tasks; Table 6 reports non-trivial tasks with 0 < n+ < n.'], 'section_ids': ['sec_12', 'sec_27']}

论点验证

可复现性评估

较低可复现性 (0%)

缺失的复现细节

局限性(作者自述)

本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv

分析时间:2026-04-25T13:23:42+00:00 · 数据来源:Paper Collector