CORAL introduces autonomous multi-agent evolution where agents control decisions through shared memory and heartbeats. It achieves new SOTA on 8 of 11 tasks with 3-10× higher improvement rates and 10× fewer evaluations than fixed evolutionary search, demonstrating that autonomous agents enable more…
核心问题
Can stronger performance emerge when evolutionary algorithm decisions are delegated to autonomous agents, and can multiple autonomous agents scale more effectively through horizontal parallelism by exploring in parallel and building on each other's progress?
核心方法
{'approach': 'CORAL implements autonomous multi-agent evolution through three mechanisms: shared persistent memory structured as a file system with attempts/, notes/, and skills/ folders; asynchronous multi-agent organization where N agents run in isolated workspaces while sharing access to evaluators and memory; and heartbeat mechanisms for reflection, consolidation, and redirection. The framework is evaluated on 6 mathematical optimization tasks, 5 systems optimization tasks, and 2 stress-test problems, comparing against OpenEvolve, ShinkaEvolve, and EvoX baselines.', 'key_components': [], 'section_ids': ['sec_4', 'sec_43']}
论点验证
The paper provides a complete specification of the CORAL framework in Sections 3-4 and Appendix C, including detailed descriptions of shared persistent memory (p_13-15), multi-agent organization (p_16-17), and heartbeat mechanisms (p_18-20). The fram
The paper clearly distinguishes three paradigms in Section 3: fixed evolutionary search (p_10), autonomous single-agent evolution (p_11), and autonomous multi-agent evolution (p_11). Figure 1 illustrates the progression. This is a well-articulated co
All three mechanisms are fully specified and implemented: shared persistent memory (p_13-15), asynchronous multi-agent organization (p_16-17), and heartbeat-based interventions (p_18-20). The framework is demonstrated through experiments.
The paper explicitly states this design in p_13: 'CORAL's shared persistent memory M is structured as a file system with symbolic links to an agent's workspace (also a file system) to maintain consistency.' This is a concrete implementation detail.
The three root folders are explicitly defined in p_14-15 with clear descriptions of their purposes: attempts/, notes/, and skills/. This is a concrete design specification.
The paper explicitly states this design in p_16: 'CORAL naturally extends from a single autonomous agent to a population of N agents that run asynchronously. Each agent i maintains its own local context C(i)t and executes in an isolated workspace whi
The paper explicitly states this in p_17: 'coordination between agents occurs primarily through shared persistent memory. Similar to the single-agent scenario, each agent may autonomously read and write to a shared workspace.'
The paper explicitly states this in p_18: 'CORAL imposes a heartbeat mechanism that functions like a Reminder App, periodically prompting the agents to exercise self-reflection and pivoting for new ideas when existing approaches plateau.'
The paper explicitly describes all three heartbeat types in p_20 with specific trigger conditions and purposes. This is a complete design specification.
The claim states '2.5× higher improvement rate' but p_28 states 'CORAL's improvement rate is 3-10× higher.' These numbers are inconsistent. The '10× fewer evaluations' claim is roughly consistent with p_28 ('5-20 evaluations versus 60-100'), but the
The paper provides specific numbers in p_3 and p_21: official best score of 1,363 cycles, and CORAL achieves 1,103 cycles. The math checks out: (1363-1103)/1363 ≈ 19.1%, approximately 20%.
The paper states in p_28: 'As shown in Table 1, CORAL achieves the best final score on all 11 tasks, establishing new SOTA on 8 tasks.' Without seeing the actual table data, I accept the paper's explicit claim with moderate confidence.
The paper states in p_28: 'CORAL's improvement rate is 3-10× higher, and it typically converges within 5-20 evaluations versus 60-100 for fixed evolutionary search methods.' Without seeing Table 1, I accept the paper's explicit claim.
The paper states in p_29: 'co-evolution achieving an 18.3% cycle reduction on Kernel Engineering and a 5.0% score increase on Polyominoes.' Without seeing Table 2, I accept the paper's explicit claim with moderate confidence.
The paper explicitly states in p_29 and p_46 that CORAL achieves new SOTA on Kernel Engineering without web search, and achieves 89.4% coverage on Polyominoes with web search, surpassing previous SOTA of 87%.
The paper states in p_30: 'When evaluated on the math and systems suites using a fully open-source stack (MiniMax M2.5 + OpenCode), 4-agent co-evolution consistently improves final scores over the single-agent counterpart across most tasks (Table 2).
The paper provides specific numbers in p_32: 'on Transaction (61% local test rate) and Kernel Engineering (57%)' with reference to Tables 4 and 5. The claim about local execution improving more often is also stated.
The paper provides specific numbers in p_33: 0.05 knowledge artifacts per attempt on standard tasks, 0.55 and 0.68 on advanced tasks (10× more), +2 percentage points gain on standard tasks, 55% improvement on Kernel Engineering vs 26% on standard tas
The paper provides specific numbers in p_36: '36% of attempts use another agent's commit as their parent, and these improve at 17% versus 9% for all attempts. The majority (66%) of new records originate from a cross-agent parent.'
The paper provides specific numbers in p_36: 'direct code transfer is rarer (12%) but still very powerful (50% versus a 19% average improvement rate); transfer instead occurs more often through shared notes and skills, with 87% of rounds referencing
... 共 48 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code available - implementation cannot be reproduced without the CORAL system code
- No data/seed programs available - baseline comparisons require identical seed programs and evaluators
- Random seeds not specified for the 4 independent runs
- Heartbeat configuration details (Table 7) not provided in excerpt - critical hyperparameters unknown
- LLM API parameters not specified (temperature, top-p, etc.) for Claude Code + Opus 4.6 and MiniMax M2.5
- Exact evaluator implementations not provided
- Computational budget in terms of LLM calls/tokens not specified, only wall-clock time given
- Specific Linux environment details (OS version, dependencies) not provided
- Agent initialization details beyond 'identical initialization' not specified
- Exact benchmark task definitions and evaluation metrics not provided
局限性(作者自述)
- CORAL relies on frontier foundation models that can handle relatively complex coding-agent workflows, which makes full deployment on local devices difficult.
- multi-agent evolution currently lacks bootstrapped heterogeneity: all agents are initialized identically and given access to the same information.
- our current setting assumes the availability of a reasonably well-specified evaluator. However, for many important open-ended problems, evaluators are themselves difficult to obtain, incomplete, or even fundamentally ambiguous.
- An exciting direction for future work is therefore to train customized small models tailored to CORAL.
- Future work could inject distinct personalities, roles, or private information into different agents to encourage greater behavioral diversity and, in turn, a more efficient evolutionary process.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-25T07:19:27+00:00 · 数据来源:Paper Collector