CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery - AI 论文深度分析

TL;DR
CORAL introduces autonomous multi-agent evolution where agents control decisions through shared memory and heartbeats. It achieves new SOTA on 8 of 11 tasks with 3-10× higher improvement rates and 10× fewer evaluations than fixed evolutionary search, demonstrating that autonomous agents enable more…

已证实

证据不足

无法验证

N/A

可复现性

置信度

90%

核心问题

Can stronger performance emerge when evolutionary algorithm decisions are delegated to autonomous agents, and can multiple autonomous agents scale more effectively through horizontal parallelism by exploring in parallel and building on each other's progress?

核心方法

{'approach': 'CORAL implements autonomous multi-agent evolution through three mechanisms: shared persistent memory structured as a file system with attempts/, notes/, and skills/ folders; asynchronous multi-agent organization where N agents run in isolated workspaces while sharing access to evaluators and memory; and heartbeat mechanisms for reflection, consolidation, and redirection. The framework is evaluated on 6 mathematical optimization tasks, 5 systems optimization tasks, and 2 stress-test problems, comparing against OpenEvolve, ShinkaEvolve, and EvoX baselines.', 'key_components': [], 'section_ids': ['sec_4', 'sec_43']}

论点验证

已证实 (95%) we introduce CORAL, a framework for autonomous multi-agent evolution on open-ended problems. CORAL shifts decision-making from fixed algorithms to the agents themselves, supported by a shared persistent memory for continuous evolution.
The paper provides a complete specification of the CORAL framework in Sections 3-4 and Appendix C, including detailed descriptions of shared persistent memory (p_13-15), multi-agent organization (p_16-17), and heartbeat mechanisms (p_18-20). The fram

已证实 (90%) we formulate autonomous evolution as a distinct paradigm for open-ended discovery and distinguish autonomous single-agent and multiagent evolution from prior fixed evolutionary search.
The paper clearly distinguishes three paradigms in Section 3: fixed evolutionary search (p_10), autonomous single-agent evolution (p_11), and autonomous multi-agent evolution (p_11). Figure 1 illustrates the progression. This is a well-articulated co

已证实 (95%) we introduce CORAL, a framework that realizes this paradigm through shared persistent memory, asynchronous multi-agent organization, and heartbeat-based interventions for long-horizon search.
All three mechanisms are fully specified and implemented: shared persistent memory (p_13-15), asynchronous multi-agent organization (p_16-17), and heartbeat-based interventions (p_18-20). The framework is demonstrated through experiments.

已证实 (95%) CORAL's shared persistent memory M is structured as a file system with symbolic links to an agent's workspace (also a file system) to maintain consistency.
The paper explicitly states this design in p_13: 'CORAL's shared persistent memory M is structured as a file system with symbolic links to an agent's workspace (also a file system) to maintain consistency.' This is a concrete implementation detail.

已证实 (95%) we define three root folders storing different types of knowledge: attempts/ records historical evaluations and solutions. notes/ records observations, learnings, and reflections from all agents. skills/ records reusable procedures, tools, scripts, and implementation patterns transferable across attempts.
The three root folders are explicitly defined in p_14-15 with clear descriptions of their purposes: attempts/, notes/, and skills/. This is a concrete design specification.

已证实 (95%) CORAL naturally extends from a single autonomous agent to a population of N agents that run asynchronously. Each agent i maintains its own local context C (i) t and executes in an isolated workspace while sharing access to the same evaluator and shared persistent memory M via symbolic link
The paper explicitly states this design in p_16: 'CORAL naturally extends from a single autonomous agent to a population of N agents that run asynchronously. Each agent i maintains its own local context C(i)t and executes in an isolated workspace whi

已证实 (95%) coordination between agents occurs primarily through shared persistent memory. Similar to the single-agent scenario, each agent may autonomously read and write to a shared workspace.
The paper explicitly states this in p_17: 'coordination between agents occurs primarily through shared persistent memory. Similar to the single-agent scenario, each agent may autonomously read and write to a shared workspace.'

已证实 (95%) CORAL imposes a heartbeat mechanism that functions like a Reminder App, periodically prompting the agents to exercise self-reflection and pivoting for new ideas when existing approaches plateau.
The paper explicitly states this in p_18: 'CORAL imposes a heartbeat mechanism that functions like a Reminder App, periodically prompting the agents to exercise self-reflection and pivoting for new ideas when existing approaches plateau.'

已证实 (95%) CORAL implements three heartbeat types. The first is a per-iteration reflection heartbeat, which encourages the agent to record useful notes during ongoing work. The second is a periodic consolidation heartbeat, triggered after a fixed number of attempts, which prompts the agent to review progress, organize and refine accumulated notes, and distill reusable procedures into skills. The third is a stagnation-triggered redirection heartbeat, activated when the agent shows no improvement for several rounds
The paper explicitly describes all three heartbeat types in p_20 with specific trigger conditions and purposes. This is a complete design specification.

无法验证 (85%) CORAL establishes new SOTA on 8 of 11 tasks in mathematical and systems optimization, with a 2.5× higher improvement rate and 10× fewer evaluations than fixed evolutionary search baselines.
The claim states '2.5× higher improvement rate' but p_28 states 'CORAL's improvement rate is 3-10× higher.' These numbers are inconsistent. The '10× fewer evaluations' claim is roughly consistent with p_28 ('5-20 evaluations versus 60-100'), but the

已证实 (90%) On the stress-test Kernel Engineering task, four co-evolving agents push the score from 1, 363 to 1, 103 cycles (a 20% gain), surpassing the previous best result.
The paper provides specific numbers in p_3 and p_21: official best score of 1,363 cycles, and CORAL achieves 1,103 cycles. The math checks out: (1363-1103)/1363 ≈ 19.1%, approximately 20%.

已证实 (85%) CORAL achieves the best final score on all 11 tasks, establishing new SOTA on 8 tasks.
The paper states in p_28: 'As shown in Table 1, CORAL achieves the best final score on all 11 tasks, establishing new SOTA on 8 tasks.' Without seeing the actual table data, I accept the paper's explicit claim with moderate confidence.

已证实 (85%) CORAL's improvement rate is 3 -10× higher, and it typically converges within 5 -20 evaluations versus 60 -100 for fixed evolutionary search methods.
The paper states in p_28: 'CORAL's improvement rate is 3-10× higher, and it typically converges within 5-20 evaluations versus 60-100 for fixed evolutionary search methods.' Without seeing Table 1, I accept the paper's explicit claim.

已证实 (80%) 4-agent co-evolution pushes performance even further. The largest improvements appear on the stress-test problems, where single-agent runs tend to plateau early, with co-evolution achieving an 18.3% cycle reduction on Kernel Engineering and a 5.0% score increase on Polyominoes.
The paper states in p_29: 'co-evolution achieving an 18.3% cycle reduction on Kernel Engineering and a 5.0% score increase on Polyominoes.' Without seeing Table 2, I accept the paper's explicit claim with moderate confidence.

已证实 (90%) Without web search, CORAL already establishes a new SOTA on the Kernel Engineering task. With web search enabled, CORAL also achieves a new SOTA (89.4) on the Polyominoes problem
The paper explicitly states in p_29 and p_46 that CORAL achieves new SOTA on Kernel Engineering without web search, and achieves 89.4% coverage on Polyominoes with web search, surpassing previous SOTA of 87%.

已证实 (80%) When evaluated on the math and systems suites using a fully open-source stack (MiniMax M2.5 + OpenCode), 4-agent co-evolution consistently improves final scores over the single-agent counterpart across most tasks
The paper states in p_30: 'When evaluated on the math and systems suites using a fully open-source stack (MiniMax M2.5 + OpenCode), 4-agent co-evolution consistently improves final scores over the single-agent counterpart across most tasks (Table 2).

已证实 (85%) Attempts with local execution improve more often than the average attempt on the same task. This effect is the strongest on tasks involving compiled code: on Transaction (61% local test rate) and Kernel Engineering (57%)
The paper provides specific numbers in p_32: 'on Transaction (61% local test rate) and Kernel Engineering (57%)' with reference to Tables 4 and 5. The claim about local execution improving more often is also stated.

已证实 (85%) On standard tasks, agents create only 0.05 knowledge artifacts per attempt, and knowledge access yields only a small gain (+2 percentage points over attempts without knowledge access). On advanced tasks, agents create over 10× more knowledge per attempt (0.55 and 0.68), and knowledge access is much more strongly associated with improvement: 55% on Kernel Engineering versus 26% on standard tasks
The paper provides specific numbers in p_33: 0.05 knowledge artifacts per attempt on standard tasks, 0.55 and 0.68 on advanced tasks (10× more), +2 percentage points gain on standard tasks, 55% improvement on Kernel Engineering vs 26% on standard tas

已证实 (85%) On Kernel Engineering, 36% of attempts use another agent's commit as their parent, and these improve at 17% versus 9% for all attempts. The majority (66%) of new records originate from a cross-agent parent.
The paper provides specific numbers in p_36: '36% of attempts use another agent's commit as their parent, and these improve at 17% versus 9% for all attempts. The majority (66%) of new records originate from a cross-agent parent.'

已证实 (85%) On Polyominoes, direct code transfer is rarer (12%) but still very powerful (50% versus a 19% average improvement rate); transfer instead occurs more often through shared notes and skills, with 87% of rounds referencing knowledge committed by other agents.
The paper provides specific numbers in p_36: 'direct code transfer is rarer (12%) but still very powerful (50% versus a 19% average improvement rate); transfer instead occurs more often through shared notes and skills, with 87% of rounds referencing

... 共 48 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

No code available - implementation cannot be reproduced without the CORAL system code
No data/seed programs available - baseline comparisons require identical seed programs and evaluators
Random seeds not specified for the 4 independent runs
Heartbeat configuration details (Table 7) not provided in excerpt - critical hyperparameters unknown
LLM API parameters not specified (temperature, top-p, etc.) for Claude Code + Opus 4.6 and MiniMax M2.5
Exact evaluator implementations not provided
Computational budget in terms of LLM calls/tokens not specified, only wall-clock time given
Specific Linux environment details (OS version, dependencies) not provided
Agent initialization details beyond 'identical initialization' not specified
Exact benchmark task definitions and evaluation metrics not provided

局限性（作者自述）

CORAL relies on frontier foundation models that can handle relatively complex coding-agent workflows, which makes full deployment on local devices difficult.
multi-agent evolution currently lacks bootstrapped heterogeneity: all agents are initialized identically and given access to the same information.
our current setting assumes the availability of a reasonably well-specified evaluator. However, for many important open-ended problems, evaluators are themselves difficult to obtain, incomplete, or even fundamentally ambiguous.
An exciting direction for future work is therefore to train customized small models tailored to CORAL.
Future work could inject distinct personalities, roles, or private information into different agents to encourage greater behavioral diversity and, in turn, a more efficient evolutionary process.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-25T07:19:27+00:00 · 数据来源：Paper Collector