Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering - AI 论文深度分析

TL;DR
This paper proposes externalization—relocating cognitive burdens into persistent external structures—as the unifying logic behind LLM agent advances. Memory converts recall to retrieval, skills convert generation to composition, and protocols convert ad-hoc interaction to structured exchange, coord…

已证实

证据不足

无法验证

N/A

可复现性

置信度

83%

核心问题

What transition logic unifies recent advances in LLM agents across memory, skills, protocols, and harness engineering?

核心方法

{'approach': "The paper provides a systems-level review synthesizing existing work on LLM agents through the lens of cognitive externalization. It draws on cognitive science frameworks including Norman's cognitive artifacts, Kirsh's complementary strategies, and Hutchins' distributed cognition to analyze how memory, skills, protocols, and harnesses reorganize cognitive work.", 'key_components': ['The central design question is how aggressively active reasoning should be separated from stored state.', 'Memory architecture design follows a taxonomy that guides how state is externalized and accessed.'], 'section_ids': ['sec_8']}

论点验证

已证实 (85%) Our central thesis is that externalization-the progressive relocation of cognitive burdens from the model's internal computation into persistent, inspectable, and reusable external structures-is the transition logic-the mechanism that explains why each architectural shift has occurred and what forms of reliability it sought to preserve-that unifies recent advances in memory, skills, protocols, and harness engineering for language agents.
This is the central theoretical contribution of the paper. The paper systematically develops this thesis across Sections 2-7, providing extensive conceptual analysis, literature synthesis, and a coherent framework connecting memory, skills, protocols

已证实 (90%) We offer a systems-level review organized around four claims: Memory systems externalize an agent's state across time and convert long-horizon continuity into selective retrieval. Skill systems externalize procedural expertise and convert implicit know-how into explicit reusable operating guidance. Protocols externalize interaction structure and convert ambiguous communication into interoperable, machine-readable contracts. Harness engineering unifies these externalized modules into a coherent runtime environment with constraints, observability, feedback loops, and control points.
The paper is explicitly organized around these four claims, devoting Sections 3, 4, 5, and 6 to each respectively. Each section provides detailed analysis, taxonomies, examples, and citations. The organizational structure itself demonstrates this con

已证实 (85%) The recent history of LLM agents can be understood as a progressive movement outward from the model itself. Capabilities were first treated as properties of weights, then as properties of prompts and context windows, and are now increasingly treated as properties of the broader infrastructure in which the model operates.
The paper provides a well-documented historical analysis with Figure 2 visualizing the trajectory, specific timeline (2022-2026), and concrete examples at each stage (GPT-4, prompting techniques, RAG, Auto-GPT, BabyAGI, etc.). The three-layer model (

已证实 (80%) We distinguish the following four dimensions of externalized state: Working context, Episodic experience, Semantic knowledge, and Personalized memory.
Section 3.1 develops this taxonomy with conceptual definitions and examples for each dimension: working context (immediate state), episodic experience (prior runs), semantic knowledge (abstractions), and personalized memory (user-specific information

已证实 (85%) We identify three coupled components of procedural expertise: operational procedures, decision heuristics, and normative constraints. Together they define the reusable unit of know-how that a harness can externalize.
Section 4.1 develops this three-component framework with conceptual analysis and citations. Operational procedures address process stability (citing Hsiao et al. 2025, Nandi et al. 2026), decision heuristics address branching choices (citing Gigerenz

已证实 (85%) We identify four pathways by which procedural know-how enters the system: authored by experts, distilled from episodic memory and trajectories, discovered through environment exploration and self-induction, or composed from existing units.
Section 4.4 develops four acquisition pathways with concrete examples: authored (SKILL.md, AGENTS.md), distilled (Skill Set Optimization, MemSkill), discovered (Voyager, PolySkill), and composed (hierarchical skill repertoires). Each pathway has spec

已证实 (80%) We propose a three-stage evolution from execution primitives to capability packages: Stage 1: Atomic Execution Primitives, Stage 2: Large-scale Primitive Selection, Stage 3: Skill as Packaged Expertise.
Section 4.2 develops this three-stage evolution with specific examples: Stage 1 (Toolformer for atomic execution), Stage 2 (Gorilla, ToolLLM, ToolNet for large-scale selection), Stage 3 (program-based skill induction, web skill libraries, computer-us

已证实 (80%) We identify four main boundary conditions for skill reliability: semantic alignment, portability and staleness, unsafe composition, and context-dependent degradation.
Section 4.5 develops four boundary conditions with citations: semantic alignment (SkillProbe, Ross et al. 2025), portability and staleness (Wang et al. 2025c, SkillsBench), unsafe composition (Liu et al. 2026, Wang et al. 2026c), and context-dependen

已证实 (85%) We identify four couplings that connect skills to the broader harness: conditioning on memory, binding through protocols, runtime governance, and lifecycle feedback.
Section 4.6 develops four couplings with conceptual analysis: conditioning on memory (retrieved state informs skill selection), binding through protocols (skills grounded via protocolized interfaces), runtime governance (permission checks, approval g

无法验证 (95%) The 'lost in the middle' phenomenon shows that models attend unevenly across long inputs, with retrieval accuracy dropping sharply for information placed in the center of the context.
This is a claim about external research findings (Liu et al. 2024a). The paper cites this finding but does not reproduce the original data, methodology, or experimental setup. The claim cannot be verified from this paper alone - it requires accessing

无法验证 (95%) Large-scale empirical studies of public skill ecosystems report substantial rates of vulnerabilities, including prompt injection, data exfiltration, privilege escalation, and supply-chain risk.
This is a claim about external research findings (Liu et al. 2026). The paper reports this finding from another study but does not provide the original empirical data, methodology, or specific vulnerability rates. Cannot be verified from this paper a

无法验证 (95%) Attack-oriented studies further show that skill files themselves can become realistic prompt-injection surfaces for current agents.
This is a claim about external research findings (Wang et al. 2026c). The paper cites this attack-oriented study but does not reproduce the experiments or provide the original evidence. Cannot be verified from this paper alone.

无法验证 (95%) SkillsBench indicates that skill utility varies substantially across domains and model-agent configurations.
This is a claim about external research findings (SkillsBench, Li et al. 2026c). The paper reports this finding but does not provide the benchmark methodology, specific domains tested, or quantitative variation data. Cannot be verified from this pape

无法验证 (95%) SkillProbe identifies semantic-behavioral inconsistency as a fundamental flaw in existing skill marketplaces.
This is a claim about external research findings (SkillProbe, Guo et al. 2026). The paper cites this study but does not provide the original methodology, specific inconsistencies identified, or empirical evidence. Cannot be verified from this paper a

已证实 (80%) A central limitation of parametric knowledge is that it is difficult to selectively update, compose, and govern.
This is a conceptual limitation claim that the paper develops with reasoning and citations. The paper explains why parametric knowledge is difficult to selectively update (requires retraining), compose (coupled in weights), and govern (distributed ac

已证实 (85%) Context windows are finite, costly at scale, and often noisy when overloaded with marginally relevant material.
This is a well-established limitation that the paper discusses with conceptual analysis. The paper explains the finiteness (token limits), cost at scale (computational overhead), and noise issues (marginally relevant material degrading performance).

已证实 (85%) Context is also ephemeral: unless state is explicitly externalized elsewhere, every new session begins with partial amnesia.
This is a conceptual limitation that the paper develops clearly. The paper explains that without explicit externalization, each new session starts fresh ('partial amnesia'), which is a fundamental architectural constraint of current LLM systems.

已证实 (80%) Even as context lengths have expanded dramatically-from 2K tokens to over 100K and beyond-the fundamental tension persists: more capacity does not eliminate the need for selective curation.
This is a conceptual claim about the persistence of curation needs despite capacity expansion. The paper provides reasoning: expanded context (2K to 100K+ tokens) doesn't eliminate the need for selective curation because the fundamental tension betwe

已证实 (80%) Updating a single fact-say, the current head of state of a country-requires retraining, knowledge editing, or patching through additional alignment layers, all of which risk unintended side effects on other capabilities.
This is a conceptual limitation with supporting citations. The paper explains that updating a single fact requires retraining, knowledge editing, or alignment patches, with citations to knowledge editing literature (Meng et al. 2022, Mitchell et al.

已证实 (80%) Auditing why a model behaved a certain way is difficult because relevant knowledge is distributed across billions of parameters rather than encoded as inspectable modules.
This is a conceptual limitation with supporting citation. The paper explains that auditing is difficult because knowledge is distributed across billions of parameters rather than in inspectable modules, citing Zhao et al. 2024 on explainability.

... 共 41 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

No code repository available
No data or supplementary materials available
Unclear availability statements - fragmented text does not indicate clear access to resources
No methodological details provided for the review/survey process
No criteria for paper selection or inclusion in the review
No details on how the taxonomy/framework was developed
No information about systematic review methodology
No details on analysis procedures or evaluation criteria
Reference to Du [2026a] taxonomy but no implementation details provided

局限性（作者自述）

A central limitation of parametric knowledge is that it is difficult to selectively update, compose, and govern.
Context windows are finite, costly at scale, and often noisy when overloaded with marginally relevant material.
Context is also ephemeral: unless state is explicitly externalized elsewhere, every new session begins with partial amnesia.
Even as context lengths have expanded dramatically-from 2K tokens to over 100K and beyond-the fundamental tension persists: more capacity does not eliminate the need for selective curation.
Updating a single fact-say, the current head of state of a country-requires retraining, knowledge editing, or patching through additional alignment layers, all of which risk unintended side effects on other capabilities.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-27T01:08:50+00:00 · 数据来源：Paper Collector