OpenSage: Self-programming Agent Generation Engine

Hongwei Li^1,*, Zhun Wang^2,*, Qinrun Dai³, Yuzhou Nie^1,6, Jinjun Peng⁴, Ruitong Liu³, Jingyang Zhang⁶, Kaijie Zhu¹, Jingxuan He², Lun Wang⁷, Yangruibo Ding⁵, Yueqi Chen³, Wenbo Guo^1,6, Dawn Song^2,6

¹UC Santa Barbara ²UC Berkeley ³University of Colorado Boulder ⁴Columbia University ⁵UCLA ⁶Virtue AI ⁷Google DeepMind

^*Fully equal contribution

An AI-centered agent generation engine that enables LLMs to self-create agent topology, synthesize toolsets, and manage structured memory for complex real-world tasks.

Paper Code Documentation

Overview

OpenSage (Open Self-programming Agent Generation Engine) is an AI-centered agent framework designed to shift agent development from a human-engineered, fixed paradigm to an AI-driven, self-programming one. Instead of requiring developers to hand-design workflows, tool lists, and memory logic for each task, OpenSage provides a minimal scaffold that lets the model create and orchestrate these components at runtime.

OpenSage is built around three core systems that strongly influence agent performance:

Self-generating agent topology: the agent can dynamically create, run, and terminate sub-agents during execution, supporting both vertical decomposition (specialists per sub-task) and horizontal exploration (parallel plans + ensemble).
Dynamic tool synthesis and management: the agent can write task-specific tools (e.g., scripts, generators, analyzers) and use a hierarchical tool organization with tool-specific sandboxing, state caching, and async/background execution.
Hierarchical, graph-based memory: a short-term execution-memory graph plus a long-term knowledge graph, bridged by a dedicated memory agent that retrieves, stores, and updates high-signal knowledge at the point of need.

Results

OpenSage is evaluated on three SOTA benchmarks: CyberGym, Terminal-Bench 2.0, and SWE-Bench Pro.

Key Techniques

Self-generating Agent Topology

Create/run/stop sub-agents at runtime
Vertical decomposition with specialist sub-agents
Horizontal parallel exploration via agent ensemble

Dynamic Tool Synthesis

AI-writable tools as first-class artifacts
Hierarchical organization for discovery at scale
Tool-specific sandboxing, caching, async execution

Hierarchical Memory

Graph-based short-term execution memory
Long-term knowledge graph for reusable insights
Memory agent for retrieval, updates, and compaction

Domain-Specific Toolkit

OpenSage includes a toolkit spanning software engineering and security, covering both static and dynamic analysis.

Category	Tool set	Libraries	Features
Static	Code analysis	Joern, CodeQL	CPG query, call graph analysis, dataflow slicing, semantic-aware search
Dynamic	Fuzzing	AFL++, LibFuzzer	Customizable seed generation, mutation, scoring
Dynamic	Coverage	LLVM-Cov	Query coverage with Neo4j, generate detailed reports
Dynamic	Debugger	GDB, PDB	Breakpoints, inspect states, trace execution, custom commands

More Key Findings

In addition to the overall benchmark scores, our analyses provide more concrete evidence for why OpenSage works: agent topology, tooling, and memory each contribute materially, and the framework supports practical patterns like heterogeneous model collaboration.

Topology + tooling are both necessary (CyberGym). This plot quantifies how OpenSage’s key design choices affect end-to-end vulnerability reproduction on a 300-instance subset.

The left panel isolates agent topology effects (horizontal ensemble and vertical dynamic sub-agents). The right panel isolates the tooling system contribution versus a raw terminal and a no-feature baseline.

Topology improves performance (CyberGym). Compares variants that disable horizontal ensemble or vertical dynamic sub-agent creation.

Removing either capability reduces the resolved rate, while disabling all OpenSage features drops much further. Horizontal ensemble improves performance via parallel exploration, while vertical dynamic sub-agents help by decomposing tasks and isolating context (reducing context overflow and information loss when summarization triggers).

Tooling is more than a shell (CyberGym). Evaluates OpenSage’s tooling system versus replacing it with a raw terminal interface.

This highlights the benefit of dynamic tool synthesis plus tool/runtime management beyond “just having a bash tool”.

Memory helps long-horizon tasks (SWE-Bench Pro). Compares OpenSage’s memory against no-memory and Mem0^g.

OpenSage’s hierarchical, agent-managed memory achieves the best performance among the compared baselines. Long-horizon tasks benefit from explicitly storing high-signal intermediate findings and retrieving them at the point of need, especially after history compaction or when revisiting earlier decisions.

Terminal-Bench large-small collaboration results

Heterogeneous model collaboration (Terminal-Bench). Shows resolved rate and cost when combining a strong “planner/reviewer” with a cheaper “executor” via sub-agents.

This demonstrates OpenSage’s flexibility for cost/quality trade-offs by assigning different models to different roles.

Citation

@misc{li2026opensage,
  title={OpenSage: Self-programming Agent Generation Engine},
  author={Hongwei Li and Zhun Wang and Qinrun Dai and Yuzhou Nie and Jinjun Peng and Ruitong Liu and Jingyang Zhang and Kaijie Zhu and Jingxuan He and Lun Wang and Yangruibo Ding and Yueqi Chen and Wenbo Guo and Dawn Song},
  year={2026},
  archivePrefix={arXiv},
  primaryClass={cs.SE},
}