Skip to content

Evaluations - Batch Processing Entry Point

Evaluation scripts run agents on benchmark datasets for performance measurement and testing.

Command

cd src/aigise/evaluations
python cybergym/cybergym_vul_detection.py run \
  --agent-id my_agent \
  --config-path /path/to/config.toml \
  --max_llm_calls 75 \
  --use_multiprocessing \
  --max_workers 3

Step-by-Step Workflow

Step 1: Script Initialization

  1. Fire library parses command-line arguments
  2. Creates Evaluation class instance with parameters:
  3. agent_id: Identifier for the agent
  4. config_path: Path to TOML configuration
  5. max_llm_calls: Maximum LLM calls per task
  6. use_multiprocessing: Use processes vs threads
  7. max_workers: Number of parallel workers
  8. Sets up logging and instrumentation (Langfuse, OpenTelemetry)

Step 2: Load Dataset

self.dataset = self._get_dataset()
  1. Loads benchmark dataset (e.g., HuggingFace datasets, JSON files)
  2. Dataset contains multiple samples/tasks to evaluate
  3. Example: CyberGym dataset has vulnerability detection tasks
  4. Each sample contains:
  5. Task description
  6. Expected outputs (ground truth)
  7. Metadata (file paths, vulnerability info, etc.)

Step 3: Prepare General Environment (_prepare_general_env)

This sets up shared resources used across all evaluation tasks.

3.1 Create Base Configuration

  1. Loads base configuration from TOML file
  2. Expands template variables
  3. Stores in class for later use

3.2 Setup Evaluation Directories

self.eval_output_dir = Path(f"evals/{self.agent_id}/...")
self.eval_output_dir.mkdir(parents=True, exist_ok=True)
  • Creates output directories for results
  • Structure: evals/{agent_id}/{benchmark_name}/{timestamp}/
  • Stores agent outputs, logs, artifacts

Step 4: Generate Samples (Parallel Execution)

The evaluation runs tasks in parallel. Choose one mode:

Mode A: Multiprocessing (generate())

with ProcessPoolExecutor(max_workers=self.max_workers) as executor:
    futures = {
        executor.submit(_run_sample_in_process, self, sample): sample
        for sample in self.dataset
    }
  • Each sample runs in separate process
  • True parallelism (bypasses Python GIL)
  • Processes are isolated (no shared memory)
  • Requires serializable data

Mode B: Multithreading (generate_threaded())

with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
    futures = {
        executor.submit(run_sample_in_thread, sample): sample
        for sample in self.dataset
    }
  • Each sample runs in separate thread
  • Shared memory (can share resources)
  • Limited by GIL for CPU-bound tasks
  • Better for I/O-bound operations

Mode C: Single Thread (generate_single_thread())

  • Sequential execution, one sample at a time
  • Used for debugging
  • Easier to debug issues
  • Much slower

Step 5: Process Each Sample (_generate_sample or _run_sample_in_process)

For each sample in the dataset:

5.1 Create Evaluation Task

task = self._create_task(sample)
  1. Extracts sample data
  2. Creates EvaluationTask object with:
  3. session_id: Unique ID for this task
  4. sample: Original sample data
  5. aigise_session: Will be created next
  6. Metadata (task name, description, etc.)

5.2 Create OpenSage Session

aigise_session = get_aigise_session(
    aigise_session_id=task.session_id,
    config_path=self.config_path
)
  • Creates isolated OpenSage session for this task
  • Loads configuration
  • Each task gets its own session (isolation)

5.3 Prepare Task-Specific Environment (_prepare_environment)

This is benchmark-specific. Example for CyberGym:

  1. Extract code/data:
  2. Extracts source code to sandbox
  3. Copies test files, build scripts
  4. Sets up project structure

  5. Initialize sandboxes:

    aigise_session.sandboxes.initialize_shared_volumes()
    await aigise_session.sandboxes.launch_all_sandboxes()
    await aigise_session.sandboxes.initialize_all_sandboxes()
    

  6. Creates shared volumes
  7. Launches required sandbox containers
  8. Initializes sandboxes (tools, dependencies)

  9. Set source directory:

    aigise_session.config.src_dir_in_sandbox = "/shared/code"
    

  10. Tells tools where to find source code

  11. Git repository setup (if applicable):

  12. Finds git repository in sandbox
  13. Checks out main/master branch
  14. Updates src_dir_in_sandbox to repo path

5.4 Load Agent

mk_agent = self._load_mk_agent()
agent = mk_agent(aigise_session_id=task.session_id)
  1. Imports agent module
  2. Calls mk_agent() function with session ID
  3. Agent is configured for this specific session
  4. Agent has access to task-specific sandboxes and resources

5.5 Create ADK Session and Runner

inner_session_service = InMemorySessionService()
await inner_session_service.create_session(
    app_name=app_name,
    user_id=self.user_id + "_" + meta_data,
    session_id=task.session_id,
    state={"aigise_session_id": task.session_id},
)

runner = Runner(
    agent=agent,
    app_name=app_name,
    session_service=inner_session_service,
)
  • Creates ADK session that maps to OpenSage session
  • Stores aigise_session_id in session state
  • Creates ADK Runner for agent execution

5.6 Run Agent

run_config = RunConfig(max_llm_calls=self.max_llm_calls)

async for event in runner.run_async(
    user_id=user_id,
    session_id=task.session_id,
    run_config=run_config,
    new_message=types.Content(parts=[types.Part(text=task.prompt)]),
):
    # Process events
    if isinstance(event, types.FunctionResponse):
        # Tool execution results
    elif isinstance(event, types.Candidate):
        # Agent responses
  1. Runner starts agent execution:
  2. Sends prompt to agent
  3. Agent enters reason-act loop

  4. Agent reasoning:

  5. Calls LLM for reasoning
  6. Decides which tools to use
  7. Generates function calls

  8. Tool execution:

  9. Runner executes tools in sandbox
  10. Tools access session resources
  11. Results returned to agent

  12. Iteration:

  13. Agent processes tool results
  14. Decides next action
  15. Continues until completion or max calls

  16. Completion:

  17. Agent generates final response
  18. Runner finishes execution
  19. Events collected

5.7 Collect Results

result = {
    "session_id": task.session_id,
    "prompt": task.prompt,
    "response": agent_response,
    "events": events,
    "metadata": {...},
}
  • Extracts agent response
  • Collects execution metadata:
  • Number of LLM calls
  • Tools used
  • Execution time
  • Errors (if any)

5.8 Save Results

self._save_result(task, result)
  • Saves result to file (JSON)
  • Location: evals/{agent_id}/{benchmark}/results/{task_id}.json
  • Includes full event history for analysis

5.9 Cleanup Task Session

cleanup_aigise_session(task.session_id)
  • Stops sandbox containers
  • Removes shared volumes
  • Cleans up session resources
  • Frees Docker resources

Step 6: Collect All Results

After all samples complete:

  1. Aggregates results from all tasks
  2. Collects statistics:
  3. Success rate
  4. Average execution time
  5. Tool usage patterns
  6. Error rates

Step 7: Evaluate Results (evaluate())

self.evaluate()
  1. Load ground truth:
  2. Loads expected outputs from dataset
  3. Loads agent results from files

  4. Compare outputs:

  5. Compares agent output vs ground truth
  6. Calculates metrics:

    • Accuracy
    • Precision/Recall (if applicable)
    • Custom benchmark metrics
  7. Generate report:

  8. Creates evaluation report
  9. Includes metrics, statistics, examples
  10. Saves to evals/{agent_id}/{benchmark}/evaluation_report.json

  11. Display summary:

  12. Prints metrics to console
  13. Shows top failures/successes
  14. Provides analysis

Key Characteristics

Isolation

  • Each task gets its own OpenSage session
  • Separate sandbox containers
  • No interference between tasks

Parallelism

  • Multiple tasks run simultaneously
  • Configurable worker count
  • Process or thread-based execution

Reproducibility

  • Deterministic task execution
  • Results saved with full event history
  • Can replay specific tasks

Resource Management

  • Sessions cleaned up after each task
  • Containers stopped and removed
  • Prevents resource leaks

Comparison with opensage web

Aspect opensage web Evaluations
Purpose Development, debugging Performance measurement
Sessions Single long-lived session Multiple short-lived sessions
Interaction Interactive chat Batch processing
Parallelism Single user Multiple tasks in parallel
Cleanup Manual (on exit) Automatic (per task)
Output Real-time events Saved results files

Example Evaluation Flow

Dataset (100 tasks)
Process Pool (3 workers)
  ├─ Worker 1: Task 1 → Session 1 → Agent → Result 1
  ├─ Worker 2: Task 2 → Session 2 → Agent → Result 2
  └─ Worker 3: Task 3 → Session 3 → Agent → Result 3
  ├─ Worker 1: Task 4 → Session 4 → Agent → Result 4
  ...
All Results Collected
Evaluation (compare vs ground truth)
Report Generated