Adding a New Evaluation Benchmark¶
Overview¶
Evaluations are used to benchmark agent performance on specific tasks. The evaluation system in AIgiSE is built on top of the base Evaluation class, which provides a complete framework for running benchmarks, managing sandboxes, collecting outputs, and generating metrics.
Entry Points¶
The Evaluation class provides multiple entry points for running evaluations, each suited for different use cases:
1. generate() - Multiprocessing Mode (Default)¶
- Uses
ProcessPoolExecutorfor true parallelism - Each sample runs in a separate process
- Best for production runs with maximum parallelism
- Bypasses Python's GIL for true concurrent execution
2. generate_threaded() - Multithreading Mode¶
- Uses
ThreadPoolExecutorfor parallel execution - Each sample runs in a separate thread
- Useful when multiprocessing has serialization issues
- Shares memory across threads
3. generate_single_thread() - Single-Threaded Mode¶
- Sequential execution in a single thread
- Best for debugging and development
- Easiest to debug with step-by-step execution
4. run() - Auto-Select Mode¶
- Automatically selects execution mode based on
use_multiprocessingflag - If
use_multiprocessing=True: callsgenerate()(multiprocessing) - If
use_multiprocessing=False: callsgenerate_threaded()(multithreading) - Calls
evaluate()after generation completes - Recommended for most use cases
5. run_debug() - Debug Mode¶
- Calls
generate_single_thread()followed byevaluate() - Best for debugging and development
- Slower but easier to debug
Usage Example¶
Evaluations use Python Fire for command-line interface. You can run evaluations in several ways:
Using command-line (recommended):
# Option 1: Auto-select mode (uses generate() or generate_threaded() based on use_multiprocessing)
python -m aigise.evaluations.my_benchmark.my_evaluation \
--dataset_path="org/dataset" \
--agent_dir="examples/agents/my_agent" \
--max_workers=6 \
--use_multiprocessing=true \
run
# Option 2: Explicit multiprocessing mode
python -m aigise.evaluations.my_benchmark.my_evaluation \
--dataset_path="org/dataset" \
--agent_dir="examples/agents/my_agent" \
generate
# Option 3: Multithreading mode
python -m aigise.evaluations.my_benchmark.my_evaluation \
--dataset_path="org/dataset" \
--agent_dir="examples/agents/my_agent" \
generate_threaded
# Option 4: Single-threaded debugging mode
python -m aigise.evaluations.my_benchmark.my_evaluation \
--dataset_path="org/dataset" \
--agent_dir="examples/agents/my_agent" \
run_debug
# Or using direct file path
python src/aigise/evaluations/my_benchmark/my_evaluation.py \
--dataset_path="org/dataset" \
--agent_dir="examples/agents/my_agent" \
run
Using Python API:
from aigise.evaluations import MyEvaluation
# Create evaluation instance
eval = MyEvaluation(
dataset_path="org/dataset",
agent_dir="examples/agents/my_agent",
max_workers=6,
use_multiprocessing=True,
)
# Option 1: Auto-select mode (recommended)
eval.run()
# Option 2: Explicit multiprocessing
eval.generate()
# Option 3: Multithreading
eval.generate_threaded()
# Option 4: Single-threaded debugging
eval.run_debug()
Steps to Create a New Evaluation¶
1. Create Evaluation Module¶
Create a new directory under src/aigise/evaluations/ with your benchmark name:
2. Implement Evaluation Class¶
Create a class that inherits from Evaluation and implements required abstract methods:
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from aigise.evaluations import Evaluation, EvaluationTask
@dataclass
class MyEvaluation(Evaluation):
"""Custom evaluation benchmark.
This class is automatically registered by name (lowercase).
You can retrieve it later using: get_evaluation_class("myevaluation")
"""
# Required fields from parent class
dataset_path: str = "org/dataset_name" # HuggingFace dataset or local path
agent_dir: str = "examples/agents/my_agent" # Directory containing agent.py
# Optional configuration overrides
max_llm_calls: int = 100
max_workers: int = 6
use_multiprocessing: bool = True
run_until_explicit_finish: bool = True
use_cache: bool = True
# Custom fields for your benchmark
custom_param: str = "default_value"
# Implement required abstract methods
def _get_sample_id(self, sample: dict) -> str:
"""Extract unique task ID from sample.
This ID is used for:
- Output directory naming
- Task identification in logs
- Result tracking
"""
return sample["task_id"] # or sample.get("id"), etc.
def _get_user_msg_first(self, sample: dict) -> str:
"""Extract the initial prompt/message to send to agent.
This is the first message that will trigger agent execution.
"""
return sample["prompt"] # or sample.get("question"), etc.
# Optional: Override methods for custom behavior
def _get_dataset(self) -> datasets.Dataset:
"""Load dataset with custom filtering or preprocessing."""
dataset = super()._get_dataset()
# Add custom filtering logic if needed
# dataset = dataset.filter(lambda x: x["difficulty"] == "hard")
return dataset
def _create_task(self, sample: dict) -> EvaluationTask:
"""Create custom task with additional fields if needed."""
task = super()._create_task(sample)
# Add custom fields to task if needed
return task
def _get_input_data_path(self, sample: dict) -> str:
"""Specify input data directory for this sample."""
task_id = self._get_sample_id(sample)
return str(Path(self.input_data_path) / task_id) if self.input_data_path else ""
def _get_cache_dir(self, sample: dict) -> str:
"""Specify cache directory for sandbox state."""
task_id = self._get_sample_id(sample)
return str(Path(self.cache_dir) / task_id) if self.cache_dir else ""
def _get_output_dir_in_sandbox(self, sample: dict) -> str | tuple | None:
"""Specify which sandbox directories to export after execution."""
return "/output" # or ("/output1", "/output2") for multiple dirs
def customized_modify_and_save_results(
self,
*,
results: list | None,
failed_samples: list[str] | None,
mode: str,
) -> None:
"""Post-process and save aggregated results after all samples complete.
Args:
results: List of successful sample outputs
failed_samples: List of task IDs that failed
mode: Execution mode ("multiprocess", "threaded", or "single_thread")
"""
# Calculate metrics, save summary, etc.
pass
def evaluate(self) -> None:
"""Analyze collected results and generate final metrics.
This is called after generate() completes. Implement your
evaluation logic here (accuracy, pass rate, etc.).
"""
# Load results from output_dir
# Calculate metrics
# Save evaluation report
pass
3. Configuration Template¶
Create a configuration template in src/aigise/evaluations/configs/:
# src/aigise/evaluations/configs/my_benchmark_config.toml
[llm]
model_name = "gemini-2.0-flash-exp"
temperature = 0.7
[sandbox]
[sandbox.main]
type = "docker"
image = "python:3.11"
working_dir = "/workspace"
# Template variables can be used:
# ${TASK_NAME} - Replaced with actual task ID
# ${PROJECT_RELATIVE_SHARED_DATA_PATH} - Replaced with data path
4. Registration¶
The evaluation class is automatically registered when imported. The registration name is the lowercase class name.
Example: - Class name: MyEvaluation → Registered as: "myevaluation" - Retrieve with: get_evaluation_class("myevaluation")
5. Running the Evaluation¶
Since evaluations use Python Fire, you can run them from command-line:
# Run with auto-select mode (recommended)
python -m aigise.evaluations.my_benchmark.my_evaluation \
--dataset_path="org/dataset" \
--agent_dir="examples/agents/my_agent" \
--max_workers=6 \
--output_dir="results/my_benchmark" \
run
# Or for debugging (single-threaded)
python -m aigise.evaluations.my_benchmark.my_evaluation \
--dataset_path="org/dataset" \
--agent_dir="examples/agents/my_agent" \
run_debug
# Or directly specify execution method
python -m aigise.evaluations.my_benchmark.my_evaluation \
--dataset_path="org/dataset" \
--agent_dir="examples/agents/my_agent" \
generate # or generate_threaded, generate_single_thread
Or programmatically:
from aigise.evaluations import MyEvaluation
# Create and run
eval = MyEvaluation(
dataset_path="org/dataset",
agent_dir="examples/agents/my_agent",
output_dir="results/my_benchmark", # Optional, auto-generated if not provided
max_workers=6,
)
# Run evaluation
eval.run() # or eval.run_debug() for debugging
Evaluation Lifecycle¶
Each evaluation sample goes through the following lifecycle:
- Task Creation (
_create_task()) - Convert dataset sample to
EvaluationTask -
Extract task ID, prompt, paths, etc.
-
Environment Preparation (
_prepare_environment()) - Initialize AIgiSE session
- Load/launch sandboxes
- Set up Neo4j (if enabled)
-
Load cached sandbox states (if
use_cache=True) -
Agent Preparation (
_prepare_agent()) - Load
mk_agentfunction fromagent_dir - Create agent instance
-
Configure model (if
use_config_model=True) -
Agent Execution (
_run_agent()) - Send prompt to agent
- Run agent with configured limits
- Track LLM calls, costs, etc.
-
Handle
run_until_explicit_finishloop -
Output Collection (
_collect_outputs()) - Export sandbox outputs (if
output_dir_in_sandboxspecified) - Export Neo4j database
- Save session trace
-
Calculate cost information
-
Cleanup
- Clean up sandboxes
- Close sessions
- Save error information (if failed)
Key Methods to Override¶
Required Abstract Methods¶
_get_sample_id(sample: dict) -> str: Extract unique task ID_get_user_msg_first(sample: dict) -> str: Extract initial prompt
Optional Methods (with Defaults)¶
_get_dataset() -> datasets.Dataset: Load and filter dataset_create_task(sample: dict) -> EvaluationTask: Create task instance_get_input_data_path(sample: dict) -> str: Input data directory_get_cache_dir(sample: dict) -> str: Cache directory_get_output_dir_in_sandbox(sample: dict) -> str | tuple | None: Output dirs to export_prepare_general_env() -> None: Setup shared across all samples_before_initialize_hooks(aigise_session, task) -> None: Hooks before sandbox initcustomized_modify_and_save_results(results, failed_samples, mode) -> None: Post-processingevaluate() -> None: Final evaluation and metrics
Output Structure¶
Each evaluation run creates an output directory with the following structure:
evals/
└── myevaluation/
└── yymmdd_HHMMSS/
├── evaluation_master.log # Master log for entire run
├── eval_params.json # Evaluation parameters
├── task_001/
│ ├── execution_debug.log # DEBUG-level log
│ ├── execution_info.log # INFO-level log
│ ├── config_used.toml # Config used for this task
│ ├── cost_info.json # Token usage and costs
│ ├── session_trace.json # Complete session events
│ ├── session_trace.txt # Human-readable trace
│ ├── metadata.json # Task metadata
│ ├── sandbox_output/ # Exported from sandbox
│ └── neo4j_history/ # Neo4j database export
└── task_002/
└── ...
Configuration Options¶
Key configuration options available in Evaluation:
| Option | Type | Default | Description |
|---|---|---|---|
dataset_path | str | Required | HuggingFace dataset or local path |
agent_dir | str | Required | Directory with agent.py |
max_llm_calls | int | 100 | Maximum LLM calls per task |
max_workers | int | 6 | Parallel workers |
use_multiprocessing | bool | True | Use multiprocessing vs threading |
use_cache | bool | True | Load/cache sandbox states |
run_until_explicit_finish | bool | True | Keep running until task finished |
use_config_model | bool | False | Use model from config file |
llm_retry_count | int | 3 | Retries for LLM API calls |
llm_retry_timeout | int | 30 | Timeout per LLM request (seconds) |
log_level | str | "INFO" | Terminal log level |
Examples¶
See existing evaluations for reference:
src/aigise/evaluations/cybergym/__init__.py- Base class of evaluationsrc/aigise/evaluations/cybergym/cybergym_static.py- Full-featured evaluationsrc/aigise/evaluations/mock_debug/mock_debug_evaluation.py- Minimal examplesrc/aigise/evaluations/secodeplt/vul_detection.py- Another example
See Also¶
- Development Guides - Other development guides
- Testing Debugging - Testing evaluations
src/aigise/evaluations/__init__.py- Base Evaluation class implementation