Skip to content

Adding a New Evaluation Benchmark

Overview

Evaluations are used to benchmark agent performance on specific tasks. The evaluation system in AIgiSE is built on top of the base Evaluation class, which provides a complete framework for running benchmarks, managing sandboxes, collecting outputs, and generating metrics.

Entry Points

The Evaluation class provides multiple entry points for running evaluations, each suited for different use cases:

1. generate() - Multiprocessing Mode (Default)

  • Uses ProcessPoolExecutor for true parallelism
  • Each sample runs in a separate process
  • Best for production runs with maximum parallelism
  • Bypasses Python's GIL for true concurrent execution

2. generate_threaded() - Multithreading Mode

  • Uses ThreadPoolExecutor for parallel execution
  • Each sample runs in a separate thread
  • Useful when multiprocessing has serialization issues
  • Shares memory across threads

3. generate_single_thread() - Single-Threaded Mode

  • Sequential execution in a single thread
  • Best for debugging and development
  • Easiest to debug with step-by-step execution

4. run() - Auto-Select Mode

  • Automatically selects execution mode based on use_multiprocessing flag
  • If use_multiprocessing=True: calls generate() (multiprocessing)
  • If use_multiprocessing=False: calls generate_threaded() (multithreading)
  • Calls evaluate() after generation completes
  • Recommended for most use cases

5. run_debug() - Debug Mode

  • Calls generate_single_thread() followed by evaluate()
  • Best for debugging and development
  • Slower but easier to debug

Usage Example

Evaluations use Python Fire for command-line interface. You can run evaluations in several ways:

Using command-line (recommended):

# Option 1: Auto-select mode (uses generate() or generate_threaded() based on use_multiprocessing)
python -m aigise.evaluations.my_benchmark.my_evaluation \
  --dataset_path="org/dataset" \
  --agent_dir="examples/agents/my_agent" \
  --max_workers=6 \
  --use_multiprocessing=true \
  run

# Option 2: Explicit multiprocessing mode
python -m aigise.evaluations.my_benchmark.my_evaluation \
  --dataset_path="org/dataset" \
  --agent_dir="examples/agents/my_agent" \
  generate

# Option 3: Multithreading mode
python -m aigise.evaluations.my_benchmark.my_evaluation \
  --dataset_path="org/dataset" \
  --agent_dir="examples/agents/my_agent" \
  generate_threaded

# Option 4: Single-threaded debugging mode
python -m aigise.evaluations.my_benchmark.my_evaluation \
  --dataset_path="org/dataset" \
  --agent_dir="examples/agents/my_agent" \
  run_debug

# Or using direct file path
python src/aigise/evaluations/my_benchmark/my_evaluation.py \
  --dataset_path="org/dataset" \
  --agent_dir="examples/agents/my_agent" \
  run

Using Python API:

from aigise.evaluations import MyEvaluation

# Create evaluation instance
eval = MyEvaluation(
    dataset_path="org/dataset",
    agent_dir="examples/agents/my_agent",
    max_workers=6,
    use_multiprocessing=True,
)

# Option 1: Auto-select mode (recommended)
eval.run()

# Option 2: Explicit multiprocessing
eval.generate()

# Option 3: Multithreading
eval.generate_threaded()

# Option 4: Single-threaded debugging
eval.run_debug()

Steps to Create a New Evaluation

1. Create Evaluation Module

Create a new directory under src/aigise/evaluations/ with your benchmark name:

src/aigise/evaluations/
└── my_benchmark/
    ├── __init__.py
    └── my_evaluation.py

2. Implement Evaluation Class

Create a class that inherits from Evaluation and implements required abstract methods:

from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path

from aigise.evaluations import Evaluation, EvaluationTask

@dataclass
class MyEvaluation(Evaluation):
    """Custom evaluation benchmark.

    This class is automatically registered by name (lowercase).
    You can retrieve it later using: get_evaluation_class("myevaluation")
    """

    # Required fields from parent class
    dataset_path: str = "org/dataset_name"  # HuggingFace dataset or local path
    agent_dir: str = "examples/agents/my_agent"  # Directory containing agent.py

    # Optional configuration overrides
    max_llm_calls: int = 100
    max_workers: int = 6
    use_multiprocessing: bool = True
    run_until_explicit_finish: bool = True
    use_cache: bool = True

    # Custom fields for your benchmark
    custom_param: str = "default_value"

    # Implement required abstract methods
    def _get_sample_id(self, sample: dict) -> str:
        """Extract unique task ID from sample.

        This ID is used for:
        - Output directory naming
        - Task identification in logs
        - Result tracking
        """
        return sample["task_id"]  # or sample.get("id"), etc.

    def _get_user_msg_first(self, sample: dict) -> str:
        """Extract the initial prompt/message to send to agent.

        This is the first message that will trigger agent execution.
        """
        return sample["prompt"]  # or sample.get("question"), etc.

    # Optional: Override methods for custom behavior
    def _get_dataset(self) -> datasets.Dataset:
        """Load dataset with custom filtering or preprocessing."""
        dataset = super()._get_dataset()
        # Add custom filtering logic if needed
        # dataset = dataset.filter(lambda x: x["difficulty"] == "hard")
        return dataset

    def _create_task(self, sample: dict) -> EvaluationTask:
        """Create custom task with additional fields if needed."""
        task = super()._create_task(sample)
        # Add custom fields to task if needed
        return task

    def _get_input_data_path(self, sample: dict) -> str:
        """Specify input data directory for this sample."""
        task_id = self._get_sample_id(sample)
        return str(Path(self.input_data_path) / task_id) if self.input_data_path else ""

    def _get_cache_dir(self, sample: dict) -> str:
        """Specify cache directory for sandbox state."""
        task_id = self._get_sample_id(sample)
        return str(Path(self.cache_dir) / task_id) if self.cache_dir else ""

    def _get_output_dir_in_sandbox(self, sample: dict) -> str | tuple | None:
        """Specify which sandbox directories to export after execution."""
        return "/output"  # or ("/output1", "/output2") for multiple dirs

    def customized_modify_and_save_results(
        self,
        *,
        results: list | None,
        failed_samples: list[str] | None,
        mode: str,
    ) -> None:
        """Post-process and save aggregated results after all samples complete.

        Args:
            results: List of successful sample outputs
            failed_samples: List of task IDs that failed
            mode: Execution mode ("multiprocess", "threaded", or "single_thread")
        """
        # Calculate metrics, save summary, etc.
        pass

    def evaluate(self) -> None:
        """Analyze collected results and generate final metrics.

        This is called after generate() completes. Implement your
        evaluation logic here (accuracy, pass rate, etc.).
        """
        # Load results from output_dir
        # Calculate metrics
        # Save evaluation report
        pass

3. Configuration Template

Create a configuration template in src/aigise/evaluations/configs/:

# src/aigise/evaluations/configs/my_benchmark_config.toml
[llm]
model_name = "gemini-2.0-flash-exp"
temperature = 0.7

[sandbox]
[sandbox.main]
type = "docker"
image = "python:3.11"
working_dir = "/workspace"

# Template variables can be used:
# ${TASK_NAME} - Replaced with actual task ID
# ${PROJECT_RELATIVE_SHARED_DATA_PATH} - Replaced with data path

4. Registration

The evaluation class is automatically registered when imported. The registration name is the lowercase class name.

Example: - Class name: MyEvaluation → Registered as: "myevaluation" - Retrieve with: get_evaluation_class("myevaluation")

5. Running the Evaluation

Since evaluations use Python Fire, you can run them from command-line:

# Run with auto-select mode (recommended)
python -m aigise.evaluations.my_benchmark.my_evaluation \
  --dataset_path="org/dataset" \
  --agent_dir="examples/agents/my_agent" \
  --max_workers=6 \
  --output_dir="results/my_benchmark" \
  run

# Or for debugging (single-threaded)
python -m aigise.evaluations.my_benchmark.my_evaluation \
  --dataset_path="org/dataset" \
  --agent_dir="examples/agents/my_agent" \
  run_debug

# Or directly specify execution method
python -m aigise.evaluations.my_benchmark.my_evaluation \
  --dataset_path="org/dataset" \
  --agent_dir="examples/agents/my_agent" \
  generate  # or generate_threaded, generate_single_thread

Or programmatically:

from aigise.evaluations import MyEvaluation

# Create and run
eval = MyEvaluation(
    dataset_path="org/dataset",
    agent_dir="examples/agents/my_agent",
    output_dir="results/my_benchmark",  # Optional, auto-generated if not provided
    max_workers=6,
)

# Run evaluation
eval.run()  # or eval.run_debug() for debugging

Evaluation Lifecycle

Each evaluation sample goes through the following lifecycle:

  1. Task Creation (_create_task())
  2. Convert dataset sample to EvaluationTask
  3. Extract task ID, prompt, paths, etc.

  4. Environment Preparation (_prepare_environment())

  5. Initialize AIgiSE session
  6. Load/launch sandboxes
  7. Set up Neo4j (if enabled)
  8. Load cached sandbox states (if use_cache=True)

  9. Agent Preparation (_prepare_agent())

  10. Load mk_agent function from agent_dir
  11. Create agent instance
  12. Configure model (if use_config_model=True)

  13. Agent Execution (_run_agent())

  14. Send prompt to agent
  15. Run agent with configured limits
  16. Track LLM calls, costs, etc.
  17. Handle run_until_explicit_finish loop

  18. Output Collection (_collect_outputs())

  19. Export sandbox outputs (if output_dir_in_sandbox specified)
  20. Export Neo4j database
  21. Save session trace
  22. Calculate cost information

  23. Cleanup

  24. Clean up sandboxes
  25. Close sessions
  26. Save error information (if failed)

Key Methods to Override

Required Abstract Methods

  • _get_sample_id(sample: dict) -> str: Extract unique task ID
  • _get_user_msg_first(sample: dict) -> str: Extract initial prompt

Optional Methods (with Defaults)

  • _get_dataset() -> datasets.Dataset: Load and filter dataset
  • _create_task(sample: dict) -> EvaluationTask: Create task instance
  • _get_input_data_path(sample: dict) -> str: Input data directory
  • _get_cache_dir(sample: dict) -> str: Cache directory
  • _get_output_dir_in_sandbox(sample: dict) -> str | tuple | None: Output dirs to export
  • _prepare_general_env() -> None: Setup shared across all samples
  • _before_initialize_hooks(aigise_session, task) -> None: Hooks before sandbox init
  • customized_modify_and_save_results(results, failed_samples, mode) -> None: Post-processing
  • evaluate() -> None: Final evaluation and metrics

Output Structure

Each evaluation run creates an output directory with the following structure:

evals/
└── myevaluation/
    └── yymmdd_HHMMSS/
        ├── evaluation_master.log       # Master log for entire run
        ├── eval_params.json            # Evaluation parameters
        ├── task_001/
        │   ├── execution_debug.log     # DEBUG-level log
        │   ├── execution_info.log      # INFO-level log
        │   ├── config_used.toml        # Config used for this task
        │   ├── cost_info.json          # Token usage and costs
        │   ├── session_trace.json      # Complete session events
        │   ├── session_trace.txt       # Human-readable trace
        │   ├── metadata.json           # Task metadata
        │   ├── sandbox_output/         # Exported from sandbox
        │   └── neo4j_history/          # Neo4j database export
        └── task_002/
            └── ...

Configuration Options

Key configuration options available in Evaluation:

Option Type Default Description
dataset_path str Required HuggingFace dataset or local path
agent_dir str Required Directory with agent.py
max_llm_calls int 100 Maximum LLM calls per task
max_workers int 6 Parallel workers
use_multiprocessing bool True Use multiprocessing vs threading
use_cache bool True Load/cache sandbox states
run_until_explicit_finish bool True Keep running until task finished
use_config_model bool False Use model from config file
llm_retry_count int 3 Retries for LLM API calls
llm_retry_timeout int 30 Timeout per LLM request (seconds)
log_level str "INFO" Terminal log level

Examples

See existing evaluations for reference:

  • src/aigise/evaluations/cybergym/__init__.py - Base class of evaluation
  • src/aigise/evaluations/cybergym/cybergym_static.py - Full-featured evaluation
  • src/aigise/evaluations/mock_debug/mock_debug_evaluation.py - Minimal example
  • src/aigise/evaluations/secodeplt/vul_detection.py - Another example

See Also