Benchmarking with Goose
The Goose benchmarking system allows you to evaluate goose performance on complex tasks with one or more system
configurations.
This guide covers how to use the goose bench
command to run benchmarks and analyze results.
Quick Start
- The benchmarking system includes several evaluation suites.
Run the following to see a listing of every valid selector:
goose bench selectors
- Create a basic configuration file:
goose bench init-config -n bench-config.json
cat bench-config.json
{
"models": [
{
"provider": "databricks",
"name": "goose",
"parallel_safe": true
}
],
"evals": [
{
"selector": "core",
"parallel_safe": true
}
],
"repeat": 1
}
...etc.
- Run the benchmark:
goose bench run -c bench-config.json
Configuration File
The benchmark configuration is specified in a JSON file with the following structure:
{
"models": [
{
"provider": "databricks",
"name": "goose",
"parallel_safe": true,
"tool_shim": {
"use_tool_shim": false,
"tool_shim_model": null
}
}
],
"evals": [
{
"selector": "core",
"post_process_cmd": null,
"parallel_safe": true
}
],
"include_dirs": [],
"repeat": 2,
"run_id": null,
"eval_result_filename": "eval-results.json",
"run_summary_filename": "run-results-summary.json",
"env_file": null
}
Configuration Options
Models Section
Each model entry in the models
array specifies:
provider
: The model provider (e.g., "databricks")name
: Model identifierparallel_safe
: Whether the model can be run in paralleltool_shim
: Optional configuration for tool shimminguse_tool_shim
: Enable/disable tool shimmingtool_shim_model
: Optional model to use for tool shimming
Evals Section
Each evaluation entry in the evals
array specifies:
selector
: The evaluation suite to run (e.g., "core")post_process_cmd
: Optional path to a post-processing scriptparallel_safe
: Whether the evaluation can run in parallel
General Options
include_dirs
: Additional directories to include in the evaluationrepeat
: Number of times to repeat each evaluationrun_id
: Optional identifier for the benchmark runeval_result_filename
: Name of the evaluation results filerun_summary_filename
: Name of the summary results fileenv_file
: Optional path to an environment file
Mechanics of include_dirs option
The include_dirs
config parameter makes the items at all paths listed within the option, available to all
evaluations.
It accomplishes this by:
- copying each included asset into the top-level directory created for each model/provider pair
- at evaluation run-time
- whichever assets is explicitly required by an evaluation gets copied into the eval-specific directory
- only if the evaluation-code specifically pulls it in
- and only if the evaluation actually is covered by one of the configured selectors and therefore runs
Customizing Evaluations
You can customize runs in several ways:
- Using Post-Processing Commands after evaluation:
{
"evals": [
{
"selector": "core",
"post_process_cmd": "/path/to/process-script.sh",
"parallel_safe": true
}
]
}
- Including Additional Data:
{
"include_dirs": [
"/path/to/custom/eval/data"
]
}
- Setting Environment Variables:
{
"env_file": "/path/to/env-file"
}
Output and Results
The benchmark generates two main output files within a file-hierarchy similar to the following.
Results from running ach model/provider pair are stored within their own directory:
benchmark-${datetime}/
${model}-${provider}[-tool-shim[-${shim-model}]]/
run-${i}/
${an-include_dir-asset}
run-results-summary.json
core/developer/list_files/
${an-include_dir-asset}
run-results-summary.json
-
eval-results.json
: Contains detailed results from each evaluation, including:- Individual test case results
- Model responses
- Scoring metrics
- Error logs
-
run-results-summary.json
: A collection of all eval results across all suites.
Debug Mode
For detailed logging, you can enable debug mode:
RUST_LOG=debug goose bench bench-config.json
Advanced Usage
Tool Shimming
Tool shimming allows you to use a non-tool-capable models with Goose, provided Ollama is installed on the
system.
See this guide for important details on tool shimming.