Running Evaluations¶
The Eval Set Config¶
An eval set config is a YAML file that defines a grid of tasks, agents, and models to evaluate. Hawk runs every combination.
Here's a minimal example:
```yaml tasks: - package: git+https://github.com/UKGovernmentBEIS/inspect_evals name: inspect_evals items: - name: mbpp
models: - package: openai name: openai items: - name: gpt-4o-mini
limit: 1 # optional: cap the number of samples ```
Submit it:
bash
hawk eval-set config.yaml
Adding Agents¶
yaml
agents:
- package: git+https://github.com/METR/inspect-agents
name: metr_agents
items:
- name: react
args:
max_attempts: 3
Eval Parameters¶
These fields are passed through to inspect_ai.eval_set():
| Field | Description |
|---|---|
limit |
Maximum samples to evaluate |
time_limit |
Per-sample time limit in seconds |
message_limit |
Maximum messages per sample |
epochs |
Number of evaluation epochs |
eval_set_id |
Custom ID (auto-generated if omitted) |
metadata |
Arbitrary metadata dictionary |
tags |
List of tags for organization |
For the complete schema, see hawk/core/types/evals.py or the Inspect AI docs.
Secrets and API Keys¶
Pass environment variables to your eval runner with --secret or --secrets-file:
```bash
From your environment¶
hawk eval-set config.yaml --secret MY_API_KEY
From a file¶
hawk eval-set config.yaml --secrets-file .env
Both¶
hawk eval-set config.yaml --secrets-file .env --secret ANOTHER_KEY ```
By default, Hawk routes model API calls through its managed LLM proxy (supporting OpenAI, Anthropic, and Google Vertex). To use your own API keys instead, pass them as secrets and disable the proxy's token refresh:
yaml
runner:
environment:
INSPECT_ACTION_RUNNER_REFRESH_URL: ""
You can also declare required secrets in your config to catch missing credentials before the job starts:
yaml
runner:
secrets:
- name: DATASET_ACCESS_KEY
description: API key for dataset access
Additional Packages¶
Install extra Python packages into the runner's virtualenv:
yaml
packages:
- git+https://github.com/some-org/some-package
Private GitHub repos work automatically if Hawk's GitHub token has access. Both git@github.com: and ssh://git@github.com/ URL formats are supported and converted to HTTPS internally.
[Experimental] Custom Runner Images¶
You can use your own Docker image for the runner instead of the default:
yaml
runner:
image: "<ecr-url>/custom-runners:my-image-v1"
Or via the CLI: hawk eval-set config.yaml --image <image-uri>
Images must have an explicit tag (:v1) or digest (@sha256:...). Tagless and :latest are rejected.
Each Hawk deployment includes a custom-runners ECR repo with immutable tags. Get its URL with pulumi stack output custom_runners_ecr_url. Public images from any registry also work.
Look at the dockerfile in infra/runner-image/ to what a valid image looks like.
Monitoring¶
Logs¶
bash
hawk logs # last 100 log lines for current job
hawk logs -f # follow logs in real-time
hawk logs -n 50 # last 50 lines
hawk logs JOB_ID -f # follow a specific job
Status¶
bash
hawk status # JSON report: pod state, logs, metrics
hawk status --hours 48 # include 48 hours of log data
Web Viewer¶
bash
hawk web # open current eval set in browser
hawk web EVAL_SET_ID # open a specific eval set
hawk view-sample UUID # open a specific sample
Listing and Inspecting Results¶
bash
hawk list eval-sets # list all eval sets
hawk list evals [EVAL_SET_ID] # list evals in an eval set
hawk list samples [EVAL_SET_ID] # list samples
hawk transcript UUID # download a sample transcript (markdown)
hawk transcript UUID --raw # download as raw JSON
hawk transcripts [EVAL_SET_ID] # download all transcripts
Running Locally¶
Run evals on your own machine instead of the cluster. Useful for debugging.
bash
hawk local eval-set examples/simple.eval-set.yaml
This creates a fresh virtualenv in a temp directory, installs dependencies, and runs the evaluation the same way the cluster would.
Debugging with --direct¶
Use --direct to skip the virtualenv and run in your current Python environment:
bash
hawk local eval-set examples/simple.eval-set.yaml --direct
This lets you set breakpoints in your IDE and debug from the start. Note that --direct installs dependencies into your current environment.
Using an AI Gateway¶
Route model calls through a managed AI gateway:
bash
export HAWK_AI_GATEWAY_URL=https://your-gateway.example.com
hawk login
hawk local eval-set examples/simple.eval-set.yaml
Sample Editing¶
Batch edit sample scores or invalidate samples:
bash
hawk edit-samples edits.json
Accepts JSON arrays or JSONL:
json
[
{"sample_uuid": "...", "details": {"type": "score_edit", ...}},
{"sample_uuid": "...", "details": {"type": "invalidate_sample", ...}}
]
Resource Cleanup¶
bash
hawk delete # delete current eval set's Kubernetes resources (logs are kept)
hawk delete EVAL_SET_ID # delete a specific eval set's resources