Skip to content

Running Evaluations

The Eval Set Config

An eval set config is a YAML file that defines a grid of tasks, agents, and models to evaluate. Hawk runs every combination.

Here's a minimal example:

```yaml tasks: - package: git+https://github.com/UKGovernmentBEIS/inspect_evals name: inspect_evals items: - name: mbpp

models: - package: openai name: openai items: - name: gpt-4o-mini

limit: 1 # optional: cap the number of samples ```

Submit it:

bash hawk eval-set config.yaml

Adding Agents

yaml agents: - package: git+https://github.com/METR/inspect-agents name: metr_agents items: - name: react args: max_attempts: 3

Eval Parameters

These fields are passed through to inspect_ai.eval_set():

Field Description
limit Maximum samples to evaluate
time_limit Per-sample time limit in seconds
message_limit Maximum messages per sample
epochs Number of evaluation epochs
eval_set_id Custom ID (auto-generated if omitted)
metadata Arbitrary metadata dictionary
tags List of tags for organization

For the complete schema, see hawk/core/types/evals.py or the Inspect AI docs.

Secrets and API Keys

Pass environment variables to your eval runner with --secret or --secrets-file:

```bash

From your environment

hawk eval-set config.yaml --secret MY_API_KEY

From a file

hawk eval-set config.yaml --secrets-file .env

Both

hawk eval-set config.yaml --secrets-file .env --secret ANOTHER_KEY ```

By default, Hawk routes model API calls through its managed LLM proxy (supporting OpenAI, Anthropic, and Google Vertex). To use your own API keys instead, pass them as secrets and disable the proxy's token refresh:

yaml runner: environment: INSPECT_ACTION_RUNNER_REFRESH_URL: ""

You can also declare required secrets in your config to catch missing credentials before the job starts:

yaml runner: secrets: - name: DATASET_ACCESS_KEY description: API key for dataset access

Additional Packages

Install extra Python packages into the runner's virtualenv:

yaml packages: - git+https://github.com/some-org/some-package

Private GitHub repos work automatically if Hawk's GitHub token has access. Both git@github.com: and ssh://git@github.com/ URL formats are supported and converted to HTTPS internally.

[Experimental] Custom Runner Images

You can use your own Docker image for the runner instead of the default:

yaml runner: image: "<ecr-url>/custom-runners:my-image-v1"

Or via the CLI: hawk eval-set config.yaml --image <image-uri>

Images must have an explicit tag (:v1) or digest (@sha256:...). Tagless and :latest are rejected.

Each Hawk deployment includes a custom-runners ECR repo with immutable tags. Get its URL with pulumi stack output custom_runners_ecr_url. Public images from any registry also work.

Look at the dockerfile in infra/runner-image/ to what a valid image looks like.

Monitoring

Logs

bash hawk logs # last 100 log lines for current job hawk logs -f # follow logs in real-time hawk logs -n 50 # last 50 lines hawk logs JOB_ID -f # follow a specific job

Status

bash hawk status # JSON report: pod state, logs, metrics hawk status --hours 48 # include 48 hours of log data

Web Viewer

bash hawk web # open current eval set in browser hawk web EVAL_SET_ID # open a specific eval set hawk view-sample UUID # open a specific sample

Listing and Inspecting Results

bash hawk list eval-sets # list all eval sets hawk list evals [EVAL_SET_ID] # list evals in an eval set hawk list samples [EVAL_SET_ID] # list samples hawk transcript UUID # download a sample transcript (markdown) hawk transcript UUID --raw # download as raw JSON hawk transcripts [EVAL_SET_ID] # download all transcripts

Running Locally

Run evals on your own machine instead of the cluster. Useful for debugging.

bash hawk local eval-set examples/simple.eval-set.yaml

This creates a fresh virtualenv in a temp directory, installs dependencies, and runs the evaluation the same way the cluster would.

Debugging with --direct

Use --direct to skip the virtualenv and run in your current Python environment:

bash hawk local eval-set examples/simple.eval-set.yaml --direct

This lets you set breakpoints in your IDE and debug from the start. Note that --direct installs dependencies into your current environment.

Using an AI Gateway

Route model calls through a managed AI gateway:

bash export HAWK_AI_GATEWAY_URL=https://your-gateway.example.com hawk login hawk local eval-set examples/simple.eval-set.yaml

Sample Editing

Batch edit sample scores or invalidate samples:

bash hawk edit-samples edits.json

Accepts JSON arrays or JSONL:

json [ {"sample_uuid": "...", "details": {"type": "score_edit", ...}}, {"sample_uuid": "...", "details": {"type": "invalidate_sample", ...}} ]

Resource Cleanup

bash hawk delete # delete current eval set's Kubernetes resources (logs are kept) hawk delete EVAL_SET_ID # delete a specific eval set's resources