Running Evaluations¶

The Eval Set Config¶

An eval set config is a YAML file that defines a grid of tasks, agents, and models to evaluate. Hawk runs every combination.

Here's a minimal example:

tasks:
  - package: git+https://github.com/UKGovernmentBEIS/inspect_evals
    name: inspect_evals
    items:
      - name: mbpp

models:
  - package: openai
    name: openai
    items:
      - name: gpt-4o-mini

limit: 1  # optional: cap the number of samples

Submit it:

hawk eval-set config.yaml

Adding Agents¶

agents:
  - package: git+https://github.com/METR/inspect-agents
    name: metr_agents
    items:
      - name: react
        args:
          max_attempts: 3

Eval Parameters¶

These fields are passed through to inspect_ai.eval_set():

Field	Description
`limit`	Maximum samples to evaluate
`time_limit`	Per-sample time limit in seconds
`message_limit`	Maximum messages per sample
`epochs`	Number of evaluation epochs
`eval_set_id`	Custom ID (auto-generated if omitted)
`metadata`	Arbitrary metadata dictionary
`tags`	List of tags for organization

For the complete schema, see hawk/core/types/evals.py or the Inspect AI docs.

Secrets and API Keys¶

Pass environment variables to your eval runner with --secret or --secrets-file:

# From your environment
hawk eval-set config.yaml --secret MY_API_KEY

# From a file
hawk eval-set config.yaml --secrets-file .env

# Both
hawk eval-set config.yaml --secrets-file .env --secret ANOTHER_KEY

By default, Hawk routes model API calls through its managed LLM proxy (supporting OpenAI, Anthropic, and Google Vertex). To use your own API keys instead, pass them as secrets and disable the proxy's token refresh:

runner:
  environment:
    INSPECT_ACTION_RUNNER_REFRESH_URL: ""

You can also declare required secrets in your config to catch missing credentials before the job starts:

runner:
  secrets:
    - name: DATASET_ACCESS_KEY
      description: API key for dataset access

Additional Packages¶

Install extra Python packages into the runner's virtualenv:

packages:
  - git+https://github.com/some-org/some-package

Private GitHub repos work automatically if Hawk's GitHub token has access. Both git@github.com: and ssh://git@github.com/ URL formats are supported and converted to HTTPS internally.

[Experimental] Custom Runner Images¶

You can use your own Docker image for the runner instead of the default:

runner:
  image: "<ecr-url>/custom-runners:my-image-v1"

Or via the CLI: hawk eval-set config.yaml --image <image-uri>

Images must have an explicit tag (:v1) or digest (@sha256:...). Tagless and :latest are rejected.

Each Hawk deployment includes a custom-runners ECR repo with immutable tags. Get its URL with pulumi stack output custom_runners_ecr_url. Public images from any registry also work.

Look at the dockerfile in infra/runner-image/ to what a valid image looks like.

Monitoring¶

Logs¶

hawk logs                 # last 100 log lines for current job
hawk logs -f              # follow logs in real-time
hawk logs -n 50           # last 50 lines
hawk logs JOB_ID -f       # follow a specific job

Status¶

hawk status               # JSON report: pod state, logs, metrics
hawk status --hours 48    # include 48 hours of log data

Web Viewer¶

hawk web                  # open current eval set in browser
hawk web EVAL_SET_ID      # open a specific eval set
hawk view-sample UUID     # open a specific sample

Listing and Inspecting Results¶

hawk list eval-sets                     # list all eval sets
hawk list evals [EVAL_SET_ID]           # list evals in an eval set
hawk list samples [EVAL_SET_ID]         # list samples
hawk transcript UUID                    # download a sample transcript (markdown)
hawk transcript UUID --raw              # download as raw JSON
hawk transcripts [EVAL_SET_ID]          # download all transcripts

Running Locally¶

Run evals on your own machine instead of the cluster. Useful for debugging.

hawk local eval-set examples/simple.eval-set.yaml

This creates a fresh virtualenv in a temp directory, installs dependencies, and runs the evaluation the same way the cluster would.

Providing model API keys¶

Unlike cluster runs, hawk local does not route model calls through Hawk's managed LLM proxy (Middleman). Inspect calls each provider's API directly, so you must supply the provider API keys yourself — otherwise the run fails with authentication errors from the provider.

Set the environment variables for the providers your models use, for example:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
hawk local eval-set examples/simple.eval-set.yaml

Or keep them in a file and load it with --secrets-file:

# .env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

hawk local eval-set examples/simple.eval-set.yaml --secrets-file .env

You can also forward individual variables from your current shell with --secret NAME (see Secrets and API Keys above). Generate keys from your provider's dashboard (e.g. platform.openai.com, console.anthropic.com).

Note

To route through a managed gateway instead of using your own keys, see Using an AI Gateway below.

Debugging with `--direct`¶

Use --direct to skip the virtualenv and run in your current Python environment:

hawk local eval-set examples/simple.eval-set.yaml --direct

This lets you set breakpoints in your IDE and debug from the start. Note that --direct installs dependencies into your current environment via uv pip install, but model-provider packages (openai, anthropic, etc.) must already be present in the environment hawk was installed into. If they're missing, add them when installing hawk:

uv tool install --python 3.13 --reinstall-package hawk \
  "hawk[cli,runner] @ git+https://github.com/METR/hawk#subdirectory=hawk" --with openai

Note

Tasks from inspect_evals require openai even with a different model provider — inspect_evals imports it at load time, and missing it makes the task registry silently fail with LookupError.

Using an AI Gateway¶

Route model calls through a managed AI gateway:

export HAWK_AI_GATEWAY_URL=https://your-gateway.example.com
hawk login
hawk local eval-set examples/simple.eval-set.yaml

Sample Editing¶

Batch edit sample scores or invalidate samples:

hawk edit-samples edits.json

Accepts JSON arrays or JSONL:

[
  {"sample_uuid": "...", "details": {"type": "score_edit", ...}},
  {"sample_uuid": "...", "details": {"type": "invalidate_sample", ...}}
]

Importing Eval Files¶

Upload locally-produced .eval files into Hawk's warehouse so they appear alongside natively-run eval sets:

hawk import path/to/file.eval               # single file
hawk import path/to/dir/                     # directory of .eval files
hawk import path/to/dir/ --name my-import    # friendly name in the eval_set_id

Imported eval sets get IDs prefixed imported- and have metadata.imported = true set. They appear in the warehouse, viewer, and hawk download identically to natively-run eval sets.

Importing Scans¶

Upload one or more locally-produced Scout scans into the warehouse so their scanner results appear alongside natively-run scans:

hawk scan import path/to/scan_id=.../               # a single scan results directory
hawk scan import path/to/run/                        # a folder of scan directories
hawk scan import path/to/run/ --name my-import

PATH is either a single Scout scan results directory (the scan_id=... directory Scout writes) or a folder containing several such directories. Each scan directory must contain a _scan.json spec and at least one per-scanner .parquet file (_summary.json is uploaded too if present). All the scans in one import land under a single fresh, imported--prefixed scan run, each with a freshly-generated scan id, so they never collide with existing warehouse scans. The scanned transcripts must already be in the warehouse: Hawk derives the models that gate access (the scanner models plus the models of the scanned transcripts' source eval sets, matching what a native hawk scan run over those eval sets would require) and refuses to import if any scanned transcript can't be resolved, or if you lack permission for those models.

Running Human Evaluations¶

For evaluations driven by a human instead of an LLM agent, Hawk provisions a sandbox and exposes it via SSH through the shared jumphost. config.yaml can be any eval-set YAML — the server swaps in the configured human-agent solver, clamps epochs=1 / limit=1, and defaults runner.cleanup to false so the sandbox sticks around between SSH sessions. To set args on the installed agent (e.g., user, record_session), add a human_eval.agent_args block to your config; it's shallow-merged onto the agent. Pass --no-rewrite to keep your config's agents/solvers block untouched (the SSH key is still injected on every solver).

# 1. Register the human and their SSH public key (one-time)
hawk human register --name jane --ssh-key "ssh-ed25519 AAAA..."

# 2. Start the eval — provisions a sandbox and registers jane's key on the jumphost
hawk human eval start config.yaml --human jane

# 3. Get a ready-to-paste SSH command for the sandbox
hawk human eval ssh-command          # uses the last-started eval-set
# — or —
hawk human eval ssh-command <eval-set-id>

ssh-command polls the eval logs for the agent's connection line and prints ssh -J ssh-user@<jumphost> <user>@<sandbox-ip> -p <port>. Load your private key into ssh-agent first (ssh-add /path/to/key) — the -J ProxyJump uses the agent for both the jumphost and sandbox hops. Use --timeout SECONDS to bound how long it waits for the sandbox to come up (default 600).

After hawk delete <eval-set-id> (or eval completion), the SSH key is removed from the jumphost and ssh-command refuses subsequent invocations.

Stopping and Deleting¶

hawk stop tells the running eval to finish gracefully: active samples are scored with whatever work they've done so far, results are written to S3, and the job exits on its own. Use this to end an eval early but keep the partial results.

hawk delete kills the job immediately and tears down its Kubernetes resources. Use this when you don't care about partial results and just want the job gone. Logs already written to S3 are kept.

hawk stop                  # gracefully stop current eval set (score partial work)
hawk stop --error          # mark samples as errors (will retry if retries are configured)
hawk stop --sample UUID    # stop a single sample

hawk delete                # tear down current eval set's Kubernetes resources (logs are kept)
hawk delete EVAL_SET_ID    # delete a specific eval set's resources