CLI Reference¶

Authentication¶

Command	Description
`hawk login`	Authenticate via browser/PKCE; `--device` for the Device flow (not supported by the default Cognito auth)
`hawk auth access-token`	Print a valid access token to stdout
`hawk auth refresh-token`	Print the current refresh token

Evaluations¶

Command	Description
`hawk eval-set CONFIG`	Submit an evaluation set
`hawk eval-set resume [ID]`	Resume a crashed eval set from its last checkpoint
`hawk local eval-set CONFIG`	Run eval locally

hawk eval-set options:

Option	Description
`--image URI`	Full container image URI for the runner
`--image-tag TAG`	Specify runner image tag (within the default repo)
`--secrets-file FILE`	Load secrets from file (repeatable)
`--secret NAME`	Pass env var as secret (repeatable)
`--skip-confirm`	Skip unknown field warnings
`--log-dir-allow-dirty`	Allow dirty log directory

hawk eval-set resume additionally accepts --config FILE — resume with an updated, checkpoint-compatible config (e.g. to fix a crashing scorer); requires an explicit ID. Compatibility is not validated — see Checkpointing & Resume for what's safe to change. If the previous run has finished (or crashed), resume clears its leftover release automatically. It refuses (409) only when that run is still running — stop it first (hawk stop or hawk delete) — or when its state can't be confirmed, in which case clear it with hawk delete and retry.

Eval Set Configuration¶

Checkpointing¶

Eval sets can periodically snapshot in-progress samples to durable storage, allowing resumption after crashes. See Checkpointing & Resume for requirements, configuration options, and the resume workflow. Checkpoints fire only for agents or solvers that integrate Inspect's checkpointer (e.g. a checkpoint-aware react agent) and require the sandbox to permit root exec for in-sandbox capture.

Scans¶

Command	Description
`hawk scan run CONFIG`	Start a Scout scan
`hawk scan resume [ID]`	Resume an interrupted scan
`hawk scan import PATH`	Import a locally-produced scan directory, or a folder of scan directories, into the warehouse under one run (`--name NAME`)
`hawk local scan CONFIG`	Run scan locally

Monitoring¶

Command	Description
`hawk logs [JOB_ID]`	View logs (`-f` to follow, `-n` for line count)
`hawk status [JOB_ID]`	Status report, always emitted as JSON — no flag needed (`--hours` for log window)
`hawk watch [JOB_ID]`	Live per-task / per-sample status (streams until the run finishes; `--json` for a raw single snapshot)
`hawk trace [JOB_ID]`	View the Inspect trace log from a running runner pod
`hawk stacktrace [JOB_ID]`	Capture a live py-spy stack dump of the runner process (live only)

hawk watch shows the same live view as the web viewer's status page — per-task progress bars, retries, limits, scores, and scheduling/pod trouble — streamed over SSE with an automatic polling fallback.

Option	Description
`--no-follow`	Print a single snapshot and exit instead of streaming
`--json`	Output the raw status JSON (implies a single snapshot)

hawk logs options:

Option	Description
`-n`, `--lines INT`	Number of lines to show (default: 100)
`-f`, `--follow`	Follow mode — continuously poll for new logs
`--hours INT`	Hours of data to search (default: 5 years)
`--poll-interval FLOAT`	Seconds between polls in follow mode (default: 3.0)

hawk trace options:

Option	Description
`-n`, `--lines INT`	Number of lines from the end of the trace (default: 100)
`--full`	Fetch the entire trace file instead of just the tail
`-f`, `--follow`	Follow the trace, printing new lines as the eval appends them (Ctrl-C to stop)
`--poll-interval FLOAT`	Seconds between polls in `--follow` mode (default: 3.0)
`--raw`	Output the unmodified JSON-lines trace records instead of formatted lines

The trace log records enter/exit events for model calls, subprocesses, and other long-running actions — useful for diagnosing a stuck or in-progress eval. Records are formatted like hawk logs output; use --raw to get the underlying JSON-lines for tooling. Access requires the same model-group permissions as viewing the eval's results. Only available while the runner pod is still running.

hawk trace abc123 -f                            # Follow new trace lines live (Ctrl-C to stop)
hawk trace abc123 --full --raw > trace.log      # Raw trace for: inspect trace anomalies trace.log

hawk stacktrace options:

Option	Description
`--native`	Include native (C-extension) stack frames
`--json`	Output py-spy's JSON instead of the formatted text dump

Captures a live py-spy stack dump of the runner process (PID 1) inside the runner pod. Useful for diagnosing a stuck eval — shows exactly where each thread is blocked right now. Live only (runner pod must be running).

Viewing Results¶

Command	Description
`hawk web [EVAL_SET_ID]`	Open eval set in browser
`hawk view-sample UUID`	Open a specific sample in browser
`hawk list jobs`	List your launched jobs (eval-sets and scans); `--all` for all visible jobs
`hawk list eval-sets`	List all eval sets
`hawk list evals [ID]`	List evals in an eval set
`hawk list samples [ID]`	List samples in an eval set
`hawk transcript UUID`	Download a sample transcript
`hawk transcripts [ID]`	Download all transcripts for an eval set
`hawk download-artifacts [ID]`	Download sample artifact files for an eval set

hawk list samples options:

Option	Description
`--eval TEXT`	Filter to a specific eval file
`--limit INT`	Max samples to show (default: 50)

hawk transcript / hawk transcripts options:

Option	Description
`--output-dir DIR`	Write to files instead of stdout
`--raw`	Raw JSON instead of markdown
`--limit INT`	Limit number of transcripts

hawk download-artifacts options:

Option	Description
`--sample UUID`	Download artifacts for one sample only
`--output-dir DIR` / `-o DIR`	Output directory (default: `artifacts/<eval-set-id>`)

Artifacts are written as <output-dir>/<sample-uuid>/<artifact-path>. When --output-dir is omitted, the output directory is artifacts/<eval-set-id>. Existing files are overwritten.

When EVAL_SET_ID is omitted, Hawk uses the last eval set from the current session.

Management¶

Command	Description
`hawk stop [EVAL_SET_ID]`	Stop eval gracefully, scoring partial work (`--error`, `--sample UUID`)
`hawk delete [EVAL_SET_ID]`	Delete eval set's Kubernetes resources (logs are kept)
`hawk edit-samples FILE`	Submit sample edits (JSON or JSONL)
`hawk import PATH`	Import locally-produced `.eval` files into the warehouse (`--name NAME`)

Other¶

Command	Description
`hawk config`	Print the current CLI configuration
`hawk version`	Print the local CLI version and the deployed server version
`hawk models`	List models accessible via the LLM proxy
`hawk scan-export SCANNER_RESULT_UUID`	Export scan results as CSV

Human Registry¶

Manage external participants and their SSH public keys. This feature allows humans to perform an evaluation, for example to create human baselines.

Command	Description
`hawk human register --name NAME --ssh-key KEY`	Register a new human
`hawk human list`	List all registered humans
`hawk human update NAME --ssh-key KEY`	Update a human's SSH public key
`hawk human delete NAME`	Remove a human from the registry

Human Evaluations¶

Run evaluations where a registered human does the work inside the sandbox instead of an LLM agent.

Command	Description
`hawk human eval start CONFIG --human NAME`	Start a human evaluation. Accepts any eval-set YAML — the server swaps in the configured human-agent solver and clamps `epochs=1` / `limit=1`. Add `human_eval.agent_args` to the config to set args on the installed agent (e.g., `user`, `record_session`). Same secrets/image flags as `hawk eval-set`, plus `--no-rewrite` to pass the config through unchanged
`hawk human eval ssh-command [EVAL_SET_ID]`	Print a copy-paste-ready SSH command for the sandbox (defaults to the most recently started eval-set)

ssh-command polls the eval logs for the agent's SSH connection details and prints a ssh -J command that hops through the shared jumphost to reach the sandbox pod. Pass --timeout SECONDS (default 600) to bound how long it waits for the sandbox to come up.