Debugging Stuck Evaluations¶
This guide documents debugging techniques for evaluations that aren't progressing.
Quick Diagnosis Checklist¶
- Verify authentication:
hawk auth access-token > /dev/null || echo "Run 'hawk login' first" - Check job status:
hawk status <eval-set-id>— JSON report with pod status, logs, metrics - View logs:
hawk logs <eval-set-id>orhawk logs -ffor follow mode - List samples:
hawk list samples <eval-set-id>— see which samples completed/failed - Get transcript:
hawk transcript <sample-uuid>— view conversation for a specific sample - Check buffer in S3: Download
.buffer/segments to analyze eval state - Test API directly: Reproduce the error outside Inspect to isolate the issue
System Architecture¶
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ hawk CLI │────>│ API Server │────>│ Helm │
│ (eval-set) │ │ (FastAPI) │ │ (releases) │
└──────────────┘ └──────────────┘ └──────────────┘
│
v
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ S3 Logs │<────│ Runner Pod │────>│ Sandbox Pods │
│ (.eval, │ │ (runner ns) │ │ (eval-set ns)│
│ .buffer/) │ └──────────────┘ └──────────────┘
└──────────────┘ │
v
┌──────────────┐ ┌──────────────┐
│ Middleman │────>│ Provider APIs│
│ (auth proxy)│ │ │
└──────────────┘ └──────────────┘
Common Error Patterns¶
API 500 Errors with Retries¶
What it looks like:
Error code: 500 - Internal server error
Retry attempt 45 of 50 failed. Waiting 1800 seconds before next retry.
Diagnosis:
- Download the
.buffer/from S3 to find the failing request - Test the request through Middleman:
curl https://middleman.internal.metr.org/... - Test directly against the provider API to isolate whether it's a Middleman issue
Note
500 errors are NOT token limit issues. Token limits return 400 errors.
Token/Context Limit Errors (400)¶
Error code: 400 - invalid_request_error
This IS a token limit issue. Check message count, configured token_limit, and model context window.
Retry Logs¶
Retry log messages include a context prefix identifying the sample:
[nWJu3Mz mmlu/42/1 openai/gpt-4o] -> openai/gpt-4o retry 3 (retrying in 24 seconds) [RateLimitError 429 rate_limit_exceeded]
The prefix format is [{sample_uuid} {task}/{sample_id}/{epoch} {model}].
Tip
The OpenAI SDK does not show the actual error in its retry messages. You must test with curl directly to see the real error.
Pod UID Mismatch¶
Pod UID mismatch: expected abc123, got def456
The sandbox pod was killed and restarted. There's nothing to do — Inspect will automatically retry the sample.
OOMKilled Pods¶
Check pod status via hawk status or:
bash
kubectl describe pod -n <runner-namespace> <pod-name> | grep -A5 "Last State"
Look for OOMKilled in the termination reason.
Accessing Eval State¶
From S3¶
```bash
List eval set contents¶
aws s3 ls s3://
Download buffer for analysis¶
aws s3 sync s3://
Download completed .eval files¶
aws s3 sync s3://
Reading .eval Files¶
.eval files are zip archives. Read them using the inspect_ai library:
```python from inspect_ai.log import read_eval_log
log = read_eval_log("path/to/file.eval") for sample in log.samples: print(sample.id, sample.score) ```
Sample Buffer Analysis¶
The .buffer/ directory contains SQLite databases with eval state.
Check overall progress:
```sql SELECT COUNT(*) as total_events FROM events;
SELECT json_extract(data, '$.state') as state, COUNT(*) FROM samples GROUP BY state; ```
Detect if eval is progressing (FAIL-OK pattern):
sql
SELECT
id,
CASE WHEN json_extract(data, '$.error') IS NULL THEN 'OK' ELSE 'FAIL' END as status
FROM events
WHERE json_extract(data, '$.event') = 'model_output'
ORDER BY id DESC LIMIT 50;
FAIL, OK, FAIL, OK...= Progressing (transient errors, retries succeed)FAIL, FAIL, FAIL...= Stuck (something changed, now all fail)
Find pending events (stuck API calls):
sql
SELECT COUNT(*) FROM events WHERE json_extract(data, '$.pending') = 1;
Direct API Testing¶
Test through Middleman to isolate whether the issue is Inspect AI, Middleman, or the provider.
```bash TOKEN=$(hawk auth access-token)
Test Anthropic¶
curl --max-time 300 -X POST https://middleman.internal.metr.org/anthropic/v1/messages \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "model": "claude-sonnet-4-20250514", "max_tokens": 100, "messages": [{"role": "user", "content": "Say hello"}] }'
Test OpenAI-compatible APIs¶
curl --max-time 300 -X POST https://middleman.internal.metr.org/openai/v1/chat/completions \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o", "messages": [{"role": "user", "content": "Say hello"}], "max_tokens": 100 }' ```
Interpretation:
| Result | Meaning |
|---|---|
| All tests pass | Issue is content-specific or Inspect AI bug |
| Middleman fails, direct to provider works | Middleman issue |
| Both fail | Provider API issue |
| Large request fails with 400 | Token limit |
| Large request fails with 500 | API bug with large context |
Early Warning Signs¶
HTTP Retry Count Growth¶
Task progress logs include an "HTTP retries" counter:
{"message": "Task: search_server, Steps: 12/12 100%, ... HTTP retries: 5,710", ...}
Tasks can complete successfully despite thousands of retries. A growing retry count indicates API instability, but isn't fatal until retries stop succeeding entirely.
Severity calculation: Retry count x wait time = stuck duration. E.g., 45 retries x 1800s = 22+ hours stuck.
Command Cheatsheet¶
Hawk CLI¶
bash
hawk status <eval-set-id> # JSON monitoring report
hawk logs <eval-set-id> # View logs
hawk logs -f # Follow logs live
hawk list samples <eval-set-id> # List samples with status
hawk transcript <sample-uuid> # View sample conversation
hawk delete <eval-set-id> # Delete eval and cleanup
hawk eval-set <config.yaml> # Start/restart eval
S3 Access¶
bash
aws s3 ls s3://<bucket>/evals/<eval-set-id>/
aws s3 sync s3://<bucket>/evals/<eval-set-id>/.buffer/ /tmp/buffer/
Kubectl (Advanced)¶
bash
kubectl get pods -n <runner-ns> | grep <eval-set-id> # Find runner pod
kubectl logs -n <runner-ns> <pod-name> --tail=200 # Pod logs
kubectl get pods -n <eval-set-id> # Sandbox pods
kubectl describe pod -n <runner-ns> <pod-name> # Full pod details
Escalation Checklist¶
If you can't resolve the issue:
- Gather evidence:
hawk statusoutput, error patterns, sample buffer state, failing request payloads - Determine scope: One eval or multiple? One task or all? One model or all?
- Check external status: Provider status pages, AWS status, internal monitoring dashboards
- Document timeline: When did the eval start? When did it get stuck? Last successful operation?