Deploying Hawk on AWS¶

Just want to run evals?

If you already have access to a Hawk deployment, you just need the CLI. See Installing the CLI for setup and usage.

This gets you from zero to a working Hawk deployment on AWS. You'll need an AWS account and a domain name. You can use your existing OIDC identity provider for authentication, or a Cognito user pool by default.

1. Install prerequisites¶

brew install pulumi awscli uv python@3.13 jq
nvm install 22
npm install -g pnpm

You also need Docker running — the deploy builds ~12 container images.

Or on Linux, install Pulumi, uv, the AWS CLI, Python 3.13+, Node.js 22 (see .nvmrc), pnpm, Docker, and jq.

2. Clone the repo¶

git clone https://github.com/METR/hawk.git
cd hawk

Before you deploy: sizing and quotas¶

Hawk's default footprint needs roughly 9 Fargate vCPUs at peak during a deploy (Middleman 2 + API 2, each transiently doubled by ECS rolling deployments, + Viewer 0.25 + db-migrate 0.5; steady state is ~4.75) and ~10 EC2 vCPUs to run one eval (managed nodegroup 4 + system node ~2 + a 4-vCPU runner node). Brand-new AWS accounts typically get only 4 Fargate / 8 EC2 On-Demand vCPUs, so a default deploy fails midway with VcpuLimitExceeded. Pick one of these paths before deploying:

Path A — request quota increases (recommended for real use)¶

Raise the Fargate On-Demand, Fargate Spot, and EC2 vCPU quotas to fit Hawk's default footprint. Request them on day one — approval can take days, and brand-new accounts are often denied or already at the open-request cap (start with Path B while you wait). The three aws service-quotas commands and the denial/escalation guidance are in the Configuration Reference: Quota increases.

Path B — starter tier that fits default quotas¶

Shrink the deployment to fit a fresh account's defaults by adding this block to your stack config (Pulumi.<stack>.yaml):

# Fargate On-Demand (quota 4): shrink the two 2-vCPU services
hawk:middlemanTaskCpu: "256"
hawk:middlemanTaskMemory: "1024"
hawk:apiTaskCpu: "1024"
hawk:apiTaskMemory: "2048"
# Fargate Spot (quota 4): the default 8-vCPU importer job never schedules,
# leaving the warehouse (`hawk list evals/samples`) empty forever
hawk:evalLogImporterVcpu: "2"
hawk:evalLogImporterMemory: "16384"
hawk:sampleEditorVcpu: "1"
hawk:sampleEditorMemory: "4096"
# EC2 (quota 8): 1 controller node (2 vCPU) + a 2-vCPU node per eval,
# instead of 4 vCPU of controller nodes + a 4-vCPU node per eval
hawk:karpenterNodeGroupDesiredSize: "1"
hawk:runnerCpu: "1"
hawk:runnerMemory: "4Gi"
# CPU-only: skip the GPU operator's always-on system pods and the g4dn pool
hawk:enableGpuOperator: "false"
# Make quota exhaustion fail visibly instead of retrying forever
hawk:karpenterNodePoolCpuLimit: "8"

Per-eval-set YAML can still override the runner (runner.cpu / runner.memory) when a specific eval needs more.

The trade-offs: lower Middleman/API throughput, slower imports of very large eval logs, single controller node (no HA) — fine for trying Hawk out, not for heavy parallel eval traffic. Note that a later pulumi up that replaces the API or Middleman task transiently doubles that service (ECS rolling deploy), which can wedge under a 4-vCPU Fargate quota until the old task drains. To move to Path A later: raise the quotas, remove the overrides, and run pulumi up.

3. Set up Pulumi state backend¶

aws configure  # or: aws sso login --profile your-profile

Default region

Set the default region to the region you'll deploy to (aws:region in step 4, e.g. us-west-2) — the commands below create resources in that region. If you use an SSO profile, the profile's region applies; override with export AWS_DEFAULT_REGION=us-west-2 if it differs.

Create an S3 bucket and KMS key for Pulumi state:

# Suffixing your account ID makes the bucket name globally unique
BUCKET="hawk-pulumi-state-$(aws sts get-caller-identity --query Account --output text)"
aws s3 mb "s3://$BUCKET" --region <region>
aws kms create-alias --alias-name alias/pulumi-secrets --region <region> \
  --target-key-id $(aws kms create-key --region <region> --query KeyMetadata.KeyId --output text)

Log in to the S3 backend, then verify it took effect:

pulumi login "s3://$BUCKET?region=<region>&awssdk=v2"
pulumi whoami -v   # Backend URL must show YOUR bucket before any stack operation

Credential troubleshooting

If pulumi login fails with NoCredentialProviders or AccessDenied: No AWSAccessKey was presented, your AWS credentials aren't visible to Pulumi. Make sure you ran aws configure (not just aws login, which doesn't persist credentials for other tools). If using SSO profiles or short-lived credentials, ensure AWS_PROFILE is set, or export credentials explicitly:

eval "$(aws configure export-credentials --format env)"

A failed login does not log you out — Pulumi stays pointed at whatever backend was active before, possibly a different organization's state bucket on a shared machine. Never run stack operations until pulumi whoami -v shows your bucket.

4. Create and configure your stack¶

cd infra
# Copy the example config FIRST — `pulumi stack init` merges its KMS metadata
# (secretsprovider/encryptedkey) into the existing file; copying afterwards
# would overwrite that metadata and break secret encryption.
cp ../Pulumi.example.yaml ../Pulumi.my-org.yaml
pulumi stack init my-org \
  --secrets-provider="awskms://alias/pulumi-secrets?region=<same as aws:region>&awssdk=v2"

Edit Pulumi.my-org.yaml with your values. At minimum, you need:

Replace domain placeholders

Names like example.com are IANA documentation domains, not domains you can register. Use DNS names you control; the values in the example config are illustrative only.

config:
  aws:region: us-west-2
  hawk:domain: example.com            # root domain you control — Hawk prepends `hawk.` automatically (services resolve at api.hawk.example.com, viewer.hawk.example.com)
  hawk:publicDomain: example.com       # parent domain for DNS zones and TLS certs
  hawk:primarySubnetCidr: "10.0.0.0/16"
  hawk:org: yourco                     # short org slug, used in globally-unique S3 bucket and Cognito
                                       # domain names — do not leave it as the default "myorg"

That's enough to get started. The environment name defaults to your stack name. Hawk will create a Cognito user pool for authentication automatically.

hawk:domain must equal or be a subdomain of hawk:publicDomain — service certificates validate in the publicDomain Route 53 zone, and preflight rejects mismatched pairs. See Configuration Reference: Domain & DNS for the full rule, the four DNS-strategy options, and the single-subdomain (Cloudflare) pairing.

Before deploying with TLS enabled, make sure a Route 53 public hosted zone exists for hawk:publicDomain and that your registrar or parent DNS delegates to that zone's NS records. If the subdomain was delegated before (e.g. a prior deployment), replace any existing NS records at the parent rather than adding alongside them — a mixed nameserver set causes intermittent SERVFAIL. Leave hawk:createPublicZone unset or "false" for strict preflight; creating the public zone during pulumi up cannot be validated ahead of time.

If you already have an OIDC provider (Okta, Auth0, etc.), you can use it instead:

# Optional: use your own OIDC provider instead of Cognito
hawk:oidcClientId: "your-client-id"
hawk:oidcAudience: "your-audience"
hawk:oidcIssuer: "https://login.example.com/oauth2/default"

Known issue: EKS may reject some availability zones

If pulumi up fails with UnsupportedAvailabilityZoneException, exclude the rejected zone via hawk:excludeZoneIds — see Configuration Reference: Infrastructure Options.

5. Deploy¶

Before your first deploy, make sure Docker Hub authentication is set up — the build pulls base images from Docker Hub, which rate-limits anonymous pulls:

docker login          # Docker Hub — required; anonymous pulls are rate-limited (https://hub.docker.com/)
docker login dhi.io   # Docker Hardened Images — Hawk's Python base lives here (free Community tier; same Docker Hub credentials work)

amd64/x86_64 build hosts: set armImagesEnabled: "false"

hawk:armImagesEnabled defaults to "true", which builds arm64 images. On an amd64/x86_64 host those become QEMU-emulated cross-builds that are very slow and can hang pulumi up indefinitely (there is no per-image build timeout). Set hawk:armImagesEnabled: "false" in your stack config before deploying from an amd64 machine.

Run the preflight checks before creating AWS resources:

../scripts/dev/preflight.sh
pulumi up

Secrets encryption (AWS KMS)

With pulumi stack init ... --secrets-provider="awskms://alias/pulumi-secrets?region=<same as aws:region>&awssdk=v2" (step 4), secret stack configuration is encrypted using KMS, not a passphrase. Do not set PULUMI_CONFIG_PASSPHRASE or rely on passphrase-based encryption for Hawk stacks. If Pulumi prompts for a passphrase, the stack is probably not using the KMS secrets provider — align the stack with step 4 (see Pulumi: changing secrets providers) instead of configuring a passphrase.

If pulumi up reports passphrase must be set on a KMS-backed stack, the secrets-provider URL is missing its ?region= — a bare awskms://alias/... resolves the alias against the machine's ambient AWS region, not the stack's aws:region. Recreate/migrate the stack with awskms://alias/pulumi-secrets?region=<aws:region>&awssdk=v2.

This creates roughly 400 AWS resources including a VPC, EKS cluster, ALB, ECS services, Aurora PostgreSQL, S3 buckets, Lambda functions, and more. First deploy takes about 15-20 minutes.

Custom domain / DNS setup

If you want TLS certificates and public DNS for your deployment, create or import a Route 53 hosted zone for your publicDomain before running preflight, then delegate DNS to that zone. See Configuration Reference: DNS / Cloudflare for DNS strategy details.

6. Set up LLM API keys¶

Hawk routes model API calls through its built-in LLM proxy (Middleman). You need to provide at least one provider's API key:

Middleman starts healthy with no provider keys and refreshes them from Secrets Manager every few minutes, so the first deploy completes before any keys are set; models for a provider appear shortly after you add its key.

scripts/dev/set-api-keys.sh <env> OPENAI_API_KEY=sk-...

This stores the key in Secrets Manager and restarts Middleman. You can set multiple keys at once:

scripts/dev/set-api-keys.sh <env> OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-...

Replace <env> with your hawk:env value (e.g., production). Supported providers: OpenAI, Anthropic, Gemini, DeepInfra, DeepSeek, Fireworks, Mistral, OpenRouter, Together, xAI.

7. Create a user (Cognito only)¶

If you're using the default Cognito authentication, create a user:

scripts/dev/create-cognito-user.sh <stack> you@example.com

The script reads the Cognito user pool from your Pulumi stack outputs, creates the user, and prints the login credentials. Skip this step if you're using your own OIDC provider.

Users can only call models whose model group they belong to. Models Middleman autodiscovers from your provider keys land in the model-access-public group, and users whose tokens carry no group claims get model-access-public by default (hawk:defaultPermissions) — so a plain Cognito user can call autodiscovered models out of the box. Create and assign groups when you change hawk:defaultPermissions or need finer-grained model access:

scripts/dev/manage-cognito-groups.sh <stack> create model-access-public
scripts/dev/manage-cognito-groups.sh <stack> add-user model-access-public you@example.com

See Security: Access Control for details.

8. Install the Hawk CLI and run your first eval¶

Install the CLI using the command in the main README, then configure it for your deployment:

uv run python scripts/dev/generate-env.py <stack> > hawk/.env

hawk login
hawk eval-set hawk/examples/simple.eval-set.yaml
hawk logs -f   # watch it run
hawk web       # open results in browser

If your first eval looks stuck (small accounts)

Runner pod Pending: the default runner requests ~2.3 vCPU and needs a 4-vCPU node. On the default 8-vCPU EC2 quota, set hawk:runnerCpu: "1" deploy-wide (Path B does this) or runner.cpu: "1" per eval-set YAML (the shipped example sets it) — see sizing and quotas.
hawk list evals empty: the warehouse imports an eval only after it reaches a terminal state (success/error/cancelled), then takes roughly 15-70 seconds (the first import also pulls a container image); the viewer (hawk web) reads S3 directly and renders earlier. If the list stays empty for minutes, check the eval-log-importer Batch job logs, the -events-dlq/-batch-dlq SQS queues, and — on a fresh account — the Fargate Spot vCPU quota: the importer needs 8 Spot vCPUs by default and sits unschedulable below that (Path A raises the quota; Path B sets hawk:evalLogImporterVcpu: "2" / hawk:evalLogImporterMemory: "16384").

Tear down¶

To delete a deployment, use the teardown script:

scripts/dev/teardown.sh <stack>

It disables the deletion guards (hawk:protectResources=false + pulumi up), drains Karpenter nodes with a bounded wait, runs pulumi destroy, removes the stack, and prints the remaining manual cleanup for bootstrap resources (state bucket, KMS key, Route 53 zone, DNS delegation). For the manual sequence and troubleshooting (stuck NodeClaims, ALB deletion protection, non-empty buckets), see Managing Your Deployment: Tearing Down.

Warning

Don't pipe long-running destroys through tee — it masks Pulumi's non-zero exit code as success.