Skip to main content

Scaling Architecture

Ash scales horizontally in two dimensions: the data plane (runners that host sandboxes) and the control plane (coordinators that route requests). Each dimension scales independently.

Three Operating Modes

Start with Mode 1. Move to Mode 2 when one machine isn't enough. Move to Mode 3 when one coordinator isn't enough or you need redundancy.

Session Routing

Every session is pinned to a runner. The coordinator selects the runner with the most available capacity at session creation time.

Once assigned, all subsequent messages for that session route to the same runner:

Runner Registration and Heartbeat

Runners self-register with the control plane and send periodic heartbeats with pool statistics.

Graceful Runner Shutdown

When a runner shuts down cleanly, it deregisters from the coordinator. Sessions are paused immediately — no 30-second wait.

Dead Runner Detection

If a runner crashes without deregistering, the coordinator sweeps for dead runners every 30 seconds (with random 0-5s jitter to prevent thundering herd across coordinators). Sessions are bulk-paused in a single query.

Multi-Coordinator (Mode 3)

In multi-coordinator mode, all coordinators share the same database (Postgres or CockroachDB). The runner registry and session state live in the database — coordinators hold no authoritative state in memory.

Key properties:

  • Any coordinator can route to any runner (DB is source of truth)
  • Coordinators don't talk to each other
  • Each coordinator has a unique ID (hostname-PID) reported in GET /health and startup logs
  • Liveness sweep runs on all coordinators independently (idempotent, with random jitter to prevent thundering herd)
  • SSE reconnection handles coordinator failover (no session migration)

Coordinator Failover

Capacity Estimates

ComponentPer InstanceLimitBottleneck
Coordinator~10,000 SSE connectionsNetwork/CPUSSE proxy fan-out
Runner (8 vCPU, 16GB)30-120 sessionsMemoryDepends on sandbox memory limit
Database (CRDB)~5,000 queries/secSingle-node CRDBSession creation path only

Scaling math:

  • 3 coordinators = ~30,000 concurrent SSE streams
  • 10 runners (256MB/sandbox) = ~600 concurrent sessions
  • You'll run out of runner capacity before coordinator capacity

Database Tables for Scaling

Environment Variables

Coordinator

VariableDefaultDescription
ASH_MODEstandaloneSet to coordinator for multi-runner mode
ASH_DATABASE_URLPostgres/CRDB connection string (required for multi-coordinator)
ASH_PORT4100HTTP listen port
ASH_INTERNAL_SECRETShared secret for runner auth. If set, all /api/internal/* endpoints require Authorization: Bearer <secret>. Required for multi-machine deployments.

Runner

VariableDefaultDescription
ASH_RUNNER_IDrunner-{pid}Unique runner identifier
ASH_RUNNER_PORT4200HTTP listen port
ASH_SERVER_URLCoordinator URL for registration (use LB URL in multi-coordinator mode)
ASH_RUNNER_ADVERTISE_HOSTHost reachable from coordinator
ASH_MAX_SANDBOXES1000Maximum concurrent sandboxes
ASH_INTERNAL_SECRETMust match the coordinator's ASH_INTERNAL_SECRET

When to Scale

SymptomAction
CPU/memory maxed on single machineAdd runners (Mode 2)
Need high availability for control planeAdd coordinators (Mode 3)
SSE connections saturating coordinatorAdd coordinators (Mode 3)
Session creation latency increasingAdd runners or increase ASH_MAX_SANDBOXES
All runners at capacityAdd more runner nodes
Measure First

Don't scale until you have numbers. A single standalone Ash server handles dozens of concurrent sessions. Use ASH_DEBUG_TIMING=1 and the /metrics endpoint to find the actual bottleneck before adding complexity.