Multi-Machine Setup
Most deployments do not need multi-machine mode. A single machine running in standalone mode can handle dozens of concurrent sessions. Read this page only when you need more capacity than one machine provides.
When to Use
Use multi-machine mode when:
- You need more concurrent sessions than a single server can handle.
- You want to isolate sandbox execution from the control plane for reliability.
- You need to scale sandbox capacity independently of the API server.
Architecture
- Coordinator (the Ash server in
coordinatormode): handles all client-facing HTTP traffic, manages the database, routes session creation to runners, and proxies messages to the correct runner. - Runners: each runner manages a local
SandboxPool, creates sandbox processes, and communicates with the bridge inside each sandbox over Unix sockets. Runners do not serve external traffic directly.
Standalone Mode (Default)
In standalone mode (ASH_MODE=standalone), the server creates a local SandboxPool and handles everything in one process. This is the default and the right choice for single-machine deployments.
Client --> Ash Server (standalone) --> SandboxPool --> Bridge --> Claude SDK
No runners are needed. The server is both control plane and execution plane.
Coordinator Mode
In coordinator mode (ASH_MODE=coordinator), the server does not create a local sandbox pool. Instead, it waits for runners to register and provides capacity through them.
Starting the Coordinator
export ASH_MODE=coordinator
export ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash"
export ASH_API_KEY=my-api-key
export ASH_INTERNAL_SECRET=my-runner-secret # Required: authenticates runner registration
export ANTHROPIC_API_KEY=sk-ant-...
node packages/server/dist/index.js
# Or via Docker:
# ash start -e ASH_MODE=coordinator -e ASH_DATABASE_URL=... -e ASH_INTERNAL_SECRET=...
The coordinator logs:
Ash server listening on 0.0.0.0:4100 (mode: coordinator)
At this point, the server accepts API requests but cannot create sessions until at least one runner registers.
Starting a Runner
On each runner machine:
export ASH_RUNNER_ID=runner-1
export ASH_SERVER_URL=http://coordinator-host:4100
export ASH_RUNNER_PORT=4200
export ASH_RUNNER_ADVERTISE_HOST=10.0.1.5 # IP the coordinator can reach
export ASH_MAX_SANDBOXES=50
export ASH_INTERNAL_SECRET=my-runner-secret # Must match coordinator
export ANTHROPIC_API_KEY=sk-ant-...
node packages/runner/dist/index.js
The runner:
- Creates a
SandboxPoolwith a lightweight in-memory database for sandbox tracking. - Starts a Fastify HTTP server on port 4200.
- Sends a registration request to
ASH_SERVER_URL/api/internal/runners/register(with exponential backoff retry on failure). - Begins sending heartbeats every 10 seconds to
ASH_SERVER_URL/api/internal/runners/heartbeat, including pool stats (running, warming, waiting counts).
The coordinator logs:
[coordinator] Runner runner-1 registered at 10.0.1.5:4200 (max 50)
Session Routing
When a client creates a session, the coordinator selects a runner using least-loaded routing: it picks the runner with the most available capacity (max sandboxes minus running and warming sandboxes).
If no remote runners are healthy, the coordinator falls back to the local backend (if running in standalone mode). In pure coordinator mode with no healthy runners, session creation fails with an error.
Once a session is assigned to a runner, all subsequent messages for that session are routed to the same runner. The runner_id is stored in the session record in the database.
Failure Handling
Graceful Shutdown
When a runner shuts down cleanly (receives SIGTERM), it calls POST /api/internal/runners/deregister. The coordinator immediately bulk-pauses all active sessions on that runner in a single query and removes it from the registry. No 30-second wait.
Runner Crashes
If a runner crashes without deregistering, the coordinator detects it via missed heartbeats. After 30 seconds without a heartbeat (RUNNER_LIVENESS_TIMEOUT_MS), the coordinator:
- Bulk-pauses all active/starting sessions on that runner (single
UPDATEquery). - Removes the runner from the routing table.
- Purges stale entries from its local backend cache.
Each coordinator adds random 0-5s jitter to its sweep interval to prevent thundering herd when multiple coordinators sweep independently.
Paused sessions can be resumed later. If ASH_SNAPSHOT_URL is configured, the runner persists workspace state to cloud storage before eviction, enabling resume on a different runner.
Runner Comes Back
If a runner restarts with the same ASH_RUNNER_ID, it re-registers with the coordinator. The coordinator updates the connection info (host, port) and resumes routing new sessions to it.
Existing sessions that were paused when the runner died are not automatically resumed. The client must explicitly resume them via POST /api/sessions/:id/resume.
No Runners Available
If all runners are dead or at capacity, session creation returns an HTTP 503 error:
{
"error": "No runners available and no local backend configured"
}
Example: Two Runners with Docker Compose
version: "3.8"
services:
db:
image: postgres:16
environment:
POSTGRES_USER: ash
POSTGRES_PASSWORD: ash
POSTGRES_DB: ash
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ash"]
interval: 5s
timeout: 5s
retries: 5
coordinator:
image: ghcr.io/ash-ai/ash:0.1.0
init: true
ports:
- "4100:4100"
environment:
- ASH_MODE=coordinator
- ASH_DATABASE_URL=postgresql://ash:ash@db:5432/ash
- ASH_API_KEY=${ASH_API_KEY}
- ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
depends_on:
db:
condition: service_healthy
runner-1:
image: ghcr.io/ash-ai/ash:0.1.0
init: true
privileged: true
command: ["node", "packages/runner/dist/index.js"]
volumes:
- runner1-data:/data
environment:
- ASH_RUNNER_ID=runner-1
- ASH_SERVER_URL=http://coordinator:4100
- ASH_RUNNER_ADVERTISE_HOST=runner-1
- ASH_MAX_SANDBOXES=50
- ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
depends_on:
- coordinator
runner-2:
image: ghcr.io/ash-ai/ash:0.1.0
init: true
privileged: true
command: ["node", "packages/runner/dist/index.js"]
volumes:
- runner2-data:/data
environment:
- ASH_RUNNER_ID=runner-2
- ASH_SERVER_URL=http://coordinator:4100
- ASH_RUNNER_ADVERTISE_HOST=runner-2
- ASH_MAX_SANDBOXES=50
- ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET}
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
depends_on:
- coordinator
volumes:
pgdata:
runner1-data:
runner2-data:
export ANTHROPIC_API_KEY=sk-ant-...
export ASH_API_KEY=my-production-key
export ASH_INTERNAL_SECRET=my-runner-secret
docker compose up -d
Multi-Coordinator (High Availability)
For production deployments that need control plane redundancy or handle more than ~10,000 concurrent SSE connections, you can run multiple coordinators behind a load balancer.
Multi-coordinator requires a shared database. All coordinators must point to the same Postgres or CockroachDB instance via ASH_DATABASE_URL. SQLite does not support multi-coordinator.
Architecture
Coordinators are stateless — all routing decisions come from the database. Any coordinator can route to any runner.
Starting Multiple Coordinators
# Coordinator 1
ASH_MODE=coordinator \
ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash" \
ASH_API_KEY=my-api-key \
ASH_INTERNAL_SECRET=my-runner-secret \
ANTHROPIC_API_KEY=sk-ant-... \
ASH_PORT=4100 \
node packages/server/dist/index.js
# Coordinator 2 (same config, different host)
ASH_MODE=coordinator \
ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash" \
ASH_API_KEY=my-api-key \
ASH_INTERNAL_SECRET=my-runner-secret \
ANTHROPIC_API_KEY=sk-ant-... \
ASH_PORT=4100 \
node packages/server/dist/index.js
Each coordinator logs its unique ID (hostname-PID) on startup for debugging:
Ash server listening on 0.0.0.0:4100 (mode: coordinator, id: ip-10-0-1-5-12345)
Load Balancer Configuration
upstream ash_coordinators {
server coordinator-1:4100;
server coordinator-2:4100;
}
server {
listen 443 ssl;
location / {
proxy_pass http://ash_coordinators;
proxy_http_version 1.1;
proxy_set_header Connection ''; # Required for SSE
proxy_buffering off; # Required for SSE
proxy_read_timeout 86400s; # Long-lived SSE streams
}
}
- No sticky sessions needed. Any coordinator can handle any request.
- Health check:
GET /healthon each coordinator. - SSE failover: If a coordinator dies mid-stream, the client's SSE auto-reconnects through the load balancer to a different coordinator. The new coordinator looks up the session in the shared database and re-establishes the proxy to the runner. No session migration needed.
Runners with Multi-Coordinator
Runners register with the load balancer URL (not a specific coordinator):
ASH_SERVER_URL=http://load-balancer:4100 \
ASH_RUNNER_ID=runner-1 \
ASH_INTERNAL_SECRET=my-runner-secret \
node packages/runner/dist/index.js
Heartbeats go through the load balancer. Any coordinator that receives a heartbeat writes it to the shared database, where all other coordinators can see it.
How It Works
The runner registry lives in the runners table in the shared database. All coordinators read and write to this table:
- Registration: Runner sends
POST /api/internal/runners/register→ coordinator upserts intorunnerstable (with retry and exponential backoff) - Heartbeat: Runner sends
POST /api/internal/runners/heartbeat→ coordinator updatesactive_count,warming_count,last_heartbeat_at - Deregistration: Runner sends
POST /api/internal/runners/deregisteron graceful shutdown → coordinator bulk-pauses sessions and deletes runner in one pass - Selection: On session creation, coordinator queries
SELECT ... FROM runners ORDER BY available_capacity DESC LIMIT 1 - Liveness sweep: All coordinators run the sweep independently (every 30s + random 0-5s jitter). Operations are idempotent — multiple coordinators marking the same dead runner's sessions as paused is harmless.
- Auth: When
ASH_INTERNAL_SECRETis set, all/api/internal/*endpoints requireAuthorization: Bearer <secret>.
For more details on the scaling architecture, see Scaling Architecture.
Limitations
-
Cross-runner resume requires cloud persistence. Without
ASH_SNAPSHOT_URL, a session paused on runner-1 cannot be resumed on runner-2 because the workspace state is local to runner-1's filesystem. Configure S3 or GCS snapshots for cross-runner resume. -
No automatic session migration. If a runner is overloaded, existing sessions are not moved to a less-loaded runner. Only new sessions benefit from load-based routing.
-
Runner state is in-memory. Each runner uses an in-memory database for sandbox tracking (not SQLite). If a runner crashes, its sandbox tracking is lost. On restart, it re-registers with fresh state. The coordinator detects the gap via missed heartbeats and pauses affected sessions.