Multi-Machine Setup

Advanced

Most deployments do not need multi-machine mode. A single machine running in standalone mode can handle dozens of concurrent sessions. Read this page only when you need more capacity than one machine provides.

When to Use

Use multi-machine mode when:

You need more concurrent sessions than a single server can handle.
You want to isolate sandbox execution from the control plane for reliability.
You need to scale sandbox capacity independently of the API server.

Architecture

Coordinator (the Ash server in coordinator mode): handles all client-facing HTTP traffic, manages the database, routes session creation to runners, and proxies messages to the correct runner.
Runners: each runner manages a local SandboxPool, creates sandbox processes, and communicates with the bridge inside each sandbox over Unix sockets. Runners do not serve external traffic directly.

Standalone Mode (Default)

In standalone mode (ASH_MODE=standalone), the server creates a local SandboxPool and handles everything in one process. This is the default and the right choice for single-machine deployments.

Client  -->  Ash Server (standalone)  -->  SandboxPool  -->  Bridge  -->  Claude SDK

No runners are needed. The server is both control plane and execution plane.

Coordinator Mode

In coordinator mode (ASH_MODE=coordinator), the server does not create a local sandbox pool. Instead, it waits for runners to register and provides capacity through them.

Starting the Coordinator

export ASH_MODE=coordinator
export ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash"
export ASH_API_KEY=my-api-key
export ASH_INTERNAL_SECRET=my-runner-secret  # Required: authenticates runner registration
export ANTHROPIC_API_KEY=sk-ant-...

node packages/server/dist/index.js
# Or via Docker:
# ash start -e ASH_MODE=coordinator -e ASH_DATABASE_URL=... -e ASH_INTERNAL_SECRET=...

The coordinator logs:

Ash server listening on 0.0.0.0:4100 (mode: coordinator)

At this point, the server accepts API requests but cannot create sessions until at least one runner registers.

Starting a Runner

On each runner machine:

export ASH_RUNNER_ID=runner-1
export ASH_SERVER_URL=http://coordinator-host:4100
export ASH_RUNNER_PORT=4200
export ASH_RUNNER_ADVERTISE_HOST=10.0.1.5  # IP the coordinator can reach
export ASH_MAX_SANDBOXES=50
export ASH_INTERNAL_SECRET=my-runner-secret  # Must match coordinator
export ANTHROPIC_API_KEY=sk-ant-...

node packages/runner/dist/index.js

The runner:

Creates a SandboxPool with a lightweight in-memory database for sandbox tracking.
Starts a Fastify HTTP server on port 4200.
Sends a registration request to ASH_SERVER_URL/api/internal/runners/register (with exponential backoff retry on failure).
Begins sending heartbeats every 10 seconds to ASH_SERVER_URL/api/internal/runners/heartbeat, including pool stats (running, warming, waiting counts).

The coordinator logs:

[coordinator] Runner runner-1 registered at 10.0.1.5:4200 (max 50)

Session Routing

When a client creates a session, the coordinator selects a runner using least-loaded routing: it picks the runner with the most available capacity (max sandboxes minus running and warming sandboxes).

If no remote runners are healthy, the coordinator falls back to the local backend (if running in standalone mode). In pure coordinator mode with no healthy runners, session creation fails with an error.

Once a session is assigned to a runner, all subsequent messages for that session are routed to the same runner. The runner_id is stored in the session record in the database.

Failure Handling

Graceful Shutdown

When a runner shuts down cleanly (receives SIGTERM), it calls POST /api/internal/runners/deregister. The coordinator immediately bulk-pauses all active sessions on that runner in a single query and removes it from the registry. No 30-second wait.

Runner Crashes

If a runner crashes without deregistering, the coordinator detects it via missed heartbeats. After 30 seconds without a heartbeat (RUNNER_LIVENESS_TIMEOUT_MS), the coordinator:

Bulk-pauses all active/starting sessions on that runner (single UPDATE query).
Removes the runner from the routing table.
Purges stale entries from its local backend cache.

Each coordinator adds random 0-5s jitter to its sweep interval to prevent thundering herd when multiple coordinators sweep independently.

Paused sessions can be resumed later. If ASH_SNAPSHOT_URL is configured, the runner persists workspace state to cloud storage before eviction, enabling resume on a different runner.

Runner Comes Back

If a runner restarts with the same ASH_RUNNER_ID, it re-registers with the coordinator. The coordinator updates the connection info (host, port) and resumes routing new sessions to it.

Existing sessions that were paused when the runner died are not automatically resumed. The client must explicitly resume them via POST /api/sessions/:id/resume.

No Runners Available

If all runners are dead or at capacity, session creation returns an HTTP 503 error:

{
  "error": "No runners available and no local backend configured"
}

Example: Two Runners with Docker Compose

version: "3.8"

services:
  db:
    image: postgres:16
    environment:
      POSTGRES_USER: ash
      POSTGRES_PASSWORD: ash
      POSTGRES_DB: ash
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ash"]
      interval: 5s
      timeout: 5s
      retries: 5

  coordinator:
    image: ghcr.io/ash-ai/ash:0.1.0
    init: true
    ports:
      - "4100:4100"
    environment:
      - ASH_MODE=coordinator
      - ASH_DATABASE_URL=postgresql://ash:ash@db:5432/ash
      - ASH_API_KEY=${ASH_API_KEY}
      - ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    depends_on:
      db:
        condition: service_healthy

  runner-1:
    image: ghcr.io/ash-ai/ash:0.1.0
    init: true
    privileged: true
    command: ["node", "packages/runner/dist/index.js"]
    volumes:
      - runner1-data:/data
    environment:
      - ASH_RUNNER_ID=runner-1
      - ASH_SERVER_URL=http://coordinator:4100
      - ASH_RUNNER_ADVERTISE_HOST=runner-1
      - ASH_MAX_SANDBOXES=50
      - ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    depends_on:
      - coordinator

  runner-2:
    image: ghcr.io/ash-ai/ash:0.1.0
    init: true
    privileged: true
    command: ["node", "packages/runner/dist/index.js"]
    volumes:
      - runner2-data:/data
    environment:
      - ASH_RUNNER_ID=runner-2
      - ASH_SERVER_URL=http://coordinator:4100
      - ASH_RUNNER_ADVERTISE_HOST=runner-2
      - ASH_MAX_SANDBOXES=50
      - ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    depends_on:
      - coordinator

volumes:
  pgdata:
  runner1-data:
  runner2-data:

export ANTHROPIC_API_KEY=sk-ant-...
export ASH_API_KEY=my-production-key
export ASH_INTERNAL_SECRET=my-runner-secret
docker compose up -d

Multi-Coordinator (High Availability)

For production deployments that need control plane redundancy or handle more than ~10,000 concurrent SSE connections, you can run multiple coordinators behind a load balancer.

Prerequisites

Multi-coordinator requires a shared database. All coordinators must point to the same Postgres or CockroachDB instance via ASH_DATABASE_URL. SQLite does not support multi-coordinator.

Architecture

Coordinators are stateless — all routing decisions come from the database. Any coordinator can route to any runner.

Starting Multiple Coordinators

# Coordinator 1
ASH_MODE=coordinator \
ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash" \
ASH_API_KEY=my-api-key \
ASH_INTERNAL_SECRET=my-runner-secret \
ANTHROPIC_API_KEY=sk-ant-... \
ASH_PORT=4100 \
node packages/server/dist/index.js

# Coordinator 2 (same config, different host)
ASH_MODE=coordinator \
ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash" \
ASH_API_KEY=my-api-key \
ASH_INTERNAL_SECRET=my-runner-secret \
ANTHROPIC_API_KEY=sk-ant-... \
ASH_PORT=4100 \
node packages/server/dist/index.js

Each coordinator logs its unique ID (hostname-PID) on startup for debugging:

Ash server listening on 0.0.0.0:4100 (mode: coordinator, id: ip-10-0-1-5-12345)

Load Balancer Configuration

upstream ash_coordinators {
    server coordinator-1:4100;
    server coordinator-2:4100;
}

server {
    listen 443 ssl;
    location / {
        proxy_pass http://ash_coordinators;
        proxy_http_version 1.1;
        proxy_set_header Connection '';    # Required for SSE
        proxy_buffering off;              # Required for SSE
        proxy_read_timeout 86400s;        # Long-lived SSE streams
    }
}

No sticky sessions needed. Any coordinator can handle any request.
Health check: GET /health on each coordinator.
SSE failover: If a coordinator dies mid-stream, the client's SSE auto-reconnects through the load balancer to a different coordinator. The new coordinator looks up the session in the shared database and re-establishes the proxy to the runner. No session migration needed.

Runners with Multi-Coordinator

Runners register with the load balancer URL (not a specific coordinator):

ASH_SERVER_URL=http://load-balancer:4100 \
ASH_RUNNER_ID=runner-1 \
ASH_INTERNAL_SECRET=my-runner-secret \
node packages/runner/dist/index.js

Heartbeats go through the load balancer. Any coordinator that receives a heartbeat writes it to the shared database, where all other coordinators can see it.

How It Works

The runner registry lives in the runners table in the shared database. All coordinators read and write to this table:

Registration: Runner sends POST /api/internal/runners/register → coordinator upserts into runners table (with retry and exponential backoff)
Heartbeat: Runner sends POST /api/internal/runners/heartbeat → coordinator updates active_count, warming_count, last_heartbeat_at
Deregistration: Runner sends POST /api/internal/runners/deregister on graceful shutdown → coordinator bulk-pauses sessions and deletes runner in one pass
Selection: On session creation, coordinator queries SELECT ... FROM runners ORDER BY available_capacity DESC LIMIT 1
Liveness sweep: All coordinators run the sweep independently (every 30s + random 0-5s jitter). Operations are idempotent — multiple coordinators marking the same dead runner's sessions as paused is harmless.
Auth: When ASH_INTERNAL_SECRET is set, all /api/internal/* endpoints require Authorization: Bearer <secret>.

For more details on the scaling architecture, see Scaling Architecture.

Limitations

Cross-runner resume requires cloud persistence. Without ASH_SNAPSHOT_URL, a session paused on runner-1 cannot be resumed on runner-2 because the workspace state is local to runner-1's filesystem. Configure S3 or GCS snapshots for cross-runner resume.
No automatic session migration. If a runner is overloaded, existing sessions are not moved to a less-loaded runner. Only new sessions benefit from load-based routing.
Runner state is in-memory. Each runner uses an in-memory database for sandbox tracking (not SQLite). If a runner crashes, its sandbox tracking is lost. On restart, it re-registers with fresh state. The coordinator detects the gap via missed heartbeats and pauses affected sessions.

When to Use​

Architecture​

Standalone Mode (Default)​

Coordinator Mode​

Starting the Coordinator​

Starting a Runner​

Session Routing​

Failure Handling​

Graceful Shutdown​

Runner Crashes​

Runner Comes Back​

No Runners Available​

Example: Two Runners with Docker Compose​

Multi-Coordinator (High Availability)​

Architecture​

Starting Multiple Coordinators​

Load Balancer Configuration​

Runners with Multi-Coordinator​

How It Works​

Limitations​

When to Use

Architecture

Standalone Mode (Default)

Coordinator Mode

Starting the Coordinator

Starting a Runner

Session Routing

Failure Handling

Graceful Shutdown

Runner Crashes

Runner Comes Back

No Runners Available

Example: Two Runners with Docker Compose

Multi-Coordinator (High Availability)

Architecture

Starting Multiple Coordinators

Load Balancer Configuration

Runners with Multi-Coordinator

How It Works

Limitations