# Ash Documentation

> Complete documentation for Ash — an open-source system for deploying and orchestrating AI agents.

---

# Introduction

Source: https://docs.ash-cloud.ai/

# What is Ash?

Ash is a self-hostable system for deploying and orchestrating AI agents. You define an agent as a folder with a `CLAUDE.md` system prompt, deploy it to a server, and interact with it through a REST API, CLI, or SDKs. Every agent session runs inside an isolated sandbox.

## Who is Ash For?

Ash is for developers and teams who want to run Claude-powered agents in production without giving up control of their infrastructure.

| If you need... | Ash gives you... |
|---|---|
| AI agents behind an API | REST endpoints with SSE streaming |
| Stateful conversations | Sessions that persist, pause, resume, and survive restarts |
| Security isolation | Sandboxes with cgroups, bubblewrap, and environment allowlists |
| Full infrastructure control | Self-hosted on Docker, EC2, ECS, GCE, or bare metal |
| Multi-language clients | TypeScript SDK, Python SDK, CLI, and raw curl |

## Core Concepts at a Glance

```mermaid
graph LR
    Agent["Agent<br/>(folder with CLAUDE.md)"]
    Server["Ash Server<br/>(REST API)"]
    Session["Session<br/>(stateful conversation)"]
    Sandbox["Sandbox<br/>(isolated process)"]
    Bridge["Bridge<br/>(Claude Code SDK)"]

    Agent -->|deployed to| Server
    Server -->|creates| Session
    Session -->|runs in| Sandbox
    Sandbox -->|contains| Bridge
```

- **Agent** -- A folder containing a `CLAUDE.md` system prompt and optional config. Like a Docker image: the blueprint, not the running instance.
- **Session** -- A stateful conversation between a client and a deployed agent. Created, paused, resumed, or ended through the API.
- **Sandbox** -- An isolated child process running a single session. Restricted environment, resource limits, filesystem isolation.
- **Bridge** -- The process inside each sandbox that connects to the Claude Code SDK and streams responses back to the server.
- **Server** -- The Fastify HTTP server that exposes the REST API, manages agents, routes sessions, and persists state.

## Key Differentiators

### Self-Hosted, Not SaaS

Ash runs on your infrastructure. Docker, EC2, ECS Fargate, GCE, or bare metal. Your data stays on your machines. No vendor lock-in, no external API gateway, no third-party dependencies beyond the Claude API itself.

### Agent = Folder

No YAML manifests, no complex deployment pipelines. An agent is a directory with a `CLAUDE.md` file. Add `.claude/settings.json` for permissions, `.mcp.json` for MCP tools, and `.claude/skills/` for reusable workflows. Deploy with `ash deploy ./my-agent`.

### Sessions That Survive

Sessions persist to SQLite or Postgres. They survive server restarts, can be paused and resumed days later, and hand off between machines in a multi-runner setup. Warm resume is 1.7ms (sandbox still alive). Cold resume is 32ms (new process + state restoration).

### Fast by Default

Ash adds sub-millisecond overhead per message (0.41ms p50). Session creation is 44ms. Pool operations are microsecond-scale. The latency users feel is dominated by the LLM API, not Ash. See the [benchmarks](/guides/monitoring) for full numbers.

### Real Isolation

Every session sandbox runs with an environment allowlist (host credentials never leak in), cgroups v2 resource limits on Linux, and bubblewrap filesystem isolation. The agent inside the sandbox is treated as untrusted code.

### Thin Wrapper Philosophy

Ash wraps the Claude Code SDK without reinventing it. SDK types flow through the system unchanged -- from bridge to server to client. Ash adds orchestration (sessions, sandboxes, streaming, persistence) but does not translate or redefine the AI layer.

## Quick Example

```bash
# Install and start the server
npm install -g @ash-ai/cli
ash start

# Define an agent (one file is all you need)
mkdir my-agent
echo "You are a helpful coding assistant." > my-agent/CLAUDE.md

# Deploy and chat
ash deploy ./my-agent --name my-agent
ash chat my-agent "Explain closures in JavaScript"
```

The response streams back in real time. Under the hood, Ash deploys the agent to its registry, spawns an isolated sandbox, starts a bridge process with your `CLAUDE.md` as the system prompt, and proxies the Claude SDK's streaming response as SSE events.

## How Does Ash Compare?

| | Ash | Generic Sandbox APIs | Managed Agent Platforms |
|---|---|---|---|
| **Focus** | AI agent orchestration | Code execution | Agent hosting |
| **Infrastructure** | Self-hosted | Cloud/SaaS | Cloud/SaaS |
| **Session model** | Persistent, resumable | Ephemeral | Varies |
| **Isolation** | OS-level (cgroups, bwrap) | Provider-dependent | Provider-dependent |
| **AI integration** | Deep (Claude Code SDK) | None (BYO) | Framework-specific |
| **Data control** | Full (your machines) | Partial | Limited |

For detailed comparisons, see [Ash vs ComputeSDK](/comparisons/computesdk) and [Ash vs Blaxel](/comparisons/blaxel).

## Next Steps

- **[Installation](/getting-started/installation)** -- Get the CLI installed and the server running
- **[Quickstart](/getting-started/quickstart)** -- Deploy your first agent in two minutes
- **[Key Concepts](/getting-started/concepts)** -- Deep dive into agents, sessions, sandboxes, and bridges
- **[Architecture](/architecture/overview)** -- How all the pieces fit together

---

# Installation

Source: https://docs.ash-cloud.ai/getting-started/installation

# Installation

Get the Ash CLI installed and the server running.

## Prerequisites

| Requirement | Details |
|-------------|---------|
| **Node.js** | >= 20 ([download](https://nodejs.org/)) |
| **Docker** | Required for `ash start` ([install Docker](https://docs.docker.com/get-docker/)) |
| **Anthropic API key** | Get one at [console.anthropic.com](https://console.anthropic.com/) |

## Install the CLI

```bash
npm install -g @ash-ai/cli
```

Verify the installation:

```bash
ash --help
```

You should see a list of available commands including `start`, `deploy`, `session`, `agent`, and `health`.

## Set Your API Key

Ash needs an Anthropic API key to run agents. Export it in your shell:

```bash
export ANTHROPIC_API_KEY=sk-ant-...
```

For persistence, add the export to your shell profile (`~/.bashrc`, `~/.zshrc`, etc.).

## Start the Server

```bash
ash start
```

This pulls the Ash Docker image, starts the container, and waits for the server to become healthy:

```
Pulling ghcr.io/ash-ai/ash:latest...
Starting Ash server...
Waiting for server to be ready...
Ash server is running.
  URL:      http://localhost:4100
  API key:  ash_xxxxxxxx (saved to ~/.ash/config.json)
  Data dir: ~/.ash
```

The server auto-generates an API key on first start. The CLI captures it and saves it to `~/.ash/config.json`, so subsequent CLI commands authenticate automatically. If you need the key for SDK usage, read it from `~/.ash/config.json` or set `ASH_API_KEY` as an environment variable.

### `ash start` Options

| Option | Default | Description |
|--------|---------|-------------|
| `--port <port>` | `4100` | Host port to expose |
| `--database-url <url>` | SQLite (`data/ash.db`) | Use Postgres or CockroachDB instead of SQLite. Example: `postgresql://user:pass@host:5432/ash` |
| `--env KEY=VALUE` | -- | Pass extra environment variables to the container. Can be specified multiple times. |
| `--tag <tag>` | `latest` | Docker image tag |
| `--image <image>` | -- | Full Docker image name (overrides default + tag) |
| `--no-pull` | -- | Skip pulling the image (use a local build) |

Examples:

```bash
# Custom port
ash start --port 5000

# Use Postgres
ash start --database-url "postgresql://localhost:5432/ash"

# Pass additional environment variables
ash start --env ASH_SNAPSHOT_URL=s3://my-bucket/snapshots/

# Use a local dev image
ash start --image ash-dev --no-pull
```

## Verify the Server

Check that the server is running and healthy:

```bash
ash health
```

Expected output:

```json
{ "status": "ok", "activeSessions": 0, "activeSandboxes": 0, "uptime": 5 }
```

You can also check container status:

```bash
ash status
```

## Stopping the Server

```bash
ash stop
```

This stops and removes the Docker container. Session data persists in `~/.ash` (SQLite) or your configured database.

## View Logs

```bash
ash logs        # Show server logs
ash logs -f     # Follow logs in real-time
```

## Next Step

With the server running, follow the [Quickstart](quickstart.md) to deploy your first agent.

---

# Quickstart

Source: https://docs.ash-cloud.ai/getting-started/quickstart

# Quickstart

Deploy an agent and chat with it. This takes about two minutes, assuming you have completed [Installation](installation.md).

## 1. Define an Agent

An agent is a folder with a `CLAUDE.md` file. The `CLAUDE.md` is the system prompt -- it tells the agent who it is and how to behave.

```bash
mkdir my-agent

cat > my-agent/CLAUDE.md << 'EOF'
You are a helpful coding assistant.
Answer questions about JavaScript and TypeScript.
Keep answers concise. Include working code examples.
EOF
```

That is the only required file. For production agents, you can add `.claude/settings.json` (tool permissions), `.claude/skills/` (reusable skills), and `.mcp.json` (MCP server connections). See [Key Concepts](concepts.md) for more.

## 2. Deploy and Chat

```bash
ash deploy ./my-agent --name my-agent
ash chat my-agent "What is a closure in JavaScript?"
```

The response streams back in real time, with the session ID printed at the end:

```
A closure is a function that retains access to variables from its enclosing
scope, even after the outer function has returned...
Session: 550e8400-e29b-41d4-a716-446655440000
```

The session stays alive so you can continue the conversation:

```bash
ash chat --session 550e8400-e29b-41d4-a716-446655440000 "Now explain with an example"
```

When you are done, end the session:

```bash
ash session end 550e8400-e29b-41d4-a716-446655440000
```

Use `ash chat --end` for one-shot messages that don't need follow-ups -- it ends the session automatically after the response.

## Detailed Flow (Optional)

If you need more control -- multiple messages, pause/resume, or session inspection -- use the session commands directly:

```bash
# Create a session
ash session create my-agent
# → { "id": "550e8400-...", "status": "active", "agentName": "my-agent" }

# Send messages (replace SESSION_ID with the actual ID)
ash session send SESSION_ID "What is a closure in JavaScript?"
ash session send SESSION_ID "Now explain it with an example"

# End the session when done
ash session end SESSION_ID
```

---

## Using the SDKs

The CLI is convenient for testing. For applications, use one of the SDKs.

```bash
npm install @ash-ai/sdk
```

```typescript

const client = new AshClient({
  serverUrl: 'http://localhost:4100',
  apiKey: process.env.ASH_API_KEY,
});

// Create a session
const session = await client.createSession('my-agent');

// Send a message and stream the response
for await (const event of client.sendMessageStream(session.id, 'What is a closure?')) {
  if (event.type === 'message') {
    process.stdout.write(event.data);
  }
}

// Clean up
await client.endSession(session.id);
```

```bash
pip install ash-ai-sdk
```

```python
from ash_ai import AshClient

client = AshClient(
    server_url="http://localhost:4100",
    api_key=os.environ.get("ASH_API_KEY"),
)

# Create a session
session = client.create_session("my-agent")

# Send a message and stream the response
for event in client.send_message_stream(session.id, "What is a closure?"):
    if event.type == "message":
        print(event.data, end="")

# Clean up
client.end_session(session.id)
```

```bash
# Create a session
curl -s -X POST http://localhost:4100/api/sessions \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $ASH_API_KEY" \
  -d '{"agent":"my-agent"}'

# Send a message (returns an SSE stream)
curl -N -X POST http://localhost:4100/api/sessions/SESSION_ID/messages \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer $ASH_API_KEY" \
  -d '{"content":"What is a closure?"}'

# End the session
curl -s -X DELETE http://localhost:4100/api/sessions/SESSION_ID \
  -H "Authorization: Bearer $ASH_API_KEY"
```

---

## What Just Happened

When you ran those commands, here is what Ash did under the hood:

1. **`ash deploy`** -- Copied your agent folder to the server's agent registry and recorded it in the database.
2. **`ash session create`** -- Created a session record in the database and spawned an isolated sandbox process. Inside that sandbox, a bridge process started and loaded your `CLAUDE.md` as the system prompt.
3. **`ash session send`** -- Sent your message to the bridge over a Unix socket. The bridge called the Claude Agent SDK, which streamed the response back. Ash proxied each chunk as a Server-Sent Event (SSE) over HTTP to your terminal.
4. **`ash session end`** -- Marked the session as ended in the database and destroyed the sandbox process.

The sandbox is an isolated child process with a restricted environment -- only allowlisted variables reach it, and on Linux it runs with cgroup resource limits and filesystem isolation via bubblewrap.

## Next Steps

- [Key Concepts](concepts.md) -- Understand agents, sessions, sandboxes, bridges, and the server
- [CLI Reference](/cli/overview) -- All commands and flags
- [API Reference](/api/overview) -- REST endpoints, SSE format, request/response schemas
- [TypeScript SDK](/sdks/typescript) -- Full TypeScript client documentation
- [Python SDK](/sdks/python) -- Full Python client documentation

---

# Key Concepts

Source: https://docs.ash-cloud.ai/getting-started/concepts

# Key Concepts

Ash has five core concepts. Understanding how they relate to each other will help you make sense of the rest of the documentation.

## The Five Concepts

| Concept | What it is | Analogy |
|---------|-----------|---------|
| **Agent** | A folder containing `CLAUDE.md` (system prompt) and optional config files. Defines the behavior and permissions of an AI agent. | A Docker image -- the blueprint, not the running instance. |
| **Session** | A stateful conversation between a client and a deployed agent. Has a lifecycle (starting, active, paused, ended). Persisted in the database. | A container instance -- created from the image, has state, can be stopped and restarted. |
| **Sandbox** | An isolated child process that runs a single session. Restricted environment variables, resource limits (cgroups on Linux), and filesystem isolation (bubblewrap). | A jail cell -- the agent runs inside it and cannot access anything outside. |
| **Bridge** | A process inside each sandbox that connects to the Claude Agent SDK. Reads the agent's `CLAUDE.md`, receives commands from the server over a Unix socket, and streams responses back. | A translator -- it speaks the server's protocol on one side and the Claude SDK's API on the other. |
| **Server** | The Fastify HTTP server that exposes the REST API, manages the agent registry, routes sessions, and persists state to SQLite or Postgres. | The control tower -- it coordinates everything but does not do the AI work itself. |

## How They Connect

```mermaid
graph LR
    Client["Client<br/>(CLI / SDK / Browser)"]
    Server["Server<br/>(Fastify, port 4100)"]
    DB["Database<br/>(SQLite / Postgres)"]
    Pool["Sandbox Pool"]
    S1["Sandbox"]
    B1["Bridge"]
    SDK1["Claude Agent SDK"]

    Client -->|HTTP + SSE| Server
    Server --> DB
    Server --> Pool
    Pool --> S1
    S1 -->|contains| B1
    B1 -->|Unix socket| Server
    B1 --> SDK1
```

**The data flow for a single message:**

1. The client sends `POST /api/sessions/:id/messages` with a prompt.
2. The server looks up the session and its associated sandbox.
3. The server sends a `query` command to the bridge over the Unix socket.
4. The bridge calls the Claude Agent SDK's `query()` function.
5. The SDK streams response messages back to the bridge.
6. The bridge writes each message to the Unix socket.
7. The server reads each message and writes it as an SSE frame to the HTTP response.
8. The client receives the streamed response.

## Agent Structure

An agent is a folder. The only required file is `CLAUDE.md`:

```
my-agent/
├── CLAUDE.md                    # System prompt (required)
├── .claude/
│   ├── settings.json            # Tool permissions
│   └── skills/
│       └── search-and-summarize/
│           └── SKILL.md         # Reusable skill definition
└── .mcp.json                    # MCP server connections
```

**Minimal agent** -- one file:

```
my-agent/
└── CLAUDE.md
```

**Production agent** -- skills, MCP tools, scoped permissions:

```
research-agent/
├── CLAUDE.md                    # "You are a research assistant..."
├── .mcp.json                    # Connect to web fetch, memory servers
└── .claude/
    ├── settings.json            # Allow: Bash, WebSearch, mcp__fetch
    └── skills/
        ├── search-and-summarize/
        │   └── SKILL.md
        └── write-memo/
            └── SKILL.md
```

## Session Lifecycle

Every session moves through a defined set of states:

```mermaid
stateDiagram-v2
    [*] --> starting : POST /api/sessions
    starting --> active : Sandbox ready
    starting --> error : Sandbox failed

    active --> active : Send messages
    active --> paused : Pause
    active --> ended : End session
    active --> error : Sandbox crashed

    paused --> active : Resume
    error --> active : Resume

    ended --> [*]
```

### State Descriptions

| State | Description |
|-------|-------------|
| **starting** | Session created, sandbox is being spawned. Brief transient state. |
| **active** | Sandbox is running. The session can send and receive messages. |
| **paused** | Session is paused. The sandbox may still be alive (enabling fast resume) or may have been cleaned up. Workspace state is persisted. |
| **error** | The sandbox crashed or failed to start. The session can be resumed -- Ash will spawn a new sandbox and restore the previous workspace. |
| **ended** | Terminal state. The session was explicitly ended by the client. Cannot be resumed -- create a new session instead. |

### Resume: Fast Path vs. Cold Path

When you resume a paused or errored session, Ash takes one of two paths:

- **Fast path**: The sandbox process is still alive. Ash flips the status back to `active` immediately. This is instant.
- **Cold path**: The sandbox process is gone (crashed, server restarted, idle cleanup). Ash creates a new sandbox in the same workspace directory, restoring the `.claude` session state so the Claude SDK picks up the previous conversation. This takes a few seconds.

In both cases, the conversation history is preserved. The client can continue sending messages as if nothing happened.

## Sandbox Isolation

Each sandbox runs as an isolated child process with multiple layers of protection:

| Layer | Linux | macOS (dev) |
|-------|-------|-------------|
| Process limits | cgroups v2 (pids.max) | -- |
| Memory limits | cgroups v2 (memory.max) | -- |
| CPU limits | cgroups v2 (cpu.max) | -- |
| File size limits | cgroups + tmpfs | ulimit -f |
| Environment | Strict allowlist | Strict allowlist |
| Filesystem | bubblewrap | Restricted cwd |

The environment allowlist ensures only explicitly permitted variables reach the sandbox: `PATH`, `HOME`, `LANG`, `TERM`, `ANTHROPIC_API_KEY`, and a few others. Everything else is blocked. Host credentials like `AWS_ACCESS_KEY_ID` never enter the sandbox.

## Next Steps

- [Quickstart](quickstart.md) -- Deploy your first agent
- [CLI Reference](/cli/overview) -- All commands and flags
- [Architecture](/architecture/overview) -- Deep dive into the system design

---

# Use Cases

Source: https://docs.ash-cloud.ai/use-cases

# Use Cases

Ash is a general-purpose platform for deploying AI agents. Here are common patterns for what you can build.

## Customer Support Agent

Deploy an agent that handles support tickets, looks up account data via MCP tools, and follows your company's support playbook.

```
support-agent/
  CLAUDE.md                  # Support playbook and escalation rules
  .mcp.json                  # Connect to CRM, knowledge base
  .claude/
    settings.json            # Allow: WebFetch, mcp__crm__*, mcp__kb__*
    skills/
      lookup-account.md      # /lookup-account workflow
      process-refund.md      # /process-refund workflow
```

**Why Ash:** Each support conversation is a persistent session. If the customer comes back later, resume the session with full context. The agent runs in an isolated sandbox, so it can't access other customers' data. MCP servers connect the agent to your CRM and knowledge base without exposing raw database access.

## Code Review Bot

Build a bot that reviews pull requests, clones repos into sandboxes, runs tests, and posts structured feedback.

```typescript
// Triggered by GitHub webhook
const session = await client.createSession('code-reviewer');

for await (const event of client.sendMessageStream(
  session.id,
  `Review this PR:\n${prDiff}\n\nClone the repo and run tests.`,
)) {
  // Stream review results back to your webhook handler
}

// Post the review to GitHub, then clean up
await client.endSession(session.id);
```

**Why Ash:** Sandbox isolation means the agent can clone repos and run `npm test` without affecting your host. Each review gets its own sandbox with its own filesystem. Streaming lets you show review progress in real time.

## Research Assistant with Memory

Deploy an agent that searches the web, synthesizes findings, and remembers context across sessions using an MCP memory server.

```
research-agent/
  CLAUDE.md                  # Research methodology and output format
  .mcp.json                  # fetch + memory MCP servers
  .claude/
    settings.json            # Allow: WebFetch, mcp__fetch__*, mcp__memory__*
    skills/
      search-and-summarize/
        SKILL.md             # /search-and-summarize workflow
      write-memo/
        SKILL.md             # /write-memo workflow
```

**Why Ash:** Sessions persist across restarts. Pause a research session, come back days later, and resume where you left off. The memory MCP server stores facts persistently inside the sandbox workspace, building knowledge over time.

## Multi-Tenant SaaS Integration

Build a SaaS feature where each of your customers gets their own AI assistant, with tenant-specific tools injected via per-session MCP servers.

```typescript
// For each customer request, inject their specific MCP tools
const session = await client.createSession('assistant', {
  mcpServers: {
    'customer-api': {
      command: 'npx',
      args: ['-y', '@your-org/customer-mcp', '--tenant', customerId],
      env: {
        CUSTOMER_TOKEN: customerToken,
      },
    },
  },
});
```

**Why Ash:** Per-session MCP servers let you inject tenant-specific tools at runtime without redeploying the agent. Each customer's session runs in its own sandbox with its own environment, so credentials never cross boundaries.

## Data Processing Pipeline

Run agents that ingest data, execute analysis in sandboxed environments, and stream results back to your application.

```typescript
const session = await client.createSession('data-analyst');

// Upload a CSV to the sandbox
await client.uploadFile(session.id, '/workspace/data.csv', csvBuffer);

// Ask the agent to analyze it
for await (const event of client.sendMessageStream(
  session.id,
  'Analyze data.csv. Calculate summary statistics, identify outliers, and produce a report.',
)) {
  if (event.type === 'message') {
    const text = extractTextFromEvent(event.data);
    if (text) process.stdout.write(text);
  }
}

// Download the generated report
const report = await client.downloadFile(session.id, '/workspace/report.md');
```

**Why Ash:** The agent can install Python packages, write scripts, and execute code inside the sandbox without affecting your host system. File upload/download APIs let you pass data in and pull results out. Streaming shows progress as the analysis runs.

## Background Automation Agent

Deploy a long-running agent that monitors systems, runs periodic checks, and takes action when needed.

```
monitor-agent/
  CLAUDE.md                  # Monitoring procedures and alert rules
  .claude/
    settings.json            # Allow: Bash(*), WebFetch, mcp__slack__*
  .mcp.json                  # Slack MCP server for alerts
```

```typescript
// Create a long-lived session
const session = await client.createSession('monitor-agent');

// Send periodic check instructions
setInterval(async () => {
  for await (const event of client.sendMessageStream(
    session.id,
    'Run your health checks and report any issues to Slack.',
  )) {
    // Log results
  }
}, 5 * 60 * 1000); // Every 5 minutes
```

**Why Ash:** The session persists indefinitely. The agent builds context over time -- it knows what it checked last, what's normal, and what's changed. Pause the session during maintenance windows and resume after.

## Patterns to Notice

Across all use cases, a few patterns repeat:

1. **Agent as folder** -- Define behavior in `CLAUDE.md`, not code. Change the prompt, redeploy, done.
2. **Session persistence** -- Long-lived, resumable conversations are the default, not a special case.
3. **Sandbox isolation** -- Agents run untrusted code safely. Clone repos, run scripts, install packages.
4. **MCP servers** -- Connect agents to your systems (CRM, databases, APIs) through a standard protocol.
5. **Streaming** -- Real-time responses via SSE. Show progress, not just final answers.

## Next Steps

- **[Quickstart](/getting-started/quickstart)** -- Deploy your first agent
- **[Defining an Agent](/guides/defining-an-agent)** -- Full guide to agent structure
- **[Managing Sessions](/guides/managing-sessions)** -- Session lifecycle and persistence
- **[Streaming Responses](/guides/streaming-responses)** -- SSE events and SDK helpers

---

# Defining an Agent

Source: https://docs.ash-cloud.ai/guides/defining-an-agent

# Defining an Agent

An agent in Ash is a folder on disk. At minimum, it contains a single file: `CLAUDE.md`. This file defines the agent's identity, capabilities, and behavior. Ash reads this folder when you deploy, copies it into a sandbox, and uses it as the system prompt for every session.

## Minimal Agent

The simplest possible agent is a directory with one file:

```
my-agent/
  CLAUDE.md
```

The `CLAUDE.md` is the only required file. It contains the instructions the agent follows during every conversation.

```markdown title="my-agent/CLAUDE.md"
# Customer Support Agent

You are a customer support agent for Acme Corp. You help users troubleshoot
product issues, process returns, and answer billing questions.

## Behavior

- Be polite and professional
- Ask clarifying questions before making assumptions
- If you cannot resolve an issue, escalate by telling the user to email support@acme.com
```

Deploy it:

```bash
ash deploy ./my-agent --name customer-support
```

That is a working agent. It will respond to messages using the instructions in `CLAUDE.md`.

## Production Agent

A production agent adds configuration for permissions, MCP servers, and skills:

```
research-assistant/
  CLAUDE.md
  .claude/
    settings.json
    skills/
      search-and-summarize.md
      analyze-code.md
  .mcp.json
```

### CLAUDE.md

The system prompt defines identity, capabilities, and behavior rules:

```markdown title="research-assistant/CLAUDE.md"
# Research Assistant Agent

You are a research assistant powered by Ash. You help users research topics,
analyze code, and produce structured reports.

## Capabilities

You have access to:
- **Web fetching** via the `fetch` MCP server
- **Persistent memory** via the `memory` MCP server
- **Skills** -- invoke /search-and-summarize or /analyze-code for structured workflows

## Behavior

- Use your tools to find accurate information before answering
- Store important facts in memory so you can recall them later
- Be concise but thorough -- cite sources when you fetch web content

## Identity

When asked about yourself, say you are the Research Assistant powered by Ash.
```

### .claude/settings.json

Controls which tools the agent is allowed to use without asking for confirmation, and optionally sets the default model. This maps directly to the Claude Code SDK's permission system.

```json title="research-assistant/.claude/settings.json"
{
  "model": "claude-sonnet-4-5-20250929",
  "permissions": {
    "allow": [
      "Bash(npm install:*)",
      "Bash(node:*)",
      "Read",
      "Write",
      "Glob",
      "Grep",
      "WebFetch",
      "mcp__fetch__*",
      "mcp__memory__*"
    ]
  }
}
```

The `model` field sets the default model for the agent. This is the model the SDK uses unless overridden at the API level (see [Model Precedence](#model-precedence) below).

The `allow` list uses glob patterns. Each entry permits the agent to use that tool without human approval. Tools not listed will be blocked or require approval depending on the session's permission mode.

Common patterns:

| Pattern | Allows |
|---------|--------|
| `Read` | Reading any file |
| `Write` | Writing any file |
| `Bash(node:*)` | Running any command starting with `node` |
| `Bash(npm install:*)` | Running npm install commands |
| `mcp__fetch__*` | All tools from the `fetch` MCP server |
| `WebFetch` | The built-in web fetch tool |

### .mcp.json

Configures MCP (Model Context Protocol) servers available to the agent. Each server provides additional tools the agent can call.

```json title="research-assistant/.mcp.json"
{
  "mcpServers": {
    "fetch": {
      "command": "npx",
      "args": ["-y", "@anthropic-ai/mcp-fetch"]
    },
    "memory": {
      "command": "npx",
      "args": ["-y", "@anthropic-ai/mcp-memory"],
      "env": {
        "MEMORY_FILE": "./memory.json"
      }
    }
  }
}
```

MCP servers run as child processes inside the sandbox. The `env` field sets environment variables specific to that server. Paths are relative to the agent's workspace directory.

You can also inject MCP servers at session creation time using the `mcpServers` field on `POST /api/sessions`. Session-level entries are merged into the agent's `.mcp.json` (session overrides agent on key conflict). This enables the **sidecar pattern** — your host app exposes tenant-specific tools as MCP endpoints. See [Per-Session MCP Servers](../api/sessions.md#per-session-mcp-servers) for details.

### .claude/skills/

Skills are markdown files that define reusable workflows the agent can invoke. Each file becomes a slash command.

```markdown title="research-assistant/.claude/skills/search-and-summarize.md"
# /search-and-summarize

Search the web for a given topic and produce a structured summary.

## Steps

1. Use the fetch tool to search for the topic
2. Read the top 3-5 results
3. Synthesize a summary with key findings
4. List all sources with URLs at the bottom

## Output Format

Return a markdown document with sections: Overview, Key Findings, Sources.
```

The filename (minus `.md`) becomes the skill name. The agent can invoke it when a user references `/search-and-summarize` in a message.

## Folder Structure Reference

```
agent-name/
  CLAUDE.md                  # Required. Agent system prompt.
  .claude/
    settings.json            # Optional. Tool permissions + default model.
    skills/
      skill-name.md          # Optional. Reusable workflows.
  .mcp.json                  # Optional. MCP server configuration.
  package.json               # Optional. Dependencies installed at sandbox start.
  setup.sh                   # Optional. Runs once when sandbox initializes.
```

If a `package.json` is present, Ash runs `npm install` inside the sandbox when the session starts. If a `setup.sh` is present, it runs after dependency installation.

## What Happens at Deploy

When you run `ash deploy ./my-agent --name my-agent`:

1. Ash validates that the directory contains `CLAUDE.md`
2. The agent files are copied to `~/.ash/agents/my-agent/`
3. The agent is registered with the server (name, path, version)
4. If an agent with that name already exists, its version is incremented

The agent folder becomes the working directory for every session sandbox. Files the agent creates during a session are written to the sandbox workspace, not back to the agent definition.

## Model Precedence

The model used for a conversation is resolved with the following precedence (highest to lowest):

1. **Per-message model** — passed in the `model` field of `POST /api/sessions/:id/messages`
2. **Session-level model** — set when creating the session via `POST /api/sessions`
3. **Agent record model** — set on the agent via the API
4. **Agent settings file** — the `model` field in `.claude/settings.json`
5. **SDK default** — the Claude Code SDK's built-in default model

This means you can deploy an agent with a default model in `.claude/settings.json`, override it for specific sessions, and override it again for individual messages — all without redeploying the agent. When a new model comes out, you can start using it immediately by passing it at the session or message level.

---

# Deploying Agents

Source: https://docs.ash-cloud.ai/guides/deploying-agents

# Deploying Agents

Deploying an agent registers it with the Ash server so sessions can be created against it. The agent folder is copied to the server's data directory and validated.

## Deploy with the CLI

```bash
ash deploy ./path/to/agent --name my-agent
```

The `--name` flag sets the agent name. If omitted, the directory name is used.

### What happens during deploy

1. **Validation** -- Ash checks that the directory contains a `CLAUDE.md` file. If it does not, the deploy fails with an error.
2. **Copy** -- The agent files are copied to `~/.ash/agents/<name>/`. This ensures the server can access them even if the original directory moves.
3. **Registration** -- The server creates or updates the agent record in its database. Each deploy increments the agent's version number.

```
$ ash deploy ./research-assistant --name research-bot
Copied agent files to /Users/you/.ash/agents/research-bot
Deployed agent: {
  "id": "a1b2c3d4-...",
  "name": "research-bot",
  "version": 1,
  "path": "agents/research-bot",
  "createdAt": "2025-01-15T10:30:00.000Z",
  "updatedAt": "2025-01-15T10:30:00.000Z"
}
```

## Updating an Agent

Redeploy with the same name to update an agent. Ash overwrites the agent files and increments the version:

```bash
# Edit your agent's CLAUDE.md, then redeploy
ash deploy ./research-assistant --name research-bot
```

Existing sessions continue using the version they started with. New sessions pick up the updated agent.

## Listing Agents

```typescript

const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: process.env.ASH_API_KEY });
const agents = await client.listAgents();
console.log(agents);
```

```python
from ash_sdk import AshClient

client = AshClient("http://localhost:4100", api_key=os.environ["ASH_API_KEY"])
agents = client.list_agents()
print(agents)
```

```bash
ash agent list
```

```bash
curl $ASH_SERVER_URL/api/agents
```

Response:

```json
{
  "agents": [
    {
      "id": "a1b2c3d4-...",
      "name": "research-bot",
      "version": 2,
      "path": "/data/agents/research-bot",
      "createdAt": "2025-01-15T10:30:00.000Z",
      "updatedAt": "2025-01-15T12:00:00.000Z"
    }
  ]
}
```

## Getting Agent Details

```typescript
const agent = await client.getAgent('research-bot');
```

```python
agent = client.get_agent("research-bot")
```

```bash
ash agent info research-bot
```

```bash
curl $ASH_SERVER_URL/api/agents/research-bot
```

## Deleting an Agent

Deleting an agent removes its registration from the server. Existing sessions that were created from the agent continue to run, but no new sessions can be created.

```typescript
await client.deleteAgent('research-bot');
```

```python
client.delete_agent("research-bot")
```

```bash
ash agent delete research-bot
```

```bash
curl -X DELETE $ASH_SERVER_URL/api/agents/research-bot
```

## API Reference

| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/api/agents` | Deploy (create or update) an agent |
| `GET` | `/api/agents` | List all agents |
| `GET` | `/api/agents/:name` | Get agent details |
| `DELETE` | `/api/agents/:name` | Delete an agent |

### POST /api/agents

Request body:

```json
{
  "name": "research-bot",
  "path": "agents/research-bot"
}
```

The `path` field is resolved relative to the server's data directory. When deploying via the CLI, this is handled automatically.

Response (201):

```json
{
  "agent": {
    "id": "a1b2c3d4-...",
    "name": "research-bot",
    "version": 1,
    "path": "/data/agents/research-bot",
    "createdAt": "2025-01-15T10:30:00.000Z",
    "updatedAt": "2025-01-15T10:30:00.000Z"
  }
}
```

Error (400) -- missing CLAUDE.md:

```json
{
  "error": "Agent directory must contain CLAUDE.md",
  "statusCode": 400
}
```

---

# Managing Sessions

Source: https://docs.ash-cloud.ai/guides/managing-sessions

# Managing Sessions

A session is a stateful conversation between a client and a deployed agent. Each session runs inside an isolated sandbox with its own workspace directory. Sessions persist messages across turns and can be paused, resumed, and ended.

## Session States

| State | Description |
|-------|-------------|
| `starting` | Sandbox is being created. Transitions to `active` on success or `error` on failure. |
| `active` | Sandbox is running and accepting messages. |
| `paused` | Sandbox may still be alive but the session is idle. Can be resumed. |
| `ended` | Session is terminated. Sandbox is destroyed. Cannot be resumed. |
| `error` | Something went wrong (sandbox crash, runner unavailable). Can be resumed. |

State transitions:

```
starting --> active --> paused --> active (resume)
                   \           \-> ended
                    \-> error --> active (resume)
                             \-> ended
```

## Creating a Session

```typescript

const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: process.env.ASH_API_KEY });
const session = await client.createSession('my-agent');
console.log(session.id);     // "a1b2c3d4-..."
console.log(session.status); // "active"
```

```python
from ash_sdk import AshClient

client = AshClient("http://localhost:4100", api_key=os.environ["ASH_API_KEY"])
session = client.create_session("my-agent")
print(session.id)     # "a1b2c3d4-..."
print(session.status) # "active"
```

```bash
ash session create my-agent
```

```bash
curl -X POST $ASH_SERVER_URL/api/sessions \
  -H "Content-Type: application/json" \
  -d '{"agent": "my-agent"}'
```

Response (201):

```json
{
  "session": {
    "id": "a1b2c3d4-...",
    "agentName": "my-agent",
    "sandboxId": "a1b2c3d4-...",
    "status": "active",
    "model": null,
    "createdAt": "2025-01-15T10:30:00.000Z",
    "lastActiveAt": "2025-01-15T10:30:00.000Z"
  }
}
```

### Creating a Session with a Model Override

You can specify a model when creating a session. This overrides the agent's default model for the entire session.

```typescript
const session = await client.createSession('my-agent', { model: 'claude-opus-4-6' });
```

```python
session = client.create_session("my-agent", model="claude-opus-4-6")
```

```bash
ash session create my-agent --model claude-opus-4-6
```

```bash
curl -X POST $ASH_SERVER_URL/api/sessions \
  -H "Content-Type: application/json" \
  -d '{"agent": "my-agent", "model": "claude-opus-4-6"}'
```

### Creating a Session with Per-Session MCP Servers

You can inject MCP servers at session creation time. This enables the **sidecar pattern**: your host application exposes tenant-specific tools as MCP endpoints, and each session connects to its own URL.

Session-level MCP servers are merged into the agent's `.mcp.json`. If both define a server with the same key, the session entry wins.

```typescript
const session = await client.createSession('my-agent', {
  mcpServers: {
    'customer-tools': { url: 'http://host-app:8000/mcp?tenant=t_abc123' },
  },
});
```

```bash
curl -X POST $ASH_SERVER_URL/api/sessions \
  -H "Content-Type: application/json" \
  -d '{
    "agent": "my-agent",
    "mcpServers": {
      "customer-tools": { "url": "http://host-app:8000/mcp?tenant=t_abc123" }
    }
  }'
```

### Creating a Session with a System Prompt Override

You can replace the agent's `CLAUDE.md` for a specific session. The agent definition is not modified — only the sandbox workspace copy is overwritten.

```typescript
const session = await client.createSession('my-agent', {
  systemPrompt: 'You are a support agent for tenant t_abc123. Be concise.',
});
```

```bash
curl -X POST $ASH_SERVER_URL/api/sessions \
  -H "Content-Type: application/json" \
  -d '{
    "agent": "my-agent",
    "systemPrompt": "You are a support agent for tenant t_abc123. Be concise."
  }'
```

### Combining MCP Servers and System Prompt

For full per-tenant customization, pass both `mcpServers` and `systemPrompt` together:

```typescript
const session = await client.createSession('my-agent', {
  mcpServers: {
    'tenant-tools': { url: `http://host-app:8000/mcp?tenant=${tenantId}` },
  },
  systemPrompt: `You are a support agent for ${tenantName}. Use the tenant-tools MCP server to look up account data.`,
});
```

```bash
curl -X POST $ASH_SERVER_URL/api/sessions \
  -H "Content-Type: application/json" \
  -d '{
    "agent": "my-agent",
    "mcpServers": {
      "tenant-tools": { "url": "http://host-app:8000/mcp?tenant=t_abc123" }
    },
    "systemPrompt": "You are a support agent for Acme Corp. Use the tenant-tools MCP server to look up account data."
  }'
```

## Sending Messages

Messages are sent via POST and return an SSE stream. See the [Streaming Responses](./streaming-responses.md) guide for full details on consuming the stream.

```typescript
for await (const event of client.sendMessageStream(session.id, 'What is the capital of France?')) {
  if (event.type === 'message') {
    console.log(event.data);
  } else if (event.type === 'done') {
    console.log('Turn complete');
  }
}
```

```python
for event in client.send_message_stream(session.id, "What is the capital of France?"):
    if event.type == "message":
        print(event.data)
    elif event.type == "done":
        print("Turn complete")
```

```bash
ash session send <session-id> "What is the capital of France?"
```

```bash
curl -X POST $ASH_SERVER_URL/api/sessions/<session-id>/messages \
  -H "Content-Type: application/json" \
  -d '{"content": "What is the capital of France?"}' \
  -N
```

### Per-Message Model Override

You can override the model for a single message. This takes the highest precedence — it overrides both the session model and the agent's default. Useful for using a more capable model on hard tasks or a cheaper model on simple ones.

```typescript
for await (const event of client.sendMessageStream(session.id, 'Analyze this complex codebase', {
  model: 'claude-opus-4-6',
})) {
  // This message uses Opus regardless of the session/agent default
}
```

```python
for event in client.send_message_stream(session.id, "Analyze this complex codebase", model="claude-opus-4-6"):
    pass  # This message uses Opus regardless of the session/agent default
```

```bash
curl -X POST $ASH_SERVER_URL/api/sessions/<session-id>/messages \
  -H "Content-Type: application/json" \
  -d '{"content": "Analyze this complex codebase", "model": "claude-opus-4-6"}' \
  -N
```

## Multi-Turn Conversations

Sessions preserve full conversation context across turns. Each message builds on the previous ones.

```typescript
const session = await client.createSession('my-agent');

// Turn 1
for await (const event of client.sendMessageStream(session.id, 'My name is Alice.')) {
  // Agent acknowledges
}

// Turn 2 -- agent remembers context from turn 1
for await (const event of client.sendMessageStream(session.id, 'What is my name?')) {
  if (event.type === 'message') {
    const text = extractTextFromEvent(event.data);
    if (text) console.log(text); // "Your name is Alice."
  }
}
```

Messages are persisted to the database. You can retrieve them later:

```typescript
const messages = await client.listMessages(session.id);
for (const msg of messages) {
  console.log(`[${msg.role}] ${msg.content}`);
}
```

```python
session = client.create_session("my-agent")

# Turn 1
for event in client.send_message_stream(session.id, "My name is Alice."):
    pass  # Agent acknowledges

# Turn 2 -- agent remembers context from turn 1
for event in client.send_message_stream(session.id, "What is my name?"):
    if event.type == "message":
        data = event.data
        if data.get("type") == "assistant":
            for block in data.get("message", {}).get("content", []):
                if block.get("type") == "text":
                    print(block["text"])  # "Your name is Alice."
```

Messages are persisted to the database. You can retrieve them later:

```python
messages = client.list_messages(session.id)
for msg in messages:
    print(f"[{msg.role}] {msg.content}")
```

## Pausing a Session

Pausing a session marks it as idle. The sandbox may remain alive for fast resume, but the session stops accepting new messages until resumed.

```typescript
const session = await client.pauseSession(session.id);
console.log(session.status); // "paused"
```

```python
session = client.pause_session(session.id)
```

```bash
ash session pause <session-id>
```

```bash
curl -X POST $ASH_SERVER_URL/api/sessions/<session-id>/pause
```

## Resuming a Session

Resume brings a paused or errored session back to `active`. Ash uses two resume paths:

**Fast path (warm resume):** If the original sandbox is still alive, the session resumes instantly with no state loss. This is the common case when resuming shortly after pausing.

**Cold path (cold resume):** If the sandbox was reclaimed (idle timeout, OOM, server restart), Ash creates a new sandbox. Workspace state is restored from the persisted snapshot if available. Conversation history is preserved in the database regardless.

```typescript
const session = await client.resumeSession(session.id);
console.log(session.status); // "active"
```

```python
session = client.resume_session(session.id)
```

```bash
ash session resume <session-id>
```

```bash
curl -X POST $ASH_SERVER_URL/api/sessions/<session-id>/resume
```

Response includes the resume path taken:

```json
{
  "session": {
    "id": "a1b2c3d4-...",
    "status": "active",
    "sandboxId": "a1b2c3d4-..."
  }
}
```

## Ending a Session

Ending a session destroys the sandbox and marks the session as permanently closed. The session's messages and events remain in the database for retrieval, but no new messages can be sent.

```typescript
const session = await client.endSession(session.id);
console.log(session.status); // "ended"
```

```python
session = client.end_session(session.id)
```

```bash
ash session end <session-id>
```

```bash
curl -X DELETE $ASH_SERVER_URL/api/sessions/<session-id>
```

## Listing Sessions

```typescript
// All sessions
const sessions = await client.listSessions();

// Filter by agent
const sessions = await client.listSessions('my-agent');
```

```python
sessions = client.list_sessions()
```

```bash
ash session list
```

```bash
# All sessions
curl $ASH_SERVER_URL/api/sessions

# Filter by agent
curl "$ASH_SERVER_URL/api/sessions?agent=my-agent"
```

## API Reference

| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/api/sessions` | Create a session |
| `GET` | `/api/sessions` | List sessions (optional `?agent=` filter) |
| `GET` | `/api/sessions/:id` | Get session details |
| `POST` | `/api/sessions/:id/messages` | Send a message (returns SSE stream) |
| `GET` | `/api/sessions/:id/messages` | List persisted messages |
| `POST` | `/api/sessions/:id/pause` | Pause a session |
| `POST` | `/api/sessions/:id/resume` | Resume a session |
| `DELETE` | `/api/sessions/:id` | End a session |

---

# Streaming Responses

Source: https://docs.ash-cloud.ai/guides/streaming-responses

# Streaming Responses

When you send a message to a session, the response is delivered as a Server-Sent Events (SSE) stream. Events arrive in real time as the agent thinks, uses tools, and generates text.

## SSE Event Types

The stream carries three event types:

| Event | Description |
|-------|-------------|
| `message` | An SDK message from the agent. Contains assistant text, tool use, tool results, or stream deltas. |
| `error` | An error occurred during processing. |
| `done` | The agent's turn is complete. |

Each SSE frame has the format:

```
event: message
data: {"type": "assistant", "message": {"content": [{"type": "text", "text": "Hello!"}]}}

event: done
data: {"sessionId": "a1b2c3d4-..."}
```

The `data` field of `message` events carries raw SDK message objects passed through from the Claude Code SDK. The shape varies by message type (`assistant`, `user`, `result`, `stream_event`).

## Basic Streaming

The `sendMessageStream` method returns an async generator of typed events:

```typescript

const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: process.env.ASH_API_KEY });
const session = await client.createSession('my-agent');

for await (const event of client.sendMessageStream(session.id, 'Explain TCP in one paragraph.')) {
  switch (event.type) {
    case 'message': {
      const text = extractTextFromEvent(event.data);
      if (text) {
        process.stdout.write(text);
      }
      break;
    }
    case 'error':
      console.error('Error:', event.data.error);
      break;
    case 'done':
      console.log('\nDone.');
      break;
  }
}
```

```python
from ash_sdk import AshClient

client = AshClient("http://localhost:4100", api_key=os.environ["ASH_API_KEY"])
session = client.create_session("my-agent")

for event in client.send_message_stream(session.id, "Explain TCP in one paragraph."):
    if event.type == "message":
        data = event.data
        # Extract text from assistant messages
        if data.get("type") == "assistant":
            content = data.get("message", {}).get("content", [])
            for block in content:
                if block.get("type") == "text":
                    print(block["text"], end="")
    elif event.type == "error":
        print(f"Error: {event.data.get('error')}")
    elif event.type == "done":
        print("\nDone.")
```

Use the `-N` flag to disable output buffering so events print as they arrive:

```bash
curl -N -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/messages \
  -H "Content-Type: application/json" \
  -d '{"content": "Hello!"}'
```

Output:

```
event: message
data: {"type":"assistant","message":{"content":[{"type":"text","text":"Hello! How can I help you?"}]}}

event: done
data: {"sessionId":"a1b2c3d4-..."}
```

## Display Items

For richer output that includes tool use and tool results, use `extractDisplayItems`:

```typescript
for await (const event of client.sendMessageStream(session.id, 'List files in /tmp')) {
  if (event.type === 'message') {
    const items = extractDisplayItems(event.data);
    if (items) {
      for (const item of items) {
        switch (item.type) {
          case 'text':
            console.log(item.content);
            break;
          case 'tool_use':
            console.log(`[Tool: ${item.toolName}] ${item.toolInput}`);
            break;
          case 'tool_result':
            console.log(`[Result] ${item.content}`);
            break;
        }
      }
    }
  }
}
```

```python
for event in client.send_message_stream(session.id, "List files in /tmp"):
    if event.type == "message":
        data = event.data
        if data.get("type") == "assistant":
            for block in data.get("message", {}).get("content", []):
                if block.get("type") == "text":
                    print(block["text"])
                elif block.get("type") == "tool_use":
                    print(f"[Tool: {block['name']}] {block.get('input', '')}")
        elif data.get("type") == "result":
            for block in data.get("content", []):
                if block.get("type") == "text":
                    print(f"[Result] {block['text']}")
```

## Partial Messages (Real-Time Streaming)

By default, `message` events contain complete SDK messages. To receive incremental text deltas as the agent types, enable `includePartialMessages`:

```typescript
for await (const event of client.sendMessageStream(
  session.id,
  'Write a haiku about servers.',
  { includePartialMessages: true },
)) {
  if (event.type === 'message') {
    const delta = extractStreamDelta(event.data);
    if (delta) {
      process.stdout.write(delta); // Character-by-character streaming
    }
  }
}
```

The `extractStreamDelta` helper extracts text from `content_block_delta` stream events. It returns `null` for non-delta events, so you can safely call it on every message.

```python
for event in client.send_message_stream(
    session.id,
    "Write a haiku about servers.",
    include_partial_messages=True,
):
    if event.type == "message":
        data = event.data
        if data.get("type") == "stream_event":
            evt = data.get("event", {})
            if evt.get("type") == "content_block_delta":
                delta = evt.get("delta", {})
                if delta.get("type") == "text_delta":
                    print(delta.get("text", ""), end="", flush=True)
```

## Browser (Raw Fetch)

For browser applications that do not use the SDK, parse the SSE stream directly with `ReadableStream`:

```javascript
const response = await fetch('http://localhost:4100/api/sessions/SESSION_ID/messages', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer YOUR_API_KEY',
  },
  body: JSON.stringify({ content: 'Hello!' }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
let currentEvent = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split('\n');
  buffer = lines.pop() || '';

  for (const line of lines) {
    if (line.startsWith('event: ')) {
      currentEvent = line.slice(7).trim();
    } else if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      if (currentEvent === 'message') {
        // Handle message
        console.log(data);
      } else if (currentEvent === 'done') {
        console.log('Stream complete');
      } else if (currentEvent === 'error') {
        console.error(data.error);
      }
    }
  }
}
```

## Error Handling

Errors can arrive at two levels: **connection errors** (network failure, server restart) throw exceptions, and **agent errors** (sandbox crash, SDK error) arrive as `error` events within the stream. Handle both:

```typescript
try {
  for await (const event of client.sendMessageStream(sessionId, 'Hello')) {
    if (event.type === 'message') {
      const text = extractTextFromEvent(event.data);
      if (text) process.stdout.write(text);
    } else if (event.type === 'error') {
      // Agent-level error (sandbox crash, OOM, SDK error)
      console.error('Agent error:', event.data.error);
    } else if (event.type === 'done') {
      console.log('\nDone.');
    }
  }
} catch (err) {
  // Connection-level error (network failure, server restart, 404)
  console.error('Connection error:', err.message);
}
```

```python
try:
    for event in client.send_message_stream(session_id, "Hello"):
        if event.type == "message":
            data = event.data
            if data.get("type") == "assistant":
                for block in data.get("message", {}).get("content", []):
                    if block.get("type") == "text":
                        print(block["text"], end="")
        elif event.type == "error":
            # Agent-level error (sandbox crash, OOM, SDK error)
            print(f"Agent error: {event.data.get('error')}")
        elif event.type == "done":
            print("\nDone.")
except Exception as e:
    # Connection-level error (network failure, server restart)
    print(f"Connection error: {e}")
```

## Reconnection with Retry

When an SSE stream disconnects (server restart, network blip, load balancer timeout), retry with exponential backoff. If the session's sandbox was destroyed, resume it before retrying.

```typescript

const client = new AshClient({
  serverUrl: 'http://localhost:4100',
  apiKey: process.env.ASH_API_KEY,
});

async function sleep(ms: number) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

async function streamWithRetry(
  sessionId: string,
  content: string,
  maxRetries = 3,
): Promise<string> {
  let fullText = '';

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      for await (const event of client.sendMessageStream(sessionId, content)) {
        if (event.type === 'message') {
          const text = extractTextFromEvent(event.data);
          if (text) {
            fullText += text;
            process.stdout.write(text);
          }
        } else if (event.type === 'error') {
          throw new Error(`Agent error: ${event.data.error}`);
        }
      }
      return fullText; // Stream completed successfully
    } catch (err) {
      console.warn(`Stream attempt ${attempt + 1} failed: ${(err as Error).message}`);

      if (attempt === maxRetries - 1) throw err;

      // Check if the session needs recovery before retrying
      try {
        const session = await client.getSession(sessionId);
        if (session.status === 'paused' || session.status === 'error') {
          await client.resumeSession(sessionId);
          console.log('Session resumed after disconnect');
        }
      } catch {
        // Server might be temporarily unreachable — wait and retry
      }

      // Exponential backoff: 1s, 2s, 4s
      await sleep(Math.pow(2, attempt) * 1000);
    }
  }

  return fullText;
}

// Usage
const session = await client.createSession('my-agent');
const result = await streamWithRetry(session.id, 'Analyze this code');
```

```python

from ash_ai import AshClient, AshApiError

client = AshClient(
    server_url="http://localhost:4100",
    api_key=os.environ["ASH_API_KEY"],
)

def stream_with_retry(session_id: str, content: str, max_retries: int = 3) -> str:
    full_text = ""

    for attempt in range(max_retries):
        try:
            for event in client.send_message_stream(session_id, content):
                if event.type == "message":
                    data = event.data
                    if data.get("type") == "assistant":
                        for block in data.get("message", {}).get("content", []):
                            if block.get("type") == "text":
                                full_text += block["text"]
                                print(block["text"], end="", flush=True)
                elif event.type == "error":
                    raise Exception(f"Agent error: {event.data.get('error')}")
            return full_text  # Stream completed successfully

        except Exception as e:
            print(f"\nStream attempt {attempt + 1} failed: {e}")

            if attempt == max_retries - 1:
                raise

            # Check if the session needs recovery
            try:
                session = client.get_session(session_id)
                if session.status in ("paused", "error"):
                    client.resume_session(session_id)
                    print("Session resumed after disconnect")
            except Exception:
                pass  # Server temporarily unreachable

            # Exponential backoff
            time.sleep(2 ** attempt)

    return full_text

# Usage
session = client.create_session("my-agent")
result = stream_with_retry(session.id, "Analyze this code")
```

## Backpressure

Ash handles backpressure automatically on the server side. When your client reads the SSE stream slowly, the server pauses the upstream agent rather than buffering unbounded data in memory.

**What this means for your client:**

- **You do not need to implement client-side backpressure.** Read the stream at whatever pace you can handle. If you process events slowly, the server waits.
- **Memory is bounded.** The server buffers at most one SSE frame plus the kernel TCP send buffer (typically 128 KB - 1 MB). There is no application-level buffering.
- **Slow clients get disconnected after 30 seconds.** If your client stops reading for more than 30 seconds, the server closes the stream with a timeout error. Reconnect and resume the session to continue.

See [SSE Backpressure](../architecture/sse-backpressure.md) for the full server-side implementation.

## Helper Functions Reference

The `@ash-ai/shared` package exports three helper functions for extracting content from stream events:

| Function | Purpose | Returns |
|----------|---------|---------|
| `extractTextFromEvent(data)` | Extract text content from assistant messages | `string \| null` |
| `extractDisplayItems(data)` | Extract structured items (text, tool use, tool results) | `DisplayItem[] \| null` |
| `extractStreamDelta(data)` | Extract incremental text from partial stream events | `string \| null` |

All three accept the `data` field from a `message` event and return `null` for events that do not match their expected type.

---

# Working with Files

Source: https://docs.ash-cloud.ai/guides/working-with-files

# Working with Files

Each session runs inside an isolated sandbox with its own workspace directory. Files the agent creates, modifies, or downloads during a session are accessible through the files API. This lets you review agent-written code, download generated artifacts, or inspect the workspace state.

## Listing Files

Retrieve a flat list of all files in a session's workspace.

```typescript

const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: process.env.ASH_API_KEY });

const result = await client.getSessionFiles(sessionId);
console.log(`Source: ${result.source}`); // "sandbox" or "snapshot"
for (const file of result.files) {
  console.log(`${file.path} (${file.size} bytes, modified ${file.modifiedAt})`);
}
```

```python
from ash_sdk import AshClient

client = AshClient("http://localhost:4100", api_key=os.environ["ASH_API_KEY"])
# The Python SDK uses the raw API response
resp = httpx.get(f"http://localhost:4100/api/sessions/{session_id}/files")
data = resp.json()
for f in data["files"]:
    print(f"{f['path']} ({f['size']} bytes)")
```

```bash
curl $ASH_SERVER_URL/api/sessions/SESSION_ID/files
```

Response:

```json
{
  "files": [
    {
      "path": "CLAUDE.md",
      "size": 512,
      "modifiedAt": "2025-01-15T10:30:00.000Z"
    },
    {
      "path": "src/index.ts",
      "size": 1024,
      "modifiedAt": "2025-01-15T10:35:00.000Z"
    },
    {
      "path": "output/report.md",
      "size": 4096,
      "modifiedAt": "2025-01-15T10:36:00.000Z"
    }
  ],
  "source": "sandbox"
}
```

The `source` field indicates where the file listing came from:

| Source | Meaning |
|--------|---------|
| `sandbox` | Read from the live sandbox process. The session is active or paused with its sandbox still running. |
| `snapshot` | Read from a persisted workspace snapshot. The sandbox was reclaimed but workspace state was saved. |

## Downloading a File (Raw)

Download a file as raw bytes. This is the default behavior and works for any file type — text, binary, images, PDFs, etc. Files up to 100 MB are supported.

```typescript
const { buffer, mimeType, source } = await client.downloadSessionFile(sessionId, 'output/report.pdf');
console.log(`Type: ${mimeType}, Source: ${source}`);
fs.writeFileSync('report.pdf', buffer);
```

```python
resp = httpx.get(f"http://localhost:4100/api/sessions/{session_id}/files/output/report.pdf")
with open("report.pdf", "wb") as f:
    f.write(resp.content)
print(f"Type: {resp.headers['content-type']}")
print(f"Source: {resp.headers['x-ash-source']}")
```

```bash
# Download raw file
curl -o report.pdf $ASH_SERVER_URL/api/sessions/SESSION_ID/files/output/report.pdf
```

The response includes these headers:

| Header | Description |
|--------|-------------|
| `Content-Type` | MIME type based on file extension (e.g. `application/pdf`, `text/typescript`) |
| `Content-Disposition` | Suggested filename for download |
| `Content-Length` | File size in bytes |
| `X-Ash-Source` | `sandbox` or `snapshot` |

## Reading a File (JSON)

For text files, you can get the content inline as a JSON response by adding `?format=json`. This is useful for building UIs that display file content directly. Limited to 1 MB.

```typescript
const file = await client.getSessionFile(sessionId, 'src/index.ts');
console.log(`Path: ${file.path}`);
console.log(`Size: ${file.size} bytes`);
console.log(`Source: ${file.source}`);
console.log(file.content);
```

```python
resp = httpx.get(
    f"http://localhost:4100/api/sessions/{session_id}/files/src/index.ts",
    params={"format": "json"}
)
data = resp.json()
print(f"Path: {data['path']}")
print(f"Size: {data['size']} bytes")
print(data["content"])
```

```bash
curl "$ASH_SERVER_URL/api/sessions/SESSION_ID/files/src/index.ts?format=json"
```

Response:

```json
{
  "path": "src/index.ts",
  "content": "console.log('hello world');\n",
  "size": 28,
  "source": "sandbox"
}
```

### Limitations (JSON mode)

- Maximum file size is 1 MB. For larger files, use the raw download.
- Content is read as UTF-8 text. Binary files should use the raw download instead.
- Path traversal (`..`) and absolute paths (`/`) are rejected with a 400 error.
- Certain directories are excluded from listings: `node_modules`, `.git`, `__pycache__`, `.cache`, `.npm`, `.venv`, and other common dependency/cache directories.

## Workspace Isolation

Each session's workspace is isolated from other sessions and from the host system. The agent can read and write files within its workspace but cannot access files outside of it.

When a session is created, the agent definition folder is copied into the sandbox workspace. Any files the agent creates during the session live alongside the agent definition files.

When a session is paused or ended, the workspace state is persisted as a snapshot. If the session is later resumed with a new sandbox (cold resume), the snapshot is restored so the agent picks up where it left off.

## Use Cases

**Reviewing agent-written code.** After an agent writes code in response to a prompt, list the workspace files and read specific files to review what was generated.

```typescript
const session = await client.createSession('code-writer');

// Ask the agent to write something
for await (const event of client.sendMessageStream(session.id, 'Write a Python fibonacci function')) {
  // wait for completion
}

// Review what was written
const files = await client.getSessionFiles(session.id);
for (const f of files.files) {
  if (f.path.endsWith('.py')) {
    const content = await client.getSessionFile(session.id, f.path);
    console.log(`--- ${content.path} ---`);
    console.log(content.content);
  }
}
```

```python
session = client.create_session("code-writer")

# Ask the agent to write something
for event in client.send_message_stream(session.id, "Write a Python fibonacci function"):
    pass  # wait for completion

# Review what was written
resp = httpx.get(f"http://localhost:4100/api/sessions/{session.id}/files")
for f in resp.json()["files"]:
    if f["path"].endswith(".py"):
        file_resp = httpx.get(
            f"http://localhost:4100/api/sessions/{session.id}/files/{f['path']}",
            params={"format": "json"}
        )
        data = file_resp.json()
        print(f"--- {data['path']} ---")
        print(data["content"])
```

**Downloading binary artifacts.** If an agent generates images, PDFs, or other binary files, download them directly.

```typescript
// Download a generated image
const { buffer } = await client.downloadSessionFile(session.id, 'output/chart.png');
fs.writeFileSync('chart.png', buffer);
```

```python
# Download a generated image
resp = httpx.get(f"http://localhost:4100/api/sessions/{session.id}/files/output/chart.png")
with open("chart.png", "wb") as f:
    f.write(resp.content)
```

**Accessing files after a session ends.** Files remain available from the persisted snapshot.

```typescript
await client.endSession(session.id);
// Files still accessible from snapshot
const report = await client.getSessionFile(session.id, 'output/report.md');
```

```python
client.end_session(session.id)
# Files still accessible from snapshot
resp = httpx.get(
    f"http://localhost:4100/api/sessions/{session.id}/files/output/report.md",
    params={"format": "json"}
)
report = resp.json()
```

## API Reference

| Method | Endpoint | Description |
|--------|----------|-------------|
| `GET` | `/api/sessions/:id/files` | List all files in the session workspace |
| `GET` | `/api/sessions/:id/files/*path` | Download a file (raw bytes by default, JSON with `?format=json`) |

---

# Authentication

Source: https://docs.ash-cloud.ai/guides/authentication

# Authentication

Ash uses Bearer token authentication to protect API endpoints. All requests to `/api/*` routes require a valid API key. Authentication is always enabled — the server auto-generates an API key on first start if one is not provided.

## Auto-Generated API Key

When you run `ash start` for the first time, the server automatically generates a secure API key (prefixed `ash_`) and:

1. Stores the hashed key in the database.
2. Writes the plaintext key to `~/.ash/initial-api-key`.
3. Logs the key to stdout.

The CLI automatically picks up this key and saves it to `~/.ash/config.json`. No manual configuration is needed for local development.

## Manual Configuration

To use a specific API key instead of the auto-generated one, set the `ASH_API_KEY` environment variable:

```bash
export ASH_API_KEY="your-key-here"
```

Or pass it when starting the server:

```bash
ash start -e ASH_API_KEY=your-key-here
```

When `ASH_API_KEY` is set, the server uses it directly instead of auto-generating one.

## Sending Authenticated Requests

Pass the API key when creating the client:

```typescript

const client = new AshClient({
  serverUrl: 'http://localhost:4100',
  apiKey: 'your-generated-key-here',
});

// All subsequent calls include the Authorization header automatically
const agents = await client.listAgents();
```

```python
from ash_sdk import AshClient

client = AshClient(
    "http://localhost:4100",
    api_key="your-generated-key-here",
)

agents = client.list_agents()
```

Set the `ASH_API_KEY` environment variable:

```bash
export ASH_API_KEY="your-generated-key-here"
ash agent list
```

Or pass it inline:

```bash
ASH_API_KEY="your-generated-key-here" ash agent list
```

Include the `Authorization` header with the `Bearer` scheme:

```bash
curl $ASH_SERVER_URL/api/agents \
  -H "Authorization: Bearer your-generated-key-here"
```

## Public Endpoints

The following endpoints do not require authentication, even when `ASH_API_KEY` is set:

| Endpoint | Description |
|----------|-------------|
| `GET /health` | Server health check |
| `GET /metrics` | Prometheus metrics |
| `GET /docs/*` | API documentation (Swagger UI) |

## Error Responses

### 401 -- Missing Authorization Header

Returned when the request has no `Authorization` header:

```json
{
  "error": "Missing Authorization header",
  "statusCode": 401
}
```

### 401 -- Invalid API Key

Returned when the `Authorization` header is present but the key does not match:

```json
{
  "error": "Invalid API key",
  "statusCode": 401
}
```

### 401 -- Malformed Header

Returned when the `Authorization` header does not use the `Bearer <key>` format:

```json
{
  "error": "Invalid Authorization header format",
  "statusCode": 401
}
```

## Auth Resolution Order

When a request arrives, the server resolves authentication in the following order:

1. **Public endpoints** (`/health`, `/docs/*`) -- skip auth entirely.
2. **Internal endpoints** (`/api/internal/*`) -- authenticated via `ASH_INTERNAL_SECRET` (used for runner registration).
3. **Bearer token present** -- validate against `ASH_API_KEY` or the database API keys table. Accept if matched.
4. **No match** -- reject with 401.

---

# Monitoring

Source: https://docs.ash-cloud.ai/guides/monitoring

# Monitoring

Ash exposes health checks, Prometheus metrics, debug timing, and structured logs for production monitoring.

## Health Endpoint

`GET /health` returns the server's current status. This endpoint does not require authentication.

```bash
curl $ASH_SERVER_URL/health
```

Response:

```json
{
  "status": "ok",
  "activeSessions": 3,
  "activeSandboxes": 5,
  "uptime": 86400,
  "pool": {
    "total": 10,
    "cold": 2,
    "warming": 1,
    "warm": 2,
    "waiting": 3,
    "running": 2,
    "maxCapacity": 1000,
    "resumeWarmHits": 42,
    "resumeColdHits": 7
  }
}
```

| Field | Description |
|-------|-------------|
| `status` | Always `"ok"` if the server is reachable. |
| `activeSessions` | Number of sessions with status `active`. |
| `activeSandboxes` | Number of live sandbox processes. |
| `uptime` | Seconds since server start. |
| `pool.total` | Total sandboxes in the pool (all states). |
| `pool.warm` | Sandboxes ready to accept work immediately. |
| `pool.running` | Sandboxes actively processing a message. |
| `pool.maxCapacity` | Maximum number of sandboxes the pool allows. |
| `pool.resumeWarmHits` | Times a session resumed with its sandbox still alive (fast path). |
| `pool.resumeColdHits` | Times a session resumed by creating a new sandbox (cold path). |

## Prometheus Metrics

`GET /metrics` returns metrics in Prometheus text exposition format. This endpoint does not require authentication.

```bash
curl $ASH_SERVER_URL/metrics
```

Response:

```
# HELP ash_up Whether the Ash server is up (always 1 if reachable).
# TYPE ash_up gauge
ash_up 1

# HELP ash_uptime_seconds Seconds since server start.
# TYPE ash_uptime_seconds gauge
ash_uptime_seconds 86400

# HELP ash_active_sessions Number of active sessions.
# TYPE ash_active_sessions gauge
ash_active_sessions 3

# HELP ash_active_sandboxes Number of live sandbox processes.
# TYPE ash_active_sandboxes gauge
ash_active_sandboxes 5

# HELP ash_pool_sandboxes Sandbox count by state.
# TYPE ash_pool_sandboxes gauge
ash_pool_sandboxes{state="cold"} 2
ash_pool_sandboxes{state="warming"} 1
ash_pool_sandboxes{state="warm"} 2
ash_pool_sandboxes{state="waiting"} 3
ash_pool_sandboxes{state="running"} 2

# HELP ash_pool_max_capacity Maximum sandbox capacity.
# TYPE ash_pool_max_capacity gauge
ash_pool_max_capacity 1000

# HELP ash_resume_total Total session resumes by path (warm=sandbox alive, cold=new sandbox).
# TYPE ash_resume_total counter
ash_resume_total{path="warm"} 42
ash_resume_total{path="cold"} 7
```

### Metric Reference

| Metric | Type | Description |
|--------|------|-------------|
| `ash_up` | gauge | Always 1 if the server is reachable. Use for up/down alerting. |
| `ash_uptime_seconds` | gauge | Seconds since server process started. |
| `ash_active_sessions` | gauge | Sessions currently in `active` state. |
| `ash_active_sandboxes` | gauge | Live sandbox processes (includes all states). |
| `ash_pool_sandboxes` | gauge | Sandbox count broken down by state label: `cold`, `warming`, `warm`, `waiting`, `running`. |
| `ash_pool_max_capacity` | gauge | Maximum sandboxes the pool will create. |
| `ash_resume_total` | counter | Cumulative session resumes by path: `warm` (sandbox alive) or `cold` (new sandbox). |

### Prometheus Configuration

Add Ash as a scrape target in `prometheus.yml`:

```yaml
scrape_configs:
  - job_name: 'ash'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:4100']
    metrics_path: /metrics
```

### Example PromQL Queries

Active sessions over time:

```promql
ash_active_sessions
```

Warm resume hit rate (percentage of resumes that were fast):

```promql
ash_resume_total{path="warm"} / (ash_resume_total{path="warm"} + ash_resume_total{path="cold"})
```

Pool utilization (fraction of capacity in use):

```promql
sum(ash_pool_sandboxes) / ash_pool_max_capacity
```

Running sandboxes (actively processing messages):

```promql
ash_pool_sandboxes{state="running"}
```

Alert when pool is over 80% capacity:

```promql
sum(ash_pool_sandboxes) / ash_pool_max_capacity > 0.8
```

## Debug Timing

Set `ASH_DEBUG_TIMING=1` to enable per-message timing instrumentation. When enabled, the server writes one JSON line to stderr for each message processed:

```bash
ASH_DEBUG_TIMING=1 ash start
```

Timing output:

```json
{
  "type": "timing",
  "source": "server",
  "sessionId": "a1b2c3d4-...",
  "sandboxId": "a1b2c3d4-...",
  "lookupMs": 0.42,
  "firstEventMs": 145.8,
  "totalMs": 2340.5,
  "eventCount": 12,
  "timestamp": "2025-01-15T10:30:00.000Z"
}
```

| Field | Description |
|-------|-------------|
| `lookupMs` | Time to look up the session and sandbox. |
| `firstEventMs` | Time from request to first SSE event (time-to-first-token). |
| `totalMs` | Total request duration. |
| `eventCount` | Number of SSE events sent. |

Timing is zero-overhead when `ASH_DEBUG_TIMING` is not set. The check is a single `process.env` read per message.

## Structured Logs

Ash writes structured JSON log lines to stderr. Each line is a self-contained JSON object.

### Resume Logging

Every session resume emits a log line (always on, not gated by `ASH_DEBUG_TIMING`):

```json
{
  "type": "resume_hit",
  "path": "warm",
  "sessionId": "a1b2c3d4-...",
  "agentName": "my-agent",
  "ts": "2025-01-15T10:30:00.000Z"
}
```

The `path` field is `warm` (sandbox still alive) or `cold` (new sandbox created).

### Log Analysis with jq

Filter resume events:

```bash
ash start 2>&1 | jq -c 'select(.type == "resume_hit")'
```

Count warm vs cold resumes:

```bash
ash start 2>&1 | jq -c 'select(.type == "resume_hit")' | \
  jq -s 'group_by(.path) | map({path: .[0].path, count: length})'
```

Filter timing data for a specific session:

```bash
ash start 2>&1 | jq -c 'select(.type == "timing" and .sessionId == "SESSION_ID")'
```

Find slow messages (time-to-first-token over 500ms):

```bash
ash start 2>&1 | jq -c 'select(.type == "timing" and .firstEventMs > 500)'
```

Average time-to-first-token:

```bash
ash start 2>&1 | jq -cs '[.[] | select(.type == "timing")] | (map(.firstEventMs) | add) / length'
```

---

# Docker (Default)

Source: https://docs.ash-cloud.ai/self-hosting/docker

# Docker (Default)

The recommended way to run Ash is via Docker. The `ash start` command manages the entire Docker lifecycle for you -- pulling the image, creating volumes, and starting the container with the correct flags.

## Quick Start

```bash
npm install -g @ash-ai/cli

export ANTHROPIC_API_KEY=sk-ant-...
ash start
```

That is it. The server is now running at `http://localhost:4100`.

## What `ash start` Does

When you run `ash start`, the CLI performs the following steps in order:

1. **Checks Docker** -- verifies Docker is installed and the daemon is running.
2. **Removes stale containers** -- if a stopped `ash-server` container exists, it is removed.
3. **Creates `~/.ash/`** -- ensures the persistent data directory exists on the host.
4. **Pulls the image** -- downloads `ghcr.io/ash-ai/ash:0.1.0` (skip with `--no-pull`).
5. **Starts the container** -- runs `docker run` with the flags described below.
6. **Waits for healthy** -- polls `GET /health` until the server responds 200 (up to 30 seconds).

### Docker Flags

The container is started with these flags:

| Flag | Purpose |
|------|---------|
| `--init` | Runs [tini](https://github.com/krallin/tini) as PID 1 so signals (SIGTERM, SIGINT) are forwarded correctly to child processes. Without this, sandbox processes can become zombies on shutdown. |
| `--cgroupns=host` | Shares the host's cgroup namespace so the entrypoint script can create per-sandbox cgroups for memory, CPU, and process limits. |
| `-v ~/.ash:/data` | Mounts the host data directory into the container. All persistent state -- SQLite database, agent definitions, session workspaces -- lives here. |
| `-p 4100:4100` | Exposes the API on the host. Configurable with `--port`. |
| `-e ANTHROPIC_API_KEY=...` | Passes your API key into the container. The key is read from your shell environment. |

### Entrypoint: cgroup v2 Setup

The container uses a custom entrypoint (`docker-entrypoint.sh`) that configures cgroup v2 delegation before starting the server. This enables per-sandbox resource limits (memory, CPU, process count). If cgroup v2 is not available (older kernels or restricted Docker configurations), the server falls back to ulimit-based limits.

## Lifecycle Commands

```bash
# Start the server (pulls image, creates container, waits for healthy)
ash start

# Check server status (container state + health endpoint)
ash status

# View logs (add -f to follow)
ash logs
ash logs -f

# Stop and remove the container
ash stop
```

## Configuration

Pass environment variables to the container with `-e`:

```bash
ash start -e ASH_MAX_SANDBOXES=50

# Override the auto-generated API key (optional — not required for basic setup)
ash start -e ASH_API_KEY=my-secret-key
```

Use `--database-url` to connect to an external database instead of the default SQLite:

```bash
ash start --database-url "postgresql://user:pass@host:5432/ash"
```

Use `--port` to change the host port:

```bash
ash start --port 8080
```

Use `--image` to run a custom image (for example, a local build):

```bash
docker build -t ash-dev .
ash start --image ash-dev --no-pull
```

See the [Configuration Reference](./configuration.md) for all environment variables and [Streaming Telemetry](./telemetry.md) for OpenTelemetry tracing and event collection setup.

## Volume Mount Layout

The host directory `~/.ash/` is mounted into the container at `/data/`. Here is what it contains:

```
~/.ash/                        (host)     →  /data/           (container)
├── ash.db                                    SQLite database (agents, sessions, sandboxes, messages)
├── agents/                                   Deployed agent definitions
│   └── my-agent/
│       ├── CLAUDE.md
│       └── .claude/
├── sessions/                                 Persisted session workspaces
│   └── <session-id>/
│       ├── workspace/                        Snapshot of the sandbox filesystem
│       └── metadata.json                     Agent name, persist timestamp
└── sandboxes/                                Active sandbox working directories
    └── <sandbox-id>/
        └── workspace/
```

Because all state lives in `~/.ash/`, you can stop and restart the container without losing data. Sessions, agents, and the database survive across restarts.

## Docker Compose for Production

For production deployments with CockroachDB (or PostgreSQL):

```yaml
version: "3.8"

services:
  ash:
    image: ghcr.io/ash-ai/ash:0.1.0
    init: true
    privileged: true
    ports:
      - "4100:4100"
    volumes:
      - ash-data:/data
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - ASH_API_KEY=${ASH_API_KEY}  # Required for Docker Compose — auto-generation only works with `ash start`
      - ASH_DATABASE_URL=postgresql://ash:ash@cockroach:26257/ash?sslmode=disable
      - ASH_MAX_SANDBOXES=200
      - ASH_IDLE_TIMEOUT_MS=1800000
    depends_on:
      cockroach:
        condition: service_healthy

  cockroach:
    image: cockroachdb/cockroach:v24.1.0
    command: start-single-node --insecure
    ports:
      - "26257:26257"
      - "8080:8080"
    volumes:
      - cockroach-data:/cockroach/cockroach-data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 5s
      timeout: 5s
      retries: 10

volumes:
  ash-data:
  cockroach-data:
```

Start with:

```bash
export ANTHROPIC_API_KEY=sk-ant-...
export ASH_API_KEY=my-production-key
docker compose up -d
```

## Resource Recommendations

| Concurrent Sessions | CPU | RAM | Disk |
|---------------------|-----|-----|------|
| 1--5 | 2 cores | 4 GB | 20 GB |
| 5--20 | 4 cores | 8 GB | 50 GB |
| 20--50 | 8 cores | 16 GB | 100 GB |

Each active sandbox uses up to 2 GB of memory (configurable via resource limits) and 1 GB of disk by default. Plan capacity based on your peak concurrent session count, not total sessions -- idle sessions are evicted to disk and do not consume memory.

## Using a Local Build

If you are developing Ash itself or need a custom image:

```bash
# Build the image from the repository root
docker build -t ash-dev .

# Start using the local image (skip pulling from registry)
ash start --image ash-dev --no-pull
```

The Dockerfile builds the full monorepo, installs `@anthropic-ai/claude-code` globally, creates a non-root sandbox user, and configures the entrypoint for cgroup v2 delegation.

---

# Deploy to AWS EC2

Source: https://docs.ash-cloud.ai/self-hosting/ec2

# Deploy to AWS EC2

This guide walks through deploying Ash to an EC2 instance using the included deploy script. The script provisions an Ubuntu instance, installs Docker, builds the Ash image, starts the server, and deploys the example QA Bot agent.

## Prerequisites

- **AWS CLI v2** installed and configured ([install guide](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html))
- **EC2 key pair** created in your target region ([create a key pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-key-pairs.html))
- **ANTHROPIC_API_KEY** for agent execution

## Quick Start

```bash
# Clone the repo
git clone https://github.com/ash-ai-org/ash.git
cd ash

# Create .env from the example
cp .env.example .env
# Edit .env with your credentials (see below)

# Deploy
./scripts/deploy-ec2.sh
```

The script takes 5--8 minutes. When it finishes, it prints the server URL, SSH command, and instructions for connecting the QA Bot UI.

## Configuration

Create a `.env` file in the project root with the following variables:

### Required

| Variable | Description |
|----------|-------------|
| `ANTHROPIC_API_KEY` | Your Anthropic API key for agent execution |
| `AWS_ACCESS_KEY_ID` | AWS access key with EC2 permissions |
| `AWS_SECRET_ACCESS_KEY` | AWS secret key |
| `EC2_KEY_NAME` | Name of your EC2 key pair (as shown in the AWS console) |
| `EC2_KEY_PATH` | Path to the private key file, e.g. `~/.ssh/my-key.pem` |

### Optional

| Variable | Default | Description |
|----------|---------|-------------|
| `AWS_DEFAULT_REGION` | `us-east-1` | AWS region to deploy in |
| `EC2_INSTANCE_TYPE` | `t3.large` | Instance type (2 vCPU, 8 GB RAM) |
| `EC2_VOLUME_SIZE` | `30` | Root volume size in GB |
| `EC2_SECURITY_GROUP_ID` | (created) | Use an existing security group instead of creating one |
| `EC2_SUBNET_ID` | (default VPC) | Deploy into a specific subnet |
| `ASH_PORT` | `4100` | Port to expose the API on |

Example `.env`:

```bash
ANTHROPIC_API_KEY=sk-ant-api03-...
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...
EC2_KEY_NAME=my-key
EC2_KEY_PATH=~/.ssh/my-key.pem
AWS_DEFAULT_REGION=us-east-1
EC2_INSTANCE_TYPE=t3.large
```

## What the Deploy Script Does

1. **Finds the latest Ubuntu 22.04 AMI** in your region.
2. **Creates a security group** (`ash-server-sg`) with ports 22 (SSH) and 4100 (API) open. Skipped if you provide `EC2_SECURITY_GROUP_ID`.
3. **Launches a `t3.large` instance** with a 30 GB gp3 volume and a user-data script that installs Docker, Node.js 20, and pnpm.
4. **Waits for SSH** and the user-data script to complete (~2 minutes).
5. **Syncs the project** to the instance via rsync (excludes `node_modules`, `.git`, `dist`).
6. **Builds the Docker image** on the instance (`docker build -t ash-dev .`). This takes 3--5 minutes on a `t3.large`.
7. **Starts the container** with `--init`, `--privileged`, `--cgroupns=host`, and the volume mount.
8. **Waits for healthy** by polling `GET /health`.
9. **Deploys the qa-bot agent** by copying agent files and calling the API.

## Connecting the QA Bot Example

After deployment, the QA Bot agent is ready. To connect the example Next.js UI:

```bash
# From your local machine (not the EC2 instance)
ASH_SERVER_URL=http://<PUBLIC_IP>:4100 pnpm --filter qa-bot dev
```

This starts the QA Bot frontend locally, pointing at your remote Ash server.

## Deploying Your Own Agent

SSH into the instance and copy your agent folder to the data directory:

```bash
ssh -i ~/.ssh/my-key.pem ubuntu@<PUBLIC_IP>

# Copy your agent files
mkdir -p ~/.ash/agents/my-agent
# Place your CLAUDE.md, .claude/ settings, etc. in ~/.ash/agents/my-agent/

# Deploy via API
curl -X POST $ASH_SERVER_URL/api/agents \
  -H "Content-Type: application/json" \
  -d '{"name": "my-agent", "path": "agents/my-agent"}'
```

Alternatively, use the Ash CLI or SDK from your local machine:

```bash
export ASH_SERVER_URL=http://<PUBLIC_IP>:4100
ash deploy ./my-agent --name my-agent
```

## Monitoring

### Logs

```bash
# From your local machine
ssh -i ~/.ssh/my-key.pem ubuntu@<PUBLIC_IP> 'docker logs -f ash-server'
```

### Health Check

```bash
curl http://<PUBLIC_IP>:4100/health | jq .
```

Returns active session count, sandbox pool stats, and uptime.

### API Docs

Open `http://<PUBLIC_IP>:4100/docs` in a browser for the Swagger UI.

## Troubleshooting

### "Key file not found"

Verify `EC2_KEY_PATH` in your `.env` points to the correct `.pem` file. The script sets permissions to `400` automatically.

### "Instance has no public IP"

Your VPC or subnet does not auto-assign public IPs. Either:
- Set `EC2_SUBNET_ID` to a public subnet, or
- Enable "Auto-assign public IPv4 address" on your subnet in the AWS console.

### "Server did not become healthy within 60 seconds"

SSH in and check the Docker logs:

```bash
ssh -i ~/.ssh/my-key.pem ubuntu@<PUBLIC_IP>
docker logs ash-server
```

Common causes:
- Missing `ANTHROPIC_API_KEY` -- the server starts but agents cannot execute.
- Docker build failed -- check for network issues during `pnpm install`.

### "Setup did not complete within 5 minutes"

The user-data script (Docker + Node.js installation) is taking too long. SSH in and check:

```bash
ssh -i ~/.ssh/my-key.pem ubuntu@<PUBLIC_IP>
cat /var/log/cloud-init-output.log
```

## Tearing Down

```bash
./scripts/teardown-ec2.sh
```

This terminates the EC2 instance, waits for termination to complete, and deletes the security group if the script created it. The teardown script reads from the `.ec2-instance` state file that was created during deployment.

## Cost Estimate

| Resource | Spec | Hourly Cost (us-east-1) |
|----------|------|------------------------|
| EC2 `t3.large` | 2 vCPU, 8 GB RAM | ~$0.083 |
| EBS gp3 | 30 GB | ~$0.003 |
| **Total** | | **~$0.086/hour (~$62/month)** |

Data transfer costs are additional. Actual costs depend on region and usage patterns.

---

# Deploy to Google Cloud

Source: https://docs.ash-cloud.ai/self-hosting/gce

# Deploy to Google Cloud

This guide walks through deploying Ash to a Google Compute Engine (GCE) instance using the included deploy script. The script provisions an Ubuntu VM, installs Docker, builds the Ash image, starts the server, and deploys the example QA Bot agent.

## Prerequisites

- **gcloud CLI** installed ([install guide](https://cloud.google.com/sdk/docs/install))
- **GCP project** with Compute Engine API enabled and billing configured
- **ANTHROPIC_API_KEY** for agent execution

## Quick Start

```bash
# Clone the repo
git clone https://github.com/ash-ai-org/ash.git
cd ash

# Create .env from the example
cp .env.example .env
# Edit .env with your credentials (see below)

# Deploy
./scripts/deploy-gce.sh
```

The script takes 5--8 minutes. When it finishes, it prints the server URL, SSH command, and instructions for connecting the QA Bot UI.

## Configuration

Create a `.env` file in the project root with the following variables:

### Required

| Variable | Description |
|----------|-------------|
| `ANTHROPIC_API_KEY` | Your Anthropic API key for agent execution |
| `GCP_PROJECT_ID` | Your GCP project ID. Falls back to `gcloud config get-value project` if not set. |

### Optional

| Variable | Default | Description |
|----------|---------|-------------|
| `GCP_ZONE` | `us-east1-b` | Compute Engine zone |
| `GCP_MACHINE_TYPE` | `e2-standard-2` | Machine type (2 vCPU, 8 GB RAM) |
| `GCP_DISK_SIZE` | `30` | Boot disk size in GB (SSD) |
| `ASH_PORT` | `4100` | Port to expose the API on |

Example `.env`:

```bash
ANTHROPIC_API_KEY=sk-ant-api03-...
GCP_PROJECT_ID=my-project-123
GCP_ZONE=us-east1-b
GCP_MACHINE_TYPE=e2-standard-2
```

## GCP Setup from Scratch

If you do not have a GCP project configured yet:

```bash
# 1. Install gcloud CLI
# macOS:
brew install --cask google-cloud-sdk
# Or download from https://cloud.google.com/sdk/docs/install

# 2. Authenticate
gcloud auth login

# 3. Create a project (or use an existing one)
gcloud projects create my-ash-project --name="Ash Server"

# 4. Set the project as default
gcloud config set project my-ash-project

# 5. Enable the Compute Engine API
gcloud services enable compute.googleapis.com

# 6. Enable billing (required for Compute Engine)
# Go to https://console.cloud.google.com/billing and link a billing account
# to your project
```

## What the Deploy Script Does

1. **Ensures a firewall rule** (`allow-ash-api`) exists for port 4100. Creates one if it does not exist.
2. **Creates a Compute Engine instance** (`ash-server`) with Ubuntu 22.04, SSD boot disk, and the `ash-server` network tag.
3. **Runs a startup script** that installs Docker, Node.js 20, pnpm, rsync, and jq.
4. **Waits for SSH** and the startup script to complete (~2 minutes).
5. **Syncs the project** to the instance by creating a tarball and using `gcloud compute scp`.
6. **Builds the Docker image** on the instance (`docker build -t ash-dev .`). This takes 3--5 minutes.
7. **Starts the container** with `--init`, `--privileged`, `--cgroupns=host`, and the volume mount.
8. **Waits for healthy** by polling `GET /health`.
9. **Deploys the qa-bot agent** by copying agent files and calling the API.

## Using the SDK

After deployment, connect from your application:

```typescript

const client = new AshClient({
  serverUrl: "http://<EXTERNAL_IP>:4100",
  apiKey: "<your-api-key>",
});

// Create a session
const session = await client.createSession({ agentName: "qa-bot" });

// Send a message (SSE streaming)
const stream = client.sendMessage(session.id, {
  message: "What is the capital of France?",
});

for await (const event of stream) {
  if (event.type === "assistant") {
    process.stdout.write(event.content);
  }
}
```

## Monitoring

### Logs

```bash
gcloud compute ssh ash-server --zone=us-east1-b \
  --command='docker logs -f ash-server'
```

### Health Check

```bash
curl http://<EXTERNAL_IP>:4100/health | jq .
```

### API Docs

Open `http://<EXTERNAL_IP>:4100/docs` in a browser for the Swagger UI.

## Troubleshooting

### "gcloud: command not found"

Install the gcloud CLI:

```bash
# macOS
brew install --cask google-cloud-sdk

# Linux
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init
```

### "Your current active account does not have permission"

Re-authenticate:

```bash
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
```

### "Compute Engine API has not been enabled"

```bash
gcloud services enable compute.googleapis.com
```

This can take a minute to propagate. Wait and retry.

### "Instance has no external IP"

The default network configuration includes an external IP. If you are using a custom VPC without external IPs, you need to either add an access config or use Cloud NAT + Identity-Aware Proxy for SSH.

### "Firewall rule blocks traffic"

Verify the rule exists and the instance has the correct network tag:

```bash
gcloud compute firewall-rules describe allow-ash-api
gcloud compute instances describe ash-server --zone=us-east1-b \
  --format='get(tags.items)'
```

## Tearing Down

```bash
./scripts/teardown-gce.sh
```

This deletes the Compute Engine instance and the `allow-ash-api` firewall rule. The teardown script reads from the `.gce-instance` state file created during deployment.

## Cost Estimate

| Resource | Spec | Hourly Cost (us-east1) |
|----------|------|------------------------|
| GCE `e2-standard-2` | 2 vCPU, 8 GB RAM | ~$0.067 |
| Boot disk (pd-ssd) | 30 GB | ~$0.005 |
| **Total** | | **~$0.072/hour (~$52/month)** |

Data transfer costs are additional. Actual costs depend on region and usage patterns.

## EC2 vs GCE Comparison

| | AWS EC2 | Google Cloud GCE |
|---|---------|-----------------|
| **Deploy command** | `./scripts/deploy-ec2.sh` | `./scripts/deploy-gce.sh` |
| **Default instance** | `t3.large` (2 vCPU, 8 GB) | `e2-standard-2` (2 vCPU, 8 GB) |
| **Default region** | `us-east-1` | `us-east1-b` |
| **SSH access** | `ssh -i key.pem ubuntu@IP` | `gcloud compute ssh ash-server` |
| **Auth method** | AWS access key + secret | `gcloud auth login` |
| **Required credentials** | `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `EC2_KEY_NAME`, `EC2_KEY_PATH` | `GCP_PROJECT_ID` (+ gcloud auth) |
| **Estimated cost** | ~$0.086/hour | ~$0.072/hour |
| **Teardown** | `./scripts/teardown-ec2.sh` | `./scripts/teardown-gce.sh` |

Both scripts produce identical results: a running Ash server with the QA Bot agent deployed. Choose whichever cloud you already have an account with.

---

# Configuration Reference

Source: https://docs.ash-cloud.ai/self-hosting/configuration

# Configuration Reference

All Ash configuration is done via environment variables. There are no config files. This page documents every variable the server and runner recognize.

## Server Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `ASH_PORT` | `4100` | Port the HTTP server listens on. |
| `ASH_HOST` | `0.0.0.0` | Bind address. Use `127.0.0.1` to restrict to localhost. |
| `ASH_DATA_DIR` | `./data` (native) or `/data` (Docker) | Root directory for all persistent state: SQLite database, agent definitions, session workspaces, sandbox working directories. |
| `ASH_MODE` | `standalone` | Server mode. `standalone` runs sandboxes locally. `coordinator` runs as a pure control plane with no local sandbox pool -- runners must register to provide capacity. See [Multi-Machine Setup](./multi-machine.md). |
| `ASH_DATABASE_URL` | (none) | PostgreSQL or CockroachDB connection string. When set, the server uses Postgres instead of SQLite. Format: `postgresql://user:password@host:port/dbname`. |
| `ASH_MAX_SANDBOXES` | `1000` | Maximum number of sandbox entries (live + cold) tracked in the database. When this limit is reached, the pool evicts the least-recently-used sandbox. |
| `ASH_IDLE_TIMEOUT_MS` | `1800000` (30 min) | How long a sandbox can sit idle (in the `waiting` state) before the idle sweep evicts it. Evicted sandboxes have their workspace persisted and are marked `cold`. |
| `ASH_API_KEY` | (auto-generated) | Single-tenant API key for authentication. All API requests (except `/health` and `/docs`) must include `Authorization: Bearer <key>`. If not set, the server auto-generates a key on first start and writes it to `{ASH_DATA_DIR}/initial-api-key`. |
| `ASH_SNAPSHOT_URL` | (none) | Cloud storage URL for session workspace snapshots. Enables cross-machine session resume. Format: `s3://bucket/prefix/` or `gs://bucket/prefix/`. Requires the appropriate SDK installed (`@aws-sdk/client-s3` for S3, `@google-cloud/storage` for GCS). |
| `ASH_BRIDGE_ENTRY` | (auto-detected) | Absolute path to the bridge process entry point (`packages/bridge/dist/index.js`). Normally auto-detected from the monorepo layout. Override only for custom installations. |
| `ASH_DEBUG_TIMING` | `0` | Set to `1` to enable timing instrumentation on the hot path. Logs latency for sandbox creation, bridge connect, message round-trip, and SSE delivery. |
| `ANTHROPIC_API_KEY` | (none) | **Required.** Passed into sandbox processes via the environment allowlist. The Claude Agent SDK uses this to authenticate with the Anthropic API. |

## Telemetry Variables

These variables enable optional telemetry. Both systems are zero-overhead when not configured. See [Streaming Telemetry](./telemetry.md) for setup guides and examples.

| Variable | Default | Description |
|----------|---------|-------------|
| `OTEL_EXPORTER_OTLP_ENDPOINT` | (none) | gRPC endpoint for OpenTelemetry trace export (e.g. `http://jaeger:4317`). Tracing is completely disabled when not set. |
| `OTEL_SERVICE_NAME` | `ash-coordinator` | Service name in OpenTelemetry traces. Bridge processes default to `ash-bridge`. |
| `OTEL_TRACES_SAMPLER` | (none) | Optional OTEL sampling strategy (e.g. `parentbased_traceidratio`). |
| `ASH_TELEMETRY_URL` | (none) | HTTP endpoint for streaming event telemetry (session lifecycle, messages, tool calls). When not set and `ASH_CLOUD_URL` is present, auto-configured to send to Ash Cloud. |
| `ASH_TELEMETRY_KEY` | (none) | Optional bearer token for authenticating with the telemetry endpoint. When auto-configured for Ash Cloud, defaults to `ASH_API_KEY`. |
| `ASH_CLOUD_URL` | (none) | Ash Cloud URL (set automatically by `ash login` + `ash start`). When present and `ASH_TELEMETRY_URL` is not set, the server auto-configures event telemetry to send to `<ASH_CLOUD_URL>/api/telemetry/ingest`. |

## Runner Variables

These variables configure runner processes in [multi-machine mode](./multi-machine.md).

| Variable | Default | Description |
|----------|---------|-------------|
| `ASH_RUNNER_ID` | `runner-<PID>` | Unique identifier for this runner. Must be stable across restarts for re-registration to work correctly. |
| `ASH_RUNNER_PORT` | `4200` | Port the runner's HTTP server listens on. |
| `ASH_RUNNER_HOST` | `0.0.0.0` | Bind address for the runner. |
| `ASH_SERVER_URL` | (none) | URL of the coordinator server (e.g., `http://coordinator:4100`). When set, the runner registers itself with the coordinator and begins sending heartbeats. |
| `ASH_RUNNER_ADVERTISE_HOST` | (same as `ASH_RUNNER_HOST`) | The hostname or IP the coordinator should use to reach this runner. Useful when the runner binds to `0.0.0.0` but needs to advertise a specific IP or hostname to the coordinator. |

Runner processes also read `ASH_DATA_DIR`, `ASH_MAX_SANDBOXES`, `ASH_IDLE_TIMEOUT_MS`, `ASH_BRIDGE_ENTRY`, and `ANTHROPIC_API_KEY` with the same semantics as the server.

## Database

### SQLite (Default)

SQLite is the default database. It requires zero configuration. The database file is created at `<ASH_DATA_DIR>/ash.db` on first startup.

SQLite is configured with:
- **WAL mode** for concurrent reads during writes.
- **Foreign keys enabled** for referential integrity.
- **Automatic migrations** on startup -- schema changes are applied idempotently.

SQLite is the right choice for single-machine deployments. It handles hundreds of concurrent sessions without issue.

### PostgreSQL / CockroachDB

Set `ASH_DATABASE_URL` to use Postgres or CockroachDB:

```bash
# PostgreSQL
export ASH_DATABASE_URL="postgresql://ash:password@localhost:5432/ash"

# CockroachDB
export ASH_DATABASE_URL="postgresql://ash:password@localhost:26257/ash?sslmode=disable"
```

The server auto-detects the database type from the connection string prefix (`postgresql://` or `postgres://`).

**Connection retry behavior:** On startup, the server attempts to connect to the database with exponential backoff. It retries up to 5 times with delays of 1s, 2s, 4s, 8s, and 16s (total ~31 seconds). If all attempts fail, the server exits with an error. This is designed for Docker Compose deployments where the database container may start slightly after the server.

**Schema migrations** are applied automatically on startup, just like SQLite. Tables and indexes are created with `IF NOT EXISTS` / `IF NOT EXISTS` semantics.

Use Postgres or CockroachDB when:
- You need the database to be on a separate machine from the server.
- You are running in coordinator mode with multiple runners and want a shared database.
- You want to use your existing database infrastructure for backups, monitoring, and replication.

## Authentication

Ash supports two authentication modes:

### Auto-Generated Key (Default)

When `ASH_API_KEY` is not set, the server auto-generates a secure API key on first start. The key is stored (hashed) in the database and the plaintext is written to `{ASH_DATA_DIR}/initial-api-key`. The CLI automatically picks up this key via `ash start`. This key appears in the dashboard's **API Keys** page and can be revoked there.

### Explicit API Key

Set `ASH_API_KEY` to use a specific key instead of auto-generating:

```bash
export ASH_API_KEY=my-secret-key
```

All API requests must then include:

```
Authorization: Bearer my-secret-key
```

The environment variable key is shown in the dashboard's **API Keys** page with an `env` badge. To change or remove it, update the environment variable and restart the server.

### Dashboard-Created Keys

You can also create additional API keys directly from the dashboard's **API Keys** page. These keys are stored in the database and can be revoked from the dashboard. Both environment-variable and dashboard-created keys work simultaneously.

For multi-tenant authentication, create API keys via the internal API. Each key is associated with a `tenant_id`, and requests authenticated with that key are scoped to that tenant's agents, sessions, and data.

### Public Endpoints

The following endpoints never require authentication:
- `GET /health`
- `/docs` (Swagger UI)
- `/api/internal/*` (runner registration and heartbeats)

## Environment Variable Summary

Here is every variable in one table for quick reference:

| Variable | Default | Component |
|----------|---------|-----------|
| `ANTHROPIC_API_KEY` | -- | Server, Runner |
| `ASH_PORT` | `4100` | Server |
| `ASH_HOST` | `0.0.0.0` | Server |
| `ASH_DATA_DIR` | `./data` | Server, Runner |
| `ASH_MODE` | `standalone` | Server |
| `ASH_DATABASE_URL` | (SQLite) | Server |
| `ASH_MAX_SANDBOXES` | `1000` | Server, Runner |
| `ASH_IDLE_TIMEOUT_MS` | `1800000` | Server, Runner |
| `ASH_API_KEY` | (auto-generated) | Server |
| `ASH_SNAPSHOT_URL` | (none) | Server, Runner |
| `ASH_BRIDGE_ENTRY` | (auto) | Server, Runner |
| `ASH_DEBUG_TIMING` | `0` | Server, Runner |
| `ASH_RUNNER_ID` | `runner-<PID>` | Runner |
| `ASH_RUNNER_PORT` | `4200` | Runner |
| `ASH_RUNNER_HOST` | `0.0.0.0` | Runner |
| `ASH_SERVER_URL` | (none) | Runner |
| `ASH_RUNNER_ADVERTISE_HOST` | (bind host) | Runner |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | (none) | Server, Runner |
| `OTEL_SERVICE_NAME` | `ash-coordinator` | Server, Runner |
| `ASH_TELEMETRY_URL` | (none) | Server |
| `ASH_TELEMETRY_KEY` | (none) | Server |
| `ASH_CLOUD_URL` | (none) | Server |

---

# Multi-Machine Setup

Source: https://docs.ash-cloud.ai/self-hosting/multi-machine

# Multi-Machine Setup

Most deployments do not need multi-machine mode. A single machine running in standalone mode can handle dozens of concurrent sessions. Read this page only when you need more capacity than one machine provides.

## When to Use

Use multi-machine mode when:
- You need more concurrent sessions than a single server can handle.
- You want to isolate sandbox execution from the control plane for reliability.
- You need to scale sandbox capacity independently of the API server.

## Architecture

```mermaid
graph TB
    Client[Client / SDK] -->|HTTP + SSE| Coordinator

    subgraph Control Plane
        Coordinator[Ash Server<br/>mode=coordinator<br/>port 4100]
        DB[(Database<br/>Postgres / CockroachDB)]
        Coordinator --> DB
    end

    subgraph Runner Node 1
        R1[Ash Runner<br/>port 4200]
        R1 --> S1[Sandbox Pool]
        S1 --> B1[Bridge 1]
        S1 --> B2[Bridge 2]
    end

    subgraph Runner Node 2
        R2[Ash Runner<br/>port 4200]
        R2 --> S2[Sandbox Pool]
        S2 --> B3[Bridge 3]
        S2 --> B4[Bridge 4]
    end

    Coordinator -->|HTTP| R1
    Coordinator -->|HTTP| R2
    R1 -->|Heartbeat| Coordinator
    R2 -->|Heartbeat| Coordinator
```

- **Coordinator** (the Ash server in `coordinator` mode): handles all client-facing HTTP traffic, manages the database, routes session creation to runners, and proxies messages to the correct runner.
- **Runners**: each runner manages a local `SandboxPool`, creates sandbox processes, and communicates with the bridge inside each sandbox over Unix sockets. Runners do not serve external traffic directly.

## Standalone Mode (Default)

In standalone mode (`ASH_MODE=standalone`), the server creates a local `SandboxPool` and handles everything in one process. This is the default and the right choice for single-machine deployments.

```
Client  -->  Ash Server (standalone)  -->  SandboxPool  -->  Bridge  -->  Claude SDK
```

No runners are needed. The server is both control plane and execution plane.

## Coordinator Mode

In coordinator mode (`ASH_MODE=coordinator`), the server does not create a local sandbox pool. Instead, it waits for runners to register and provides capacity through them.

### Starting the Coordinator

```bash
export ASH_MODE=coordinator
export ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash"
export ASH_API_KEY=my-api-key
export ASH_INTERNAL_SECRET=my-runner-secret  # Required: authenticates runner registration
export ANTHROPIC_API_KEY=sk-ant-...

node packages/server/dist/index.js
# Or via Docker:
# ash start -e ASH_MODE=coordinator -e ASH_DATABASE_URL=... -e ASH_INTERNAL_SECRET=...
```

The coordinator logs:

```
Ash server listening on 0.0.0.0:4100 (mode: coordinator)
```

At this point, the server accepts API requests but cannot create sessions until at least one runner registers.

### Starting a Runner

On each runner machine:

```bash
export ASH_RUNNER_ID=runner-1
export ASH_SERVER_URL=http://coordinator-host:4100
export ASH_RUNNER_PORT=4200
export ASH_RUNNER_ADVERTISE_HOST=10.0.1.5  # IP the coordinator can reach
export ASH_MAX_SANDBOXES=50
export ASH_INTERNAL_SECRET=my-runner-secret  # Must match coordinator
export ANTHROPIC_API_KEY=sk-ant-...

node packages/runner/dist/index.js
```

The runner:
1. Creates a `SandboxPool` with a lightweight in-memory database for sandbox tracking.
2. Starts a Fastify HTTP server on port 4200.
3. Sends a registration request to `ASH_SERVER_URL/api/internal/runners/register` (with exponential backoff retry on failure).
4. Begins sending heartbeats every 10 seconds to `ASH_SERVER_URL/api/internal/runners/heartbeat`, including pool stats (running, warming, waiting counts).

The coordinator logs:

```
[coordinator] Runner runner-1 registered at 10.0.1.5:4200 (max 50)
```

## Session Routing

When a client creates a session, the coordinator selects a runner using **least-loaded routing**: it picks the runner with the most available capacity (max sandboxes minus running and warming sandboxes).

If no remote runners are healthy, the coordinator falls back to the local backend (if running in standalone mode). In pure coordinator mode with no healthy runners, session creation fails with an error.

Once a session is assigned to a runner, all subsequent messages for that session are routed to the same runner. The `runner_id` is stored in the session record in the database.

## Failure Handling

### Graceful Shutdown

When a runner shuts down cleanly (receives `SIGTERM`), it calls `POST /api/internal/runners/deregister`. The coordinator immediately bulk-pauses all active sessions on that runner in a single query and removes it from the registry. No 30-second wait.

### Runner Crashes

If a runner crashes without deregistering, the coordinator detects it via missed heartbeats. After 30 seconds without a heartbeat (`RUNNER_LIVENESS_TIMEOUT_MS`), the coordinator:

1. Bulk-pauses all active/starting sessions on that runner (single `UPDATE` query).
2. Removes the runner from the routing table.
3. Purges stale entries from its local backend cache.

Each coordinator adds random 0-5s jitter to its sweep interval to prevent thundering herd when multiple coordinators sweep independently.

Paused sessions can be resumed later. If `ASH_SNAPSHOT_URL` is configured, the runner persists workspace state to cloud storage before eviction, enabling resume on a different runner.

### Runner Comes Back

If a runner restarts with the same `ASH_RUNNER_ID`, it re-registers with the coordinator. The coordinator updates the connection info (host, port) and resumes routing new sessions to it.

Existing sessions that were paused when the runner died are **not** automatically resumed. The client must explicitly resume them via `POST /api/sessions/:id/resume`.

### No Runners Available

If all runners are dead or at capacity, session creation returns an HTTP 503 error:

```json
{
  "error": "No runners available and no local backend configured"
}
```

## Example: Two Runners with Docker Compose

```yaml
version: "3.8"

services:
  db:
    image: postgres:16
    environment:
      POSTGRES_USER: ash
      POSTGRES_PASSWORD: ash
      POSTGRES_DB: ash
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ash"]
      interval: 5s
      timeout: 5s
      retries: 5

  coordinator:
    image: ghcr.io/ash-ai/ash:0.1.0
    init: true
    ports:
      - "4100:4100"
    environment:
      - ASH_MODE=coordinator
      - ASH_DATABASE_URL=postgresql://ash:ash@db:5432/ash
      - ASH_API_KEY=${ASH_API_KEY}
      - ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    depends_on:
      db:
        condition: service_healthy

  runner-1:
    image: ghcr.io/ash-ai/ash:0.1.0
    init: true
    privileged: true
    command: ["node", "packages/runner/dist/index.js"]
    volumes:
      - runner1-data:/data
    environment:
      - ASH_RUNNER_ID=runner-1
      - ASH_SERVER_URL=http://coordinator:4100
      - ASH_RUNNER_ADVERTISE_HOST=runner-1
      - ASH_MAX_SANDBOXES=50
      - ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    depends_on:
      - coordinator

  runner-2:
    image: ghcr.io/ash-ai/ash:0.1.0
    init: true
    privileged: true
    command: ["node", "packages/runner/dist/index.js"]
    volumes:
      - runner2-data:/data
    environment:
      - ASH_RUNNER_ID=runner-2
      - ASH_SERVER_URL=http://coordinator:4100
      - ASH_RUNNER_ADVERTISE_HOST=runner-2
      - ASH_MAX_SANDBOXES=50
      - ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    depends_on:
      - coordinator

volumes:
  pgdata:
  runner1-data:
  runner2-data:
```

```bash
export ANTHROPIC_API_KEY=sk-ant-...
export ASH_API_KEY=my-production-key
export ASH_INTERNAL_SECRET=my-runner-secret
docker compose up -d
```

## Multi-Coordinator (High Availability)

For production deployments that need control plane redundancy or handle more than ~10,000 concurrent SSE connections, you can run multiple coordinators behind a load balancer.

Multi-coordinator requires a shared database. All coordinators must point to the same Postgres or CockroachDB instance via `ASH_DATABASE_URL`. SQLite does not support multi-coordinator.

### Architecture

```mermaid
graph TB
    Client["Client / SDK"] -->|HTTPS| LB["Load Balancer<br/>(nginx, ALB, etc.)"]

    subgraph Coordinators
        C1["Coordinator 1<br/>:4100"]
        C2["Coordinator 2<br/>:4100"]
    end

    LB --> C1
    LB --> C2
    C1 & C2 --> DB[("CockroachDB")]
    C1 & C2 -->|HTTP| R1["Runner 1"]
    C1 & C2 -->|HTTP| R2["Runner 2"]
```

Coordinators are stateless — all routing decisions come from the database. Any coordinator can route to any runner.

### Starting Multiple Coordinators

```bash
# Coordinator 1
ASH_MODE=coordinator \
ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash" \
ASH_API_KEY=my-api-key \
ASH_INTERNAL_SECRET=my-runner-secret \
ANTHROPIC_API_KEY=sk-ant-... \
ASH_PORT=4100 \
node packages/server/dist/index.js

# Coordinator 2 (same config, different host)
ASH_MODE=coordinator \
ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash" \
ASH_API_KEY=my-api-key \
ASH_INTERNAL_SECRET=my-runner-secret \
ANTHROPIC_API_KEY=sk-ant-... \
ASH_PORT=4100 \
node packages/server/dist/index.js
```

Each coordinator logs its unique ID (`hostname-PID`) on startup for debugging:

```
Ash server listening on 0.0.0.0:4100 (mode: coordinator, id: ip-10-0-1-5-12345)
```

### Load Balancer Configuration

```nginx
upstream ash_coordinators {
    server coordinator-1:4100;
    server coordinator-2:4100;
}

server {
    listen 443 ssl;
    location / {
        proxy_pass http://ash_coordinators;
        proxy_http_version 1.1;
        proxy_set_header Connection '';    # Required for SSE
        proxy_buffering off;              # Required for SSE
        proxy_read_timeout 86400s;        # Long-lived SSE streams
    }
}
```

- **No sticky sessions needed.** Any coordinator can handle any request.
- **Health check:** `GET /health` on each coordinator.
- **SSE failover:** If a coordinator dies mid-stream, the client's SSE auto-reconnects through the load balancer to a different coordinator. The new coordinator looks up the session in the shared database and re-establishes the proxy to the runner. No session migration needed.

### Runners with Multi-Coordinator

Runners register with the load balancer URL (not a specific coordinator):

```bash
ASH_SERVER_URL=http://load-balancer:4100 \
ASH_RUNNER_ID=runner-1 \
ASH_INTERNAL_SECRET=my-runner-secret \
node packages/runner/dist/index.js
```

Heartbeats go through the load balancer. Any coordinator that receives a heartbeat writes it to the shared database, where all other coordinators can see it.

### How It Works

The runner registry lives in the `runners` table in the shared database. All coordinators read and write to this table:

- **Registration:** Runner sends `POST /api/internal/runners/register` → coordinator upserts into `runners` table (with retry and exponential backoff)
- **Heartbeat:** Runner sends `POST /api/internal/runners/heartbeat` → coordinator updates `active_count`, `warming_count`, `last_heartbeat_at`
- **Deregistration:** Runner sends `POST /api/internal/runners/deregister` on graceful shutdown → coordinator bulk-pauses sessions and deletes runner in one pass
- **Selection:** On session creation, coordinator queries `SELECT ... FROM runners ORDER BY available_capacity DESC LIMIT 1`
- **Liveness sweep:** All coordinators run the sweep independently (every 30s + random 0-5s jitter). Operations are idempotent — multiple coordinators marking the same dead runner's sessions as paused is harmless.
- **Auth:** When `ASH_INTERNAL_SECRET` is set, all `/api/internal/*` endpoints require `Authorization: Bearer <secret>`.

For more details on the scaling architecture, see [Scaling Architecture](/architecture/scaling).

## Limitations

- **Cross-runner resume requires cloud persistence.** Without `ASH_SNAPSHOT_URL`, a session paused on runner-1 cannot be resumed on runner-2 because the workspace state is local to runner-1's filesystem. Configure S3 or GCS snapshots for cross-runner resume.

- **No automatic session migration.** If a runner is overloaded, existing sessions are not moved to a less-loaded runner. Only new sessions benefit from load-based routing.

- **Runner state is in-memory.** Each runner uses an in-memory database for sandbox tracking (not SQLite). If a runner crashes, its sandbox tracking is lost. On restart, it re-registers with fresh state. The coordinator detects the gap via missed heartbeats and pauses affected sessions.

---

# API Overview

Source: https://docs.ash-cloud.ai/api/overview

# API Overview

The Ash REST API is the primary interface for deploying agents, managing sessions, and sending messages. All endpoints are served by the Ash server process.

## Base URL

```
http://localhost:4100
```

The port is configurable via the `ASH_PORT` environment variable (default: `4100`). The host is configurable via `ASH_HOST` (default: `0.0.0.0`).

## Authentication

API requests are authenticated using Bearer tokens in the `Authorization` header:

```
Authorization: Bearer <your-api-key>
```

Authentication behavior depends on server configuration:

| Configuration | Behavior |
|---|---|
| `ASH_API_KEY` set | Single-tenant mode. The Bearer token must match `ASH_API_KEY`. |
| `ASH_API_KEY` not set (auto-generated) | The server auto-generates a key on first start. The CLI picks it up automatically. |
| API keys in database | Multi-tenant mode. Bearer token is hashed and looked up in the `api_keys` table. Each key maps to a tenant. |

Public endpoints (`/health`, `/docs/*`, `/metrics`) do not require authentication.

## Content Types

| Direction | Content-Type |
|---|---|
| Request bodies | `application/json` |
| Most responses | `application/json` |
| Message streaming | `text/event-stream` (SSE) |
| Prometheus metrics | `text/plain; version=0.0.4; charset=utf-8` |

## Error Format

All error responses use a consistent JSON structure:

```json
{
  "error": "Human-readable error message",
  "statusCode": 400
}
```

## Common Status Codes

| Code | Meaning |
|---|---|
| `200` | Success |
| `201` | Resource created |
| `400` | Bad request (missing required fields, invalid state transition) |
| `401` | Unauthorized (missing or invalid API key) |
| `404` | Resource not found |
| `410` | Gone (session has ended and cannot be resumed) |
| `500` | Internal server error |
| `503` | Service unavailable (sandbox capacity reached, no runners available) |

## Interactive API Docs

The server ships with built-in Swagger UI and an OpenAPI specification.

| Resource | URL |
|---|---|
| Swagger UI | [http://localhost:4100/docs](http://localhost:4100/docs) |
| OpenAPI spec (JSON) | [http://localhost:4100/docs/json](http://localhost:4100/docs/json) |

The Swagger UI provides interactive request builders for every endpoint, making it useful for exploration and debugging.

## TypeScript Types

If you are using the TypeScript SDK, all request and response types are available as imports:

```typescript

```

The shared type definitions used by both client and server are available from the `@ash-ai/shared` package:

```typescript

  Agent,
  Session,
  SessionStatus,
  PoolStats,
  HealthResponse,
  ApiError,
  FileEntry,
  ListFilesResponse,
  GetFileResponse,
  AshStreamEvent,
} from '@ash-ai/shared';
```

---

# Agents

Source: https://docs.ash-cloud.ai/api/agents

# Agents

Agents are the deployable units in Ash. An agent is a directory on disk that contains a `CLAUDE.md` file and optional configuration. Deploying an agent registers it with the server so sessions can be created against it.

Deploying the same agent name again performs an upsert: the path is updated and the version is incremented.

## Agent Type

```typescript
interface Agent {
  id: string;          // UUID
  name: string;        // Unique agent name
  tenantId: string;    // Tenant that owns this agent
  version: number;     // Auto-incremented on each deploy
  path: string;        // Absolute path to agent directory on server
  createdAt: string;   // ISO 8601 timestamp
  updatedAt: string;   // ISO 8601 timestamp
}
```

---

## Deploy Agent

```
POST /api/agents
```

Registers or updates an agent. The agent directory must contain a `CLAUDE.md` file. If an agent with the same name already exists for this tenant, it is updated (upserted) and its version is incremented.

Relative paths are resolved against the server's data directory.

### Request

```json
{
  "name": "qa-bot",
  "path": "/home/user/agents/qa-bot"
}
```

| Field | Type | Required | Description |
|---|---|---|---|
| `name` | string | Yes | Unique name for the agent |
| `path` | string | Yes | Path to the agent directory (must contain `CLAUDE.md`) |

### Response

**201 Created**

```json
{
  "agent": {
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "name": "qa-bot",
    "tenantId": "default",
    "version": 1,
    "path": "/home/user/agents/qa-bot",
    "createdAt": "2025-06-15T10:30:00.000Z",
    "updatedAt": "2025-06-15T10:30:00.000Z"
  }
}
```

### Errors

| Status | Condition |
|---|---|
| `400` | Missing `name` or `path`, or directory does not contain `CLAUDE.md` |

```json
{
  "error": "Agent directory must contain CLAUDE.md",
  "statusCode": 400
}
```

---

## List Agents

```
GET /api/agents
```

Returns all agents belonging to the authenticated tenant.

### Request

No request body. No query parameters.

### Response

**200 OK**

```json
{
  "agents": [
    {
      "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
      "name": "qa-bot",
      "tenantId": "default",
      "version": 2,
      "path": "/home/user/agents/qa-bot",
      "createdAt": "2025-06-15T10:30:00.000Z",
      "updatedAt": "2025-06-16T14:00:00.000Z"
    },
    {
      "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
      "name": "code-reviewer",
      "tenantId": "default",
      "version": 1,
      "path": "/home/user/agents/code-reviewer",
      "createdAt": "2025-06-16T09:00:00.000Z",
      "updatedAt": "2025-06-16T09:00:00.000Z"
    }
  ]
}
```

---

## Get Agent

```
GET /api/agents/:name
```

Returns a single agent by name.

### Path Parameters

| Parameter | Type | Description |
|---|---|---|
| `name` | string | Agent name |

### Response

**200 OK**

```json
{
  "agent": {
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "name": "qa-bot",
    "tenantId": "default",
    "version": 2,
    "path": "/home/user/agents/qa-bot",
    "createdAt": "2025-06-15T10:30:00.000Z",
    "updatedAt": "2025-06-16T14:00:00.000Z"
  }
}
```

### Errors

| Status | Condition |
|---|---|
| `404` | Agent with the given name does not exist for this tenant |

```json
{
  "error": "Agent not found",
  "statusCode": 404
}
```

---

## Delete Agent

```
DELETE /api/agents/:name
```

Removes an agent registration. This does not terminate any active sessions using this agent.

### Path Parameters

| Parameter | Type | Description |
|---|---|---|
| `name` | string | Agent name |

### Response

**200 OK**

```json
{
  "ok": true
}
```

### Errors

| Status | Condition |
|---|---|
| `404` | Agent with the given name does not exist for this tenant |

```json
{
  "error": "Agent not found",
  "statusCode": 404
}
```

---

# Sessions

Source: https://docs.ash-cloud.ai/api/sessions

# Sessions

A session represents an ongoing conversation with a deployed agent. Each session runs inside an isolated sandbox with its own filesystem, process tree, and environment. Sessions have a lifecycle: they are created, become active, can be paused and resumed, and eventually end.

## Session Type

```typescript
interface Session {
  id: string;              // UUID
  tenantId: string;        // Tenant that owns this session
  agentName: string;       // Name of the agent this session runs
  sandboxId: string;       // ID of the sandbox process
  status: SessionStatus;   // Current lifecycle state
  model: string | null;    // Model override for this session (null = use agent default)
  runnerId: string | null; // Runner hosting the sandbox (null in standalone mode)
  createdAt: string;       // ISO 8601 timestamp
  lastActiveAt: string;    // ISO 8601 timestamp, updated on each message
}

type SessionStatus = 'starting' | 'active' | 'paused' | 'ended' | 'error';
```

### Session Status Transitions

```
starting --> active --> paused --> active  (resume)
                   \         \--> ended    (delete)
                    \--> ended             (delete)
                    \--> error --> active   (resume)
                              \--> ended   (delete)
```

---

## Create Session

```
POST /api/sessions
```

Creates a new session for the specified agent. The server allocates a sandbox, copies the agent directory into it, and starts the bridge process. The session is returned in `active` status once the sandbox is ready.

### Request

```json
{
  "agent": "qa-bot",
  "model": "claude-opus-4-6"
}
```

| Field | Type | Required | Description |
|---|---|---|---|
| `agent` | string | Yes | Name of a previously deployed agent |
| `model` | string | No | Model to use for this session. Overrides the agent's default model. Any valid model identifier accepted (e.g. `claude-sonnet-4-5-20250929`, `claude-opus-4-6`). |
| `mcpServers` | object | No | Per-session MCP servers. Merged into the agent's `.mcp.json` — session entries override agent entries with the same key. See [Per-Session MCP Servers](#per-session-mcp-servers). |
| `systemPrompt` | string | No | System prompt override. Replaces the agent's `CLAUDE.md` for this session only. |
| `credentialId` | string | No | Credential ID to inject into sandbox env. |
| `extraEnv` | object | No | Extra env vars to inject into the sandbox (merged with credential env). |

### Response

**201 Created**

```json
{
  "session": {
    "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "tenantId": "default",
    "agentName": "qa-bot",
    "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "status": "active",
    "model": "claude-opus-4-6",
    "runnerId": null,
    "createdAt": "2025-06-15T10:30:00.000Z",
    "lastActiveAt": "2025-06-15T10:30:00.000Z"
  }
}
```

### Errors

| Status | Condition |
|---|---|
| `400` | Missing `agent` field |
| `404` | Agent not found |
| `500` | Sandbox creation failed |
| `503` | Sandbox capacity reached or no runners available |

---

## List Sessions

```
GET /api/sessions
```

Returns all sessions for the authenticated tenant. Optionally filter by agent name.

### Query Parameters

| Parameter | Type | Required | Description |
|---|---|---|---|
| `agent` | string | No | Filter sessions by agent name |

### Response

**200 OK**

```json
{
  "sessions": [
    {
      "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
      "tenantId": "default",
      "agentName": "qa-bot",
      "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
      "status": "active",
      "model": "claude-opus-4-6",
      "runnerId": null,
      "createdAt": "2025-06-15T10:30:00.000Z",
      "lastActiveAt": "2025-06-15T10:35:00.000Z"
    },
    {
      "id": "c9bf9e57-1685-4c89-bafb-ff5af830be8a",
      "tenantId": "default",
      "agentName": "code-reviewer",
      "sandboxId": "c9bf9e57-1685-4c89-bafb-ff5af830be8a",
      "status": "paused",
      "model": null,
      "runnerId": null,
      "createdAt": "2025-06-15T09:00:00.000Z",
      "lastActiveAt": "2025-06-15T09:15:00.000Z"
    }
  ]
}
```

### Example: Filter by Agent

```
GET /api/sessions?agent=qa-bot
```

---

## Get Session

```
GET /api/sessions/:id
```

Returns a single session by ID.

### Path Parameters

| Parameter | Type | Description |
|---|---|---|
| `id` | string (UUID) | Session ID |

### Response

**200 OK**

```json
{
  "session": {
    "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "tenantId": "default",
    "agentName": "qa-bot",
    "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "status": "active",
    "model": "claude-opus-4-6",
    "runnerId": null,
    "createdAt": "2025-06-15T10:30:00.000Z",
    "lastActiveAt": "2025-06-15T10:35:00.000Z"
  }
}
```

### Errors

| Status | Condition |
|---|---|
| `404` | Session not found |

---

## Pause Session

```
POST /api/sessions/:id/pause
```

Pauses an active session. The sandbox state is persisted so the session can be resumed later. Only sessions with status `active` can be paused.

### Path Parameters

| Parameter | Type | Description |
|---|---|---|
| `id` | string (UUID) | Session ID |

### Request

No request body.

### Response

**200 OK**

```json
{
  "session": {
    "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "tenantId": "default",
    "agentName": "qa-bot",
    "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "status": "paused",
    "model": "claude-opus-4-6",
    "runnerId": null,
    "createdAt": "2025-06-15T10:30:00.000Z",
    "lastActiveAt": "2025-06-15T10:35:00.000Z"
  }
}
```

### Errors

| Status | Condition |
|---|---|
| `400` | Session is not in `active` status |
| `404` | Session not found |

```json
{
  "error": "Cannot pause session with status \"paused\"",
  "statusCode": 400
}
```

---

## Resume Session

```
POST /api/sessions/:id/resume
```

Resumes a paused, errored, or starting session. The server attempts two resume paths:

1. **Warm resume** -- If the original sandbox is still alive on the same runner, the session is reactivated immediately with no overhead.
2. **Cold resume** -- If the sandbox has been evicted or the runner is gone, a new sandbox is created. Workspace state is restored from a local snapshot or cloud storage if available.

Sessions with status `active` are returned as-is (no-op). Sessions with status `ended` cannot be resumed.

### Path Parameters

| Parameter | Type | Description |
|---|---|---|
| `id` | string (UUID) | Session ID |

### Request

No request body.

### Response

**200 OK**

```json
{
  "session": {
    "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "tenantId": "default",
    "agentName": "qa-bot",
    "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "status": "active",
    "model": "claude-opus-4-6",
    "runnerId": null,
    "createdAt": "2025-06-15T10:30:00.000Z",
    "lastActiveAt": "2025-06-15T10:35:00.000Z"
  }
}
```

### Errors

| Status | Condition |
|---|---|
| `404` | Session or agent not found |
| `410` | Session has ended -- create a new session instead |
| `500` | Failed to create a new sandbox for cold resume |
| `503` | Sandbox capacity reached or no runners available |

```json
{
  "error": "Session has ended \u2014 create a new session",
  "statusCode": 410
}
```

---

## End Session

```
DELETE /api/sessions/:id
```

Ends a session. The sandbox state is persisted and the sandbox process is destroyed. Ended sessions cannot be resumed.

### Path Parameters

| Parameter | Type | Description |
|---|---|---|
| `id` | string (UUID) | Session ID |

### Request

No request body.

### Response

**200 OK**

```json
{
  "session": {
    "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "tenantId": "default",
    "agentName": "qa-bot",
    "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "status": "ended",
    "model": "claude-opus-4-6",
    "runnerId": null,
    "createdAt": "2025-06-15T10:30:00.000Z",
    "lastActiveAt": "2025-06-15T10:35:00.000Z"
  }
}
```

### Errors

| Status | Condition |
|---|---|
| `404` | Session not found |

---

## Per-Session MCP Servers

The `mcpServers` field on `POST /api/sessions` lets you inject MCP servers at session creation time. This enables the **sidecar pattern**: your host application exposes tools as MCP endpoints, and each session connects to a tenant-scoped URL.

Session-level servers are merged into the agent's `.mcp.json`. If both the agent and the session define a server with the same key, the session entry wins.

### Example: Sidecar Pattern

Your host application runs an MCP server that provides tenant-specific tools:

```json
{
  "agent": "support-bot",
  "mcpServers": {
    "customer-tools": {
      "url": "http://host-app:8000/mcp?tenant=t_abc123"
    }
  }
}
```

The agent's `.mcp.json` might already define shared MCP servers (e.g. `fetch`). The session adds `customer-tools` on top of those.

### McpServerConfig

Each MCP server entry supports:

| Field | Type | Description |
|---|---|---|
| `url` | string | Remote MCP server URL (HTTP/SSE transport). Mutually exclusive with `command`. |
| `command` | string | Command to spawn a stdio MCP server. Mutually exclusive with `url`. |
| `args` | string[] | Arguments for the command. |
| `env` | object | Environment variables for the MCP server process. |

---

## Per-Session System Prompt

The `systemPrompt` field on `POST /api/sessions` replaces the agent's `CLAUDE.md` for that session. This is useful when the same agent definition needs different instructions per tenant or per use case.

```json
{
  "agent": "support-bot",
  "systemPrompt": "You are a support agent for Acme Corp tenant t_abc123. Use the customer-tools MCP server to look up their account."
}
```

The agent's original `CLAUDE.md` is not modified — only the sandbox workspace copy is overwritten before the bridge starts.

---

# Messages

Source: https://docs.ash-cloud.ai/api/messages

# Messages

Messages are how you interact with an agent inside a session. You send a text prompt and receive a stream of Server-Sent Events (SSE) containing the agent's response, including tool use, intermediate results, and the final answer.

---

## Send Message

```
POST /api/sessions/:id/messages
```

Sends a message to the agent running in the specified session. The response is an SSE stream. The session must be in `active` status.

### Path Parameters

| Parameter | Type | Description |
|---|---|---|
| `id` | string (UUID) | Session ID |

### Request

```json
{
  "content": "What files are in the current directory?",
  "includePartialMessages": false,
  "model": "claude-opus-4-6"
}
```

| Field | Type | Required | Description |
|---|---|---|---|
| `content` | string | Yes | The message text to send to the agent |
| `includePartialMessages` | boolean | No | When `true`, the stream includes incremental `stream_event` messages with raw API deltas in addition to complete messages. Useful for building real-time streaming UIs. Default: `false`. |
| `model` | string | No | Model override for this specific message. Takes precedence over the session-level and agent-level model. Any valid model identifier accepted. |

### Response

The response uses `Content-Type: text/event-stream`. The HTTP status is `200` and the body is a stream of SSE frames.

#### SSE Event Types

The stream contains three event types: `message`, `error`, and `done`.

**`message` event** -- An SDK Message object from the Claude Code agent. The shape varies depending on the message type (assistant response, tool use, tool result, etc.). These are passed through from the SDK without transformation.

```
event: message
data: {"type":"assistant","message":{"id":"msg_01X...","type":"message","role":"assistant","content":[{"type":"text","text":"The current directory contains the following files:\n\n- src/\n- package.json\n- README.md"}],"model":"claude-sonnet-4-20250514","stop_reason":"end_turn"}}

```

**`error` event** -- An error occurred during message processing.

```
event: error
data: {"error":"Bridge connection lost"}

```

**`done` event** -- The agent has finished processing the message. This is always the last event in the stream.

```
event: done
data: {"sessionId":"f47ac10b-58cc-4372-a567-0e02b2c3d479"}

```

### Pre-Stream Errors

If validation fails before the stream starts, the server returns a standard JSON error response (not SSE):

| Status | Condition |
|---|---|
| `400` | Session is not in `active` status |
| `404` | Session not found |
| `500` | Runner not available or sandbox not found |

```json
{
  "error": "Session is paused",
  "statusCode": 400
}
```

### Connection Lifecycle

1. Client sends `POST /api/sessions/:id/messages` with JSON body.
2. Server validates the session and sandbox, then responds with `200` and `Content-Type: text/event-stream`.
3. Server streams `message` events as the agent works (tool calls, text responses, etc.).
4. If an error occurs mid-stream, the server sends an `error` event.
5. The stream ends with a `done` event, then the connection closes.

### Backpressure

The server applies backpressure on the SSE stream. If the client stops reading and the kernel TCP send buffer fills up, the server waits up to 30 seconds for the buffer to drain. If the client remains unresponsive after 30 seconds, the server closes the connection.

### Consuming the Stream

#### curl

```bash
curl -N -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/messages \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"content": "Hello, what can you do?"}'
```

#### JavaScript (EventSource-like)

```javascript
const response = await fetch(
  `http://localhost:4100/api/sessions/${sessionId}/messages`,
  {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': 'Bearer YOUR_API_KEY',
    },
    body: JSON.stringify({ content: 'Hello, what can you do?' }),
  }
);

const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split('\n');
  buffer = lines.pop() || '';

  let eventType = '';
  for (const line of lines) {
    if (line.startsWith('event: ')) {
      eventType = line.slice(7);
    } else if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      if (eventType === 'message') {
        console.log('Message:', data);
      } else if (eventType === 'error') {
        console.error('Error:', data.error);
      } else if (eventType === 'done') {
        console.log('Done:', data.sessionId);
      }
    }
  }
}
```

---

## List Messages

```
GET /api/sessions/:id/messages
```

Returns persisted messages for a session. Messages are stored after each completed turn. User messages and complete assistant/result messages are persisted; partial streaming events are not.

### Path Parameters

| Parameter | Type | Description |
|---|---|---|
| `id` | string (UUID) | Session ID |

### Query Parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `limit` | integer | 100 | Maximum number of messages to return (1--1000) |
| `after` | integer | 0 | Return messages with sequence number greater than this value |

### Response

**200 OK**

```json
{
  "messages": [
    {
      "id": "d290f1ee-6c54-4b01-90e6-d701748f0851",
      "sessionId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
      "tenantId": "default",
      "role": "user",
      "content": "{\"type\":\"user\",\"content\":\"What files are in the current directory?\"}",
      "sequence": 1,
      "createdAt": "2025-06-15T10:31:00.000Z"
    },
    {
      "id": "e391f2ff-7d65-5c12-a1f7-e812859f1962",
      "sessionId": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
      "tenantId": "default",
      "role": "assistant",
      "content": "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"text\",\"text\":\"Here are the files...\"}]}}",
      "sequence": 2,
      "createdAt": "2025-06-15T10:31:05.000Z"
    }
  ]
}
```

The `content` field is a JSON-encoded string containing the raw SDK message. Parse it to access the full message structure.

### Errors

| Status | Condition |
|---|---|
| `404` | Session not found |

---

# Files

Source: https://docs.ash-cloud.ai/api/files

# Files

The Files API provides read access to files in a session's workspace. Each session has an isolated workspace directory where the agent operates. You can list all files and download individual files.

Files are resolved from the live sandbox when the session is active. If the sandbox has been evicted (session paused or ended), the server falls back to the most recent persisted snapshot. The `source` field (or `X-Ash-Source` header) in each response indicates which one was used.

---

## List Files

```
GET /api/sessions/:id/files
```

Returns a list of all files in the session's workspace, recursively. Certain directories and file types are excluded automatically: `node_modules`, `.git`, `__pycache__`, `.cache`, `.npm`, `.pnpm-store`, `.yarn`, `.venv`, `venv`, `.tmp`, `tmp`, and files with `.sock`, `.lock`, or `.pid` extensions.

### Path Parameters

| Parameter | Type | Description |
|---|---|---|
| `id` | string (UUID) | Session ID |

### Response

**200 OK**

```json
{
  "files": [
    {
      "path": "CLAUDE.md",
      "size": 1234,
      "modifiedAt": "2025-06-15T10:30:00.000Z"
    },
    {
      "path": "src/index.ts",
      "size": 567,
      "modifiedAt": "2025-06-15T10:32:00.000Z"
    },
    {
      "path": "package.json",
      "size": 890,
      "modifiedAt": "2025-06-15T10:30:00.000Z"
    }
  ],
  "source": "sandbox"
}
```

| Field | Type | Description |
|---|---|---|
| `files` | FileEntry[] | Array of file entries |
| `files[].path` | string | Path relative to workspace root |
| `files[].size` | integer | File size in bytes |
| `files[].modifiedAt` | string | ISO 8601 last-modified timestamp |
| `source` | string | `"sandbox"` if read from the live sandbox, `"snapshot"` if read from a persisted snapshot |

### Errors

| Status | Condition |
|---|---|
| `404` | Session not found, or no workspace is available for the session |

```json
{
  "error": "No workspace available for this session",
  "statusCode": 404
}
```

---

## Download File (Raw)

```
GET /api/sessions/:id/files/*path
```

Downloads a single file from the session's workspace as raw bytes. The file path is specified as the wildcard portion of the URL.

By default, the response streams the raw file content with appropriate `Content-Type` based on the file extension. Files up to 100 MB are supported.

### Path Parameters

| Parameter | Type | Description |
|---|---|---|
| `id` | string (UUID) | Session ID |
| `*` | string | File path relative to workspace root |

### Query Parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `format` | string | `raw` | Response format. `raw` streams the file bytes directly. `json` returns a JSON-wrapped response (see below). |

### Example Request

```
GET /api/sessions/f47ac10b-58cc-4372-a567-0e02b2c3d479/files/src/index.ts
```

### Response (Raw — Default)

**200 OK**

The raw file bytes are returned with these headers:

| Header | Example | Description |
|---|---|---|
| `Content-Type` | `text/typescript` | MIME type based on file extension (fallback: `application/octet-stream`) |
| `Content-Disposition` | `attachment; filename*=UTF-8''index.ts` | Suggests a filename for download |
| `Content-Length` | `67` | File size in bytes |
| `X-Ash-Source` | `sandbox` | `sandbox` if from live sandbox, `snapshot` if from persisted snapshot |

```bash
# Download raw file content
curl -O $ASH_SERVER_URL/api/sessions/SESSION_ID/files/output/report.pdf \
  -H "Authorization: Bearer YOUR_API_KEY"
```

### Response (JSON — `?format=json`)

**200 OK**

```
GET /api/sessions/:id/files/src/index.ts?format=json
```

```json
{
  "path": "src/index.ts",
  "content": "import express from 'express';\n\nconst app = express();\napp.listen(3000);\n",
  "size": 67,
  "source": "sandbox"
}
```

| Field | Type | Description |
|---|---|---|
| `path` | string | The requested file path |
| `content` | string | Full file content as UTF-8 text |
| `size` | integer | File size in bytes |
| `source` | string | `"sandbox"` if read from the live sandbox, `"snapshot"` if from a persisted snapshot |

JSON mode has a 1 MB file size limit. For larger files, use the default raw mode.

### Errors

| Status | Condition |
|---|---|
| `400` | Missing file path, path contains `..` traversal, path starts with `/`, path is a directory, or file exceeds size limit (1 MB for JSON mode, 100 MB for raw mode) |
| `404` | Session not found, no workspace available, or file does not exist |

```json
{
  "error": "File not found",
  "statusCode": 404
}
```

---

## Use Cases

**Downloading binary artifacts.** After an agent generates images, PDFs, or compiled binaries, download them directly using the raw endpoint.

```bash
# Download a generated PDF
curl -o report.pdf $ASH_SERVER_URL/api/sessions/SESSION_ID/files/output/report.pdf \
  -H "Authorization: Bearer YOUR_API_KEY"
```

**Inspecting agent-written code.** After an agent writes code, use `?format=json` to get the content inline.

```bash
# Read a text file as JSON
curl "$ASH_SERVER_URL/api/sessions/SESSION_ID/files/src/index.ts?format=json" \
  -H "Authorization: Bearer YOUR_API_KEY"
```

**Building UIs.** The Files API provides the data needed to build file-browser components that show the agent's workspace in real time.

**Reviewing changes after a session ends.** Even after a session is paused or ended, files remain accessible from the persisted snapshot, so you can review what the agent produced.

---

# Health and Metrics

Source: https://docs.ash-cloud.ai/api/health

# Health and Metrics

Ash exposes health and metrics endpoints for monitoring, alerting, and integration with orchestration systems. Neither endpoint requires authentication.

---

## Health Check

```
GET /health
```

Returns the server's current status, active session and sandbox counts, uptime, and detailed sandbox pool statistics.

### Request

No request body. No authentication required.

### Response

**200 OK**

```json
{
  "status": "ok",
  "activeSessions": 3,
  "activeSandboxes": 5,
  "uptime": 86400,
  "pool": {
    "total": 5,
    "cold": 0,
    "warming": 1,
    "warm": 1,
    "waiting": 2,
    "running": 1,
    "maxCapacity": 1000,
    "resumeWarmHits": 42,
    "resumeColdHits": 7
  }
}
```

| Field | Type | Description |
|---|---|---|
| `status` | string | Always `"ok"` if the server is reachable |
| `activeSessions` | integer | Number of sessions in `active` status |
| `activeSandboxes` | integer | Number of live sandbox processes |
| `uptime` | integer | Seconds since server start |
| `pool` | PoolStats | Sandbox pool breakdown |

### Pool Stats

The `pool` object provides a detailed view of sandbox states:

| Field | Type | Description |
|---|---|---|
| `total` | integer | Total sandboxes in the pool |
| `cold` | integer | Sandboxes not yet started |
| `warming` | integer | Sandboxes currently starting up |
| `warm` | integer | Sandboxes ready but not assigned to a session |
| `waiting` | integer | Sandboxes assigned to a session, idle between messages |
| `running` | integer | Sandboxes actively processing a message |
| `maxCapacity` | integer | Maximum number of sandboxes allowed (configured via `ASH_MAX_SANDBOXES`) |
| `resumeWarmHits` | integer | Total warm resumes (sandbox was still alive) |
| `resumeColdHits` | integer | Total cold resumes (new sandbox created, state restored) |

---

## Prometheus Metrics

```
GET /metrics
```

Returns metrics in Prometheus text exposition format. No authentication required.

### Request

No request body.

### Response

**200 OK** with `Content-Type: text/plain; version=0.0.4; charset=utf-8`

```
# HELP ash_up Whether the Ash server is up (always 1 if reachable).
# TYPE ash_up gauge
ash_up 1

# HELP ash_uptime_seconds Seconds since server start.
# TYPE ash_uptime_seconds gauge
ash_uptime_seconds 86400

# HELP ash_active_sessions Number of active sessions.
# TYPE ash_active_sessions gauge
ash_active_sessions 3

# HELP ash_active_sandboxes Number of live sandbox processes.
# TYPE ash_active_sandboxes gauge
ash_active_sandboxes 5

# HELP ash_pool_sandboxes Sandbox count by state.
# TYPE ash_pool_sandboxes gauge
ash_pool_sandboxes{state="cold"} 0
ash_pool_sandboxes{state="warming"} 1
ash_pool_sandboxes{state="warm"} 1
ash_pool_sandboxes{state="waiting"} 2
ash_pool_sandboxes{state="running"} 1

# HELP ash_pool_max_capacity Maximum sandbox capacity.
# TYPE ash_pool_max_capacity gauge
ash_pool_max_capacity 1000

# HELP ash_resume_total Total session resumes by path (warm=sandbox alive, cold=new sandbox).
# TYPE ash_resume_total counter
ash_resume_total{path="warm"} 42
ash_resume_total{path="cold"} 7
```

### Metric Reference

| Metric | Type | Labels | Description |
|---|---|---|---|
| `ash_up` | gauge | -- | Always `1` if the server is reachable |
| `ash_uptime_seconds` | gauge | -- | Seconds since server process started |
| `ash_active_sessions` | gauge | -- | Number of sessions in `active` status |
| `ash_active_sandboxes` | gauge | -- | Number of live sandbox processes |
| `ash_pool_sandboxes` | gauge | `state` | Sandbox count broken down by state: `cold`, `warming`, `warm`, `waiting`, `running` |
| `ash_pool_max_capacity` | gauge | -- | Configured maximum sandbox capacity |
| `ash_resume_total` | counter | `path` | Cumulative session resume count by path: `warm` (sandbox still alive) or `cold` (new sandbox created) |

---

## Prometheus Configuration

Add the following scrape config to your `prometheus.yml`:

```yaml
scrape_configs:
  - job_name: 'ash'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:4100']
    metrics_path: '/metrics'
```

---

## Kubernetes Probes

The `/health` endpoint is suitable for both liveness and readiness probes:

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ash-server
spec:
  template:
    spec:
      containers:
        - name: ash
          livenessProbe:
            httpGet:
              path: /health
              port: 4100
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health
              port: 4100
            initialDelaySeconds: 5
            periodSeconds: 5
```

The liveness probe verifies the server process is responsive. The readiness probe can be used to gate traffic until the server has completed initialization. Both return `200` with `{"status": "ok", ...}` when the server is healthy.

---

# TypeScript SDK

Source: https://docs.ash-cloud.ai/sdks/typescript

# TypeScript SDK

The `@ash-ai/sdk` package provides a typed TypeScript client for the Ash REST API.

## Installation

```bash
npm install @ash-ai/sdk
```

```bash
pip install ash-ai-sdk
```

## Client Setup

```typescript

const client = new AshClient({
  serverUrl: 'http://localhost:4100',
  apiKey: 'your-api-key',
});
```

The `serverUrl` is the base URL of your Ash server. Trailing slashes are stripped automatically.

The server always requires authentication. If you used `ash start`, the CLI saves the auto-generated key to `~/.ash/config.json`. For SDK usage, pass the key explicitly.

```python
from ash_ai import AshClient

client = AshClient(
    server_url="http://localhost:4100",
    api_key="your-api-key",
)
```

The `server_url` is the base URL of your Ash server. The `api_key` is required — the server always requires authentication.

## Methods Reference

### Agents

```typescript
// Deploy an agent from a directory path on the server
const agent = await client.deployAgent('my-agent', '/path/to/agent');

// List all deployed agents
const agents = await client.listAgents();

// Get a specific agent by name
const agent = await client.getAgent('my-agent');

// Delete an agent (also deletes its sessions)
await client.deleteAgent('my-agent');
```

```python
# Deploy an agent from a directory path on the server
agent = client.deploy_agent(name="my-agent", path="/path/to/agent")

# List all deployed agents
agents = client.list_agents()

# Get a specific agent by name
agent = client.get_agent("my-agent")

# Delete an agent (also deletes its sessions)
client.delete_agent("my-agent")
```

### Sessions

```typescript
// Create a new session for an agent
const session = await client.createSession('my-agent');

// Create with per-session MCP servers (sidecar pattern)
const session = await client.createSession('my-agent', {
  mcpServers: {
    'tenant-tools': { url: 'http://host-app:8000/mcp?tenant=t_abc123' },
  },
});

// Create with a system prompt override
const session = await client.createSession('my-agent', {
  systemPrompt: 'You are a support agent for Acme Corp.',
});

// List all sessions (optionally filter by agent name)
const sessions = await client.listSessions();
const agentSessions = await client.listSessions('my-agent');

// Get a session by ID
const session = await client.getSession(sessionId);

// Pause a session (persists workspace state)
const paused = await client.pauseSession(sessionId);

// Resume a paused or errored session
const resumed = await client.resumeSession(sessionId);

// End a session permanently
const ended = await client.endSession(sessionId);
```

```python
# Create a new session for an agent
session = client.create_session("my-agent")

# List all sessions (optionally filter by agent name)
sessions = client.list_sessions()
agent_sessions = client.list_sessions(agent="my-agent")

# Get a session by ID
session = client.get_session(session_id)

# Pause a session (persists workspace state)
paused = client.pause_session(session_id)

# Resume a paused or errored session
resumed = client.resume_session(session_id)

# End a session permanently
ended = client.end_session(session_id)
```

### Messages

#### Streaming Messages (Recommended)

`sendMessageStream(sessionId, content, opts?)` returns an async generator that yields parsed `AshStreamEvent` objects:

```typescript
for await (const event of client.sendMessageStream(sessionId, 'Analyze this code')) {
  if (event.type === 'message') {
    console.log('SDK message:', event.data);
  } else if (event.type === 'error') {
    console.error('Error:', event.data.error);
  } else if (event.type === 'done') {
    console.log('Turn complete for session:', event.data.sessionId);
  }
}
```

`send_message_stream(session_id, content, **kwargs)` returns an iterator of parsed events:

```python
for event in client.send_message_stream(session_id, "Analyze this code"):
    if event.type == "message":
        print("SDK message:", event.data)
    elif event.type == "error":
        print(f"Error: {event.data['error']}")
    elif event.type == "done":
        print(f"Turn complete for session: {event.data['sessionId']}")
```

#### Raw Response (TypeScript only)

`sendMessage(sessionId, content, opts?)` returns a raw `Response` object with an SSE stream body. Use this when you need full control over the stream.

```typescript
const response = await client.sendMessage(sessionId, 'Hello, agent');
// response.body is a ReadableStream<Uint8Array> containing SSE frames
```

#### Options

Both methods accept options for partial message streaming:

```typescript
interface SendMessageOptions {
  /** Enable partial message streaming. Yields incremental StreamEvent messages
   *  with raw API deltas in addition to complete messages. */
  includePartialMessages?: boolean;
}
```

When `includePartialMessages` is `true`, the stream includes `stream_event` messages with `content_block_delta` events. Use `extractStreamDelta()` to pull text chunks from these events for real-time streaming UIs.

```python
# Enable partial message streaming with the include_partial_messages kwarg
for event in client.send_message_stream(
    session_id,
    "Write a haiku.",
    include_partial_messages=True,
):
    if event.type == "message":
        data = event.data
        if data.get("type") == "stream_event":
            evt = data.get("event", {})
            if evt.get("type") == "content_block_delta":
                delta = evt.get("delta", {})
                if delta.get("type") == "text_delta":
                    print(delta.get("text", ""), end="", flush=True)
```

### Messages History

```typescript
// List persisted messages for a session
const messages = await client.listMessages(sessionId);

// With pagination
const messages = await client.listMessages(sessionId, {
  limit: 50,
  afterSequence: 10,
});
```

```python
# List persisted messages for a session
messages = client.list_messages(session_id)

# With pagination
messages = client.list_messages(session_id, limit=50, after_sequence=10)
```

### Session Events (Timeline)

```typescript
// List timeline events for a session
const events = await client.listSessionEvents(sessionId);

// Filter by type and paginate
const textEvents = await client.listSessionEvents(sessionId, {
  type: 'text',
  limit: 100,
  afterSequence: 0,
});
```

```python
# List timeline events for a session
events = client.list_session_events(session_id)

# Filter by type and paginate
text_events = client.list_session_events(session_id, type="text", limit=100, after_sequence=0)
```

Event types: `text`, `tool_start`, `tool_result`, `reasoning`, `error`, `turn_complete`, `lifecycle`.

### Files

```typescript
// List files in a session's workspace
const { files, source } = await client.getSessionFiles(sessionId);
// source is 'sandbox' (live) or 'snapshot' (persisted)

// Read a specific file
const { path, content, size, source } = await client.getSessionFile(sessionId, 'src/index.ts');
```

```python

# List files in a session's workspace
resp = httpx.get(f"http://localhost:4100/api/sessions/{session_id}/files")
data = resp.json()
# data["source"] is "sandbox" (live) or "snapshot" (persisted)

# Read a specific file
resp = httpx.get(f"http://localhost:4100/api/sessions/{session_id}/files/src/index.ts")
file_data = resp.json()
```

### Health

```typescript
const health = await client.health();
// {
//   status: 'ok',
//   activeSessions: 3,
//   activeSandboxes: 2,
//   uptime: 3600,
//   pool: { total: 5, cold: 2, warming: 0, warm: 1, waiting: 1, running: 1, maxCapacity: 1000, ... }
// }
```

```python
health = client.health()
# {
#   "status": "ok",
#   "activeSessions": 3,
#   "activeSandboxes": 2,
#   "uptime": 3600,
#   "pool": { "total": 5, "cold": 2, "warming": 0, "warm": 1, ... }
# }
```

## Full Streaming Example

```typescript

const client = new AshClient({
  serverUrl: 'http://localhost:4100',
  apiKey: process.env.ASH_API_KEY,
});

// Deploy and create session
const agent = await client.deployAgent('helper', '/path/to/agent');
const session = await client.createSession('helper');

// Stream with partial messages for real-time output
for await (const event of client.sendMessageStream(session.id, 'Write a haiku', {
  includePartialMessages: true,
})) {
  if (event.type === 'message') {
    // Extract incremental text deltas for real-time display
    const delta = extractStreamDelta(event.data);
    if (delta) {
      process.stdout.write(delta);
      continue;
    }

    // Extract complete text from finished assistant messages
    const text = extractTextFromEvent(event.data);
    if (text) {
      console.log('\nComplete:', text);
    }

    // Extract structured display items (text, tool use, tool results)
    const items = extractDisplayItems(event.data);
    if (items) {
      for (const item of items) {
        if (item.type === 'tool_use') {
          console.log(`Tool: ${item.toolName} (${item.toolInput})`);
        }
      }
    }
  } else if (event.type === 'error') {
    console.error('Error:', event.data.error);
  } else if (event.type === 'done') {
    console.log('Done.');
  }
}

// Clean up
await client.endSession(session.id);
```

```python
from ash_ai import AshClient

client = AshClient(
    server_url="http://localhost:4100",
    api_key=os.environ.get("ASH_API_KEY"),
)

# Deploy and create session
agent = client.deploy_agent(name="helper", path="/path/to/agent")
session = client.create_session("helper")

# Stream with partial messages for real-time output
for event in client.send_message_stream(session.id, "Write a haiku",
    include_partial_messages=True,
):
    if event.type == "message":
        data = event.data

        # Extract incremental text deltas for real-time display
        if data.get("type") == "stream_event":
            evt = data.get("event", {})
            if evt.get("type") == "content_block_delta":
                delta = evt.get("delta", {})
                if delta.get("type") == "text_delta":
                    print(delta.get("text", ""), end="", flush=True)
                    continue

        # Extract complete text from finished assistant messages
        if data.get("type") == "assistant":
            for block in data.get("message", {}).get("content", []):
                if block.get("type") == "text":
                    print(f"\nComplete: {block['text']}")
                elif block.get("type") == "tool_use":
                    print(f"Tool: {block['name']} ({block.get('input', '')})")

    elif event.type == "error":
        print(f"Error: {event.data.get('error')}")
    elif event.type == "done":
        print("Done.")

# Clean up
client.end_session(session.id)
```

## Helper Functions

The SDK re-exports these helpers from `@ash-ai/shared`:

| Function | Description |
|----------|-------------|
| `extractDisplayItems(data)` | Extract structured display items (text, tool use, tool result) from an SDK message. Returns `DisplayItem[]` or `null`. |
| `extractTextFromEvent(data)` | Extract plain text content from an assistant message. Returns `string` or `null`. |
| `extractStreamDelta(data)` | Extract incremental text delta from a `stream_event` / `content_block_delta`. Only yields values when `includePartialMessages` is enabled. Returns `string` or `null`. |
| `parseSSEStream(stream)` | Parse a `ReadableStream<Uint8Array>` into an async generator of `AshStreamEvent`. Works in both Node.js and browser. |

## Re-exported Types

The SDK re-exports these types from `@ash-ai/shared`:

```typescript
// Core entities
Agent, Session, SessionStatus

// Request/Response
CreateSessionRequest, SendMessageRequest, DeployAgentRequest
ListAgentsResponse, ListSessionsResponse, HealthResponse, ApiError

// SSE streaming
AshSSEEventType, AshMessageEvent, AshErrorEvent, AshDoneEvent, AshStreamEvent

// Display helpers
DisplayItem, DisplayItemType

// Files
FileEntry, ListFilesResponse, GetFileResponse

// MCP
McpServerConfig
```

## Error Handling

All methods throw on non-2xx responses. The error message is extracted from the API response body.

```typescript
try {
  const session = await client.createSession('nonexistent-agent');
} catch (err) {
  // err.message === 'Agent "nonexistent-agent" not found'
  console.error(err.message);
}
```

For streaming, errors can arrive both as thrown exceptions (connection failures) and as `error` events within the stream (agent-level errors):

```typescript
try {
  for await (const event of client.sendMessageStream(sessionId, 'hello')) {
    if (event.type === 'error') {
      // Agent-level error (e.g., sandbox crash, SDK error)
      console.error('Stream error:', event.data.error);
    }
  }
} catch (err) {
  // Connection-level error (e.g., network failure, 404)
  console.error('Connection error:', err.message);
}
```

All methods raise on non-2xx responses:

```python
from ash_ai import AshApiError

try:
    session = client.create_session(agent="nonexistent")
except AshApiError as e:
    print(f"API error ({e.status_code}): {e.message}")
except Exception as e:
    print(f"Connection error: {e}")
```

For streaming, errors can arrive both as exceptions (connection failures) and as `error` events within the stream (agent-level errors):

```python
try:
    for event in client.send_message_stream(session_id, "hello"):
        if event.type == "error":
            # Agent-level error (e.g., sandbox crash, SDK error)
            print(f"Stream error: {event.data.get('error')}")
except Exception as e:
    # Connection-level error (e.g., network failure, 404)
    print(f"Connection error: {e}")
```

---

# Python SDK

Source: https://docs.ash-cloud.ai/sdks/python

# Python SDK

The `ash-ai-sdk` Python package provides a client for the Ash REST API. It is auto-generated from the OpenAPI specification.

## Installation

```bash
pip install ash-ai-sdk
```

```bash
npm install @ash-ai/sdk
```

## Client Setup

```python
from ash_ai import AshClient

client = AshClient(
    server_url="http://localhost:4100",
    api_key="your-api-key",
)
```

```typescript

const client = new AshClient({
  serverUrl: 'http://localhost:4100',
  apiKey: 'your-api-key',
});
```

## Usage Examples

### Deploy an Agent

```python
agent = client.deploy_agent(name="my-agent", path="/path/to/agent")
print(f"Deployed: {agent.name} v{agent.version}")
```

```typescript
const agent = await client.deployAgent('my-agent', '/path/to/agent');
console.log(`Deployed: ${agent.name} v${agent.version}`);
```

### Create a Session

```python
session = client.create_session(agent="my-agent")
print(f"Session ID: {session.id}")
print(f"Status: {session.status}")
```

```typescript
const session = await client.createSession('my-agent');
console.log(`Session ID: ${session.id}`);
console.log(`Status: ${session.status}`);
```

### Send a Message (Streaming)

```python
for event in client.send_message_stream(session.id, "Analyze this code"):
    if event.type == "message":
        data = event.data
        if data.get("type") == "assistant" and data.get("message", {}).get("content"):
            for block in data["message"]["content"]:
                if block.get("type") == "text":
                    print(block["text"])
    elif event.type == "error":
        print(f"Error: {event.data['error']}")
    elif event.type == "done":
        print("Turn complete.")
```

```typescript
for await (const event of client.sendMessageStream(session.id, 'Analyze this code')) {
  if (event.type === 'message') {
    const text = extractTextFromEvent(event.data);
    if (text) console.log(text);
  } else if (event.type === 'error') {
    console.error('Error:', event.data.error);
  } else if (event.type === 'done') {
    console.log('Turn complete.');
  }
}
```

### Pause and Resume

```python
# Pause the session (persists workspace state)
paused = client.pause_session(session.id)
print(f"Status: {paused.status}")  # 'paused'

# Resume later (fast path if sandbox is still alive)
resumed = client.resume_session(session.id)
print(f"Status: {resumed.status}")  # 'active'
```

```typescript
// Pause the session (persists workspace state)
const paused = await client.pauseSession(session.id);
console.log(`Status: ${paused.status}`); // 'paused'

// Resume later (fast path if sandbox is still alive)
const resumed = await client.resumeSession(session.id);
console.log(`Status: ${resumed.status}`); // 'active'
```

### End a Session

```python
ended = client.end_session(session.id)
print(f"Status: {ended.status}")  # 'ended'
```

```typescript
const ended = await client.endSession(session.id);
console.log(`Status: ${ended.status}`); // 'ended'
```

### Multi-Turn Conversation

```python
session = client.create_session(agent="my-agent")

questions = [
    "What files are in the workspace?",
    "Read the main config file.",
    "Summarize what this project does.",
]

for question in questions:
    print(f"\n> {question}")
    for event in client.send_message_stream(session.id, question):
        if event.type == "message":
            data = event.data
            if data.get("type") == "assistant":
                content = data.get("message", {}).get("content", [])
                for block in content:
                    if block.get("type") == "text":
                        print(block["text"], end="")
    print()

client.end_session(session.id)
```

```typescript
const session = await client.createSession('my-agent');

const questions = [
  'What files are in the workspace?',
  'Read the main config file.',
  'Summarize what this project does.',
];

for (const question of questions) {
  console.log(`\n> ${question}`);
  for await (const event of client.sendMessageStream(session.id, question)) {
    if (event.type === 'message') {
      const text = extractTextFromEvent(event.data);
      if (text) process.stdout.write(text);
    }
  }
  console.log();
}

await client.endSession(session.id);
```

### List Agents and Sessions

```python
# List all deployed agents
agents = client.list_agents()
for agent in agents:
    print(f"{agent.name} (v{agent.version})")

# List all sessions, optionally filtered by agent
sessions = client.list_sessions(agent="my-agent")
for s in sessions:
    print(f"{s.id} - {s.status}")
```

```typescript
// List all deployed agents
const agents = await client.listAgents();
for (const agent of agents) {
  console.log(`${agent.name} (v${agent.version})`);
}

// List all sessions, optionally filtered by agent
const sessions = await client.listSessions('my-agent');
for (const s of sessions) {
  console.log(`${s.id} - ${s.status}`);
}
```

## Error Handling

```python
from ash_ai import AshApiError

try:
    session = client.create_session(agent="nonexistent")
except AshApiError as e:
    print(f"API error ({e.status_code}): {e.message}")
except Exception as e:
    print(f"Connection error: {e}")
```

```typescript
try {
  const session = await client.createSession('nonexistent');
} catch (err) {
  console.error(err.message);
}
```

## Note on SDK Generation

The Python SDK is auto-generated from the Ash server's OpenAPI specification using `openapi-python-client`. The spec is generated from Fastify route schemas, so the Python SDK always matches the server's API surface.

To regenerate the SDK from source:

```bash
make sdk-python
```

This runs `make openapi` first (to generate the spec), then runs the Python client generator.

---

# Direct API (curl)

Source: https://docs.ash-cloud.ai/sdks/curl

# Direct API (curl)

No SDK dependencies needed. All Ash functionality is available through HTTP requests. This page shows every operation using `curl`.

## Setup

All examples below use the `ASH_SERVER_URL` environment variable. Set it once:

```bash
export ASH_SERVER_URL=$ASH_SERVER_URL   # default
```

Include the `-H "Authorization: Bearer YOUR_KEY"` header on every request except `/health`. The server always requires authentication — it auto-generates an API key on first start if one is not provided.

## Health Check

```bash
curl $ASH_SERVER_URL/health
```

```json
{
  "status": "ok",
  "activeSessions": 2,
  "activeSandboxes": 2,
  "uptime": 1234,
  "pool": {
    "total": 5,
    "cold": 2,
    "warming": 0,
    "warm": 1,
    "waiting": 1,
    "running": 1,
    "maxCapacity": 1000,
    "resumeWarmHits": 3,
    "resumeColdHits": 1
  }
}
```

## Agents

### Deploy an Agent

```bash
curl -X POST $ASH_SERVER_URL/api/agents \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{"name": "my-agent", "path": "/path/to/agent/directory"}'
```

The agent directory must contain a `CLAUDE.md` file. The path is resolved on the server.

### List Agents

```bash
curl $ASH_SERVER_URL/api/agents \
  -H "Authorization: Bearer YOUR_KEY"
```

### Get Agent Details

```bash
curl $ASH_SERVER_URL/api/agents/my-agent \
  -H "Authorization: Bearer YOUR_KEY"
```

### Delete an Agent

```bash
curl -X DELETE $ASH_SERVER_URL/api/agents/my-agent \
  -H "Authorization: Bearer YOUR_KEY"
```

## Sessions

### Create a Session

```bash
curl -X POST $ASH_SERVER_URL/api/sessions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{"agent": "my-agent"}'
```

Response:

```json
{
  "session": {
    "id": "a1b2c3d4-...",
    "agentName": "my-agent",
    "sandboxId": "a1b2c3d4-...",
    "status": "active",
    "createdAt": "2026-01-15T10:00:00.000Z",
    "lastActiveAt": "2026-01-15T10:00:00.000Z"
  }
}
```

### Send a Message (SSE Stream)

Use `-N` to disable output buffering so SSE events print in real time:

```bash
curl -N -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/messages \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{"content": "What files are in the workspace?"}'
```

The response is a `text/event-stream`. Events arrive as:

```
event: message
data: {"type":"assistant","message":{"role":"assistant","content":[{"type":"text","text":"Here are the files..."}]}}

event: message
data: {"type":"result","subtype":"success","session_id":"...","num_turns":1}

event: done
data: {"sessionId":"a1b2c3d4-..."}
```

To enable partial message streaming (incremental text deltas):

```bash
curl -N -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/messages \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{"content": "Write a haiku", "includePartialMessages": true}'
```

### List Sessions

```bash
# All sessions
curl $ASH_SERVER_URL/api/sessions \
  -H "Authorization: Bearer YOUR_KEY"

# Filter by agent
curl "$ASH_SERVER_URL/api/sessions?agent=my-agent" \
  -H "Authorization: Bearer YOUR_KEY"
```

### Get Session Details

```bash
curl $ASH_SERVER_URL/api/sessions/SESSION_ID \
  -H "Authorization: Bearer YOUR_KEY"
```

### Pause a Session

```bash
curl -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/pause \
  -H "Authorization: Bearer YOUR_KEY"
```

### Resume a Session

```bash
curl -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/resume \
  -H "Authorization: Bearer YOUR_KEY"
```

### End a Session

```bash
curl -X DELETE $ASH_SERVER_URL/api/sessions/SESSION_ID \
  -H "Authorization: Bearer YOUR_KEY"
```

### List Messages (History)

```bash
# Default: last 100 messages
curl $ASH_SERVER_URL/api/sessions/SESSION_ID/messages \
  -H "Authorization: Bearer YOUR_KEY"

# With pagination
curl "$ASH_SERVER_URL/api/sessions/SESSION_ID/messages?limit=50&after=10" \
  -H "Authorization: Bearer YOUR_KEY"
```

Note: `GET /api/sessions/:id/messages` returns persisted message history, while `POST /api/sessions/:id/messages` sends a new message and returns an SSE stream.

### List Session Events (Timeline)

```bash
# All events
curl $ASH_SERVER_URL/api/sessions/SESSION_ID/events \
  -H "Authorization: Bearer YOUR_KEY"

# Filter by type
curl "$ASH_SERVER_URL/api/sessions/SESSION_ID/events?type=text&limit=50" \
  -H "Authorization: Bearer YOUR_KEY"
```

## Files

### List Workspace Files

```bash
curl $ASH_SERVER_URL/api/sessions/SESSION_ID/files \
  -H "Authorization: Bearer YOUR_KEY"
```

### Read a File

```bash
curl $ASH_SERVER_URL/api/sessions/SESSION_ID/files/src/index.ts \
  -H "Authorization: Bearer YOUR_KEY"
```

## SSE Event Format

The send-message endpoint returns an SSE stream with three event types:

| Event | Data | Description |
|-------|------|-------------|
| `message` | Raw Claude Code SDK `Message` object | Assistant response, tool use, tool result, or final result. The `data.type` field indicates the message kind (`assistant`, `user`, `result`, `stream_event`). |
| `error` | `{"error": "..."}` | An error occurred during processing. |
| `done` | `{"sessionId": "..."}` | The agent's turn is complete. |

Each SSE frame follows the standard format:

```
event: <type>\n
data: <JSON>\n
\n
```

The `message` event data is a passthrough of the Claude Code SDK's `Message` type. Ash does not translate or wrap these messages -- the SDK's types are the wire format.

---

# CLI Overview

Source: https://docs.ash-cloud.ai/cli/overview

# CLI Overview

The `ash` CLI manages the Ash server lifecycle, deploys agents, and interacts with sessions from the terminal.

## Installation

```bash
npm install -g @ash-ai/cli
```

## Global Configuration

The CLI connects to an Ash server. Set the server URL via environment variable:

```bash
export ASH_SERVER_URL=http://localhost:4100   # default
```

The server always requires authentication. When you run `ash start`, the CLI automatically picks up the server's API key (auto-generated or explicit) and saves it to `~/.ash/config.json`. For remote servers, use `ash link <url> --api-key <key>` to save the key.

## Help

```bash
ash --help
```

```
Usage: ash [options] [command]

Agent orchestration CLI

Options:
  -V, --version   output the version number
  -h, --help      display help for command

Commands:
  start           Start the Ash server in a Docker container
  stop            Stop the Ash server container
  status          Show Ash server status
  logs            Show Ash server logs
  chat            Send a message to an agent (one-shot)
  deploy          Deploy an agent to the server
  session         Manage sessions
  agent           Manage agents
  health          Check server health
  help [command]  display help for command
```

## Command Groups

| Group | Description |
|-------|-------------|
| **Server Lifecycle** | `start`, `stop`, `status`, `logs` -- manage the Ash server Docker container |
| **Quick** | `chat` -- send a message to an agent, keep session alive for follow-ups (`--session <id>` to continue, `--end` to clean up) |
| **Agents** | `deploy`, `agent list`, `agent info`, `agent delete` -- deploy and manage agent definitions |
| **Sessions** | `session create`, `session send`, `session list`, `session pause`, `session resume`, `session end` -- interact with agent sessions |
| **Health** | `health` -- check server health and pool stats |

---

# Server Lifecycle

Source: https://docs.ash-cloud.ai/cli/lifecycle

# Server Lifecycle

The CLI manages an Ash server running in a Docker container. These commands handle the full lifecycle: start, stop, status, and logs.

## `ash start`

Starts the Ash server in a Docker container.

```bash
ash start
```

The command:

1. Checks that Docker is installed and running
2. Removes any stale stopped container
3. Creates the data directory (`~/.ash/`)
4. Pulls the latest image (unless `--no-pull`)
5. Starts the container with port mapping and volume mounts
6. Waits for the health endpoint to respond (up to 30 seconds)

### Options

| Flag | Description | Default |
|------|-------------|---------|
| `--port <port>` | Host port to expose | `4100` |
| `--tag <tag>` | Docker image tag | Latest published version |
| `--image <image>` | Full Docker image name (overrides default + tag) | `ghcr.io/ash-ai/ash` |
| `--no-pull` | Skip pulling the image; use a local build | Pull enabled |
| `--database-url <url>` | PostgreSQL/CockroachDB connection URL | SQLite (default) |
| `-e, --env <KEY=VALUE>` | Extra env vars to pass to the container (repeatable) | None |

### Examples

```bash
# Start with defaults
ash start

# Use a local dev image
ash start --image ash-dev --no-pull

# Use a specific port
ash start --port 8080

# Use Postgres instead of SQLite
ash start --database-url "postgresql://user:pass@host:5432/ash"

# Pass extra env vars
ash start -e ANTHROPIC_API_KEY=sk-ant-...
```

### Output

```
Starting Ash server...
Waiting for server to be ready...
Ash server is running.
  URL:      http://localhost:4100
  API key:  ash_xxxxxxxx (saved to ~/.ash/config.json)
  Data dir: /Users/you/.ash
```

The server auto-generates a secure API key on first start and the CLI saves it to `~/.ash/config.json`. Subsequent CLI commands use this key automatically.

## `ash stop`

Stops the running Ash server container.

```bash
ash stop
```

```
Stopping Ash server...
Ash server stopped.
```

If no container is found, prints a message and exits.

## `ash status`

Shows the current state of the Ash server container and, if running, its health stats.

```bash
ash status
```

### Example Output

```
Container: running
  ID:    a1b2c3d4e5f6
  Image: ghcr.io/ash-ai/ash:0.1.0
  Active sessions:  3
  Active sandboxes: 2
  Uptime:           1234s
```

When the container is stopped:

```
Container: exited
  ID:    a1b2c3d4e5f6
  Image: ghcr.io/ash-ai/ash:0.1.0
```

When no container exists:

```
Container: not-found
```

## `ash logs`

Shows logs from the Ash server container.

```bash
ash logs
```

### Options

| Flag | Description |
|------|-------------|
| `-f, --follow` | Follow log output (like `tail -f`) |

### Examples

```bash
# Show recent logs
ash logs

# Follow logs in real time
ash logs --follow
```

---

# Agent Commands

Source: https://docs.ash-cloud.ai/cli/agents

# Agent Commands

Deploy and manage agent definitions on the Ash server.

## `ash deploy <path>`

Deploys an agent from a local directory. The directory must contain a `CLAUDE.md` file.

```bash
ash deploy ./my-agent --name my-agent
```

The command copies the agent directory to `~/.ash/agents/<name>/` (so it is accessible inside the Docker container via volume mount), then registers it with the server.

### Options

| Flag | Description | Default |
|------|-------------|---------|
| `-n, --name <name>` | Agent name | Directory name |

### Example

```bash
ash deploy ./examples/qa-bot/agent --name qa-bot
```

```
Copied agent files to /Users/you/.ash/agents/qa-bot
Deployed agent: {
  "id": "a1b2c3d4-...",
  "name": "qa-bot",
  "version": 1,
  "path": "agents/qa-bot",
  "createdAt": "2026-01-15T10:00:00.000Z",
  "updatedAt": "2026-01-15T10:00:00.000Z"
}
```

Deploying the same name again increments the version:

```bash
ash deploy ./examples/qa-bot/agent --name qa-bot
```

```
Deployed agent: {
  ...
  "version": 2,
  ...
}
```

## `ash agent list`

Lists all deployed agents.

```bash
ash agent list
```

```json
[
  {
    "id": "a1b2c3d4-...",
    "name": "qa-bot",
    "version": 2,
    "path": "agents/qa-bot",
    "createdAt": "2026-01-15T10:00:00.000Z",
    "updatedAt": "2026-01-15T10:05:00.000Z"
  },
  {
    "id": "e5f6a7b8-...",
    "name": "code-reviewer",
    "version": 1,
    "path": "agents/code-reviewer",
    "createdAt": "2026-01-15T11:00:00.000Z",
    "updatedAt": "2026-01-15T11:00:00.000Z"
  }
]
```

## `ash agent info <name>`

Gets details for a specific agent.

```bash
ash agent info qa-bot
```

```json
{
  "id": "a1b2c3d4-...",
  "name": "qa-bot",
  "version": 2,
  "path": "agents/qa-bot",
  "createdAt": "2026-01-15T10:00:00.000Z",
  "updatedAt": "2026-01-15T10:05:00.000Z"
}
```

## `ash agent delete <name>`

Deletes an agent and its associated sessions.

```bash
ash agent delete qa-bot
```

```
Deleted agent: qa-bot
```

---

# Session Commands

Source: https://docs.ash-cloud.ai/cli/sessions

# Session Commands

Create, message, and manage agent sessions from the terminal.

## `ash session create <agent>`

Creates a new session for the named agent. A sandbox is allocated and the agent's workspace is initialized.

```bash
ash session create qa-bot
```

```
Session created: {
  "id": "b2c3d4e5-1234-5678-9abc-def012345678",
  "agentName": "qa-bot",
  "sandboxId": "b2c3d4e5-1234-5678-9abc-def012345678",
  "status": "active",
  "createdAt": "2026-01-15T10:00:00.000Z",
  "lastActiveAt": "2026-01-15T10:00:00.000Z"
}
```

## `ash session send <id> <message>`

Sends a message to a session and streams the response. SSE events are printed as they arrive.

```bash
ash session send b2c3d4e5-1234-5678-9abc-def012345678 "What files are in the workspace?"
```

```
[message] assistant: {"type":"assistant","message":{"role":"assistant","content":[{"type":"text","text":"..."}]}}
[message] result: {"type":"result","subtype":"success","session_id":"...","num_turns":1}
[done] {"sessionId":"b2c3d4e5-..."}
```

Each line shows the SSE event type in brackets followed by the SDK message type and a truncated JSON preview.

## `ash session list`

Lists all sessions.

```bash
ash session list
```

```json
[
  {
    "id": "b2c3d4e5-...",
    "agentName": "qa-bot",
    "sandboxId": "b2c3d4e5-...",
    "status": "active",
    "createdAt": "2026-01-15T10:00:00.000Z",
    "lastActiveAt": "2026-01-15T10:01:00.000Z"
  },
  {
    "id": "c3d4e5f6-...",
    "agentName": "qa-bot",
    "status": "paused",
    ...
  }
]
```

## `ash session pause <id>`

Pauses an active session. The workspace state is persisted and the sandbox remains alive for fast resume.

```bash
ash session pause b2c3d4e5-1234-5678-9abc-def012345678
```

```
Session paused: {
  "id": "b2c3d4e5-...",
  "status": "paused",
  ...
}
```

## `ash session resume <id>`

Resumes a paused or errored session. If the sandbox is still alive, resume is instant (warm path). If the sandbox was evicted, a new one is created and the workspace is restored (cold path).

```bash
ash session resume b2c3d4e5-1234-5678-9abc-def012345678
```

```
Session resumed: {
  "id": "b2c3d4e5-...",
  "status": "active",
  ...
}
```

## `ash session end <id>`

Ends a session permanently. The sandbox is destroyed and the session status is set to `ended`.

```bash
ash session end b2c3d4e5-1234-5678-9abc-def012345678
```

```
Session ended: {
  "id": "b2c3d4e5-...",
  "status": "ended",
  ...
}
```

## Full Lifecycle Example

```bash
# Deploy an agent
ash deploy ./my-agent --name helper

# Create a session
ash session create helper
# Note the session ID from the output

# Send messages
ash session send SESSION_ID "List the project structure"
ash session send SESSION_ID "Read the README"

# Pause when done for now
ash session pause SESSION_ID

# Resume later
ash session resume SESSION_ID
ash session send SESSION_ID "Summarize what you found"

# End when finished
ash session end SESSION_ID
```

---

# Health

Source: https://docs.ash-cloud.ai/cli/health

# Health

Check the health of a running Ash server.

## `ash health`

Queries the server's `/health` endpoint and prints the response.

```bash
ash health
```

### Example Output

```json
{
  "status": "ok",
  "activeSessions": 3,
  "activeSandboxes": 2,
  "uptime": 7200,
  "pool": {
    "total": 5,
    "cold": 2,
    "warming": 0,
    "warm": 1,
    "waiting": 1,
    "running": 1,
    "maxCapacity": 1000,
    "resumeWarmHits": 5,
    "resumeColdHits": 2
  }
}
```

### Fields

| Field | Description |
|-------|-------------|
| `status` | Always `"ok"` if the server is reachable |
| `activeSessions` | Number of sessions with status `active` |
| `activeSandboxes` | Number of live sandbox processes |
| `uptime` | Seconds since the server started |
| `pool.total` | Total sandbox entries in the database (live + cold) |
| `pool.cold` | Sandboxes with no live process (can be evicted or restored) |
| `pool.warming` | Sandboxes currently starting up |
| `pool.warm` | Sandboxes with a live process, not yet assigned to a message |
| `pool.waiting` | Sandboxes idle between messages (sandbox alive, session paused or between turns) |
| `pool.running` | Sandboxes actively processing a message |
| `pool.maxCapacity` | Maximum number of sandboxes allowed (set by `ASH_MAX_SANDBOXES`) |
| `pool.resumeWarmHits` | Number of resumes that found the sandbox still alive |
| `pool.resumeColdHits` | Number of resumes that required creating a new sandbox |

The health endpoint does not require authentication.

---

# System Overview

Source: https://docs.ash-cloud.ai/architecture/overview

# System Overview

Ash is a thin orchestration layer around the [Claude Code SDK](https://github.com/anthropic-ai/claude-code-sdk-python). It manages agent deployment, session lifecycle, sandbox isolation, and streaming -- adding as little overhead as possible on top of the SDK itself.

## Standalone Mode

In standalone mode, a single server process manages everything: HTTP API, sandbox pool, and bridge processes.

```mermaid
graph LR
    Client["Client (SDK / CLI / curl)"]
    Server["Ash Server<br/>Fastify :4100"]
    Pool["SandboxPool"]
    B1["Bridge 1"]
    B2["Bridge 2"]
    SDK1["Claude Code SDK"]
    SDK2["Claude Code SDK"]
    DB["SQLite / Postgres"]

    Client -->|HTTP + SSE| Server
    Server --> Pool
    Server --> DB
    Pool --> B1
    Pool --> B2
    B1 -->|Unix Socket| SDK1
    B2 -->|Unix Socket| SDK2
```

## Coordinator Mode

In coordinator mode, the server acts as a pure control plane. Sandbox execution is offloaded to remote runner processes on separate machines.

```mermaid
graph LR
    Client["Client"]
    Server["Ash Server<br/>(coordinator)"]
    R1["Runner 1"]
    R2["Runner 2"]
    B1["Bridge"]
    B2["Bridge"]
    DB["Postgres / CRDB"]

    Client -->|HTTP + SSE| Server
    Server --> DB
    Server -->|HTTP| R1
    Server -->|HTTP| R2
    R1 --> B1
    R2 --> B2
```

Runners register with the server via heartbeat. The server routes sessions to the runner with the most available capacity.

## Multi-Coordinator Mode

For high availability and horizontal scaling of the control plane, run multiple coordinators behind a load balancer with a shared database (Postgres or CockroachDB).

```mermaid
graph LR
    Client["Client"]
    LB["Load Balancer"]
    C1["Coordinator 1"]
    C2["Coordinator 2"]
    R1["Runner 1"]
    R2["Runner 2"]
    DB["CRDB"]

    Client -->|HTTPS| LB
    LB --> C1
    LB --> C2
    C1 --> DB
    C2 --> DB
    C1 -->|HTTP| R1
    C1 -->|HTTP| R2
    C2 -->|HTTP| R1
    C2 -->|HTTP| R2
```

Coordinators are stateless — the runner registry and session routing state live in the database. Any coordinator can route to any runner. SSE reconnection handles coordinator failover transparently. See [Scaling Architecture](./scaling) for details.

## Components

| Package | Description |
|---------|-------------|
| `@ash-ai/shared` | Types, protocol definitions, constants. No runtime dependencies. |
| `@ash-ai/sandbox` | SandboxManager, SandboxPool, BridgeClient, resource limits, state persistence. Used by both server and runner. |
| `@ash-ai/bridge` | Runs inside each sandbox process. Receives commands over Unix socket, calls the Claude Code SDK, streams responses back. |
| `@ash-ai/server` | Fastify REST API. Agent registry, session routing, SSE streaming, database access. |
| `@ash-ai/runner` | Worker node for multi-machine deployments. Manages sandboxes on a remote host, registers with the server. |
| `@ash-ai/sdk` | TypeScript client library for the Ash API. |
| `@ash-ai/cli` | `ash` command-line tool. Server lifecycle, agent deployment, session management. |

## Message Hot Path

Every message traverses this path. Ash's goal is to add no more than 1-3ms of overhead on top of the SDK.

```mermaid
sequenceDiagram
    participant C as Client
    participant S as Server (Fastify)
    participant P as Pool
    participant B as Bridge
    participant SDK as Claude Code SDK

    C->>S: POST /api/sessions/:id/messages
    S->>S: Session lookup (DB)
    S->>P: markRunning(sandboxId)
    S->>B: query command (Unix socket)
    B->>SDK: sdk.query(prompt)
    SDK-->>B: Message stream
    B-->>S: message events (Unix socket)
    S-->>C: SSE stream (event: message)
    S->>P: markWaiting(sandboxId)
    S-->>C: event: done
```

## Package Dependency Graph

```mermaid
graph TD
    shared["@ash-ai/shared"]
    sandbox["@ash-ai/sandbox"]
    bridge["@ash-ai/bridge"]
    server["@ash-ai/server"]
    runner["@ash-ai/runner"]
    sdk["@ash-ai/sdk"]
    cli["@ash-ai/cli"]

    sandbox --> shared
    bridge --> shared
    server --> shared
    server --> sandbox
    runner --> shared
    runner --> sandbox
    sdk --> shared
    cli --> shared
```

## Storage Layout

```
data/
  ash.db                      # SQLite database (agents, sessions, sandboxes, messages, events)
  sandboxes/
    <session-id>/
      workspace/              # Agent workspace (CLAUDE.md, files, etc.)
  sessions/
    <session-id>/
      workspace/              # Persisted workspace snapshot (for cold resume)
```

In Postgres/CRDB mode, `ash.db` is replaced by the remote database. The `sandboxes/` and `sessions/` directories remain on the local filesystem.

---

# Sandbox Isolation

Source: https://docs.ash-cloud.ai/architecture/sandbox-isolation

# Sandbox Isolation

Ash treats agent code as untrusted. Each agent session runs inside an isolated sandbox process with restricted access to the host system.

## Security Model

The agent inside the sandbox can execute arbitrary shell commands (that is how the Claude Code SDK works). The sandbox must prevent:

- Reading host environment variables (credentials, secrets)
- Writing outside the workspace directory
- Consuming unbounded host resources (memory, CPU, disk)
- Interfering with other sandboxes or the host process

## Isolation Layers

| Layer | Linux | macOS (dev) |
|-------|-------|-------------|
| **Process limits** | cgroups v2 | ulimit |
| **Memory limit** | cgroup `memory.max` (default 2048 MB) | ulimit (best-effort) |
| **CPU limit** | cgroup `cpu.max` (default 100% = 1 core) | Not enforced |
| **Disk limit** | Periodic check, kill on exceed (default 1024 MB) | Periodic check, kill on exceed |
| **Max processes** | cgroup `pids.max` (default 64, fork bomb protection) | ulimit |
| **Environment** | Strict allowlist | Strict allowlist |
| **Filesystem** | bubblewrap (bwrap) read-only root, writable workspace | Restricted cwd only |
| **Network** | Network namespace (configurable) | Unrestricted |

Resource limit defaults are defined in `@ash-ai/shared`:

```typescript
const DEFAULT_SANDBOX_LIMITS = {
  memoryMb: 2048,     // Max RSS in MB
  cpuPercent: 100,     // 100 = 1 core
  diskMb: 1024,        // Max workspace size in MB
  maxProcesses: 64,    // Fork bomb protection
};
```

## Environment Variable Allowlist

The sandbox process receives **only** these environment variables. Everything else is blocked.

### Passed through from host (if set)

| Variable | Purpose |
|----------|---------|
| `PATH` | Standard path |
| `NODE_PATH` | Node.js module resolution |
| `HOME` | Home directory (set to workspace dir) |
| `LANG` | Locale |
| `TERM` | Terminal type |
| `ANTHROPIC_API_KEY` | Required for Claude Code SDK |
| `ASH_DEBUG_TIMING` | Enable timing instrumentation |

### Injected by Ash

| Variable | Purpose |
|----------|---------|
| `ASH_BRIDGE_SOCKET` | Path to the Unix socket for bridge communication |
| `ASH_AGENT_DIR` | Original agent directory path |
| `ASH_WORKSPACE_DIR` | Writable workspace directory for this session |
| `ASH_SANDBOX_ID` | Unique sandbox identifier |
| `ASH_SESSION_ID` | Session identifier |

### Everything else: blocked

The sandbox does not inherit `process.env`. Variables like `AWS_SECRET_ACCESS_KEY`, `DATABASE_URL`, `GITHUB_TOKEN`, or any other host secret are never visible inside the sandbox.

```typescript
// From sandbox/manager.ts -- allowlist enforcement
const env: Record<string, string> = {};
for (const key of SANDBOX_ENV_ALLOWLIST) {
  if (process.env[key]) {
    env[key] = process.env[key]!;
  }
}
// Only these vars + injected ASH_* vars are passed to the child process
```

## OOM Detection

When a sandbox process is killed by the kernel's OOM killer (exit code 137 or SIGKILL), Ash detects this and automatically pauses the session. The session can be resumed later with a fresh sandbox.

## Disk Monitoring

A periodic check (every 30 seconds) measures the workspace directory size. If it exceeds `diskMb`, the sandbox is killed immediately.

---

# Bridge Protocol

Source: https://docs.ash-cloud.ai/architecture/bridge-protocol

# Bridge Protocol

The bridge process runs inside each sandbox and communicates with the host server over a Unix domain socket using newline-delimited JSON.

## Why Unix Sockets

- Lower overhead than TCP (no network stack, no port allocation)
- No port conflicts when running multiple sandboxes
- Natural 1:1 mapping between socket file and sandbox process
- Socket paths include the sandbox ID for easy identification

Socket path format: `/tmp/ash-<short-id>.sock`

## Wire Format

Each message is a single JSON object followed by a newline character (`\n`). Both directions use the same encoding:

```typescript
function encode(msg: BridgeCommand | BridgeEvent): string {
  return JSON.stringify(msg) + '\n';
}

function decode(line: string): BridgeCommand | BridgeEvent {
  return JSON.parse(line.trim());
}
```

These functions are exported from `@ash-ai/shared` as `encode` and `decode`.

## Commands (Server to Bridge)

Commands are sent from the server (or runner) to the bridge process inside the sandbox.

| Command | Fields | Description |
|---------|--------|-------------|
| `query` | `cmd`, `prompt`, `sessionId`, `includePartialMessages?` | Send a message to the agent. The bridge calls the Claude Code SDK and streams responses back. |
| `resume` | `cmd`, `sessionId` | Resume a conversation with the SDK's session resume capability. |
| `interrupt` | `cmd` | Interrupt the current agent turn. |
| `shutdown` | `cmd` | Gracefully shut down the bridge process. |

### Command type definitions

```typescript
interface QueryCommand {
  cmd: 'query';
  prompt: string;
  sessionId: string;
  includePartialMessages?: boolean;
}

interface ResumeCommand {
  cmd: 'resume';
  sessionId: string;
}

interface InterruptCommand {
  cmd: 'interrupt';
}

interface ShutdownCommand {
  cmd: 'shutdown';
}
```

## Events (Bridge to Server)

Events are sent from the bridge process back to the server.

| Event | Fields | Description |
|-------|--------|-------------|
| `ready` | `ev` | Bridge is initialized and ready to accept commands. Sent once on startup. |
| `message` | `ev`, `data` | A raw SDK `Message` object. The `data` field contains the unmodified message from `@anthropic-ai/claude-code`. |
| `error` | `ev`, `error` | An error occurred during processing. |
| `done` | `ev`, `sessionId` | The agent's turn is complete. |

### Event type definitions

```typescript
interface ReadyEvent {
  ev: 'ready';
}

interface MessageEvent {
  ev: 'message';
  data: unknown; // Raw SDK Message -- passthrough, not translated
}

interface ErrorEvent {
  ev: 'error';
  error: string;
}

interface DoneEvent {
  ev: 'done';
  sessionId: string;
}
```

## SDK Message Passthrough

The `message` event's `data` field contains the raw SDK `Message` object exactly as returned by `@anthropic-ai/claude-code`. Ash does not translate, wrap, or modify these messages.

This is a deliberate design decision ([ADR 0001](/architecture/decisions#adr-0001-sdk-passthrough-types)). The benefits:

- One type system instead of three (no bridge-specific or SSE-specific message types)
- SDK changes propagate automatically through the entire pipeline
- Clients can use SDK types directly for type-safe message handling
- Less code to maintain

The `data.type` field indicates the SDK message kind: `assistant`, `user`, `result`, `stream_event`, etc.

## Connection Lifecycle

```mermaid
sequenceDiagram
    participant S as Server
    participant B as Bridge

    Note over S: spawn bridge process
    B->>S: ready
    S->>B: query (prompt, sessionId)
    B-->>S: message (SDK Message)
    B-->>S: message (SDK Message)
    B-->>S: done (sessionId)
    Note over S,B: ... more commands ...
    S->>B: shutdown
    Note over B: process exits
```

The bridge sends `ready` immediately after initializing the Unix socket listener. The server waits for this event before sending any commands (with a 10-second timeout).

---

# Session Lifecycle

Source: https://docs.ash-cloud.ai/architecture/session-lifecycle

# Session Lifecycle

A session represents an ongoing conversation between a client and an agent running inside a sandbox.

## State Machine

```mermaid
stateDiagram-v2
    [*] --> starting: POST /api/sessions
    starting --> active: Sandbox ready
    active --> paused: POST .../pause
    active --> ended: DELETE .../
    active --> error: Sandbox crash / OOM
    paused --> active: POST .../resume (warm or cold)
    error --> active: POST .../resume (cold)
    ended --> [*]
```

## States

| Status | Description |
|--------|-------------|
| `starting` | Session created, sandbox being allocated. Transient -- transitions to `active` within seconds. |
| `active` | Sandbox is alive and ready to accept messages. |
| `paused` | Session is paused. Workspace state is persisted. The sandbox may still be alive (warm) or evicted (cold). |
| `error` | An error occurred (sandbox crash, OOM kill). Resumable -- a new sandbox will be created on resume. |
| `ended` | Session is permanently ended. The sandbox is destroyed. Cannot be resumed. |

## State Transitions

| Transition | Trigger | What happens |
|-----------|---------|-------------|
| starting -> active | Sandbox process starts, bridge sends `ready` | Session is ready to accept messages |
| active -> paused | `POST /api/sessions/:id/pause` | Workspace state persisted, session marked paused |
| active -> ended | `DELETE /api/sessions/:id` | Workspace persisted, sandbox destroyed, session marked ended |
| active -> error | Sandbox crash or OOM kill | Session marked as error, available for resume |
| paused -> active | `POST /api/sessions/:id/resume` | Warm path (instant) or cold path (new sandbox) |
| error -> active | `POST /api/sessions/:id/resume` | Always cold path (new sandbox created) |

## Pause Flow

When a session is paused:

1. Server persists workspace state to `data/sessions/<id>/workspace/`
2. If cloud storage is configured (`ASH_SNAPSHOT_URL`), workspace is synced to S3/GCS
3. Session status is updated to `paused` in the database
4. The sandbox process remains alive (for potential fast resume)

## Resume Flow

Resume follows a decision tree to minimize latency:

```mermaid
flowchart TD
    A[POST .../resume] --> B{Session status?}
    B -->|ended| C[410 Gone]
    B -->|active| D[Return session as-is]
    B -->|paused / error| E{Same runner available?}
    E -->|yes| F{Sandbox alive?}
    F -->|yes| G["Warm path<br/>(instant resume)"]
    F -->|no| H["Cold path"]
    E -->|no| H
    H --> I{Local workspace exists?}
    I -->|yes| J[Create sandbox<br/>with existing workspace]
    I -->|no| K{Persisted snapshot?}
    K -->|yes| L[Restore from local snapshot]
    K -->|no| M{Cloud snapshot?}
    M -->|yes| N[Restore from S3/GCS]
    M -->|no| O[Create fresh sandbox]
    L --> J
    N --> J
    O --> J
```

### Warm path

If the original sandbox process is still alive (the session was paused but not evicted), resume is instant. No data is copied, no process is started. The session status is simply updated to `active`.

### Cold path

If the sandbox was evicted or crashed, a new sandbox is created:

1. Check for workspace on local disk (`data/sandboxes/<id>/workspace/`) → **source: local**
2. If not found, check for persisted snapshot (`data/sessions/<id>/workspace/`) → **source: local**
3. If not found, try restoring from cloud storage (`ASH_SNAPSHOT_URL`) → **source: cloud**
4. If no backup exists, create from fresh agent definition → **source: fresh**
5. Create a new sandbox, reusing the restored workspace if available
6. Update session with new sandbox ID and runner ID

The resume source is tracked in metrics (`ash_resume_cold_total{source="..."}`) so you can monitor how often each path is hit. See [State Persistence & Restore](./state-persistence.md) for the full storage architecture.

## Cloud Persistence

When `ASH_SNAPSHOT_URL` is set to an S3 or GCS URL, workspace snapshots are automatically synced to cloud storage after each completed agent turn and before eviction. This enables resume across server restarts and machine migrations.

## Cold Cleanup

Cold sandbox entries (no live process) are automatically cleaned up after 2 hours of inactivity. Local workspace files and database records are deleted, but **cloud snapshots are preserved** — so sessions can still be resumed from cloud storage after local cleanup. See [Sandbox Pool](./sandbox-pool.md#cold-cleanup) for details.

---

# Sandbox Pool

Source: https://docs.ash-cloud.ai/architecture/sandbox-pool

# Sandbox Pool

The `SandboxPool` manages the lifecycle of all sandboxes in a server or runner process. It enforces capacity limits, handles eviction, and runs periodic idle sweeps.

## State Machine

Each sandbox transitions through these states:

```mermaid
stateDiagram-v2
    [*] --> cold: Server restart
    cold --> warming: Create requested
    warming --> warm: Bridge ready
    warm --> running: Message received
    running --> waiting: Turn complete
    waiting --> running: Next message
    waiting --> cold: Idle sweep / eviction
```

## States

| State | Process alive? | Description |
|-------|---------------|-------------|
| `cold` | No | Database record only. Process was evicted or server restarted. Workspace may be persisted for later restore. |
| `warming` | Starting | Sandbox process is being created. Bridge not yet ready. |
| `warm` | Yes | Bridge process is alive and connected. Ready to accept its first command. |
| `waiting` | Yes | Between messages. Sandbox is idle, waiting for the next command. Eligible for idle eviction. |
| `running` | Yes | Actively processing a message. Never evicted. |

## Eviction

When a new sandbox needs to be created but the pool is at capacity (`ASH_MAX_SANDBOXES`), eviction kicks in. Candidates are selected in priority order:

| Tier | State | Action |
|------|-------|--------|
| 1 | `cold` | Delete persisted state and database record. No process to kill. |
| 2 | `warm` | Kill the sandbox process. Delete database record. |
| 3 | `waiting` | Persist workspace state, kill the sandbox process, mark as `cold`. The session is paused so it can be resumed later. |
| 4 | `running` | Never evicted. If all sandboxes are running, the create request returns 503. |

Within each tier, the least-recently-used sandbox is evicted first (ordered by `last_used_at`).

### Eviction query

```sql
SELECT * FROM sandboxes
WHERE state IN ('cold', 'warm', 'waiting')
ORDER BY
  CASE state WHEN 'cold' THEN 0 WHEN 'warm' THEN 1 WHEN 'waiting' THEN 2 END,
  last_used_at ASC
LIMIT 1
```

## Idle Sweep

A periodic timer (every 60 seconds) checks for sandboxes in the `waiting` state that have been idle longer than `ASH_IDLE_TIMEOUT_MS` (default: 30 minutes).

Idle sandboxes are evicted: workspace is persisted, the process is killed, and the database record is marked `cold`. The associated session is paused so it can be resumed later.

```typescript
pool.startIdleSweep();  // Start the periodic timer
pool.stopIdleSweep();   // Stop the timer (graceful shutdown)
```

## Cold Cleanup

A separate periodic timer (every 5 minutes) removes cold sandbox entries that haven't been used for 2 hours. This prevents unbounded disk growth from accumulated cold entries.

Cold cleanup deletes:
- The live workspace directory (`data/sandboxes/<id>/`)
- The local snapshot directory (`data/sessions/<sessionId>/workspace/`)
- The database record

**Cloud snapshots are preserved**, so sessions can still be resumed from cloud storage after local cleanup. See [State Persistence & Restore](./state-persistence.md) for the full restore fallback chain.

```typescript
pool.startColdCleanup();  // Start the periodic timer
pool.stopColdCleanup();   // Stop the timer (graceful shutdown)
```

## Configuration

| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `ASH_MAX_SANDBOXES` | `1000` | Maximum number of sandbox entries (live + cold) in the database |
| `ASH_IDLE_TIMEOUT_MS` | `1800000` (30 min) | How long a `waiting` sandbox can be idle before eviction |
| `COLD_CLEANUP_TTL_MS` | `7200000` (2 hr) | How long a `cold` sandbox sits before local files are cleaned up |

## Race Condition Safety

`markRunning()` is synchronous (updates the in-memory map immediately). This prevents a race where an idle sweep could evict a sandbox between when a message arrives and when the sandbox starts processing it.

```typescript
// In the message handler -- synchronous, prevents eviction
backend.markRunning(session.sandboxId);
```

The database update is fire-and-forget (asynchronous) since the in-memory map is the source of truth for the running state.

## Server Restart

On server startup, `pool.init()` calls `markAllSandboxesCold()`, which updates all sandbox records in the database to `cold`. This is correct because:

- All sandbox processes were killed when the server stopped
- Cold entries can be evicted or used for workspace restoration during resume
- The in-memory live map starts empty

```typescript
const marked = await this.db.markAllSandboxesCold();
// "Startup: marked 5 stale sandbox(es) as cold"
```

## Pool Stats

The pool exposes statistics for the health endpoint and Prometheus metrics:

```typescript
const stats = await pool.statsAsync();
// {
//   total: 10,       // All entries (live + cold)
//   cold: 3,         // No process
//   warming: 0,      // Starting up
//   warm: 2,         // Ready, no session
//   waiting: 3,      // Idle between messages
//   running: 2,      // Processing a message
//   maxCapacity: 1000,
//   resumeWarmHits: 15,   // Resumes that found sandbox alive
//   resumeColdHits: 5,    // Resumes that needed new sandbox (total)
//   resumeColdLocalHits: 3,  // Cold resume from local disk
//   resumeColdCloudHits: 1,  // Cold resume from cloud storage
//   resumeColdFreshHits: 1,  // Cold resume with no state available
//   preWarmHits: 2,    // Sessions that claimed a pre-warmed sandbox
// }
```

The cold resume counters break down where the workspace came from during a cold resume. See [State Persistence & Restore](./state-persistence.md) for details.

---

# SSE Backpressure

Source: https://docs.ash-cloud.ai/architecture/sse-backpressure

# SSE Backpressure

## Problem

When a fast agent produces messages faster than a slow client can consume them, the server-side write buffer grows without bound. With many concurrent sessions, this leads to unbounded memory usage and eventual out-of-memory crashes.

```
Agent (fast)  -->  Bridge  -->  Server  -->  SSE  -->  Client (slow)
                                          ^^^^^^^^^^
                                          Buffer grows here
```

## Solution

Ash respects backpressure at every boundary in the pipeline. When the downstream consumer cannot accept data, the upstream producer pauses.

### Bridge Side

The bridge's `send()` function checks the return value of `socket.write()`. If the kernel buffer is full, it waits for the `drain` event before sending more data. This prevents the bridge from flooding the Unix socket.

### Server Side

The `writeSSE()` function in the session routes checks if `response.write()` returns `false` (indicating the TCP send buffer is full). If so, it waits for the `drain` event with a 30-second timeout.

```typescript
async function writeSSE(raw: ServerResponse, frame: string): Promise<void> {
  const canWrite = raw.write(frame);
  if (!canWrite) {
    const drained = await Promise.race([
      new Promise<true>((resolve) => {
        raw.once('drain', () => resolve(true));
      }),
      new Promise<false>((resolve) => {
        setTimeout(() => resolve(false), SSE_WRITE_TIMEOUT_MS);
      }),
    ]);

    if (!drained) {
      throw new Error('Client write timeout -- closing stream');
    }
  }
}
```

If the client does not drain within the timeout, the stream is closed. This prevents a single slow client from holding a sandbox in the `running` state indefinitely.

## Full Pipeline

```mermaid
graph LR
    SDK["Claude SDK"] -->|Messages| Bridge
    Bridge -->|Unix Socket<br/>await drain| Server
    Server -->|SSE<br/>await drain| Client

    style Bridge fill:#f0f0f0
    style Server fill:#f0f0f0
```

At each arrow, the sender checks backpressure before writing. If the receiver is slow, the sender pauses. The pause propagates upstream through the entire pipeline.

## Memory Bound

Memory per connection is bounded by the kernel's TCP send buffer size (typically 128 KB - 1 MB depending on OS configuration) plus one pending SSE frame. There is no application-level buffering.

## Configuration

| Constant | Value | Description |
|----------|-------|-------------|
| `SSE_WRITE_TIMEOUT_MS` | 30,000 ms | Maximum time to wait for a slow client to drain before closing the connection |

This value is defined in `@ash-ai/shared` and used by the server's SSE writer.

---

# Database

Source: https://docs.ash-cloud.ai/architecture/database

# Database

Ash supports two database backends behind a common interface: SQLite (default) for single-machine deployments and PostgreSQL/CockroachDB for multi-machine setups.

## Configuration

| Environment Variable | Default | Description |
|---------------------|---------|-------------|
| `ASH_DATABASE_URL` | Not set (uses SQLite) | PostgreSQL or CockroachDB connection URL |

When `ASH_DATABASE_URL` is not set, Ash creates a SQLite database at `data/ash.db`. When set to a `postgresql://` or `postgres://` URL, Ash connects to the specified Postgres-compatible database.

## Backend Selection

The `initDb()` factory function selects the backend based on the URL:

```typescript
export async function initDb(opts: { dataDir: string; databaseUrl?: string }): Promise<Db> {
  if (opts.databaseUrl && /^postgres(ql)?:\/\//.test(opts.databaseUrl)) {
    const pgDb = new PgDb(opts.databaseUrl);
    await pgDb.init();
    return pgDb;
  } else {
    return new SqliteDb(opts.dataDir);
  }
}
```

## Common Interface

Both backends implement the same `Db` interface:

```typescript
interface Db {
  // Agents
  upsertAgent(name, path, tenantId?): Promise<Agent>;
  getAgent(name, tenantId?): Promise<Agent | null>;
  listAgents(tenantId?): Promise<Agent[]>;
  deleteAgent(name, tenantId?): Promise<boolean>;

  // Sessions
  insertSession(id, agentName, sandboxId, tenantId?, runnerId?, model?): Promise<Session>;
  updateSessionStatus(id, status): Promise<void>;
  getSession(id): Promise<Session | null>;
  listSessions(tenantId?, agent?): Promise<Session[]>;
  touchSession(id): Promise<void>;
  // ... plus updateSessionSandbox, updateSessionRunner, listSessionsByRunner

  // Sandboxes
  insertSandbox(id, agentName, workspaceDir, sessionId?, tenantId?): Promise<void>;
  updateSandboxState(id, state): Promise<void>;
  getSandbox(id): Promise<SandboxRecord | null>;
  countSandboxes(): Promise<number>;
  getBestEvictionCandidate(): Promise<SandboxRecord | null>;
  getIdleSandboxes(olderThan): Promise<SandboxRecord[]>;
  markAllSandboxesCold(): Promise<number>;
  // ... plus updateSandboxSession, touchSandbox, deleteSandbox

  // Messages
  insertMessage(sessionId, role, content, tenantId?): Promise<Message>;
  listMessages(sessionId, tenantId?, opts?): Promise<Message[]>;

  // Session Events
  insertSessionEvent(sessionId, type, data, tenantId?): Promise<SessionEvent>;
  insertSessionEvents(events): Promise<SessionEvent[]>;
  listSessionEvents(sessionId, tenantId?, opts?): Promise<SessionEvent[]>;

  // API Keys
  getApiKeyByHash(keyHash): Promise<ApiKey | null>;
  insertApiKey(id, tenantId, keyHash, label): Promise<ApiKey>;

  // Lifecycle
  close(): Promise<void>;
}
```

## SQL Dialect Differences

| Feature | SQLite | Postgres |
|---------|--------|----------|
| Timestamps | `datetime('now')` | `now()::TEXT` |
| Upsert | `ON CONFLICT(...) DO UPDATE` | `ON CONFLICT(...) DO UPDATE` |
| Parameters | `?` positional | `$1`, `$2` numbered |
| Connection model | Single file, in-process | Connection pool (`pg.Pool`) |
| Journal mode | WAL | WAL (default in Postgres) |
| Column migration | `try/catch` (no `IF NOT EXISTS`) | `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` |
| Sequence assignment | `SELECT MAX(sequence)` in transaction | Atomic subquery in `INSERT ... RETURNING` |

## Connection Retry (Postgres)

The Postgres backend retries the initial connection with exponential backoff (1s, 2s, 4s, 8s, 16s -- five attempts total, ~31 seconds). This handles common startup races where the database container is not yet ready.

```
[db] Connection attempt 1 failed, retrying in 1000ms...
[db] Connection attempt 2 failed, retrying in 2000ms...
```

## Tables

### agents

```sql
CREATE TABLE agents (
  id TEXT PRIMARY KEY,
  tenant_id TEXT NOT NULL DEFAULT 'default',
  name TEXT NOT NULL,
  version INTEGER NOT NULL DEFAULT 1,
  path TEXT NOT NULL,
  created_at TEXT NOT NULL,
  updated_at TEXT NOT NULL,
  UNIQUE(tenant_id, name)
);
```

### sessions

```sql
CREATE TABLE sessions (
  id TEXT PRIMARY KEY,
  tenant_id TEXT NOT NULL DEFAULT 'default',
  agent_name TEXT NOT NULL,
  sandbox_id TEXT NOT NULL,
  status TEXT NOT NULL DEFAULT 'starting',
  runner_id TEXT,
  model TEXT,
  created_at TEXT NOT NULL,
  last_active_at TEXT NOT NULL
);
```

### sandboxes

```sql
CREATE TABLE sandboxes (
  id TEXT PRIMARY KEY,
  tenant_id TEXT NOT NULL DEFAULT 'default',
  session_id TEXT,
  agent_name TEXT NOT NULL,
  state TEXT NOT NULL DEFAULT 'warming',
  workspace_dir TEXT NOT NULL,
  created_at TEXT NOT NULL,
  last_used_at TEXT NOT NULL
);
```

### messages

```sql
CREATE TABLE messages (
  id TEXT PRIMARY KEY,
  tenant_id TEXT NOT NULL DEFAULT 'default',
  session_id TEXT NOT NULL,
  role TEXT NOT NULL,
  content TEXT NOT NULL,
  sequence INTEGER NOT NULL,
  created_at TEXT NOT NULL,
  UNIQUE(tenant_id, session_id, sequence)
);
```

### session_events

```sql
CREATE TABLE session_events (
  id TEXT PRIMARY KEY,
  tenant_id TEXT NOT NULL DEFAULT 'default',
  session_id TEXT NOT NULL,
  type TEXT NOT NULL,
  data TEXT,
  sequence INTEGER NOT NULL,
  created_at TEXT NOT NULL,
  UNIQUE(tenant_id, session_id, sequence)
);
```

### api_keys

```sql
CREATE TABLE api_keys (
  id TEXT PRIMARY KEY,
  tenant_id TEXT NOT NULL,
  key_hash TEXT NOT NULL UNIQUE,
  label TEXT NOT NULL DEFAULT '',
  created_at TEXT NOT NULL
);
```

## Production Recommendation

For single-machine deployments, SQLite with WAL mode is sufficient and requires no external dependencies. For multi-machine deployments (coordinator + runners sharing state), use PostgreSQL or CockroachDB so all nodes share the same database.

---

# Scaling Architecture

Source: https://docs.ash-cloud.ai/architecture/scaling

# Scaling Architecture

Ash scales horizontally in two dimensions: the **data plane** (runners that host sandboxes) and the **control plane** (coordinators that route requests). Each dimension scales independently.

## Three Operating Modes

```mermaid
graph TB
    subgraph "Mode 1: Standalone"
        direction LR
        C1["Client"] -->|HTTP + SSE| S1["Ash Server<br/>:4100"]
        S1 --> P1["SandboxPool"]
        S1 --> DB1["SQLite"]
        P1 --> B1["Bridge 1"]
        P1 --> B2["Bridge 2"]
    end
```

```mermaid
graph TB
    subgraph "Mode 2: Coordinator + N Runners"
        direction LR
        C2["Client"] -->|HTTP + SSE| S2["Coordinator<br/>:4100"]
        S2 --> DB2["Postgres / CRDB"]
        S2 -->|HTTP| R1["Runner 1"]
        S2 -->|HTTP| R2["Runner 2"]
        S2 -->|HTTP| R3["Runner N"]
    end
```

```mermaid
graph TB
    subgraph "Mode 3: N Coordinators + N Runners"
        direction TB
        C3["Client"] -->|HTTPS| LB["Load Balancer"]
        LB --> S3a["Coordinator 1"]
        LB --> S3b["Coordinator 2"]
        LB --> S3c["Coordinator M"]
        S3a & S3b & S3c --> DB3["CRDB"]
        S3a & S3b & S3c -->|HTTP| R4["Runner 1"]
        S3a & S3b & S3c -->|HTTP| R5["Runner 2"]
        S3a & S3b & S3c -->|HTTP| R6["Runner N"]
    end
```

**Start with Mode 1. Move to Mode 2 when one machine isn't enough. Move to Mode 3 when one coordinator isn't enough or you need redundancy.**

## Session Routing

Every session is pinned to a runner. The coordinator selects the runner with the most available capacity at session creation time.

```mermaid
sequenceDiagram
    participant C as Client
    participant Co as Coordinator
    participant DB as Database
    participant R as Runner (selected)

    C->>Co: POST /api/sessions {agent: "my-agent"}
    Co->>DB: SELECT best runner (most capacity)
    DB-->>Co: runner-2 (70 available slots)
    Co->>R: POST /runner/sandboxes
    R-->>Co: {sandboxId, workspaceDir}
    Co->>DB: INSERT session (runner_id = "runner-2")
    Co-->>C: 201 {session}
```

Once assigned, all subsequent messages for that session route to the same runner:

```mermaid
sequenceDiagram
    participant C as Client
    participant Co as Coordinator
    participant DB as Database
    participant R as Runner (same)

    C->>Co: POST /api/sessions/:id/messages
    Co->>DB: SELECT session → runner_id = "runner-2"
    Co->>R: POST /runner/sandboxes/:id/cmd
    R-->>Co: SSE stream (bridge events)
    Co-->>C: SSE stream (proxied)
```

## Runner Registration and Heartbeat

Runners self-register with the control plane and send periodic heartbeats with pool statistics.

```mermaid
sequenceDiagram
    participant R as Runner
    participant Co as Coordinator
    participant DB as Database

    R->>Co: POST /api/internal/runners/register
    Co->>DB: UPSERT runners (id, host, port, max)
    Co-->>R: {ok: true}

    loop Every 10 seconds
        R->>Co: POST /api/internal/runners/heartbeat
        Note right of R: {runnerId, stats: {running: 12, warming: 3, ...}}
        Co->>DB: UPDATE runners SET active_count, warming_count, last_heartbeat_at
        Co-->>R: {ok: true}
    end
```

## Graceful Runner Shutdown

When a runner shuts down cleanly, it deregisters from the coordinator. Sessions are paused immediately — no 30-second wait.

```mermaid
sequenceDiagram
    participant R as Runner
    participant Co as Coordinator
    participant DB as Database

    Note over R: SIGTERM received
    R->>Co: POST /api/internal/runners/deregister
    Co->>DB: UPDATE sessions SET status='paused' WHERE runner_id AND status IN ('active','starting')
    Co->>DB: DELETE FROM runners WHERE id='runner-1'
    Co-->>R: {ok: true}
    Note over R: Destroy sandboxes, close server, exit
```

## Dead Runner Detection

If a runner crashes without deregistering, the coordinator sweeps for dead runners every 30 seconds (with random 0-5s jitter to prevent thundering herd across coordinators). Sessions are bulk-paused in a single query.

```mermaid
sequenceDiagram
    participant Co as Coordinator
    participant DB as Database
    participant C as Client

    Note over Co: Liveness sweep (every 30s + jitter)
    Co->>DB: SELECT runners WHERE last_heartbeat_at <= cutoff
    DB-->>Co: [runner-3 is stale]
    Co->>DB: UPDATE sessions SET status='paused' WHERE runner_id='runner-3' AND status IN ('active','starting')
    Co->>DB: DELETE FROM runners WHERE id='runner-3'
    Note over C: Client detects disconnect
    C->>Co: POST /api/sessions/:id/resume
    Co->>DB: SELECT best healthy runner
    Note over Co: Cold restore on new runner
    Co-->>C: 200 {session: {status: 'active'}}
```

## Multi-Coordinator (Mode 3)

In multi-coordinator mode, all coordinators share the same database (Postgres or CockroachDB). The runner registry and session state live in the database — coordinators hold no authoritative state in memory.

```mermaid
graph TB
    subgraph "Coordinator 1"
        Co1["Fastify :4100"]
        Cache1["Backend Cache<br/>(connection pool)"]
        Co1 --> Cache1
    end

    subgraph "Coordinator 2"
        Co2["Fastify :4100"]
        Cache2["Backend Cache<br/>(connection pool)"]
        Co2 --> Cache2
    end

    DB[("CRDB<br/>runners table<br/>sessions table")]

    Co1 --> DB
    Co2 --> DB
    Cache1 -->|HTTP| R1["Runner 1"]
    Cache1 -->|HTTP| R2["Runner 2"]
    Cache2 -->|HTTP| R1
    Cache2 -->|HTTP| R2

    R1 -->|Heartbeat| LB["Load Balancer"]
    R2 -->|Heartbeat| LB
    LB --> Co1
    LB --> Co2
```

**Key properties:**
- Any coordinator can route to any runner (DB is source of truth)
- Coordinators don't talk to each other
- Each coordinator has a unique ID (`hostname-PID`) reported in `GET /health` and startup logs
- Liveness sweep runs on all coordinators independently (idempotent, with random jitter to prevent thundering herd)
- SSE reconnection handles coordinator failover (no session migration)

### Coordinator Failover

```mermaid
sequenceDiagram
    participant C as Client
    participant LB as Load Balancer
    participant Co1 as Coordinator 1
    participant Co2 as Coordinator 2
    participant DB as Database
    participant R as Runner

    C->>LB: SSE stream (session ABC)
    LB->>Co1: Forward
    Co1->>R: Proxy bridge events
    R-->>Co1: SSE events
    Co1-->>C: SSE events

    Note over Co1: Coordinator 1 dies
    C--xCo1: Connection lost

    Note over C: SSE auto-reconnects
    C->>LB: Reconnect
    LB->>Co2: Route to healthy coordinator
    Co2->>DB: SELECT session ABC → runner_id
    Co2->>R: Re-establish proxy
    R-->>Co2: SSE events resume
    Co2-->>C: SSE events resume
```

## Capacity Estimates

| Component | Per Instance | Limit | Bottleneck |
|-----------|-------------|-------|------------|
| Coordinator | ~10,000 SSE connections | Network/CPU | SSE proxy fan-out |
| Runner (8 vCPU, 16GB) | 30-120 sessions | Memory | Depends on sandbox memory limit |
| Database (CRDB) | ~5,000 queries/sec | Single-node CRDB | Session creation path only |

**Scaling math:**
- 3 coordinators = ~30,000 concurrent SSE streams
- 10 runners (256MB/sandbox) = ~600 concurrent sessions
- You'll run out of runner capacity before coordinator capacity

## Database Tables for Scaling

```mermaid
erDiagram
    runners {
        text id PK
        text host
        int port
        int max_sandboxes
        int active_count
        int warming_count
        text last_heartbeat_at
        text registered_at
    }

    sessions {
        text id PK
        text agent_name
        text sandbox_id
        text status
        text runner_id FK
        text created_at
        text last_active_at
    }

    runners ||--o{ sessions : "hosts"
```

## Environment Variables

### Coordinator

| Variable | Default | Description |
|----------|---------|-------------|
| `ASH_MODE` | `standalone` | Set to `coordinator` for multi-runner mode |
| `ASH_DATABASE_URL` | — | Postgres/CRDB connection string (required for multi-coordinator) |
| `ASH_PORT` | `4100` | HTTP listen port |
| `ASH_INTERNAL_SECRET` | — | Shared secret for runner auth. If set, all `/api/internal/*` endpoints require `Authorization: Bearer <secret>`. **Required for multi-machine deployments.** |

### Runner

| Variable | Default | Description |
|----------|---------|-------------|
| `ASH_RUNNER_ID` | `runner-{pid}` | Unique runner identifier |
| `ASH_RUNNER_PORT` | `4200` | HTTP listen port |
| `ASH_SERVER_URL` | — | Coordinator URL for registration (use LB URL in multi-coordinator mode) |
| `ASH_RUNNER_ADVERTISE_HOST` | — | Host reachable from coordinator |
| `ASH_MAX_SANDBOXES` | `1000` | Maximum concurrent sandboxes |
| `ASH_INTERNAL_SECRET` | — | Must match the coordinator's `ASH_INTERNAL_SECRET` |

## When to Scale

| Symptom | Action |
|---------|--------|
| CPU/memory maxed on single machine | Add runners (Mode 2) |
| Need high availability for control plane | Add coordinators (Mode 3) |
| SSE connections saturating coordinator | Add coordinators (Mode 3) |
| Session creation latency increasing | Add runners or increase `ASH_MAX_SANDBOXES` |
| All runners at capacity | Add more runner nodes |

Don't scale until you have numbers. A single standalone Ash server handles dozens of concurrent sessions. Use `ASH_DEBUG_TIMING=1` and the `/metrics` endpoint to find the actual bottleneck before adding complexity.

---

# Design Decisions

Source: https://docs.ash-cloud.ai/architecture/decisions

# Design Decisions

Architecture Decision Records (ADRs) for significant technical choices in Ash.

## ADR 0001: SDK Passthrough Types

**Date**: 2025-01-15 | **Status**: Accepted

**Decision**: Pass Claude Code SDK `Message` objects through the entire pipeline untranslated. The bridge yields raw SDK messages over the Unix socket. The server wraps them in SSE envelopes and streams them to the client. No custom `BridgeEvent` or `SSEEventType` translation layers.

**Context**: Ash originally defined three parallel type systems: `BridgeEvent` (7 variants in the bridge), `SSEEventType` (6 values in the server), and a translation layer converting SDK messages to bridge events. Every SDK message was translated twice.

**Why**:

- One type system instead of three -- less code to maintain
- SDK type changes propagate automatically through the pipeline (no manual translation updates)
- Clients (CLI, SDK) can use SDK types directly for type-safe message handling
- Translation layers do not protect against SDK breaking changes -- they just delay discovery

**What Ash owns**: Bridge commands (`query`, `resume`, `interrupt`, `shutdown`), orchestration types (`Session`, `Agent`, `SandboxInfo`, `PoolStats`), and two envelope events (`ready`, `error`). Everything else is SDK passthrough.

**Trade-off**: Tighter coupling to the SDK's type shape. If the SDK changes its `Message` type, the wire format changes. This is acceptable because the SDK is the primary dependency -- if it changes, Ash must update regardless.

---

## ADR 0002: HTTP over gRPC for Runner Communication

**Date**: 2026-02-18 | **Status**: Accepted

**Decision**: Use HTTP + SSE for communication between the server and runner processes instead of gRPC.

**Context**: Step 08 of the implementation plan adds runner processes that manage sandboxes on remote hosts. The server needs to communicate with runners for sandbox lifecycle operations and command streaming.

**Why**:

- **Simplicity**: gRPC adds protobuf schemas, code generation, the `@grpc/grpc-js` dependency, and binary debugging difficulty. HTTP uses the same Fastify framework, same patterns, same tools (curl, Swagger, browser).
- **No performance bottleneck**: LLM inference takes 2-10 seconds. The HTTP hop from server to runner adds single-digit milliseconds. gRPC would save 1-2ms per request -- irrelevant at this scale.
- **Ecosystem alignment**: Runners use the same Fastify framework as the server. Tests use the same patterns. One less technology in the stack.

**Alternatives considered**:

- **gRPC with bidirectional streaming**: More complex than needed. The command/event flow is naturally request-response with server-push, which SSE handles well.
- **WebSocket**: More complex lifecycle management and message framing for the same use case. SSE already handles server-push-only flows.

**Trade-off**: If true bidirectional streaming to runners becomes necessary, this decision would need revisiting. This is unlikely because the bridge protocol is inherently request/response.

---

# Ash vs ComputeSDK

Source: https://docs.ash-cloud.ai/comparisons/computesdk

# Ash vs ComputeSDK

[ComputeSDK](https://www.computesdk.com/) and Ash solve different but adjacent problems. This page breaks down where they overlap, where they diverge, and when to use each.

## TL;DR

- **Ash** is an AI agent platform -- deploy a Claude agent as a folder, get a production REST API with sessions, streaming, sandboxing, and persistence.
- **ComputeSDK** is a sandbox abstraction layer -- one API to create isolated compute environments across 8+ cloud providers (E2B, Modal, Railway, etc.).

They're complementary, not competitive. ComputeSDK could be a sandbox *provider* that Ash delegates to.

## Different Problems

| | Ash | ComputeSDK |
|---|---|---|
| **What it is** | Self-hostable system for deploying AI agents | Unified API for generic sandbox compute |
| **Core abstraction** | Agent sessions (deploy a CLAUDE.md, chat via REST/SSE) | Sandboxes (create environments, run code/commands) |
| **Primary use case** | Host AI agents that persist, resume, and stream | Execute untrusted code, spin up dev environments |
| **AI-specific?** | Yes -- thin wrapper around Claude Code SDK | No -- provider-agnostic compute for any workload |
| **Infra model** | Self-hosted (your Docker, your machine) | SaaS gateway routing to cloud providers |

## Feature Comparison

| Feature | Ash | ComputeSDK |
|---|---|---|
| **Sandbox isolation** | Bubblewrap, cgroups v2, env allowlist | Provider-dependent |
| **Session persistence** | SQLite/Postgres, survives restarts | Stateless by default; named sandboxes for reuse |
| **Session resume** | Full context preservation, pause/resume, cross-machine | Not conversation-oriented |
| **Streaming** | Native SSE with typed events, backpressure | Request/response for commands |
| **Agent definition** | Folder with `CLAUDE.md` -- minimal | N/A -- not agent-oriented |
| **Multi-provider** | N/A -- runs your own sandboxes | 8+ providers, swap via env var |
| **Overlays/templates** | N/A | Smart overlays with symlinks for fast bootstrap |
| **Managed servers** | N/A | Supervised long-lived processes with health checks |
| **Filesystem API** | Agent has full workspace inside sandbox | `writeFile`, `readFile`, `mkdir`, etc. |
| **Shell execution** | Agent runs commands via Claude Code SDK | `runCommand()` API |
| **Observability** | Prometheus metrics, structured logs, `/health` | Not documented |
| **Multi-machine** | Built-in coordinator + runner architecture | Handled by underlying providers |
| **SDKs** | TypeScript + Python | TypeScript |
| **CLI** | Full lifecycle (`ash start/deploy/session/health`) | Not documented |
| **Self-hostable** | Yes -- Docker, bare metal, or cloud VMs | No -- SaaS gateway required |
| **Open source** | Yes | Partially (client SDK open, gateway is SaaS) |

## Architecture Differences

### Ash

```
CLI/SDK  ──HTTP──>  ash-server  ──in-process──>  SandboxPool  ──unix socket──>  Bridge  ──>  Claude Code SDK
                    (your infra)                  (bubblewrap)                   (in sandbox)
```

Ash owns the full stack. Your server, your sandboxes, your data. The server manages sandbox lifecycle directly using OS-level isolation (bubblewrap on Linux, ulimit on macOS).

### ComputeSDK

```
Your code  ──HTTP──>  ComputeSDK Gateway  ──HTTP──>  Cloud Provider (E2B / Modal / Railway / ...)
                      (their SaaS)                    (their infra)
```

ComputeSDK is a routing layer. Your code talks to their gateway, which translates to provider-specific APIs. You don't manage sandboxes -- the provider does.

## When to Use Each

### Use Ash when you need:

- **AI agents that persist** -- sessions that survive restarts, resume days later, hand off between machines
- **Full control over infrastructure** -- self-hosted, no external dependencies, data stays on your machines
- **Deep sandbox isolation** -- cgroups, bubblewrap, environment allowlists you configure
- **Streaming conversations** -- SSE with typed events, backpressure, real-time token streaming
- **An agent platform** -- deploy agents as folders, manage via CLI/SDK, monitor with Prometheus

### Use ComputeSDK when you need:

- **Generic sandbox compute** -- run arbitrary code, not specifically AI conversations
- **Provider flexibility** -- switch between E2B, Modal, Railway without code changes
- **Managed infrastructure** -- don't want to run your own servers
- **Quick ephemeral environments** -- spin up a sandbox, run a script, tear it down
- **Pre-configured templates** -- overlays for fast environment bootstrap

### Use both when:

You want Ash's agent orchestration with cloud-hosted sandboxes instead of local ones. A future `SandboxProvider` interface in Ash could delegate sandbox creation to ComputeSDK-supported providers, giving you Ash's session management and streaming with E2B's or Modal's compute.

## Onboarding Comparison

### ComputeSDK -- 3 lines

```typescript
const sandbox = await compute.sandbox.create();
const result = await sandbox.runCode('print("Hello World!")');
await sandbox.destroy();
```

### Ash -- 4 commands

```bash
ash start
ash deploy ./my-agent --name my-agent
ash session create my-agent
ash session send <SESSION_ID> "Hello"
```

ComputeSDK's onboarding is simpler because it solves a simpler problem -- create a sandbox and run code. Ash's extra steps (start server, deploy agent, create session) exist because Ash manages persistent, stateful agent sessions rather than ephemeral compute.

## Summary

Ash and ComputeSDK are in different categories:

- **Ash** = AI agent orchestration platform (sessions, streaming, persistence, isolation)
- **ComputeSDK** = sandbox compute abstraction (multi-provider, ephemeral, code execution)

If you're deploying Claude agents that need production infrastructure, use Ash. If you need generic sandboxed code execution across cloud providers, use ComputeSDK. If you want both, they can complement each other.

---

# Ash vs Blaxel

Source: https://docs.ash-cloud.ai/comparisons/blaxel

# Ash vs Blaxel

[Blaxel](https://blaxel.ai) and Ash both provide infrastructure for AI agents, but they make different tradeoffs. This page breaks down where they overlap, where they diverge, and when to use each.

## TL;DR

- **Ash** is a self-hostable agent platform -- deploy Claude agents as folders, get production APIs with sessions, streaming, sandboxing, and persistence on your own infrastructure. Sub-millisecond per-message overhead, 44ms cold start, 1.7ms warm resume.
- **Blaxel** is a managed cloud platform -- serverless agent hosting, perpetual sandboxes, model gateway, and observability as a service.

The core difference: Ash runs on your machines; Blaxel runs on theirs.

## Different Tradeoffs

| | Ash | Blaxel |
|---|---|---|
| **What it is** | Self-hostable agent orchestration | Managed cloud agent platform |
| **Infrastructure model** | Your servers (Docker, EC2, GCE, bare metal) | Their cloud (serverless) |
| **Agent definition** | Folder with `CLAUDE.md` | HTTP server (any framework) |
| **AI model** | Claude (via Claude Code SDK) | Any model (model gateway) |
| **Sandbox model** | OS-level (bubblewrap, cgroups) | MicroVMs |
| **Session persistence** | SQLite/Postgres, survives restarts | Snapshot-based |
| **Pricing** | Self-hosted (pay for compute + Claude API) | Usage-based SaaS |

## Feature Comparison

| Feature | Ash | Blaxel |
|---|---|---|
| **Agent hosting** | Yes -- deploy folders, get REST API | Yes -- serverless endpoints |
| **Sandbox isolation** | Bubblewrap, cgroups v2, env allowlist | MicroVMs (EROFS + tmpfs) |
| **Session creation (cold start)** | 44ms p50 (process spawn + bridge connect) | ~25ms (MicroVM resume) |
| **Session resume (warm)** | 1.7ms p50 (DB lookup + status flip) | ~25ms (MicroVM resume) |
| **Per-message overhead** | 0.41ms p50 (sub-millisecond) | Not published |
| **Session persistence** | SQLite/Postgres, pause/resume | Snapshot-based, scale-to-zero |
| **Streaming** | Native SSE with typed events, backpressure | Framework-dependent |
| **Model support** | Claude (deep SDK integration) | Multi-model (gateway routing) |
| **Observability** | Prometheus metrics, structured logs, `/health` | Built-in logs, traces, metrics |
| **MCP servers** | Per-agent and per-session MCP config | Hosted MCP servers |
| **Batch jobs** | Not built-in | Yes -- async compute |
| **Multi-machine** | Built-in coordinator + runner | Managed by platform |
| **SDKs** | TypeScript + Python | TypeScript + Python |
| **CLI** | Full lifecycle management | Yes |
| **Self-hostable** | Yes | No |
| **Open source** | Yes | No |
| **Data residency** | Full control (your machines) | Their cloud |

## Architecture Differences

### Ash

```
CLI/SDK  ──HTTP──>  Ash Server  ──in-process──>  SandboxPool  ──unix socket──>  Bridge  ──>  Claude Code SDK
                    (your infra)                  (bubblewrap)                   (in sandbox)
```

Ash owns the full stack. Your server, your sandboxes, your data. The server manages sandbox lifecycle directly using OS-level isolation.

### Blaxel

```
Your App  ──HTTP──>  Blaxel Cloud  ──>  Agent Endpoint (serverless)  ──>  Model Gateway  ──>  LLM Provider
                     (their infra)      (MicroVM sandbox)
```

Blaxel is a managed platform. You deploy agents to their cloud, which handles scaling, sandboxing, routing, and observability.

## When to Use Each

### Use Ash when:

- **You need infrastructure control** -- data must stay on your machines, compliance requirements, air-gapped environments
- **You're building with Claude** -- Ash's deep Claude Code SDK integration gives you the full power of the SDK (sessions, tools, MCP, skills) with zero translation layer
- **Sessions must persist across restarts** -- Ash's SQLite/Postgres persistence survives crashes, supports pause/resume, and enables multi-day sessions
- **You want self-hosted, open source** -- inspect the code, modify the behavior, no vendor lock-in

### Use Blaxel when:

- **You want managed infrastructure** -- don't want to run your own servers, prefer pay-per-use
- **You use multiple LLM providers** -- Blaxel's model gateway routes between providers with fallback and telemetry
- **You want built-in observability** -- logs, traces, and metrics without setting up Prometheus or Grafana
- **Framework flexibility matters** -- Blaxel hosts any HTTP server, not just Claude agents

### Use Ash if you're unsure:

Self-hosted means you can migrate away at any time. You're not locked into a platform. Start with Ash, and if you later need managed infrastructure, the migration path is straightforward since your agents are just folders.

## Onboarding Comparison

### Ash -- 4 commands

```bash
ash start
ash deploy ./my-agent --name my-agent
ash chat my-agent "Hello"
```

### Blaxel -- framework setup + deploy

```bash
bl login
bl init my-agent
# ... write HTTP server code ...
bl deploy
bl run my-agent --data '{"inputs": "Hello"}'
```

Ash's agent definition is simpler (a folder with `CLAUDE.md`) because it targets a specific SDK. Blaxel requires writing an HTTP server because it supports any framework.

## Performance

Ash publishes [real benchmarks](/guides/monitoring). Here's how the numbers compare:

| Metric | Ash (measured) | Blaxel (claimed) |
|---|---|---|
| **Session creation** | 44ms p50 | ~25ms (MicroVM resume) |
| **Warm resume** | 1.7ms p50 | ~25ms (MicroVM resume) |
| **Cold resume** | 32ms p50 | Not published |
| **Per-message overhead** | 0.41ms p50 | Not published |
| **Pool operations** | 0.03ms p50 | Not published |

Blaxel's 25ms number is for MicroVM resume from a snapshot. Ash's 1.7ms warm resume is actually faster because it's just a DB lookup + status flip -- the sandbox process is still alive. For cold starts (new session creation), Ash's 44ms and Blaxel's ~25ms are in the same ballpark. In both cases, the real latency users feel is dominated by the LLM API response time (~1-3 seconds), not the platform overhead.

## Summary

| Dimension | Ash | Blaxel |
|---|---|---|
| **Control** | Full (self-hosted, open source) | Managed (their cloud) |
| **Simplicity** | Agent = folder with `CLAUDE.md` | Agent = HTTP server |
| **AI model** | Claude (deep integration) | Any model (gateway) |
| **Session creation** | 44ms p50 | ~25ms (claimed) |
| **Warm resume** | 1.7ms p50 | ~25ms (claimed) |
| **Per-message overhead** | 0.41ms p50 | Not published |
| **Best for** | Teams who want control + Claude | Teams who want managed + multi-model |

Both are solid choices. The decision comes down to whether you want to own the infrastructure or outsource it.

---

# Development Setup

Source: https://docs.ash-cloud.ai/contributing/development-setup

# Development Setup

Build Ash from source and run it locally.

## Prerequisites

- **Node.js** >= 20
- **pnpm** >= 9
- **Docker** (for sandbox isolation and `ash start`)

## Clone and Install

```bash
git clone https://github.com/ash-ai-org/ash.git
cd ash
pnpm install
pnpm build
```

## Dev Commands

| Command | Description |
|---------|-------------|
| `make build` | Build all packages |
| `make test` | Run unit tests |
| `make typecheck` | Type-check all packages |
| `make test-integration` | Run integration tests (starts real processes) |
| `make dev` | Build Docker image, start server, deploy QA Bot agent, start QA Bot UI |
| `make dev-no-sandbox` | Start server + QA Bot natively (no Docker, no sandbox isolation) |
| `make docker-build` | Build local `ash-dev` Docker image |
| `make docker-start` | Build image and start server in Docker |
| `make docker-stop` | Stop the server container |
| `make docker-status` | Show container status and health |
| `make docker-logs` | Show container logs |
| `make kill` | Kill processes on dev ports (4100, 3100) and stop Docker |
| `make clean` | Remove build artifacts |

### Quick Start (with Docker)

```bash
make dev
```

This builds the Docker image, starts the Ash server at `http://localhost:4100`, deploys the QA Bot example agent, and starts the QA Bot web UI at `http://localhost:3100`.

### Quick Start (without Docker)

```bash
make dev-no-sandbox
```

This starts both the server and QA Bot natively. No sandbox isolation -- agent code runs in the same process context. Suitable for development when Docker is unavailable.

## Running a Single Package

```bash
# Server only (native, with real Claude SDK)
ASH_REAL_SDK=1 pnpm --filter '@ash-ai/server' dev

# QA Bot web UI only (needs server running separately)
pnpm --filter qa-bot dev

# Build a single package
pnpm --filter '@ash-ai/shared' build

# Test a single package
pnpm --filter '@ash-ai/server' test
```

## Using the CLI from Source

Instead of the globally installed `ash`, run the CLI directly with `tsx`:

```bash
npx tsx packages/cli/src/index.ts <command>
```

Examples:

```bash
# Start server with local dev image
npx tsx packages/cli/src/index.ts start --image ash-dev --no-pull

# Deploy an agent
npx tsx packages/cli/src/index.ts deploy ./examples/qa-bot/agent --name qa-bot

# Check status
npx tsx packages/cli/src/index.ts status

# Check health
npx tsx packages/cli/src/index.ts health
```

## OpenAPI and Python SDK Generation

```bash
# Generate OpenAPI spec from Fastify route schemas
make openapi

# Generate Python SDK from OpenAPI spec (requires openapi-python-client)
make sdk-python
```

The OpenAPI spec is generated by starting the server, extracting the schema from Fastify's Swagger plugin, and writing it to `packages/server/openapi.json` (also copied to `docs/openapi.json`).

The Python SDK is generated from this spec using `openapi-python-client`, producing the `packages/sdk-python/` package.

---

# Project Structure

Source: https://docs.ash-cloud.ai/contributing/project-structure

# Project Structure

Ash is a pnpm monorepo with seven packages, each with a specific responsibility.

## Package Map

| Package | npm Name | Description |
|---------|----------|-------------|
| `packages/shared` | `@ash-ai/shared` | Types, protocol definitions, constants. Zero runtime dependencies. Every other package depends on this. |
| `packages/sandbox` | `@ash-ai/sandbox` | `SandboxManager` (process lifecycle), `SandboxPool` (capacity/eviction), `BridgeClient` (Unix socket client), resource limits, state persistence. Used by both server and runner. |
| `packages/bridge` | `@ash-ai/bridge` | Runs inside each sandbox process. Listens on a Unix socket, receives commands, calls the Claude Code SDK (`@anthropic-ai/claude-code`), streams responses back. |
| `packages/server` | `@ash-ai/server` | Fastify REST API. Agent registry, session routing, SSE streaming, database access (SQLite + Postgres). The main entry point. |
| `packages/runner` | `@ash-ai/runner` | Worker node for multi-machine deployments. Manages sandboxes on a remote host. Registers with the server via heartbeat. |
| `packages/sdk` | `@ash-ai/sdk` | TypeScript client library. `AshClient` class, SSE stream parser, re-exported types. |
| `packages/cli` | `@ash-ai/cli` | `ash` command-line tool. Server lifecycle (Docker), agent deployment, session management. |

### Supporting directories

| Directory | Description |
|-----------|-------------|
| `packages/sdk-python` | Python SDK, auto-generated from OpenAPI spec |
| `examples/qa-bot` | Next.js chat app that uses Ash to power a QA bot |
| `examples/hosted-agent` | Minimal example agent definition (CLAUDE.md + config) |
| `examples/python-bot` | Python SDK usage example |
| `docs/` | Architecture docs, ADRs, feature docs, runbooks, benchmarks |
| `test/` | Integration tests and benchmarks (cross-package) |
| `scripts/` | Deployment scripts (EC2, GCE) |

## Dependency Graph

```mermaid
graph TD
    shared["shared<br/>(types, protocol, constants)"]
    sandbox["sandbox<br/>(manager, pool, bridge client)"]
    bridge["bridge<br/>(in-sandbox process)"]
    server["server<br/>(Fastify API, DB)"]
    runner["runner<br/>(remote worker)"]
    sdk["sdk<br/>(TypeScript client)"]
    cli["cli<br/>(ash command)"]

    sandbox --> shared
    bridge --> shared
    server --> shared
    server --> sandbox
    runner --> shared
    runner --> sandbox
    sdk --> shared
    cli --> shared
```

The key insight: `sandbox` is a **library**, not a standalone process. It is imported by both `server` (standalone mode) and `runner` (multi-machine mode).

## Build Order

Packages must be built in dependency order:

1. `shared` (no dependencies)
2. `sandbox` (depends on `shared`)
3. Everything else (`bridge`, `server`, `runner`, `sdk`, `cli` depend on `shared` and/or `sandbox`)

`pnpm build` at the root handles this automatically via workspace dependency resolution.

## Module System

All packages use ESM with TypeScript's `NodeNext` module resolution. Import paths include the `.js` extension:

```typescript

```

## Key Conventions

1. **SDK types pass through.** Ash uses the Claude Code SDK's `Message` type directly throughout the pipeline. Do not create wrapper types for conversation data. See [ADR 0001](/architecture/decisions#adr-0001-sdk-passthrough-types).

2. **Test boundaries, not glue.** Test API contracts, state transitions, protocol serialization, failure modes, and security invariants. Do not test trivial wrappers, type re-exports, or config loading.

3. **Document what you build.** Features go in `docs/features/`, decisions in `docs/decisions/`, benchmarks in `docs/benchmarks/`. If it is not documented, it is not finished.

---

# Testing Guide

Source: https://docs.ash-cloud.ai/contributing/testing

# Testing Guide

## Philosophy

**The test is the spec.** If the behavior is not tested, it is not guaranteed. Tests encode what the system promises. When requirements change, change the test first, then change the code.

## Test Pyramid

| Layer | Count | Runner | Description |
|-------|-------|--------|-------------|
| **Unit** | ~50 | `pnpm test` | Protocol encode/decode, state machines, validators, helpers. Fast, no I/O. |
| **Integration** | ~15 | `pnpm test:integration` | Full lifecycle: start server, deploy agent, create session, send messages, verify responses. Uses real sockets, real files, real processes (mocked Claude SDK). |
| **Isolation** | Linux only | `pnpm test:isolation` | Sandbox security: verify env leaks are blocked, filesystem escapes fail, resource limits are enforced. Requires bubblewrap (bwrap). |
| **Load** | On demand | `pnpm bench` | Latency and throughput benchmarks. Pool operations, sandbox startup, message overhead. |

## Running Tests

```bash
# All unit tests across all packages
pnpm test

# Integration tests (starts real server processes)
pnpm test:integration

# Sandbox isolation tests (Linux with bwrap only)
pnpm test:isolation

# Benchmarks
pnpm bench

# Single package
pnpm --filter '@ash-ai/server' test
pnpm --filter '@ash-ai/shared' test
```

## What to Test

### Test boundaries

Protocol serialization (encode/decode round-trip), API request/response contracts, database queries, bridge command/event handling. These are the surfaces where bugs hide.

```typescript
// Good: tests the encode/decode contract
test('encode then decode round-trips a query command', () => {
  const cmd: QueryCommand = { cmd: 'query', prompt: 'hello', sessionId: 'abc' };
  const decoded = decode(encode(cmd));
  expect(decoded).toEqual(cmd);
});
```

### Test failure modes

What happens when the bridge crashes mid-stream? When the client disconnects? When the sandbox runs out of memory? When the database is unreachable? These are the scenarios that distinguish a demo from a system.

```typescript
// Good: tests crash recovery behavior
test('session transitions to error when sandbox crashes', async () => {
  const session = await createSession('test-agent');
  // Kill the sandbox process
  sandbox.process.kill('SIGKILL');
  // Verify session status
  const updated = await getSession(session.id);
  expect(updated.status).toBe('error');
});
```

### Test invariants

The sandbox environment never contains host secrets. An ended session rejects new messages. Eviction never touches a running sandbox. These are the properties that must always hold.

```typescript
// Good: tests a security invariant
test('sandbox env does not contain host secrets', () => {
  process.env.AWS_SECRET_ACCESS_KEY = 'supersecret';
  const env = buildSandboxEnv();
  expect(env.AWS_SECRET_ACCESS_KEY).toBeUndefined();
});
```

## What NOT to Test

- **Trivial wrappers**: If a function just calls another function and returns the result, testing it adds no value.
- **Type re-exports**: `export type { Session } from '@ash-ai/shared'` does not need a test.
- **Config loading**: Unless the loading logic has branching or defaults that matter, skip it.

## Mocking Strategy

**Mock the Claude SDK, not the OS.**

- Use real Unix sockets, real files, real child processes.
- Mock `@anthropic-ai/claude-code` to return predictable responses.
- Do not mock `fs`, `net`, `child_process`, or `http`. If the test needs these, use them for real.

The bridge package tests mock the SDK's `query()` function to yield controlled message sequences. Everything else (socket communication, process lifecycle, file I/O) uses real system calls.

```typescript
// Good: mock the SDK, use real sockets
const mockSdk = {
  async *query(prompt: string) {
    yield { type: 'assistant', message: { content: [{ type: 'text', text: 'Hello' }] } };
    yield { type: 'result', subtype: 'success' };
  },
};

// Bad: mock the filesystem
jest.mock('fs'); // Don't do this
```

---

# Release Process

Source: https://docs.ash-cloud.ai/contributing/releases

# Release Process

Ash uses [Changesets](https://github.com/changesets/changesets) for versioning, changelogs, and npm publishing.

## Changesets

Every pull request that changes package behavior must include a changeset. A changeset is a small markdown file in `.changeset/` that describes what changed and which packages are affected.

### Creating a Changeset

```bash
pnpm changeset
```

This launches an interactive prompt that asks:

1. Which packages changed?
2. What type of bump for each? (patch, minor, major)
3. A one-sentence summary of the change

The result is a file like `.changeset/cool-dogs-laugh.md`:

```markdown
---
"@ash-ai/server": minor
"@ash-ai/shared": patch
---

Add session events timeline API for tracking agent actions.
```

### Bump Types

| Type | When to use | Examples |
|------|-------------|---------|
| `patch` | Bug fixes, internal refactors, dependency updates | Fix session timeout, update test helpers, bump vitest |
| `minor` | New features, new API endpoints, new CLI commands | Add file listing endpoint, add `ash logs` command |
| `major` | Breaking API changes, removed features, changed wire formats | Remove deprecated endpoint, change SSE event names |

### Rules

- **One changeset per PR.** If a PR does one thing, one changeset. If it does two unrelated things, split the PR.
- **Only include packages that changed.** Check which `packages/*/` directories your diff touches.
- **Description is user-facing.** Write what changed from the consumer's perspective, not implementation details. These become CHANGELOG entries and GitHub Release notes.
- **Internal packages count.** Changes to `@ash-ai/shared`, `@ash-ai/sandbox`, `@ash-ai/bridge` still need changesets. The config automatically bumps their dependents.

### What Does NOT Need a Changeset

- Documentation-only changes
- CI configuration changes
- Test-only changes
- Anything that does not affect published package behavior

## CI Flow

```mermaid
graph LR
    A["PR merged to main<br/>(includes changeset)"] --> B["CI opens<br/>'Version Packages' PR"]
    B --> C["Bumps package.json versions<br/>Generates CHANGELOG entries"]
    C --> D["Merge 'Version Packages' PR"]
    D --> E["CI publishes to npm<br/>Creates GitHub Release"]
```

### Step by step

1. **You merge a PR** that includes a `.changeset/*.md` file.
2. **CI automatically opens a "Version Packages" PR.** This PR:
   - Bumps `version` in the affected `package.json` files
   - Generates `CHANGELOG.md` entries from the changeset description
   - Deletes the consumed `.changeset/*.md` files
3. **You review and merge the "Version Packages" PR.**
4. **CI publishes** the bumped packages to npm and creates GitHub Releases with release notes.

### Preview

To see what changesets are pending and what they would do:

```bash
pnpm changeset status
```

### Local Version Bump (rare)

Normally CI handles versioning. If you need to bump locally:

```bash
make version-packages   # Apply pending changesets locally
make publish-dry-run    # See what would be published
make publish            # Publish to npm (requires NPM_TOKEN)
```