# Ash Documentation > Complete documentation for Ash — an open-source system for deploying and orchestrating AI agents. --- # Introduction Source: https://docs.ash-cloud.ai/ # What is Ash? Ash is a self-hostable system for deploying and orchestrating AI agents. You define an agent as a folder with a `CLAUDE.md` system prompt, deploy it to a server, and interact with it through a REST API, CLI, or SDKs. Every agent session runs inside an isolated sandbox. ## Who is Ash For? Ash is for developers and teams who want to run Claude-powered agents in production without giving up control of their infrastructure. | If you need... | Ash gives you... | |---|---| | AI agents behind an API | REST endpoints with SSE streaming | | Stateful conversations | Sessions that persist, pause, resume, and survive restarts | | Security isolation | Sandboxes with cgroups, bubblewrap, and environment allowlists | | Full infrastructure control | Self-hosted on Docker, EC2, ECS, GCE, or bare metal | | Multi-language clients | TypeScript SDK, Python SDK, CLI, and raw curl | ## Core Concepts at a Glance ```mermaid graph LR Agent["Agent
(folder with CLAUDE.md)"] Server["Ash Server
(REST API)"] Session["Session
(stateful conversation)"] Sandbox["Sandbox
(isolated process)"] Bridge["Bridge
(Claude Code SDK)"] Agent -->|deployed to| Server Server -->|creates| Session Session -->|runs in| Sandbox Sandbox -->|contains| Bridge ``` - **Agent** -- A folder containing a `CLAUDE.md` system prompt and optional config. Like a Docker image: the blueprint, not the running instance. - **Session** -- A stateful conversation between a client and a deployed agent. Created, paused, resumed, or ended through the API. - **Sandbox** -- An isolated child process running a single session. Restricted environment, resource limits, filesystem isolation. - **Bridge** -- The process inside each sandbox that connects to the Claude Code SDK and streams responses back to the server. - **Server** -- The Fastify HTTP server that exposes the REST API, manages agents, routes sessions, and persists state. ## Key Differentiators ### Self-Hosted, Not SaaS Ash runs on your infrastructure. Docker, EC2, ECS Fargate, GCE, or bare metal. Your data stays on your machines. No vendor lock-in, no external API gateway, no third-party dependencies beyond the Claude API itself. ### Agent = Folder No YAML manifests, no complex deployment pipelines. An agent is a directory with a `CLAUDE.md` file. Add `.claude/settings.json` for permissions, `.mcp.json` for MCP tools, and `.claude/skills/` for reusable workflows. Deploy with `ash deploy ./my-agent`. ### Sessions That Survive Sessions persist to SQLite or Postgres. They survive server restarts, can be paused and resumed days later, and hand off between machines in a multi-runner setup. Warm resume is 1.7ms (sandbox still alive). Cold resume is 32ms (new process + state restoration). ### Fast by Default Ash adds sub-millisecond overhead per message (0.41ms p50). Session creation is 44ms. Pool operations are microsecond-scale. The latency users feel is dominated by the LLM API, not Ash. See the [benchmarks](/guides/monitoring) for full numbers. ### Real Isolation Every session sandbox runs with an environment allowlist (host credentials never leak in), cgroups v2 resource limits on Linux, and bubblewrap filesystem isolation. The agent inside the sandbox is treated as untrusted code. ### Thin Wrapper Philosophy Ash wraps the Claude Code SDK without reinventing it. SDK types flow through the system unchanged -- from bridge to server to client. Ash adds orchestration (sessions, sandboxes, streaming, persistence) but does not translate or redefine the AI layer. ## Quick Example ```bash # Install and start the server npm install -g @ash-ai/cli ash start # Define an agent (one file is all you need) mkdir my-agent echo "You are a helpful coding assistant." > my-agent/CLAUDE.md # Deploy and chat ash deploy ./my-agent --name my-agent ash chat my-agent "Explain closures in JavaScript" ``` The response streams back in real time. Under the hood, Ash deploys the agent to its registry, spawns an isolated sandbox, starts a bridge process with your `CLAUDE.md` as the system prompt, and proxies the Claude SDK's streaming response as SSE events. ## How Does Ash Compare? | | Ash | Generic Sandbox APIs | Managed Agent Platforms | |---|---|---|---| | **Focus** | AI agent orchestration | Code execution | Agent hosting | | **Infrastructure** | Self-hosted | Cloud/SaaS | Cloud/SaaS | | **Session model** | Persistent, resumable | Ephemeral | Varies | | **Isolation** | OS-level (cgroups, bwrap) | Provider-dependent | Provider-dependent | | **AI integration** | Deep (Claude Code SDK) | None (BYO) | Framework-specific | | **Data control** | Full (your machines) | Partial | Limited | For detailed comparisons, see [Ash vs ComputeSDK](/comparisons/computesdk) and [Ash vs Blaxel](/comparisons/blaxel). ## Next Steps - **[Installation](/getting-started/installation)** -- Get the CLI installed and the server running - **[Quickstart](/getting-started/quickstart)** -- Deploy your first agent in two minutes - **[Key Concepts](/getting-started/concepts)** -- Deep dive into agents, sessions, sandboxes, and bridges - **[Architecture](/architecture/overview)** -- How all the pieces fit together --- # Installation Source: https://docs.ash-cloud.ai/getting-started/installation # Installation Get the Ash CLI installed and the server running. ## Prerequisites | Requirement | Details | |-------------|---------| | **Node.js** | >= 20 ([download](https://nodejs.org/)) | | **Docker** | Required for `ash start` ([install Docker](https://docs.docker.com/get-docker/)) | | **Anthropic API key** | Get one at [console.anthropic.com](https://console.anthropic.com/) | ## Install the CLI ```bash npm install -g @ash-ai/cli ``` Verify the installation: ```bash ash --help ``` You should see a list of available commands including `start`, `deploy`, `session`, `agent`, and `health`. ## Set Your API Key Ash needs an Anthropic API key to run agents. Export it in your shell: ```bash export ANTHROPIC_API_KEY=sk-ant-... ``` For persistence, add the export to your shell profile (`~/.bashrc`, `~/.zshrc`, etc.). ## Start the Server ```bash ash start ``` This pulls the Ash Docker image, starts the container, and waits for the server to become healthy: ``` Pulling ghcr.io/ash-ai/ash:latest... Starting Ash server... Waiting for server to be ready... Ash server is running. URL: http://localhost:4100 API key: ash_xxxxxxxx (saved to ~/.ash/config.json) Data dir: ~/.ash ``` The server auto-generates an API key on first start. The CLI captures it and saves it to `~/.ash/config.json`, so subsequent CLI commands authenticate automatically. If you need the key for SDK usage, read it from `~/.ash/config.json` or set `ASH_API_KEY` as an environment variable. ### `ash start` Options | Option | Default | Description | |--------|---------|-------------| | `--port ` | `4100` | Host port to expose | | `--database-url ` | SQLite (`data/ash.db`) | Use Postgres or CockroachDB instead of SQLite. Example: `postgresql://user:pass@host:5432/ash` | | `--env KEY=VALUE` | -- | Pass extra environment variables to the container. Can be specified multiple times. | | `--tag ` | `latest` | Docker image tag | | `--image ` | -- | Full Docker image name (overrides default + tag) | | `--no-pull` | -- | Skip pulling the image (use a local build) | Examples: ```bash # Custom port ash start --port 5000 # Use Postgres ash start --database-url "postgresql://localhost:5432/ash" # Pass additional environment variables ash start --env ASH_SNAPSHOT_URL=s3://my-bucket/snapshots/ # Use a local dev image ash start --image ash-dev --no-pull ``` ## Verify the Server Check that the server is running and healthy: ```bash ash health ``` Expected output: ```json { "status": "ok", "activeSessions": 0, "activeSandboxes": 0, "uptime": 5 } ``` You can also check container status: ```bash ash status ``` ## Stopping the Server ```bash ash stop ``` This stops and removes the Docker container. Session data persists in `~/.ash` (SQLite) or your configured database. ## View Logs ```bash ash logs # Show server logs ash logs -f # Follow logs in real-time ``` ## Next Step With the server running, follow the [Quickstart](quickstart.md) to deploy your first agent. --- # Quickstart Source: https://docs.ash-cloud.ai/getting-started/quickstart # Quickstart Deploy an agent and chat with it. This takes about two minutes, assuming you have completed [Installation](installation.md). ## 1. Define an Agent An agent is a folder with a `CLAUDE.md` file. The `CLAUDE.md` is the system prompt -- it tells the agent who it is and how to behave. ```bash mkdir my-agent cat > my-agent/CLAUDE.md << 'EOF' You are a helpful coding assistant. Answer questions about JavaScript and TypeScript. Keep answers concise. Include working code examples. EOF ``` That is the only required file. For production agents, you can add `.claude/settings.json` (tool permissions), `.claude/skills/` (reusable skills), and `.mcp.json` (MCP server connections). See [Key Concepts](concepts.md) for more. ## 2. Deploy and Chat ```bash ash deploy ./my-agent --name my-agent ash chat my-agent "What is a closure in JavaScript?" ``` The response streams back in real time, with the session ID printed at the end: ``` A closure is a function that retains access to variables from its enclosing scope, even after the outer function has returned... Session: 550e8400-e29b-41d4-a716-446655440000 ``` The session stays alive so you can continue the conversation: ```bash ash chat --session 550e8400-e29b-41d4-a716-446655440000 "Now explain with an example" ``` When you are done, end the session: ```bash ash session end 550e8400-e29b-41d4-a716-446655440000 ``` Use `ash chat --end` for one-shot messages that don't need follow-ups -- it ends the session automatically after the response. ## Detailed Flow (Optional) If you need more control -- multiple messages, pause/resume, or session inspection -- use the session commands directly: ```bash # Create a session ash session create my-agent # → { "id": "550e8400-...", "status": "active", "agentName": "my-agent" } # Send messages (replace SESSION_ID with the actual ID) ash session send SESSION_ID "What is a closure in JavaScript?" ash session send SESSION_ID "Now explain it with an example" # End the session when done ash session end SESSION_ID ``` --- ## Using the SDKs The CLI is convenient for testing. For applications, use one of the SDKs. ```bash npm install @ash-ai/sdk ``` ```typescript const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: process.env.ASH_API_KEY, }); // Create a session const session = await client.createSession('my-agent'); // Send a message and stream the response for await (const event of client.sendMessageStream(session.id, 'What is a closure?')) { if (event.type === 'message') { process.stdout.write(event.data); } } // Clean up await client.endSession(session.id); ``` ```bash pip install ash-ai-sdk ``` ```python from ash_ai import AshClient client = AshClient( server_url="http://localhost:4100", api_key=os.environ.get("ASH_API_KEY"), ) # Create a session session = client.create_session("my-agent") # Send a message and stream the response for event in client.send_message_stream(session.id, "What is a closure?"): if event.type == "message": print(event.data, end="") # Clean up client.end_session(session.id) ``` ```bash # Create a session curl -s -X POST http://localhost:4100/api/sessions \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $ASH_API_KEY" \ -d '{"agent":"my-agent"}' # Send a message (returns an SSE stream) curl -N -X POST http://localhost:4100/api/sessions/SESSION_ID/messages \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $ASH_API_KEY" \ -d '{"content":"What is a closure?"}' # End the session curl -s -X DELETE http://localhost:4100/api/sessions/SESSION_ID \ -H "Authorization: Bearer $ASH_API_KEY" ``` --- ## What Just Happened When you ran those commands, here is what Ash did under the hood: 1. **`ash deploy`** -- Copied your agent folder to the server's agent registry and recorded it in the database. 2. **`ash session create`** -- Created a session record in the database and spawned an isolated sandbox process. Inside that sandbox, a bridge process started and loaded your `CLAUDE.md` as the system prompt. 3. **`ash session send`** -- Sent your message to the bridge over a Unix socket. The bridge called the Claude Agent SDK, which streamed the response back. Ash proxied each chunk as a Server-Sent Event (SSE) over HTTP to your terminal. 4. **`ash session end`** -- Marked the session as ended in the database and destroyed the sandbox process. The sandbox is an isolated child process with a restricted environment -- only allowlisted variables reach it, and on Linux it runs with cgroup resource limits and filesystem isolation via bubblewrap. ## Next Steps - [Key Concepts](concepts.md) -- Understand agents, sessions, sandboxes, bridges, and the server - [CLI Reference](/cli/overview) -- All commands and flags - [API Reference](/api/overview) -- REST endpoints, SSE format, request/response schemas - [TypeScript SDK](/sdks/typescript) -- Full TypeScript client documentation - [Python SDK](/sdks/python) -- Full Python client documentation --- # Key Concepts Source: https://docs.ash-cloud.ai/getting-started/concepts # Key Concepts Ash has five core concepts. Understanding how they relate to each other will help you make sense of the rest of the documentation. ## The Five Concepts | Concept | What it is | Analogy | |---------|-----------|---------| | **Agent** | A folder containing `CLAUDE.md` (system prompt) and optional config files. Defines the behavior and permissions of an AI agent. | A Docker image -- the blueprint, not the running instance. | | **Session** | A stateful conversation between a client and a deployed agent. Has a lifecycle (starting, active, paused, ended). Persisted in the database. | A container instance -- created from the image, has state, can be stopped and restarted. | | **Sandbox** | An isolated child process that runs a single session. Restricted environment variables, resource limits (cgroups on Linux), and filesystem isolation (bubblewrap). | A jail cell -- the agent runs inside it and cannot access anything outside. | | **Bridge** | A process inside each sandbox that connects to the Claude Agent SDK. Reads the agent's `CLAUDE.md`, receives commands from the server over a Unix socket, and streams responses back. | A translator -- it speaks the server's protocol on one side and the Claude SDK's API on the other. | | **Server** | The Fastify HTTP server that exposes the REST API, manages the agent registry, routes sessions, and persists state to SQLite or Postgres. | The control tower -- it coordinates everything but does not do the AI work itself. | ## How They Connect ```mermaid graph LR Client["Client
(CLI / SDK / Browser)"] Server["Server
(Fastify, port 4100)"] DB["Database
(SQLite / Postgres)"] Pool["Sandbox Pool"] S1["Sandbox"] B1["Bridge"] SDK1["Claude Agent SDK"] Client -->|HTTP + SSE| Server Server --> DB Server --> Pool Pool --> S1 S1 -->|contains| B1 B1 -->|Unix socket| Server B1 --> SDK1 ``` **The data flow for a single message:** 1. The client sends `POST /api/sessions/:id/messages` with a prompt. 2. The server looks up the session and its associated sandbox. 3. The server sends a `query` command to the bridge over the Unix socket. 4. The bridge calls the Claude Agent SDK's `query()` function. 5. The SDK streams response messages back to the bridge. 6. The bridge writes each message to the Unix socket. 7. The server reads each message and writes it as an SSE frame to the HTTP response. 8. The client receives the streamed response. ## Agent Structure An agent is a folder. The only required file is `CLAUDE.md`: ``` my-agent/ ├── CLAUDE.md # System prompt (required) ├── .claude/ │ ├── settings.json # Tool permissions │ └── skills/ │ └── search-and-summarize/ │ └── SKILL.md # Reusable skill definition └── .mcp.json # MCP server connections ``` **Minimal agent** -- one file: ``` my-agent/ └── CLAUDE.md ``` **Production agent** -- skills, MCP tools, scoped permissions: ``` research-agent/ ├── CLAUDE.md # "You are a research assistant..." ├── .mcp.json # Connect to web fetch, memory servers └── .claude/ ├── settings.json # Allow: Bash, WebSearch, mcp__fetch └── skills/ ├── search-and-summarize/ │ └── SKILL.md └── write-memo/ └── SKILL.md ``` ## Session Lifecycle Every session moves through a defined set of states: ```mermaid stateDiagram-v2 [*] --> starting : POST /api/sessions starting --> active : Sandbox ready starting --> error : Sandbox failed active --> active : Send messages active --> paused : Pause active --> ended : End session active --> error : Sandbox crashed paused --> active : Resume error --> active : Resume ended --> [*] ``` ### State Descriptions | State | Description | |-------|-------------| | **starting** | Session created, sandbox is being spawned. Brief transient state. | | **active** | Sandbox is running. The session can send and receive messages. | | **paused** | Session is paused. The sandbox may still be alive (enabling fast resume) or may have been cleaned up. Workspace state is persisted. | | **error** | The sandbox crashed or failed to start. The session can be resumed -- Ash will spawn a new sandbox and restore the previous workspace. | | **ended** | Terminal state. The session was explicitly ended by the client. Cannot be resumed -- create a new session instead. | ### Resume: Fast Path vs. Cold Path When you resume a paused or errored session, Ash takes one of two paths: - **Fast path**: The sandbox process is still alive. Ash flips the status back to `active` immediately. This is instant. - **Cold path**: The sandbox process is gone (crashed, server restarted, idle cleanup). Ash creates a new sandbox in the same workspace directory, restoring the `.claude` session state so the Claude SDK picks up the previous conversation. This takes a few seconds. In both cases, the conversation history is preserved. The client can continue sending messages as if nothing happened. ## Sandbox Isolation Each sandbox runs as an isolated child process with multiple layers of protection: | Layer | Linux | macOS (dev) | |-------|-------|-------------| | Process limits | cgroups v2 (pids.max) | -- | | Memory limits | cgroups v2 (memory.max) | -- | | CPU limits | cgroups v2 (cpu.max) | -- | | File size limits | cgroups + tmpfs | ulimit -f | | Environment | Strict allowlist | Strict allowlist | | Filesystem | bubblewrap | Restricted cwd | The environment allowlist ensures only explicitly permitted variables reach the sandbox: `PATH`, `HOME`, `LANG`, `TERM`, `ANTHROPIC_API_KEY`, and a few others. Everything else is blocked. Host credentials like `AWS_ACCESS_KEY_ID` never enter the sandbox. ## Next Steps - [Quickstart](quickstart.md) -- Deploy your first agent - [CLI Reference](/cli/overview) -- All commands and flags - [Architecture](/architecture/overview) -- Deep dive into the system design --- # Use Cases Source: https://docs.ash-cloud.ai/use-cases # Use Cases Ash is a general-purpose platform for deploying AI agents. Here are common patterns for what you can build. ## Customer Support Agent Deploy an agent that handles support tickets, looks up account data via MCP tools, and follows your company's support playbook. ``` support-agent/ CLAUDE.md # Support playbook and escalation rules .mcp.json # Connect to CRM, knowledge base .claude/ settings.json # Allow: WebFetch, mcp__crm__*, mcp__kb__* skills/ lookup-account.md # /lookup-account workflow process-refund.md # /process-refund workflow ``` **Why Ash:** Each support conversation is a persistent session. If the customer comes back later, resume the session with full context. The agent runs in an isolated sandbox, so it can't access other customers' data. MCP servers connect the agent to your CRM and knowledge base without exposing raw database access. ## Code Review Bot Build a bot that reviews pull requests, clones repos into sandboxes, runs tests, and posts structured feedback. ```typescript // Triggered by GitHub webhook const session = await client.createSession('code-reviewer'); for await (const event of client.sendMessageStream( session.id, `Review this PR:\n${prDiff}\n\nClone the repo and run tests.`, )) { // Stream review results back to your webhook handler } // Post the review to GitHub, then clean up await client.endSession(session.id); ``` **Why Ash:** Sandbox isolation means the agent can clone repos and run `npm test` without affecting your host. Each review gets its own sandbox with its own filesystem. Streaming lets you show review progress in real time. ## Research Assistant with Memory Deploy an agent that searches the web, synthesizes findings, and remembers context across sessions using an MCP memory server. ``` research-agent/ CLAUDE.md # Research methodology and output format .mcp.json # fetch + memory MCP servers .claude/ settings.json # Allow: WebFetch, mcp__fetch__*, mcp__memory__* skills/ search-and-summarize/ SKILL.md # /search-and-summarize workflow write-memo/ SKILL.md # /write-memo workflow ``` **Why Ash:** Sessions persist across restarts. Pause a research session, come back days later, and resume where you left off. The memory MCP server stores facts persistently inside the sandbox workspace, building knowledge over time. ## Multi-Tenant SaaS Integration Build a SaaS feature where each of your customers gets their own AI assistant, with tenant-specific tools injected via per-session MCP servers. ```typescript // For each customer request, inject their specific MCP tools const session = await client.createSession('assistant', { mcpServers: { 'customer-api': { command: 'npx', args: ['-y', '@your-org/customer-mcp', '--tenant', customerId], env: { CUSTOMER_TOKEN: customerToken, }, }, }, }); ``` **Why Ash:** Per-session MCP servers let you inject tenant-specific tools at runtime without redeploying the agent. Each customer's session runs in its own sandbox with its own environment, so credentials never cross boundaries. ## Data Processing Pipeline Run agents that ingest data, execute analysis in sandboxed environments, and stream results back to your application. ```typescript const session = await client.createSession('data-analyst'); // Upload a CSV to the sandbox await client.uploadFile(session.id, '/workspace/data.csv', csvBuffer); // Ask the agent to analyze it for await (const event of client.sendMessageStream( session.id, 'Analyze data.csv. Calculate summary statistics, identify outliers, and produce a report.', )) { if (event.type === 'message') { const text = extractTextFromEvent(event.data); if (text) process.stdout.write(text); } } // Download the generated report const report = await client.downloadFile(session.id, '/workspace/report.md'); ``` **Why Ash:** The agent can install Python packages, write scripts, and execute code inside the sandbox without affecting your host system. File upload/download APIs let you pass data in and pull results out. Streaming shows progress as the analysis runs. ## Background Automation Agent Deploy a long-running agent that monitors systems, runs periodic checks, and takes action when needed. ``` monitor-agent/ CLAUDE.md # Monitoring procedures and alert rules .claude/ settings.json # Allow: Bash(*), WebFetch, mcp__slack__* .mcp.json # Slack MCP server for alerts ``` ```typescript // Create a long-lived session const session = await client.createSession('monitor-agent'); // Send periodic check instructions setInterval(async () => { for await (const event of client.sendMessageStream( session.id, 'Run your health checks and report any issues to Slack.', )) { // Log results } }, 5 * 60 * 1000); // Every 5 minutes ``` **Why Ash:** The session persists indefinitely. The agent builds context over time -- it knows what it checked last, what's normal, and what's changed. Pause the session during maintenance windows and resume after. ## Patterns to Notice Across all use cases, a few patterns repeat: 1. **Agent as folder** -- Define behavior in `CLAUDE.md`, not code. Change the prompt, redeploy, done. 2. **Session persistence** -- Long-lived, resumable conversations are the default, not a special case. 3. **Sandbox isolation** -- Agents run untrusted code safely. Clone repos, run scripts, install packages. 4. **MCP servers** -- Connect agents to your systems (CRM, databases, APIs) through a standard protocol. 5. **Streaming** -- Real-time responses via SSE. Show progress, not just final answers. ## Next Steps - **[Quickstart](/getting-started/quickstart)** -- Deploy your first agent - **[Defining an Agent](/guides/defining-an-agent)** -- Full guide to agent structure - **[Managing Sessions](/guides/managing-sessions)** -- Session lifecycle and persistence - **[Streaming Responses](/guides/streaming-responses)** -- SSE events and SDK helpers --- # Defining an Agent Source: https://docs.ash-cloud.ai/guides/defining-an-agent # Defining an Agent An agent in Ash is a folder on disk. At minimum, it contains a single file: `CLAUDE.md`. This file defines the agent's identity, capabilities, and behavior. Ash reads this folder when you deploy, copies it into a sandbox, and uses it as the system prompt for every session. ## Minimal Agent The simplest possible agent is a directory with one file: ``` my-agent/ CLAUDE.md ``` The `CLAUDE.md` is the only required file. It contains the instructions the agent follows during every conversation. ```markdown title="my-agent/CLAUDE.md" # Customer Support Agent You are a customer support agent for Acme Corp. You help users troubleshoot product issues, process returns, and answer billing questions. ## Behavior - Be polite and professional - Ask clarifying questions before making assumptions - If you cannot resolve an issue, escalate by telling the user to email support@acme.com ``` Deploy it: ```bash ash deploy ./my-agent --name customer-support ``` That is a working agent. It will respond to messages using the instructions in `CLAUDE.md`. ## Production Agent A production agent adds configuration for permissions, MCP servers, and skills: ``` research-assistant/ CLAUDE.md .claude/ settings.json skills/ search-and-summarize.md analyze-code.md .mcp.json ``` ### CLAUDE.md The system prompt defines identity, capabilities, and behavior rules: ```markdown title="research-assistant/CLAUDE.md" # Research Assistant Agent You are a research assistant powered by Ash. You help users research topics, analyze code, and produce structured reports. ## Capabilities You have access to: - **Web fetching** via the `fetch` MCP server - **Persistent memory** via the `memory` MCP server - **Skills** -- invoke /search-and-summarize or /analyze-code for structured workflows ## Behavior - Use your tools to find accurate information before answering - Store important facts in memory so you can recall them later - Be concise but thorough -- cite sources when you fetch web content ## Identity When asked about yourself, say you are the Research Assistant powered by Ash. ``` ### .claude/settings.json Controls which tools the agent is allowed to use without asking for confirmation, and optionally sets the default model. This maps directly to the Claude Code SDK's permission system. ```json title="research-assistant/.claude/settings.json" { "model": "claude-sonnet-4-5-20250929", "permissions": { "allow": [ "Bash(npm install:*)", "Bash(node:*)", "Read", "Write", "Glob", "Grep", "WebFetch", "mcp__fetch__*", "mcp__memory__*" ] } } ``` The `model` field sets the default model for the agent. This is the model the SDK uses unless overridden at the API level (see [Model Precedence](#model-precedence) below). The `allow` list uses glob patterns. Each entry permits the agent to use that tool without human approval. Tools not listed will be blocked or require approval depending on the session's permission mode. Common patterns: | Pattern | Allows | |---------|--------| | `Read` | Reading any file | | `Write` | Writing any file | | `Bash(node:*)` | Running any command starting with `node` | | `Bash(npm install:*)` | Running npm install commands | | `mcp__fetch__*` | All tools from the `fetch` MCP server | | `WebFetch` | The built-in web fetch tool | ### .mcp.json Configures MCP (Model Context Protocol) servers available to the agent. Each server provides additional tools the agent can call. ```json title="research-assistant/.mcp.json" { "mcpServers": { "fetch": { "command": "npx", "args": ["-y", "@anthropic-ai/mcp-fetch"] }, "memory": { "command": "npx", "args": ["-y", "@anthropic-ai/mcp-memory"], "env": { "MEMORY_FILE": "./memory.json" } } } } ``` MCP servers run as child processes inside the sandbox. The `env` field sets environment variables specific to that server. Paths are relative to the agent's workspace directory. You can also inject MCP servers at session creation time using the `mcpServers` field on `POST /api/sessions`. Session-level entries are merged into the agent's `.mcp.json` (session overrides agent on key conflict). This enables the **sidecar pattern** — your host app exposes tenant-specific tools as MCP endpoints. See [Per-Session MCP Servers](../api/sessions.md#per-session-mcp-servers) for details. ### .claude/skills/ Skills are markdown files that define reusable workflows the agent can invoke. Each file becomes a slash command. ```markdown title="research-assistant/.claude/skills/search-and-summarize.md" # /search-and-summarize Search the web for a given topic and produce a structured summary. ## Steps 1. Use the fetch tool to search for the topic 2. Read the top 3-5 results 3. Synthesize a summary with key findings 4. List all sources with URLs at the bottom ## Output Format Return a markdown document with sections: Overview, Key Findings, Sources. ``` The filename (minus `.md`) becomes the skill name. The agent can invoke it when a user references `/search-and-summarize` in a message. ## Folder Structure Reference ``` agent-name/ CLAUDE.md # Required. Agent system prompt. .claude/ settings.json # Optional. Tool permissions + default model. skills/ skill-name.md # Optional. Reusable workflows. .mcp.json # Optional. MCP server configuration. package.json # Optional. Dependencies installed at sandbox start. setup.sh # Optional. Runs once when sandbox initializes. ``` If a `package.json` is present, Ash runs `npm install` inside the sandbox when the session starts. If a `setup.sh` is present, it runs after dependency installation. ## What Happens at Deploy When you run `ash deploy ./my-agent --name my-agent`: 1. Ash validates that the directory contains `CLAUDE.md` 2. The agent files are copied to `~/.ash/agents/my-agent/` 3. The agent is registered with the server (name, path, version) 4. If an agent with that name already exists, its version is incremented The agent folder becomes the working directory for every session sandbox. Files the agent creates during a session are written to the sandbox workspace, not back to the agent definition. ## Model Precedence The model used for a conversation is resolved with the following precedence (highest to lowest): 1. **Per-message model** — passed in the `model` field of `POST /api/sessions/:id/messages` 2. **Session-level model** — set when creating the session via `POST /api/sessions` 3. **Agent record model** — set on the agent via the API 4. **Agent settings file** — the `model` field in `.claude/settings.json` 5. **SDK default** — the Claude Code SDK's built-in default model This means you can deploy an agent with a default model in `.claude/settings.json`, override it for specific sessions, and override it again for individual messages — all without redeploying the agent. When a new model comes out, you can start using it immediately by passing it at the session or message level. --- # Deploying Agents Source: https://docs.ash-cloud.ai/guides/deploying-agents # Deploying Agents Deploying an agent registers it with the Ash server so sessions can be created against it. The agent folder is copied to the server's data directory and validated. ## Deploy with the CLI ```bash ash deploy ./path/to/agent --name my-agent ``` The `--name` flag sets the agent name. If omitted, the directory name is used. ### What happens during deploy 1. **Validation** -- Ash checks that the directory contains a `CLAUDE.md` file. If it does not, the deploy fails with an error. 2. **Copy** -- The agent files are copied to `~/.ash/agents//`. This ensures the server can access them even if the original directory moves. 3. **Registration** -- The server creates or updates the agent record in its database. Each deploy increments the agent's version number. ``` $ ash deploy ./research-assistant --name research-bot Copied agent files to /Users/you/.ash/agents/research-bot Deployed agent: { "id": "a1b2c3d4-...", "name": "research-bot", "version": 1, "path": "agents/research-bot", "createdAt": "2025-01-15T10:30:00.000Z", "updatedAt": "2025-01-15T10:30:00.000Z" } ``` ## Updating an Agent Redeploy with the same name to update an agent. Ash overwrites the agent files and increments the version: ```bash # Edit your agent's CLAUDE.md, then redeploy ash deploy ./research-assistant --name research-bot ``` Existing sessions continue using the version they started with. New sessions pick up the updated agent. ## Listing Agents ```typescript const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: process.env.ASH_API_KEY }); const agents = await client.listAgents(); console.log(agents); ``` ```python from ash_sdk import AshClient client = AshClient("http://localhost:4100", api_key=os.environ["ASH_API_KEY"]) agents = client.list_agents() print(agents) ``` ```bash ash agent list ``` ```bash curl $ASH_SERVER_URL/api/agents ``` Response: ```json { "agents": [ { "id": "a1b2c3d4-...", "name": "research-bot", "version": 2, "path": "/data/agents/research-bot", "createdAt": "2025-01-15T10:30:00.000Z", "updatedAt": "2025-01-15T12:00:00.000Z" } ] } ``` ## Getting Agent Details ```typescript const agent = await client.getAgent('research-bot'); ``` ```python agent = client.get_agent("research-bot") ``` ```bash ash agent info research-bot ``` ```bash curl $ASH_SERVER_URL/api/agents/research-bot ``` ## Deleting an Agent Deleting an agent removes its registration from the server. Existing sessions that were created from the agent continue to run, but no new sessions can be created. ```typescript await client.deleteAgent('research-bot'); ``` ```python client.delete_agent("research-bot") ``` ```bash ash agent delete research-bot ``` ```bash curl -X DELETE $ASH_SERVER_URL/api/agents/research-bot ``` ## API Reference | Method | Endpoint | Description | |--------|----------|-------------| | `POST` | `/api/agents` | Deploy (create or update) an agent | | `GET` | `/api/agents` | List all agents | | `GET` | `/api/agents/:name` | Get agent details | | `DELETE` | `/api/agents/:name` | Delete an agent | ### POST /api/agents Request body: ```json { "name": "research-bot", "path": "agents/research-bot" } ``` The `path` field is resolved relative to the server's data directory. When deploying via the CLI, this is handled automatically. Response (201): ```json { "agent": { "id": "a1b2c3d4-...", "name": "research-bot", "version": 1, "path": "/data/agents/research-bot", "createdAt": "2025-01-15T10:30:00.000Z", "updatedAt": "2025-01-15T10:30:00.000Z" } } ``` Error (400) -- missing CLAUDE.md: ```json { "error": "Agent directory must contain CLAUDE.md", "statusCode": 400 } ``` --- # Managing Sessions Source: https://docs.ash-cloud.ai/guides/managing-sessions # Managing Sessions A session is a stateful conversation between a client and a deployed agent. Each session runs inside an isolated sandbox with its own workspace directory. Sessions persist messages across turns and can be paused, resumed, and ended. ## Session States | State | Description | |-------|-------------| | `starting` | Sandbox is being created. Transitions to `active` on success or `error` on failure. | | `active` | Sandbox is running and accepting messages. | | `paused` | Sandbox may still be alive but the session is idle. Can be resumed. | | `ended` | Session is terminated. Sandbox is destroyed. Cannot be resumed. | | `error` | Something went wrong (sandbox crash, runner unavailable). Can be resumed. | State transitions: ``` starting --> active --> paused --> active (resume) \ \-> ended \-> error --> active (resume) \-> ended ``` ## Creating a Session ```typescript const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: process.env.ASH_API_KEY }); const session = await client.createSession('my-agent'); console.log(session.id); // "a1b2c3d4-..." console.log(session.status); // "active" ``` ```python from ash_sdk import AshClient client = AshClient("http://localhost:4100", api_key=os.environ["ASH_API_KEY"]) session = client.create_session("my-agent") print(session.id) # "a1b2c3d4-..." print(session.status) # "active" ``` ```bash ash session create my-agent ``` ```bash curl -X POST $ASH_SERVER_URL/api/sessions \ -H "Content-Type: application/json" \ -d '{"agent": "my-agent"}' ``` Response (201): ```json { "session": { "id": "a1b2c3d4-...", "agentName": "my-agent", "sandboxId": "a1b2c3d4-...", "status": "active", "model": null, "createdAt": "2025-01-15T10:30:00.000Z", "lastActiveAt": "2025-01-15T10:30:00.000Z" } } ``` ### Creating a Session with a Model Override You can specify a model when creating a session. This overrides the agent's default model for the entire session. ```typescript const session = await client.createSession('my-agent', { model: 'claude-opus-4-6' }); ``` ```python session = client.create_session("my-agent", model="claude-opus-4-6") ``` ```bash ash session create my-agent --model claude-opus-4-6 ``` ```bash curl -X POST $ASH_SERVER_URL/api/sessions \ -H "Content-Type: application/json" \ -d '{"agent": "my-agent", "model": "claude-opus-4-6"}' ``` ### Creating a Session with Per-Session MCP Servers You can inject MCP servers at session creation time. This enables the **sidecar pattern**: your host application exposes tenant-specific tools as MCP endpoints, and each session connects to its own URL. Session-level MCP servers are merged into the agent's `.mcp.json`. If both define a server with the same key, the session entry wins. ```typescript const session = await client.createSession('my-agent', { mcpServers: { 'customer-tools': { url: 'http://host-app:8000/mcp?tenant=t_abc123' }, }, }); ``` ```bash curl -X POST $ASH_SERVER_URL/api/sessions \ -H "Content-Type: application/json" \ -d '{ "agent": "my-agent", "mcpServers": { "customer-tools": { "url": "http://host-app:8000/mcp?tenant=t_abc123" } } }' ``` ### Creating a Session with a System Prompt Override You can replace the agent's `CLAUDE.md` for a specific session. The agent definition is not modified — only the sandbox workspace copy is overwritten. ```typescript const session = await client.createSession('my-agent', { systemPrompt: 'You are a support agent for tenant t_abc123. Be concise.', }); ``` ```bash curl -X POST $ASH_SERVER_URL/api/sessions \ -H "Content-Type: application/json" \ -d '{ "agent": "my-agent", "systemPrompt": "You are a support agent for tenant t_abc123. Be concise." }' ``` ### Combining MCP Servers and System Prompt For full per-tenant customization, pass both `mcpServers` and `systemPrompt` together: ```typescript const session = await client.createSession('my-agent', { mcpServers: { 'tenant-tools': { url: `http://host-app:8000/mcp?tenant=${tenantId}` }, }, systemPrompt: `You are a support agent for ${tenantName}. Use the tenant-tools MCP server to look up account data.`, }); ``` ```bash curl -X POST $ASH_SERVER_URL/api/sessions \ -H "Content-Type: application/json" \ -d '{ "agent": "my-agent", "mcpServers": { "tenant-tools": { "url": "http://host-app:8000/mcp?tenant=t_abc123" } }, "systemPrompt": "You are a support agent for Acme Corp. Use the tenant-tools MCP server to look up account data." }' ``` ## Sending Messages Messages are sent via POST and return an SSE stream. See the [Streaming Responses](./streaming-responses.md) guide for full details on consuming the stream. ```typescript for await (const event of client.sendMessageStream(session.id, 'What is the capital of France?')) { if (event.type === 'message') { console.log(event.data); } else if (event.type === 'done') { console.log('Turn complete'); } } ``` ```python for event in client.send_message_stream(session.id, "What is the capital of France?"): if event.type == "message": print(event.data) elif event.type == "done": print("Turn complete") ``` ```bash ash session send "What is the capital of France?" ``` ```bash curl -X POST $ASH_SERVER_URL/api/sessions//messages \ -H "Content-Type: application/json" \ -d '{"content": "What is the capital of France?"}' \ -N ``` ### Per-Message Model Override You can override the model for a single message. This takes the highest precedence — it overrides both the session model and the agent's default. Useful for using a more capable model on hard tasks or a cheaper model on simple ones. ```typescript for await (const event of client.sendMessageStream(session.id, 'Analyze this complex codebase', { model: 'claude-opus-4-6', })) { // This message uses Opus regardless of the session/agent default } ``` ```python for event in client.send_message_stream(session.id, "Analyze this complex codebase", model="claude-opus-4-6"): pass # This message uses Opus regardless of the session/agent default ``` ```bash curl -X POST $ASH_SERVER_URL/api/sessions//messages \ -H "Content-Type: application/json" \ -d '{"content": "Analyze this complex codebase", "model": "claude-opus-4-6"}' \ -N ``` ## Multi-Turn Conversations Sessions preserve full conversation context across turns. Each message builds on the previous ones. ```typescript const session = await client.createSession('my-agent'); // Turn 1 for await (const event of client.sendMessageStream(session.id, 'My name is Alice.')) { // Agent acknowledges } // Turn 2 -- agent remembers context from turn 1 for await (const event of client.sendMessageStream(session.id, 'What is my name?')) { if (event.type === 'message') { const text = extractTextFromEvent(event.data); if (text) console.log(text); // "Your name is Alice." } } ``` Messages are persisted to the database. You can retrieve them later: ```typescript const messages = await client.listMessages(session.id); for (const msg of messages) { console.log(`[${msg.role}] ${msg.content}`); } ``` ```python session = client.create_session("my-agent") # Turn 1 for event in client.send_message_stream(session.id, "My name is Alice."): pass # Agent acknowledges # Turn 2 -- agent remembers context from turn 1 for event in client.send_message_stream(session.id, "What is my name?"): if event.type == "message": data = event.data if data.get("type") == "assistant": for block in data.get("message", {}).get("content", []): if block.get("type") == "text": print(block["text"]) # "Your name is Alice." ``` Messages are persisted to the database. You can retrieve them later: ```python messages = client.list_messages(session.id) for msg in messages: print(f"[{msg.role}] {msg.content}") ``` ## Pausing a Session Pausing a session marks it as idle. The sandbox may remain alive for fast resume, but the session stops accepting new messages until resumed. ```typescript const session = await client.pauseSession(session.id); console.log(session.status); // "paused" ``` ```python session = client.pause_session(session.id) ``` ```bash ash session pause ``` ```bash curl -X POST $ASH_SERVER_URL/api/sessions//pause ``` ## Resuming a Session Resume brings a paused or errored session back to `active`. Ash uses two resume paths: **Fast path (warm resume):** If the original sandbox is still alive, the session resumes instantly with no state loss. This is the common case when resuming shortly after pausing. **Cold path (cold resume):** If the sandbox was reclaimed (idle timeout, OOM, server restart), Ash creates a new sandbox. Workspace state is restored from the persisted snapshot if available. Conversation history is preserved in the database regardless. ```typescript const session = await client.resumeSession(session.id); console.log(session.status); // "active" ``` ```python session = client.resume_session(session.id) ``` ```bash ash session resume ``` ```bash curl -X POST $ASH_SERVER_URL/api/sessions//resume ``` Response includes the resume path taken: ```json { "session": { "id": "a1b2c3d4-...", "status": "active", "sandboxId": "a1b2c3d4-..." } } ``` ## Ending a Session Ending a session destroys the sandbox and marks the session as permanently closed. The session's messages and events remain in the database for retrieval, but no new messages can be sent. ```typescript const session = await client.endSession(session.id); console.log(session.status); // "ended" ``` ```python session = client.end_session(session.id) ``` ```bash ash session end ``` ```bash curl -X DELETE $ASH_SERVER_URL/api/sessions/ ``` ## Listing Sessions ```typescript // All sessions const sessions = await client.listSessions(); // Filter by agent const sessions = await client.listSessions('my-agent'); ``` ```python sessions = client.list_sessions() ``` ```bash ash session list ``` ```bash # All sessions curl $ASH_SERVER_URL/api/sessions # Filter by agent curl "$ASH_SERVER_URL/api/sessions?agent=my-agent" ``` ## API Reference | Method | Endpoint | Description | |--------|----------|-------------| | `POST` | `/api/sessions` | Create a session | | `GET` | `/api/sessions` | List sessions (optional `?agent=` filter) | | `GET` | `/api/sessions/:id` | Get session details | | `POST` | `/api/sessions/:id/messages` | Send a message (returns SSE stream) | | `GET` | `/api/sessions/:id/messages` | List persisted messages | | `POST` | `/api/sessions/:id/pause` | Pause a session | | `POST` | `/api/sessions/:id/resume` | Resume a session | | `DELETE` | `/api/sessions/:id` | End a session | --- # Streaming Responses Source: https://docs.ash-cloud.ai/guides/streaming-responses # Streaming Responses When you send a message to a session, the response is delivered as a Server-Sent Events (SSE) stream. Events arrive in real time as the agent thinks, uses tools, and generates text. ## SSE Event Types The stream carries three event types: | Event | Description | |-------|-------------| | `message` | An SDK message from the agent. Contains assistant text, tool use, tool results, or stream deltas. | | `error` | An error occurred during processing. | | `done` | The agent's turn is complete. | Each SSE frame has the format: ``` event: message data: {"type": "assistant", "message": {"content": [{"type": "text", "text": "Hello!"}]}} event: done data: {"sessionId": "a1b2c3d4-..."} ``` The `data` field of `message` events carries raw SDK message objects passed through from the Claude Code SDK. The shape varies by message type (`assistant`, `user`, `result`, `stream_event`). ## Basic Streaming The `sendMessageStream` method returns an async generator of typed events: ```typescript const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: process.env.ASH_API_KEY }); const session = await client.createSession('my-agent'); for await (const event of client.sendMessageStream(session.id, 'Explain TCP in one paragraph.')) { switch (event.type) { case 'message': { const text = extractTextFromEvent(event.data); if (text) { process.stdout.write(text); } break; } case 'error': console.error('Error:', event.data.error); break; case 'done': console.log('\nDone.'); break; } } ``` ```python from ash_sdk import AshClient client = AshClient("http://localhost:4100", api_key=os.environ["ASH_API_KEY"]) session = client.create_session("my-agent") for event in client.send_message_stream(session.id, "Explain TCP in one paragraph."): if event.type == "message": data = event.data # Extract text from assistant messages if data.get("type") == "assistant": content = data.get("message", {}).get("content", []) for block in content: if block.get("type") == "text": print(block["text"], end="") elif event.type == "error": print(f"Error: {event.data.get('error')}") elif event.type == "done": print("\nDone.") ``` Use the `-N` flag to disable output buffering so events print as they arrive: ```bash curl -N -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/messages \ -H "Content-Type: application/json" \ -d '{"content": "Hello!"}' ``` Output: ``` event: message data: {"type":"assistant","message":{"content":[{"type":"text","text":"Hello! How can I help you?"}]}} event: done data: {"sessionId":"a1b2c3d4-..."} ``` ## Display Items For richer output that includes tool use and tool results, use `extractDisplayItems`: ```typescript for await (const event of client.sendMessageStream(session.id, 'List files in /tmp')) { if (event.type === 'message') { const items = extractDisplayItems(event.data); if (items) { for (const item of items) { switch (item.type) { case 'text': console.log(item.content); break; case 'tool_use': console.log(`[Tool: ${item.toolName}] ${item.toolInput}`); break; case 'tool_result': console.log(`[Result] ${item.content}`); break; } } } } } ``` ```python for event in client.send_message_stream(session.id, "List files in /tmp"): if event.type == "message": data = event.data if data.get("type") == "assistant": for block in data.get("message", {}).get("content", []): if block.get("type") == "text": print(block["text"]) elif block.get("type") == "tool_use": print(f"[Tool: {block['name']}] {block.get('input', '')}") elif data.get("type") == "result": for block in data.get("content", []): if block.get("type") == "text": print(f"[Result] {block['text']}") ``` ## Partial Messages (Real-Time Streaming) By default, `message` events contain complete SDK messages. To receive incremental text deltas as the agent types, enable `includePartialMessages`: ```typescript for await (const event of client.sendMessageStream( session.id, 'Write a haiku about servers.', { includePartialMessages: true }, )) { if (event.type === 'message') { const delta = extractStreamDelta(event.data); if (delta) { process.stdout.write(delta); // Character-by-character streaming } } } ``` The `extractStreamDelta` helper extracts text from `content_block_delta` stream events. It returns `null` for non-delta events, so you can safely call it on every message. ```python for event in client.send_message_stream( session.id, "Write a haiku about servers.", include_partial_messages=True, ): if event.type == "message": data = event.data if data.get("type") == "stream_event": evt = data.get("event", {}) if evt.get("type") == "content_block_delta": delta = evt.get("delta", {}) if delta.get("type") == "text_delta": print(delta.get("text", ""), end="", flush=True) ``` ## Browser (Raw Fetch) For browser applications that do not use the SDK, parse the SSE stream directly with `ReadableStream`: ```javascript const response = await fetch('http://localhost:4100/api/sessions/SESSION_ID/messages', { method: 'POST', headers: { 'Content-Type': 'application/json', 'Authorization': 'Bearer YOUR_API_KEY', }, body: JSON.stringify({ content: 'Hello!' }), }); const reader = response.body.getReader(); const decoder = new TextDecoder(); let buffer = ''; let currentEvent = ''; while (true) { const { done, value } = await reader.read(); if (done) break; buffer += decoder.decode(value, { stream: true }); const lines = buffer.split('\n'); buffer = lines.pop() || ''; for (const line of lines) { if (line.startsWith('event: ')) { currentEvent = line.slice(7).trim(); } else if (line.startsWith('data: ')) { const data = JSON.parse(line.slice(6)); if (currentEvent === 'message') { // Handle message console.log(data); } else if (currentEvent === 'done') { console.log('Stream complete'); } else if (currentEvent === 'error') { console.error(data.error); } } } } ``` ## Error Handling Errors can arrive at two levels: **connection errors** (network failure, server restart) throw exceptions, and **agent errors** (sandbox crash, SDK error) arrive as `error` events within the stream. Handle both: ```typescript try { for await (const event of client.sendMessageStream(sessionId, 'Hello')) { if (event.type === 'message') { const text = extractTextFromEvent(event.data); if (text) process.stdout.write(text); } else if (event.type === 'error') { // Agent-level error (sandbox crash, OOM, SDK error) console.error('Agent error:', event.data.error); } else if (event.type === 'done') { console.log('\nDone.'); } } } catch (err) { // Connection-level error (network failure, server restart, 404) console.error('Connection error:', err.message); } ``` ```python try: for event in client.send_message_stream(session_id, "Hello"): if event.type == "message": data = event.data if data.get("type") == "assistant": for block in data.get("message", {}).get("content", []): if block.get("type") == "text": print(block["text"], end="") elif event.type == "error": # Agent-level error (sandbox crash, OOM, SDK error) print(f"Agent error: {event.data.get('error')}") elif event.type == "done": print("\nDone.") except Exception as e: # Connection-level error (network failure, server restart) print(f"Connection error: {e}") ``` ## Reconnection with Retry When an SSE stream disconnects (server restart, network blip, load balancer timeout), retry with exponential backoff. If the session's sandbox was destroyed, resume it before retrying. ```typescript const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: process.env.ASH_API_KEY, }); async function sleep(ms: number) { return new Promise((resolve) => setTimeout(resolve, ms)); } async function streamWithRetry( sessionId: string, content: string, maxRetries = 3, ): Promise { let fullText = ''; for (let attempt = 0; attempt < maxRetries; attempt++) { try { for await (const event of client.sendMessageStream(sessionId, content)) { if (event.type === 'message') { const text = extractTextFromEvent(event.data); if (text) { fullText += text; process.stdout.write(text); } } else if (event.type === 'error') { throw new Error(`Agent error: ${event.data.error}`); } } return fullText; // Stream completed successfully } catch (err) { console.warn(`Stream attempt ${attempt + 1} failed: ${(err as Error).message}`); if (attempt === maxRetries - 1) throw err; // Check if the session needs recovery before retrying try { const session = await client.getSession(sessionId); if (session.status === 'paused' || session.status === 'error') { await client.resumeSession(sessionId); console.log('Session resumed after disconnect'); } } catch { // Server might be temporarily unreachable — wait and retry } // Exponential backoff: 1s, 2s, 4s await sleep(Math.pow(2, attempt) * 1000); } } return fullText; } // Usage const session = await client.createSession('my-agent'); const result = await streamWithRetry(session.id, 'Analyze this code'); ``` ```python from ash_ai import AshClient, AshApiError client = AshClient( server_url="http://localhost:4100", api_key=os.environ["ASH_API_KEY"], ) def stream_with_retry(session_id: str, content: str, max_retries: int = 3) -> str: full_text = "" for attempt in range(max_retries): try: for event in client.send_message_stream(session_id, content): if event.type == "message": data = event.data if data.get("type") == "assistant": for block in data.get("message", {}).get("content", []): if block.get("type") == "text": full_text += block["text"] print(block["text"], end="", flush=True) elif event.type == "error": raise Exception(f"Agent error: {event.data.get('error')}") return full_text # Stream completed successfully except Exception as e: print(f"\nStream attempt {attempt + 1} failed: {e}") if attempt == max_retries - 1: raise # Check if the session needs recovery try: session = client.get_session(session_id) if session.status in ("paused", "error"): client.resume_session(session_id) print("Session resumed after disconnect") except Exception: pass # Server temporarily unreachable # Exponential backoff time.sleep(2 ** attempt) return full_text # Usage session = client.create_session("my-agent") result = stream_with_retry(session.id, "Analyze this code") ``` ## Backpressure Ash handles backpressure automatically on the server side. When your client reads the SSE stream slowly, the server pauses the upstream agent rather than buffering unbounded data in memory. **What this means for your client:** - **You do not need to implement client-side backpressure.** Read the stream at whatever pace you can handle. If you process events slowly, the server waits. - **Memory is bounded.** The server buffers at most one SSE frame plus the kernel TCP send buffer (typically 128 KB - 1 MB). There is no application-level buffering. - **Slow clients get disconnected after 30 seconds.** If your client stops reading for more than 30 seconds, the server closes the stream with a timeout error. Reconnect and resume the session to continue. See [SSE Backpressure](../architecture/sse-backpressure.md) for the full server-side implementation. ## Helper Functions Reference The `@ash-ai/shared` package exports three helper functions for extracting content from stream events: | Function | Purpose | Returns | |----------|---------|---------| | `extractTextFromEvent(data)` | Extract text content from assistant messages | `string \| null` | | `extractDisplayItems(data)` | Extract structured items (text, tool use, tool results) | `DisplayItem[] \| null` | | `extractStreamDelta(data)` | Extract incremental text from partial stream events | `string \| null` | All three accept the `data` field from a `message` event and return `null` for events that do not match their expected type. --- # Working with Files Source: https://docs.ash-cloud.ai/guides/working-with-files # Working with Files Each session runs inside an isolated sandbox with its own workspace directory. Files the agent creates, modifies, or downloads during a session are accessible through the files API. This lets you review agent-written code, download generated artifacts, or inspect the workspace state. ## Listing Files Retrieve a flat list of all files in a session's workspace. ```typescript const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: process.env.ASH_API_KEY }); const result = await client.getSessionFiles(sessionId); console.log(`Source: ${result.source}`); // "sandbox" or "snapshot" for (const file of result.files) { console.log(`${file.path} (${file.size} bytes, modified ${file.modifiedAt})`); } ``` ```python from ash_sdk import AshClient client = AshClient("http://localhost:4100", api_key=os.environ["ASH_API_KEY"]) # The Python SDK uses the raw API response resp = httpx.get(f"http://localhost:4100/api/sessions/{session_id}/files") data = resp.json() for f in data["files"]: print(f"{f['path']} ({f['size']} bytes)") ``` ```bash curl $ASH_SERVER_URL/api/sessions/SESSION_ID/files ``` Response: ```json { "files": [ { "path": "CLAUDE.md", "size": 512, "modifiedAt": "2025-01-15T10:30:00.000Z" }, { "path": "src/index.ts", "size": 1024, "modifiedAt": "2025-01-15T10:35:00.000Z" }, { "path": "output/report.md", "size": 4096, "modifiedAt": "2025-01-15T10:36:00.000Z" } ], "source": "sandbox" } ``` The `source` field indicates where the file listing came from: | Source | Meaning | |--------|---------| | `sandbox` | Read from the live sandbox process. The session is active or paused with its sandbox still running. | | `snapshot` | Read from a persisted workspace snapshot. The sandbox was reclaimed but workspace state was saved. | ## Downloading a File (Raw) Download a file as raw bytes. This is the default behavior and works for any file type — text, binary, images, PDFs, etc. Files up to 100 MB are supported. ```typescript const { buffer, mimeType, source } = await client.downloadSessionFile(sessionId, 'output/report.pdf'); console.log(`Type: ${mimeType}, Source: ${source}`); fs.writeFileSync('report.pdf', buffer); ``` ```python resp = httpx.get(f"http://localhost:4100/api/sessions/{session_id}/files/output/report.pdf") with open("report.pdf", "wb") as f: f.write(resp.content) print(f"Type: {resp.headers['content-type']}") print(f"Source: {resp.headers['x-ash-source']}") ``` ```bash # Download raw file curl -o report.pdf $ASH_SERVER_URL/api/sessions/SESSION_ID/files/output/report.pdf ``` The response includes these headers: | Header | Description | |--------|-------------| | `Content-Type` | MIME type based on file extension (e.g. `application/pdf`, `text/typescript`) | | `Content-Disposition` | Suggested filename for download | | `Content-Length` | File size in bytes | | `X-Ash-Source` | `sandbox` or `snapshot` | ## Reading a File (JSON) For text files, you can get the content inline as a JSON response by adding `?format=json`. This is useful for building UIs that display file content directly. Limited to 1 MB. ```typescript const file = await client.getSessionFile(sessionId, 'src/index.ts'); console.log(`Path: ${file.path}`); console.log(`Size: ${file.size} bytes`); console.log(`Source: ${file.source}`); console.log(file.content); ``` ```python resp = httpx.get( f"http://localhost:4100/api/sessions/{session_id}/files/src/index.ts", params={"format": "json"} ) data = resp.json() print(f"Path: {data['path']}") print(f"Size: {data['size']} bytes") print(data["content"]) ``` ```bash curl "$ASH_SERVER_URL/api/sessions/SESSION_ID/files/src/index.ts?format=json" ``` Response: ```json { "path": "src/index.ts", "content": "console.log('hello world');\n", "size": 28, "source": "sandbox" } ``` ### Limitations (JSON mode) - Maximum file size is 1 MB. For larger files, use the raw download. - Content is read as UTF-8 text. Binary files should use the raw download instead. - Path traversal (`..`) and absolute paths (`/`) are rejected with a 400 error. - Certain directories are excluded from listings: `node_modules`, `.git`, `__pycache__`, `.cache`, `.npm`, `.venv`, and other common dependency/cache directories. ## Workspace Isolation Each session's workspace is isolated from other sessions and from the host system. The agent can read and write files within its workspace but cannot access files outside of it. When a session is created, the agent definition folder is copied into the sandbox workspace. Any files the agent creates during the session live alongside the agent definition files. When a session is paused or ended, the workspace state is persisted as a snapshot. If the session is later resumed with a new sandbox (cold resume), the snapshot is restored so the agent picks up where it left off. ## Use Cases **Reviewing agent-written code.** After an agent writes code in response to a prompt, list the workspace files and read specific files to review what was generated. ```typescript const session = await client.createSession('code-writer'); // Ask the agent to write something for await (const event of client.sendMessageStream(session.id, 'Write a Python fibonacci function')) { // wait for completion } // Review what was written const files = await client.getSessionFiles(session.id); for (const f of files.files) { if (f.path.endsWith('.py')) { const content = await client.getSessionFile(session.id, f.path); console.log(`--- ${content.path} ---`); console.log(content.content); } } ``` ```python session = client.create_session("code-writer") # Ask the agent to write something for event in client.send_message_stream(session.id, "Write a Python fibonacci function"): pass # wait for completion # Review what was written resp = httpx.get(f"http://localhost:4100/api/sessions/{session.id}/files") for f in resp.json()["files"]: if f["path"].endswith(".py"): file_resp = httpx.get( f"http://localhost:4100/api/sessions/{session.id}/files/{f['path']}", params={"format": "json"} ) data = file_resp.json() print(f"--- {data['path']} ---") print(data["content"]) ``` **Downloading binary artifacts.** If an agent generates images, PDFs, or other binary files, download them directly. ```typescript // Download a generated image const { buffer } = await client.downloadSessionFile(session.id, 'output/chart.png'); fs.writeFileSync('chart.png', buffer); ``` ```python # Download a generated image resp = httpx.get(f"http://localhost:4100/api/sessions/{session.id}/files/output/chart.png") with open("chart.png", "wb") as f: f.write(resp.content) ``` **Accessing files after a session ends.** Files remain available from the persisted snapshot. ```typescript await client.endSession(session.id); // Files still accessible from snapshot const report = await client.getSessionFile(session.id, 'output/report.md'); ``` ```python client.end_session(session.id) # Files still accessible from snapshot resp = httpx.get( f"http://localhost:4100/api/sessions/{session.id}/files/output/report.md", params={"format": "json"} ) report = resp.json() ``` ## API Reference | Method | Endpoint | Description | |--------|----------|-------------| | `GET` | `/api/sessions/:id/files` | List all files in the session workspace | | `GET` | `/api/sessions/:id/files/*path` | Download a file (raw bytes by default, JSON with `?format=json`) | --- # Authentication Source: https://docs.ash-cloud.ai/guides/authentication # Authentication Ash uses Bearer token authentication to protect API endpoints. All requests to `/api/*` routes require a valid API key. Authentication is always enabled — the server auto-generates an API key on first start if one is not provided. ## Auto-Generated API Key When you run `ash start` for the first time, the server automatically generates a secure API key (prefixed `ash_`) and: 1. Stores the hashed key in the database. 2. Writes the plaintext key to `~/.ash/initial-api-key`. 3. Logs the key to stdout. The CLI automatically picks up this key and saves it to `~/.ash/config.json`. No manual configuration is needed for local development. ## Manual Configuration To use a specific API key instead of the auto-generated one, set the `ASH_API_KEY` environment variable: ```bash export ASH_API_KEY="your-key-here" ``` Or pass it when starting the server: ```bash ash start -e ASH_API_KEY=your-key-here ``` When `ASH_API_KEY` is set, the server uses it directly instead of auto-generating one. ## Sending Authenticated Requests Pass the API key when creating the client: ```typescript const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: 'your-generated-key-here', }); // All subsequent calls include the Authorization header automatically const agents = await client.listAgents(); ``` ```python from ash_sdk import AshClient client = AshClient( "http://localhost:4100", api_key="your-generated-key-here", ) agents = client.list_agents() ``` Set the `ASH_API_KEY` environment variable: ```bash export ASH_API_KEY="your-generated-key-here" ash agent list ``` Or pass it inline: ```bash ASH_API_KEY="your-generated-key-here" ash agent list ``` Include the `Authorization` header with the `Bearer` scheme: ```bash curl $ASH_SERVER_URL/api/agents \ -H "Authorization: Bearer your-generated-key-here" ``` ## Public Endpoints The following endpoints do not require authentication, even when `ASH_API_KEY` is set: | Endpoint | Description | |----------|-------------| | `GET /health` | Server health check | | `GET /metrics` | Prometheus metrics | | `GET /docs/*` | API documentation (Swagger UI) | ## Error Responses ### 401 -- Missing Authorization Header Returned when the request has no `Authorization` header: ```json { "error": "Missing Authorization header", "statusCode": 401 } ``` ### 401 -- Invalid API Key Returned when the `Authorization` header is present but the key does not match: ```json { "error": "Invalid API key", "statusCode": 401 } ``` ### 401 -- Malformed Header Returned when the `Authorization` header does not use the `Bearer ` format: ```json { "error": "Invalid Authorization header format", "statusCode": 401 } ``` ## Auth Resolution Order When a request arrives, the server resolves authentication in the following order: 1. **Public endpoints** (`/health`, `/docs/*`) -- skip auth entirely. 2. **Internal endpoints** (`/api/internal/*`) -- authenticated via `ASH_INTERNAL_SECRET` (used for runner registration). 3. **Bearer token present** -- validate against `ASH_API_KEY` or the database API keys table. Accept if matched. 4. **No match** -- reject with 401. --- # Monitoring Source: https://docs.ash-cloud.ai/guides/monitoring # Monitoring Ash exposes health checks, Prometheus metrics, debug timing, and structured logs for production monitoring. ## Health Endpoint `GET /health` returns the server's current status. This endpoint does not require authentication. ```bash curl $ASH_SERVER_URL/health ``` Response: ```json { "status": "ok", "activeSessions": 3, "activeSandboxes": 5, "uptime": 86400, "pool": { "total": 10, "cold": 2, "warming": 1, "warm": 2, "waiting": 3, "running": 2, "maxCapacity": 1000, "resumeWarmHits": 42, "resumeColdHits": 7 } } ``` | Field | Description | |-------|-------------| | `status` | Always `"ok"` if the server is reachable. | | `activeSessions` | Number of sessions with status `active`. | | `activeSandboxes` | Number of live sandbox processes. | | `uptime` | Seconds since server start. | | `pool.total` | Total sandboxes in the pool (all states). | | `pool.warm` | Sandboxes ready to accept work immediately. | | `pool.running` | Sandboxes actively processing a message. | | `pool.maxCapacity` | Maximum number of sandboxes the pool allows. | | `pool.resumeWarmHits` | Times a session resumed with its sandbox still alive (fast path). | | `pool.resumeColdHits` | Times a session resumed by creating a new sandbox (cold path). | ## Prometheus Metrics `GET /metrics` returns metrics in Prometheus text exposition format. This endpoint does not require authentication. ```bash curl $ASH_SERVER_URL/metrics ``` Response: ``` # HELP ash_up Whether the Ash server is up (always 1 if reachable). # TYPE ash_up gauge ash_up 1 # HELP ash_uptime_seconds Seconds since server start. # TYPE ash_uptime_seconds gauge ash_uptime_seconds 86400 # HELP ash_active_sessions Number of active sessions. # TYPE ash_active_sessions gauge ash_active_sessions 3 # HELP ash_active_sandboxes Number of live sandbox processes. # TYPE ash_active_sandboxes gauge ash_active_sandboxes 5 # HELP ash_pool_sandboxes Sandbox count by state. # TYPE ash_pool_sandboxes gauge ash_pool_sandboxes{state="cold"} 2 ash_pool_sandboxes{state="warming"} 1 ash_pool_sandboxes{state="warm"} 2 ash_pool_sandboxes{state="waiting"} 3 ash_pool_sandboxes{state="running"} 2 # HELP ash_pool_max_capacity Maximum sandbox capacity. # TYPE ash_pool_max_capacity gauge ash_pool_max_capacity 1000 # HELP ash_resume_total Total session resumes by path (warm=sandbox alive, cold=new sandbox). # TYPE ash_resume_total counter ash_resume_total{path="warm"} 42 ash_resume_total{path="cold"} 7 ``` ### Metric Reference | Metric | Type | Description | |--------|------|-------------| | `ash_up` | gauge | Always 1 if the server is reachable. Use for up/down alerting. | | `ash_uptime_seconds` | gauge | Seconds since server process started. | | `ash_active_sessions` | gauge | Sessions currently in `active` state. | | `ash_active_sandboxes` | gauge | Live sandbox processes (includes all states). | | `ash_pool_sandboxes` | gauge | Sandbox count broken down by state label: `cold`, `warming`, `warm`, `waiting`, `running`. | | `ash_pool_max_capacity` | gauge | Maximum sandboxes the pool will create. | | `ash_resume_total` | counter | Cumulative session resumes by path: `warm` (sandbox alive) or `cold` (new sandbox). | ### Prometheus Configuration Add Ash as a scrape target in `prometheus.yml`: ```yaml scrape_configs: - job_name: 'ash' scrape_interval: 15s static_configs: - targets: ['localhost:4100'] metrics_path: /metrics ``` ### Example PromQL Queries Active sessions over time: ```promql ash_active_sessions ``` Warm resume hit rate (percentage of resumes that were fast): ```promql ash_resume_total{path="warm"} / (ash_resume_total{path="warm"} + ash_resume_total{path="cold"}) ``` Pool utilization (fraction of capacity in use): ```promql sum(ash_pool_sandboxes) / ash_pool_max_capacity ``` Running sandboxes (actively processing messages): ```promql ash_pool_sandboxes{state="running"} ``` Alert when pool is over 80% capacity: ```promql sum(ash_pool_sandboxes) / ash_pool_max_capacity > 0.8 ``` ## Debug Timing Set `ASH_DEBUG_TIMING=1` to enable per-message timing instrumentation. When enabled, the server writes one JSON line to stderr for each message processed: ```bash ASH_DEBUG_TIMING=1 ash start ``` Timing output: ```json { "type": "timing", "source": "server", "sessionId": "a1b2c3d4-...", "sandboxId": "a1b2c3d4-...", "lookupMs": 0.42, "firstEventMs": 145.8, "totalMs": 2340.5, "eventCount": 12, "timestamp": "2025-01-15T10:30:00.000Z" } ``` | Field | Description | |-------|-------------| | `lookupMs` | Time to look up the session and sandbox. | | `firstEventMs` | Time from request to first SSE event (time-to-first-token). | | `totalMs` | Total request duration. | | `eventCount` | Number of SSE events sent. | Timing is zero-overhead when `ASH_DEBUG_TIMING` is not set. The check is a single `process.env` read per message. ## Structured Logs Ash writes structured JSON log lines to stderr. Each line is a self-contained JSON object. ### Resume Logging Every session resume emits a log line (always on, not gated by `ASH_DEBUG_TIMING`): ```json { "type": "resume_hit", "path": "warm", "sessionId": "a1b2c3d4-...", "agentName": "my-agent", "ts": "2025-01-15T10:30:00.000Z" } ``` The `path` field is `warm` (sandbox still alive) or `cold` (new sandbox created). ### Log Analysis with jq Filter resume events: ```bash ash start 2>&1 | jq -c 'select(.type == "resume_hit")' ``` Count warm vs cold resumes: ```bash ash start 2>&1 | jq -c 'select(.type == "resume_hit")' | \ jq -s 'group_by(.path) | map({path: .[0].path, count: length})' ``` Filter timing data for a specific session: ```bash ash start 2>&1 | jq -c 'select(.type == "timing" and .sessionId == "SESSION_ID")' ``` Find slow messages (time-to-first-token over 500ms): ```bash ash start 2>&1 | jq -c 'select(.type == "timing" and .firstEventMs > 500)' ``` Average time-to-first-token: ```bash ash start 2>&1 | jq -cs '[.[] | select(.type == "timing")] | (map(.firstEventMs) | add) / length' ``` --- # Docker (Default) Source: https://docs.ash-cloud.ai/self-hosting/docker # Docker (Default) The recommended way to run Ash is via Docker. The `ash start` command manages the entire Docker lifecycle for you -- pulling the image, creating volumes, and starting the container with the correct flags. ## Quick Start ```bash npm install -g @ash-ai/cli export ANTHROPIC_API_KEY=sk-ant-... ash start ``` That is it. The server is now running at `http://localhost:4100`. ## What `ash start` Does When you run `ash start`, the CLI performs the following steps in order: 1. **Checks Docker** -- verifies Docker is installed and the daemon is running. 2. **Removes stale containers** -- if a stopped `ash-server` container exists, it is removed. 3. **Creates `~/.ash/`** -- ensures the persistent data directory exists on the host. 4. **Pulls the image** -- downloads `ghcr.io/ash-ai/ash:0.1.0` (skip with `--no-pull`). 5. **Starts the container** -- runs `docker run` with the flags described below. 6. **Waits for healthy** -- polls `GET /health` until the server responds 200 (up to 30 seconds). ### Docker Flags The container is started with these flags: | Flag | Purpose | |------|---------| | `--init` | Runs [tini](https://github.com/krallin/tini) as PID 1 so signals (SIGTERM, SIGINT) are forwarded correctly to child processes. Without this, sandbox processes can become zombies on shutdown. | | `--cgroupns=host` | Shares the host's cgroup namespace so the entrypoint script can create per-sandbox cgroups for memory, CPU, and process limits. | | `-v ~/.ash:/data` | Mounts the host data directory into the container. All persistent state -- SQLite database, agent definitions, session workspaces -- lives here. | | `-p 4100:4100` | Exposes the API on the host. Configurable with `--port`. | | `-e ANTHROPIC_API_KEY=...` | Passes your API key into the container. The key is read from your shell environment. | ### Entrypoint: cgroup v2 Setup The container uses a custom entrypoint (`docker-entrypoint.sh`) that configures cgroup v2 delegation before starting the server. This enables per-sandbox resource limits (memory, CPU, process count). If cgroup v2 is not available (older kernels or restricted Docker configurations), the server falls back to ulimit-based limits. ## Lifecycle Commands ```bash # Start the server (pulls image, creates container, waits for healthy) ash start # Check server status (container state + health endpoint) ash status # View logs (add -f to follow) ash logs ash logs -f # Stop and remove the container ash stop ``` ## Configuration Pass environment variables to the container with `-e`: ```bash ash start -e ASH_MAX_SANDBOXES=50 # Override the auto-generated API key (optional — not required for basic setup) ash start -e ASH_API_KEY=my-secret-key ``` Use `--database-url` to connect to an external database instead of the default SQLite: ```bash ash start --database-url "postgresql://user:pass@host:5432/ash" ``` Use `--port` to change the host port: ```bash ash start --port 8080 ``` Use `--image` to run a custom image (for example, a local build): ```bash docker build -t ash-dev . ash start --image ash-dev --no-pull ``` See the [Configuration Reference](./configuration.md) for all environment variables and [Streaming Telemetry](./telemetry.md) for OpenTelemetry tracing and event collection setup. ## Volume Mount Layout The host directory `~/.ash/` is mounted into the container at `/data/`. Here is what it contains: ``` ~/.ash/ (host) → /data/ (container) ├── ash.db SQLite database (agents, sessions, sandboxes, messages) ├── agents/ Deployed agent definitions │ └── my-agent/ │ ├── CLAUDE.md │ └── .claude/ ├── sessions/ Persisted session workspaces │ └── / │ ├── workspace/ Snapshot of the sandbox filesystem │ └── metadata.json Agent name, persist timestamp └── sandboxes/ Active sandbox working directories └── / └── workspace/ ``` Because all state lives in `~/.ash/`, you can stop and restart the container without losing data. Sessions, agents, and the database survive across restarts. ## Docker Compose for Production For production deployments with CockroachDB (or PostgreSQL): ```yaml version: "3.8" services: ash: image: ghcr.io/ash-ai/ash:0.1.0 init: true privileged: true ports: - "4100:4100" volumes: - ash-data:/data environment: - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - ASH_API_KEY=${ASH_API_KEY} # Required for Docker Compose — auto-generation only works with `ash start` - ASH_DATABASE_URL=postgresql://ash:ash@cockroach:26257/ash?sslmode=disable - ASH_MAX_SANDBOXES=200 - ASH_IDLE_TIMEOUT_MS=1800000 depends_on: cockroach: condition: service_healthy cockroach: image: cockroachdb/cockroach:v24.1.0 command: start-single-node --insecure ports: - "26257:26257" - "8080:8080" volumes: - cockroach-data:/cockroach/cockroach-data healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8080/health"] interval: 5s timeout: 5s retries: 10 volumes: ash-data: cockroach-data: ``` Start with: ```bash export ANTHROPIC_API_KEY=sk-ant-... export ASH_API_KEY=my-production-key docker compose up -d ``` ## Resource Recommendations | Concurrent Sessions | CPU | RAM | Disk | |---------------------|-----|-----|------| | 1--5 | 2 cores | 4 GB | 20 GB | | 5--20 | 4 cores | 8 GB | 50 GB | | 20--50 | 8 cores | 16 GB | 100 GB | Each active sandbox uses up to 2 GB of memory (configurable via resource limits) and 1 GB of disk by default. Plan capacity based on your peak concurrent session count, not total sessions -- idle sessions are evicted to disk and do not consume memory. ## Using a Local Build If you are developing Ash itself or need a custom image: ```bash # Build the image from the repository root docker build -t ash-dev . # Start using the local image (skip pulling from registry) ash start --image ash-dev --no-pull ``` The Dockerfile builds the full monorepo, installs `@anthropic-ai/claude-code` globally, creates a non-root sandbox user, and configures the entrypoint for cgroup v2 delegation. --- # Deploy to AWS EC2 Source: https://docs.ash-cloud.ai/self-hosting/ec2 # Deploy to AWS EC2 This guide walks through deploying Ash to an EC2 instance using the included deploy script. The script provisions an Ubuntu instance, installs Docker, builds the Ash image, starts the server, and deploys the example QA Bot agent. ## Prerequisites - **AWS CLI v2** installed and configured ([install guide](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html)) - **EC2 key pair** created in your target region ([create a key pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/create-key-pairs.html)) - **ANTHROPIC_API_KEY** for agent execution ## Quick Start ```bash # Clone the repo git clone https://github.com/ash-ai-org/ash.git cd ash # Create .env from the example cp .env.example .env # Edit .env with your credentials (see below) # Deploy ./scripts/deploy-ec2.sh ``` The script takes 5--8 minutes. When it finishes, it prints the server URL, SSH command, and instructions for connecting the QA Bot UI. ## Configuration Create a `.env` file in the project root with the following variables: ### Required | Variable | Description | |----------|-------------| | `ANTHROPIC_API_KEY` | Your Anthropic API key for agent execution | | `AWS_ACCESS_KEY_ID` | AWS access key with EC2 permissions | | `AWS_SECRET_ACCESS_KEY` | AWS secret key | | `EC2_KEY_NAME` | Name of your EC2 key pair (as shown in the AWS console) | | `EC2_KEY_PATH` | Path to the private key file, e.g. `~/.ssh/my-key.pem` | ### Optional | Variable | Default | Description | |----------|---------|-------------| | `AWS_DEFAULT_REGION` | `us-east-1` | AWS region to deploy in | | `EC2_INSTANCE_TYPE` | `t3.large` | Instance type (2 vCPU, 8 GB RAM) | | `EC2_VOLUME_SIZE` | `30` | Root volume size in GB | | `EC2_SECURITY_GROUP_ID` | (created) | Use an existing security group instead of creating one | | `EC2_SUBNET_ID` | (default VPC) | Deploy into a specific subnet | | `ASH_PORT` | `4100` | Port to expose the API on | Example `.env`: ```bash ANTHROPIC_API_KEY=sk-ant-api03-... AWS_ACCESS_KEY_ID=AKIA... AWS_SECRET_ACCESS_KEY=... EC2_KEY_NAME=my-key EC2_KEY_PATH=~/.ssh/my-key.pem AWS_DEFAULT_REGION=us-east-1 EC2_INSTANCE_TYPE=t3.large ``` ## What the Deploy Script Does 1. **Finds the latest Ubuntu 22.04 AMI** in your region. 2. **Creates a security group** (`ash-server-sg`) with ports 22 (SSH) and 4100 (API) open. Skipped if you provide `EC2_SECURITY_GROUP_ID`. 3. **Launches a `t3.large` instance** with a 30 GB gp3 volume and a user-data script that installs Docker, Node.js 20, and pnpm. 4. **Waits for SSH** and the user-data script to complete (~2 minutes). 5. **Syncs the project** to the instance via rsync (excludes `node_modules`, `.git`, `dist`). 6. **Builds the Docker image** on the instance (`docker build -t ash-dev .`). This takes 3--5 minutes on a `t3.large`. 7. **Starts the container** with `--init`, `--privileged`, `--cgroupns=host`, and the volume mount. 8. **Waits for healthy** by polling `GET /health`. 9. **Deploys the qa-bot agent** by copying agent files and calling the API. ## Connecting the QA Bot Example After deployment, the QA Bot agent is ready. To connect the example Next.js UI: ```bash # From your local machine (not the EC2 instance) ASH_SERVER_URL=http://:4100 pnpm --filter qa-bot dev ``` This starts the QA Bot frontend locally, pointing at your remote Ash server. ## Deploying Your Own Agent SSH into the instance and copy your agent folder to the data directory: ```bash ssh -i ~/.ssh/my-key.pem ubuntu@ # Copy your agent files mkdir -p ~/.ash/agents/my-agent # Place your CLAUDE.md, .claude/ settings, etc. in ~/.ash/agents/my-agent/ # Deploy via API curl -X POST $ASH_SERVER_URL/api/agents \ -H "Content-Type: application/json" \ -d '{"name": "my-agent", "path": "agents/my-agent"}' ``` Alternatively, use the Ash CLI or SDK from your local machine: ```bash export ASH_SERVER_URL=http://:4100 ash deploy ./my-agent --name my-agent ``` ## Monitoring ### Logs ```bash # From your local machine ssh -i ~/.ssh/my-key.pem ubuntu@ 'docker logs -f ash-server' ``` ### Health Check ```bash curl http://:4100/health | jq . ``` Returns active session count, sandbox pool stats, and uptime. ### API Docs Open `http://:4100/docs` in a browser for the Swagger UI. ## Troubleshooting ### "Key file not found" Verify `EC2_KEY_PATH` in your `.env` points to the correct `.pem` file. The script sets permissions to `400` automatically. ### "Instance has no public IP" Your VPC or subnet does not auto-assign public IPs. Either: - Set `EC2_SUBNET_ID` to a public subnet, or - Enable "Auto-assign public IPv4 address" on your subnet in the AWS console. ### "Server did not become healthy within 60 seconds" SSH in and check the Docker logs: ```bash ssh -i ~/.ssh/my-key.pem ubuntu@ docker logs ash-server ``` Common causes: - Missing `ANTHROPIC_API_KEY` -- the server starts but agents cannot execute. - Docker build failed -- check for network issues during `pnpm install`. ### "Setup did not complete within 5 minutes" The user-data script (Docker + Node.js installation) is taking too long. SSH in and check: ```bash ssh -i ~/.ssh/my-key.pem ubuntu@ cat /var/log/cloud-init-output.log ``` ## Tearing Down ```bash ./scripts/teardown-ec2.sh ``` This terminates the EC2 instance, waits for termination to complete, and deletes the security group if the script created it. The teardown script reads from the `.ec2-instance` state file that was created during deployment. ## Cost Estimate | Resource | Spec | Hourly Cost (us-east-1) | |----------|------|------------------------| | EC2 `t3.large` | 2 vCPU, 8 GB RAM | ~$0.083 | | EBS gp3 | 30 GB | ~$0.003 | | **Total** | | **~$0.086/hour (~$62/month)** | Data transfer costs are additional. Actual costs depend on region and usage patterns. --- # Deploy to Google Cloud Source: https://docs.ash-cloud.ai/self-hosting/gce # Deploy to Google Cloud This guide walks through deploying Ash to a Google Compute Engine (GCE) instance using the included deploy script. The script provisions an Ubuntu VM, installs Docker, builds the Ash image, starts the server, and deploys the example QA Bot agent. ## Prerequisites - **gcloud CLI** installed ([install guide](https://cloud.google.com/sdk/docs/install)) - **GCP project** with Compute Engine API enabled and billing configured - **ANTHROPIC_API_KEY** for agent execution ## Quick Start ```bash # Clone the repo git clone https://github.com/ash-ai-org/ash.git cd ash # Create .env from the example cp .env.example .env # Edit .env with your credentials (see below) # Deploy ./scripts/deploy-gce.sh ``` The script takes 5--8 minutes. When it finishes, it prints the server URL, SSH command, and instructions for connecting the QA Bot UI. ## Configuration Create a `.env` file in the project root with the following variables: ### Required | Variable | Description | |----------|-------------| | `ANTHROPIC_API_KEY` | Your Anthropic API key for agent execution | | `GCP_PROJECT_ID` | Your GCP project ID. Falls back to `gcloud config get-value project` if not set. | ### Optional | Variable | Default | Description | |----------|---------|-------------| | `GCP_ZONE` | `us-east1-b` | Compute Engine zone | | `GCP_MACHINE_TYPE` | `e2-standard-2` | Machine type (2 vCPU, 8 GB RAM) | | `GCP_DISK_SIZE` | `30` | Boot disk size in GB (SSD) | | `ASH_PORT` | `4100` | Port to expose the API on | Example `.env`: ```bash ANTHROPIC_API_KEY=sk-ant-api03-... GCP_PROJECT_ID=my-project-123 GCP_ZONE=us-east1-b GCP_MACHINE_TYPE=e2-standard-2 ``` ## GCP Setup from Scratch If you do not have a GCP project configured yet: ```bash # 1. Install gcloud CLI # macOS: brew install --cask google-cloud-sdk # Or download from https://cloud.google.com/sdk/docs/install # 2. Authenticate gcloud auth login # 3. Create a project (or use an existing one) gcloud projects create my-ash-project --name="Ash Server" # 4. Set the project as default gcloud config set project my-ash-project # 5. Enable the Compute Engine API gcloud services enable compute.googleapis.com # 6. Enable billing (required for Compute Engine) # Go to https://console.cloud.google.com/billing and link a billing account # to your project ``` ## What the Deploy Script Does 1. **Ensures a firewall rule** (`allow-ash-api`) exists for port 4100. Creates one if it does not exist. 2. **Creates a Compute Engine instance** (`ash-server`) with Ubuntu 22.04, SSD boot disk, and the `ash-server` network tag. 3. **Runs a startup script** that installs Docker, Node.js 20, pnpm, rsync, and jq. 4. **Waits for SSH** and the startup script to complete (~2 minutes). 5. **Syncs the project** to the instance by creating a tarball and using `gcloud compute scp`. 6. **Builds the Docker image** on the instance (`docker build -t ash-dev .`). This takes 3--5 minutes. 7. **Starts the container** with `--init`, `--privileged`, `--cgroupns=host`, and the volume mount. 8. **Waits for healthy** by polling `GET /health`. 9. **Deploys the qa-bot agent** by copying agent files and calling the API. ## Using the SDK After deployment, connect from your application: ```typescript const client = new AshClient({ serverUrl: "http://:4100", apiKey: "", }); // Create a session const session = await client.createSession({ agentName: "qa-bot" }); // Send a message (SSE streaming) const stream = client.sendMessage(session.id, { message: "What is the capital of France?", }); for await (const event of stream) { if (event.type === "assistant") { process.stdout.write(event.content); } } ``` ## Monitoring ### Logs ```bash gcloud compute ssh ash-server --zone=us-east1-b \ --command='docker logs -f ash-server' ``` ### Health Check ```bash curl http://:4100/health | jq . ``` ### API Docs Open `http://:4100/docs` in a browser for the Swagger UI. ## Troubleshooting ### "gcloud: command not found" Install the gcloud CLI: ```bash # macOS brew install --cask google-cloud-sdk # Linux curl https://sdk.cloud.google.com | bash exec -l $SHELL gcloud init ``` ### "Your current active account does not have permission" Re-authenticate: ```bash gcloud auth login gcloud config set project YOUR_PROJECT_ID ``` ### "Compute Engine API has not been enabled" ```bash gcloud services enable compute.googleapis.com ``` This can take a minute to propagate. Wait and retry. ### "Instance has no external IP" The default network configuration includes an external IP. If you are using a custom VPC without external IPs, you need to either add an access config or use Cloud NAT + Identity-Aware Proxy for SSH. ### "Firewall rule blocks traffic" Verify the rule exists and the instance has the correct network tag: ```bash gcloud compute firewall-rules describe allow-ash-api gcloud compute instances describe ash-server --zone=us-east1-b \ --format='get(tags.items)' ``` ## Tearing Down ```bash ./scripts/teardown-gce.sh ``` This deletes the Compute Engine instance and the `allow-ash-api` firewall rule. The teardown script reads from the `.gce-instance` state file created during deployment. ## Cost Estimate | Resource | Spec | Hourly Cost (us-east1) | |----------|------|------------------------| | GCE `e2-standard-2` | 2 vCPU, 8 GB RAM | ~$0.067 | | Boot disk (pd-ssd) | 30 GB | ~$0.005 | | **Total** | | **~$0.072/hour (~$52/month)** | Data transfer costs are additional. Actual costs depend on region and usage patterns. ## EC2 vs GCE Comparison | | AWS EC2 | Google Cloud GCE | |---|---------|-----------------| | **Deploy command** | `./scripts/deploy-ec2.sh` | `./scripts/deploy-gce.sh` | | **Default instance** | `t3.large` (2 vCPU, 8 GB) | `e2-standard-2` (2 vCPU, 8 GB) | | **Default region** | `us-east-1` | `us-east1-b` | | **SSH access** | `ssh -i key.pem ubuntu@IP` | `gcloud compute ssh ash-server` | | **Auth method** | AWS access key + secret | `gcloud auth login` | | **Required credentials** | `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `EC2_KEY_NAME`, `EC2_KEY_PATH` | `GCP_PROJECT_ID` (+ gcloud auth) | | **Estimated cost** | ~$0.086/hour | ~$0.072/hour | | **Teardown** | `./scripts/teardown-ec2.sh` | `./scripts/teardown-gce.sh` | Both scripts produce identical results: a running Ash server with the QA Bot agent deployed. Choose whichever cloud you already have an account with. --- # Configuration Reference Source: https://docs.ash-cloud.ai/self-hosting/configuration # Configuration Reference All Ash configuration is done via environment variables. There are no config files. This page documents every variable the server and runner recognize. ## Server Variables | Variable | Default | Description | |----------|---------|-------------| | `ASH_PORT` | `4100` | Port the HTTP server listens on. | | `ASH_HOST` | `0.0.0.0` | Bind address. Use `127.0.0.1` to restrict to localhost. | | `ASH_DATA_DIR` | `./data` (native) or `/data` (Docker) | Root directory for all persistent state: SQLite database, agent definitions, session workspaces, sandbox working directories. | | `ASH_MODE` | `standalone` | Server mode. `standalone` runs sandboxes locally. `coordinator` runs as a pure control plane with no local sandbox pool -- runners must register to provide capacity. See [Multi-Machine Setup](./multi-machine.md). | | `ASH_DATABASE_URL` | (none) | PostgreSQL or CockroachDB connection string. When set, the server uses Postgres instead of SQLite. Format: `postgresql://user:password@host:port/dbname`. | | `ASH_MAX_SANDBOXES` | `1000` | Maximum number of sandbox entries (live + cold) tracked in the database. When this limit is reached, the pool evicts the least-recently-used sandbox. | | `ASH_IDLE_TIMEOUT_MS` | `1800000` (30 min) | How long a sandbox can sit idle (in the `waiting` state) before the idle sweep evicts it. Evicted sandboxes have their workspace persisted and are marked `cold`. | | `ASH_API_KEY` | (auto-generated) | Single-tenant API key for authentication. All API requests (except `/health` and `/docs`) must include `Authorization: Bearer `. If not set, the server auto-generates a key on first start and writes it to `{ASH_DATA_DIR}/initial-api-key`. | | `ASH_SNAPSHOT_URL` | (none) | Cloud storage URL for session workspace snapshots. Enables cross-machine session resume. Format: `s3://bucket/prefix/` or `gs://bucket/prefix/`. Requires the appropriate SDK installed (`@aws-sdk/client-s3` for S3, `@google-cloud/storage` for GCS). | | `ASH_BRIDGE_ENTRY` | (auto-detected) | Absolute path to the bridge process entry point (`packages/bridge/dist/index.js`). Normally auto-detected from the monorepo layout. Override only for custom installations. | | `ASH_DEBUG_TIMING` | `0` | Set to `1` to enable timing instrumentation on the hot path. Logs latency for sandbox creation, bridge connect, message round-trip, and SSE delivery. | | `ANTHROPIC_API_KEY` | (none) | **Required.** Passed into sandbox processes via the environment allowlist. The Claude Agent SDK uses this to authenticate with the Anthropic API. | ## Telemetry Variables These variables enable optional telemetry. Both systems are zero-overhead when not configured. See [Streaming Telemetry](./telemetry.md) for setup guides and examples. | Variable | Default | Description | |----------|---------|-------------| | `OTEL_EXPORTER_OTLP_ENDPOINT` | (none) | gRPC endpoint for OpenTelemetry trace export (e.g. `http://jaeger:4317`). Tracing is completely disabled when not set. | | `OTEL_SERVICE_NAME` | `ash-coordinator` | Service name in OpenTelemetry traces. Bridge processes default to `ash-bridge`. | | `OTEL_TRACES_SAMPLER` | (none) | Optional OTEL sampling strategy (e.g. `parentbased_traceidratio`). | | `ASH_TELEMETRY_URL` | (none) | HTTP endpoint for streaming event telemetry (session lifecycle, messages, tool calls). When not set and `ASH_CLOUD_URL` is present, auto-configured to send to Ash Cloud. | | `ASH_TELEMETRY_KEY` | (none) | Optional bearer token for authenticating with the telemetry endpoint. When auto-configured for Ash Cloud, defaults to `ASH_API_KEY`. | | `ASH_CLOUD_URL` | (none) | Ash Cloud URL (set automatically by `ash login` + `ash start`). When present and `ASH_TELEMETRY_URL` is not set, the server auto-configures event telemetry to send to `/api/telemetry/ingest`. | ## Runner Variables These variables configure runner processes in [multi-machine mode](./multi-machine.md). | Variable | Default | Description | |----------|---------|-------------| | `ASH_RUNNER_ID` | `runner-` | Unique identifier for this runner. Must be stable across restarts for re-registration to work correctly. | | `ASH_RUNNER_PORT` | `4200` | Port the runner's HTTP server listens on. | | `ASH_RUNNER_HOST` | `0.0.0.0` | Bind address for the runner. | | `ASH_SERVER_URL` | (none) | URL of the coordinator server (e.g., `http://coordinator:4100`). When set, the runner registers itself with the coordinator and begins sending heartbeats. | | `ASH_RUNNER_ADVERTISE_HOST` | (same as `ASH_RUNNER_HOST`) | The hostname or IP the coordinator should use to reach this runner. Useful when the runner binds to `0.0.0.0` but needs to advertise a specific IP or hostname to the coordinator. | Runner processes also read `ASH_DATA_DIR`, `ASH_MAX_SANDBOXES`, `ASH_IDLE_TIMEOUT_MS`, `ASH_BRIDGE_ENTRY`, and `ANTHROPIC_API_KEY` with the same semantics as the server. ## Database ### SQLite (Default) SQLite is the default database. It requires zero configuration. The database file is created at `/ash.db` on first startup. SQLite is configured with: - **WAL mode** for concurrent reads during writes. - **Foreign keys enabled** for referential integrity. - **Automatic migrations** on startup -- schema changes are applied idempotently. SQLite is the right choice for single-machine deployments. It handles hundreds of concurrent sessions without issue. ### PostgreSQL / CockroachDB Set `ASH_DATABASE_URL` to use Postgres or CockroachDB: ```bash # PostgreSQL export ASH_DATABASE_URL="postgresql://ash:password@localhost:5432/ash" # CockroachDB export ASH_DATABASE_URL="postgresql://ash:password@localhost:26257/ash?sslmode=disable" ``` The server auto-detects the database type from the connection string prefix (`postgresql://` or `postgres://`). **Connection retry behavior:** On startup, the server attempts to connect to the database with exponential backoff. It retries up to 5 times with delays of 1s, 2s, 4s, 8s, and 16s (total ~31 seconds). If all attempts fail, the server exits with an error. This is designed for Docker Compose deployments where the database container may start slightly after the server. **Schema migrations** are applied automatically on startup, just like SQLite. Tables and indexes are created with `IF NOT EXISTS` / `IF NOT EXISTS` semantics. Use Postgres or CockroachDB when: - You need the database to be on a separate machine from the server. - You are running in coordinator mode with multiple runners and want a shared database. - You want to use your existing database infrastructure for backups, monitoring, and replication. ## Authentication Ash supports two authentication modes: ### Auto-Generated Key (Default) When `ASH_API_KEY` is not set, the server auto-generates a secure API key on first start. The key is stored (hashed) in the database and the plaintext is written to `{ASH_DATA_DIR}/initial-api-key`. The CLI automatically picks up this key via `ash start`. This key appears in the dashboard's **API Keys** page and can be revoked there. ### Explicit API Key Set `ASH_API_KEY` to use a specific key instead of auto-generating: ```bash export ASH_API_KEY=my-secret-key ``` All API requests must then include: ``` Authorization: Bearer my-secret-key ``` The environment variable key is shown in the dashboard's **API Keys** page with an `env` badge. To change or remove it, update the environment variable and restart the server. ### Dashboard-Created Keys You can also create additional API keys directly from the dashboard's **API Keys** page. These keys are stored in the database and can be revoked from the dashboard. Both environment-variable and dashboard-created keys work simultaneously. For multi-tenant authentication, create API keys via the internal API. Each key is associated with a `tenant_id`, and requests authenticated with that key are scoped to that tenant's agents, sessions, and data. ### Public Endpoints The following endpoints never require authentication: - `GET /health` - `/docs` (Swagger UI) - `/api/internal/*` (runner registration and heartbeats) ## Environment Variable Summary Here is every variable in one table for quick reference: | Variable | Default | Component | |----------|---------|-----------| | `ANTHROPIC_API_KEY` | -- | Server, Runner | | `ASH_PORT` | `4100` | Server | | `ASH_HOST` | `0.0.0.0` | Server | | `ASH_DATA_DIR` | `./data` | Server, Runner | | `ASH_MODE` | `standalone` | Server | | `ASH_DATABASE_URL` | (SQLite) | Server | | `ASH_MAX_SANDBOXES` | `1000` | Server, Runner | | `ASH_IDLE_TIMEOUT_MS` | `1800000` | Server, Runner | | `ASH_API_KEY` | (auto-generated) | Server | | `ASH_SNAPSHOT_URL` | (none) | Server, Runner | | `ASH_BRIDGE_ENTRY` | (auto) | Server, Runner | | `ASH_DEBUG_TIMING` | `0` | Server, Runner | | `ASH_RUNNER_ID` | `runner-` | Runner | | `ASH_RUNNER_PORT` | `4200` | Runner | | `ASH_RUNNER_HOST` | `0.0.0.0` | Runner | | `ASH_SERVER_URL` | (none) | Runner | | `ASH_RUNNER_ADVERTISE_HOST` | (bind host) | Runner | | `OTEL_EXPORTER_OTLP_ENDPOINT` | (none) | Server, Runner | | `OTEL_SERVICE_NAME` | `ash-coordinator` | Server, Runner | | `ASH_TELEMETRY_URL` | (none) | Server | | `ASH_TELEMETRY_KEY` | (none) | Server | | `ASH_CLOUD_URL` | (none) | Server | --- # Multi-Machine Setup Source: https://docs.ash-cloud.ai/self-hosting/multi-machine # Multi-Machine Setup Most deployments do not need multi-machine mode. A single machine running in standalone mode can handle dozens of concurrent sessions. Read this page only when you need more capacity than one machine provides. ## When to Use Use multi-machine mode when: - You need more concurrent sessions than a single server can handle. - You want to isolate sandbox execution from the control plane for reliability. - You need to scale sandbox capacity independently of the API server. ## Architecture ```mermaid graph TB Client[Client / SDK] -->|HTTP + SSE| Coordinator subgraph Control Plane Coordinator[Ash Server
mode=coordinator
port 4100] DB[(Database
Postgres / CockroachDB)] Coordinator --> DB end subgraph Runner Node 1 R1[Ash Runner
port 4200] R1 --> S1[Sandbox Pool] S1 --> B1[Bridge 1] S1 --> B2[Bridge 2] end subgraph Runner Node 2 R2[Ash Runner
port 4200] R2 --> S2[Sandbox Pool] S2 --> B3[Bridge 3] S2 --> B4[Bridge 4] end Coordinator -->|HTTP| R1 Coordinator -->|HTTP| R2 R1 -->|Heartbeat| Coordinator R2 -->|Heartbeat| Coordinator ``` - **Coordinator** (the Ash server in `coordinator` mode): handles all client-facing HTTP traffic, manages the database, routes session creation to runners, and proxies messages to the correct runner. - **Runners**: each runner manages a local `SandboxPool`, creates sandbox processes, and communicates with the bridge inside each sandbox over Unix sockets. Runners do not serve external traffic directly. ## Standalone Mode (Default) In standalone mode (`ASH_MODE=standalone`), the server creates a local `SandboxPool` and handles everything in one process. This is the default and the right choice for single-machine deployments. ``` Client --> Ash Server (standalone) --> SandboxPool --> Bridge --> Claude SDK ``` No runners are needed. The server is both control plane and execution plane. ## Coordinator Mode In coordinator mode (`ASH_MODE=coordinator`), the server does not create a local sandbox pool. Instead, it waits for runners to register and provides capacity through them. ### Starting the Coordinator ```bash export ASH_MODE=coordinator export ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash" export ASH_API_KEY=my-api-key export ASH_INTERNAL_SECRET=my-runner-secret # Required: authenticates runner registration export ANTHROPIC_API_KEY=sk-ant-... node packages/server/dist/index.js # Or via Docker: # ash start -e ASH_MODE=coordinator -e ASH_DATABASE_URL=... -e ASH_INTERNAL_SECRET=... ``` The coordinator logs: ``` Ash server listening on 0.0.0.0:4100 (mode: coordinator) ``` At this point, the server accepts API requests but cannot create sessions until at least one runner registers. ### Starting a Runner On each runner machine: ```bash export ASH_RUNNER_ID=runner-1 export ASH_SERVER_URL=http://coordinator-host:4100 export ASH_RUNNER_PORT=4200 export ASH_RUNNER_ADVERTISE_HOST=10.0.1.5 # IP the coordinator can reach export ASH_MAX_SANDBOXES=50 export ASH_INTERNAL_SECRET=my-runner-secret # Must match coordinator export ANTHROPIC_API_KEY=sk-ant-... node packages/runner/dist/index.js ``` The runner: 1. Creates a `SandboxPool` with a lightweight in-memory database for sandbox tracking. 2. Starts a Fastify HTTP server on port 4200. 3. Sends a registration request to `ASH_SERVER_URL/api/internal/runners/register` (with exponential backoff retry on failure). 4. Begins sending heartbeats every 10 seconds to `ASH_SERVER_URL/api/internal/runners/heartbeat`, including pool stats (running, warming, waiting counts). The coordinator logs: ``` [coordinator] Runner runner-1 registered at 10.0.1.5:4200 (max 50) ``` ## Session Routing When a client creates a session, the coordinator selects a runner using **least-loaded routing**: it picks the runner with the most available capacity (max sandboxes minus running and warming sandboxes). If no remote runners are healthy, the coordinator falls back to the local backend (if running in standalone mode). In pure coordinator mode with no healthy runners, session creation fails with an error. Once a session is assigned to a runner, all subsequent messages for that session are routed to the same runner. The `runner_id` is stored in the session record in the database. ## Failure Handling ### Graceful Shutdown When a runner shuts down cleanly (receives `SIGTERM`), it calls `POST /api/internal/runners/deregister`. The coordinator immediately bulk-pauses all active sessions on that runner in a single query and removes it from the registry. No 30-second wait. ### Runner Crashes If a runner crashes without deregistering, the coordinator detects it via missed heartbeats. After 30 seconds without a heartbeat (`RUNNER_LIVENESS_TIMEOUT_MS`), the coordinator: 1. Bulk-pauses all active/starting sessions on that runner (single `UPDATE` query). 2. Removes the runner from the routing table. 3. Purges stale entries from its local backend cache. Each coordinator adds random 0-5s jitter to its sweep interval to prevent thundering herd when multiple coordinators sweep independently. Paused sessions can be resumed later. If `ASH_SNAPSHOT_URL` is configured, the runner persists workspace state to cloud storage before eviction, enabling resume on a different runner. ### Runner Comes Back If a runner restarts with the same `ASH_RUNNER_ID`, it re-registers with the coordinator. The coordinator updates the connection info (host, port) and resumes routing new sessions to it. Existing sessions that were paused when the runner died are **not** automatically resumed. The client must explicitly resume them via `POST /api/sessions/:id/resume`. ### No Runners Available If all runners are dead or at capacity, session creation returns an HTTP 503 error: ```json { "error": "No runners available and no local backend configured" } ``` ## Example: Two Runners with Docker Compose ```yaml version: "3.8" services: db: image: postgres:16 environment: POSTGRES_USER: ash POSTGRES_PASSWORD: ash POSTGRES_DB: ash volumes: - pgdata:/var/lib/postgresql/data healthcheck: test: ["CMD-SHELL", "pg_isready -U ash"] interval: 5s timeout: 5s retries: 5 coordinator: image: ghcr.io/ash-ai/ash:0.1.0 init: true ports: - "4100:4100" environment: - ASH_MODE=coordinator - ASH_DATABASE_URL=postgresql://ash:ash@db:5432/ash - ASH_API_KEY=${ASH_API_KEY} - ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} depends_on: db: condition: service_healthy runner-1: image: ghcr.io/ash-ai/ash:0.1.0 init: true privileged: true command: ["node", "packages/runner/dist/index.js"] volumes: - runner1-data:/data environment: - ASH_RUNNER_ID=runner-1 - ASH_SERVER_URL=http://coordinator:4100 - ASH_RUNNER_ADVERTISE_HOST=runner-1 - ASH_MAX_SANDBOXES=50 - ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} depends_on: - coordinator runner-2: image: ghcr.io/ash-ai/ash:0.1.0 init: true privileged: true command: ["node", "packages/runner/dist/index.js"] volumes: - runner2-data:/data environment: - ASH_RUNNER_ID=runner-2 - ASH_SERVER_URL=http://coordinator:4100 - ASH_RUNNER_ADVERTISE_HOST=runner-2 - ASH_MAX_SANDBOXES=50 - ASH_INTERNAL_SECRET=${ASH_INTERNAL_SECRET} - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} depends_on: - coordinator volumes: pgdata: runner1-data: runner2-data: ``` ```bash export ANTHROPIC_API_KEY=sk-ant-... export ASH_API_KEY=my-production-key export ASH_INTERNAL_SECRET=my-runner-secret docker compose up -d ``` ## Multi-Coordinator (High Availability) For production deployments that need control plane redundancy or handle more than ~10,000 concurrent SSE connections, you can run multiple coordinators behind a load balancer. Multi-coordinator requires a shared database. All coordinators must point to the same Postgres or CockroachDB instance via `ASH_DATABASE_URL`. SQLite does not support multi-coordinator. ### Architecture ```mermaid graph TB Client["Client / SDK"] -->|HTTPS| LB["Load Balancer
(nginx, ALB, etc.)"] subgraph Coordinators C1["Coordinator 1
:4100"] C2["Coordinator 2
:4100"] end LB --> C1 LB --> C2 C1 & C2 --> DB[("CockroachDB")] C1 & C2 -->|HTTP| R1["Runner 1"] C1 & C2 -->|HTTP| R2["Runner 2"] ``` Coordinators are stateless — all routing decisions come from the database. Any coordinator can route to any runner. ### Starting Multiple Coordinators ```bash # Coordinator 1 ASH_MODE=coordinator \ ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash" \ ASH_API_KEY=my-api-key \ ASH_INTERNAL_SECRET=my-runner-secret \ ANTHROPIC_API_KEY=sk-ant-... \ ASH_PORT=4100 \ node packages/server/dist/index.js # Coordinator 2 (same config, different host) ASH_MODE=coordinator \ ASH_DATABASE_URL="postgresql://ash:password@db-host:5432/ash" \ ASH_API_KEY=my-api-key \ ASH_INTERNAL_SECRET=my-runner-secret \ ANTHROPIC_API_KEY=sk-ant-... \ ASH_PORT=4100 \ node packages/server/dist/index.js ``` Each coordinator logs its unique ID (`hostname-PID`) on startup for debugging: ``` Ash server listening on 0.0.0.0:4100 (mode: coordinator, id: ip-10-0-1-5-12345) ``` ### Load Balancer Configuration ```nginx upstream ash_coordinators { server coordinator-1:4100; server coordinator-2:4100; } server { listen 443 ssl; location / { proxy_pass http://ash_coordinators; proxy_http_version 1.1; proxy_set_header Connection ''; # Required for SSE proxy_buffering off; # Required for SSE proxy_read_timeout 86400s; # Long-lived SSE streams } } ``` - **No sticky sessions needed.** Any coordinator can handle any request. - **Health check:** `GET /health` on each coordinator. - **SSE failover:** If a coordinator dies mid-stream, the client's SSE auto-reconnects through the load balancer to a different coordinator. The new coordinator looks up the session in the shared database and re-establishes the proxy to the runner. No session migration needed. ### Runners with Multi-Coordinator Runners register with the load balancer URL (not a specific coordinator): ```bash ASH_SERVER_URL=http://load-balancer:4100 \ ASH_RUNNER_ID=runner-1 \ ASH_INTERNAL_SECRET=my-runner-secret \ node packages/runner/dist/index.js ``` Heartbeats go through the load balancer. Any coordinator that receives a heartbeat writes it to the shared database, where all other coordinators can see it. ### How It Works The runner registry lives in the `runners` table in the shared database. All coordinators read and write to this table: - **Registration:** Runner sends `POST /api/internal/runners/register` → coordinator upserts into `runners` table (with retry and exponential backoff) - **Heartbeat:** Runner sends `POST /api/internal/runners/heartbeat` → coordinator updates `active_count`, `warming_count`, `last_heartbeat_at` - **Deregistration:** Runner sends `POST /api/internal/runners/deregister` on graceful shutdown → coordinator bulk-pauses sessions and deletes runner in one pass - **Selection:** On session creation, coordinator queries `SELECT ... FROM runners ORDER BY available_capacity DESC LIMIT 1` - **Liveness sweep:** All coordinators run the sweep independently (every 30s + random 0-5s jitter). Operations are idempotent — multiple coordinators marking the same dead runner's sessions as paused is harmless. - **Auth:** When `ASH_INTERNAL_SECRET` is set, all `/api/internal/*` endpoints require `Authorization: Bearer `. For more details on the scaling architecture, see [Scaling Architecture](/architecture/scaling). ## Limitations - **Cross-runner resume requires cloud persistence.** Without `ASH_SNAPSHOT_URL`, a session paused on runner-1 cannot be resumed on runner-2 because the workspace state is local to runner-1's filesystem. Configure S3 or GCS snapshots for cross-runner resume. - **No automatic session migration.** If a runner is overloaded, existing sessions are not moved to a less-loaded runner. Only new sessions benefit from load-based routing. - **Runner state is in-memory.** Each runner uses an in-memory database for sandbox tracking (not SQLite). If a runner crashes, its sandbox tracking is lost. On restart, it re-registers with fresh state. The coordinator detects the gap via missed heartbeats and pauses affected sessions. --- # API Overview Source: https://docs.ash-cloud.ai/api/overview # API Overview The Ash REST API is the primary interface for deploying agents, managing sessions, and sending messages. All endpoints are served by the Ash server process. ## Base URL ``` http://localhost:4100 ``` The port is configurable via the `ASH_PORT` environment variable (default: `4100`). The host is configurable via `ASH_HOST` (default: `0.0.0.0`). ## Authentication API requests are authenticated using Bearer tokens in the `Authorization` header: ``` Authorization: Bearer ``` Authentication behavior depends on server configuration: | Configuration | Behavior | |---|---| | `ASH_API_KEY` set | Single-tenant mode. The Bearer token must match `ASH_API_KEY`. | | `ASH_API_KEY` not set (auto-generated) | The server auto-generates a key on first start. The CLI picks it up automatically. | | API keys in database | Multi-tenant mode. Bearer token is hashed and looked up in the `api_keys` table. Each key maps to a tenant. | Public endpoints (`/health`, `/docs/*`, `/metrics`) do not require authentication. ## Content Types | Direction | Content-Type | |---|---| | Request bodies | `application/json` | | Most responses | `application/json` | | Message streaming | `text/event-stream` (SSE) | | Prometheus metrics | `text/plain; version=0.0.4; charset=utf-8` | ## Error Format All error responses use a consistent JSON structure: ```json { "error": "Human-readable error message", "statusCode": 400 } ``` ## Common Status Codes | Code | Meaning | |---|---| | `200` | Success | | `201` | Resource created | | `400` | Bad request (missing required fields, invalid state transition) | | `401` | Unauthorized (missing or invalid API key) | | `404` | Resource not found | | `410` | Gone (session has ended and cannot be resumed) | | `500` | Internal server error | | `503` | Service unavailable (sandbox capacity reached, no runners available) | ## Interactive API Docs The server ships with built-in Swagger UI and an OpenAPI specification. | Resource | URL | |---|---| | Swagger UI | [http://localhost:4100/docs](http://localhost:4100/docs) | | OpenAPI spec (JSON) | [http://localhost:4100/docs/json](http://localhost:4100/docs/json) | The Swagger UI provides interactive request builders for every endpoint, making it useful for exploration and debugging. ## TypeScript Types If you are using the TypeScript SDK, all request and response types are available as imports: ```typescript ``` The shared type definitions used by both client and server are available from the `@ash-ai/shared` package: ```typescript Agent, Session, SessionStatus, PoolStats, HealthResponse, ApiError, FileEntry, ListFilesResponse, GetFileResponse, AshStreamEvent, } from '@ash-ai/shared'; ``` --- # Agents Source: https://docs.ash-cloud.ai/api/agents # Agents Agents are the deployable units in Ash. An agent is a directory on disk that contains a `CLAUDE.md` file and optional configuration. Deploying an agent registers it with the server so sessions can be created against it. Deploying the same agent name again performs an upsert: the path is updated and the version is incremented. ## Agent Type ```typescript interface Agent { id: string; // UUID name: string; // Unique agent name tenantId: string; // Tenant that owns this agent version: number; // Auto-incremented on each deploy path: string; // Absolute path to agent directory on server createdAt: string; // ISO 8601 timestamp updatedAt: string; // ISO 8601 timestamp } ``` --- ## Deploy Agent ``` POST /api/agents ``` Registers or updates an agent. The agent directory must contain a `CLAUDE.md` file. If an agent with the same name already exists for this tenant, it is updated (upserted) and its version is incremented. Relative paths are resolved against the server's data directory. ### Request ```json { "name": "qa-bot", "path": "/home/user/agents/qa-bot" } ``` | Field | Type | Required | Description | |---|---|---|---| | `name` | string | Yes | Unique name for the agent | | `path` | string | Yes | Path to the agent directory (must contain `CLAUDE.md`) | ### Response **201 Created** ```json { "agent": { "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "name": "qa-bot", "tenantId": "default", "version": 1, "path": "/home/user/agents/qa-bot", "createdAt": "2025-06-15T10:30:00.000Z", "updatedAt": "2025-06-15T10:30:00.000Z" } } ``` ### Errors | Status | Condition | |---|---| | `400` | Missing `name` or `path`, or directory does not contain `CLAUDE.md` | ```json { "error": "Agent directory must contain CLAUDE.md", "statusCode": 400 } ``` --- ## List Agents ``` GET /api/agents ``` Returns all agents belonging to the authenticated tenant. ### Request No request body. No query parameters. ### Response **200 OK** ```json { "agents": [ { "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "name": "qa-bot", "tenantId": "default", "version": 2, "path": "/home/user/agents/qa-bot", "createdAt": "2025-06-15T10:30:00.000Z", "updatedAt": "2025-06-16T14:00:00.000Z" }, { "id": "b2c3d4e5-f6a7-8901-bcde-f12345678901", "name": "code-reviewer", "tenantId": "default", "version": 1, "path": "/home/user/agents/code-reviewer", "createdAt": "2025-06-16T09:00:00.000Z", "updatedAt": "2025-06-16T09:00:00.000Z" } ] } ``` --- ## Get Agent ``` GET /api/agents/:name ``` Returns a single agent by name. ### Path Parameters | Parameter | Type | Description | |---|---|---| | `name` | string | Agent name | ### Response **200 OK** ```json { "agent": { "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "name": "qa-bot", "tenantId": "default", "version": 2, "path": "/home/user/agents/qa-bot", "createdAt": "2025-06-15T10:30:00.000Z", "updatedAt": "2025-06-16T14:00:00.000Z" } } ``` ### Errors | Status | Condition | |---|---| | `404` | Agent with the given name does not exist for this tenant | ```json { "error": "Agent not found", "statusCode": 404 } ``` --- ## Delete Agent ``` DELETE /api/agents/:name ``` Removes an agent registration. This does not terminate any active sessions using this agent. ### Path Parameters | Parameter | Type | Description | |---|---|---| | `name` | string | Agent name | ### Response **200 OK** ```json { "ok": true } ``` ### Errors | Status | Condition | |---|---| | `404` | Agent with the given name does not exist for this tenant | ```json { "error": "Agent not found", "statusCode": 404 } ``` --- # Sessions Source: https://docs.ash-cloud.ai/api/sessions # Sessions A session represents an ongoing conversation with a deployed agent. Each session runs inside an isolated sandbox with its own filesystem, process tree, and environment. Sessions have a lifecycle: they are created, become active, can be paused and resumed, and eventually end. ## Session Type ```typescript interface Session { id: string; // UUID tenantId: string; // Tenant that owns this session agentName: string; // Name of the agent this session runs sandboxId: string; // ID of the sandbox process status: SessionStatus; // Current lifecycle state model: string | null; // Model override for this session (null = use agent default) runnerId: string | null; // Runner hosting the sandbox (null in standalone mode) createdAt: string; // ISO 8601 timestamp lastActiveAt: string; // ISO 8601 timestamp, updated on each message } type SessionStatus = 'starting' | 'active' | 'paused' | 'ended' | 'error'; ``` ### Session Status Transitions ``` starting --> active --> paused --> active (resume) \ \--> ended (delete) \--> ended (delete) \--> error --> active (resume) \--> ended (delete) ``` --- ## Create Session ``` POST /api/sessions ``` Creates a new session for the specified agent. The server allocates a sandbox, copies the agent directory into it, and starts the bridge process. The session is returned in `active` status once the sandbox is ready. ### Request ```json { "agent": "qa-bot", "model": "claude-opus-4-6" } ``` | Field | Type | Required | Description | |---|---|---|---| | `agent` | string | Yes | Name of a previously deployed agent | | `model` | string | No | Model to use for this session. Overrides the agent's default model. Any valid model identifier accepted (e.g. `claude-sonnet-4-5-20250929`, `claude-opus-4-6`). | | `mcpServers` | object | No | Per-session MCP servers. Merged into the agent's `.mcp.json` — session entries override agent entries with the same key. See [Per-Session MCP Servers](#per-session-mcp-servers). | | `systemPrompt` | string | No | System prompt override. Replaces the agent's `CLAUDE.md` for this session only. | | `credentialId` | string | No | Credential ID to inject into sandbox env. | | `extraEnv` | object | No | Extra env vars to inject into the sandbox (merged with credential env). | ### Response **201 Created** ```json { "session": { "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "tenantId": "default", "agentName": "qa-bot", "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "status": "active", "model": "claude-opus-4-6", "runnerId": null, "createdAt": "2025-06-15T10:30:00.000Z", "lastActiveAt": "2025-06-15T10:30:00.000Z" } } ``` ### Errors | Status | Condition | |---|---| | `400` | Missing `agent` field | | `404` | Agent not found | | `500` | Sandbox creation failed | | `503` | Sandbox capacity reached or no runners available | --- ## List Sessions ``` GET /api/sessions ``` Returns all sessions for the authenticated tenant. Optionally filter by agent name. ### Query Parameters | Parameter | Type | Required | Description | |---|---|---|---| | `agent` | string | No | Filter sessions by agent name | ### Response **200 OK** ```json { "sessions": [ { "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "tenantId": "default", "agentName": "qa-bot", "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "status": "active", "model": "claude-opus-4-6", "runnerId": null, "createdAt": "2025-06-15T10:30:00.000Z", "lastActiveAt": "2025-06-15T10:35:00.000Z" }, { "id": "c9bf9e57-1685-4c89-bafb-ff5af830be8a", "tenantId": "default", "agentName": "code-reviewer", "sandboxId": "c9bf9e57-1685-4c89-bafb-ff5af830be8a", "status": "paused", "model": null, "runnerId": null, "createdAt": "2025-06-15T09:00:00.000Z", "lastActiveAt": "2025-06-15T09:15:00.000Z" } ] } ``` ### Example: Filter by Agent ``` GET /api/sessions?agent=qa-bot ``` --- ## Get Session ``` GET /api/sessions/:id ``` Returns a single session by ID. ### Path Parameters | Parameter | Type | Description | |---|---|---| | `id` | string (UUID) | Session ID | ### Response **200 OK** ```json { "session": { "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "tenantId": "default", "agentName": "qa-bot", "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "status": "active", "model": "claude-opus-4-6", "runnerId": null, "createdAt": "2025-06-15T10:30:00.000Z", "lastActiveAt": "2025-06-15T10:35:00.000Z" } } ``` ### Errors | Status | Condition | |---|---| | `404` | Session not found | --- ## Pause Session ``` POST /api/sessions/:id/pause ``` Pauses an active session. The sandbox state is persisted so the session can be resumed later. Only sessions with status `active` can be paused. ### Path Parameters | Parameter | Type | Description | |---|---|---| | `id` | string (UUID) | Session ID | ### Request No request body. ### Response **200 OK** ```json { "session": { "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "tenantId": "default", "agentName": "qa-bot", "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "status": "paused", "model": "claude-opus-4-6", "runnerId": null, "createdAt": "2025-06-15T10:30:00.000Z", "lastActiveAt": "2025-06-15T10:35:00.000Z" } } ``` ### Errors | Status | Condition | |---|---| | `400` | Session is not in `active` status | | `404` | Session not found | ```json { "error": "Cannot pause session with status \"paused\"", "statusCode": 400 } ``` --- ## Resume Session ``` POST /api/sessions/:id/resume ``` Resumes a paused, errored, or starting session. The server attempts two resume paths: 1. **Warm resume** -- If the original sandbox is still alive on the same runner, the session is reactivated immediately with no overhead. 2. **Cold resume** -- If the sandbox has been evicted or the runner is gone, a new sandbox is created. Workspace state is restored from a local snapshot or cloud storage if available. Sessions with status `active` are returned as-is (no-op). Sessions with status `ended` cannot be resumed. ### Path Parameters | Parameter | Type | Description | |---|---|---| | `id` | string (UUID) | Session ID | ### Request No request body. ### Response **200 OK** ```json { "session": { "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "tenantId": "default", "agentName": "qa-bot", "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "status": "active", "model": "claude-opus-4-6", "runnerId": null, "createdAt": "2025-06-15T10:30:00.000Z", "lastActiveAt": "2025-06-15T10:35:00.000Z" } } ``` ### Errors | Status | Condition | |---|---| | `404` | Session or agent not found | | `410` | Session has ended -- create a new session instead | | `500` | Failed to create a new sandbox for cold resume | | `503` | Sandbox capacity reached or no runners available | ```json { "error": "Session has ended \u2014 create a new session", "statusCode": 410 } ``` --- ## End Session ``` DELETE /api/sessions/:id ``` Ends a session. The sandbox state is persisted and the sandbox process is destroyed. Ended sessions cannot be resumed. ### Path Parameters | Parameter | Type | Description | |---|---|---| | `id` | string (UUID) | Session ID | ### Request No request body. ### Response **200 OK** ```json { "session": { "id": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "tenantId": "default", "agentName": "qa-bot", "sandboxId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "status": "ended", "model": "claude-opus-4-6", "runnerId": null, "createdAt": "2025-06-15T10:30:00.000Z", "lastActiveAt": "2025-06-15T10:35:00.000Z" } } ``` ### Errors | Status | Condition | |---|---| | `404` | Session not found | --- ## Per-Session MCP Servers The `mcpServers` field on `POST /api/sessions` lets you inject MCP servers at session creation time. This enables the **sidecar pattern**: your host application exposes tools as MCP endpoints, and each session connects to a tenant-scoped URL. Session-level servers are merged into the agent's `.mcp.json`. If both the agent and the session define a server with the same key, the session entry wins. ### Example: Sidecar Pattern Your host application runs an MCP server that provides tenant-specific tools: ```json { "agent": "support-bot", "mcpServers": { "customer-tools": { "url": "http://host-app:8000/mcp?tenant=t_abc123" } } } ``` The agent's `.mcp.json` might already define shared MCP servers (e.g. `fetch`). The session adds `customer-tools` on top of those. ### McpServerConfig Each MCP server entry supports: | Field | Type | Description | |---|---|---| | `url` | string | Remote MCP server URL (HTTP/SSE transport). Mutually exclusive with `command`. | | `command` | string | Command to spawn a stdio MCP server. Mutually exclusive with `url`. | | `args` | string[] | Arguments for the command. | | `env` | object | Environment variables for the MCP server process. | --- ## Per-Session System Prompt The `systemPrompt` field on `POST /api/sessions` replaces the agent's `CLAUDE.md` for that session. This is useful when the same agent definition needs different instructions per tenant or per use case. ```json { "agent": "support-bot", "systemPrompt": "You are a support agent for Acme Corp tenant t_abc123. Use the customer-tools MCP server to look up their account." } ``` The agent's original `CLAUDE.md` is not modified — only the sandbox workspace copy is overwritten before the bridge starts. --- # Messages Source: https://docs.ash-cloud.ai/api/messages # Messages Messages are how you interact with an agent inside a session. You send a text prompt and receive a stream of Server-Sent Events (SSE) containing the agent's response, including tool use, intermediate results, and the final answer. --- ## Send Message ``` POST /api/sessions/:id/messages ``` Sends a message to the agent running in the specified session. The response is an SSE stream. The session must be in `active` status. ### Path Parameters | Parameter | Type | Description | |---|---|---| | `id` | string (UUID) | Session ID | ### Request ```json { "content": "What files are in the current directory?", "includePartialMessages": false, "model": "claude-opus-4-6" } ``` | Field | Type | Required | Description | |---|---|---|---| | `content` | string | Yes | The message text to send to the agent | | `includePartialMessages` | boolean | No | When `true`, the stream includes incremental `stream_event` messages with raw API deltas in addition to complete messages. Useful for building real-time streaming UIs. Default: `false`. | | `model` | string | No | Model override for this specific message. Takes precedence over the session-level and agent-level model. Any valid model identifier accepted. | ### Response The response uses `Content-Type: text/event-stream`. The HTTP status is `200` and the body is a stream of SSE frames. #### SSE Event Types The stream contains three event types: `message`, `error`, and `done`. **`message` event** -- An SDK Message object from the Claude Code agent. The shape varies depending on the message type (assistant response, tool use, tool result, etc.). These are passed through from the SDK without transformation. ``` event: message data: {"type":"assistant","message":{"id":"msg_01X...","type":"message","role":"assistant","content":[{"type":"text","text":"The current directory contains the following files:\n\n- src/\n- package.json\n- README.md"}],"model":"claude-sonnet-4-20250514","stop_reason":"end_turn"}} ``` **`error` event** -- An error occurred during message processing. ``` event: error data: {"error":"Bridge connection lost"} ``` **`done` event** -- The agent has finished processing the message. This is always the last event in the stream. ``` event: done data: {"sessionId":"f47ac10b-58cc-4372-a567-0e02b2c3d479"} ``` ### Pre-Stream Errors If validation fails before the stream starts, the server returns a standard JSON error response (not SSE): | Status | Condition | |---|---| | `400` | Session is not in `active` status | | `404` | Session not found | | `500` | Runner not available or sandbox not found | ```json { "error": "Session is paused", "statusCode": 400 } ``` ### Connection Lifecycle 1. Client sends `POST /api/sessions/:id/messages` with JSON body. 2. Server validates the session and sandbox, then responds with `200` and `Content-Type: text/event-stream`. 3. Server streams `message` events as the agent works (tool calls, text responses, etc.). 4. If an error occurs mid-stream, the server sends an `error` event. 5. The stream ends with a `done` event, then the connection closes. ### Backpressure The server applies backpressure on the SSE stream. If the client stops reading and the kernel TCP send buffer fills up, the server waits up to 30 seconds for the buffer to drain. If the client remains unresponsive after 30 seconds, the server closes the connection. ### Consuming the Stream #### curl ```bash curl -N -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/messages \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_API_KEY" \ -d '{"content": "Hello, what can you do?"}' ``` #### JavaScript (EventSource-like) ```javascript const response = await fetch( `http://localhost:4100/api/sessions/${sessionId}/messages`, { method: 'POST', headers: { 'Content-Type': 'application/json', 'Authorization': 'Bearer YOUR_API_KEY', }, body: JSON.stringify({ content: 'Hello, what can you do?' }), } ); const reader = response.body.getReader(); const decoder = new TextDecoder(); let buffer = ''; while (true) { const { done, value } = await reader.read(); if (done) break; buffer += decoder.decode(value, { stream: true }); const lines = buffer.split('\n'); buffer = lines.pop() || ''; let eventType = ''; for (const line of lines) { if (line.startsWith('event: ')) { eventType = line.slice(7); } else if (line.startsWith('data: ')) { const data = JSON.parse(line.slice(6)); if (eventType === 'message') { console.log('Message:', data); } else if (eventType === 'error') { console.error('Error:', data.error); } else if (eventType === 'done') { console.log('Done:', data.sessionId); } } } } ``` --- ## List Messages ``` GET /api/sessions/:id/messages ``` Returns persisted messages for a session. Messages are stored after each completed turn. User messages and complete assistant/result messages are persisted; partial streaming events are not. ### Path Parameters | Parameter | Type | Description | |---|---|---| | `id` | string (UUID) | Session ID | ### Query Parameters | Parameter | Type | Default | Description | |---|---|---|---| | `limit` | integer | 100 | Maximum number of messages to return (1--1000) | | `after` | integer | 0 | Return messages with sequence number greater than this value | ### Response **200 OK** ```json { "messages": [ { "id": "d290f1ee-6c54-4b01-90e6-d701748f0851", "sessionId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "tenantId": "default", "role": "user", "content": "{\"type\":\"user\",\"content\":\"What files are in the current directory?\"}", "sequence": 1, "createdAt": "2025-06-15T10:31:00.000Z" }, { "id": "e391f2ff-7d65-5c12-a1f7-e812859f1962", "sessionId": "f47ac10b-58cc-4372-a567-0e02b2c3d479", "tenantId": "default", "role": "assistant", "content": "{\"type\":\"assistant\",\"message\":{\"content\":[{\"type\":\"text\",\"text\":\"Here are the files...\"}]}}", "sequence": 2, "createdAt": "2025-06-15T10:31:05.000Z" } ] } ``` The `content` field is a JSON-encoded string containing the raw SDK message. Parse it to access the full message structure. ### Errors | Status | Condition | |---|---| | `404` | Session not found | --- # Files Source: https://docs.ash-cloud.ai/api/files # Files The Files API provides read access to files in a session's workspace. Each session has an isolated workspace directory where the agent operates. You can list all files and download individual files. Files are resolved from the live sandbox when the session is active. If the sandbox has been evicted (session paused or ended), the server falls back to the most recent persisted snapshot. The `source` field (or `X-Ash-Source` header) in each response indicates which one was used. --- ## List Files ``` GET /api/sessions/:id/files ``` Returns a list of all files in the session's workspace, recursively. Certain directories and file types are excluded automatically: `node_modules`, `.git`, `__pycache__`, `.cache`, `.npm`, `.pnpm-store`, `.yarn`, `.venv`, `venv`, `.tmp`, `tmp`, and files with `.sock`, `.lock`, or `.pid` extensions. ### Path Parameters | Parameter | Type | Description | |---|---|---| | `id` | string (UUID) | Session ID | ### Response **200 OK** ```json { "files": [ { "path": "CLAUDE.md", "size": 1234, "modifiedAt": "2025-06-15T10:30:00.000Z" }, { "path": "src/index.ts", "size": 567, "modifiedAt": "2025-06-15T10:32:00.000Z" }, { "path": "package.json", "size": 890, "modifiedAt": "2025-06-15T10:30:00.000Z" } ], "source": "sandbox" } ``` | Field | Type | Description | |---|---|---| | `files` | FileEntry[] | Array of file entries | | `files[].path` | string | Path relative to workspace root | | `files[].size` | integer | File size in bytes | | `files[].modifiedAt` | string | ISO 8601 last-modified timestamp | | `source` | string | `"sandbox"` if read from the live sandbox, `"snapshot"` if read from a persisted snapshot | ### Errors | Status | Condition | |---|---| | `404` | Session not found, or no workspace is available for the session | ```json { "error": "No workspace available for this session", "statusCode": 404 } ``` --- ## Download File (Raw) ``` GET /api/sessions/:id/files/*path ``` Downloads a single file from the session's workspace as raw bytes. The file path is specified as the wildcard portion of the URL. By default, the response streams the raw file content with appropriate `Content-Type` based on the file extension. Files up to 100 MB are supported. ### Path Parameters | Parameter | Type | Description | |---|---|---| | `id` | string (UUID) | Session ID | | `*` | string | File path relative to workspace root | ### Query Parameters | Parameter | Type | Default | Description | |---|---|---|---| | `format` | string | `raw` | Response format. `raw` streams the file bytes directly. `json` returns a JSON-wrapped response (see below). | ### Example Request ``` GET /api/sessions/f47ac10b-58cc-4372-a567-0e02b2c3d479/files/src/index.ts ``` ### Response (Raw — Default) **200 OK** The raw file bytes are returned with these headers: | Header | Example | Description | |---|---|---| | `Content-Type` | `text/typescript` | MIME type based on file extension (fallback: `application/octet-stream`) | | `Content-Disposition` | `attachment; filename*=UTF-8''index.ts` | Suggests a filename for download | | `Content-Length` | `67` | File size in bytes | | `X-Ash-Source` | `sandbox` | `sandbox` if from live sandbox, `snapshot` if from persisted snapshot | ```bash # Download raw file content curl -O $ASH_SERVER_URL/api/sessions/SESSION_ID/files/output/report.pdf \ -H "Authorization: Bearer YOUR_API_KEY" ``` ### Response (JSON — `?format=json`) **200 OK** ``` GET /api/sessions/:id/files/src/index.ts?format=json ``` ```json { "path": "src/index.ts", "content": "import express from 'express';\n\nconst app = express();\napp.listen(3000);\n", "size": 67, "source": "sandbox" } ``` | Field | Type | Description | |---|---|---| | `path` | string | The requested file path | | `content` | string | Full file content as UTF-8 text | | `size` | integer | File size in bytes | | `source` | string | `"sandbox"` if read from the live sandbox, `"snapshot"` if from a persisted snapshot | JSON mode has a 1 MB file size limit. For larger files, use the default raw mode. ### Errors | Status | Condition | |---|---| | `400` | Missing file path, path contains `..` traversal, path starts with `/`, path is a directory, or file exceeds size limit (1 MB for JSON mode, 100 MB for raw mode) | | `404` | Session not found, no workspace available, or file does not exist | ```json { "error": "File not found", "statusCode": 404 } ``` --- ## Use Cases **Downloading binary artifacts.** After an agent generates images, PDFs, or compiled binaries, download them directly using the raw endpoint. ```bash # Download a generated PDF curl -o report.pdf $ASH_SERVER_URL/api/sessions/SESSION_ID/files/output/report.pdf \ -H "Authorization: Bearer YOUR_API_KEY" ``` **Inspecting agent-written code.** After an agent writes code, use `?format=json` to get the content inline. ```bash # Read a text file as JSON curl "$ASH_SERVER_URL/api/sessions/SESSION_ID/files/src/index.ts?format=json" \ -H "Authorization: Bearer YOUR_API_KEY" ``` **Building UIs.** The Files API provides the data needed to build file-browser components that show the agent's workspace in real time. **Reviewing changes after a session ends.** Even after a session is paused or ended, files remain accessible from the persisted snapshot, so you can review what the agent produced. --- # Health and Metrics Source: https://docs.ash-cloud.ai/api/health # Health and Metrics Ash exposes health and metrics endpoints for monitoring, alerting, and integration with orchestration systems. Neither endpoint requires authentication. --- ## Health Check ``` GET /health ``` Returns the server's current status, active session and sandbox counts, uptime, and detailed sandbox pool statistics. ### Request No request body. No authentication required. ### Response **200 OK** ```json { "status": "ok", "activeSessions": 3, "activeSandboxes": 5, "uptime": 86400, "pool": { "total": 5, "cold": 0, "warming": 1, "warm": 1, "waiting": 2, "running": 1, "maxCapacity": 1000, "resumeWarmHits": 42, "resumeColdHits": 7 } } ``` | Field | Type | Description | |---|---|---| | `status` | string | Always `"ok"` if the server is reachable | | `activeSessions` | integer | Number of sessions in `active` status | | `activeSandboxes` | integer | Number of live sandbox processes | | `uptime` | integer | Seconds since server start | | `pool` | PoolStats | Sandbox pool breakdown | ### Pool Stats The `pool` object provides a detailed view of sandbox states: | Field | Type | Description | |---|---|---| | `total` | integer | Total sandboxes in the pool | | `cold` | integer | Sandboxes not yet started | | `warming` | integer | Sandboxes currently starting up | | `warm` | integer | Sandboxes ready but not assigned to a session | | `waiting` | integer | Sandboxes assigned to a session, idle between messages | | `running` | integer | Sandboxes actively processing a message | | `maxCapacity` | integer | Maximum number of sandboxes allowed (configured via `ASH_MAX_SANDBOXES`) | | `resumeWarmHits` | integer | Total warm resumes (sandbox was still alive) | | `resumeColdHits` | integer | Total cold resumes (new sandbox created, state restored) | --- ## Prometheus Metrics ``` GET /metrics ``` Returns metrics in Prometheus text exposition format. No authentication required. ### Request No request body. ### Response **200 OK** with `Content-Type: text/plain; version=0.0.4; charset=utf-8` ``` # HELP ash_up Whether the Ash server is up (always 1 if reachable). # TYPE ash_up gauge ash_up 1 # HELP ash_uptime_seconds Seconds since server start. # TYPE ash_uptime_seconds gauge ash_uptime_seconds 86400 # HELP ash_active_sessions Number of active sessions. # TYPE ash_active_sessions gauge ash_active_sessions 3 # HELP ash_active_sandboxes Number of live sandbox processes. # TYPE ash_active_sandboxes gauge ash_active_sandboxes 5 # HELP ash_pool_sandboxes Sandbox count by state. # TYPE ash_pool_sandboxes gauge ash_pool_sandboxes{state="cold"} 0 ash_pool_sandboxes{state="warming"} 1 ash_pool_sandboxes{state="warm"} 1 ash_pool_sandboxes{state="waiting"} 2 ash_pool_sandboxes{state="running"} 1 # HELP ash_pool_max_capacity Maximum sandbox capacity. # TYPE ash_pool_max_capacity gauge ash_pool_max_capacity 1000 # HELP ash_resume_total Total session resumes by path (warm=sandbox alive, cold=new sandbox). # TYPE ash_resume_total counter ash_resume_total{path="warm"} 42 ash_resume_total{path="cold"} 7 ``` ### Metric Reference | Metric | Type | Labels | Description | |---|---|---|---| | `ash_up` | gauge | -- | Always `1` if the server is reachable | | `ash_uptime_seconds` | gauge | -- | Seconds since server process started | | `ash_active_sessions` | gauge | -- | Number of sessions in `active` status | | `ash_active_sandboxes` | gauge | -- | Number of live sandbox processes | | `ash_pool_sandboxes` | gauge | `state` | Sandbox count broken down by state: `cold`, `warming`, `warm`, `waiting`, `running` | | `ash_pool_max_capacity` | gauge | -- | Configured maximum sandbox capacity | | `ash_resume_total` | counter | `path` | Cumulative session resume count by path: `warm` (sandbox still alive) or `cold` (new sandbox created) | --- ## Prometheus Configuration Add the following scrape config to your `prometheus.yml`: ```yaml scrape_configs: - job_name: 'ash' scrape_interval: 15s static_configs: - targets: ['localhost:4100'] metrics_path: '/metrics' ``` --- ## Kubernetes Probes The `/health` endpoint is suitable for both liveness and readiness probes: ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: ash-server spec: template: spec: containers: - name: ash livenessProbe: httpGet: path: /health port: 4100 initialDelaySeconds: 5 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 4100 initialDelaySeconds: 5 periodSeconds: 5 ``` The liveness probe verifies the server process is responsive. The readiness probe can be used to gate traffic until the server has completed initialization. Both return `200` with `{"status": "ok", ...}` when the server is healthy. --- # TypeScript SDK Source: https://docs.ash-cloud.ai/sdks/typescript # TypeScript SDK The `@ash-ai/sdk` package provides a typed TypeScript client for the Ash REST API. ## Installation ```bash npm install @ash-ai/sdk ``` ```bash pip install ash-ai-sdk ``` ## Client Setup ```typescript const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: 'your-api-key', }); ``` The `serverUrl` is the base URL of your Ash server. Trailing slashes are stripped automatically. The server always requires authentication. If you used `ash start`, the CLI saves the auto-generated key to `~/.ash/config.json`. For SDK usage, pass the key explicitly. ```python from ash_ai import AshClient client = AshClient( server_url="http://localhost:4100", api_key="your-api-key", ) ``` The `server_url` is the base URL of your Ash server. The `api_key` is required — the server always requires authentication. ## Methods Reference ### Agents ```typescript // Deploy an agent from a directory path on the server const agent = await client.deployAgent('my-agent', '/path/to/agent'); // List all deployed agents const agents = await client.listAgents(); // Get a specific agent by name const agent = await client.getAgent('my-agent'); // Delete an agent (also deletes its sessions) await client.deleteAgent('my-agent'); ``` ```python # Deploy an agent from a directory path on the server agent = client.deploy_agent(name="my-agent", path="/path/to/agent") # List all deployed agents agents = client.list_agents() # Get a specific agent by name agent = client.get_agent("my-agent") # Delete an agent (also deletes its sessions) client.delete_agent("my-agent") ``` ### Sessions ```typescript // Create a new session for an agent const session = await client.createSession('my-agent'); // Create with per-session MCP servers (sidecar pattern) const session = await client.createSession('my-agent', { mcpServers: { 'tenant-tools': { url: 'http://host-app:8000/mcp?tenant=t_abc123' }, }, }); // Create with a system prompt override const session = await client.createSession('my-agent', { systemPrompt: 'You are a support agent for Acme Corp.', }); // List all sessions (optionally filter by agent name) const sessions = await client.listSessions(); const agentSessions = await client.listSessions('my-agent'); // Get a session by ID const session = await client.getSession(sessionId); // Pause a session (persists workspace state) const paused = await client.pauseSession(sessionId); // Resume a paused or errored session const resumed = await client.resumeSession(sessionId); // End a session permanently const ended = await client.endSession(sessionId); ``` ```python # Create a new session for an agent session = client.create_session("my-agent") # List all sessions (optionally filter by agent name) sessions = client.list_sessions() agent_sessions = client.list_sessions(agent="my-agent") # Get a session by ID session = client.get_session(session_id) # Pause a session (persists workspace state) paused = client.pause_session(session_id) # Resume a paused or errored session resumed = client.resume_session(session_id) # End a session permanently ended = client.end_session(session_id) ``` ### Messages #### Streaming Messages (Recommended) `sendMessageStream(sessionId, content, opts?)` returns an async generator that yields parsed `AshStreamEvent` objects: ```typescript for await (const event of client.sendMessageStream(sessionId, 'Analyze this code')) { if (event.type === 'message') { console.log('SDK message:', event.data); } else if (event.type === 'error') { console.error('Error:', event.data.error); } else if (event.type === 'done') { console.log('Turn complete for session:', event.data.sessionId); } } ``` `send_message_stream(session_id, content, **kwargs)` returns an iterator of parsed events: ```python for event in client.send_message_stream(session_id, "Analyze this code"): if event.type == "message": print("SDK message:", event.data) elif event.type == "error": print(f"Error: {event.data['error']}") elif event.type == "done": print(f"Turn complete for session: {event.data['sessionId']}") ``` #### Raw Response (TypeScript only) `sendMessage(sessionId, content, opts?)` returns a raw `Response` object with an SSE stream body. Use this when you need full control over the stream. ```typescript const response = await client.sendMessage(sessionId, 'Hello, agent'); // response.body is a ReadableStream containing SSE frames ``` #### Options Both methods accept options for partial message streaming: ```typescript interface SendMessageOptions { /** Enable partial message streaming. Yields incremental StreamEvent messages * with raw API deltas in addition to complete messages. */ includePartialMessages?: boolean; } ``` When `includePartialMessages` is `true`, the stream includes `stream_event` messages with `content_block_delta` events. Use `extractStreamDelta()` to pull text chunks from these events for real-time streaming UIs. ```python # Enable partial message streaming with the include_partial_messages kwarg for event in client.send_message_stream( session_id, "Write a haiku.", include_partial_messages=True, ): if event.type == "message": data = event.data if data.get("type") == "stream_event": evt = data.get("event", {}) if evt.get("type") == "content_block_delta": delta = evt.get("delta", {}) if delta.get("type") == "text_delta": print(delta.get("text", ""), end="", flush=True) ``` ### Messages History ```typescript // List persisted messages for a session const messages = await client.listMessages(sessionId); // With pagination const messages = await client.listMessages(sessionId, { limit: 50, afterSequence: 10, }); ``` ```python # List persisted messages for a session messages = client.list_messages(session_id) # With pagination messages = client.list_messages(session_id, limit=50, after_sequence=10) ``` ### Session Events (Timeline) ```typescript // List timeline events for a session const events = await client.listSessionEvents(sessionId); // Filter by type and paginate const textEvents = await client.listSessionEvents(sessionId, { type: 'text', limit: 100, afterSequence: 0, }); ``` ```python # List timeline events for a session events = client.list_session_events(session_id) # Filter by type and paginate text_events = client.list_session_events(session_id, type="text", limit=100, after_sequence=0) ``` Event types: `text`, `tool_start`, `tool_result`, `reasoning`, `error`, `turn_complete`, `lifecycle`. ### Files ```typescript // List files in a session's workspace const { files, source } = await client.getSessionFiles(sessionId); // source is 'sandbox' (live) or 'snapshot' (persisted) // Read a specific file const { path, content, size, source } = await client.getSessionFile(sessionId, 'src/index.ts'); ``` ```python # List files in a session's workspace resp = httpx.get(f"http://localhost:4100/api/sessions/{session_id}/files") data = resp.json() # data["source"] is "sandbox" (live) or "snapshot" (persisted) # Read a specific file resp = httpx.get(f"http://localhost:4100/api/sessions/{session_id}/files/src/index.ts") file_data = resp.json() ``` ### Health ```typescript const health = await client.health(); // { // status: 'ok', // activeSessions: 3, // activeSandboxes: 2, // uptime: 3600, // pool: { total: 5, cold: 2, warming: 0, warm: 1, waiting: 1, running: 1, maxCapacity: 1000, ... } // } ``` ```python health = client.health() # { # "status": "ok", # "activeSessions": 3, # "activeSandboxes": 2, # "uptime": 3600, # "pool": { "total": 5, "cold": 2, "warming": 0, "warm": 1, ... } # } ``` ## Full Streaming Example ```typescript const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: process.env.ASH_API_KEY, }); // Deploy and create session const agent = await client.deployAgent('helper', '/path/to/agent'); const session = await client.createSession('helper'); // Stream with partial messages for real-time output for await (const event of client.sendMessageStream(session.id, 'Write a haiku', { includePartialMessages: true, })) { if (event.type === 'message') { // Extract incremental text deltas for real-time display const delta = extractStreamDelta(event.data); if (delta) { process.stdout.write(delta); continue; } // Extract complete text from finished assistant messages const text = extractTextFromEvent(event.data); if (text) { console.log('\nComplete:', text); } // Extract structured display items (text, tool use, tool results) const items = extractDisplayItems(event.data); if (items) { for (const item of items) { if (item.type === 'tool_use') { console.log(`Tool: ${item.toolName} (${item.toolInput})`); } } } } else if (event.type === 'error') { console.error('Error:', event.data.error); } else if (event.type === 'done') { console.log('Done.'); } } // Clean up await client.endSession(session.id); ``` ```python from ash_ai import AshClient client = AshClient( server_url="http://localhost:4100", api_key=os.environ.get("ASH_API_KEY"), ) # Deploy and create session agent = client.deploy_agent(name="helper", path="/path/to/agent") session = client.create_session("helper") # Stream with partial messages for real-time output for event in client.send_message_stream(session.id, "Write a haiku", include_partial_messages=True, ): if event.type == "message": data = event.data # Extract incremental text deltas for real-time display if data.get("type") == "stream_event": evt = data.get("event", {}) if evt.get("type") == "content_block_delta": delta = evt.get("delta", {}) if delta.get("type") == "text_delta": print(delta.get("text", ""), end="", flush=True) continue # Extract complete text from finished assistant messages if data.get("type") == "assistant": for block in data.get("message", {}).get("content", []): if block.get("type") == "text": print(f"\nComplete: {block['text']}") elif block.get("type") == "tool_use": print(f"Tool: {block['name']} ({block.get('input', '')})") elif event.type == "error": print(f"Error: {event.data.get('error')}") elif event.type == "done": print("Done.") # Clean up client.end_session(session.id) ``` ## Helper Functions The SDK re-exports these helpers from `@ash-ai/shared`: | Function | Description | |----------|-------------| | `extractDisplayItems(data)` | Extract structured display items (text, tool use, tool result) from an SDK message. Returns `DisplayItem[]` or `null`. | | `extractTextFromEvent(data)` | Extract plain text content from an assistant message. Returns `string` or `null`. | | `extractStreamDelta(data)` | Extract incremental text delta from a `stream_event` / `content_block_delta`. Only yields values when `includePartialMessages` is enabled. Returns `string` or `null`. | | `parseSSEStream(stream)` | Parse a `ReadableStream` into an async generator of `AshStreamEvent`. Works in both Node.js and browser. | ## Re-exported Types The SDK re-exports these types from `@ash-ai/shared`: ```typescript // Core entities Agent, Session, SessionStatus // Request/Response CreateSessionRequest, SendMessageRequest, DeployAgentRequest ListAgentsResponse, ListSessionsResponse, HealthResponse, ApiError // SSE streaming AshSSEEventType, AshMessageEvent, AshErrorEvent, AshDoneEvent, AshStreamEvent // Display helpers DisplayItem, DisplayItemType // Files FileEntry, ListFilesResponse, GetFileResponse // MCP McpServerConfig ``` ## Error Handling All methods throw on non-2xx responses. The error message is extracted from the API response body. ```typescript try { const session = await client.createSession('nonexistent-agent'); } catch (err) { // err.message === 'Agent "nonexistent-agent" not found' console.error(err.message); } ``` For streaming, errors can arrive both as thrown exceptions (connection failures) and as `error` events within the stream (agent-level errors): ```typescript try { for await (const event of client.sendMessageStream(sessionId, 'hello')) { if (event.type === 'error') { // Agent-level error (e.g., sandbox crash, SDK error) console.error('Stream error:', event.data.error); } } } catch (err) { // Connection-level error (e.g., network failure, 404) console.error('Connection error:', err.message); } ``` All methods raise on non-2xx responses: ```python from ash_ai import AshApiError try: session = client.create_session(agent="nonexistent") except AshApiError as e: print(f"API error ({e.status_code}): {e.message}") except Exception as e: print(f"Connection error: {e}") ``` For streaming, errors can arrive both as exceptions (connection failures) and as `error` events within the stream (agent-level errors): ```python try: for event in client.send_message_stream(session_id, "hello"): if event.type == "error": # Agent-level error (e.g., sandbox crash, SDK error) print(f"Stream error: {event.data.get('error')}") except Exception as e: # Connection-level error (e.g., network failure, 404) print(f"Connection error: {e}") ``` --- # Python SDK Source: https://docs.ash-cloud.ai/sdks/python # Python SDK The `ash-ai-sdk` Python package provides a client for the Ash REST API. It is auto-generated from the OpenAPI specification. ## Installation ```bash pip install ash-ai-sdk ``` ```bash npm install @ash-ai/sdk ``` ## Client Setup ```python from ash_ai import AshClient client = AshClient( server_url="http://localhost:4100", api_key="your-api-key", ) ``` ```typescript const client = new AshClient({ serverUrl: 'http://localhost:4100', apiKey: 'your-api-key', }); ``` ## Usage Examples ### Deploy an Agent ```python agent = client.deploy_agent(name="my-agent", path="/path/to/agent") print(f"Deployed: {agent.name} v{agent.version}") ``` ```typescript const agent = await client.deployAgent('my-agent', '/path/to/agent'); console.log(`Deployed: ${agent.name} v${agent.version}`); ``` ### Create a Session ```python session = client.create_session(agent="my-agent") print(f"Session ID: {session.id}") print(f"Status: {session.status}") ``` ```typescript const session = await client.createSession('my-agent'); console.log(`Session ID: ${session.id}`); console.log(`Status: ${session.status}`); ``` ### Send a Message (Streaming) ```python for event in client.send_message_stream(session.id, "Analyze this code"): if event.type == "message": data = event.data if data.get("type") == "assistant" and data.get("message", {}).get("content"): for block in data["message"]["content"]: if block.get("type") == "text": print(block["text"]) elif event.type == "error": print(f"Error: {event.data['error']}") elif event.type == "done": print("Turn complete.") ``` ```typescript for await (const event of client.sendMessageStream(session.id, 'Analyze this code')) { if (event.type === 'message') { const text = extractTextFromEvent(event.data); if (text) console.log(text); } else if (event.type === 'error') { console.error('Error:', event.data.error); } else if (event.type === 'done') { console.log('Turn complete.'); } } ``` ### Pause and Resume ```python # Pause the session (persists workspace state) paused = client.pause_session(session.id) print(f"Status: {paused.status}") # 'paused' # Resume later (fast path if sandbox is still alive) resumed = client.resume_session(session.id) print(f"Status: {resumed.status}") # 'active' ``` ```typescript // Pause the session (persists workspace state) const paused = await client.pauseSession(session.id); console.log(`Status: ${paused.status}`); // 'paused' // Resume later (fast path if sandbox is still alive) const resumed = await client.resumeSession(session.id); console.log(`Status: ${resumed.status}`); // 'active' ``` ### End a Session ```python ended = client.end_session(session.id) print(f"Status: {ended.status}") # 'ended' ``` ```typescript const ended = await client.endSession(session.id); console.log(`Status: ${ended.status}`); // 'ended' ``` ### Multi-Turn Conversation ```python session = client.create_session(agent="my-agent") questions = [ "What files are in the workspace?", "Read the main config file.", "Summarize what this project does.", ] for question in questions: print(f"\n> {question}") for event in client.send_message_stream(session.id, question): if event.type == "message": data = event.data if data.get("type") == "assistant": content = data.get("message", {}).get("content", []) for block in content: if block.get("type") == "text": print(block["text"], end="") print() client.end_session(session.id) ``` ```typescript const session = await client.createSession('my-agent'); const questions = [ 'What files are in the workspace?', 'Read the main config file.', 'Summarize what this project does.', ]; for (const question of questions) { console.log(`\n> ${question}`); for await (const event of client.sendMessageStream(session.id, question)) { if (event.type === 'message') { const text = extractTextFromEvent(event.data); if (text) process.stdout.write(text); } } console.log(); } await client.endSession(session.id); ``` ### List Agents and Sessions ```python # List all deployed agents agents = client.list_agents() for agent in agents: print(f"{agent.name} (v{agent.version})") # List all sessions, optionally filtered by agent sessions = client.list_sessions(agent="my-agent") for s in sessions: print(f"{s.id} - {s.status}") ``` ```typescript // List all deployed agents const agents = await client.listAgents(); for (const agent of agents) { console.log(`${agent.name} (v${agent.version})`); } // List all sessions, optionally filtered by agent const sessions = await client.listSessions('my-agent'); for (const s of sessions) { console.log(`${s.id} - ${s.status}`); } ``` ## Error Handling ```python from ash_ai import AshApiError try: session = client.create_session(agent="nonexistent") except AshApiError as e: print(f"API error ({e.status_code}): {e.message}") except Exception as e: print(f"Connection error: {e}") ``` ```typescript try { const session = await client.createSession('nonexistent'); } catch (err) { console.error(err.message); } ``` ## Note on SDK Generation The Python SDK is auto-generated from the Ash server's OpenAPI specification using `openapi-python-client`. The spec is generated from Fastify route schemas, so the Python SDK always matches the server's API surface. To regenerate the SDK from source: ```bash make sdk-python ``` This runs `make openapi` first (to generate the spec), then runs the Python client generator. --- # Direct API (curl) Source: https://docs.ash-cloud.ai/sdks/curl # Direct API (curl) No SDK dependencies needed. All Ash functionality is available through HTTP requests. This page shows every operation using `curl`. ## Setup All examples below use the `ASH_SERVER_URL` environment variable. Set it once: ```bash export ASH_SERVER_URL=$ASH_SERVER_URL # default ``` Include the `-H "Authorization: Bearer YOUR_KEY"` header on every request except `/health`. The server always requires authentication — it auto-generates an API key on first start if one is not provided. ## Health Check ```bash curl $ASH_SERVER_URL/health ``` ```json { "status": "ok", "activeSessions": 2, "activeSandboxes": 2, "uptime": 1234, "pool": { "total": 5, "cold": 2, "warming": 0, "warm": 1, "waiting": 1, "running": 1, "maxCapacity": 1000, "resumeWarmHits": 3, "resumeColdHits": 1 } } ``` ## Agents ### Deploy an Agent ```bash curl -X POST $ASH_SERVER_URL/api/agents \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_KEY" \ -d '{"name": "my-agent", "path": "/path/to/agent/directory"}' ``` The agent directory must contain a `CLAUDE.md` file. The path is resolved on the server. ### List Agents ```bash curl $ASH_SERVER_URL/api/agents \ -H "Authorization: Bearer YOUR_KEY" ``` ### Get Agent Details ```bash curl $ASH_SERVER_URL/api/agents/my-agent \ -H "Authorization: Bearer YOUR_KEY" ``` ### Delete an Agent ```bash curl -X DELETE $ASH_SERVER_URL/api/agents/my-agent \ -H "Authorization: Bearer YOUR_KEY" ``` ## Sessions ### Create a Session ```bash curl -X POST $ASH_SERVER_URL/api/sessions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_KEY" \ -d '{"agent": "my-agent"}' ``` Response: ```json { "session": { "id": "a1b2c3d4-...", "agentName": "my-agent", "sandboxId": "a1b2c3d4-...", "status": "active", "createdAt": "2026-01-15T10:00:00.000Z", "lastActiveAt": "2026-01-15T10:00:00.000Z" } } ``` ### Send a Message (SSE Stream) Use `-N` to disable output buffering so SSE events print in real time: ```bash curl -N -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/messages \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_KEY" \ -d '{"content": "What files are in the workspace?"}' ``` The response is a `text/event-stream`. Events arrive as: ``` event: message data: {"type":"assistant","message":{"role":"assistant","content":[{"type":"text","text":"Here are the files..."}]}} event: message data: {"type":"result","subtype":"success","session_id":"...","num_turns":1} event: done data: {"sessionId":"a1b2c3d4-..."} ``` To enable partial message streaming (incremental text deltas): ```bash curl -N -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/messages \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_KEY" \ -d '{"content": "Write a haiku", "includePartialMessages": true}' ``` ### List Sessions ```bash # All sessions curl $ASH_SERVER_URL/api/sessions \ -H "Authorization: Bearer YOUR_KEY" # Filter by agent curl "$ASH_SERVER_URL/api/sessions?agent=my-agent" \ -H "Authorization: Bearer YOUR_KEY" ``` ### Get Session Details ```bash curl $ASH_SERVER_URL/api/sessions/SESSION_ID \ -H "Authorization: Bearer YOUR_KEY" ``` ### Pause a Session ```bash curl -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/pause \ -H "Authorization: Bearer YOUR_KEY" ``` ### Resume a Session ```bash curl -X POST $ASH_SERVER_URL/api/sessions/SESSION_ID/resume \ -H "Authorization: Bearer YOUR_KEY" ``` ### End a Session ```bash curl -X DELETE $ASH_SERVER_URL/api/sessions/SESSION_ID \ -H "Authorization: Bearer YOUR_KEY" ``` ### List Messages (History) ```bash # Default: last 100 messages curl $ASH_SERVER_URL/api/sessions/SESSION_ID/messages \ -H "Authorization: Bearer YOUR_KEY" # With pagination curl "$ASH_SERVER_URL/api/sessions/SESSION_ID/messages?limit=50&after=10" \ -H "Authorization: Bearer YOUR_KEY" ``` Note: `GET /api/sessions/:id/messages` returns persisted message history, while `POST /api/sessions/:id/messages` sends a new message and returns an SSE stream. ### List Session Events (Timeline) ```bash # All events curl $ASH_SERVER_URL/api/sessions/SESSION_ID/events \ -H "Authorization: Bearer YOUR_KEY" # Filter by type curl "$ASH_SERVER_URL/api/sessions/SESSION_ID/events?type=text&limit=50" \ -H "Authorization: Bearer YOUR_KEY" ``` ## Files ### List Workspace Files ```bash curl $ASH_SERVER_URL/api/sessions/SESSION_ID/files \ -H "Authorization: Bearer YOUR_KEY" ``` ### Read a File ```bash curl $ASH_SERVER_URL/api/sessions/SESSION_ID/files/src/index.ts \ -H "Authorization: Bearer YOUR_KEY" ``` ## SSE Event Format The send-message endpoint returns an SSE stream with three event types: | Event | Data | Description | |-------|------|-------------| | `message` | Raw Claude Code SDK `Message` object | Assistant response, tool use, tool result, or final result. The `data.type` field indicates the message kind (`assistant`, `user`, `result`, `stream_event`). | | `error` | `{"error": "..."}` | An error occurred during processing. | | `done` | `{"sessionId": "..."}` | The agent's turn is complete. | Each SSE frame follows the standard format: ``` event: \n data: \n \n ``` The `message` event data is a passthrough of the Claude Code SDK's `Message` type. Ash does not translate or wrap these messages -- the SDK's types are the wire format. --- # CLI Overview Source: https://docs.ash-cloud.ai/cli/overview # CLI Overview The `ash` CLI manages the Ash server lifecycle, deploys agents, and interacts with sessions from the terminal. ## Installation ```bash npm install -g @ash-ai/cli ``` ## Global Configuration The CLI connects to an Ash server. Set the server URL via environment variable: ```bash export ASH_SERVER_URL=http://localhost:4100 # default ``` The server always requires authentication. When you run `ash start`, the CLI automatically picks up the server's API key (auto-generated or explicit) and saves it to `~/.ash/config.json`. For remote servers, use `ash link --api-key ` to save the key. ## Help ```bash ash --help ``` ``` Usage: ash [options] [command] Agent orchestration CLI Options: -V, --version output the version number -h, --help display help for command Commands: start Start the Ash server in a Docker container stop Stop the Ash server container status Show Ash server status logs Show Ash server logs chat Send a message to an agent (one-shot) deploy Deploy an agent to the server session Manage sessions agent Manage agents health Check server health help [command] display help for command ``` ## Command Groups | Group | Description | |-------|-------------| | **Server Lifecycle** | `start`, `stop`, `status`, `logs` -- manage the Ash server Docker container | | **Quick** | `chat` -- send a message to an agent, keep session alive for follow-ups (`--session ` to continue, `--end` to clean up) | | **Agents** | `deploy`, `agent list`, `agent info`, `agent delete` -- deploy and manage agent definitions | | **Sessions** | `session create`, `session send`, `session list`, `session pause`, `session resume`, `session end` -- interact with agent sessions | | **Health** | `health` -- check server health and pool stats | --- # Server Lifecycle Source: https://docs.ash-cloud.ai/cli/lifecycle # Server Lifecycle The CLI manages an Ash server running in a Docker container. These commands handle the full lifecycle: start, stop, status, and logs. ## `ash start` Starts the Ash server in a Docker container. ```bash ash start ``` The command: 1. Checks that Docker is installed and running 2. Removes any stale stopped container 3. Creates the data directory (`~/.ash/`) 4. Pulls the latest image (unless `--no-pull`) 5. Starts the container with port mapping and volume mounts 6. Waits for the health endpoint to respond (up to 30 seconds) ### Options | Flag | Description | Default | |------|-------------|---------| | `--port ` | Host port to expose | `4100` | | `--tag ` | Docker image tag | Latest published version | | `--image ` | Full Docker image name (overrides default + tag) | `ghcr.io/ash-ai/ash` | | `--no-pull` | Skip pulling the image; use a local build | Pull enabled | | `--database-url ` | PostgreSQL/CockroachDB connection URL | SQLite (default) | | `-e, --env ` | Extra env vars to pass to the container (repeatable) | None | ### Examples ```bash # Start with defaults ash start # Use a local dev image ash start --image ash-dev --no-pull # Use a specific port ash start --port 8080 # Use Postgres instead of SQLite ash start --database-url "postgresql://user:pass@host:5432/ash" # Pass extra env vars ash start -e ANTHROPIC_API_KEY=sk-ant-... ``` ### Output ``` Starting Ash server... Waiting for server to be ready... Ash server is running. URL: http://localhost:4100 API key: ash_xxxxxxxx (saved to ~/.ash/config.json) Data dir: /Users/you/.ash ``` The server auto-generates a secure API key on first start and the CLI saves it to `~/.ash/config.json`. Subsequent CLI commands use this key automatically. ## `ash stop` Stops the running Ash server container. ```bash ash stop ``` ``` Stopping Ash server... Ash server stopped. ``` If no container is found, prints a message and exits. ## `ash status` Shows the current state of the Ash server container and, if running, its health stats. ```bash ash status ``` ### Example Output ``` Container: running ID: a1b2c3d4e5f6 Image: ghcr.io/ash-ai/ash:0.1.0 Active sessions: 3 Active sandboxes: 2 Uptime: 1234s ``` When the container is stopped: ``` Container: exited ID: a1b2c3d4e5f6 Image: ghcr.io/ash-ai/ash:0.1.0 ``` When no container exists: ``` Container: not-found ``` ## `ash logs` Shows logs from the Ash server container. ```bash ash logs ``` ### Options | Flag | Description | |------|-------------| | `-f, --follow` | Follow log output (like `tail -f`) | ### Examples ```bash # Show recent logs ash logs # Follow logs in real time ash logs --follow ``` --- # Agent Commands Source: https://docs.ash-cloud.ai/cli/agents # Agent Commands Deploy and manage agent definitions on the Ash server. ## `ash deploy ` Deploys an agent from a local directory. The directory must contain a `CLAUDE.md` file. ```bash ash deploy ./my-agent --name my-agent ``` The command copies the agent directory to `~/.ash/agents//` (so it is accessible inside the Docker container via volume mount), then registers it with the server. ### Options | Flag | Description | Default | |------|-------------|---------| | `-n, --name ` | Agent name | Directory name | ### Example ```bash ash deploy ./examples/qa-bot/agent --name qa-bot ``` ``` Copied agent files to /Users/you/.ash/agents/qa-bot Deployed agent: { "id": "a1b2c3d4-...", "name": "qa-bot", "version": 1, "path": "agents/qa-bot", "createdAt": "2026-01-15T10:00:00.000Z", "updatedAt": "2026-01-15T10:00:00.000Z" } ``` Deploying the same name again increments the version: ```bash ash deploy ./examples/qa-bot/agent --name qa-bot ``` ``` Deployed agent: { ... "version": 2, ... } ``` ## `ash agent list` Lists all deployed agents. ```bash ash agent list ``` ```json [ { "id": "a1b2c3d4-...", "name": "qa-bot", "version": 2, "path": "agents/qa-bot", "createdAt": "2026-01-15T10:00:00.000Z", "updatedAt": "2026-01-15T10:05:00.000Z" }, { "id": "e5f6a7b8-...", "name": "code-reviewer", "version": 1, "path": "agents/code-reviewer", "createdAt": "2026-01-15T11:00:00.000Z", "updatedAt": "2026-01-15T11:00:00.000Z" } ] ``` ## `ash agent info ` Gets details for a specific agent. ```bash ash agent info qa-bot ``` ```json { "id": "a1b2c3d4-...", "name": "qa-bot", "version": 2, "path": "agents/qa-bot", "createdAt": "2026-01-15T10:00:00.000Z", "updatedAt": "2026-01-15T10:05:00.000Z" } ``` ## `ash agent delete ` Deletes an agent and its associated sessions. ```bash ash agent delete qa-bot ``` ``` Deleted agent: qa-bot ``` --- # Session Commands Source: https://docs.ash-cloud.ai/cli/sessions # Session Commands Create, message, and manage agent sessions from the terminal. ## `ash session create ` Creates a new session for the named agent. A sandbox is allocated and the agent's workspace is initialized. ```bash ash session create qa-bot ``` ``` Session created: { "id": "b2c3d4e5-1234-5678-9abc-def012345678", "agentName": "qa-bot", "sandboxId": "b2c3d4e5-1234-5678-9abc-def012345678", "status": "active", "createdAt": "2026-01-15T10:00:00.000Z", "lastActiveAt": "2026-01-15T10:00:00.000Z" } ``` ## `ash session send ` Sends a message to a session and streams the response. SSE events are printed as they arrive. ```bash ash session send b2c3d4e5-1234-5678-9abc-def012345678 "What files are in the workspace?" ``` ``` [message] assistant: {"type":"assistant","message":{"role":"assistant","content":[{"type":"text","text":"..."}]}} [message] result: {"type":"result","subtype":"success","session_id":"...","num_turns":1} [done] {"sessionId":"b2c3d4e5-..."} ``` Each line shows the SSE event type in brackets followed by the SDK message type and a truncated JSON preview. ## `ash session list` Lists all sessions. ```bash ash session list ``` ```json [ { "id": "b2c3d4e5-...", "agentName": "qa-bot", "sandboxId": "b2c3d4e5-...", "status": "active", "createdAt": "2026-01-15T10:00:00.000Z", "lastActiveAt": "2026-01-15T10:01:00.000Z" }, { "id": "c3d4e5f6-...", "agentName": "qa-bot", "status": "paused", ... } ] ``` ## `ash session pause ` Pauses an active session. The workspace state is persisted and the sandbox remains alive for fast resume. ```bash ash session pause b2c3d4e5-1234-5678-9abc-def012345678 ``` ``` Session paused: { "id": "b2c3d4e5-...", "status": "paused", ... } ``` ## `ash session resume ` Resumes a paused or errored session. If the sandbox is still alive, resume is instant (warm path). If the sandbox was evicted, a new one is created and the workspace is restored (cold path). ```bash ash session resume b2c3d4e5-1234-5678-9abc-def012345678 ``` ``` Session resumed: { "id": "b2c3d4e5-...", "status": "active", ... } ``` ## `ash session end ` Ends a session permanently. The sandbox is destroyed and the session status is set to `ended`. ```bash ash session end b2c3d4e5-1234-5678-9abc-def012345678 ``` ``` Session ended: { "id": "b2c3d4e5-...", "status": "ended", ... } ``` ## Full Lifecycle Example ```bash # Deploy an agent ash deploy ./my-agent --name helper # Create a session ash session create helper # Note the session ID from the output # Send messages ash session send SESSION_ID "List the project structure" ash session send SESSION_ID "Read the README" # Pause when done for now ash session pause SESSION_ID # Resume later ash session resume SESSION_ID ash session send SESSION_ID "Summarize what you found" # End when finished ash session end SESSION_ID ``` --- # Health Source: https://docs.ash-cloud.ai/cli/health # Health Check the health of a running Ash server. ## `ash health` Queries the server's `/health` endpoint and prints the response. ```bash ash health ``` ### Example Output ```json { "status": "ok", "activeSessions": 3, "activeSandboxes": 2, "uptime": 7200, "pool": { "total": 5, "cold": 2, "warming": 0, "warm": 1, "waiting": 1, "running": 1, "maxCapacity": 1000, "resumeWarmHits": 5, "resumeColdHits": 2 } } ``` ### Fields | Field | Description | |-------|-------------| | `status` | Always `"ok"` if the server is reachable | | `activeSessions` | Number of sessions with status `active` | | `activeSandboxes` | Number of live sandbox processes | | `uptime` | Seconds since the server started | | `pool.total` | Total sandbox entries in the database (live + cold) | | `pool.cold` | Sandboxes with no live process (can be evicted or restored) | | `pool.warming` | Sandboxes currently starting up | | `pool.warm` | Sandboxes with a live process, not yet assigned to a message | | `pool.waiting` | Sandboxes idle between messages (sandbox alive, session paused or between turns) | | `pool.running` | Sandboxes actively processing a message | | `pool.maxCapacity` | Maximum number of sandboxes allowed (set by `ASH_MAX_SANDBOXES`) | | `pool.resumeWarmHits` | Number of resumes that found the sandbox still alive | | `pool.resumeColdHits` | Number of resumes that required creating a new sandbox | The health endpoint does not require authentication. --- # System Overview Source: https://docs.ash-cloud.ai/architecture/overview # System Overview Ash is a thin orchestration layer around the [Claude Code SDK](https://github.com/anthropic-ai/claude-code-sdk-python). It manages agent deployment, session lifecycle, sandbox isolation, and streaming -- adding as little overhead as possible on top of the SDK itself. ## Standalone Mode In standalone mode, a single server process manages everything: HTTP API, sandbox pool, and bridge processes. ```mermaid graph LR Client["Client (SDK / CLI / curl)"] Server["Ash Server
Fastify :4100"] Pool["SandboxPool"] B1["Bridge 1"] B2["Bridge 2"] SDK1["Claude Code SDK"] SDK2["Claude Code SDK"] DB["SQLite / Postgres"] Client -->|HTTP + SSE| Server Server --> Pool Server --> DB Pool --> B1 Pool --> B2 B1 -->|Unix Socket| SDK1 B2 -->|Unix Socket| SDK2 ``` ## Coordinator Mode In coordinator mode, the server acts as a pure control plane. Sandbox execution is offloaded to remote runner processes on separate machines. ```mermaid graph LR Client["Client"] Server["Ash Server
(coordinator)"] R1["Runner 1"] R2["Runner 2"] B1["Bridge"] B2["Bridge"] DB["Postgres / CRDB"] Client -->|HTTP + SSE| Server Server --> DB Server -->|HTTP| R1 Server -->|HTTP| R2 R1 --> B1 R2 --> B2 ``` Runners register with the server via heartbeat. The server routes sessions to the runner with the most available capacity. ## Multi-Coordinator Mode For high availability and horizontal scaling of the control plane, run multiple coordinators behind a load balancer with a shared database (Postgres or CockroachDB). ```mermaid graph LR Client["Client"] LB["Load Balancer"] C1["Coordinator 1"] C2["Coordinator 2"] R1["Runner 1"] R2["Runner 2"] DB["CRDB"] Client -->|HTTPS| LB LB --> C1 LB --> C2 C1 --> DB C2 --> DB C1 -->|HTTP| R1 C1 -->|HTTP| R2 C2 -->|HTTP| R1 C2 -->|HTTP| R2 ``` Coordinators are stateless — the runner registry and session routing state live in the database. Any coordinator can route to any runner. SSE reconnection handles coordinator failover transparently. See [Scaling Architecture](./scaling) for details. ## Components | Package | Description | |---------|-------------| | `@ash-ai/shared` | Types, protocol definitions, constants. No runtime dependencies. | | `@ash-ai/sandbox` | SandboxManager, SandboxPool, BridgeClient, resource limits, state persistence. Used by both server and runner. | | `@ash-ai/bridge` | Runs inside each sandbox process. Receives commands over Unix socket, calls the Claude Code SDK, streams responses back. | | `@ash-ai/server` | Fastify REST API. Agent registry, session routing, SSE streaming, database access. | | `@ash-ai/runner` | Worker node for multi-machine deployments. Manages sandboxes on a remote host, registers with the server. | | `@ash-ai/sdk` | TypeScript client library for the Ash API. | | `@ash-ai/cli` | `ash` command-line tool. Server lifecycle, agent deployment, session management. | ## Message Hot Path Every message traverses this path. Ash's goal is to add no more than 1-3ms of overhead on top of the SDK. ```mermaid sequenceDiagram participant C as Client participant S as Server (Fastify) participant P as Pool participant B as Bridge participant SDK as Claude Code SDK C->>S: POST /api/sessions/:id/messages S->>S: Session lookup (DB) S->>P: markRunning(sandboxId) S->>B: query command (Unix socket) B->>SDK: sdk.query(prompt) SDK-->>B: Message stream B-->>S: message events (Unix socket) S-->>C: SSE stream (event: message) S->>P: markWaiting(sandboxId) S-->>C: event: done ``` ## Package Dependency Graph ```mermaid graph TD shared["@ash-ai/shared"] sandbox["@ash-ai/sandbox"] bridge["@ash-ai/bridge"] server["@ash-ai/server"] runner["@ash-ai/runner"] sdk["@ash-ai/sdk"] cli["@ash-ai/cli"] sandbox --> shared bridge --> shared server --> shared server --> sandbox runner --> shared runner --> sandbox sdk --> shared cli --> shared ``` ## Storage Layout ``` data/ ash.db # SQLite database (agents, sessions, sandboxes, messages, events) sandboxes/ / workspace/ # Agent workspace (CLAUDE.md, files, etc.) sessions/ / workspace/ # Persisted workspace snapshot (for cold resume) ``` In Postgres/CRDB mode, `ash.db` is replaced by the remote database. The `sandboxes/` and `sessions/` directories remain on the local filesystem. --- # Sandbox Isolation Source: https://docs.ash-cloud.ai/architecture/sandbox-isolation # Sandbox Isolation Ash treats agent code as untrusted. Each agent session runs inside an isolated sandbox process with restricted access to the host system. ## Security Model The agent inside the sandbox can execute arbitrary shell commands (that is how the Claude Code SDK works). The sandbox must prevent: - Reading host environment variables (credentials, secrets) - Writing outside the workspace directory - Consuming unbounded host resources (memory, CPU, disk) - Interfering with other sandboxes or the host process ## Isolation Layers | Layer | Linux | macOS (dev) | |-------|-------|-------------| | **Process limits** | cgroups v2 | ulimit | | **Memory limit** | cgroup `memory.max` (default 2048 MB) | ulimit (best-effort) | | **CPU limit** | cgroup `cpu.max` (default 100% = 1 core) | Not enforced | | **Disk limit** | Periodic check, kill on exceed (default 1024 MB) | Periodic check, kill on exceed | | **Max processes** | cgroup `pids.max` (default 64, fork bomb protection) | ulimit | | **Environment** | Strict allowlist | Strict allowlist | | **Filesystem** | bubblewrap (bwrap) read-only root, writable workspace | Restricted cwd only | | **Network** | Network namespace (configurable) | Unrestricted | Resource limit defaults are defined in `@ash-ai/shared`: ```typescript const DEFAULT_SANDBOX_LIMITS = { memoryMb: 2048, // Max RSS in MB cpuPercent: 100, // 100 = 1 core diskMb: 1024, // Max workspace size in MB maxProcesses: 64, // Fork bomb protection }; ``` ## Environment Variable Allowlist The sandbox process receives **only** these environment variables. Everything else is blocked. ### Passed through from host (if set) | Variable | Purpose | |----------|---------| | `PATH` | Standard path | | `NODE_PATH` | Node.js module resolution | | `HOME` | Home directory (set to workspace dir) | | `LANG` | Locale | | `TERM` | Terminal type | | `ANTHROPIC_API_KEY` | Required for Claude Code SDK | | `ASH_DEBUG_TIMING` | Enable timing instrumentation | ### Injected by Ash | Variable | Purpose | |----------|---------| | `ASH_BRIDGE_SOCKET` | Path to the Unix socket for bridge communication | | `ASH_AGENT_DIR` | Original agent directory path | | `ASH_WORKSPACE_DIR` | Writable workspace directory for this session | | `ASH_SANDBOX_ID` | Unique sandbox identifier | | `ASH_SESSION_ID` | Session identifier | ### Everything else: blocked The sandbox does not inherit `process.env`. Variables like `AWS_SECRET_ACCESS_KEY`, `DATABASE_URL`, `GITHUB_TOKEN`, or any other host secret are never visible inside the sandbox. ```typescript // From sandbox/manager.ts -- allowlist enforcement const env: Record = {}; for (const key of SANDBOX_ENV_ALLOWLIST) { if (process.env[key]) { env[key] = process.env[key]!; } } // Only these vars + injected ASH_* vars are passed to the child process ``` ## OOM Detection When a sandbox process is killed by the kernel's OOM killer (exit code 137 or SIGKILL), Ash detects this and automatically pauses the session. The session can be resumed later with a fresh sandbox. ## Disk Monitoring A periodic check (every 30 seconds) measures the workspace directory size. If it exceeds `diskMb`, the sandbox is killed immediately. --- # Bridge Protocol Source: https://docs.ash-cloud.ai/architecture/bridge-protocol # Bridge Protocol The bridge process runs inside each sandbox and communicates with the host server over a Unix domain socket using newline-delimited JSON. ## Why Unix Sockets - Lower overhead than TCP (no network stack, no port allocation) - No port conflicts when running multiple sandboxes - Natural 1:1 mapping between socket file and sandbox process - Socket paths include the sandbox ID for easy identification Socket path format: `/tmp/ash-.sock` ## Wire Format Each message is a single JSON object followed by a newline character (`\n`). Both directions use the same encoding: ```typescript function encode(msg: BridgeCommand | BridgeEvent): string { return JSON.stringify(msg) + '\n'; } function decode(line: string): BridgeCommand | BridgeEvent { return JSON.parse(line.trim()); } ``` These functions are exported from `@ash-ai/shared` as `encode` and `decode`. ## Commands (Server to Bridge) Commands are sent from the server (or runner) to the bridge process inside the sandbox. | Command | Fields | Description | |---------|--------|-------------| | `query` | `cmd`, `prompt`, `sessionId`, `includePartialMessages?` | Send a message to the agent. The bridge calls the Claude Code SDK and streams responses back. | | `resume` | `cmd`, `sessionId` | Resume a conversation with the SDK's session resume capability. | | `interrupt` | `cmd` | Interrupt the current agent turn. | | `shutdown` | `cmd` | Gracefully shut down the bridge process. | ### Command type definitions ```typescript interface QueryCommand { cmd: 'query'; prompt: string; sessionId: string; includePartialMessages?: boolean; } interface ResumeCommand { cmd: 'resume'; sessionId: string; } interface InterruptCommand { cmd: 'interrupt'; } interface ShutdownCommand { cmd: 'shutdown'; } ``` ## Events (Bridge to Server) Events are sent from the bridge process back to the server. | Event | Fields | Description | |-------|--------|-------------| | `ready` | `ev` | Bridge is initialized and ready to accept commands. Sent once on startup. | | `message` | `ev`, `data` | A raw SDK `Message` object. The `data` field contains the unmodified message from `@anthropic-ai/claude-code`. | | `error` | `ev`, `error` | An error occurred during processing. | | `done` | `ev`, `sessionId` | The agent's turn is complete. | ### Event type definitions ```typescript interface ReadyEvent { ev: 'ready'; } interface MessageEvent { ev: 'message'; data: unknown; // Raw SDK Message -- passthrough, not translated } interface ErrorEvent { ev: 'error'; error: string; } interface DoneEvent { ev: 'done'; sessionId: string; } ``` ## SDK Message Passthrough The `message` event's `data` field contains the raw SDK `Message` object exactly as returned by `@anthropic-ai/claude-code`. Ash does not translate, wrap, or modify these messages. This is a deliberate design decision ([ADR 0001](/architecture/decisions#adr-0001-sdk-passthrough-types)). The benefits: - One type system instead of three (no bridge-specific or SSE-specific message types) - SDK changes propagate automatically through the entire pipeline - Clients can use SDK types directly for type-safe message handling - Less code to maintain The `data.type` field indicates the SDK message kind: `assistant`, `user`, `result`, `stream_event`, etc. ## Connection Lifecycle ```mermaid sequenceDiagram participant S as Server participant B as Bridge Note over S: spawn bridge process B->>S: ready S->>B: query (prompt, sessionId) B-->>S: message (SDK Message) B-->>S: message (SDK Message) B-->>S: done (sessionId) Note over S,B: ... more commands ... S->>B: shutdown Note over B: process exits ``` The bridge sends `ready` immediately after initializing the Unix socket listener. The server waits for this event before sending any commands (with a 10-second timeout). --- # Session Lifecycle Source: https://docs.ash-cloud.ai/architecture/session-lifecycle # Session Lifecycle A session represents an ongoing conversation between a client and an agent running inside a sandbox. ## State Machine ```mermaid stateDiagram-v2 [*] --> starting: POST /api/sessions starting --> active: Sandbox ready active --> paused: POST .../pause active --> ended: DELETE .../ active --> error: Sandbox crash / OOM paused --> active: POST .../resume (warm or cold) error --> active: POST .../resume (cold) ended --> [*] ``` ## States | Status | Description | |--------|-------------| | `starting` | Session created, sandbox being allocated. Transient -- transitions to `active` within seconds. | | `active` | Sandbox is alive and ready to accept messages. | | `paused` | Session is paused. Workspace state is persisted. The sandbox may still be alive (warm) or evicted (cold). | | `error` | An error occurred (sandbox crash, OOM kill). Resumable -- a new sandbox will be created on resume. | | `ended` | Session is permanently ended. The sandbox is destroyed. Cannot be resumed. | ## State Transitions | Transition | Trigger | What happens | |-----------|---------|-------------| | starting -> active | Sandbox process starts, bridge sends `ready` | Session is ready to accept messages | | active -> paused | `POST /api/sessions/:id/pause` | Workspace state persisted, session marked paused | | active -> ended | `DELETE /api/sessions/:id` | Workspace persisted, sandbox destroyed, session marked ended | | active -> error | Sandbox crash or OOM kill | Session marked as error, available for resume | | paused -> active | `POST /api/sessions/:id/resume` | Warm path (instant) or cold path (new sandbox) | | error -> active | `POST /api/sessions/:id/resume` | Always cold path (new sandbox created) | ## Pause Flow When a session is paused: 1. Server persists workspace state to `data/sessions//workspace/` 2. If cloud storage is configured (`ASH_SNAPSHOT_URL`), workspace is synced to S3/GCS 3. Session status is updated to `paused` in the database 4. The sandbox process remains alive (for potential fast resume) ## Resume Flow Resume follows a decision tree to minimize latency: ```mermaid flowchart TD A[POST .../resume] --> B{Session status?} B -->|ended| C[410 Gone] B -->|active| D[Return session as-is] B -->|paused / error| E{Same runner available?} E -->|yes| F{Sandbox alive?} F -->|yes| G["Warm path
(instant resume)"] F -->|no| H["Cold path"] E -->|no| H H --> I{Local workspace exists?} I -->|yes| J[Create sandbox
with existing workspace] I -->|no| K{Persisted snapshot?} K -->|yes| L[Restore from local snapshot] K -->|no| M{Cloud snapshot?} M -->|yes| N[Restore from S3/GCS] M -->|no| O[Create fresh sandbox] L --> J N --> J O --> J ``` ### Warm path If the original sandbox process is still alive (the session was paused but not evicted), resume is instant. No data is copied, no process is started. The session status is simply updated to `active`. ### Cold path If the sandbox was evicted or crashed, a new sandbox is created: 1. Check for workspace on local disk (`data/sandboxes//workspace/`) → **source: local** 2. If not found, check for persisted snapshot (`data/sessions//workspace/`) → **source: local** 3. If not found, try restoring from cloud storage (`ASH_SNAPSHOT_URL`) → **source: cloud** 4. If no backup exists, create from fresh agent definition → **source: fresh** 5. Create a new sandbox, reusing the restored workspace if available 6. Update session with new sandbox ID and runner ID The resume source is tracked in metrics (`ash_resume_cold_total{source="..."}`) so you can monitor how often each path is hit. See [State Persistence & Restore](./state-persistence.md) for the full storage architecture. ## Cloud Persistence When `ASH_SNAPSHOT_URL` is set to an S3 or GCS URL, workspace snapshots are automatically synced to cloud storage after each completed agent turn and before eviction. This enables resume across server restarts and machine migrations. ## Cold Cleanup Cold sandbox entries (no live process) are automatically cleaned up after 2 hours of inactivity. Local workspace files and database records are deleted, but **cloud snapshots are preserved** — so sessions can still be resumed from cloud storage after local cleanup. See [Sandbox Pool](./sandbox-pool.md#cold-cleanup) for details. --- # Sandbox Pool Source: https://docs.ash-cloud.ai/architecture/sandbox-pool # Sandbox Pool The `SandboxPool` manages the lifecycle of all sandboxes in a server or runner process. It enforces capacity limits, handles eviction, and runs periodic idle sweeps. ## State Machine Each sandbox transitions through these states: ```mermaid stateDiagram-v2 [*] --> cold: Server restart cold --> warming: Create requested warming --> warm: Bridge ready warm --> running: Message received running --> waiting: Turn complete waiting --> running: Next message waiting --> cold: Idle sweep / eviction ``` ## States | State | Process alive? | Description | |-------|---------------|-------------| | `cold` | No | Database record only. Process was evicted or server restarted. Workspace may be persisted for later restore. | | `warming` | Starting | Sandbox process is being created. Bridge not yet ready. | | `warm` | Yes | Bridge process is alive and connected. Ready to accept its first command. | | `waiting` | Yes | Between messages. Sandbox is idle, waiting for the next command. Eligible for idle eviction. | | `running` | Yes | Actively processing a message. Never evicted. | ## Eviction When a new sandbox needs to be created but the pool is at capacity (`ASH_MAX_SANDBOXES`), eviction kicks in. Candidates are selected in priority order: | Tier | State | Action | |------|-------|--------| | 1 | `cold` | Delete persisted state and database record. No process to kill. | | 2 | `warm` | Kill the sandbox process. Delete database record. | | 3 | `waiting` | Persist workspace state, kill the sandbox process, mark as `cold`. The session is paused so it can be resumed later. | | 4 | `running` | Never evicted. If all sandboxes are running, the create request returns 503. | Within each tier, the least-recently-used sandbox is evicted first (ordered by `last_used_at`). ### Eviction query ```sql SELECT * FROM sandboxes WHERE state IN ('cold', 'warm', 'waiting') ORDER BY CASE state WHEN 'cold' THEN 0 WHEN 'warm' THEN 1 WHEN 'waiting' THEN 2 END, last_used_at ASC LIMIT 1 ``` ## Idle Sweep A periodic timer (every 60 seconds) checks for sandboxes in the `waiting` state that have been idle longer than `ASH_IDLE_TIMEOUT_MS` (default: 30 minutes). Idle sandboxes are evicted: workspace is persisted, the process is killed, and the database record is marked `cold`. The associated session is paused so it can be resumed later. ```typescript pool.startIdleSweep(); // Start the periodic timer pool.stopIdleSweep(); // Stop the timer (graceful shutdown) ``` ## Cold Cleanup A separate periodic timer (every 5 minutes) removes cold sandbox entries that haven't been used for 2 hours. This prevents unbounded disk growth from accumulated cold entries. Cold cleanup deletes: - The live workspace directory (`data/sandboxes//`) - The local snapshot directory (`data/sessions//workspace/`) - The database record **Cloud snapshots are preserved**, so sessions can still be resumed from cloud storage after local cleanup. See [State Persistence & Restore](./state-persistence.md) for the full restore fallback chain. ```typescript pool.startColdCleanup(); // Start the periodic timer pool.stopColdCleanup(); // Stop the timer (graceful shutdown) ``` ## Configuration | Environment Variable | Default | Description | |---------------------|---------|-------------| | `ASH_MAX_SANDBOXES` | `1000` | Maximum number of sandbox entries (live + cold) in the database | | `ASH_IDLE_TIMEOUT_MS` | `1800000` (30 min) | How long a `waiting` sandbox can be idle before eviction | | `COLD_CLEANUP_TTL_MS` | `7200000` (2 hr) | How long a `cold` sandbox sits before local files are cleaned up | ## Race Condition Safety `markRunning()` is synchronous (updates the in-memory map immediately). This prevents a race where an idle sweep could evict a sandbox between when a message arrives and when the sandbox starts processing it. ```typescript // In the message handler -- synchronous, prevents eviction backend.markRunning(session.sandboxId); ``` The database update is fire-and-forget (asynchronous) since the in-memory map is the source of truth for the running state. ## Server Restart On server startup, `pool.init()` calls `markAllSandboxesCold()`, which updates all sandbox records in the database to `cold`. This is correct because: - All sandbox processes were killed when the server stopped - Cold entries can be evicted or used for workspace restoration during resume - The in-memory live map starts empty ```typescript const marked = await this.db.markAllSandboxesCold(); // "Startup: marked 5 stale sandbox(es) as cold" ``` ## Pool Stats The pool exposes statistics for the health endpoint and Prometheus metrics: ```typescript const stats = await pool.statsAsync(); // { // total: 10, // All entries (live + cold) // cold: 3, // No process // warming: 0, // Starting up // warm: 2, // Ready, no session // waiting: 3, // Idle between messages // running: 2, // Processing a message // maxCapacity: 1000, // resumeWarmHits: 15, // Resumes that found sandbox alive // resumeColdHits: 5, // Resumes that needed new sandbox (total) // resumeColdLocalHits: 3, // Cold resume from local disk // resumeColdCloudHits: 1, // Cold resume from cloud storage // resumeColdFreshHits: 1, // Cold resume with no state available // preWarmHits: 2, // Sessions that claimed a pre-warmed sandbox // } ``` The cold resume counters break down where the workspace came from during a cold resume. See [State Persistence & Restore](./state-persistence.md) for details. --- # SSE Backpressure Source: https://docs.ash-cloud.ai/architecture/sse-backpressure # SSE Backpressure ## Problem When a fast agent produces messages faster than a slow client can consume them, the server-side write buffer grows without bound. With many concurrent sessions, this leads to unbounded memory usage and eventual out-of-memory crashes. ``` Agent (fast) --> Bridge --> Server --> SSE --> Client (slow) ^^^^^^^^^^ Buffer grows here ``` ## Solution Ash respects backpressure at every boundary in the pipeline. When the downstream consumer cannot accept data, the upstream producer pauses. ### Bridge Side The bridge's `send()` function checks the return value of `socket.write()`. If the kernel buffer is full, it waits for the `drain` event before sending more data. This prevents the bridge from flooding the Unix socket. ### Server Side The `writeSSE()` function in the session routes checks if `response.write()` returns `false` (indicating the TCP send buffer is full). If so, it waits for the `drain` event with a 30-second timeout. ```typescript async function writeSSE(raw: ServerResponse, frame: string): Promise { const canWrite = raw.write(frame); if (!canWrite) { const drained = await Promise.race([ new Promise((resolve) => { raw.once('drain', () => resolve(true)); }), new Promise((resolve) => { setTimeout(() => resolve(false), SSE_WRITE_TIMEOUT_MS); }), ]); if (!drained) { throw new Error('Client write timeout -- closing stream'); } } } ``` If the client does not drain within the timeout, the stream is closed. This prevents a single slow client from holding a sandbox in the `running` state indefinitely. ## Full Pipeline ```mermaid graph LR SDK["Claude SDK"] -->|Messages| Bridge Bridge -->|Unix Socket
await drain| Server Server -->|SSE
await drain| Client style Bridge fill:#f0f0f0 style Server fill:#f0f0f0 ``` At each arrow, the sender checks backpressure before writing. If the receiver is slow, the sender pauses. The pause propagates upstream through the entire pipeline. ## Memory Bound Memory per connection is bounded by the kernel's TCP send buffer size (typically 128 KB - 1 MB depending on OS configuration) plus one pending SSE frame. There is no application-level buffering. ## Configuration | Constant | Value | Description | |----------|-------|-------------| | `SSE_WRITE_TIMEOUT_MS` | 30,000 ms | Maximum time to wait for a slow client to drain before closing the connection | This value is defined in `@ash-ai/shared` and used by the server's SSE writer. --- # Database Source: https://docs.ash-cloud.ai/architecture/database # Database Ash supports two database backends behind a common interface: SQLite (default) for single-machine deployments and PostgreSQL/CockroachDB for multi-machine setups. ## Configuration | Environment Variable | Default | Description | |---------------------|---------|-------------| | `ASH_DATABASE_URL` | Not set (uses SQLite) | PostgreSQL or CockroachDB connection URL | When `ASH_DATABASE_URL` is not set, Ash creates a SQLite database at `data/ash.db`. When set to a `postgresql://` or `postgres://` URL, Ash connects to the specified Postgres-compatible database. ## Backend Selection The `initDb()` factory function selects the backend based on the URL: ```typescript export async function initDb(opts: { dataDir: string; databaseUrl?: string }): Promise { if (opts.databaseUrl && /^postgres(ql)?:\/\//.test(opts.databaseUrl)) { const pgDb = new PgDb(opts.databaseUrl); await pgDb.init(); return pgDb; } else { return new SqliteDb(opts.dataDir); } } ``` ## Common Interface Both backends implement the same `Db` interface: ```typescript interface Db { // Agents upsertAgent(name, path, tenantId?): Promise; getAgent(name, tenantId?): Promise; listAgents(tenantId?): Promise; deleteAgent(name, tenantId?): Promise; // Sessions insertSession(id, agentName, sandboxId, tenantId?, runnerId?, model?): Promise; updateSessionStatus(id, status): Promise; getSession(id): Promise; listSessions(tenantId?, agent?): Promise; touchSession(id): Promise; // ... plus updateSessionSandbox, updateSessionRunner, listSessionsByRunner // Sandboxes insertSandbox(id, agentName, workspaceDir, sessionId?, tenantId?): Promise; updateSandboxState(id, state): Promise; getSandbox(id): Promise; countSandboxes(): Promise; getBestEvictionCandidate(): Promise; getIdleSandboxes(olderThan): Promise; markAllSandboxesCold(): Promise; // ... plus updateSandboxSession, touchSandbox, deleteSandbox // Messages insertMessage(sessionId, role, content, tenantId?): Promise; listMessages(sessionId, tenantId?, opts?): Promise; // Session Events insertSessionEvent(sessionId, type, data, tenantId?): Promise; insertSessionEvents(events): Promise; listSessionEvents(sessionId, tenantId?, opts?): Promise; // API Keys getApiKeyByHash(keyHash): Promise; insertApiKey(id, tenantId, keyHash, label): Promise; // Lifecycle close(): Promise; } ``` ## SQL Dialect Differences | Feature | SQLite | Postgres | |---------|--------|----------| | Timestamps | `datetime('now')` | `now()::TEXT` | | Upsert | `ON CONFLICT(...) DO UPDATE` | `ON CONFLICT(...) DO UPDATE` | | Parameters | `?` positional | `$1`, `$2` numbered | | Connection model | Single file, in-process | Connection pool (`pg.Pool`) | | Journal mode | WAL | WAL (default in Postgres) | | Column migration | `try/catch` (no `IF NOT EXISTS`) | `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` | | Sequence assignment | `SELECT MAX(sequence)` in transaction | Atomic subquery in `INSERT ... RETURNING` | ## Connection Retry (Postgres) The Postgres backend retries the initial connection with exponential backoff (1s, 2s, 4s, 8s, 16s -- five attempts total, ~31 seconds). This handles common startup races where the database container is not yet ready. ``` [db] Connection attempt 1 failed, retrying in 1000ms... [db] Connection attempt 2 failed, retrying in 2000ms... ``` ## Tables ### agents ```sql CREATE TABLE agents ( id TEXT PRIMARY KEY, tenant_id TEXT NOT NULL DEFAULT 'default', name TEXT NOT NULL, version INTEGER NOT NULL DEFAULT 1, path TEXT NOT NULL, created_at TEXT NOT NULL, updated_at TEXT NOT NULL, UNIQUE(tenant_id, name) ); ``` ### sessions ```sql CREATE TABLE sessions ( id TEXT PRIMARY KEY, tenant_id TEXT NOT NULL DEFAULT 'default', agent_name TEXT NOT NULL, sandbox_id TEXT NOT NULL, status TEXT NOT NULL DEFAULT 'starting', runner_id TEXT, model TEXT, created_at TEXT NOT NULL, last_active_at TEXT NOT NULL ); ``` ### sandboxes ```sql CREATE TABLE sandboxes ( id TEXT PRIMARY KEY, tenant_id TEXT NOT NULL DEFAULT 'default', session_id TEXT, agent_name TEXT NOT NULL, state TEXT NOT NULL DEFAULT 'warming', workspace_dir TEXT NOT NULL, created_at TEXT NOT NULL, last_used_at TEXT NOT NULL ); ``` ### messages ```sql CREATE TABLE messages ( id TEXT PRIMARY KEY, tenant_id TEXT NOT NULL DEFAULT 'default', session_id TEXT NOT NULL, role TEXT NOT NULL, content TEXT NOT NULL, sequence INTEGER NOT NULL, created_at TEXT NOT NULL, UNIQUE(tenant_id, session_id, sequence) ); ``` ### session_events ```sql CREATE TABLE session_events ( id TEXT PRIMARY KEY, tenant_id TEXT NOT NULL DEFAULT 'default', session_id TEXT NOT NULL, type TEXT NOT NULL, data TEXT, sequence INTEGER NOT NULL, created_at TEXT NOT NULL, UNIQUE(tenant_id, session_id, sequence) ); ``` ### api_keys ```sql CREATE TABLE api_keys ( id TEXT PRIMARY KEY, tenant_id TEXT NOT NULL, key_hash TEXT NOT NULL UNIQUE, label TEXT NOT NULL DEFAULT '', created_at TEXT NOT NULL ); ``` ## Production Recommendation For single-machine deployments, SQLite with WAL mode is sufficient and requires no external dependencies. For multi-machine deployments (coordinator + runners sharing state), use PostgreSQL or CockroachDB so all nodes share the same database. --- # Scaling Architecture Source: https://docs.ash-cloud.ai/architecture/scaling # Scaling Architecture Ash scales horizontally in two dimensions: the **data plane** (runners that host sandboxes) and the **control plane** (coordinators that route requests). Each dimension scales independently. ## Three Operating Modes ```mermaid graph TB subgraph "Mode 1: Standalone" direction LR C1["Client"] -->|HTTP + SSE| S1["Ash Server
:4100"] S1 --> P1["SandboxPool"] S1 --> DB1["SQLite"] P1 --> B1["Bridge 1"] P1 --> B2["Bridge 2"] end ``` ```mermaid graph TB subgraph "Mode 2: Coordinator + N Runners" direction LR C2["Client"] -->|HTTP + SSE| S2["Coordinator
:4100"] S2 --> DB2["Postgres / CRDB"] S2 -->|HTTP| R1["Runner 1"] S2 -->|HTTP| R2["Runner 2"] S2 -->|HTTP| R3["Runner N"] end ``` ```mermaid graph TB subgraph "Mode 3: N Coordinators + N Runners" direction TB C3["Client"] -->|HTTPS| LB["Load Balancer"] LB --> S3a["Coordinator 1"] LB --> S3b["Coordinator 2"] LB --> S3c["Coordinator M"] S3a & S3b & S3c --> DB3["CRDB"] S3a & S3b & S3c -->|HTTP| R4["Runner 1"] S3a & S3b & S3c -->|HTTP| R5["Runner 2"] S3a & S3b & S3c -->|HTTP| R6["Runner N"] end ``` **Start with Mode 1. Move to Mode 2 when one machine isn't enough. Move to Mode 3 when one coordinator isn't enough or you need redundancy.** ## Session Routing Every session is pinned to a runner. The coordinator selects the runner with the most available capacity at session creation time. ```mermaid sequenceDiagram participant C as Client participant Co as Coordinator participant DB as Database participant R as Runner (selected) C->>Co: POST /api/sessions {agent: "my-agent"} Co->>DB: SELECT best runner (most capacity) DB-->>Co: runner-2 (70 available slots) Co->>R: POST /runner/sandboxes R-->>Co: {sandboxId, workspaceDir} Co->>DB: INSERT session (runner_id = "runner-2") Co-->>C: 201 {session} ``` Once assigned, all subsequent messages for that session route to the same runner: ```mermaid sequenceDiagram participant C as Client participant Co as Coordinator participant DB as Database participant R as Runner (same) C->>Co: POST /api/sessions/:id/messages Co->>DB: SELECT session → runner_id = "runner-2" Co->>R: POST /runner/sandboxes/:id/cmd R-->>Co: SSE stream (bridge events) Co-->>C: SSE stream (proxied) ``` ## Runner Registration and Heartbeat Runners self-register with the control plane and send periodic heartbeats with pool statistics. ```mermaid sequenceDiagram participant R as Runner participant Co as Coordinator participant DB as Database R->>Co: POST /api/internal/runners/register Co->>DB: UPSERT runners (id, host, port, max) Co-->>R: {ok: true} loop Every 10 seconds R->>Co: POST /api/internal/runners/heartbeat Note right of R: {runnerId, stats: {running: 12, warming: 3, ...}} Co->>DB: UPDATE runners SET active_count, warming_count, last_heartbeat_at Co-->>R: {ok: true} end ``` ## Graceful Runner Shutdown When a runner shuts down cleanly, it deregisters from the coordinator. Sessions are paused immediately — no 30-second wait. ```mermaid sequenceDiagram participant R as Runner participant Co as Coordinator participant DB as Database Note over R: SIGTERM received R->>Co: POST /api/internal/runners/deregister Co->>DB: UPDATE sessions SET status='paused' WHERE runner_id AND status IN ('active','starting') Co->>DB: DELETE FROM runners WHERE id='runner-1' Co-->>R: {ok: true} Note over R: Destroy sandboxes, close server, exit ``` ## Dead Runner Detection If a runner crashes without deregistering, the coordinator sweeps for dead runners every 30 seconds (with random 0-5s jitter to prevent thundering herd across coordinators). Sessions are bulk-paused in a single query. ```mermaid sequenceDiagram participant Co as Coordinator participant DB as Database participant C as Client Note over Co: Liveness sweep (every 30s + jitter) Co->>DB: SELECT runners WHERE last_heartbeat_at <= cutoff DB-->>Co: [runner-3 is stale] Co->>DB: UPDATE sessions SET status='paused' WHERE runner_id='runner-3' AND status IN ('active','starting') Co->>DB: DELETE FROM runners WHERE id='runner-3' Note over C: Client detects disconnect C->>Co: POST /api/sessions/:id/resume Co->>DB: SELECT best healthy runner Note over Co: Cold restore on new runner Co-->>C: 200 {session: {status: 'active'}} ``` ## Multi-Coordinator (Mode 3) In multi-coordinator mode, all coordinators share the same database (Postgres or CockroachDB). The runner registry and session state live in the database — coordinators hold no authoritative state in memory. ```mermaid graph TB subgraph "Coordinator 1" Co1["Fastify :4100"] Cache1["Backend Cache
(connection pool)"] Co1 --> Cache1 end subgraph "Coordinator 2" Co2["Fastify :4100"] Cache2["Backend Cache
(connection pool)"] Co2 --> Cache2 end DB[("CRDB
runners table
sessions table")] Co1 --> DB Co2 --> DB Cache1 -->|HTTP| R1["Runner 1"] Cache1 -->|HTTP| R2["Runner 2"] Cache2 -->|HTTP| R1 Cache2 -->|HTTP| R2 R1 -->|Heartbeat| LB["Load Balancer"] R2 -->|Heartbeat| LB LB --> Co1 LB --> Co2 ``` **Key properties:** - Any coordinator can route to any runner (DB is source of truth) - Coordinators don't talk to each other - Each coordinator has a unique ID (`hostname-PID`) reported in `GET /health` and startup logs - Liveness sweep runs on all coordinators independently (idempotent, with random jitter to prevent thundering herd) - SSE reconnection handles coordinator failover (no session migration) ### Coordinator Failover ```mermaid sequenceDiagram participant C as Client participant LB as Load Balancer participant Co1 as Coordinator 1 participant Co2 as Coordinator 2 participant DB as Database participant R as Runner C->>LB: SSE stream (session ABC) LB->>Co1: Forward Co1->>R: Proxy bridge events R-->>Co1: SSE events Co1-->>C: SSE events Note over Co1: Coordinator 1 dies C--xCo1: Connection lost Note over C: SSE auto-reconnects C->>LB: Reconnect LB->>Co2: Route to healthy coordinator Co2->>DB: SELECT session ABC → runner_id Co2->>R: Re-establish proxy R-->>Co2: SSE events resume Co2-->>C: SSE events resume ``` ## Capacity Estimates | Component | Per Instance | Limit | Bottleneck | |-----------|-------------|-------|------------| | Coordinator | ~10,000 SSE connections | Network/CPU | SSE proxy fan-out | | Runner (8 vCPU, 16GB) | 30-120 sessions | Memory | Depends on sandbox memory limit | | Database (CRDB) | ~5,000 queries/sec | Single-node CRDB | Session creation path only | **Scaling math:** - 3 coordinators = ~30,000 concurrent SSE streams - 10 runners (256MB/sandbox) = ~600 concurrent sessions - You'll run out of runner capacity before coordinator capacity ## Database Tables for Scaling ```mermaid erDiagram runners { text id PK text host int port int max_sandboxes int active_count int warming_count text last_heartbeat_at text registered_at } sessions { text id PK text agent_name text sandbox_id text status text runner_id FK text created_at text last_active_at } runners ||--o{ sessions : "hosts" ``` ## Environment Variables ### Coordinator | Variable | Default | Description | |----------|---------|-------------| | `ASH_MODE` | `standalone` | Set to `coordinator` for multi-runner mode | | `ASH_DATABASE_URL` | — | Postgres/CRDB connection string (required for multi-coordinator) | | `ASH_PORT` | `4100` | HTTP listen port | | `ASH_INTERNAL_SECRET` | — | Shared secret for runner auth. If set, all `/api/internal/*` endpoints require `Authorization: Bearer `. **Required for multi-machine deployments.** | ### Runner | Variable | Default | Description | |----------|---------|-------------| | `ASH_RUNNER_ID` | `runner-{pid}` | Unique runner identifier | | `ASH_RUNNER_PORT` | `4200` | HTTP listen port | | `ASH_SERVER_URL` | — | Coordinator URL for registration (use LB URL in multi-coordinator mode) | | `ASH_RUNNER_ADVERTISE_HOST` | — | Host reachable from coordinator | | `ASH_MAX_SANDBOXES` | `1000` | Maximum concurrent sandboxes | | `ASH_INTERNAL_SECRET` | — | Must match the coordinator's `ASH_INTERNAL_SECRET` | ## When to Scale | Symptom | Action | |---------|--------| | CPU/memory maxed on single machine | Add runners (Mode 2) | | Need high availability for control plane | Add coordinators (Mode 3) | | SSE connections saturating coordinator | Add coordinators (Mode 3) | | Session creation latency increasing | Add runners or increase `ASH_MAX_SANDBOXES` | | All runners at capacity | Add more runner nodes | Don't scale until you have numbers. A single standalone Ash server handles dozens of concurrent sessions. Use `ASH_DEBUG_TIMING=1` and the `/metrics` endpoint to find the actual bottleneck before adding complexity. --- # Design Decisions Source: https://docs.ash-cloud.ai/architecture/decisions # Design Decisions Architecture Decision Records (ADRs) for significant technical choices in Ash. ## ADR 0001: SDK Passthrough Types **Date**: 2025-01-15 | **Status**: Accepted **Decision**: Pass Claude Code SDK `Message` objects through the entire pipeline untranslated. The bridge yields raw SDK messages over the Unix socket. The server wraps them in SSE envelopes and streams them to the client. No custom `BridgeEvent` or `SSEEventType` translation layers. **Context**: Ash originally defined three parallel type systems: `BridgeEvent` (7 variants in the bridge), `SSEEventType` (6 values in the server), and a translation layer converting SDK messages to bridge events. Every SDK message was translated twice. **Why**: - One type system instead of three -- less code to maintain - SDK type changes propagate automatically through the pipeline (no manual translation updates) - Clients (CLI, SDK) can use SDK types directly for type-safe message handling - Translation layers do not protect against SDK breaking changes -- they just delay discovery **What Ash owns**: Bridge commands (`query`, `resume`, `interrupt`, `shutdown`), orchestration types (`Session`, `Agent`, `SandboxInfo`, `PoolStats`), and two envelope events (`ready`, `error`). Everything else is SDK passthrough. **Trade-off**: Tighter coupling to the SDK's type shape. If the SDK changes its `Message` type, the wire format changes. This is acceptable because the SDK is the primary dependency -- if it changes, Ash must update regardless. --- ## ADR 0002: HTTP over gRPC for Runner Communication **Date**: 2026-02-18 | **Status**: Accepted **Decision**: Use HTTP + SSE for communication between the server and runner processes instead of gRPC. **Context**: Step 08 of the implementation plan adds runner processes that manage sandboxes on remote hosts. The server needs to communicate with runners for sandbox lifecycle operations and command streaming. **Why**: - **Simplicity**: gRPC adds protobuf schemas, code generation, the `@grpc/grpc-js` dependency, and binary debugging difficulty. HTTP uses the same Fastify framework, same patterns, same tools (curl, Swagger, browser). - **No performance bottleneck**: LLM inference takes 2-10 seconds. The HTTP hop from server to runner adds single-digit milliseconds. gRPC would save 1-2ms per request -- irrelevant at this scale. - **Ecosystem alignment**: Runners use the same Fastify framework as the server. Tests use the same patterns. One less technology in the stack. **Alternatives considered**: - **gRPC with bidirectional streaming**: More complex than needed. The command/event flow is naturally request-response with server-push, which SSE handles well. - **WebSocket**: More complex lifecycle management and message framing for the same use case. SSE already handles server-push-only flows. **Trade-off**: If true bidirectional streaming to runners becomes necessary, this decision would need revisiting. This is unlikely because the bridge protocol is inherently request/response. --- # Ash vs ComputeSDK Source: https://docs.ash-cloud.ai/comparisons/computesdk # Ash vs ComputeSDK [ComputeSDK](https://www.computesdk.com/) and Ash solve different but adjacent problems. This page breaks down where they overlap, where they diverge, and when to use each. ## TL;DR - **Ash** is an AI agent platform -- deploy a Claude agent as a folder, get a production REST API with sessions, streaming, sandboxing, and persistence. - **ComputeSDK** is a sandbox abstraction layer -- one API to create isolated compute environments across 8+ cloud providers (E2B, Modal, Railway, etc.). They're complementary, not competitive. ComputeSDK could be a sandbox *provider* that Ash delegates to. ## Different Problems | | Ash | ComputeSDK | |---|---|---| | **What it is** | Self-hostable system for deploying AI agents | Unified API for generic sandbox compute | | **Core abstraction** | Agent sessions (deploy a CLAUDE.md, chat via REST/SSE) | Sandboxes (create environments, run code/commands) | | **Primary use case** | Host AI agents that persist, resume, and stream | Execute untrusted code, spin up dev environments | | **AI-specific?** | Yes -- thin wrapper around Claude Code SDK | No -- provider-agnostic compute for any workload | | **Infra model** | Self-hosted (your Docker, your machine) | SaaS gateway routing to cloud providers | ## Feature Comparison | Feature | Ash | ComputeSDK | |---|---|---| | **Sandbox isolation** | Bubblewrap, cgroups v2, env allowlist | Provider-dependent | | **Session persistence** | SQLite/Postgres, survives restarts | Stateless by default; named sandboxes for reuse | | **Session resume** | Full context preservation, pause/resume, cross-machine | Not conversation-oriented | | **Streaming** | Native SSE with typed events, backpressure | Request/response for commands | | **Agent definition** | Folder with `CLAUDE.md` -- minimal | N/A -- not agent-oriented | | **Multi-provider** | N/A -- runs your own sandboxes | 8+ providers, swap via env var | | **Overlays/templates** | N/A | Smart overlays with symlinks for fast bootstrap | | **Managed servers** | N/A | Supervised long-lived processes with health checks | | **Filesystem API** | Agent has full workspace inside sandbox | `writeFile`, `readFile`, `mkdir`, etc. | | **Shell execution** | Agent runs commands via Claude Code SDK | `runCommand()` API | | **Observability** | Prometheus metrics, structured logs, `/health` | Not documented | | **Multi-machine** | Built-in coordinator + runner architecture | Handled by underlying providers | | **SDKs** | TypeScript + Python | TypeScript | | **CLI** | Full lifecycle (`ash start/deploy/session/health`) | Not documented | | **Self-hostable** | Yes -- Docker, bare metal, or cloud VMs | No -- SaaS gateway required | | **Open source** | Yes | Partially (client SDK open, gateway is SaaS) | ## Architecture Differences ### Ash ``` CLI/SDK ──HTTP──> ash-server ──in-process──> SandboxPool ──unix socket──> Bridge ──> Claude Code SDK (your infra) (bubblewrap) (in sandbox) ``` Ash owns the full stack. Your server, your sandboxes, your data. The server manages sandbox lifecycle directly using OS-level isolation (bubblewrap on Linux, ulimit on macOS). ### ComputeSDK ``` Your code ──HTTP──> ComputeSDK Gateway ──HTTP──> Cloud Provider (E2B / Modal / Railway / ...) (their SaaS) (their infra) ``` ComputeSDK is a routing layer. Your code talks to their gateway, which translates to provider-specific APIs. You don't manage sandboxes -- the provider does. ## When to Use Each ### Use Ash when you need: - **AI agents that persist** -- sessions that survive restarts, resume days later, hand off between machines - **Full control over infrastructure** -- self-hosted, no external dependencies, data stays on your machines - **Deep sandbox isolation** -- cgroups, bubblewrap, environment allowlists you configure - **Streaming conversations** -- SSE with typed events, backpressure, real-time token streaming - **An agent platform** -- deploy agents as folders, manage via CLI/SDK, monitor with Prometheus ### Use ComputeSDK when you need: - **Generic sandbox compute** -- run arbitrary code, not specifically AI conversations - **Provider flexibility** -- switch between E2B, Modal, Railway without code changes - **Managed infrastructure** -- don't want to run your own servers - **Quick ephemeral environments** -- spin up a sandbox, run a script, tear it down - **Pre-configured templates** -- overlays for fast environment bootstrap ### Use both when: You want Ash's agent orchestration with cloud-hosted sandboxes instead of local ones. A future `SandboxProvider` interface in Ash could delegate sandbox creation to ComputeSDK-supported providers, giving you Ash's session management and streaming with E2B's or Modal's compute. ## Onboarding Comparison ### ComputeSDK -- 3 lines ```typescript const sandbox = await compute.sandbox.create(); const result = await sandbox.runCode('print("Hello World!")'); await sandbox.destroy(); ``` ### Ash -- 4 commands ```bash ash start ash deploy ./my-agent --name my-agent ash session create my-agent ash session send "Hello" ``` ComputeSDK's onboarding is simpler because it solves a simpler problem -- create a sandbox and run code. Ash's extra steps (start server, deploy agent, create session) exist because Ash manages persistent, stateful agent sessions rather than ephemeral compute. ## Summary Ash and ComputeSDK are in different categories: - **Ash** = AI agent orchestration platform (sessions, streaming, persistence, isolation) - **ComputeSDK** = sandbox compute abstraction (multi-provider, ephemeral, code execution) If you're deploying Claude agents that need production infrastructure, use Ash. If you need generic sandboxed code execution across cloud providers, use ComputeSDK. If you want both, they can complement each other. --- # Ash vs Blaxel Source: https://docs.ash-cloud.ai/comparisons/blaxel # Ash vs Blaxel [Blaxel](https://blaxel.ai) and Ash both provide infrastructure for AI agents, but they make different tradeoffs. This page breaks down where they overlap, where they diverge, and when to use each. ## TL;DR - **Ash** is a self-hostable agent platform -- deploy Claude agents as folders, get production APIs with sessions, streaming, sandboxing, and persistence on your own infrastructure. Sub-millisecond per-message overhead, 44ms cold start, 1.7ms warm resume. - **Blaxel** is a managed cloud platform -- serverless agent hosting, perpetual sandboxes, model gateway, and observability as a service. The core difference: Ash runs on your machines; Blaxel runs on theirs. ## Different Tradeoffs | | Ash | Blaxel | |---|---|---| | **What it is** | Self-hostable agent orchestration | Managed cloud agent platform | | **Infrastructure model** | Your servers (Docker, EC2, GCE, bare metal) | Their cloud (serverless) | | **Agent definition** | Folder with `CLAUDE.md` | HTTP server (any framework) | | **AI model** | Claude (via Claude Code SDK) | Any model (model gateway) | | **Sandbox model** | OS-level (bubblewrap, cgroups) | MicroVMs | | **Session persistence** | SQLite/Postgres, survives restarts | Snapshot-based | | **Pricing** | Self-hosted (pay for compute + Claude API) | Usage-based SaaS | ## Feature Comparison | Feature | Ash | Blaxel | |---|---|---| | **Agent hosting** | Yes -- deploy folders, get REST API | Yes -- serverless endpoints | | **Sandbox isolation** | Bubblewrap, cgroups v2, env allowlist | MicroVMs (EROFS + tmpfs) | | **Session creation (cold start)** | 44ms p50 (process spawn + bridge connect) | ~25ms (MicroVM resume) | | **Session resume (warm)** | 1.7ms p50 (DB lookup + status flip) | ~25ms (MicroVM resume) | | **Per-message overhead** | 0.41ms p50 (sub-millisecond) | Not published | | **Session persistence** | SQLite/Postgres, pause/resume | Snapshot-based, scale-to-zero | | **Streaming** | Native SSE with typed events, backpressure | Framework-dependent | | **Model support** | Claude (deep SDK integration) | Multi-model (gateway routing) | | **Observability** | Prometheus metrics, structured logs, `/health` | Built-in logs, traces, metrics | | **MCP servers** | Per-agent and per-session MCP config | Hosted MCP servers | | **Batch jobs** | Not built-in | Yes -- async compute | | **Multi-machine** | Built-in coordinator + runner | Managed by platform | | **SDKs** | TypeScript + Python | TypeScript + Python | | **CLI** | Full lifecycle management | Yes | | **Self-hostable** | Yes | No | | **Open source** | Yes | No | | **Data residency** | Full control (your machines) | Their cloud | ## Architecture Differences ### Ash ``` CLI/SDK ──HTTP──> Ash Server ──in-process──> SandboxPool ──unix socket──> Bridge ──> Claude Code SDK (your infra) (bubblewrap) (in sandbox) ``` Ash owns the full stack. Your server, your sandboxes, your data. The server manages sandbox lifecycle directly using OS-level isolation. ### Blaxel ``` Your App ──HTTP──> Blaxel Cloud ──> Agent Endpoint (serverless) ──> Model Gateway ──> LLM Provider (their infra) (MicroVM sandbox) ``` Blaxel is a managed platform. You deploy agents to their cloud, which handles scaling, sandboxing, routing, and observability. ## When to Use Each ### Use Ash when: - **You need infrastructure control** -- data must stay on your machines, compliance requirements, air-gapped environments - **You're building with Claude** -- Ash's deep Claude Code SDK integration gives you the full power of the SDK (sessions, tools, MCP, skills) with zero translation layer - **Sessions must persist across restarts** -- Ash's SQLite/Postgres persistence survives crashes, supports pause/resume, and enables multi-day sessions - **You want self-hosted, open source** -- inspect the code, modify the behavior, no vendor lock-in ### Use Blaxel when: - **You want managed infrastructure** -- don't want to run your own servers, prefer pay-per-use - **You use multiple LLM providers** -- Blaxel's model gateway routes between providers with fallback and telemetry - **You want built-in observability** -- logs, traces, and metrics without setting up Prometheus or Grafana - **Framework flexibility matters** -- Blaxel hosts any HTTP server, not just Claude agents ### Use Ash if you're unsure: Self-hosted means you can migrate away at any time. You're not locked into a platform. Start with Ash, and if you later need managed infrastructure, the migration path is straightforward since your agents are just folders. ## Onboarding Comparison ### Ash -- 4 commands ```bash ash start ash deploy ./my-agent --name my-agent ash chat my-agent "Hello" ``` ### Blaxel -- framework setup + deploy ```bash bl login bl init my-agent # ... write HTTP server code ... bl deploy bl run my-agent --data '{"inputs": "Hello"}' ``` Ash's agent definition is simpler (a folder with `CLAUDE.md`) because it targets a specific SDK. Blaxel requires writing an HTTP server because it supports any framework. ## Performance Ash publishes [real benchmarks](/guides/monitoring). Here's how the numbers compare: | Metric | Ash (measured) | Blaxel (claimed) | |---|---|---| | **Session creation** | 44ms p50 | ~25ms (MicroVM resume) | | **Warm resume** | 1.7ms p50 | ~25ms (MicroVM resume) | | **Cold resume** | 32ms p50 | Not published | | **Per-message overhead** | 0.41ms p50 | Not published | | **Pool operations** | 0.03ms p50 | Not published | Blaxel's 25ms number is for MicroVM resume from a snapshot. Ash's 1.7ms warm resume is actually faster because it's just a DB lookup + status flip -- the sandbox process is still alive. For cold starts (new session creation), Ash's 44ms and Blaxel's ~25ms are in the same ballpark. In both cases, the real latency users feel is dominated by the LLM API response time (~1-3 seconds), not the platform overhead. ## Summary | Dimension | Ash | Blaxel | |---|---|---| | **Control** | Full (self-hosted, open source) | Managed (their cloud) | | **Simplicity** | Agent = folder with `CLAUDE.md` | Agent = HTTP server | | **AI model** | Claude (deep integration) | Any model (gateway) | | **Session creation** | 44ms p50 | ~25ms (claimed) | | **Warm resume** | 1.7ms p50 | ~25ms (claimed) | | **Per-message overhead** | 0.41ms p50 | Not published | | **Best for** | Teams who want control + Claude | Teams who want managed + multi-model | Both are solid choices. The decision comes down to whether you want to own the infrastructure or outsource it. --- # Development Setup Source: https://docs.ash-cloud.ai/contributing/development-setup # Development Setup Build Ash from source and run it locally. ## Prerequisites - **Node.js** >= 20 - **pnpm** >= 9 - **Docker** (for sandbox isolation and `ash start`) ## Clone and Install ```bash git clone https://github.com/ash-ai-org/ash.git cd ash pnpm install pnpm build ``` ## Dev Commands | Command | Description | |---------|-------------| | `make build` | Build all packages | | `make test` | Run unit tests | | `make typecheck` | Type-check all packages | | `make test-integration` | Run integration tests (starts real processes) | | `make dev` | Build Docker image, start server, deploy QA Bot agent, start QA Bot UI | | `make dev-no-sandbox` | Start server + QA Bot natively (no Docker, no sandbox isolation) | | `make docker-build` | Build local `ash-dev` Docker image | | `make docker-start` | Build image and start server in Docker | | `make docker-stop` | Stop the server container | | `make docker-status` | Show container status and health | | `make docker-logs` | Show container logs | | `make kill` | Kill processes on dev ports (4100, 3100) and stop Docker | | `make clean` | Remove build artifacts | ### Quick Start (with Docker) ```bash make dev ``` This builds the Docker image, starts the Ash server at `http://localhost:4100`, deploys the QA Bot example agent, and starts the QA Bot web UI at `http://localhost:3100`. ### Quick Start (without Docker) ```bash make dev-no-sandbox ``` This starts both the server and QA Bot natively. No sandbox isolation -- agent code runs in the same process context. Suitable for development when Docker is unavailable. ## Running a Single Package ```bash # Server only (native, with real Claude SDK) ASH_REAL_SDK=1 pnpm --filter '@ash-ai/server' dev # QA Bot web UI only (needs server running separately) pnpm --filter qa-bot dev # Build a single package pnpm --filter '@ash-ai/shared' build # Test a single package pnpm --filter '@ash-ai/server' test ``` ## Using the CLI from Source Instead of the globally installed `ash`, run the CLI directly with `tsx`: ```bash npx tsx packages/cli/src/index.ts ``` Examples: ```bash # Start server with local dev image npx tsx packages/cli/src/index.ts start --image ash-dev --no-pull # Deploy an agent npx tsx packages/cli/src/index.ts deploy ./examples/qa-bot/agent --name qa-bot # Check status npx tsx packages/cli/src/index.ts status # Check health npx tsx packages/cli/src/index.ts health ``` ## OpenAPI and Python SDK Generation ```bash # Generate OpenAPI spec from Fastify route schemas make openapi # Generate Python SDK from OpenAPI spec (requires openapi-python-client) make sdk-python ``` The OpenAPI spec is generated by starting the server, extracting the schema from Fastify's Swagger plugin, and writing it to `packages/server/openapi.json` (also copied to `docs/openapi.json`). The Python SDK is generated from this spec using `openapi-python-client`, producing the `packages/sdk-python/` package. --- # Project Structure Source: https://docs.ash-cloud.ai/contributing/project-structure # Project Structure Ash is a pnpm monorepo with seven packages, each with a specific responsibility. ## Package Map | Package | npm Name | Description | |---------|----------|-------------| | `packages/shared` | `@ash-ai/shared` | Types, protocol definitions, constants. Zero runtime dependencies. Every other package depends on this. | | `packages/sandbox` | `@ash-ai/sandbox` | `SandboxManager` (process lifecycle), `SandboxPool` (capacity/eviction), `BridgeClient` (Unix socket client), resource limits, state persistence. Used by both server and runner. | | `packages/bridge` | `@ash-ai/bridge` | Runs inside each sandbox process. Listens on a Unix socket, receives commands, calls the Claude Code SDK (`@anthropic-ai/claude-code`), streams responses back. | | `packages/server` | `@ash-ai/server` | Fastify REST API. Agent registry, session routing, SSE streaming, database access (SQLite + Postgres). The main entry point. | | `packages/runner` | `@ash-ai/runner` | Worker node for multi-machine deployments. Manages sandboxes on a remote host. Registers with the server via heartbeat. | | `packages/sdk` | `@ash-ai/sdk` | TypeScript client library. `AshClient` class, SSE stream parser, re-exported types. | | `packages/cli` | `@ash-ai/cli` | `ash` command-line tool. Server lifecycle (Docker), agent deployment, session management. | ### Supporting directories | Directory | Description | |-----------|-------------| | `packages/sdk-python` | Python SDK, auto-generated from OpenAPI spec | | `examples/qa-bot` | Next.js chat app that uses Ash to power a QA bot | | `examples/hosted-agent` | Minimal example agent definition (CLAUDE.md + config) | | `examples/python-bot` | Python SDK usage example | | `docs/` | Architecture docs, ADRs, feature docs, runbooks, benchmarks | | `test/` | Integration tests and benchmarks (cross-package) | | `scripts/` | Deployment scripts (EC2, GCE) | ## Dependency Graph ```mermaid graph TD shared["shared
(types, protocol, constants)"] sandbox["sandbox
(manager, pool, bridge client)"] bridge["bridge
(in-sandbox process)"] server["server
(Fastify API, DB)"] runner["runner
(remote worker)"] sdk["sdk
(TypeScript client)"] cli["cli
(ash command)"] sandbox --> shared bridge --> shared server --> shared server --> sandbox runner --> shared runner --> sandbox sdk --> shared cli --> shared ``` The key insight: `sandbox` is a **library**, not a standalone process. It is imported by both `server` (standalone mode) and `runner` (multi-machine mode). ## Build Order Packages must be built in dependency order: 1. `shared` (no dependencies) 2. `sandbox` (depends on `shared`) 3. Everything else (`bridge`, `server`, `runner`, `sdk`, `cli` depend on `shared` and/or `sandbox`) `pnpm build` at the root handles this automatically via workspace dependency resolution. ## Module System All packages use ESM with TypeScript's `NodeNext` module resolution. Import paths include the `.js` extension: ```typescript ``` ## Key Conventions 1. **SDK types pass through.** Ash uses the Claude Code SDK's `Message` type directly throughout the pipeline. Do not create wrapper types for conversation data. See [ADR 0001](/architecture/decisions#adr-0001-sdk-passthrough-types). 2. **Test boundaries, not glue.** Test API contracts, state transitions, protocol serialization, failure modes, and security invariants. Do not test trivial wrappers, type re-exports, or config loading. 3. **Document what you build.** Features go in `docs/features/`, decisions in `docs/decisions/`, benchmarks in `docs/benchmarks/`. If it is not documented, it is not finished. --- # Testing Guide Source: https://docs.ash-cloud.ai/contributing/testing # Testing Guide ## Philosophy **The test is the spec.** If the behavior is not tested, it is not guaranteed. Tests encode what the system promises. When requirements change, change the test first, then change the code. ## Test Pyramid | Layer | Count | Runner | Description | |-------|-------|--------|-------------| | **Unit** | ~50 | `pnpm test` | Protocol encode/decode, state machines, validators, helpers. Fast, no I/O. | | **Integration** | ~15 | `pnpm test:integration` | Full lifecycle: start server, deploy agent, create session, send messages, verify responses. Uses real sockets, real files, real processes (mocked Claude SDK). | | **Isolation** | Linux only | `pnpm test:isolation` | Sandbox security: verify env leaks are blocked, filesystem escapes fail, resource limits are enforced. Requires bubblewrap (bwrap). | | **Load** | On demand | `pnpm bench` | Latency and throughput benchmarks. Pool operations, sandbox startup, message overhead. | ## Running Tests ```bash # All unit tests across all packages pnpm test # Integration tests (starts real server processes) pnpm test:integration # Sandbox isolation tests (Linux with bwrap only) pnpm test:isolation # Benchmarks pnpm bench # Single package pnpm --filter '@ash-ai/server' test pnpm --filter '@ash-ai/shared' test ``` ## What to Test ### Test boundaries Protocol serialization (encode/decode round-trip), API request/response contracts, database queries, bridge command/event handling. These are the surfaces where bugs hide. ```typescript // Good: tests the encode/decode contract test('encode then decode round-trips a query command', () => { const cmd: QueryCommand = { cmd: 'query', prompt: 'hello', sessionId: 'abc' }; const decoded = decode(encode(cmd)); expect(decoded).toEqual(cmd); }); ``` ### Test failure modes What happens when the bridge crashes mid-stream? When the client disconnects? When the sandbox runs out of memory? When the database is unreachable? These are the scenarios that distinguish a demo from a system. ```typescript // Good: tests crash recovery behavior test('session transitions to error when sandbox crashes', async () => { const session = await createSession('test-agent'); // Kill the sandbox process sandbox.process.kill('SIGKILL'); // Verify session status const updated = await getSession(session.id); expect(updated.status).toBe('error'); }); ``` ### Test invariants The sandbox environment never contains host secrets. An ended session rejects new messages. Eviction never touches a running sandbox. These are the properties that must always hold. ```typescript // Good: tests a security invariant test('sandbox env does not contain host secrets', () => { process.env.AWS_SECRET_ACCESS_KEY = 'supersecret'; const env = buildSandboxEnv(); expect(env.AWS_SECRET_ACCESS_KEY).toBeUndefined(); }); ``` ## What NOT to Test - **Trivial wrappers**: If a function just calls another function and returns the result, testing it adds no value. - **Type re-exports**: `export type { Session } from '@ash-ai/shared'` does not need a test. - **Config loading**: Unless the loading logic has branching or defaults that matter, skip it. ## Mocking Strategy **Mock the Claude SDK, not the OS.** - Use real Unix sockets, real files, real child processes. - Mock `@anthropic-ai/claude-code` to return predictable responses. - Do not mock `fs`, `net`, `child_process`, or `http`. If the test needs these, use them for real. The bridge package tests mock the SDK's `query()` function to yield controlled message sequences. Everything else (socket communication, process lifecycle, file I/O) uses real system calls. ```typescript // Good: mock the SDK, use real sockets const mockSdk = { async *query(prompt: string) { yield { type: 'assistant', message: { content: [{ type: 'text', text: 'Hello' }] } }; yield { type: 'result', subtype: 'success' }; }, }; // Bad: mock the filesystem jest.mock('fs'); // Don't do this ``` --- # Release Process Source: https://docs.ash-cloud.ai/contributing/releases # Release Process Ash uses [Changesets](https://github.com/changesets/changesets) for versioning, changelogs, and npm publishing. ## Changesets Every pull request that changes package behavior must include a changeset. A changeset is a small markdown file in `.changeset/` that describes what changed and which packages are affected. ### Creating a Changeset ```bash pnpm changeset ``` This launches an interactive prompt that asks: 1. Which packages changed? 2. What type of bump for each? (patch, minor, major) 3. A one-sentence summary of the change The result is a file like `.changeset/cool-dogs-laugh.md`: ```markdown --- "@ash-ai/server": minor "@ash-ai/shared": patch --- Add session events timeline API for tracking agent actions. ``` ### Bump Types | Type | When to use | Examples | |------|-------------|---------| | `patch` | Bug fixes, internal refactors, dependency updates | Fix session timeout, update test helpers, bump vitest | | `minor` | New features, new API endpoints, new CLI commands | Add file listing endpoint, add `ash logs` command | | `major` | Breaking API changes, removed features, changed wire formats | Remove deprecated endpoint, change SSE event names | ### Rules - **One changeset per PR.** If a PR does one thing, one changeset. If it does two unrelated things, split the PR. - **Only include packages that changed.** Check which `packages/*/` directories your diff touches. - **Description is user-facing.** Write what changed from the consumer's perspective, not implementation details. These become CHANGELOG entries and GitHub Release notes. - **Internal packages count.** Changes to `@ash-ai/shared`, `@ash-ai/sandbox`, `@ash-ai/bridge` still need changesets. The config automatically bumps their dependents. ### What Does NOT Need a Changeset - Documentation-only changes - CI configuration changes - Test-only changes - Anything that does not affect published package behavior ## CI Flow ```mermaid graph LR A["PR merged to main
(includes changeset)"] --> B["CI opens
'Version Packages' PR"] B --> C["Bumps package.json versions
Generates CHANGELOG entries"] C --> D["Merge 'Version Packages' PR"] D --> E["CI publishes to npm
Creates GitHub Release"] ``` ### Step by step 1. **You merge a PR** that includes a `.changeset/*.md` file. 2. **CI automatically opens a "Version Packages" PR.** This PR: - Bumps `version` in the affected `package.json` files - Generates `CHANGELOG.md` entries from the changeset description - Deletes the consumed `.changeset/*.md` files 3. **You review and merge the "Version Packages" PR.** 4. **CI publishes** the bumped packages to npm and creates GitHub Releases with release notes. ### Preview To see what changesets are pending and what they would do: ```bash pnpm changeset status ``` ### Local Version Bump (rare) Normally CI handles versioning. If you need to bump locally: ```bash make version-packages # Apply pending changesets locally make publish-dry-run # See what would be published make publish # Publish to npm (requires NPM_TOKEN) ```