Skip to main content

Monitoring

Ash exposes health checks, Prometheus metrics, debug timing, and structured logs for production monitoring.

Health Endpoint

GET /health returns the server's current status. This endpoint does not require authentication.

curl $ASH_SERVER_URL/health

Response:

{
"status": "ok",
"activeSessions": 3,
"activeSandboxes": 5,
"uptime": 86400,
"pool": {
"total": 10,
"cold": 2,
"warming": 1,
"warm": 2,
"waiting": 3,
"running": 2,
"maxCapacity": 1000,
"resumeWarmHits": 42,
"resumeColdHits": 7
}
}
FieldDescription
statusAlways "ok" if the server is reachable.
activeSessionsNumber of sessions with status active.
activeSandboxesNumber of live sandbox processes.
uptimeSeconds since server start.
pool.totalTotal sandboxes in the pool (all states).
pool.warmSandboxes ready to accept work immediately.
pool.runningSandboxes actively processing a message.
pool.maxCapacityMaximum number of sandboxes the pool allows.
pool.resumeWarmHitsTimes a session resumed with its sandbox still alive (fast path).
pool.resumeColdHitsTimes a session resumed by creating a new sandbox (cold path).

Prometheus Metrics

GET /metrics returns metrics in Prometheus text exposition format. This endpoint does not require authentication.

curl $ASH_SERVER_URL/metrics

Response:

# HELP ash_up Whether the Ash server is up (always 1 if reachable).
# TYPE ash_up gauge
ash_up 1

# HELP ash_uptime_seconds Seconds since server start.
# TYPE ash_uptime_seconds gauge
ash_uptime_seconds 86400

# HELP ash_active_sessions Number of active sessions.
# TYPE ash_active_sessions gauge
ash_active_sessions 3

# HELP ash_active_sandboxes Number of live sandbox processes.
# TYPE ash_active_sandboxes gauge
ash_active_sandboxes 5

# HELP ash_pool_sandboxes Sandbox count by state.
# TYPE ash_pool_sandboxes gauge
ash_pool_sandboxes{state="cold"} 2
ash_pool_sandboxes{state="warming"} 1
ash_pool_sandboxes{state="warm"} 2
ash_pool_sandboxes{state="waiting"} 3
ash_pool_sandboxes{state="running"} 2

# HELP ash_pool_max_capacity Maximum sandbox capacity.
# TYPE ash_pool_max_capacity gauge
ash_pool_max_capacity 1000

# HELP ash_resume_total Total session resumes by path (warm=sandbox alive, cold=new sandbox).
# TYPE ash_resume_total counter
ash_resume_total{path="warm"} 42
ash_resume_total{path="cold"} 7

Metric Reference

MetricTypeDescription
ash_upgaugeAlways 1 if the server is reachable. Use for up/down alerting.
ash_uptime_secondsgaugeSeconds since server process started.
ash_active_sessionsgaugeSessions currently in active state.
ash_active_sandboxesgaugeLive sandbox processes (includes all states).
ash_pool_sandboxesgaugeSandbox count broken down by state label: cold, warming, warm, waiting, running.
ash_pool_max_capacitygaugeMaximum sandboxes the pool will create.
ash_resume_totalcounterCumulative session resumes by path: warm (sandbox alive) or cold (new sandbox).

Prometheus Configuration

Add Ash as a scrape target in prometheus.yml:

scrape_configs:
- job_name: 'ash'
scrape_interval: 15s
static_configs:
- targets: ['localhost:4100']
metrics_path: /metrics

Example PromQL Queries

Active sessions over time:

ash_active_sessions

Warm resume hit rate (percentage of resumes that were fast):

ash_resume_total{path="warm"} / (ash_resume_total{path="warm"} + ash_resume_total{path="cold"})

Pool utilization (fraction of capacity in use):

sum(ash_pool_sandboxes) / ash_pool_max_capacity

Running sandboxes (actively processing messages):

ash_pool_sandboxes{state="running"}

Alert when pool is over 80% capacity:

sum(ash_pool_sandboxes) / ash_pool_max_capacity > 0.8

Debug Timing

Set ASH_DEBUG_TIMING=1 to enable per-message timing instrumentation. When enabled, the server writes one JSON line to stderr for each message processed:

ASH_DEBUG_TIMING=1 ash start

Timing output:

{
"type": "timing",
"source": "server",
"sessionId": "a1b2c3d4-...",
"sandboxId": "a1b2c3d4-...",
"lookupMs": 0.42,
"firstEventMs": 145.8,
"totalMs": 2340.5,
"eventCount": 12,
"timestamp": "2025-01-15T10:30:00.000Z"
}
FieldDescription
lookupMsTime to look up the session and sandbox.
firstEventMsTime from request to first SSE event (time-to-first-token).
totalMsTotal request duration.
eventCountNumber of SSE events sent.

Timing is zero-overhead when ASH_DEBUG_TIMING is not set. The check is a single process.env read per message.

Structured Logs

Ash writes structured JSON log lines to stderr. Each line is a self-contained JSON object.

Resume Logging

Every session resume emits a log line (always on, not gated by ASH_DEBUG_TIMING):

{
"type": "resume_hit",
"path": "warm",
"sessionId": "a1b2c3d4-...",
"agentName": "my-agent",
"ts": "2025-01-15T10:30:00.000Z"
}

The path field is warm (sandbox still alive) or cold (new sandbox created).

Log Analysis with jq

Filter resume events:

ash start 2>&1 | jq -c 'select(.type == "resume_hit")'

Count warm vs cold resumes:

ash start 2>&1 | jq -c 'select(.type == "resume_hit")' | \
jq -s 'group_by(.path) | map({path: .[0].path, count: length})'

Filter timing data for a specific session:

ash start 2>&1 | jq -c 'select(.type == "timing" and .sessionId == "SESSION_ID")'

Find slow messages (time-to-first-token over 500ms):

ash start 2>&1 | jq -c 'select(.type == "timing" and .firstEventMs > 500)'

Average time-to-first-token:

ash start 2>&1 | jq -cs '[.[] | select(.type == "timing")] | (map(.firstEventMs) | add) / length'