Skip to main content

Health and Metrics

Ash exposes health and metrics endpoints for monitoring, alerting, and integration with orchestration systems. Neither endpoint requires authentication.


Health Check

GET /health

Returns the server's current status, active session and sandbox counts, uptime, and detailed sandbox pool statistics.

Request

No request body. No authentication required.

Response

200 OK

{
"status": "ok",
"activeSessions": 3,
"activeSandboxes": 5,
"uptime": 86400,
"pool": {
"total": 5,
"cold": 0,
"warming": 1,
"warm": 1,
"waiting": 2,
"running": 1,
"maxCapacity": 1000,
"resumeWarmHits": 42,
"resumeColdHits": 7
}
}
FieldTypeDescription
statusstringAlways "ok" if the server is reachable
activeSessionsintegerNumber of sessions in active status
activeSandboxesintegerNumber of live sandbox processes
uptimeintegerSeconds since server start
poolPoolStatsSandbox pool breakdown

Pool Stats

The pool object provides a detailed view of sandbox states:

FieldTypeDescription
totalintegerTotal sandboxes in the pool
coldintegerSandboxes not yet started
warmingintegerSandboxes currently starting up
warmintegerSandboxes ready but not assigned to a session
waitingintegerSandboxes assigned to a session, idle between messages
runningintegerSandboxes actively processing a message
maxCapacityintegerMaximum number of sandboxes allowed (configured via ASH_MAX_SANDBOXES)
resumeWarmHitsintegerTotal warm resumes (sandbox was still alive)
resumeColdHitsintegerTotal cold resumes (new sandbox created, state restored)

Prometheus Metrics

GET /metrics

Returns metrics in Prometheus text exposition format. No authentication required.

Request

No request body.

Response

200 OK with Content-Type: text/plain; version=0.0.4; charset=utf-8

# HELP ash_up Whether the Ash server is up (always 1 if reachable).
# TYPE ash_up gauge
ash_up 1

# HELP ash_uptime_seconds Seconds since server start.
# TYPE ash_uptime_seconds gauge
ash_uptime_seconds 86400

# HELP ash_active_sessions Number of active sessions.
# TYPE ash_active_sessions gauge
ash_active_sessions 3

# HELP ash_active_sandboxes Number of live sandbox processes.
# TYPE ash_active_sandboxes gauge
ash_active_sandboxes 5

# HELP ash_pool_sandboxes Sandbox count by state.
# TYPE ash_pool_sandboxes gauge
ash_pool_sandboxes{state="cold"} 0
ash_pool_sandboxes{state="warming"} 1
ash_pool_sandboxes{state="warm"} 1
ash_pool_sandboxes{state="waiting"} 2
ash_pool_sandboxes{state="running"} 1

# HELP ash_pool_max_capacity Maximum sandbox capacity.
# TYPE ash_pool_max_capacity gauge
ash_pool_max_capacity 1000

# HELP ash_resume_total Total session resumes by path (warm=sandbox alive, cold=new sandbox).
# TYPE ash_resume_total counter
ash_resume_total{path="warm"} 42
ash_resume_total{path="cold"} 7

Metric Reference

MetricTypeLabelsDescription
ash_upgauge--Always 1 if the server is reachable
ash_uptime_secondsgauge--Seconds since server process started
ash_active_sessionsgauge--Number of sessions in active status
ash_active_sandboxesgauge--Number of live sandbox processes
ash_pool_sandboxesgaugestateSandbox count broken down by state: cold, warming, warm, waiting, running
ash_pool_max_capacitygauge--Configured maximum sandbox capacity
ash_resume_totalcounterpathCumulative session resume count by path: warm (sandbox still alive) or cold (new sandbox created)

Prometheus Configuration

Add the following scrape config to your prometheus.yml:

scrape_configs:
- job_name: 'ash'
scrape_interval: 15s
static_configs:
- targets: ['localhost:4100']
metrics_path: '/metrics'

Kubernetes Probes

The /health endpoint is suitable for both liveness and readiness probes:

apiVersion: apps/v1
kind: Deployment
metadata:
name: ash-server
spec:
template:
spec:
containers:
- name: ash
livenessProbe:
httpGet:
path: /health
port: 4100
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 4100
initialDelaySeconds: 5
periodSeconds: 5

The liveness probe verifies the server process is responsive. The readiness probe can be used to gate traffic until the server has completed initialization. Both return 200 with {"status": "ok", ...} when the server is healthy.