Observability

The LLM Service Daemon (LSD) writes every inference, every provider response, and every feedback signal to the same PostgreSQL database that holds its config. There is no separate analytics store to run or keep in sync.

What gets stored

Table family	Contents
`chat_inferences`, `json_inferences`, `embedding_inferences`	One row per inference, by output type (partitioned by time)
`chat_inference_data`, `json_inference_data`, `model_inference_data`	Raw request/response payloads backing the above
`model_inferences`	Per-provider-attempt telemetry (latency, tokens, which provider actually served the request)
`batch_requests`, `batch_model_inferences`	Batch inference jobs and their results
`boolean_metric_feedback`, `float_metric_feedback`, `comment_feedback`, `demonstration_feedback`	Feedback attached to an inference or episode
`inference_evaluation_runs`, `inference_evaluation_human_feedback`	Evaluation run results and any human feedback collected for them

Materialized aggregates are refreshed automatically for dashboards and cost tracking: inference_by_function_statistics, variant_statistics, model_provider_statistics, and per-minute/per-hour model_latency_histogram_*.

Writes: sync, async, or batched

Postgres is a hard dependency, not an optional sink. Config, auth, rate limiting, and observability all share one connection, so LSD_DATABASE_URL must point at a reachable database or the gateway refuses to start.

What you can toggle is whether observability rows specifically get written to that already-required connection:

[gateway.observability]
enabled = true        # write inference/feedback rows (default: true)
async_writes = true   # don't block the response on the write (default in production)

[gateway.observability.batch_writes]
enabled = true
flush_interval_ms = 100
max_rows = 1000

Set enabled = false to skip writing observability data entirely, for example in a load-testing setup where you want Postgres for config/auth but don’t want every inference recorded.

Async writes avoid adding write latency to the request path; batch writes coalesce many rows into fewer Postgres round-trips under load.

Querying historical inferences

curl -X POST http://localhost:3000/v1/inferences/list_inferences \
  -H "Content-Type: application/json" \
  -d '{"function_name": "my_function", "limit": 100}'

POST /v1/inferences/list_inferences and POST /v1/inferences/get_inferences let you query stored inferences programmatically. No need to hand-write SQL against the partitioned tables, though you’re always free to.

OpenTelemetry and Prometheus

OTLP traces: enable with gateway.export.otlp.traces.enabled = true and point OTEL_EXPORTER_OTLP_TRACES_ENDPOINT at your collector. Spans are created per inference, batch, and feedback request.
Prometheus: scrape GET /metrics for request counts, latency histograms, and per-provider stats.