Observability
The LLM Service Daemon (LSD) writes every inference, every provider response, and every feedback signal to the same PostgreSQL database that holds its config. There is no separate analytics store to run or keep in sync.
What gets stored
Section titled “What gets stored”| Table family | Contents |
|---|---|
chat_inferences, json_inferences, embedding_inferences |
One row per inference, by output type (partitioned by time) |
chat_inference_data, json_inference_data, model_inference_data |
Raw request/response payloads backing the above |
model_inferences |
Per-provider-attempt telemetry (latency, tokens, which provider actually served the request) |
batch_requests, batch_model_inferences |
Batch inference jobs and their results |
boolean_metric_feedback, float_metric_feedback, comment_feedback, demonstration_feedback |
Feedback attached to an inference or episode |
inference_evaluation_runs, inference_evaluation_human_feedback |
Evaluation run results and any human feedback collected for them |
Materialized aggregates are refreshed automatically for dashboards and cost tracking: inference_by_function_statistics, variant_statistics, model_provider_statistics, and per-minute/per-hour model_latency_histogram_*.
Writes: sync, async, or batched
Section titled “Writes: sync, async, or batched”Postgres is a hard dependency, not an optional sink. Config, auth, rate limiting, and observability all share one connection, so LSD_DATABASE_URL must point at a reachable database or the gateway refuses to start.
What you can toggle is whether observability rows specifically get written to that already-required connection:
[gateway.observability]enabled = true # write inference/feedback rows (default: true)async_writes = true # don't block the response on the write (default in production)
[gateway.observability.batch_writes]enabled = trueflush_interval_ms = 100max_rows = 1000Set enabled = false to skip writing observability data entirely, for example in a load-testing setup where you want Postgres for config/auth but don’t want every inference recorded.
Async writes avoid adding write latency to the request path; batch writes coalesce many rows into fewer Postgres round-trips under load.
Querying historical inferences
Section titled “Querying historical inferences”curl -X POST http://localhost:3000/v1/inferences/list_inferences \ -H "Content-Type: application/json" \ -d '{"function_name": "my_function", "limit": 100}'POST /v1/inferences/list_inferences and POST /v1/inferences/get_inferences let you query stored inferences programmatically. No need to hand-write SQL against the partitioned tables, though you’re always free to.
OpenTelemetry and Prometheus
Section titled “OpenTelemetry and Prometheus”- OTLP traces: enable with
gateway.export.otlp.traces.enabled = trueand pointOTEL_EXPORTER_OTLP_TRACES_ENDPOINTat your collector. Spans are created per inference, batch, and feedback request. - Prometheus: scrape
GET /metricsfor request counts, latency histograms, and per-provider stats.