Evaluations

The LLM Service Daemon (LSD) supports two levels of evaluation: scoring a single inference against a datapoint, and grouping a sequence of inferences into a multi-step workflow run that gets scored as a whole.

Inference-level evaluations

Run with the standalone evaluations binary, against either an embedded config or a running gateway:

cargo run -p evaluations -- \
  --config-file ./config/lsd.toml \
  --gateway-url http://localhost:3000 \
  --evaluation-name my_evaluation \
  --dataset-name my_dataset \
  --variant-name my_variant \
  --concurrency 4

Useful flags:

--function-name + --evaluator-names: run specific evaluators without a named evaluation config
--datapoint-ids: evaluate specific datapoints instead of a whole dataset
--format pretty|jsonl: human-readable or machine-parseable output
--adaptive-stopping-precision evaluator=target: stop early once an evaluator’s confidence interval is tight enough
--cutoffs evaluator=threshold: exit non-zero if an evaluator’s score falls below a threshold (useful in CI)

Evaluator types

`type`	Behavior
`exact_match`	Boolean: output matches the expected output exactly
`regex`	Boolean: output matches a pattern
`tool_use`	Boolean: validates the tool calls made during the inference
`llm_judge`	Boolean or float: another LLM scores the output against a rubric; checks for existing human feedback before calling the judge
`typescript`	Boolean or float: custom scoring logic you provide

Evaluator configuration

[evaluations.my_evaluation.evaluators.matches_expected]
type = "exact_match"

[evaluations.my_evaluation.evaluators.no_swearing]
type = "regex"
must_not_match = "(?i)\\b(damn|hell)\\b"

[evaluations.my_evaluation.evaluators.calls_search_tool]
type = "tool_use"
behavior = "any_of"
tools = ["web_search"]

[evaluations.my_evaluation.evaluators.quality_judge]
type = "llm_judge"
input_format = "messages"     # serialized | messages
output_type = "float"         # boolean | float
optimize = "max"
include = { reference_output = true }

tool_use accepts behavior = "none" | "none_of" | "any" | "any_of" | "all_of", with tools = [...] required for every behavior except none/any. regex needs at least one of must_match / must_not_match; both apply as a logical AND. llm_judge and typescript evaluators also take an optimize direction (max/min) used by adaptive stopping and by optimization jobs that target an evaluator’s score.

Workflow evaluations

For multi-step agents, start a run and tag each step (episode) as it happens, rather than scoring one inference in isolation:

curl -X POST http://localhost:3000/workflow_evaluation_run \
  -H "Content-Type: application/json" \
  -d '{"variants": {"my_function": "my_variant"}, "project_name": "my_agent"}'

This returns a run_id. Subsequent inferences tagged into that run via POST /workflow_evaluation_run/{run_id}/episode are grouped together, so you can evaluate an entire agent trajectory rather than a single call.

Where results go

Run metadata and scores land in inference_evaluation_runs (plus workflow_evaluation_run_episodes for workflow runs) and inference_evaluation_human_feedback for any human-in-the-loop feedback. All of it is queryable from the same Postgres database as everything else.