Skip to content

Evaluations

The LLM Service Daemon (LSD) supports two levels of evaluation: scoring a single inference against a datapoint, and grouping a sequence of inferences into a multi-step workflow run that gets scored as a whole.

Run with the standalone evaluations binary, against either an embedded config or a running gateway:

Terminal window
cargo run -p evaluations -- \
--config-file ./config/lsd.toml \
--gateway-url http://localhost:3000 \
--evaluation-name my_evaluation \
--dataset-name my_dataset \
--variant-name my_variant \
--concurrency 4

Useful flags:

  • --function-name + --evaluator-names: run specific evaluators without a named evaluation config
  • --datapoint-ids: evaluate specific datapoints instead of a whole dataset
  • --format pretty|jsonl: human-readable or machine-parseable output
  • --adaptive-stopping-precision evaluator=target: stop early once an evaluator’s confidence interval is tight enough
  • --cutoffs evaluator=threshold: exit non-zero if an evaluator’s score falls below a threshold (useful in CI)
type Behavior
exact_match Boolean: output matches the expected output exactly
regex Boolean: output matches a pattern
tool_use Boolean: validates the tool calls made during the inference
llm_judge Boolean or float: another LLM scores the output against a rubric; checks for existing human feedback before calling the judge
typescript Boolean or float: custom scoring logic you provide
[evaluations.my_evaluation.evaluators.matches_expected]
type = "exact_match"
[evaluations.my_evaluation.evaluators.no_swearing]
type = "regex"
must_not_match = "(?i)\\b(damn|hell)\\b"
[evaluations.my_evaluation.evaluators.calls_search_tool]
type = "tool_use"
behavior = "any_of"
tools = ["web_search"]
[evaluations.my_evaluation.evaluators.quality_judge]
type = "llm_judge"
input_format = "messages" # serialized | messages
output_type = "float" # boolean | float
optimize = "max"
include = { reference_output = true }

tool_use accepts behavior = "none" | "none_of" | "any" | "any_of" | "all_of", with tools = [...] required for every behavior except none/any. regex needs at least one of must_match / must_not_match; both apply as a logical AND. llm_judge and typescript evaluators also take an optimize direction (max/min) used by adaptive stopping and by optimization jobs that target an evaluator’s score.

For multi-step agents, start a run and tag each step (episode) as it happens, rather than scoring one inference in isolation:

Terminal window
curl -X POST http://localhost:3000/workflow_evaluation_run \
-H "Content-Type: application/json" \
-d '{"variants": {"my_function": "my_variant"}, "project_name": "my_agent"}'

This returns a run_id. Subsequent inferences tagged into that run via POST /workflow_evaluation_run/{run_id}/episode are grouped together, so you can evaluate an entire agent trajectory rather than a single call.

Run metadata and scores land in inference_evaluation_runs (plus workflow_evaluation_run_episodes for workflow runs) and inference_evaluation_human_feedback for any human-in-the-loop feedback. All of it is queryable from the same Postgres database as everything else.