Evaluations
The LLM Service Daemon (LSD) supports two levels of evaluation: scoring a single inference against a datapoint, and grouping a sequence of inferences into a multi-step workflow run that gets scored as a whole.
Inference-level evaluations
Section titled “Inference-level evaluations”Run with the standalone evaluations binary, against either an embedded config or a running gateway:
cargo run -p evaluations -- \ --config-file ./config/lsd.toml \ --gateway-url http://localhost:3000 \ --evaluation-name my_evaluation \ --dataset-name my_dataset \ --variant-name my_variant \ --concurrency 4Useful flags:
--function-name+--evaluator-names: run specific evaluators without a named evaluation config--datapoint-ids: evaluate specific datapoints instead of a whole dataset--format pretty|jsonl: human-readable or machine-parseable output--adaptive-stopping-precision evaluator=target: stop early once an evaluator’s confidence interval is tight enough--cutoffs evaluator=threshold: exit non-zero if an evaluator’s score falls below a threshold (useful in CI)
Evaluator types
Section titled “Evaluator types”type |
Behavior |
|---|---|
exact_match |
Boolean: output matches the expected output exactly |
regex |
Boolean: output matches a pattern |
tool_use |
Boolean: validates the tool calls made during the inference |
llm_judge |
Boolean or float: another LLM scores the output against a rubric; checks for existing human feedback before calling the judge |
typescript |
Boolean or float: custom scoring logic you provide |
Evaluator configuration
Section titled “Evaluator configuration”[evaluations.my_evaluation.evaluators.matches_expected]type = "exact_match"
[evaluations.my_evaluation.evaluators.no_swearing]type = "regex"must_not_match = "(?i)\\b(damn|hell)\\b"
[evaluations.my_evaluation.evaluators.calls_search_tool]type = "tool_use"behavior = "any_of"tools = ["web_search"]
[evaluations.my_evaluation.evaluators.quality_judge]type = "llm_judge"input_format = "messages" # serialized | messagesoutput_type = "float" # boolean | floatoptimize = "max"include = { reference_output = true }tool_use accepts behavior = "none" | "none_of" | "any" | "any_of" | "all_of", with tools = [...] required for every behavior except none/any. regex needs at least one of must_match / must_not_match; both apply as a logical AND. llm_judge and typescript evaluators also take an optimize direction (max/min) used by adaptive stopping and by optimization jobs that target an evaluator’s score.
Workflow evaluations
Section titled “Workflow evaluations”For multi-step agents, start a run and tag each step (episode) as it happens, rather than scoring one inference in isolation:
curl -X POST http://localhost:3000/workflow_evaluation_run \ -H "Content-Type: application/json" \ -d '{"variants": {"my_function": "my_variant"}, "project_name": "my_agent"}'This returns a run_id. Subsequent inferences tagged into that run via POST /workflow_evaluation_run/{run_id}/episode are grouped together, so you can evaluate an entire agent trajectory rather than a single call.
Where results go
Section titled “Where results go”Run metadata and scores land in inference_evaluation_runs (plus workflow_evaluation_run_episodes for workflow runs) and inference_evaluation_human_feedback for any human-in-the-loop feedback. All of it is queryable from the same Postgres database as everything else.