—
- From
- —
- Date
- —
- Subject
- —
Your review
Pass if the expected label and model behavior look right. Fail if anything is substantively wrong. Defer if the product rule itself is ambiguous.
A hands-on walkthrough of the first Family Pulse evaluation: what went into the model, what came back, how each answer was scored, and what this tiny perfect score does—and does not—tell us.
Here the question is narrow: given the same compact email context Family Pulse currently sends to Haiku, can a local model decide whether the message matters and assign the correct product category?
One input shape and one JSON output schema.
Write emails with human-authored expected labels.
Temperature 0, fixed seed, same prompt and context.
Capture raw output, timing, tokens, and errors.
Compute exact checks, then review the traces.
A model is not graded on whether its prose sounds plausible. It must satisfy the product contract exactly. The main metric is exact accuracy: both relevance and category must match the expected answer.
| Criterion | Scored? | Rule |
|---|---|---|
| Valid JSON | Gate | Parses and contains a boolean relevant, allowed category, and 0–1 confidence. |
| Relevance accuracy | Yes | Predicted relevant exactly equals the human label. |
| Category accuracy | Yes | Predicted category exactly equals the expected category, including null. |
| Exact accuracy | Primary | Both relevance and category are correct for the same trace. |
| Confidence | Inspect | Captured but not scored until its meaning is explicitly defined. |
| Latency / tok·s | Inspect | Measures usability on this Mac, not answer quality. |
# Per trace valid_json = parses(output) && matches_schema(output) relevance_right = output.relevant === expected.relevant category_right = output.category === expected.category exact_pass = valid_json && relevance_right && category_right # Across the dataset exact_accuracy = exact_passes / total_cases
0.0 for correctly rejected emails, while the 9B emits 1.0. One reads confidence as “probability this is relevant”; the other reads it as “confidence in my classification.” The prompt currently permits both interpretations.The smaller model was faster. The larger model was more emphatic, but not more accurate on these straightforward examples. That means the current dataset is good enough to prove feasibility and too easy to choose a winner.
| Model | Valid JSON | Relevance | Exact | Avg latency | Throughput |
|---|
Computed metrics tell you where to look. Human review tells you whether the expected label was sensible, whether the model reached the right product behavior, and what the aggregate score conceals.
Pass if the expected label and model behavior look right. Fail if anything is substantively wrong. Defer if the product rule itself is ambiguous.
The evaluation machinery is reusable. The starter evidence is synthetic. New Ollama models can be added by name, and the fixture file can be replaced with any labeled dataset that follows the same input and expected-output shape.
# Try a future local model on the starter set npm run eval:local -- qwen3.5:4b gemma4:12b # Run a private labeled dataset and keep its report private npm run eval:local -- --cases data/evals/private/classifier.json \ qwen3.5:4b qwen3.5:9b \ --report-json data/evals/private/classifier-run.json
data/evals/private/, which this repo now ignores. Only sanitized aggregate findings should move into the shareable HTML.This run proves that local inference is fast enough and that both models can honor the JSON contract. Production confidence comes from representative data, explicit product rules, and deliberate analysis of mistakes.
Real household inboxes contain forwarded chains, truncated bodies, promotional camouflage, malformed dates, and mixed intent.
One mistake would move accuracy by 8.3 percentage points. The sample is far too small for a stable estimate.
This run does not yet compare local output against the current Haiku behavior on the exact same inputs.
We have not defined when low confidence should trigger a frontier-model fallback or a safe rejection.