Family Pulse — Local Classifier Eval Walkthrough

The process

An eval is a repeatable question with an answer key.

Here the question is narrow: given the same compact email context Family Pulse currently sends to Haiku, can a local model decide whether the message matters and assign the correct product category?

1Choose the contract

One input shape and one JSON output schema.

2Create cases

Write emails with human-authored expected labels.

3Freeze settings

Temperature 0, fixed seed, same prompt and context.

4Run models

Capture raw output, timing, tokens, and errors.

5Score + inspect

Compute exact checks, then review the traces.

Why start here?Classification is bounded, cheap, and objectively scoreable. That makes it a better first local-model experiment than asking a small model to own the full Morning Brief voice or an open-ended assistant conversation.

The rubric

Four checks; one primary pass.

A model is not graded on whether its prose sounds plausible. It must satisfy the product contract exactly. The main metric is exact accuracy: both relevance and category must match the expected answer.

Criterion	Scored?	Rule
Valid JSON	Gate	Parses and contains a boolean `relevant`, allowed `category`, and 0–1 `confidence`.
Relevance accuracy	Yes	Predicted `relevant` exactly equals the human label.
Category accuracy	Yes	Predicted category exactly equals the expected category, including `null`.
Exact accuracy	Primary	Both relevance and category are correct for the same trace.
Confidence	Inspect	Captured but not scored until its meaning is explicitly defined.
Latency / tok·s	Inspect	Measures usability on this Mac, not answer quality.

# Per trace
valid_json       = parses(output) && matches_schema(output)
relevance_right  = output.relevant === expected.relevant
category_right   = output.category === expected.category
exact_pass       = valid_json && relevance_right && category_right

# Across the dataset
exact_accuracy   = exact_passes / total_cases

The first useful finding is not the 100% score.The 4B model emits confidence 0.0 for correctly rejected emails, while the 9B emits 1.0. One reads confidence as “probability this is relevant”; the other reads it as “confidence in my classification.” The prompt currently permits both interpretations.

The run

Both models cleared the smoke test.

The smaller model was faster. The larger model was more emphatic, but not more accurate on these straightforward examples. That means the current dataset is good enough to prove feasibility and too easy to choose a winner.

Model	Valid JSON	Relevance	Exact	Avg latency	Throughput

Show the exact system prompt

Show the enforced response schema

Trace review

Open the black box, one email at a time.

Computed metrics tell you where to look. Human review tells you whether the expected label was sensible, whether the model reached the right product behavior, and what the aggregate score conceals.

—

From: —
Date: —
Subject: —

—

Attachment context

Expected answer — —

Your review

Pass if the expected label and model behavior look right. Fail if anything is substantively wrong. Defer if the product rule itself is ambiguous.

Reviews auto-save in this browser.

Reusable vs. mocked

Built to rerun—not to admire once.

The evaluation machinery is reusable. The starter evidence is synthetic. New Ollama models can be added by name, and the fixture file can be replaced with any labeled dataset that follows the same input and expected-output shape.

Reusable now

Arbitrary Ollama model names
Configurable labeled JSON dataset
Frozen prompt, schema, seed, and settings
Raw traces, exact scoring, and timing
Interactive human review and export

Mocked today

Twelve hand-written synthetic emails
Straightforward category boundaries
No current Haiku comparison
No historical Family Pulse outcomes
No observed production failure mix

Production gate

Representative Gmail sample
Prior Brief output paired to source messages
Human labels written independently
Shared production preprocessing contract
Fallback thresholds and held-out regression set

# Try a future local model on the starter set
npm run eval:local -- qwen3.5:4b gemma4:12b

# Run a private labeled dataset and keep its report private
npm run eval:local -- --cases data/evals/private/classifier.json \
  qwen3.5:4b qwen3.5:9b \
  --report-json data/evals/private/classifier-run.json

Private-data boundaryThe ideal next corpus can pair actual Gmail source messages with Family Pulse classifications, extractions, and delivered Morning Brief entries. Keep both the dataset and generated trace report under data/evals/private/, which this repo now ignores. Only sanitized aggregate findings should move into the shareable HTML.

What comes next

A smoke test opens the door. It does not settle the case.

This run proves that local inference is fast enough and that both models can honor the JSON contract. Production confidence comes from representative data, explicit product rules, and deliberate analysis of mistakes.

Synthetic and tidy

Real household inboxes contain forwarded chains, truncated bodies, promotional camouflage, malformed dates, and mixed intent.

Only twelve cases

One mistake would move accuracy by 8.3 percentage points. The sample is far too small for a stable estimate.

No cloud baseline

This run does not yet compare local output against the current Haiku behavior on the exact same inputs.

No threshold policy

We have not defined when low confidence should trigger a frontier-model fallback or a safe rejection.

Collect 100–300 representative messagesSample across categories and obvious negatives; redact or keep the dataset strictly local.
Label before looking at model outputWrite the human answer key independently to avoid letting the model influence ground truth.
Write down ambiguous product rulesDecide, for example, whether paid receipts and school newsletters always count as actionable.
Compare 4B, 9B, and current HaikuUse the same frozen dataset, prompt, schema, and per-case review process.
Group failures, then change one thingPrompt, preprocessing, threshold, model, or training data—avoid changing all of them at once.
Keep a held-out setDo not tune on every example. Reserve unseen cases to verify that improvements generalize.