Family Pulse / Eval notebook
Eval 001 · Local classifier

Can a small model sort the family inbox?

A hands-on walkthrough of the first Family Pulse evaluation: what went into the model, what came back, how each answer was scored, and what this tiny perfect score does—and does not—tell us.

Run
Machine
Runtime
Data12 synthetic emails
PrivacyNo inbox content
12Test cases
2Local models
100%Exact match, both
0Invalid responses
01
The process

An eval is a repeatable question with an answer key.

Here the question is narrow: given the same compact email context Family Pulse currently sends to Haiku, can a local model decide whether the message matters and assign the correct product category?

1Choose the contract

One input shape and one JSON output schema.

2Create cases

Write emails with human-authored expected labels.

3Freeze settings

Temperature 0, fixed seed, same prompt and context.

4Run models

Capture raw output, timing, tokens, and errors.

5Score + inspect

Compute exact checks, then review the traces.

Why start here?Classification is bounded, cheap, and objectively scoreable. That makes it a better first local-model experiment than asking a small model to own the full Morning Brief voice or an open-ended assistant conversation.
02
The rubric

Four checks; one primary pass.

A model is not graded on whether its prose sounds plausible. It must satisfy the product contract exactly. The main metric is exact accuracy: both relevance and category must match the expected answer.

CriterionScored?Rule
Valid JSONGateParses and contains a boolean relevant, allowed category, and 0–1 confidence.
Relevance accuracyYesPredicted relevant exactly equals the human label.
Category accuracyYesPredicted category exactly equals the expected category, including null.
Exact accuracyPrimaryBoth relevance and category are correct for the same trace.
ConfidenceInspectCaptured but not scored until its meaning is explicitly defined.
Latency / tok·sInspectMeasures usability on this Mac, not answer quality.
# Per trace
valid_json       = parses(output) && matches_schema(output)
relevance_right  = output.relevant === expected.relevant
category_right   = output.category === expected.category
exact_pass       = valid_json && relevance_right && category_right

# Across the dataset
exact_accuracy   = exact_passes / total_cases
The first useful finding is not the 100% score.The 4B model emits confidence 0.0 for correctly rejected emails, while the 9B emits 1.0. One reads confidence as “probability this is relevant”; the other reads it as “confidence in my classification.” The prompt currently permits both interpretations.
03
The run

Both models cleared the smoke test.

The smaller model was faster. The larger model was more emphatic, but not more accurate on these straightforward examples. That means the current dataset is good enough to prove feasibility and too easy to choose a winner.

ModelValid JSONRelevanceExactAvg latencyThroughput
Show the exact system prompt
Show the enforced response schema
04
Trace review

Open the black box, one email at a time.

Computed metrics tell you where to look. Human review tells you whether the expected label was sensible, whether the model reached the right product behavior, and what the aggregate score conceals.

Case 1 of 12

Expected answer

Your review

Pass if the expected label and model behavior look right. Fail if anything is substantively wrong. Defer if the product rule itself is ambiguous.

Reviews auto-save in this browser.
05
Reusable vs. mocked

Built to rerun—not to admire once.

The evaluation machinery is reusable. The starter evidence is synthetic. New Ollama models can be added by name, and the fixture file can be replaced with any labeled dataset that follows the same input and expected-output shape.

Reusable now

  • Arbitrary Ollama model names
  • Configurable labeled JSON dataset
  • Frozen prompt, schema, seed, and settings
  • Raw traces, exact scoring, and timing
  • Interactive human review and export

Mocked today

  • Twelve hand-written synthetic emails
  • Straightforward category boundaries
  • No current Haiku comparison
  • No historical Family Pulse outcomes
  • No observed production failure mix

Production gate

  • Representative Gmail sample
  • Prior Brief output paired to source messages
  • Human labels written independently
  • Shared production preprocessing contract
  • Fallback thresholds and held-out regression set
# Try a future local model on the starter set
npm run eval:local -- qwen3.5:4b gemma4:12b

# Run a private labeled dataset and keep its report private
npm run eval:local -- --cases data/evals/private/classifier.json \
  qwen3.5:4b qwen3.5:9b \
  --report-json data/evals/private/classifier-run.json
Private-data boundaryThe ideal next corpus can pair actual Gmail source messages with Family Pulse classifications, extractions, and delivered Morning Brief entries. Keep both the dataset and generated trace report under data/evals/private/, which this repo now ignores. Only sanitized aggregate findings should move into the shareable HTML.
06
What comes next

A smoke test opens the door. It does not settle the case.

This run proves that local inference is fast enough and that both models can honor the JSON contract. Production confidence comes from representative data, explicit product rules, and deliberate analysis of mistakes.

Synthetic and tidy

Real household inboxes contain forwarded chains, truncated bodies, promotional camouflage, malformed dates, and mixed intent.

Only twelve cases

One mistake would move accuracy by 8.3 percentage points. The sample is far too small for a stable estimate.

No cloud baseline

This run does not yet compare local output against the current Haiku behavior on the exact same inputs.

No threshold policy

We have not defined when low confidence should trigger a frontier-model fallback or a safe rejection.

  1. Collect 100–300 representative messagesSample across categories and obvious negatives; redact or keep the dataset strictly local.
  2. Label before looking at model outputWrite the human answer key independently to avoid letting the model influence ground truth.
  3. Write down ambiguous product rulesDecide, for example, whether paid receipts and school newsletters always count as actionable.
  4. Compare 4B, 9B, and current HaikuUse the same frozen dataset, prompt, schema, and per-case review process.
  5. Group failures, then change one thingPrompt, preprocessing, threshold, model, or training data—avoid changing all of them at once.
  6. Keep a held-out setDo not tune on every example. Reserve unseen cases to verify that improvements generalize.