dino.vitale

Current model: Qwen3.6-35B-A3B MoE (UD-Q4_K_M). All agents — Frank, Mike, Jubal — run on cha0tiktower. CPU row is the prior baseline on Gemma 4 26B before the tower arrived.

machine
gen throughput
cha0tikhome — i5-1235U CPU (Gemma 4 26B)
10.5 tok/s
avg over 17 daily logs (Apr 2–18) — retired
cha0tiktower — RTX 5060 Ti 16GB (single GPU)
74 tok/s
Apr 20 2026 — Gemma 4 26B — 6.6× CPU baseline
cha0tiktower — 2× RTX 5060 Ti (32GB VRAM)
~100 tok/s
Qwen3.6-35B-A3B MoE — 65k ctx / f16 KV — ~99 tok/s at 131k ctx / q4_0 KV — 39 experiments to ceiling

CPU Daily Logs (cha0tikhome)

Metric Value Notes
Gen throughput range5.0 – 11.7 tok/sLow outlier on high-load days
Gen throughput avg10.49 tok/s17-day sample
TTFT (time to first token)616ms – 2224msVaries with swap pressure
Server RSS18 – 25 GiBModel needs ~15GB; rest is context
Swap used4.6 – 7.7 GiBHitting swap regularly on 32GB machine
Prefill throughput7.5 – 9.6 tok/s
CPU inference was borderline usable. Swap pressure pushed TTFT above 2 seconds on bad days. The tower fixed this completely — 15.4GB in VRAM, no swap, 51°C under load.
↓ raw data — inference-perf.tsv (17 daily logs) ↓ latest JSON

LongMemEval tests whether a conversational agent can recall facts from long conversation histories. The standard benchmark uses synthetic users with 500+ message histories across 50+ sessions. I ran it against Mike using the single-session-user split (25 examples).

mode
accuracy (25 examples)
Baseline — no context injection
0 / 25
0% — Mike has no data to draw from
Context-window mode — full history injected
21 / 25
84% — best single run, Apr 7 2026
Batched runs — Apr 12 (18 runs × 5 examples)
~80%
4/5 correct per batch, consistent

What Failed

CategoryExamplesFailure mode
Extraction pipeline0/25 usableglm-4.7-flash returned empty JSON responses; extraction silently failed, nothing was stored
Context-window failures4/25Facts buried deep in 550-turn context; model retrieved wrong session or hallucinated absence
API errors mid-run1/25OpenRouter 400 error on lme_0017; run aborted mid-question

Latency

MetricValue
Avg latency per query (context-window)84s — 117s
Min latency26s
Max latency169s
Context per query~500 turns / 490k chars
The extraction mode never worked cleanly in any run. glm-4.7-flash consistently returned empty responses when asked to parse 80k+ character conversation batches. This is the single biggest open problem in Mike's memory architecture.
↓ raw data — best run (21/25, JSONL) ↓ baseline run (0/25, JSONL)

Orchestra is a conversation capture and wiki compilation tool. The test suite covers parsers, repair logic, wiki processing, and common utilities.

metric
result
Total tests
110
all passing
Runtime
0.13s
pytest
Failures
0
as of Apr 20 2026
Test fileTestsCovers
test_capture.py~45Claude.ai, ChatGPT, generic JSON parsers
test_repair.py~30Dead link pruning, orphan linking, duplicate merge
test_wiki.py~20Wiki compilation, backlink injection
test_common.py~10Shared utilities
test_split.py~5Document splitting

Frank is a benchmarking harness for local inference endpoints. Five test categories: raw throughput, single tool invocation, chained tools, self-correction, and persona smoke tests.

status
result
Harness
built
5 test categories defined
Execution logs
none
not yet run against cha0tiktower GPU
Frank was built before the tower arrived. The CPU numbers weren't worth publishing — 10 tok/s makes tool-chaining latency too high to be meaningful. First real run will happen after the dual-GPU install is stable.