// inference speed
Current model: Qwen3.6-35B-A3B MoE (UD-Q4_K_M). All agents — Frank, Mike, Jubal — run on cha0tiktower. CPU row is the prior baseline on Gemma 4 26B before the tower arrived.
CPU Daily Logs (cha0tikhome)
| Metric | Value | Notes |
|---|---|---|
| Gen throughput range | 5.0 – 11.7 tok/s | Low outlier on high-load days |
| Gen throughput avg | 10.49 tok/s | 17-day sample |
| TTFT (time to first token) | 616ms – 2224ms | Varies with swap pressure |
| Server RSS | 18 – 25 GiB | Model needs ~15GB; rest is context |
| Swap used | 4.6 – 7.7 GiB | Hitting swap regularly on 32GB machine |
| Prefill throughput | 7.5 – 9.6 tok/s |
// mike — longmemeval
LongMemEval tests whether a conversational agent can recall facts from long conversation histories. The standard benchmark uses synthetic users with 500+ message histories across 50+ sessions. I ran it against Mike using the single-session-user split (25 examples).
What Failed
| Category | Examples | Failure mode |
|---|---|---|
| Extraction pipeline | 0/25 usable | glm-4.7-flash returned empty JSON responses; extraction silently failed, nothing was stored |
| Context-window failures | 4/25 | Facts buried deep in 550-turn context; model retrieved wrong session or hallucinated absence |
| API errors mid-run | 1/25 | OpenRouter 400 error on lme_0017; run aborted mid-question |
Latency
| Metric | Value |
|---|---|
| Avg latency per query (context-window) | 84s — 117s |
| Min latency | 26s |
| Max latency | 169s |
| Context per query | ~500 turns / 490k chars |
// orchestra — test suite
Orchestra is a conversation capture and wiki compilation tool. The test suite covers parsers, repair logic, wiki processing, and common utilities.
| Test file | Tests | Covers |
|---|---|---|
| test_capture.py | ~45 | Claude.ai, ChatGPT, generic JSON parsers |
| test_repair.py | ~30 | Dead link pruning, orphan linking, duplicate merge |
| test_wiki.py | ~20 | Wiki compilation, backlink injection |
| test_common.py | ~10 | Shared utilities |
| test_split.py | ~5 | Document splitting |
// frank — inference harness
Frank is a benchmarking harness for local inference endpoints. Five test categories: raw throughput, single tool invocation, chained tools, self-correction, and persona smoke tests.