benchmarks — dino vitale

// inference speed

Current model: Qwen3.6-35B-A3B MoE (UD-Q4_K_M). All agents — Frank, Mike, Jubal — run on cha0tiktower. CPU row is the prior baseline on Gemma 4 26B before the tower arrived.

machine

gen throughput

cha0tikhome — i5-1235U CPU (Gemma 4 26B)

10.5 tok/s

avg over 17 daily logs (Apr 2–18) — retired

cha0tiktower — RTX 5060 Ti 16GB (single GPU)

74 tok/s

Apr 20 2026 — Gemma 4 26B — 6.6× CPU baseline

cha0tiktower — 2× RTX 5060 Ti (32GB VRAM)

~100 tok/s

Qwen3.6-35B-A3B MoE — 65k ctx / f16 KV — ~99 tok/s at 131k ctx / q4_0 KV — 39 experiments to ceiling

CPU Daily Logs (cha0tikhome)

Metric	Value	Notes
Gen throughput range	5.0 – 11.7 tok/s	Low outlier on high-load days
Gen throughput avg	10.49 tok/s	17-day sample
TTFT (time to first token)	616ms – 2224ms	Varies with swap pressure
Server RSS	18 – 25 GiB	Model needs ~15GB; rest is context
Swap used	4.6 – 7.7 GiB	Hitting swap regularly on 32GB machine
Prefill throughput	7.5 – 9.6 tok/s

CPU inference was borderline usable. Swap pressure pushed TTFT above 2 seconds on bad days. The tower fixed this completely — 15.4GB in VRAM, no swap, 51°C under load.

↓ raw data — inference-perf.tsv (17 daily logs) ↓ latest JSON

// mike — longmemeval

LongMemEval tests whether a conversational agent can recall facts from long conversation histories. The standard benchmark uses synthetic users with 500+ message histories across 50+ sessions. I ran it against Mike using the single-session-user split (25 examples).

mode

accuracy (25 examples)

Baseline — no context injection

0 / 25

0% — Mike has no data to draw from

Context-window mode — full history injected

21 / 25

84% — best single run, Apr 7 2026

Batched runs — Apr 12 (18 runs × 5 examples)

~80%

4/5 correct per batch, consistent

What Failed

Category	Examples	Failure mode
Extraction pipeline	0/25 usable	glm-4.7-flash returned empty JSON responses; extraction silently failed, nothing was stored
Context-window failures	4/25	Facts buried deep in 550-turn context; model retrieved wrong session or hallucinated absence
API errors mid-run	1/25	OpenRouter 400 error on lme_0017; run aborted mid-question

Latency

Metric	Value
Avg latency per query (context-window)	84s — 117s
Min latency	26s
Max latency	169s
Context per query	~500 turns / 490k chars

The extraction mode never worked cleanly in any run. glm-4.7-flash consistently returned empty responses when asked to parse 80k+ character conversation batches. This is the single biggest open problem in Mike's memory architecture.

↓ raw data — best run (21/25, JSONL) ↓ baseline run (0/25, JSONL)

// orchestra — test suite

Orchestra is a conversation capture and wiki compilation tool. The test suite covers parsers, repair logic, wiki processing, and common utilities.

metric

result

Total tests

110

all passing

Runtime

0.13s

pytest

Failures

as of Apr 20 2026

Test file	Tests	Covers
test_capture.py	~45	Claude.ai, ChatGPT, generic JSON parsers
test_repair.py	~30	Dead link pruning, orphan linking, duplicate merge
test_wiki.py	~20	Wiki compilation, backlink injection
test_common.py	~10	Shared utilities
test_split.py	~5	Document splitting

// frank — inference harness

Frank is a benchmarking harness for local inference endpoints. Five test categories: raw throughput, single tool invocation, chained tools, self-correction, and persona smoke tests.

status

result

Harness

built

5 test categories defined

Execution logs

none

not yet run against cha0tiktower GPU

Frank was built before the tower arrived. The CPU numbers weren't worth publishing — 10 tok/s makes tool-chaining latency too high to be meaningful. First real run will happen after the dual-GPU install is stable.