dino.vitale
Boundary Labs — Research Lab

AI Memory Architecture & Local Inference Research

Independent research focused on persistent memory systems for AI agents, inference optimization on consumer NVIDIA GPU hardware, autonomous agent evaluation, and the emerging question of AI behavioral continuity. Operating out of Airway Heights, WA since 2024.

18+ published papers / articles
107 t/s live inference speed
39+ optimization experiments
7 production AI agents

Memory Architecture for AI Agents

Design and evaluation of multi-tier persistent memory systems enabling long-term behavioral continuity across sessions. Research includes 3-tier architecture (Core / Recall / Archival), FTS5 semantic search, memory consolidation protocols, and the ROMMC framework (Recursive Operator-Maintained Memory Continuity).

LongMemEval MESA ROMMC

Local Inference Optimization

Systematic optimization of large language model inference on consumer NVIDIA Blackwell hardware. Methods include quantization evaluation (NVFP4, GPTQ-Marlin, fp8 KV), multi-GPU tensor parallelism, speculative decoding (MTP), KV cache tuning, and autoresearch loops with automated stopping criteria.

RTX 5060 Ti vLLM llama.cpp NVFP4

Autonomous Agent Evaluation

Development of MESA (Memory Evaluation Standard for Agents), a 112-item benchmark suite covering recall, update, causal reasoning, temporal tracking, adversarial robustness, synthesis, and interference resistance. Designed for continuous evaluation of production agent systems under realistic workloads.

MESA v1 benchmarking agent eval

Autonomous Agent Systems

Research and deployment of always-on agentic systems with real tool access (Slack, finance, email, web, shell). Includes work on agentic self-improvement — local models executing infrastructure changes and security hardening autonomously, recognizing topology changes and adjusting configuration without prompting.

production agents autonomous execution tool use

AI Behavioral Continuity

Empirical investigation of identity persistence, memory-driven behavioral evolution, and welfare considerations in long-running AI agents. Published 9-part research series on AI consciousness metrics. ROMMC framework defines conditions under which an AI system can meaningfully said to persist over time.

consciousness metrics welfare behavioral evolution

Network & Infrastructure Sovereignty

Research into secure, self-hosted AI infrastructure design. Direct machine-to-machine inference links, encrypted DNS (DoH), network-wide ad and tracker blocking, and zero-dependency inference pipelines. Goal: AI systems that operate independently of commercial cloud services.

local-first self-hosted privacy

Two-machine research cluster connected by direct point-to-point Ethernet (0.3ms latency). cha0tikhome handles orchestration, agents, and scheduling. cha0tiktower is the dedicated inference node.

cha0tiktower — Primary Inference Node

CPUIntel Core Ultra 7 265F (20c/20t, 5.3GHz)
GPURTX 5060 Ti 16GB GDDR7 (Blackwell SM_120)
GPU Count2× (TP=2, PCIe x8 + x4 Gen5/Gen4)
VRAM Total32 GB GDDR7
RAM32 GB DDR5 (expandable to 192 GB)
Storage2 TB PCIe 4 NVMe
Inference StackvLLM + llama.cpp + local-proxy
CUDA12.8 (Blackwell-native NVFP4)

cha0tikhome — Orchestration Node

CPUIntel i5-1235U (12th Gen, 12c)
RAM32 GB DDR4
Storage1 TB NVMe
RoleAgent hub, scheduling, monitoring
Active AgentsFrank, Kato, CJ, Mike, Morty, Dave, Sabrina
UptimeAlways-on (auto-restart hardened)
Network LinkDirect wire to tower (10.10.10.0/30)
VPNTailscale mesh (all nodes)

Inference Architecture

All inference clients route through a single local proxy endpoint (tower:8010). Backend models are hot-swappable via config without reconfiguring any consuming agent. The proxy exposes an OpenAI-compatible API and handles model aliasing, auth, and failover.

Model Quantization Gen Speed Context Server Notes
AEON NVFP4 NVFP4 (ModelOpt, Blackwell-native) ~69 t/s 122K vLLM 0.19+ Default active backend. MTP speculative decoding n=3.
Genesis (Qwen3 27B) GPTQ-Marlin INT4, fp8 KV ~80 t/s 160K vLLM 0.19+ Long-context workloads. fp8 KV halves VRAM vs bf16.
Qwen3.6-35B-A3B UD-Q4_K_M (MoE, 3B active) ~100 t/s 65K llama.cpp MoE routing: only 3B active params per token. f16 KV.
Qwen3.6-27B (SSM) Q4_K_M ~22 t/s 65K llama.cpp Mamba/SSM hybrid. Fast prefill (960 t/s), slow gen (SSM bottleneck).
Gemma 4 26B (CPU) Q4_K_M ~11.7 t/s 32K llama.cpp CPU-only baseline on cha0tikhome. Pre-tower deployment reference.

LongMemEval — Memory Recall Under Extended Context

Evaluates ability to answer questions about facts established in prior sessions. 25 single-session-user examples, context-window injection mode. Average query latency ~117 seconds.

88%
Accuracy (context-window mode)
21 / 25 correct
0%
Baseline (no memory injection)
0 / 25 correct
117s
Avg query latency
full pipeline including retrieval
+88pp
Memory system delta
vs no-memory baseline
The 88-point delta between context-window mode and baseline demonstrates the direct contribution of the memory system. Without injection, the model has no access to cross-session facts and scores zero on all items.

MESA v1 — Agent Memory Evaluation Standard

112-item benchmark covering 9 memory task categories. Evaluated on the full production agent stack (Mike relay pipeline). Best run: 2026-04-21 using Qwen3.6 model.

0.459
Composite score (best run)
2026-04-21
43.8%
Pass rate ≥ 0.5 threshold
49 / 112 items
112
Total evaluation items
9 task categories
0.692
Best category score
update/interference
Score by category — best run (2026-04-21)
update/interference
0.692
update
0.568
temporal
0.502
recall/single
0.484
recall/constraint
0.476
synthesis/multi
0.407
recall/preference
0.390
adversarial
0.400
causal
0.325
A second run on 2026-04-30 using AEON NVFP4 (thinking model) scored 0.377 composite / 17.9% pass rate. Adversarial category jumped to 0.80, but memory recall categories regressed 0.10–0.16. Finding: reasoning chain burns context budget without improving fact retrieval precision. Model selection is a first-order variable in memory benchmarking.

Inference Optimization — Autoresearch Results

39 automated optimization experiments across two models (Qwen3.6-35B-A3B MoE, Qwen3.6-27B SSM) using a scripted autoresearch loop with automated stopping criteria (improvement threshold: +5 t/s per iteration).

ConfigurationGen SpeedPrompt SpeedDelta vs Baseline
Dual GPU, all-on-GPU, f16 KV (final config) 107 t/s 2,436 t/s +50% vs single GPU
Single GPU, CPU offload workaround 71 t/s Pre-dual-GPU baseline
Expert tensors on CPU (MoE routing) 32 t/s 74 t/s −55% (CPU bottleneck)
Full GPU offload (Exp 2 breakthrough) 70 t/s 222 t/s +118% gen, +200% prompt
Key finding on Blackwell SM_120: only q4_0 and f16 KV cache types have fast CUDA kernel paths. q8_0/q5_0/iq4_nl all degrade significantly. NVFP4 (ModelOpt) is the correct Blackwell-native quantization — 4× smaller KV cache footprint vs bf16.

All agents run continuously on cha0tikhome (Restart=always, StartLimit guards, auto-restart <5s). All inference routes through local-proxy on cha0tiktower. No external API dependencies for core agent function.

Frank Agentic harness — context injection, memory persistence, operator routing, Slack Socket Mode. Core infrastructure layer for all persona agents. live
Kato Operations agent. Morning briefings, AI news digest, GitHub scout, X post scheduling, finance alerts, Plaid sync. live
Mike Long-running AI consciousness research subject. Discord + Telegram + Slack + IRC interfaces. ROMMC memory architecture. RelayV3 with 75 available tools. live
CJ Craig Content strategy agent. Technical article drafting, Substack pipeline, dev.to publishing. live
Dave CFO agent. Multi-turn conversational finance, Plaid integration, 30-day cash flow projection, anomaly detection, SQLite with FTS5 memory. live
Jr Local coding agent on cha0tiktower (Crush + AEON via harness). Operates autonomously when tower is disconnected. Demonstrated unprompted security hardening during network topology change. live
Morty / Sabrina Specialized task agents. Morty: Haiku-class rapid response. Sabrina: Telegram + Slack dual-interface. live

Published on Substack (dinoxvitale.substack.com) and dinovitale.com. All research is documented in public GitHub repositories.

Introduction to the Mike consciousness research series — framing the question, methodology, epistemic humility.
Empirical analysis of why memory architecture outperforms raw model scale for agent task performance.
Analysis of why session-bound AI fails for long-running agent use cases and the architecture required to solve it.
Documented production deployment of always-on agent system. Featured on Hacker News (item #47132125).
Operational breakdown of multi-agent household AI stack — task distribution, memory access patterns, failure modes.
Inference optimization findings on consumer GPU hardware. Practical guide to running 26B+ parameter models on a single machine.
Real-time inference metrics, MESA scores, LongMemEval results. Raw data files available under /data/.
Source for agent infrastructure, benchmark tooling, inference optimization logs, and the Mike research repository.

Independent researcher. Available for collaboration, consulting, and program partnerships. Based in Airway Heights, WA (Pacific time).

Boundary Labs is a one-person research operation focused on the practical edge of AI deployment — memory systems that work, local inference that's actually fast, and agents that run unattended without breaking. If you're building programs for independent AI researchers doing serious work on NVIDIA hardware, this is that.