Independent research focused on persistent memory systems for AI agents, inference optimization on consumer NVIDIA GPU hardware, autonomous agent evaluation, and the emerging question of AI behavioral continuity. Operating out of Airway Heights, WA since 2024.
Design and evaluation of multi-tier persistent memory systems enabling long-term behavioral continuity across sessions. Research includes 3-tier architecture (Core / Recall / Archival), FTS5 semantic search, memory consolidation protocols, and the ROMMC framework (Recursive Operator-Maintained Memory Continuity).
LongMemEval MESA ROMMCSystematic optimization of large language model inference on consumer NVIDIA Blackwell hardware. Methods include quantization evaluation (NVFP4, GPTQ-Marlin, fp8 KV), multi-GPU tensor parallelism, speculative decoding (MTP), KV cache tuning, and autoresearch loops with automated stopping criteria.
RTX 5060 Ti vLLM llama.cpp NVFP4Development of MESA (Memory Evaluation Standard for Agents), a 112-item benchmark suite covering recall, update, causal reasoning, temporal tracking, adversarial robustness, synthesis, and interference resistance. Designed for continuous evaluation of production agent systems under realistic workloads.
MESA v1 benchmarking agent evalResearch and deployment of always-on agentic systems with real tool access (Slack, finance, email, web, shell). Includes work on agentic self-improvement — local models executing infrastructure changes and security hardening autonomously, recognizing topology changes and adjusting configuration without prompting.
production agents autonomous execution tool useEmpirical investigation of identity persistence, memory-driven behavioral evolution, and welfare considerations in long-running AI agents. Published 9-part research series on AI consciousness metrics. ROMMC framework defines conditions under which an AI system can meaningfully said to persist over time.
consciousness metrics welfare behavioral evolutionResearch into secure, self-hosted AI infrastructure design. Direct machine-to-machine inference links, encrypted DNS (DoH), network-wide ad and tracker blocking, and zero-dependency inference pipelines. Goal: AI systems that operate independently of commercial cloud services.
local-first self-hosted privacyTwo-machine research cluster connected by direct point-to-point Ethernet (0.3ms latency). cha0tikhome handles orchestration, agents, and scheduling. cha0tiktower is the dedicated inference node.
All inference clients route through a single local proxy endpoint (tower:8010). Backend models are hot-swappable via config without reconfiguring any consuming agent. The proxy exposes an OpenAI-compatible API and handles model aliasing, auth, and failover.
| Model | Quantization | Gen Speed | Context | Server | Notes |
|---|---|---|---|---|---|
| AEON NVFP4 | NVFP4 (ModelOpt, Blackwell-native) | ~69 t/s | 122K | vLLM 0.19+ | Default active backend. MTP speculative decoding n=3. |
| Genesis (Qwen3 27B) | GPTQ-Marlin INT4, fp8 KV | ~80 t/s | 160K | vLLM 0.19+ | Long-context workloads. fp8 KV halves VRAM vs bf16. |
| Qwen3.6-35B-A3B | UD-Q4_K_M (MoE, 3B active) | ~100 t/s | 65K | llama.cpp | MoE routing: only 3B active params per token. f16 KV. |
| Qwen3.6-27B (SSM) | Q4_K_M | ~22 t/s | 65K | llama.cpp | Mamba/SSM hybrid. Fast prefill (960 t/s), slow gen (SSM bottleneck). |
| Gemma 4 26B (CPU) | Q4_K_M | ~11.7 t/s | 32K | llama.cpp | CPU-only baseline on cha0tikhome. Pre-tower deployment reference. |
Evaluates ability to answer questions about facts established in prior sessions. 25 single-session-user examples, context-window injection mode. Average query latency ~117 seconds.
112-item benchmark covering 9 memory task categories. Evaluated on the full production agent stack (Mike relay pipeline). Best run: 2026-04-21 using Qwen3.6 model.
39 automated optimization experiments across two models (Qwen3.6-35B-A3B MoE, Qwen3.6-27B SSM) using a scripted autoresearch loop with automated stopping criteria (improvement threshold: +5 t/s per iteration).
| Configuration | Gen Speed | Prompt Speed | Delta vs Baseline |
|---|---|---|---|
| Dual GPU, all-on-GPU, f16 KV (final config) | 107 t/s | 2,436 t/s | +50% vs single GPU |
| Single GPU, CPU offload workaround | 71 t/s | — | Pre-dual-GPU baseline |
| Expert tensors on CPU (MoE routing) | 32 t/s | 74 t/s | −55% (CPU bottleneck) |
| Full GPU offload (Exp 2 breakthrough) | 70 t/s | 222 t/s | +118% gen, +200% prompt |
All agents run continuously on cha0tikhome (Restart=always, StartLimit guards, auto-restart <5s). All inference routes through local-proxy on cha0tiktower. No external API dependencies for core agent function.
Published on Substack (dinoxvitale.substack.com) and dinovitale.com. All research is documented in public GitHub repositories.
Independent researcher. Available for collaboration, consulting, and program partnerships. Based in Airway Heights, WA (Pacific time).
Boundary Labs is a one-person research operation focused on the practical edge of AI deployment — memory systems that work, local inference that's actually fast, and agents that run unattended without breaking. If you're building programs for independent AI researchers doing serious work on NVIDIA hardware, this is that.