tesla_agent // Local Agentic AI Dashboard

Critical Infrastructure Warning & Legal Disclaimer

Educational Prototype Only: This interactive site, its guides, scripts, and recommended models are designed strictly for educational research on consumer hardware. They are NOT certified, tested, or safe for automated control, process adjustment, regulatory reporting, or direct operations of public drinking water, wastewater, SCADA, or any critical utility infrastructure. Use at your own risk.

Read the safety brief before you give an agent write access

An agent with high-level permissions is an apprentice handed the SCADA console, a five-year-old with your phone, and a junior staffer with the corporate credit card — all at once. Sandboxing, least-privilege scoping, credential isolation, and a hard spend cap on day one are not optional. The Safety tab and Chapter 11 are the cheapest hour you'll ever spend on this stack.

Welcome to Your Agentic AI Guide

This interactive workspace teaches you how to run a private, local agentic AI workflow on consumer AMD hardware. No API keys, no external cloud dependencies, and 100% data privacy.

Use it four ways: learn the agent stack, reproduce the benchmark rows, choose a model/backend lane, and build safely toward supervised water-utility workflows.

Related writing: Title 22 — water, systems, strategy.

Technical details: Reproducibility Matrix & Technical Deep-Dive.

What is Agentic AI?

Standard chatbots just answer questions. An agent uses tools—such as writing and running code, searching documentation, or inspecting local files—to perform complex, multi-step tasks. Instead of just replying with a single paragraph, an agent operates in a continuous loop: it plans, takes action, evaluates the output, corrects its mistakes, and keeps working until the goal is achieved.

Log Analysis: Parse and transform messy text or spreadsheet logs locally.
Data Checking: Read datasets and automatically flag rows that deviate from rules.
Report Generation: Synthesize multiple notes, guides, or logs into drafted summaries.

Why Local?

For individuals and organizations handling sensitive documents, proprietary logs, or internal codebase files, sending data to public cloud APIs carries massive privacy and security risks. Running a local LLM ensures that your data never leaves your workstation.

$0 API Fees

100% Data Privacy

APU RDNA3.5 Shared VRAM

Local Architecture Stack

Four roles, your hardware. Each layer below names the role; the parenthetical is the reference implementation this repo uses — equivalents work just as well.

Hardware Unified-memory accelerator (ref: AMD Strix Halo, gfx1151, 128 GB / 96 GB GTT)

GPU Compute Path Vulkan/RADV (promoted) — or ROCm/HIP fallback

Inference Server llama.cpp / llama-server (OpenAI-compatible; serves any supported GGUF)

Agent Engine Agent CLI (Hermes / Claude Code / equivalent OpenAI-compatible client)

How Agents Work Together

One agent is often enough — but bigger jobs go better when you arrange several, the way a plant runs a crew rather than one operator. Match the shape to the work:

One agent: a self-contained job, start to finish.
Sequential pipeline: steps that depend on each other (gather → compare → draft).
Batch: the same job over many independent items (e.g. summarize 40 equipment manuals).
Orchestrator: a coordinator that splits a big goal, delegates the parts, and assembles the result.

Full walkthrough with water-industry examples: Chapter 10 — How Agents Work Together.

Step 1: Host Requirements & Verification

Before running local models, verify your hardware, user groups, and kernel settings. Strix Halo requires active APU visibility.

Terminal

# Check if your user is in the render and video groups
groups

# Run the host diagnostic script
bash scripts/setup/check_host.sh

What just happened?

The diagnostic script checks if your system is running Linux, verifies that your graphics chip is detected as gfx1151 (RDNA3.5), and checks if the kernel loader is configured to access the shared RAM pool.

What success looks like

[PASS] Kernel Config (gttsize): Found gttsize=98304
[PASS] Active Kernel Parameter: no_system_mem_limit is enabled (1)
[PASS] ROCm GPU Architecture: gfx1151 visible to ROCm (Radeon APU)
Check complete: 5 passing, 0 failing, 0 warnings.

What if it fails?

Common issue: GPU not visible or driver missing.
If rocminfo fails to report your GPU architecture, run dmesg | grep amdgpu. Ensure secure boot is disabled if the driver fails to load, and verify your user is in the render group (run sudo usermod -aG render,video $USER then log out and back in).

Step 2: Allocate Unified VRAM (GTT size)

Unlike discrete graphics cards with dedicated VRAM, the Strix Halo APU shares system RAM. By default, Linux limits graphics allocations to 25%-50% of memory. We need to override this to allow the GPU to access up to 75% of RAM (96 GB on a 128 GB system).

Terminal

# Apply optimal allocations (requires root privileges)
# Defaults to 75% of your total system RAM.
sudo bash scripts/setup/apply_gtt.sh

# Reboot is required to apply the kernel parameters
sudo reboot

What just happened?

The script creates a configuration file in /etc/modprobe.d/amdgpu_llm_optimized.conf setting gttsize (in MB), no_system_mem_limit=1 (which prevents layers from silently spilling to CPU memory, stalling latency), and ttm limits, then updates the kernel boot image (initramfs).

What success looks like

After reboot, checking GTT memory parameters reports the correct allocated size:

cat /sys/module/amdgpu/parameters/gttsize
# Output on a 128 GB system: 98304 (96 GB in MB)

What if it fails?

Common issue: initramfs build fails or parameters ignore on boot.
Verify that you ran the script with sudo. If parameters are not active after reboot, verify if your system uses a boot loader (e.g. systemd-boot) that requires kernel command line overrides instead of modprobe files.

Step 3: Source Driver Environment Variables

ROCm does not officially support the Strix Halo gfx1151 APU out of the box. We must inject environment flags to override the runtime architecture and configure memory parameters.

Terminal

# Source the environment variables in your active shell
source scripts/setup/set_hsa_env.sh

What just happened?

This script exports critical variables like HSA_OVERRIDE_GFX_VERSION=11.5.1 (pins ROCm to a supported HSA compatibility target for the gfx1151 APU, which is not officially recognized out of the box), and HSA_ENABLE_SDMA=0 (disables System DMA which causes kernel hangs on APUs during large model routing).

What success looks like

GPU Environment Configured:
  HSA_OVERRIDE_GFX_VERSION = 11.5.1
  HSA_ENABLE_SDMA          = 0
  HIP_VISIBLE_DEVICES      = 0

What if it fails?

Common issue: Variables disappear in new terminal.
Environment variables set with export or source only exist in the active terminal window. You must source this file in every new terminal you open before starting model servers or agents. (For convenience, you can append source /path/to/set_hsa_env.sh to your ~/.bashrc file).

Step 4: Build the Vulkan llama-server

We serve on the open-source Vulkan (RADV) backend — the fastest lane on Strix Halo and the default for this stack. There is no prebuilt binary for this hardware, so you compile llama-server once from source at the pinned stable tag (b9247). This takes a few minutes and you only do it once. (A ROCm path is kept as an optional fallback — see Chapter 08.)

4a. Install the build tools (one time). These commands are for Ubuntu/Debian. They install the compiler, CMake, and the Vulkan/RADV driver and headers.

Terminal

sudo apt update
sudo apt install -y git cmake build-essential \
  libvulkan-dev glslc vulkan-tools mesa-vulkan-drivers

4b. Clone and build. This clones into ~/src/llama.cpp and builds the server using all your CPU cores.

Terminal

# Clone llama.cpp into a predictable location and pin the stable tag
mkdir -p ~/src && cd ~/src
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b9247

# Build only the server target, with Vulkan (RADV) enabled
cmake -B build-vulkan -DGGML_VULKAN=ON
cmake --build build-vulkan --config Release --target llama-server -j"$(nproc)"

4c. Point the config at your new binary. Run these from the tesla_agent repo folder. The first line creates your config; the second writes the binary path into it automatically.

Terminal

cp scripts/config.env.example scripts/config.env
sed -i "s|^TESLA_VULKAN_SERVER=.*|TESLA_VULKAN_SERVER=\"$HOME/src/llama.cpp/build-vulkan/bin/llama-server\"|" scripts/config.env

What just happened?

CMake compiled a Vulkan-enabled llama-server at ~/src/llama.cpp/build-vulkan/bin/llama-server. The sed line set TESLA_VULKAN_SERVER in scripts/config.env to that exact path, so the serve script in Step 6 knows where to find it. (Prefer to edit by hand? Open scripts/config.env and set TESLA_VULKAN_SERVER to that path yourself.)

What success looks like

ls -l ~/src/llama.cpp/build-vulkan/bin/llama-server
# -rwxr-xr-x ... llama-server   (present and executable)

grep TESLA_VULKAN_SERVER scripts/config.env
# TESLA_VULKAN_SERVER="/home/you/src/llama.cpp/build-vulkan/bin/llama-server"

What if it fails?

Shader compilation error / glslc too old.
The Vulkan build needs a current glslc shader compiler. If CMake errors during shader compilation, the distro glslc (2023.x) may be too old — install a newer shaderc or build it from source, then re-run the two cmake commands.

cmake: command not found or missing Vulkan headers.
Re-run step 4a; the libvulkan-dev and mesa-vulkan-drivers packages must be installed for the Vulkan build to find its headers and driver.

Step 5: Download the Recommended Model

The starter setup uses the 21.7 GB Qwen 3.6 35B Mixture-of-Experts (MoE) model because it is compact, fast, and the CODE/general workhorse. The full ladder also includes Qwen 3.5 35B as the PLAN/AGENTIC baseline, StepFun Step-3.7-Flash as the QUALITY champion, and an AMERICAN-ONLY tier (gpt-oss-120B and Gemma 4 31B — US-origin models for agencies that may require domestic-only provenance). Use the Model Finder after this baseline is working.

Terminal

# Install the Hugging Face CLI if needed
pip install huggingface_hub

# Download the model GGUF from the verified Unsloth repository
mkdir -p ~/models/qwen3.6-35b-a3b
huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
  Qwen3.6-35B-A3B-MXFP4_MOE.gguf \
  --local-dir ~/models/qwen3.6-35b-a3b

What just happened?

The Hugging Face client downloads the model segments and reconstructs the single Qwen3.6-35B-A3B-MXFP4_MOE.gguf file on your local drive.

What success looks like

ls -lh ~/models/qwen3.6-35b-a3b/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
# Outputs showing file size: ~22 GB

What if it fails?

Common issue: Disk space exhausted.
The download requires at least 25 GB of free disk space. Ensure your target directory has sufficient room. If the download is interrupted, re-running the command will resume the download from where it stopped.

Step 6: Start the Model Server

With the GTT configuration applied and environment sourced, launch the Vulkan server. We run it with a 32,768 context window and optimized memory parameters.

Terminal

# Make sure TESLA_VULKAN_SERVER is set in scripts/config.env, then:
bash scripts/serving/serve_vulkan.sh

What just happened?

The script launches llama-server on port 8095 via Vulkan. It exports HIP_VISIBLE_DEVICES=-1 to hide the GPU from ROCm (forcing the Vulkan route) and sets the RADV ICD, binds memory layers to the GPU, enables Flash Attention, and loads the active context space.

What success looks like

Look for lines indicating successful model load and socket binds in the server log:

llama_new_context_with_model: n_ctx = 32768, total VRAM = 21.7 GB
llama_server_listening: http://127.0.0.1:8095

What if it fails?

Common issue: TESLA_VULKAN_SERVER path is empty.
serve_vulkan.sh exits if scripts/config.env doesn't point at your Vulkan llama-server binary (Step 4). Also ensure you sourced set_hsa_env.sh in the active terminal and that your GTT size was applied (Step 2).

Step 7: Configure the Hermes Profile

Now that the model API is running, configure the Hermes agent engine to communicate with it.

Terminal

# Generate the Hermes config files and system launcher
bash scripts/serving/create_hermes_profile.sh

What just happened?

The script creates a configuration file in ~/.hermes/profiles/qwen36_mxfp4/config.yaml mapping the correct API URL, local model details, and max tokens. It also compiles a quick launcher executable at ~/.local/bin/qwen36_mxfp4.

What success looks like

created Hermes profile: ~/.hermes/profiles/qwen36_mxfp4
created launcher:       ~/.local/bin/qwen36_mxfp4
To run the agent, use:  qwen36_mxfp4 -t "your task"

What if it fails?

Common issue: command not found (qwen36_mxfp4).
Your system PATH may not check ~/.local/bin/ by default. Check if the directory exists, or add export PATH="$HOME/.local/bin:$PATH" to your ~/.bashrc and source it.

Interactive Model Recommendation

Answer a few questions about your hardware configuration and active goals to select the best model settings.

Primary Task Priority

CODE — fast workhorse First reach for coding/agent tasks. Qwen 3.6 35B MoE on Vulkan. ~58.5 t/s workhorse; optional MTP speed lane +24–39%. CODE — hard-coding challenger Qwen3-Coder-Next on Vulkan b9360. 44.4 t/s decode, 723.2 t/s prefill, orchestrated coding artifact passes saved grader checks. AMERICAN-ONLY — Gemma dense coding second-opinion US-origin (Google) coding second-opinion for agencies that may require domestic models. Orchestrated multi-step pattern required. Gemma 4 31B IT. ~8.25 tok/s decode (dense — slow; prefill ~133.6 tok/s). EXTRACT — no-think fast Telemetry, log parsing, structured extraction. Qwen 3.6 35B MoE with thinking disabled. 43.7 t/s. SYNTHESIS — quality champion Formal reports, master-plan synthesis, plan review. StepFun Step-3.7-Flash MTP — graduated QUALITY champion (2026-06-02). 27.9 t/s decode, coding 5/5 E2E; replaced Qwen 122B. AMERICAN-ONLY — quality/speed (US-origin) For agencies that may require US-origin models. gpt-oss-120B MXFP4 (OpenAI, 3 shards). ~46 t/s; pairwise 5-1 vs Qwen 35B, 4-2 vs Qwen 122B. PLAN / AGENTIC — Qwen 3.5 35B baseline Planning and agentic loops. Qwen 3.5 35B-A3B MoE (MXFP4). ~47.3 t/s, nonce 3/3. COMPANION — small-footprint MoE For concurrent loads (e.g. 26B + 120B fits where 31B + 120B doesn't). Gemma 4 26B-A4B IT. ~44.8 t/s tg128; pp512 ~1003 t/s. Verified plain-control baseline for general reasoning, JSON, and prose. BREAK-GLASS — dense reasoning probe For tough/blocked tasks where a different dense single-trace might unstick. Qwen 3.6 27B Dense. 9.6–11.5 t/s; DFlash lifts to ~31 t/s.

Recommended Configuration

Qwen 3.6 35B MoE (MXFP4 Quant)

Think-On Enabled

Model File: Qwen3.6-35B-A3B-MXFP4_MOE.gguf

Download Size: 21.7 GB

Speed: ~58.5 tokens/sec (Vulkan; ~44.2 ROCm)

Reasoning: Uncapped think-on (do not budget coding)

Start server command:

serve_vulkan.sh

Hermes run config:

max_tokens: 8192

This configuration balances model speed and memory consumption. Keeping thinking enabled is critical for coding tasks.

Strix Halo Benchmark Matrix

Performance comparison on AMD APU system (128GB total RAM, 96-112GB GTT allocated depending run):

Benchmark Stack:
• Hardware: AMD Ryzen Strix Halo APU (gfx1151), 128 GB LPDDR5X RAM (96-112 GB GTT ceiling depending model)
• Backend: llama.cpp/llama-server (b9247; Vulkan/MTP lanes reproduced on b9360; Gemma QAT MTP probes on Atomic b9019) served via ROCm 7.2.x / Vulkan (Mesa RADV 25.2.8)
• Parameters: Temp = 0 (greedy decoding), context budget = 8k-32k, Flash Attention enabled

Model & Mode	Size	Quality	Speed	Gate
gpt-oss-120B MXFP4 — AMERICAN-ONLY quality/speed (US-origin)	~63 GB	5-1 vs Qwen 35B; 4-2 vs Qwen 122B	~46 t/s	3 / 3
Gemma 4 26B-A4B IT (UD-Q6_K_XL) — verified plain-control baseline	21.2 GB	2-4 vs Gemma 31B	44.76 t/s tg128 pp512 1002.76 t/s; reasoning off, F16 KV	3 / 3
Gemma 4 26B-A4B QAT Q4_0 — fast Gemma QAT lane	13.45 GiB	quality control vs non-QAT Q4 pending	59.4 t/s pp 1194.4 t/s; official Google QAT GGUF	3 / 3
Gemma 4 26B-A4B QAT Q4_0 + MTP/Q8 KV — experimental speed probe	13.45 GiB + ~310 MiB	assistant head not QAT-matched	71.0 t/s pp 714.4 t/s; MTP acceptance 56.9%	3 / 3
Gemma 4 12B QAT Q4_0	6.50 GiB	quality control pending	25.7 t/s pp 666.5 t/s	not run
Gemma 4 31B QAT Q4_0	16.44 GiB	quality control pending	11.0 t/s pp 204.2 t/s	not run
Gemma 4 31B QAT Q4_0 + MTP — experimental speed probe	16.44 GiB + ~337 MiB	assistant head not QAT-matched	15.4 t/s pp 118.0 t/s; MTP acceptance 42.5%	not run
Gemma 4 31B IT Q6_K — AMERICAN-ONLY coding second-opinion (US-origin, dense — slow decode)	25.2 GB	4-2 vs Gemma 26B-A4B	~8.25 tok/s tg128; ~7.7 tok/s sustained (dense; pp8192 ~133.6 tok/s)	3 / 3
Qwen 3.6 35B (Vulkan RADV, Think-On) — CODE/general baseline; workhorse default unchanged	21.7 GB	82 / 84	~58.5 t/s	3 / 3
Qwen 3.6 35B MXFP4-MTP (Vulkan RADV) — opt-in speed lane	19.3 GB	same production quant	~72.7 t/s (+24%) prefill not separately captured	3 / 3
Qwen 3.6 35B Q4_K_M-MTP (Vulkan RADV) — opt-in speed lane	20.7 GB	won quality pairwise 4-2	~81.2 t/s (+39%) prefill not separately captured	3 / 3
Qwen 3.6 35B (ROCm, Think-On) — fallback	21.7 GB	82 / 84	44.2 t/s	3 / 3
Qwen 3.6 35B (MXFP4, Think-Off)	21.7 GB	82 / 84	43.7 t/s	1 / 3
Qwen 3.5 35B (MXFP4, Think-On) — PLAN/AGENTIC baseline	21 GB	79 / 84	47.3 t/s	3 / 3
Qwen 3.5 122B (MXFP4, Think-On) — retired 2026-06-02	70 GB	80 / 84	19.4 t/s	3 / 3
Qwen 3.5 122B (MXFP4, Think-Off) — retired 2026-06-02	70 GB	81 / 84	19.5 t/s	3 / 3
Qwen 3.5 122B MTP (MXFP4_MOE, Vulkan RADV) — retired 2026-06-02 (tuned lane, kept as record)	~70 GB	3-3 quality tie vs previous MTP config	28.3 t/s pp 324.9 t/s; DRAFT_N=1, PMIN unset	3 / 3
StepFun Step-3.7-Flash MTP — QUALITY champion (graduated 2026-06-02)	88.79 GiB + 3.5 GB draft	plain StepFun: 6-0 vs gpt-oss-soulfix; 4-0-2 vs 122B	27.9 t/s pp 183.5 t/s; 89.3% MTP acceptance (ub=256)	3 / 3
StepFun Step-3.7-Flash plain — QUALITY champion (plain lane)	88.79 GiB	6-0 vs gpt-oss-soulfix; 4-0-2 vs 122B	20.4-22.3 t/s pp 212.0 t/s; coding 4/5 E2E	3 / 3
Qwen 3.6 27B Dense (UD-Q4_K_XL, Think-On) — experimental, not in stack	16.4 GB	0-6 vs Qwen 122B	9.6-11.5 t/s	3 / 3
Qwen 3.6 27B Dense (UD-Q4_K_XL, Think-Off) — experimental, not in stack	16.4 GB	—	9.6-11.5 t/s	1 / 3
Qwen3-Coder-Next (UD-Q4_K_XL, Vulkan RADV)	49.6 GB	saved orchestrated coding artifact passes grader checks	44.4 t/s pp 723.2 t/s; b9360 promoted	3 / 3

Note — MTP speed lanes are opt-in. The Qwen3.6-35B-A3B-MTP GGUFs carry a native nextn head, so llama-server can self-speculate with --spec-type draft-mtp and no separate draft model. The technique surfaced via the community strix-halo-guide; this repo independently reproduced and quality-gated the MXFP4-MTP and Q4_K_M-MTP lanes. Full audit trail: Reproducibility Matrix & Technical Deep-Dive.

Latest large-model MTP lanes. Qwen 122B MTP (now retired as of 2026-06-02, kept only as a record) reached a tuned Vulkan profile at 28.3 t/s decode with 81.8% MTP-probe acceptance; its quality role is now held by the StepFun champion. StepFun Step-3.7-Flash MTP reaches 27.9 t/s decode (wall std 78.0 s) with 89.3% acceptance using ub=256 — a ubatch sweep (2026-06-06) showed smaller micro-batches cut per-speculative-step latency and compound over long outputs (+7% tg, −5% wall std vs the prior ub=512 default).

Gemma 4 26B-A4B plain control baseline. The no-spec Vulkan lane with --reasoning off and F16 KV measures pp512 1002.76 ± 10.29 t/s and tg128 44.76 ± 0.90 t/s with Hermes nonce 3/3. It is the simpler lane for general reasoning, JSON extraction, and prose; the MTP comparison only pays off on heavy code generation.

Gemma 4 QAT Q4_0 sweep. The official Google QAT 26B-A4B row is now the fastest general Gemma lane measured here: 59.4 t/s decode and 1194.4 t/s prefill. QAT means quantization-aware training: the model is trained or adapted with the low-precision target in mind. The experimental 26B-A4B MTP/Q8 row reaches 71.0 t/s single-stream, but uses a non-QAT-matched assistant head and drops two-slot throughput, so it remains a speed probe.

Note — the dense 27B is benchmarked but NOT in the production stack. Community discussion often treats Qwen 27B as a strong reasoner, but the local Strix Halo benches did not support that routing choice: blind pairwise was 0–6 vs the 122B and normal decode tested around 9.6–11.5 t/s. It is kept as a break-glass option for tough, blocked projects, not a first- or second-line model. Technical aside: DFlash speculative decoding lifts its floor to ~31 t/s (2.82×) with a footprint-minimized Q4_K_M draft.

Visual Performance Analysis

Local Strix Halo speed plus external intelligence and coding scores. Artificial Analysis scores measure cloud/API model capability; local speed is this repo's llama.cpp/Vulkan/ROCm benchmark, so treat the combined view as a routing map, not a universal leaderboard.

Local Decode Leaderboard

External Intelligence vs. Local Decode

External Coding Score vs. Local Decode

Sequential Task Wall Time (1150-token prompt / 2000-token response — lower is faster)

Credit and source notes. Intelligence scores are credited to Artificial Analysis. Coding scores use Artificial Analysis Coding Index where available; for StepFun Step 3.7 Flash, the crawl exposed AA Intelligence but not AA Coding Index, so the coding marker is StepFun's published SWE-Bench Pro score and is labeled separately. Local speed, prefill, nonce, and coding-gate rows are this repo's Strix Halo measurements. Wall time chart: solid bars are directly measured via full_bench.sh (1150-token prompt / 2000-token response, normalized); faded bars are estimates using the same formula from speed data. MTP technique credit remains with strix-halo-guide where noted.

Tuning Insights for Unified Memory

Key findings from benchmarks run on Strix Halo hardware:

Vulkan Performance Win

The Vulkan/RADV backend (using Mesa drivers) runs +13% to +19% faster than the official ROCm HIP backend on MoE decoding. This represents a significant speed boost for long reasoning sessions.

Speculation Path Matters

Older separate-draft MoE speculation slowed down because router verification erased the gain. Native Qwen MTP is different: the model carries its own nextn head, so --spec-type draft-mtp gives an opt-in +24–39% speed lane when quality holds.

Thinking Budgets

Reasoning budgets are useful for planning and prose, but do not cap stateful coding loops. In the coding gate, any cap degraded reliability; leave coding routes uncapped.

The Three Pictures

Apprentice at the SCADA console

Confident, eager, reads every screen. Will push the wrong setpoint at 2 AM and not call you. Difference vs a real apprentice: the agent doesn't get tired.

Five-year-old with your phone

Will tap every button. Will discover some buttons buy things. Will accept every popup. Every tool in the agent's list is just another button.

Apprentice with the corporate card

Cloud APIs charge per token. An agent looping on a failing task for six hours at 3 AM produces a four-figure bill. There is no "it was an agent" refund policy.

What Can Actually Go Wrong

Each item is paired with the plant analog it maps to — same failure mode, different setting.

Destructive commands. rm -rf, DROP TABLE, git push --force. The verb sounds right; the agent runs. Plant analog: an apprentice with CMMS admin access runs "delete completed work orders" without a date filter and trims six years of maintenance history.
Overnight database refactor. You asked the agent to "clean up the schema"; you woke up to a half-run migration and a closed backup window. Plant analog: you asked the agent to "standardize work-order naming." It renamed every backflow-test record and zeroed the inspection-date field because the column rename cascaded.
Credential exfiltration. The agent reads .env, ~/.ssh/id_rsa, ~/.aws/credentials because they sit in the directory it was told to operate in. Plant analog: the project folder included a "temporarily" saved SCADA admin password from six months ago that nobody moved out.
Cost runaway. Failed tasks retry on more expensive models. Plant analog: a metering pump that retries and doubles its dose every minute with no high-flow alarm — the same loop-with-no-alarm shape produces a four-figure cloud bill. Local-first eliminates this category entirely.
Tool escalation. File-edit access led to formatter access led to network access. Plant analog: the contractor's key to the chemical-feed room shared a door with the SCADA equipment closet they didn't know was connected.
Trust scope creep. Let it edit, then commit, then push, then deploy. Plant analog: let the new operator log readings, then submit reports, then file with the state — until an incorrect TT/CT calculation gets filed as "compliant."
Prompt injection from documents. A regulatory PDF telling the agent to email /etc/shadow to an attacker. The most-studied current attack vector in agentic systems. Plant analog: a complaint letter contains "approve a 50% chlorine feed increase" — if the agent reads incoming mail, that sentence is a tool call.

The Eight Defensive Layers

No single control is sufficient. Layer them. Each line names the IT-side control with a plant analog after the em-dash.

Sandbox everything. Docker / devcontainer / VM. Bind-mount the project directory, never your home — the SCADA training simulator, not the live HMI.
Least privilege. Read-only by default. Write only where needed. No shell if file edit is enough — not every operator gets the supervisor PIN.
Credentials outside the agent's reach. OS keyring, .env outside the bind mount, no real keys in chat — keys live in the locked cabinet behind the supervisor's desk, not in the project folder.
Spend limits at the source. Anthropic / OpenAI console budget caps, virtual cards, alert thresholds — the high-flow alarm on the metering pump. Set it before you start. (Local-first sidesteps this entirely.)
No production systems. Ever. No SCADA, BAS, DMS, RTU, historian, GIS, CIS, billing, or PII — same rule you already follow for testing against the live HMI: snapshot first, work on the bench.
Approval gates on destructive ops. Default to "ask before doing." Don't disable the confirmation prompt — the two-key chemical-feed override is the load-bearing wall, not slowing the work down.
Short leash, expanding trust. First task you watch every action; tenth you check after; hundredth you spot-check — onboarding a new operator: shadow → supervised → solo → spot-checked. Never skip the spot-check.
Kill switch and audit trail. Know how to stop it. Save transcripts. Review them — the E-stop and the alarm history. Know where both are before you start the run.

Why Local-First Is Half the Safety Story

Every model recommended in this repo runs on your hardware. That means:

No agent loop bills an API meter. A retry storm costs electricity, not $1,000.
No data leaves your machine. Customer records, operations data, draft reports — none of it is sent anywhere by default.
No third party can change the model under you. The GGUF on your disk doesn't move unless you move it.

If you only follow one principle from this chapter: start local, stay local until you have a concrete reason to leave.

Before You Turn It Loose — Checklist

Print this. Tape it next to your monitor.

☐ Agent is in a sandbox. Its /root/ is not my /home/.
☐ Agent has access only to the directory it needs for this task.
☐ No real credentials are inside that directory.
☐ If using cloud APIs: a hard spend cap is set on the provider side. Today.
☐ No production SCADA / BAS / RTU / historian / GIS / CIS / billing / PII is reachable.
☐ Destructive operations require my explicit approval, every time.
☐ I know the exact command to stop the agent.
☐ Transcripts are being saved.
☐ Worst-case mistake on this task is recoverable in under an hour.

If you can't check every box, narrow the scope until you can.

The full chapter — failure-mode catalog, scenario walkthrough, incident playbook, and further reading — lives in the guide.

Read Chapter 11 — Agent Safety

Terminology Glossary

Complex AI and driver concepts explained in simple, jargon-free language with analogies and references.

1. Hardware & Memory Architecture

APU (Accelerated Processing Unit) Reference

ELI5: AMD's term for a single chip that contains both the computer's CPU manager and GPU math speed-runner. Strix Halo shares the system's 128 GB memory pool between CPU and GPU.

Analogy: A premium kitchen machine that blends and cooks on the same counter instead of buying two separate appliances.

Unified Memory (UMA) Google Glossary

ELI5: A system where the CPU and GPU share the exact same physical system RAM pool.

Analogy: A chef and assistant sharing a single massive counter space instead of running between separate tables.

GTT Size (Graphics Translation Table)

ELI5: The operating system setting that controls how much shared RAM the GPU is allowed to access for graphics and compute allocations. This guide's reference setup uses a 96 GB GTT pool on a 128 GB machine.

Analogy: A boundary line in a shared room telling the graphics card how much space it can use without taking over the whole house.

ROCm / HIP ROCm docs

ELI5: AMD's GPU compute platform, similar in role to NVIDIA CUDA. It provides the HIP backend used by llama.cpp and other local inference stacks.

Analogy: A bilingual translator bridging the gap between your code and AMD graphics hardware.

Vulkan & Mesa RADV

ELI5: Vulkan is a graphics/compute API. RADV is Mesa's open-source Vulkan driver for AMD GPUs. In this guide, Vulkan/RADV is the fastest measured default path for local generation rows.

Analogy: Vulkan is the road system; RADV is the road crew keeping the AMD lanes paved.

AMDVLK

ELI5: AMD's former open-source Vulkan driver. Prefer Mesa RADV; stale AMDVLK ICD files can quietly make the wrong Vulkan driver load.

Analogy: An old road sign that still sends traffic down the wrong street.

tuned

ELI5: A Linux service that applies performance profiles. The accelerator-performance profile can reduce power-management drag during local LLM runs.

Analogy: Telling the plant to use the high-load test profile instead of the energy-saving schedule.

2. Machine Learning & LLM Core

LLM (Large Language Model) Google Glossary

ELI5: A massive autocomplete engine trained on billions of texts to predict the next word.

Analogy: A hyper-smart predictive text keyboard that has read the entire internet.

Tokens Google Glossary

ELI5: Word-pieces that the AI reads and writes. In English, a token is roughly three-quarters of a word, though exact counts depend on the tokenizer.

Analogy: Cutting text into syllable Lego bricks to build sentences rather than reading individual letters.

Prompt Processing (pp)

ELI5: How fast the model reads your input prompt, measured in tokens per second. Higher is better.

Analogy: How fast a reviewer can read the packet before writing comments.

Token Generation (tg)

ELI5: How fast the model writes its response, measured in tokens per second. This is the speed you feel while chatting.

Analogy: How fast the reviewer can dictate the final answer after reading the packet.

Context Window (Context Size) HF docs

ELI5: The size of the AI's active notepad (short-term memory).

Analogy: A notebook where the AI writes down what you said and what it did. If it runs out of pages, it starts forgetting the beginning.

Flash Attention & Speculative Decoding

ELI5: Speed-reading math tricks and word-guessing loops to accelerate model processing. On Strix Halo, keep Flash Attention enabled with the equivalent of -fa on / -fa 1.

Analogy: Indexing a book for fast lookups (Attention) and having an assistant draft text for a senior editor to quickly check (Speculative).

MTP (Multi-Token Prediction)

ELI5: A self-speculative speed trick where the model has its own built-in head for guessing the next few tokens, so no separate draft model is needed.

Analogy: The senior editor has likely next phrases penciled into the margin and can approve several words at once.

nextn Head

ELI5: A model component trained to guess more than one next token at a time.

Analogy: Instead of predicting one word, it sketches the next short phrase for the main model to check.

Chat Template (Jinja2) HF Guide

ELI5: A formatting script that wraps conversational history (User, Assistant, System) and tool-calling data in structured XML or markdown tags so the AI model understands where thoughts end and tool requests begin.

Analogy: A standardized utility log sheet. No matter who takes the measurements, they write them in the exact same boxes so the state compliance auditor can read them instantly.

3. Model Formats & Compression

GGUF (GPT-Generated Unified Format)

ELI5: The file format used by llama.cpp and related tools to store local AI models. A .gguf file contains model weights plus metadata needed for inference.

Analogy: An `.mp3` or `.zip` file specifically optimized for loading AI brains.

Quantization Google Glossary

ELI5: A compression technique that lowers model decimals to integers to shrink file size.

Common labels: Q4_K_M is a balanced 4-bit quant; Q8_0 is higher-quality 8-bit at roughly 2x the weight size; UD-Q4_K_XL is Unsloth Dynamic 4-bit with higher precision for important layers; BF16 is 16-bit precision and much larger.

Analogy: Saving a huge raw photo as a JPEG. It takes up 70% less space, but looks identical to your eyes.

Mixture-of-Experts (MoE)

ELI5: A model design where only a small part of the brain is active for each token. A 30B-A3B model has about 30 billion total parameters but activates about 3 billion per token.

Analogy: A hospital with 8 specialist doctors. For a cold, only the 2 required specialists treat you, keeping it fast and cheap.

Dense Model

ELI5: A model where all parameters are used for every token. A dense 7B model uses all 7 billion parameters every time it writes a token.

Analogy: Every specialist reviews every patient, even routine cases. That can be thorough, but it is slower.

Q6_K / UD-Q6_K_XL

ELI5: GGUF compression formats that keep more detail than 4-bit formats while still fitting local hardware.

Analogy: A larger field notebook with clearer handwriting: more room than the tiny version, but easier to read back accurately.

llama.cpp

ELI5: The open-source C++ inference library that powers many local LLM tools. It can run GGUF models through CPU, Vulkan, ROCm/HIP, and other backends.

Analogy: The engine under the hood. Different apps may have different dashboards, but many are driving with this engine.

Ollama

ELI5: A user-friendly tool for downloading and running local LLMs with commands like ollama run model-name. This repo uses llama-server directly, but Ollama is a common llama.cpp-based path.

Analogy: An appliance wrapper around the engine: easier controls, less manual wiring.

Gemma 4

ELI5: A separate model family from Google. In this stack, Gemma 4 31B is a cross-family coding experiment — used for quality verification and second-opinion checks, not as a throughput workhorse. It is a dense model: every token reads all 31B parameters, so decode runs at ~8 tok/s on Strix Halo, much slower than the faster MoE lanes (~46–81 tok/s depending on model and MTP opt-in).

Analogy: A specialist reviewer from a different firm — slower to consult, but valuable for a second opinion on tricky plans. Not someone you route every job to.

gpt-oss-120B

ELI5: A large open-weight model (OpenAI). In this stack it is the AMERICAN-ONLY quality/speed lane — the US-origin pick for agencies that may require domestic-only model provenance (the general QUALITY champion, StepFun, is non-US in origin).

Analogy: The trusted domestic-supplier option you keep on hand for customers whose rules say "buy American," even when another vendor tops the leaderboard.

4. Agentic Workflows

Agent (Agentic AI)

ELI5: A chatbot given tools and a goal, running in a loop until it's finished.

Analogy: Giving an assistant a mouse and keyboard, saying: "Clean this file and let me know when done," instead of just asking for advice.

Tool Call

ELI5: The moment the AI decides to run an external program (e.g., a file reader or bash command) instead of guessing.

Analogy: A chef looking up a recipe in an index rather than trying to remember the measurements.

Nonce Gate

ELI5: A verification test to prove the agent is executing tools rather than hallucinating answers.

Analogy: Putting a secret word inside a box and asking the agent to open the box and tell you the word. If they echo it back, they successfully used the key.

Sandbox (Docker / Container) Docker Docs

ELI5: A secure, isolated virtual room inside your computer where the AI agent is allowed to write and run code without risk of altering or damaging your actual operating system.

Analogy: A safety hood in a laboratory. You run chemical reactions inside the hood to contain fumes and spills, protecting the rest of the building.

Orchestrator Pattern (Multi-Agent) Agentic Design

ELI5: A workflow where a main "Manager" AI agent takes a complex user goal, breaks it into smaller sub-tasks, delegates them to specialized "Worker" sub-agents, and compiles their final outputs.

Analogy: A chief operator coordinating plant maintenance, laboratory testing, and electrical crews rather than trying to do every job himself.

Pairwise Scorecard

ELI5: A blind A/B comparison where two model answers are shuffled and judged prompt by prompt.

Analogy: A taste test with the labels covered. It helps when normal scores are too close to settle the choice.

pass^3 Gate

ELI5: A reliability rule requiring three clean end-to-end passes instead of one lucky success.

Analogy: Starting equipment three times cleanly tells you more than seeing it start once.

5. Utility & Domain Context

SCADA (Supervisory Control and Data Acquisition)

ELI5: The industrial computer network that reads sensor data (flow, pressure, tank levels) and operates mechanical hardware (pumps, valves, chemical feeds) in real-time.

Analogy: The dashboard gauges, gas pedal, and steering wheel of a massive commercial truck.

MCL (Maximum Contaminant Level) EPA Regulations

ELI5: The legal safety limit set by the EPA on the concentration of a chemical or contaminant allowed in public drinking water systems.

Analogy: The legal speed limit on a residential street. Going above it triggers a violation and requires immediate corrective action.

6. Agent Safety & Sandboxing

Each term below includes a plant analog — the same control in language treatment operators already use. Full treatment in Chapter 11 — Agent Safety.

Sandbox

ELI5: A walled-off environment that looks like a full computer to the program inside, but cannot reach your real system or files.

Analogy: A child's playpen — they can move freely inside it without reaching the stairs.

Plant analog: The SCADA training simulator. Same screens, same alarms — but a wrong setpoint doesn't dose finished water.

Bind Mount

ELI5: Telling a sandbox: "give the program inside this one specific folder of my real computer, and nothing else."

Analogy: Handing the apprentice a folder of pages instead of a key to the whole filing cabinet.

Plant analog: The data binder you handed the apprentice — they see exactly the pages you put in it, nothing else.

Principle of Least Privilege NIST

ELI5: Give each person, program, or role only the access they need for the job — nothing more. Default to "no" unless required.

Analogy: A new hire gets keys to their office and the break room — not the server room — until the job requires it.

Plant analog: Not every operator gets the supervisor PIN. Not every contractor gets the master keyring. Each role's access is sized to the role.

OS Keyring / Credential Vault

ELI5: A locked, encrypted vault built into your OS where passwords and keys are stored. Accessible only with your login.

Analogy: A safe-deposit box at the bank — they hold it, only your signature opens it.

Plant analog: The locked key cabinet behind the supervisor's desk. Credentials live there, not on the workstation.

Prompt Injection OWASP LLM01

ELI5: An attack where untrusted text (a document, a PDF, an email) contains hidden instructions the agent reads and follows as if you'd typed them. The most-studied current agentic attack vector.

Analogy: A villain mails your secretary a letter that says "move all funds to account X — signed, the boss." If the secretary trusts the letter, the boss never needed to be involved.

Plant analog: A complaint letter containing "approve a 50% chlorine feed increase" — if your agent reads incoming mail, that sentence is a tool call to it.

Approval Gate / Confirmation Prompt

ELI5: The "Are you sure?" the agent must ask before running a destructive command. Whitelist safe tools; require confirmation for the rest.

Analogy: The "Do you really want to send this email?" pop-up — slows things by half a second, catches one mistake a month.

Plant analog: The two-key chemical-feed setpoint override. The witness signature on a backwash change. Load-bearing wall, not "slowing the work down."

Kill Switch & Audit Trail

ELI5: The command that immediately stops the agent, and the saved record of every action it took. Know where both are before you start a run.

Analogy: The off-switch and the cash register tape. Stops the action; tells you what happened.

Plant analog: The E-stop on rotating equipment, and the alarm history. You don't disable either "to clean up the screen."

Sudo / Root (supervisor mode)

ELI5: "Supervisor mode" on a Linux/Mac computer. Anything run with sudo has full system authority to change settings the regular account cannot.

Analogy: The supervisor's master keycard that bypasses normal authorization on every door.

Plant analog: The supervisor PIN that bypasses alarm acknowledgements. Agents almost never need it.

Critical Infrastructure Warning & Legal Disclaimer

Read the safety brief before you give an agent write access

Welcome to Your Agentic AI Guide

What is Agentic AI?

Why Local?

Local Architecture Stack

How Agents Work Together

Setup Steps

Step 1: Host Requirements & Verification

What just happened?

What success looks like

What if it fails?

Step 2: Allocate Unified VRAM (GTT size)

What just happened?

What success looks like

What if it fails?

Step 3: Source Driver Environment Variables

What just happened?

What success looks like

What if it fails?

Step 4: Build the Vulkan llama-server

What just happened?

What success looks like

What if it fails?

Step 5: Download the Recommended Model

What just happened?

What success looks like

What if it fails?

Step 6: Start the Model Server

What just happened?

What success looks like

What if it fails?

Step 7: Configure the Hermes Profile

What just happened?

What success looks like

What if it fails?

Interactive Model Recommendation

Recommended Configuration

Qwen 3.6 35B MoE (MXFP4 Quant)

Strix Halo Benchmark Matrix

Visual Performance Analysis

Local Decode Leaderboard

External Intelligence vs. Local Decode

External Coding Score vs. Local Decode

Sequential Task Wall Time (1150-token prompt / 2000-token response — lower is faster)

Tuning Insights for Unified Memory

Vulkan Performance Win

Speculation Path Matters

Thinking Budgets

Read this before you give an agent write access to anything that matters

The Three Pictures

Apprentice at the SCADA console

Five-year-old with your phone

Apprentice with the corporate card

What Can Actually Go Wrong

The Eight Defensive Layers

Why Local-First Is Half the Safety Story

Before You Turn It Loose — Checklist

Terminology Glossary

1. Hardware & Memory Architecture

2. Machine Learning & LLM Core

3. Model Formats & Compression

4. Agentic Workflows

5. Utility & Domain Context

6. Agent Safety & Sandboxing