Dashboard Overview
Welcome to Your Agentic AI Guide
This interactive workspace teaches you how to run a private, local agentic AI workflow on consumer AMD hardware. No API keys, no external cloud dependencies, and 100% data privacy.
Use it four ways: learn the agent stack, reproduce the benchmark rows, choose a model/backend lane, and build safely toward supervised water-utility workflows.
Related writing: Title 22 — water, systems, strategy.
Technical details: Reproducibility Matrix & Technical Deep-Dive.
What is Agentic AI?
Standard chatbots just answer questions. An agent uses tools—such as writing and running code, searching documentation, or inspecting local files—to perform complex, multi-step tasks. Instead of just replying with a single paragraph, an agent operates in a continuous loop: it plans, takes action, evaluates the output, corrects its mistakes, and keeps working until the goal is achieved.
- Log Analysis: Parse and transform messy text or spreadsheet logs locally.
- Data Checking: Read datasets and automatically flag rows that deviate from rules.
- Report Generation: Synthesize multiple notes, guides, or logs into drafted summaries.
Why Local?
For individuals and organizations handling sensitive documents, proprietary logs, or internal codebase files, sending data to public cloud APIs carries massive privacy and security risks. Running a local LLM ensures that your data never leaves your workstation.
Local Architecture Stack
Four roles, your hardware. Each layer below names the role; the parenthetical is the reference implementation this repo uses — equivalents work just as well.
How Agents Work Together
One agent is often enough — but bigger jobs go better when you arrange several, the way a plant runs a crew rather than one operator. Match the shape to the work:
- One agent: a self-contained job, start to finish.
- Sequential pipeline: steps that depend on each other (gather → compare → draft).
- Batch: the same job over many independent items (e.g. summarize 40 equipment manuals).
- Orchestrator: a coordinator that splits a big goal, delegates the parts, and assembles the result.
Full walkthrough with water-industry examples: Chapter 10 — How Agents Work Together.
Setup Steps
0% DoneStep 1: Host Requirements & Verification
Before running local models, verify your hardware, user groups, and kernel settings. Strix Halo requires active APU visibility.
# Check if your user is in the render and video groups
groups
# Run the host diagnostic script
bash scripts/setup/check_host.sh
What just happened?
The diagnostic script checks if your system is running Linux, verifies that your graphics chip is detected as gfx1151 (RDNA3.5), and checks if the kernel loader is configured to access the shared RAM pool.
What success looks like
[PASS] Kernel Config (gttsize): Found gttsize=98304
[PASS] Active Kernel Parameter: no_system_mem_limit is enabled (1)
[PASS] ROCm GPU Architecture: gfx1151 visible to ROCm (Radeon APU)
Check complete: 5 passing, 0 failing, 0 warnings.
What if it fails?
Common issue: GPU not visible or driver missing.
If rocminfo fails to report your GPU architecture, run dmesg | grep amdgpu. Ensure secure boot is disabled if the driver fails to load, and verify your user is in the render group (run sudo usermod -aG render,video $USER then log out and back in).
Step 2: Allocate Unified VRAM (GTT size)
Unlike discrete graphics cards with dedicated VRAM, the Strix Halo APU shares system RAM. By default, Linux limits graphics allocations to 25%-50% of memory. We need to override this to allow the GPU to access up to 75% of RAM (96 GB on a 128 GB system).
# Apply optimal allocations (requires root privileges)
# Defaults to 75% of your total system RAM.
sudo bash scripts/setup/apply_gtt.sh
# Reboot is required to apply the kernel parameters
sudo reboot
What just happened?
The script creates a configuration file in /etc/modprobe.d/amdgpu_llm_optimized.conf setting gttsize (in MB), no_system_mem_limit=1 (which prevents layers from silently spilling to CPU memory, stalling latency), and ttm limits, then updates the kernel boot image (initramfs).
What success looks like
After reboot, checking GTT memory parameters reports the correct allocated size:
cat /sys/module/amdgpu/parameters/gttsize
# Output on a 128 GB system: 98304 (96 GB in MB)
What if it fails?
Common issue: initramfs build fails or parameters ignore on boot.
Verify that you ran the script with sudo. If parameters are not active after reboot, verify if your system uses a boot loader (e.g. systemd-boot) that requires kernel command line overrides instead of modprobe files.
Step 3: Source Driver Environment Variables
ROCm does not officially support the Strix Halo gfx1151 APU out of the box. We must inject environment flags to override the runtime architecture and configure memory parameters.
# Source the environment variables in your active shell
source scripts/setup/set_hsa_env.sh
What just happened?
This script exports critical variables like HSA_OVERRIDE_GFX_VERSION=11.5.1 (pins ROCm to a supported HSA compatibility target for the gfx1151 APU, which is not officially recognized out of the box), and HSA_ENABLE_SDMA=0 (disables System DMA which causes kernel hangs on APUs during large model routing).
What success looks like
GPU Environment Configured:
HSA_OVERRIDE_GFX_VERSION = 11.5.1
HSA_ENABLE_SDMA = 0
HIP_VISIBLE_DEVICES = 0
What if it fails?
Common issue: Variables disappear in new terminal.
Environment variables set with export or source only exist in the active terminal window. You must source this file in every new terminal you open before starting model servers or agents. (For convenience, you can append source /path/to/set_hsa_env.sh to your ~/.bashrc file).
Step 4: Build the Vulkan llama-server
We serve on the open-source Vulkan (RADV) backend — the fastest lane on Strix Halo and the default for this stack. There is no prebuilt binary for this hardware, so you compile llama-server once from source at the pinned stable tag (b9247). This takes a few minutes and you only do it once. (A ROCm path is kept as an optional fallback — see Chapter 08.)
4a. Install the build tools (one time). These commands are for Ubuntu/Debian. They install the compiler, CMake, and the Vulkan/RADV driver and headers.
sudo apt update
sudo apt install -y git cmake build-essential \
libvulkan-dev glslc vulkan-tools mesa-vulkan-drivers
4b. Clone and build. This clones into ~/src/llama.cpp and builds the server using all your CPU cores.
# Clone llama.cpp into a predictable location and pin the stable tag
mkdir -p ~/src && cd ~/src
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b9247
# Build only the server target, with Vulkan (RADV) enabled
cmake -B build-vulkan -DGGML_VULKAN=ON
cmake --build build-vulkan --config Release --target llama-server -j"$(nproc)"
4c. Point the config at your new binary. Run these from the tesla_agent repo folder. The first line creates your config; the second writes the binary path into it automatically.
cp scripts/config.env.example scripts/config.env
sed -i "s|^TESLA_VULKAN_SERVER=.*|TESLA_VULKAN_SERVER=\"$HOME/src/llama.cpp/build-vulkan/bin/llama-server\"|" scripts/config.env
What just happened?
CMake compiled a Vulkan-enabled llama-server at ~/src/llama.cpp/build-vulkan/bin/llama-server. The sed line set TESLA_VULKAN_SERVER in scripts/config.env to that exact path, so the serve script in Step 6 knows where to find it. (Prefer to edit by hand? Open scripts/config.env and set TESLA_VULKAN_SERVER to that path yourself.)
What success looks like
ls -l ~/src/llama.cpp/build-vulkan/bin/llama-server
# -rwxr-xr-x ... llama-server (present and executable)
grep TESLA_VULKAN_SERVER scripts/config.env
# TESLA_VULKAN_SERVER="/home/you/src/llama.cpp/build-vulkan/bin/llama-server"
What if it fails?
Shader compilation error / glslc too old.
The Vulkan build needs a current glslc shader compiler. If CMake errors during shader compilation, the distro glslc (2023.x) may be too old — install a newer shaderc or build it from source, then re-run the two cmake commands.
cmake: command not found or missing Vulkan headers.
Re-run step 4a; the libvulkan-dev and mesa-vulkan-drivers packages must be installed for the Vulkan build to find its headers and driver.
Step 5: Download the Recommended Model
The starter setup uses the 21.7 GB Qwen 3.6 35B Mixture-of-Experts (MoE) model because it is compact, fast, and the CODE/general workhorse. The full ladder also includes Qwen 3.5 35B as the PLAN/AGENTIC baseline, StepFun Step-3.7-Flash as the QUALITY champion, and an AMERICAN-ONLY tier (gpt-oss-120B and Gemma 4 31B — US-origin models for agencies that may require domestic-only provenance). Use the Model Finder after this baseline is working.
# Install the Hugging Face CLI if needed
pip install huggingface_hub
# Download the model GGUF from the verified Unsloth repository
mkdir -p ~/models/qwen3.6-35b-a3b
huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
Qwen3.6-35B-A3B-MXFP4_MOE.gguf \
--local-dir ~/models/qwen3.6-35b-a3b
What just happened?
The Hugging Face client downloads the model segments and reconstructs the single Qwen3.6-35B-A3B-MXFP4_MOE.gguf file on your local drive.
What success looks like
ls -lh ~/models/qwen3.6-35b-a3b/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
# Outputs showing file size: ~22 GB
What if it fails?
Common issue: Disk space exhausted.
The download requires at least 25 GB of free disk space. Ensure your target directory has sufficient room. If the download is interrupted, re-running the command will resume the download from where it stopped.
Step 6: Start the Model Server
With the GTT configuration applied and environment sourced, launch the Vulkan server. We run it with a 32,768 context window and optimized memory parameters.
# Make sure TESLA_VULKAN_SERVER is set in scripts/config.env, then:
bash scripts/serving/serve_vulkan.sh
What just happened?
The script launches llama-server on port 8095 via Vulkan. It exports HIP_VISIBLE_DEVICES=-1 to hide the GPU from ROCm (forcing the Vulkan route) and sets the RADV ICD, binds memory layers to the GPU, enables Flash Attention, and loads the active context space.
What success looks like
Look for lines indicating successful model load and socket binds in the server log:
llama_new_context_with_model: n_ctx = 32768, total VRAM = 21.7 GB
llama_server_listening: http://127.0.0.1:8095
What if it fails?
Common issue: TESLA_VULKAN_SERVER path is empty.
serve_vulkan.sh exits if scripts/config.env doesn't point at your Vulkan llama-server binary (Step 4). Also ensure you sourced set_hsa_env.sh in the active terminal and that your GTT size was applied (Step 2).
Step 7: Configure the Hermes Profile
Now that the model API is running, configure the Hermes agent engine to communicate with it.
# Generate the Hermes config files and system launcher
bash scripts/serving/create_hermes_profile.sh
What just happened?
The script creates a configuration file in ~/.hermes/profiles/qwen36_mxfp4/config.yaml mapping the correct API URL, local model details, and max tokens. It also compiles a quick launcher executable at ~/.local/bin/qwen36_mxfp4.
What success looks like
created Hermes profile: ~/.hermes/profiles/qwen36_mxfp4
created launcher: ~/.local/bin/qwen36_mxfp4
To run the agent, use: qwen36_mxfp4 -t "your task"
What if it fails?
Common issue: command not found (qwen36_mxfp4).
Your system PATH may not check ~/.local/bin/ by default. Check if the directory exists, or add export PATH="$HOME/.local/bin:$PATH" to your ~/.bashrc and source it.
Interactive Model Recommendation
Answer a few questions about your hardware configuration and active goals to select the best model settings.
Recommended Configuration
Qwen 3.6 35B MoE (MXFP4 Quant)
Think-On EnabledStart server command:
serve_vulkan.sh
Hermes run config:
max_tokens: 8192
Strix Halo Benchmark Matrix
Performance comparison on AMD APU system (128GB total RAM, 96-112GB GTT allocated depending run):
Benchmark Stack:
• Hardware: AMD Ryzen Strix Halo APU (gfx1151), 128 GB LPDDR5X RAM (96-112 GB GTT ceiling depending model)
• Backend: llama.cpp/llama-server (b9247; Vulkan/MTP lanes reproduced on b9360; Gemma QAT MTP probes on Atomic b9019) served via ROCm 7.2.x / Vulkan (Mesa RADV 25.2.8)
• Parameters: Temp = 0 (greedy decoding), context budget = 8k-32k, Flash Attention enabled
| Model & Mode | Size | Quality | Speed | Gate |
|---|---|---|---|---|
| gpt-oss-120B MXFP4 — AMERICAN-ONLY quality/speed (US-origin) | ~63 GB | 5-1 vs Qwen 35B; 4-2 vs Qwen 122B | ~46 t/s | 3 / 3 |
| Gemma 4 26B-A4B IT (UD-Q6_K_XL) — verified plain-control baseline | 21.2 GB | 2-4 vs Gemma 31B | 44.76 t/s tg128 pp512 1002.76 t/s; reasoning off, F16 KV |
3 / 3 |
| Gemma 4 26B-A4B QAT Q4_0 — fast Gemma QAT lane | 13.45 GiB | quality control vs non-QAT Q4 pending | 59.4 t/s pp 1194.4 t/s; official Google QAT GGUF |
3 / 3 |
| Gemma 4 26B-A4B QAT Q4_0 + MTP/Q8 KV — experimental speed probe | 13.45 GiB + ~310 MiB | assistant head not QAT-matched | 71.0 t/s pp 714.4 t/s; MTP acceptance 56.9% |
3 / 3 |
| Gemma 4 12B QAT Q4_0 | 6.50 GiB | quality control pending | 25.7 t/s pp 666.5 t/s |
not run |
| Gemma 4 31B QAT Q4_0 | 16.44 GiB | quality control pending | 11.0 t/s pp 204.2 t/s |
not run |
| Gemma 4 31B QAT Q4_0 + MTP — experimental speed probe | 16.44 GiB + ~337 MiB | assistant head not QAT-matched | 15.4 t/s pp 118.0 t/s; MTP acceptance 42.5% |
not run |
| Gemma 4 31B IT Q6_K — AMERICAN-ONLY coding second-opinion (US-origin, dense — slow decode) | 25.2 GB | 4-2 vs Gemma 26B-A4B | ~8.25 tok/s tg128; ~7.7 tok/s sustained (dense; pp8192 ~133.6 tok/s) |
3 / 3 |
| Qwen 3.6 35B (Vulkan RADV, Think-On) — CODE/general baseline; workhorse default unchanged | 21.7 GB | 82 / 84 | ~58.5 t/s | 3 / 3 |
| Qwen 3.6 35B MXFP4-MTP (Vulkan RADV) — opt-in speed lane | 19.3 GB | same production quant | ~72.7 t/s (+24%) prefill not separately captured |
3 / 3 |
| Qwen 3.6 35B Q4_K_M-MTP (Vulkan RADV) — opt-in speed lane | 20.7 GB | won quality pairwise 4-2 | ~81.2 t/s (+39%) prefill not separately captured |
3 / 3 |
| Qwen 3.6 35B (ROCm, Think-On) — fallback | 21.7 GB | 82 / 84 | 44.2 t/s | 3 / 3 |
| Qwen 3.6 35B (MXFP4, Think-Off) | 21.7 GB | 82 / 84 | 43.7 t/s | 1 / 3 |
| Qwen 3.5 35B (MXFP4, Think-On) — PLAN/AGENTIC baseline | 21 GB | 79 / 84 | 47.3 t/s | 3 / 3 |
| Qwen 3.5 122B (MXFP4, Think-On) — retired 2026-06-02 | 70 GB | 80 / 84 | 19.4 t/s | 3 / 3 |
| Qwen 3.5 122B (MXFP4, Think-Off) — retired 2026-06-02 | 70 GB | 81 / 84 | 19.5 t/s | 3 / 3 |
| Qwen 3.5 122B MTP (MXFP4_MOE, Vulkan RADV) — retired 2026-06-02 (tuned lane, kept as record) | ~70 GB | 3-3 quality tie vs previous MTP config | 28.3 t/s pp 324.9 t/s; DRAFT_N=1, PMIN unset |
3 / 3 |
| StepFun Step-3.7-Flash MTP — QUALITY champion (graduated 2026-06-02) | 88.79 GiB + 3.5 GB draft | plain StepFun: 6-0 vs gpt-oss-soulfix; 4-0-2 vs 122B | 27.9 t/s pp 183.5 t/s; 89.3% MTP acceptance (ub=256) |
3 / 3 |
| StepFun Step-3.7-Flash plain — QUALITY champion (plain lane) | 88.79 GiB | 6-0 vs gpt-oss-soulfix; 4-0-2 vs 122B | 20.4-22.3 t/s pp 212.0 t/s; coding 4/5 E2E |
3 / 3 |
| Qwen 3.6 27B Dense (UD-Q4_K_XL, Think-On) — experimental, not in stack | 16.4 GB | 0-6 vs Qwen 122B | 9.6-11.5 t/s | 3 / 3 |
| Qwen 3.6 27B Dense (UD-Q4_K_XL, Think-Off) — experimental, not in stack | 16.4 GB | — | 9.6-11.5 t/s | 1 / 3 |
| Qwen3-Coder-Next (UD-Q4_K_XL, Vulkan RADV) | 49.6 GB | saved orchestrated coding artifact passes grader checks | 44.4 t/s pp 723.2 t/s; b9360 promoted |
3 / 3 |
Note — MTP speed lanes are opt-in. The Qwen3.6-35B-A3B-MTP GGUFs carry a native nextn head, so llama-server can self-speculate with --spec-type draft-mtp and no separate draft model. The technique surfaced via the community strix-halo-guide; this repo independently reproduced and quality-gated the MXFP4-MTP and Q4_K_M-MTP lanes. Full audit trail: Reproducibility Matrix & Technical Deep-Dive.
Latest large-model MTP lanes. Qwen 122B MTP (now retired as of 2026-06-02, kept only as a record) reached a tuned Vulkan profile at 28.3 t/s decode with 81.8% MTP-probe acceptance; its quality role is now held by the StepFun champion. StepFun Step-3.7-Flash MTP reaches 27.9 t/s decode (wall std 78.0 s) with 89.3% acceptance using ub=256 — a ubatch sweep (2026-06-06) showed smaller micro-batches cut per-speculative-step latency and compound over long outputs (+7% tg, −5% wall std vs the prior ub=512 default).
Gemma 4 26B-A4B plain control baseline. The no-spec Vulkan lane with --reasoning off and F16 KV measures pp512 1002.76 ± 10.29 t/s and tg128 44.76 ± 0.90 t/s with Hermes nonce 3/3. It is the simpler lane for general reasoning, JSON extraction, and prose; the MTP comparison only pays off on heavy code generation.
Gemma 4 QAT Q4_0 sweep. The official Google QAT 26B-A4B row is now the fastest general Gemma lane measured here: 59.4 t/s decode and 1194.4 t/s prefill. QAT means quantization-aware training: the model is trained or adapted with the low-precision target in mind. The experimental 26B-A4B MTP/Q8 row reaches 71.0 t/s single-stream, but uses a non-QAT-matched assistant head and drops two-slot throughput, so it remains a speed probe.
Note — the dense 27B is benchmarked but NOT in the production stack. Community discussion often treats Qwen 27B as a strong reasoner, but the local Strix Halo benches did not support that routing choice: blind pairwise was 0–6 vs the 122B and normal decode tested around 9.6–11.5 t/s. It is kept as a break-glass option for tough, blocked projects, not a first- or second-line model. Technical aside: DFlash speculative decoding lifts its floor to ~31 t/s (2.82×) with a footprint-minimized Q4_K_M draft.
Visual Performance Analysis
Local Strix Halo speed plus external intelligence and coding scores. Artificial Analysis scores measure cloud/API model capability; local speed is this repo's llama.cpp/Vulkan/ROCm benchmark, so treat the combined view as a routing map, not a universal leaderboard.
Local Decode Leaderboard
External Intelligence vs. Local Decode
External Coding Score vs. Local Decode
Sequential Task Wall Time (1150-token prompt / 2000-token response — lower is faster)
Credit and source notes. Intelligence scores are credited to Artificial Analysis. Coding scores use Artificial Analysis Coding Index where available; for StepFun Step 3.7 Flash, the crawl exposed AA Intelligence but not AA Coding Index, so the coding marker is StepFun's published SWE-Bench Pro score and is labeled separately. Local speed, prefill, nonce, and coding-gate rows are this repo's Strix Halo measurements. Wall time chart: solid bars are directly measured via full_bench.sh (1150-token prompt / 2000-token response, normalized); faded bars are estimates using the same formula from speed data. MTP technique credit remains with strix-halo-guide where noted.
Tuning Insights for Unified Memory
Key findings from benchmarks run on Strix Halo hardware:
Vulkan Performance Win
The Vulkan/RADV backend (using Mesa drivers) runs +13% to +19% faster than the official ROCm HIP backend on MoE decoding. This represents a significant speed boost for long reasoning sessions.
Speculation Path Matters
Older separate-draft MoE speculation slowed down because router verification erased the gain. Native Qwen MTP is different: the model carries its own nextn head, so --spec-type draft-mtp gives an opt-in +24–39% speed lane when quality holds.
Thinking Budgets
Reasoning budgets are useful for planning and prose, but do not cap stateful coding loops. In the coding gate, any cap degraded reliability; leave coding routes uncapped.
The Three Pictures
Apprentice at the SCADA console
Confident, eager, reads every screen. Will push the wrong setpoint at 2 AM and not call you. Difference vs a real apprentice: the agent doesn't get tired.
Five-year-old with your phone
Will tap every button. Will discover some buttons buy things. Will accept every popup. Every tool in the agent's list is just another button.
Apprentice with the corporate card
Cloud APIs charge per token. An agent looping on a failing task for six hours at 3 AM produces a four-figure bill. There is no "it was an agent" refund policy.
What Can Actually Go Wrong
Each item is paired with the plant analog it maps to — same failure mode, different setting.
- Destructive commands.
rm -rf,DROP TABLE,git push --force. The verb sounds right; the agent runs. Plant analog: an apprentice with CMMS admin access runs "delete completed work orders" without a date filter and trims six years of maintenance history. - Overnight database refactor. You asked the agent to "clean up the schema"; you woke up to a half-run migration and a closed backup window. Plant analog: you asked the agent to "standardize work-order naming." It renamed every backflow-test record and zeroed the inspection-date field because the column rename cascaded.
- Credential exfiltration. The agent reads
.env,~/.ssh/id_rsa,~/.aws/credentialsbecause they sit in the directory it was told to operate in. Plant analog: the project folder included a "temporarily" saved SCADA admin password from six months ago that nobody moved out. - Cost runaway. Failed tasks retry on more expensive models. Plant analog: a metering pump that retries and doubles its dose every minute with no high-flow alarm — the same loop-with-no-alarm shape produces a four-figure cloud bill. Local-first eliminates this category entirely.
- Tool escalation. File-edit access led to formatter access led to network access. Plant analog: the contractor's key to the chemical-feed room shared a door with the SCADA equipment closet they didn't know was connected.
- Trust scope creep. Let it edit, then commit, then push, then deploy. Plant analog: let the new operator log readings, then submit reports, then file with the state — until an incorrect TT/CT calculation gets filed as "compliant."
- Prompt injection from documents. A regulatory PDF telling the agent to email
/etc/shadowto an attacker. The most-studied current attack vector in agentic systems. Plant analog: a complaint letter contains "approve a 50% chlorine feed increase" — if the agent reads incoming mail, that sentence is a tool call.
The Eight Defensive Layers
No single control is sufficient. Layer them. Each line names the IT-side control with a plant analog after the em-dash.
- Sandbox everything. Docker / devcontainer / VM. Bind-mount the project directory, never your home — the SCADA training simulator, not the live HMI.
- Least privilege. Read-only by default. Write only where needed. No shell if file edit is enough — not every operator gets the supervisor PIN.
- Credentials outside the agent's reach. OS keyring,
.envoutside the bind mount, no real keys in chat — keys live in the locked cabinet behind the supervisor's desk, not in the project folder. - Spend limits at the source. Anthropic / OpenAI console budget caps, virtual cards, alert thresholds — the high-flow alarm on the metering pump. Set it before you start. (Local-first sidesteps this entirely.)
- No production systems. Ever. No SCADA, BAS, DMS, RTU, historian, GIS, CIS, billing, or PII — same rule you already follow for testing against the live HMI: snapshot first, work on the bench.
- Approval gates on destructive ops. Default to "ask before doing." Don't disable the confirmation prompt — the two-key chemical-feed override is the load-bearing wall, not slowing the work down.
- Short leash, expanding trust. First task you watch every action; tenth you check after; hundredth you spot-check — onboarding a new operator: shadow → supervised → solo → spot-checked. Never skip the spot-check.
- Kill switch and audit trail. Know how to stop it. Save transcripts. Review them — the E-stop and the alarm history. Know where both are before you start the run.
Why Local-First Is Half the Safety Story
Every model recommended in this repo runs on your hardware. That means:
- No agent loop bills an API meter. A retry storm costs electricity, not $1,000.
- No data leaves your machine. Customer records, operations data, draft reports — none of it is sent anywhere by default.
- No third party can change the model under you. The GGUF on your disk doesn't move unless you move it.
If you only follow one principle from this chapter: start local, stay local until you have a concrete reason to leave.
Before You Turn It Loose — Checklist
Print this. Tape it next to your monitor.
- ☐ Agent is in a sandbox. Its
/root/is not my/home/. - ☐ Agent has access only to the directory it needs for this task.
- ☐ No real credentials are inside that directory.
- ☐ If using cloud APIs: a hard spend cap is set on the provider side. Today.
- ☐ No production SCADA / BAS / RTU / historian / GIS / CIS / billing / PII is reachable.
- ☐ Destructive operations require my explicit approval, every time.
- ☐ I know the exact command to stop the agent.
- ☐ Transcripts are being saved.
- ☐ Worst-case mistake on this task is recoverable in under an hour.
If you can't check every box, narrow the scope until you can.
The full chapter — failure-mode catalog, scenario walkthrough, incident playbook, and further reading — lives in the guide.
Terminology Glossary
Complex AI and driver concepts explained in simple, jargon-free language with analogies and references.
1. Hardware & Memory Architecture
ELI5: AMD's term for a single chip that contains both the computer's CPU manager and GPU math speed-runner. Strix Halo shares the system's 128 GB memory pool between CPU and GPU.
Analogy: A premium kitchen machine that blends and cooks on the same counter instead of buying two separate appliances.
ELI5: A system where the CPU and GPU share the exact same physical system RAM pool.
Analogy: A chef and assistant sharing a single massive counter space instead of running between separate tables.
ELI5: The operating system setting that controls how much shared RAM the GPU is allowed to access for graphics and compute allocations. This guide's reference setup uses a 96 GB GTT pool on a 128 GB machine.
Analogy: A boundary line in a shared room telling the graphics card how much space it can use without taking over the whole house.
ELI5: AMD's GPU compute platform, similar in role to NVIDIA CUDA. It provides the HIP backend used by llama.cpp and other local inference stacks.
Analogy: A bilingual translator bridging the gap between your code and AMD graphics hardware.
ELI5: Vulkan is a graphics/compute API. RADV is Mesa's open-source Vulkan driver for AMD GPUs. In this guide, Vulkan/RADV is the fastest measured default path for local generation rows.
Analogy: Vulkan is the road system; RADV is the road crew keeping the AMD lanes paved.
ELI5: AMD's former open-source Vulkan driver. Prefer Mesa RADV; stale AMDVLK ICD files can quietly make the wrong Vulkan driver load.
Analogy: An old road sign that still sends traffic down the wrong street.
ELI5: A Linux service that applies performance profiles. The accelerator-performance profile can reduce power-management drag during local LLM runs.
Analogy: Telling the plant to use the high-load test profile instead of the energy-saving schedule.
2. Machine Learning & LLM Core
ELI5: A massive autocomplete engine trained on billions of texts to predict the next word.
Analogy: A hyper-smart predictive text keyboard that has read the entire internet.
ELI5: Word-pieces that the AI reads and writes. In English, a token is roughly three-quarters of a word, though exact counts depend on the tokenizer.
Analogy: Cutting text into syllable Lego bricks to build sentences rather than reading individual letters.
ELI5: How fast the model reads your input prompt, measured in tokens per second. Higher is better.
Analogy: How fast a reviewer can read the packet before writing comments.
ELI5: How fast the model writes its response, measured in tokens per second. This is the speed you feel while chatting.
Analogy: How fast the reviewer can dictate the final answer after reading the packet.
ELI5: The size of the AI's active notepad (short-term memory).
Analogy: A notebook where the AI writes down what you said and what it did. If it runs out of pages, it starts forgetting the beginning.
ELI5: Speed-reading math tricks and word-guessing loops to accelerate model processing. On Strix Halo, keep Flash Attention enabled with the equivalent of -fa on / -fa 1.
Analogy: Indexing a book for fast lookups (Attention) and having an assistant draft text for a senior editor to quickly check (Speculative).
ELI5: A self-speculative speed trick where the model has its own built-in head for guessing the next few tokens, so no separate draft model is needed.
Analogy: The senior editor has likely next phrases penciled into the margin and can approve several words at once.
ELI5: A model component trained to guess more than one next token at a time.
Analogy: Instead of predicting one word, it sketches the next short phrase for the main model to check.
ELI5: A formatting script that wraps conversational history (User, Assistant, System) and tool-calling data in structured XML or markdown tags so the AI model understands where thoughts end and tool requests begin.
Analogy: A standardized utility log sheet. No matter who takes the measurements, they write them in the exact same boxes so the state compliance auditor can read them instantly.
3. Model Formats & Compression
ELI5: The file format used by llama.cpp and related tools to store local AI models. A .gguf file contains model weights plus metadata needed for inference.
Analogy: An `.mp3` or `.zip` file specifically optimized for loading AI brains.
ELI5: A compression technique that lowers model decimals to integers to shrink file size.
Common labels: Q4_K_M is a balanced 4-bit quant; Q8_0 is higher-quality 8-bit at roughly 2x the weight size; UD-Q4_K_XL is Unsloth Dynamic 4-bit with higher precision for important layers; BF16 is 16-bit precision and much larger.
Analogy: Saving a huge raw photo as a JPEG. It takes up 70% less space, but looks identical to your eyes.
ELI5: A model design where only a small part of the brain is active for each token. A 30B-A3B model has about 30 billion total parameters but activates about 3 billion per token.
Analogy: A hospital with 8 specialist doctors. For a cold, only the 2 required specialists treat you, keeping it fast and cheap.
ELI5: A model where all parameters are used for every token. A dense 7B model uses all 7 billion parameters every time it writes a token.
Analogy: Every specialist reviews every patient, even routine cases. That can be thorough, but it is slower.
ELI5: GGUF compression formats that keep more detail than 4-bit formats while still fitting local hardware.
Analogy: A larger field notebook with clearer handwriting: more room than the tiny version, but easier to read back accurately.
ELI5: The open-source C++ inference library that powers many local LLM tools. It can run GGUF models through CPU, Vulkan, ROCm/HIP, and other backends.
Analogy: The engine under the hood. Different apps may have different dashboards, but many are driving with this engine.
ELI5: A user-friendly tool for downloading and running local LLMs with commands like ollama run model-name. This repo uses llama-server directly, but Ollama is a common llama.cpp-based path.
Analogy: An appliance wrapper around the engine: easier controls, less manual wiring.
ELI5: A separate model family from Google. In this stack, Gemma 4 31B is a cross-family coding experiment — used for quality verification and second-opinion checks, not as a throughput workhorse. It is a dense model: every token reads all 31B parameters, so decode runs at ~8 tok/s on Strix Halo, much slower than the faster MoE lanes (~46–81 tok/s depending on model and MTP opt-in).
Analogy: A specialist reviewer from a different firm — slower to consult, but valuable for a second opinion on tricky plans. Not someone you route every job to.
ELI5: A large open-weight model (OpenAI). In this stack it is the AMERICAN-ONLY quality/speed lane — the US-origin pick for agencies that may require domestic-only model provenance (the general QUALITY champion, StepFun, is non-US in origin).
Analogy: The trusted domestic-supplier option you keep on hand for customers whose rules say "buy American," even when another vendor tops the leaderboard.
4. Agentic Workflows
ELI5: A chatbot given tools and a goal, running in a loop until it's finished.
Analogy: Giving an assistant a mouse and keyboard, saying: "Clean this file and let me know when done," instead of just asking for advice.
ELI5: The moment the AI decides to run an external program (e.g., a file reader or bash command) instead of guessing.
Analogy: A chef looking up a recipe in an index rather than trying to remember the measurements.
ELI5: A verification test to prove the agent is executing tools rather than hallucinating answers.
Analogy: Putting a secret word inside a box and asking the agent to open the box and tell you the word. If they echo it back, they successfully used the key.
ELI5: A secure, isolated virtual room inside your computer where the AI agent is allowed to write and run code without risk of altering or damaging your actual operating system.
Analogy: A safety hood in a laboratory. You run chemical reactions inside the hood to contain fumes and spills, protecting the rest of the building.
ELI5: A workflow where a main "Manager" AI agent takes a complex user goal, breaks it into smaller sub-tasks, delegates them to specialized "Worker" sub-agents, and compiles their final outputs.
Analogy: A chief operator coordinating plant maintenance, laboratory testing, and electrical crews rather than trying to do every job himself.
ELI5: A blind A/B comparison where two model answers are shuffled and judged prompt by prompt.
Analogy: A taste test with the labels covered. It helps when normal scores are too close to settle the choice.
ELI5: A reliability rule requiring three clean end-to-end passes instead of one lucky success.
Analogy: Starting equipment three times cleanly tells you more than seeing it start once.
5. Utility & Domain Context
ELI5: The industrial computer network that reads sensor data (flow, pressure, tank levels) and operates mechanical hardware (pumps, valves, chemical feeds) in real-time.
Analogy: The dashboard gauges, gas pedal, and steering wheel of a massive commercial truck.
ELI5: The legal safety limit set by the EPA on the concentration of a chemical or contaminant allowed in public drinking water systems.
Analogy: The legal speed limit on a residential street. Going above it triggers a violation and requires immediate corrective action.
6. Agent Safety & Sandboxing
Each term below includes a plant analog — the same control in language treatment operators already use. Full treatment in Chapter 11 — Agent Safety.
ELI5: A walled-off environment that looks like a full computer to the program inside, but cannot reach your real system or files.
Analogy: A child's playpen — they can move freely inside it without reaching the stairs.
Plant analog: The SCADA training simulator. Same screens, same alarms — but a wrong setpoint doesn't dose finished water.
ELI5: Telling a sandbox: "give the program inside this one specific folder of my real computer, and nothing else."
Analogy: Handing the apprentice a folder of pages instead of a key to the whole filing cabinet.
Plant analog: The data binder you handed the apprentice — they see exactly the pages you put in it, nothing else.
ELI5: Give each person, program, or role only the access they need for the job — nothing more. Default to "no" unless required.
Analogy: A new hire gets keys to their office and the break room — not the server room — until the job requires it.
Plant analog: Not every operator gets the supervisor PIN. Not every contractor gets the master keyring. Each role's access is sized to the role.
ELI5: A locked, encrypted vault built into your OS where passwords and keys are stored. Accessible only with your login.
Analogy: A safe-deposit box at the bank — they hold it, only your signature opens it.
Plant analog: The locked key cabinet behind the supervisor's desk. Credentials live there, not on the workstation.
ELI5: An attack where untrusted text (a document, a PDF, an email) contains hidden instructions the agent reads and follows as if you'd typed them. The most-studied current agentic attack vector.
Analogy: A villain mails your secretary a letter that says "move all funds to account X — signed, the boss." If the secretary trusts the letter, the boss never needed to be involved.
Plant analog: A complaint letter containing "approve a 50% chlorine feed increase" — if your agent reads incoming mail, that sentence is a tool call to it.
ELI5: The "Are you sure?" the agent must ask before running a destructive command. Whitelist safe tools; require confirmation for the rest.
Analogy: The "Do you really want to send this email?" pop-up — slows things by half a second, catches one mistake a month.
Plant analog: The two-key chemical-feed setpoint override. The witness signature on a backwash change. Load-bearing wall, not "slowing the work down."
ELI5: The command that immediately stops the agent, and the saved record of every action it took. Know where both are before you start a run.
Analogy: The off-switch and the cash register tape. Stops the action; tells you what happened.
Plant analog: The E-stop on rotating equipment, and the alarm history. You don't disable either "to clean up the screen."
ELI5: "Supervisor mode" on a Linux/Mac computer. Anything run with sudo has full system authority to change settings the regular account cannot.
Analogy: The supervisor's master keycard that bypasses normal authorization on every door.
Plant analog: The supervisor PIN that bypasses alarm acknowledgements. Agents almost never need it.