This chapter walks you through setting up your AMD Strix Halo host to run local agentic AI. Because unified memory allocations and GPU version overrides can easily break, every step includes failure-recovery guidelines.
Please open a terminal and follow these steps in order.
First, check if your system meets the hardware requirements and has access to the graphics driver.
# Check if your user is part of the required graphics groups
groups
# Run the project host validation check
bash scripts/setup/check_host.sh
The script verifies that your kernel detects the AMD APU as the gfx1151 architecture and checks whether your user account has permission to read the GPU control queues (which requires membership in the render and video groups).
The output should report PASS on Visibility and Kernel checks:
[PASS] ROCm GPU Architecture: gfx1151 visible to ROCm (Radeon APU)
Check complete: 5 passing, 0 failing, 0 warnings.
rocminfo not found or gfx1151 is not listed
Ensure the open-source AMD driver is installed. Run dmesg | grep amdgpu to verify the graphics driver loaded on boot.sudo usermod -aG render,video $USER
Then log out of your Linux session and log back in to apply the group memberships.By default, Linux limits graphics allocations to 25%-50% of your total RAM. To run large models, we must modify the Graphics Translation Table (GTT) parameters to allocate up to 75% of your RAM.
# Apply GTT configurations (requires sudo/root)
# This will detect your RAM and configure GTT size (e.g. 96 GB on 128 GB setups)
sudo bash scripts/setup/apply_gtt.sh
# A reboot is mandatory to apply these kernel overrides
sudo reboot
The script creates /etc/modprobe.d/amdgpu_llm_optimized.conf and writes kernel module options. It sets no_system_mem_limit=1 (critical: prevents the GPU driver from silently spilling active computation layers back to the slow CPU) and configures the TTM pages limits. It then updates your system’s initramfs boot image.
After rebooting, check that the GTT module parameter reads the custom allocation (in MB):
cat /sys/module/amdgpu/parameters/gttsize
# For a 128 GB system, this must print: 98304
apply_gtt.sh: Permission denied
You must execute this script with sudo or as root, as it modifies system files under /etc/modprobe.d/.systemd-boot or rrefind) that ignores modprobe configuration files. You must manually add amdgpu.gttsize=98304 to your kernel command line in your bootloader config (e.g., /boot/loader/entries/).ROCm does not support the Strix Halo gfx1151 chip automatically. You must load override environment variables in your active terminal.
# Source the variables (run this in every new terminal session)
source scripts/setup/set_hsa_env.sh
This exports parameters to your current shell. HSA_OVERRIDE_GFX_VERSION=11.5.1 fools the driver into treating the APU as a compatible discrete GPU. HSA_ENABLE_SDMA=0 disables system DMA, preventing kernel lockups during large memory routing.
GPU Environment Configured:
HSA_OVERRIDE_GFX_VERSION = 11.5.1
HSA_ENABLE_SDMA = 0
bash: scripts/setup/set_hsa_env.sh: No such file or directory
Make sure you are running the command from the root of the tesla_agent directory.source scripts/setup/set_hsa_env.sh again before executing model commands.We serve on the open-source Vulkan (RADV) backend — the fastest lane on Strix Halo and the default for this stack. There is no prebuilt binary for this hardware, so you compile llama-server once from source at the pinned stable tag (b9247). It takes a few minutes and you only do it once. (A ROCm path is kept as an optional fallback; see Chapter 08 — Speed and Tuning.)
4a. Install the build tools (one time). These commands are for Ubuntu/Debian — they install the compiler, CMake, and the Vulkan/RADV driver and headers.
sudo apt update
sudo apt install -y git cmake build-essential \
libvulkan-dev glslc vulkan-tools mesa-vulkan-drivers
4b. Clone and build. This clones into ~/src/llama.cpp and builds the server target using all your CPU cores.
# Clone llama.cpp into a predictable location and pin the stable tag
mkdir -p ~/src && cd ~/src
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b9247
# Build only the server target, with Vulkan (RADV) enabled
cmake -B build-vulkan -DGGML_VULKAN=ON
cmake --build build-vulkan --config Release --target llama-server -j"$(nproc)"
4c. Point the config at your new binary. Run these from the tesla_agent repo folder. The first line creates your config; the second writes the binary path into it automatically.
cp scripts/config.env.example scripts/config.env
sed -i "s|^TESLA_VULKAN_SERVER=.*|TESLA_VULKAN_SERVER=\"$HOME/src/llama.cpp/build-vulkan/bin/llama-server\"|" scripts/config.env
CMake compiled a Vulkan-enabled llama-server at ~/src/llama.cpp/build-vulkan/bin/llama-server. The sed line set TESLA_VULKAN_SERVER in scripts/config.env to that exact path, so the serve script in Step 6 knows where to find it. Prefer to edit by hand? Open scripts/config.env and set TESLA_VULKAN_SERVER to that path yourself.
ls -l ~/src/llama.cpp/build-vulkan/bin/llama-server
# -rwxr-xr-x ... llama-server (present and executable)
grep TESLA_VULKAN_SERVER scripts/config.env
# TESLA_VULKAN_SERVER="/home/you/src/llama.cpp/build-vulkan/bin/llama-server"
glslc too old
The Vulkan build needs a current glslc shader compiler. The distro glslc (2023.x) can be too old; install a newer shaderc (or build it from source) and re-run the two cmake commands.cmake: command not found or missing Vulkan headers
Re-run step 4a; the libvulkan-dev and mesa-vulkan-drivers packages must be installed for the Vulkan build to find its headers and driver.We download the recommended Qwen 3.6 35B MoE model quantized in the space-efficient MXFP4 format.
# Create models directory
mkdir -p ~/models/qwen3.6-35b-a3b
# Download the model from the verified Hugging Face repository
huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
Qwen3.6-35B-A3B-MXFP4_MOE.gguf \
--local-dir ~/models/qwen3.6-35b-a3b
The Hugging Face client downloads the 21.7 GB model file directly to your drive.
ls -lh ~/models/qwen3.6-35b-a3b/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
# Outputs a file of size: ~22 GB
Disk quota exceeded or No space left on device
This model file requires ~22 GB. Ensure your drive has at least 25 GB of free space.huggingface-cli download command. It will scan the directory and resume downloading the missing chunks.Now, launch the inference API server on port 8095 using the Vulkan backend.
# Make sure TESLA_VULKAN_SERVER is set in scripts/config.env, then launch in foreground to monitor logs
bash scripts/serving/serve_vulkan.sh
The launcher script starts your Vulkan-built llama-server on port 8095, configures a 32k context size, turns on Flash Attention, and loads the model into graphics memory. It hides the GPU from ROCm (HIP_VISIBLE_DEVICES=-1) and selects the RADV Vulkan driver.
Keep the terminal open and check the output log:
llama_new_context_with_model: n_ctx = 32768, total VRAM = 21.7 GB
llama_server_listening: http://127.0.0.1:8095
amdgpu: allocation failed or hipErrorOutOfMemory
You either forgot to reboot after setting the GTT size, or you did not source set_hsa_env.sh in the current terminal window. Close the server, run source scripts/setup/set_hsa_env.sh, and try again.scripts/config.env and change TESLA_PORT to a different number (e.g. 8096), then run the script again.In a separate terminal window, initialize the Hermes configuration so the agent can communicate with the server.
# Generate the profile config (run this from root directory)
bash scripts/serving/create_hermes_profile.sh
The script creates a configuration file in ~/.hermes/profiles/qwen36_mxfp4/config.yaml specifying our local server address and model names. It also creates a command launcher in ~/.local/bin/qwen36_mxfp4.
created Hermes profile: ~/.hermes/profiles/qwen36_mxfp4
created launcher: ~/.local/bin/qwen36_mxfp4
qwen36_mxfp4: command not found
Your user binary directory ~/.local/bin is not in your system shell search path. Run:
export PATH="$HOME/.local/bin:$PATH"
And append this line to the end of your ~/.bashrc file.