tesla_agent

Chapter 05: Step-by-Step Setup

This chapter walks you through setting up your AMD Strix Halo host to run local agentic AI. Because unified memory allocations and GPU version overrides can easily break, every step includes failure-recovery guidelines.

Please open a terminal and follow these steps in order.


Step 1: Verify Host Compatibility

First, check if your system meets the hardware requirements and has access to the graphics driver.

# Check if your user is part of the required graphics groups
groups

# Run the project host validation check
bash scripts/setup/check_host.sh

What just happened?

The script verifies that your kernel detects the AMD APU as the gfx1151 architecture and checks whether your user account has permission to read the GPU control queues (which requires membership in the render and video groups).

What success looks like:

The output should report PASS on Visibility and Kernel checks:

[PASS] ROCm GPU Architecture: gfx1151 visible to ROCm (Radeon APU)
Check complete: 5 passing, 0 failing, 0 warnings.

What to do if it fails:


Step 2: Override GPU Memory Allocation (GTT Size)

By default, Linux limits graphics allocations to 25%-50% of your total RAM. To run large models, we must modify the Graphics Translation Table (GTT) parameters to allocate up to 75% of your RAM.

# Apply GTT configurations (requires sudo/root)
# This will detect your RAM and configure GTT size (e.g. 96 GB on 128 GB setups)
sudo bash scripts/setup/apply_gtt.sh

# A reboot is mandatory to apply these kernel overrides
sudo reboot

What just happened?

The script creates /etc/modprobe.d/amdgpu_llm_optimized.conf and writes kernel module options. It sets no_system_mem_limit=1 (critical: prevents the GPU driver from silently spilling active computation layers back to the slow CPU) and configures the TTM pages limits. It then updates your system’s initramfs boot image.

What success looks like:

After rebooting, check that the GTT module parameter reads the custom allocation (in MB):

cat /sys/module/amdgpu/parameters/gttsize
# For a 128 GB system, this must print: 98304

What to do if it fails:


Step 3: Configure Driver Environment Variables

ROCm does not support the Strix Halo gfx1151 chip automatically. You must load override environment variables in your active terminal.

# Source the variables (run this in every new terminal session)
source scripts/setup/set_hsa_env.sh

What just happened?

This exports parameters to your current shell. HSA_OVERRIDE_GFX_VERSION=11.5.1 fools the driver into treating the APU as a compatible discrete GPU. HSA_ENABLE_SDMA=0 disables system DMA, preventing kernel lockups during large memory routing.

What success looks like:

GPU Environment Configured:
  HSA_OVERRIDE_GFX_VERSION = 11.5.1
  HSA_ENABLE_SDMA          = 0

What to do if it fails:


Step 4: Build the Vulkan llama-server

We serve on the open-source Vulkan (RADV) backend — the fastest lane on Strix Halo and the default for this stack. There is no prebuilt binary for this hardware, so you compile llama-server once from source at the pinned stable tag (b9247). It takes a few minutes and you only do it once. (A ROCm path is kept as an optional fallback; see Chapter 08 — Speed and Tuning.)

4a. Install the build tools (one time). These commands are for Ubuntu/Debian — they install the compiler, CMake, and the Vulkan/RADV driver and headers.

sudo apt update
sudo apt install -y git cmake build-essential \
  libvulkan-dev glslc vulkan-tools mesa-vulkan-drivers

4b. Clone and build. This clones into ~/src/llama.cpp and builds the server target using all your CPU cores.

# Clone llama.cpp into a predictable location and pin the stable tag
mkdir -p ~/src && cd ~/src
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git checkout b9247

# Build only the server target, with Vulkan (RADV) enabled
cmake -B build-vulkan -DGGML_VULKAN=ON
cmake --build build-vulkan --config Release --target llama-server -j"$(nproc)"

4c. Point the config at your new binary. Run these from the tesla_agent repo folder. The first line creates your config; the second writes the binary path into it automatically.

cp scripts/config.env.example scripts/config.env
sed -i "s|^TESLA_VULKAN_SERVER=.*|TESLA_VULKAN_SERVER=\"$HOME/src/llama.cpp/build-vulkan/bin/llama-server\"|" scripts/config.env

What just happened?

CMake compiled a Vulkan-enabled llama-server at ~/src/llama.cpp/build-vulkan/bin/llama-server. The sed line set TESLA_VULKAN_SERVER in scripts/config.env to that exact path, so the serve script in Step 6 knows where to find it. Prefer to edit by hand? Open scripts/config.env and set TESLA_VULKAN_SERVER to that path yourself.

What success looks like:

ls -l ~/src/llama.cpp/build-vulkan/bin/llama-server
# -rwxr-xr-x ... llama-server   (present and executable)

grep TESLA_VULKAN_SERVER scripts/config.env
# TESLA_VULKAN_SERVER="/home/you/src/llama.cpp/build-vulkan/bin/llama-server"

What to do if it fails:


Step 5: Download the Model

We download the recommended Qwen 3.6 35B MoE model quantized in the space-efficient MXFP4 format.

# Create models directory
mkdir -p ~/models/qwen3.6-35b-a3b

# Download the model from the verified Hugging Face repository
huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
  Qwen3.6-35B-A3B-MXFP4_MOE.gguf \
  --local-dir ~/models/qwen3.6-35b-a3b

What just happened?

The Hugging Face client downloads the 21.7 GB model file directly to your drive.

What success looks like:

ls -lh ~/models/qwen3.6-35b-a3b/Qwen3.6-35B-A3B-MXFP4_MOE.gguf
# Outputs a file of size: ~22 GB

What to do if it fails:


Step 6: Start the Model Server

Now, launch the inference API server on port 8095 using the Vulkan backend.

# Make sure TESLA_VULKAN_SERVER is set in scripts/config.env, then launch in foreground to monitor logs
bash scripts/serving/serve_vulkan.sh

What just happened?

The launcher script starts your Vulkan-built llama-server on port 8095, configures a 32k context size, turns on Flash Attention, and loads the model into graphics memory. It hides the GPU from ROCm (HIP_VISIBLE_DEVICES=-1) and selects the RADV Vulkan driver.

What success looks like:

Keep the terminal open and check the output log:

llama_new_context_with_model: n_ctx = 32768, total VRAM = 21.7 GB
llama_server_listening: http://127.0.0.1:8095

What to do if it fails:


Step 7: Create the Agent Profile

In a separate terminal window, initialize the Hermes configuration so the agent can communicate with the server.

# Generate the profile config (run this from root directory)
bash scripts/serving/create_hermes_profile.sh

What just happened?

The script creates a configuration file in ~/.hermes/profiles/qwen36_mxfp4/config.yaml specifying our local server address and model names. It also creates a command launcher in ~/.local/bin/qwen36_mxfp4.

What success looks like:

created Hermes profile: ~/.hermes/profiles/qwen36_mxfp4
created launcher:       ~/.local/bin/qwen36_mxfp4

What to do if it fails: