There’s no way to really understand something unless you dive into first principles. This is especially true for AI coding agents. What is the editor doing under the hood? I try to peel the onion a bit.
Desire: Use Ollama
I could just keep doing what I’ve been doing, and use VSCode with GitHub Copilot. My company pays for a license and I’ve been digging into that. In fact, I used that to approach building this new tool.
I want to use Ollama to host a local model. I tried this with both Zed and VSCode and the results were TERRIBLE. I suspect I was doing something wrong. But it was basically not viable.
But… that may be by design. The IDEs these days are loss-leaders to pull you into their revenue model: consumption of models tied to thier platform.
Now, I’m not actually accusing them of breaking anything, but I can see how they would not really want to support something that does not generate them revenue or tie one to their ecosystem any more closely.
Installing Ollama
Follow the instructions here to install Ollama.

My Program
To understand what I did, you probably want to see my code here. More on the code later.
Testing: Ollama Hosted Models Token per Second
I have two computing platforms that I want to test: my shiny new MacBookPro with an M4 Pro chip and 48GB of unified memory, and, a new Beelink SER9 Pro AI Mini PC, AMD Ryzen AI 9 HX 370 with 64GB of memory… that is unlikely to be fully unified. In fact, half of the exercise is to learn how to use Ollama to host a local model on that kind of computer. Will it work? How do you set it up?
Here’s the model I chose to test with so that we have apples to apples:
ollama show qwen2.5-coder:7b
Model
architecture qwen2
parameters 7.6B
context length 32768
embedding length 3584
quantization Q4_K_M
Capabilities
completion
tools
insert
System
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
License
Apache License
Version 2.0, January 2004
...
ChatGPT said to to expect the following from my Mac:
| Mode | Quantization | Expected TPS (tokens/sec) |
| -------------------- | ------------ | ------------------------- |
| **Prompt** | `q4_K_M` | 35–50 |
| **Generate** | `q4_K_M` | 55–85 |
| **Total End-to-End** | | \~60–75 average |
MacBookPro:
ollama run qwen2.5-coder:7b --verbose
<fed it the contents of the prompt.txt file in my code repository>
<inference results removed for brevity?
total duration: 14.808806709s
load duration: 28.828542ms
prompt eval count: 51 token(s)
prompt eval duration: 300.190041ms
prompt eval rate: 169.89 tokens/s
eval count: 620 token(s)
eval duration: 14.477839s
eval rate: 42.82 tokens/s
Almost 43 TPS seems on the low end, but it’s a non-trivial prompt. I even made sure that I had disabled spotlight so that the prompt eval rate was not affected.
sudo mdutil -a -i off
How about the Ryzen 9 7950X?
ollama run qwen2.5-coder:7b --verbose
<fed it the contents of the prompt.txt file in my code repository>
<inference results removed for brevity?
total duration: 34.769163118s
load duration: 12.838666ms
prompt eval count: 71 token(s)
prompt eval duration: 543.123423ms
prompt eval rate: 130.73 tokens/s
eval count: 584 token(s)
eval duration: 34.204503309s
eval rate: 17.07 tokens/s
Ugh. 17 TPS. That’s terrible.
Damn - turns out ROCm does not support this GPU
ROCm officially targets: Discrete GPUs: e.g. Radeon RX 5000/6000/7000 series (RDNA) Some integrated RDNA 2 (gfx1036/1037) Professional GPUs: Instinct MI200/MI300 As of ROCm 6.4, the gfx1150 (RDNA 3 iGPU like 780M/890M) is still not on the supported list: 🔗 https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html
Well, that stinks. End of this effort.

Seriously. No point even going any further. What a waste of time.
The Road to Hell (or, what I did before I found it wasn’t supported)
Let’s re-install Ollama, perhaps I did it incorrectly when I set up the system.
curl -fsSL https://ollama.com/install.sh | sh
>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
>>> Downloading Linux ROCm amd64 bundle
######################################################################## 100.0%
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
>>> AMD GPU ready.
Retested and it’s still 17 TPS. Hmmm.
Next step: make sure my GPU is setup and enabled in Ollama. After cloning ollama I had to find out what GPU I had:
rocminfo | grep -A4 'Name:'
Name: AMD Ryzen AI 9 HX 370 w/ Radeon 890M
Uuid: CPU-XX
Marketing Name: AMD Ryzen AI 9 HX 370 w/ Radeon 890M
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
--
Name: gfx1150
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
--
Name: amdgcn-amd-amdhsa--gfx1150
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
--
Name: amdgcn-amd-amdhsa--gfx11-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
So it’s the gfx1150.
I tried to build it:
#!/bin/bash
# Exit on error
set -e
# Set your target GPU architecture
export AMDGPU_TARGET="gfx1150"
# you need to set this to whatever YOU have!!!
# Ensure dependencies are installed
echo "Installing build dependencies..."
sudo apt update
sudo apt install -y git cmake build-essential rocm-hip-libraries rocm-opencl-runtime
# Clone llama.cpp if it doesn't already exist
if [ ! -d "llama.cpp" ]; then
echo "Cloning llama.cpp..."
git clone https://github.com/ggerganov/llama.cpp.git
fi
cd llama.cpp
# Clean previous build (optional)
rm -rf build
# Configure the build
echo "Configuring with ROCm support for $AMDGPU_TARGET..."
cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="$AMDGPU_TARGET"
# Build using all available cores
echo "Building llama.cpp with HIP acceleration..."
cmake --build build -j$(nproc)
echo "Build complete! Run with:"
echo "./build/bin/main -m path/to/your-model.gguf -p \"Hello, world\" --gpu-layers 40"
Hmmm. No joy. Looks like I don’t actually have the HIP compiler installed properly.
No, it
So let’s go reinstall that:
sudo apt install python3-setuptools python3-wheel
sudo apt update
wget https://repo.radeon.com/amdgpu-install/6.4.1/ubuntu/noble/amdgpu-install_6.4.60401-1_all.deb
sudo apt install ./amdgpu-install_6.4.60401-1_all.deb
# amdgpu-install -y --usecase=workstation,rocm
amdgpu-install --usecase=workstation -y --vulkan=pro --opencl=rocr
dkms status
# expecting: amdgpu/6.12.12-2164967.24.04, 6.8.0-60-generic, x86_64: installed
rocminfo
# expecting: ROCk module version 6.12.12 is loaded - and a lot of data
# but this is the important part
# Name: gfx1150
clinfo
# expecting: Name: gfx1150
/opt/rocm/bin/hipcc --version
# expecting: HIP version: 6.4.43483-a187df25c and some data