Agent Building: Ollama Hosted Models Token per Second

15 Jun 2025, 00:00

agents / ai / skills / golang / ollama

There’s no way to really understand something unless you dive into first principles. This is especially true for AI coding agents. What is the editor doing under the hood? I try to peel the onion a bit.

Desire: Use Ollama

I could just keep doing what I’ve been doing, and use VSCode with GitHub Copilot. My company pays for a license and I’ve been digging into that. In fact, I used that to approach building this new tool.

I want to use Ollama to host a local model. I tried this with both Zed and VSCode and the results were TERRIBLE. I suspect I was doing something wrong. But it was basically not viable.

But… that may be by design. The IDEs these days are loss-leaders to pull you into their revenue model: consumption of models tied to thier platform.

Now, I’m not actually accusing them of breaking anything, but I can see how they would not really want to support something that does not generate them revenue or tie one to their ecosystem any more closely.

Installing Ollama

Follow the instructions here to install Ollama.

My Program

To understand what I did, you probably want to see my code here. More on the code later.

Testing: Ollama Hosted Models Token per Second

I have two computing platforms that I want to test: my shiny new MacBookPro with an M4 Pro chip and 48GB of unified memory, and, a new Beelink SER9 Pro AI Mini PC, AMD Ryzen AI 9 HX 370 with 64GB of memory… that is unlikely to be fully unified. In fact, half of the exercise is to learn how to use Ollama to host a local model on that kind of computer. Will it work? How do you set it up?

Here’s the model I chose to test with so that we have apples to apples:

ollama show qwen2.5-coder:7b
  Model
    architecture        qwen2
    parameters          7.6B
    context length      32768
    embedding length    3584
    quantization        Q4_K_M

  Capabilities
    completion
    tools
    insert

  System
    You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

  License
    Apache License
    Version 2.0, January 2004
    ...

ChatGPT said to to expect the following from my Mac:

| Mode                 | Quantization | Expected TPS (tokens/sec) |
| -------------------- | ------------ | ------------------------- |
| **Prompt**           | `q4_K_M`     | 35–50                     |
| **Generate**         | `q4_K_M`     | 55–85                     |
| **Total End-to-End** |              | \~60–75 average           |

MacBookPro:

ollama run qwen2.5-coder:7b --verbose
<fed it the contents of the prompt.txt file in my code repository>
<inference results removed for brevity?
total duration:       14.808806709s
load duration:        28.828542ms
prompt eval count:    51 token(s)
prompt eval duration: 300.190041ms
prompt eval rate:     169.89 tokens/s
eval count:           620 token(s)
eval duration:        14.477839s
eval rate:            42.82 tokens/s

Almost 43 TPS seems on the low end, but it’s a non-trivial prompt. I even made sure that I had disabled spotlight so that the prompt eval rate was not affected.

sudo mdutil -a -i off

How about the Ryzen 9 7950X?

ollama run qwen2.5-coder:7b --verbose
<fed it the contents of the prompt.txt file in my code repository>
<inference results removed for brevity?
total duration:       34.769163118s
load duration:        12.838666ms
prompt eval count:    71 token(s)
prompt eval duration: 543.123423ms
prompt eval rate:     130.73 tokens/s
eval count:           584 token(s)
eval duration:        34.204503309s
eval rate:            17.07 tokens/s

Ugh. 17 TPS. That’s terrible.

Damn - turns out ROCm does not support this GPU

ROCm officially targets: Discrete GPUs: e.g. Radeon RX 5000/6000/7000 series (RDNA) Some integrated RDNA 2 (gfx1036/1037) Professional GPUs: Instinct MI200/MI300 As of ROCm 6.4, the gfx1150 (RDNA 3 iGPU like 780M/890M) is still not on the supported list: 🔗 https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html

Well, that stinks. End of this effort.

Seriously. No point even going any further. What a waste of time.

The Road to Hell (or, what I did before I found it wasn’t supported)

Let’s re-install Ollama, perhaps I did it incorrectly when I set up the system.

curl -fsSL https://ollama.com/install.sh | sh
>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
>>> Downloading Linux ROCm amd64 bundle
######################################################################## 100.0%
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
>>> AMD GPU ready.

Retested and it’s still 17 TPS. Hmmm.

Next step: make sure my GPU is setup and enabled in Ollama. After cloning ollama I had to find out what GPU I had:

rocminfo | grep -A4 'Name:'
  Name:                    AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen AI 9 HX 370 w/ Radeon 890M
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
--
  Name:                    gfx1150
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon Graphics
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
--
      Name:                    amdgcn-amd-amdhsa--gfx1150
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
--
      Name:                    amdgcn-amd-amdhsa--gfx11-generic
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR

So it’s the gfx1150.

I tried to build it:

#!/bin/bash

# Exit on error
set -e

# Set your target GPU architecture
export AMDGPU_TARGET="gfx1150"
# you need to set this to whatever YOU have!!!

# Ensure dependencies are installed
echo "Installing build dependencies..."
sudo apt update
sudo apt install -y git cmake build-essential rocm-hip-libraries rocm-opencl-runtime

# Clone llama.cpp if it doesn't already exist
if [ ! -d "llama.cpp" ]; then
    echo "Cloning llama.cpp..."
    git clone https://github.com/ggerganov/llama.cpp.git
fi

cd llama.cpp

# Clean previous build (optional)
rm -rf build

# Configure the build
echo "Configuring with ROCm support for $AMDGPU_TARGET..."
cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS="$AMDGPU_TARGET"

# Build using all available cores
echo "Building llama.cpp with HIP acceleration..."
cmake --build build -j$(nproc)

echo "Build complete! Run with:"
echo "./build/bin/main -m path/to/your-model.gguf -p \"Hello, world\" --gpu-layers 40"

Hmmm. No joy. Looks like I don’t actually have the HIP compiler installed properly.

No, it

So let’s go reinstall that:

sudo apt install python3-setuptools python3-wheel
sudo apt update
wget https://repo.radeon.com/amdgpu-install/6.4.1/ubuntu/noble/amdgpu-install_6.4.60401-1_all.deb
sudo apt install ./amdgpu-install_6.4.60401-1_all.deb
# amdgpu-install -y --usecase=workstation,rocm
amdgpu-install --usecase=workstation -y --vulkan=pro --opencl=rocr
dkms status
# expecting: amdgpu/6.12.12-2164967.24.04, 6.8.0-60-generic, x86_64: installed
rocminfo
# expecting: ROCk module version 6.12.12 is loaded - and a lot of data
# but this is the important part
# Name:                    gfx1150
clinfo
# expecting:  Name:						 gfx1150
/opt/rocm/bin/hipcc --version
# expecting: HIP version: 6.4.43483-a187df25c and some data