Tag Archives: Llama.cpp

Building llama.cpp with CUDA on Debian 13 “Trixie”

If you’ve recently upgraded to Debian 13 or are fresh on a Trixie system, you may be eager to tap the power of your NVIDIA GPU for machine‑learning workloads. This post walks you through every step required to set up the necessary drivers, libraries, and build environment.


Why Enable CUDA in llama.cpp?

The original binaries of llama.cpp run on the CPU, which is perfectly fine for small models but can become a bottleneck with larger weights. By enabling the -DGGML_CUDA=ON flag, the project compiles the CUDA kernels that allow your NVIDIA GPU to perform inference. The result is a dramatic reduction in latency and a higher throughput for text generation tasks.


Prerequisites

  • A Debian 13 machine with an NVIDIA GPU that supports CUDA 11 or later.
  • Sudo access (or root) to install packages and modify system configuration.
  • An active internet connection so the package manager can fetch the necessary files.

Step 1 – Update Kernel Headers

Your system needs the headers that match the running kernel so that the NVIDIA driver can compile its kernel modules.

apt install linux-headers-$(uname -r)

This command pulls the headers for the current kernel release and installs them into the standard package locations.


Step 2 – Add Non‑Free Firmware Repositories

The Debian base repositories do not expose the proprietary firmware and driver packages needed for NVIDIA GPUs. By creating an additional source list file, we allow apt to pull the required non‑free components.

Create the file /etc/apt/sources.list.d/non‑free.sources and paste the following content:

Types: deb deb-src
URIs: http://deb.debian.org/debian/
Suites: trixie
Components: non-free-firmware contrib
Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg

Types: deb deb-src
URIs: http://security.debian.org/debian-security/
Suites: trixie-security
Components: non-free-firmware contrib
Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg

Types: deb deb-src
URIs: http://deb.debian.org/debian/
Suites: trixie-updates
Components: non-free-firmware contrib
Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg

After saving the file, refresh the package lists so the new entries become available:

apt update

Step 3 – Install the NVIDIA Driver and CUDA Toolkit

3.1 Bring in the NVIDIA Keyring

The NVIDIA distribution for Debian ships a keyring package that allows your system to verify the authenticity of the driver packages.

wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb

3.2 Install Driver Packages

apt -V install nvidia-driver-cuda nvidia-kernel-dkms

The meta‑package nvidia-driver-cuda pulls the latest driver binaries and the CUDA toolkit for the current kernel. It also installs nvidia-kernel-dkms, which provides a Dynamic Kernel Module Support interface so the driver can be built against any future kernel version.

3.3 Regenerate Initramfs and Update GRUB

After installing the driver modules, you must ensure that the initramfs contains the new driver and that GRUB will boot into the updated kernel configuration.

update-initramfs -u -k all
update-grub

Reboot the machine to let the new driver take effect.

3.4 Install the CUDA Toolkit

With the driver in place, install the toolkit components that provide nvcc, libraries, and headers used by llama.cpp.

apt install nvidia-cuda-toolkit

Step 4 – Install Build Dependencies

The build process for llama.cpp requires several libraries and developer tools. Installing them up front keeps the compile step straightforward.

apt install libtcmalloc-minimal4 libcurl4-openssl-dev glslc cmake make git pkg-config

These packages provide memory allocation utilities, SSL support, the GLSL compiler, CMake, Make, Git, and generic build configuration tools.


Step 5 – Clone and Compile llama.cpp

With the environment prepared, fetch the source code and build it.

cd ~
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. \
  -DGGML_AVX=ON \
  -DGGML_AVX_VNNI=ON \
  -DGGML_AVX2=ON \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON
make -j8
echo 'export PATH=$PATH:'$(realpath bin) >> ~/.bashrc

After the build finishes, log out and back in again so the newly added binaries become visible in your shell path.


Step 6 – Keep the Driver in Sync with Kernel Updates

Kernel upgrades are common, and the driver must be rebuilt against each new kernel. The following routine ensures the driver modules stay current.

apt install linux-headers-$(uname -r)
apt install --reinstall nvidia-driver-cuda nvidia-kernel-dkms
apt install nvidia-cuda-toolkit
update-initramfs -u -k all
update-grub

Running this sequence after any kernel upgrade guarantees that the driver continues to load correctly.


Step 7 – Updating the Source Tree

When the upstream llama.cpp project publishes a new release or a bug fix, refresh your local copy and rebuild:

cd ~
cd llama.cpp/

# Clean the working directory
git clean -xdf
mkdir build

# Pull the latest changes and submodules
git pull
git submodule update --recursive

# Rebuild
cd build/
cmake .. \
  -DGGML_AVX=ON \
  -DGGML_AVX_VNNI=ON \
  -DGGML_AVX2=ON \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON
make -j8

Running Large Language Models on Cheap Old RX 580 GPUs with llama.cpp and Vulkan

LLMs and GPUs

In recent years, the landscape of artificial intelligence has shifted dramatically with the rise of large language models (LLMs). These models are incredibly powerful but also resource-intensive — typically requiring high-end GPUs like NVIDIA’s RTX 4090s or AMD’s latest Radeon Instinct series to run effectively.

But what if you don’t have access to such hardware? What if your budget is limited, or you already own older GPUs like the AMD Radeon RX 580? Surprisingly, there’s still a way to get meaningful performance out of these aging cards — especially with the right software stack and a bit of ingenuity.

This guide walks through how to leverage the AMD Radeon RX 580 — an aging yet capable GPU — to run large language models using llama.cpp via Vulkan API support, even though ROCm (the newer AMD compute framework) no longer supports it.


Hardware Overview: The Radeon RX 580

The Radeon RX 580 is part of AMD’s Polaris generation, released in 2016. While not cutting-edge today, it still offers:

  • 8 GB GDDR5 memory (sufficient for many smaller models)
  • 2,304 stream processors
  • 14nm process
  • Good PCIe 3.0 bandwidth

Although it’s no longer officially supported in newer versions of ROCm, the RX 580 retains full compatibility with Vulkan drivers, making it ideal for running modern AI inference engines.


Software Stack: llama.cpp + Vulkan

llama.cpp is a lightweight C++ implementation of the LLaMA architecture that allows you to run LLMs directly on your CPU or GPU.

It supports multiple backends including:

  • CPU (default)
  • CUDA (NVIDIA)
  • Metal (Apple Silicon)
  • Vulkan (AMD & Intel GPUs)

By enabling Vulkan support during compilation, we can tap into the RX 580’s full potential.


Installing Vulkan Drivers on Debian 12

Before we build llama.cpp, we need to ensure the system has proper Vulkan support:

sudo apt install vulkan-tools libtcmalloc-minimal4 libcurl4-openssl-dev glslc cmake make git pkg-config libvulkan-dev

These packages provide:

  • vulkan-tools: Tools for testing Vulkan applications
  • libtcmalloc-minimal4: Memory allocator for performance
  • libcurl4-openssl-dev: For downloading models via HTTP
  • glslc: GLSL shader compiler (needed for Vulkan)
  • cmake, make, git, pkg-config: Build dependencies
  • libvulkan-dev: Required for Vulkan development

Once installed, you can verify Vulkan support:

vulkaninfo | grep -i RX

You should see your GPU listed in the output.


Installing llama.cpp with Vulkan Support

Let’s walk through the full installation process.

Step 1: Clone the Repository

cd ~
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build

Step 2: Configure CMake for Vulkan

Build llama.cpp with Vulkan enabled:

cmake .. \
  -DGGML_AVX=ON \
  -DGGML_AVX_VNNI=ON \
  -DGGML_AVX2=ON \
  -DGGML_VULKAN=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON

This configuration enables:

  • AVX instructions for faster CPU ops
  • AVX2 / VNNI optimizations (for better performance on supported CPUs)
  • Vulkan backend support for AMD GPUs
  • Curl support for downloading GGUF models from Hugging Face

Step 3: Compile and Install

make -j8
echo 'export PATH=$PATH:'$(realpath bin) >> ~/.bashrc

Log out and back in to update your environment variables so llama-cli and llama-server are available in your terminal.


Running Models with llama-cli and llama-server

Now that everything is built, let’s test it out with some sample commands.

Using llama-cli

Run a model using the CLI interface:

llama-cli -m deepseek-r1:8B --device Vulkan0 -ngl 99

This command:

  • Loads a model named deepseek-r1:8B
  • Uses device Vulkan0 (first Vulkan-compatible GPU detected)
  • Sets -ngl 99 to offload all layers to GPU

You can optionally specify the full model path or use Hugging Face URLs (with the -hf flag if supported).

Using llama-server

To expose your model via an API endpoint:

llama-server --host 0.0.0.0 -hf unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_M --device Vulkan0 -ngl 99

This starts a server listening on all interfaces (0.0.0.0) and uses:

  • unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_M as the model (quantized to 4-bit)
  • Device Vulkan0
  • All layers (-ngl 99) loaded into GPU memory

Multi-GPU Setup

If you have more than one RX 580 (or other Vulkan-compatible GPUs), you can split the model across multiple devices:

llama-server --host 0.0.0.0 -hf unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q8_K_XL --device Vulkan0,Vulkan1

And for even larger models, like Qwen3-Coder-30B-A3B-Instruct-GGUF:

llama-server \
  --host 0.0.0.0 \
  -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q8_K_XL \
  -ngl 99 \
  --threads -1 \
  --ctx-size 32684 \
  --temp 0.7 \
  --min-p 0.0 \
  --top-p 0.80 \
  --top-k 20 \
  --repeat-penalty 1.05 \
  --device Vulkan0,Vulkan1,Vulkan2,Vulkan3,Vulkan4

This will use up to five GPUs, distributing load across them and enabling inference of 30B parameter models.


Updating llama.cpp

When new updates are released, just run:

cd ~/llama.cpp/
git clean -xdf
git pull
git submodule update --recursive
cd build/
cmake .. \
  -DGGML_AVX=ON \
  -DGGML_AVX_VNNI=ON \
  -DGGML_AVX2=ON \
  -DGGML_VULKAN=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON
make -j8

Performance Notes: RX 580 Limitations and Workarounds

While the RX 580 isn’t the fastest GPU on the market, it can still run impressive models when properly configured. Here are some key takeaways:

  • Small to medium-sized models (e.g., 7B–13B parameters) run smoothly with minimal latency.
  • Larger models (like 30B) require:
  • Quantized weights (Q4, Q8_K_XL)
  • Multi-GPU setup
  • Longer wait times for responses
  • Threading optimization (--threads -1)
  • Higher context sizes (--ctx-size)

Despite limitations, a cluster of 5 RX 580s can handle a 30B parameter model, which is quite remarkable for such older hardware.


Final Thoughts

The RX 580 may be old, but it still holds value in the world of AI inference. Thanks to the llama.cpp project’s Vulkan backend support, it’s possible to run large language models on low-cost hardware that would otherwise be unusable for AI workloads.

With careful configuration and the right software stack, you can build a capable local LLM inference rig using nothing more than a few secondhand GPUs. Whether you’re training, experimenting, or just curious about AI, this setup provides a great foundation to get started.

If you’re looking to repurpose an old rig or build a cost-effective edge AI box, the RX 580 + Vulkan + llama.cpp combination is worth exploring — and you might be surprised at what it can do.


Have questions or need help setting up your own RX 580-based LLM cluster? Leave a comment below or share your experience in the comments!