LLMs and GPUs
In recent years, the landscape of artificial intelligence has shifted dramatically with the rise of large language models (LLMs). These models are incredibly powerful but also resource-intensive — typically requiring high-end GPUs like NVIDIA’s RTX 4090s or AMD’s latest Radeon Instinct series to run effectively.
But what if you don’t have access to such hardware? What if your budget is limited, or you already own older GPUs like the AMD Radeon RX 580? Surprisingly, there’s still a way to get meaningful performance out of these aging cards — especially with the right software stack and a bit of ingenuity.
This guide walks through how to leverage the AMD Radeon RX 580 — an aging yet capable GPU — to run large language models using llama.cpp via Vulkan API support, even though ROCm (the newer AMD compute framework) no longer supports it.
Hardware Overview: The Radeon RX 580
The Radeon RX 580 is part of AMD’s Polaris generation, released in 2016. While not cutting-edge today, it still offers:
- 8 GB GDDR5 memory (sufficient for many smaller models)
- 2,304 stream processors
- 14nm process
- Good PCIe 3.0 bandwidth
Although it’s no longer officially supported in newer versions of ROCm, the RX 580 retains full compatibility with Vulkan drivers, making it ideal for running modern AI inference engines.
Software Stack: llama.cpp + Vulkan
llama.cpp is a lightweight C++ implementation of the LLaMA architecture that allows you to run LLMs directly on your CPU or GPU.
It supports multiple backends including:
- CPU (default)
- CUDA (NVIDIA)
- Metal (Apple Silicon)
- Vulkan (AMD & Intel GPUs)
By enabling Vulkan support during compilation, we can tap into the RX 580’s full potential.
Installing Vulkan Drivers on Debian 12
Before we build llama.cpp, we need to ensure the system has proper Vulkan support:
sudo apt install vulkan-tools libtcmalloc-minimal4 libcurl4-openssl-dev glslc cmake make git pkg-config libvulkan-dev
These packages provide:
vulkan-tools: Tools for testing Vulkan applicationslibtcmalloc-minimal4: Memory allocator for performancelibcurl4-openssl-dev: For downloading models via HTTPglslc: GLSL shader compiler (needed for Vulkan)cmake,make,git,pkg-config: Build dependencieslibvulkan-dev: Required for Vulkan development
Once installed, you can verify Vulkan support:
vulkaninfo | grep -i RX
You should see your GPU listed in the output.
Installing llama.cpp with Vulkan Support
Let’s walk through the full installation process.
Step 1: Clone the Repository
cd ~
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
Step 2: Configure CMake for Vulkan
Build llama.cpp with Vulkan enabled:
cmake .. \
-DGGML_AVX=ON \
-DGGML_AVX_VNNI=ON \
-DGGML_AVX2=ON \
-DGGML_VULKAN=ON \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_CURL=ON
This configuration enables:
- AVX instructions for faster CPU ops
- AVX2 / VNNI optimizations (for better performance on supported CPUs)
- Vulkan backend support for AMD GPUs
- Curl support for downloading GGUF models from Hugging Face
Step 3: Compile and Install
make -j8
echo 'export PATH=$PATH:'$(realpath bin) >> ~/.bashrc
Log out and back in to update your environment variables so llama-cli and llama-server are available in your terminal.
Running Models with llama-cli and llama-server
Now that everything is built, let’s test it out with some sample commands.
Using llama-cli
Run a model using the CLI interface:
llama-cli -m deepseek-r1:8B --device Vulkan0 -ngl 99
This command:
- Loads a model named
deepseek-r1:8B - Uses device
Vulkan0(first Vulkan-compatible GPU detected) - Sets
-ngl 99to offload all layers to GPU
You can optionally specify the full model path or use Hugging Face URLs (with the -hf flag if supported).
Using llama-server
To expose your model via an API endpoint:
llama-server --host 0.0.0.0 -hf unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_M --device Vulkan0 -ngl 99
This starts a server listening on all interfaces (0.0.0.0) and uses:
unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q4_K_Mas the model (quantized to 4-bit)- Device
Vulkan0 - All layers (
-ngl 99) loaded into GPU memory
Multi-GPU Setup
If you have more than one RX 580 (or other Vulkan-compatible GPUs), you can split the model across multiple devices:
llama-server --host 0.0.0.0 -hf unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF:Q8_K_XL --device Vulkan0,Vulkan1
And for even larger models, like Qwen3-Coder-30B-A3B-Instruct-GGUF:
llama-server \
--host 0.0.0.0 \
-hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q8_K_XL \
-ngl 99 \
--threads -1 \
--ctx-size 32684 \
--temp 0.7 \
--min-p 0.0 \
--top-p 0.80 \
--top-k 20 \
--repeat-penalty 1.05 \
--device Vulkan0,Vulkan1,Vulkan2,Vulkan3,Vulkan4
This will use up to five GPUs, distributing load across them and enabling inference of 30B parameter models.
Updating llama.cpp
When new updates are released, just run:
cd ~/llama.cpp/
git clean -xdf
git pull
git submodule update --recursive
cd build/
cmake .. \
-DGGML_AVX=ON \
-DGGML_AVX_VNNI=ON \
-DGGML_AVX2=ON \
-DGGML_VULKAN=ON \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_CURL=ON
make -j8
Performance Notes: RX 580 Limitations and Workarounds
While the RX 580 isn’t the fastest GPU on the market, it can still run impressive models when properly configured. Here are some key takeaways:
- Small to medium-sized models (e.g., 7B–13B parameters) run smoothly with minimal latency.
- Larger models (like 30B) require:
- Quantized weights (Q4, Q8_K_XL)
- Multi-GPU setup
- Longer wait times for responses
- Threading optimization (
--threads -1) - Higher context sizes (
--ctx-size)
Despite limitations, a cluster of 5 RX 580s can handle a 30B parameter model, which is quite remarkable for such older hardware.
Final Thoughts
The RX 580 may be old, but it still holds value in the world of AI inference. Thanks to the llama.cpp project’s Vulkan backend support, it’s possible to run large language models on low-cost hardware that would otherwise be unusable for AI workloads.
With careful configuration and the right software stack, you can build a capable local LLM inference rig using nothing more than a few secondhand GPUs. Whether you’re training, experimenting, or just curious about AI, this setup provides a great foundation to get started.
If you’re looking to repurpose an old rig or build a cost-effective edge AI box, the RX 580 + Vulkan + llama.cpp combination is worth exploring — and you might be surprised at what it can do.
Have questions or need help setting up your own RX 580-based LLM cluster? Leave a comment below or share your experience in the comments!
friend, you will help me a lot if you tell me how to build llama.cpp + vulkan + rx 580 + windows 10. Or how to forward the llama.cpp service from wsl to windows 10
Artem and OP:
https://github.com/ggml-org/llama.cpp/releases/download/b6421/llama-b6421-bin-win-vulkan-x64.zip
https://github.com/ggml-org/llama.cpp/releases
Just make sure you have Vulkan API setup for Windows and your RX 580/RX480 card.Great work. I got 14 x RX480 8GB working Ubuntu 24.04 in Vulkan. I can even load the 70B deepseek R1 model.
Problem is I currently only have 8GB ddr4 system memory my 3000mhz 16GB is too fast.
Swapfile seems to crash with amdgpu, so I had to flat out remove the swap files to reduce paging crash. Yes running 8GB system Ram, and 10xrx480 -8GB and 4x rx 480 flashed with 4gb (but have 8GB) Will have to reflash the 4gb to 8gb to get 112GB. Currently I have 96GB DDR VRAM Accessible.
Also noting even with 8GB, Vulkan only seems to at most access 4GB of the 8GB on each GPU, not like OpenCL or CUDA where it accesses close to 99% of the memory.