🏎️ Making ilab
go fast¶
By default, ilab
will attempt to use your GPU for inference and synthesis. This
works on a wide variety of common systems, but less-common configurations may
require some additional tinkering to get it enabled. This document aims to
describe how you can GPU-accelerate ilab
on a variety of different
environments.
ilab
relies on two Python packages that can be GPU accelerated: torch
and llama-cpp-python
. In short, you’ll need to replace the default versions of
these packages with versions that have been compiled for GPU-specific support,
recompile ilab
, then run it.
Python 3.11 (Linux only)¶
NOTE: This section may be outdated. At least AMD ROCm works fine with Python 3.12 and Torch 2.2.1+rocm5.7 binaries.
Unfortunately, at the time of writing, torch
does not have GPU-specific
support for the latest Python (3.12), so if you’re on Linux, it’s recommended
to set up a Python 3.11-specific venv
and install ilab
to that to minimize
issues. (MacOS ships Python 3.9, so this step shouldn’t be necessary.) Here’s
how to do that on Fedora with dnf
:
# Install Python 3.11
sudo dnf install python3.11 python3.11-devel
# Remove old venv from instructlab/ directory (if it exists)
rm -r venv
# Create and activate new Python 3.11 venv
python3.11 -m venv venv
source venv/bin/activate
# Install lab (assumes a locally-cloned repo)
# You can clone the repo if you haven't already done so (either one)
# gh repo clone instructlab/instructlab
# git clone https://github.com/instructlab/instructlab.git
pip install ./instructlab/
With Python 3.11 installed, it’s time to replace some packages!
llama-cpp-python backends¶
Go to the project’s GitHub to see the supported backends.
Whichever backend you choose, you’ll see a pip install
command. First
you have to purge pip’s wheel cache to force a rebuild of llama-cpp-python:
pip cache remove llama_cpp_python
You’ll want to add a few options to ensure it gets installed over the existing package, has the desired backend, and the correct version.
pip install --force-reinstall llama_cpp_python==0.2.79 -C cmake.args="-DLLAMA_$BACKEND=on"
where $BACKEND
is one of HIPBLAS
(ROCm), CUDA
, METAL
(Apple Silicon MPS), CLBLAST
(OpenCL), or another backend listed in
llama-cpp-python’s documentation.
Nvidia/CUDA¶
torch
should already ship with CUDA support, so you only have to replace
llama-cpp-python
.
Ensure you have the latest proprietary Nvidia drivers installed. You can
easily validate whether you are using nouveau
or nvidia
kernel drivers with
the following command. If your output shows Kernel driver in use: nouveau
,
you are not running with the proprietary Nvidia drivers.
# Check video driver
sudo dnf install pciutils
lspci -n -n -k | grep -A 2 -e VGA -e 3D
If needed, install the proprietary NVidia drivers
# Enable RPM Fusion Repos
sudo dnf install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
# Install Nvidia Drivers
# There may be extra steps for enabling secure boot. View the following blog for further details: https://blog.monosoul.dev/2022/05/17/automatically-sign-nvidia-kernel-module-in-fedora-36/
sudo dnf install akmod-nvidia xorg-x11-drv-nvidia-cuda
# Reboot to load new kernel drivers
sudo reboot
# Check video driver
lspci -n -n -k | grep -A 2 -e VGA -e 3D
You should now see Kernel driver in use: nvidia
. The next step is to ensure
CUDA 12.4 is installed.
# Install CUDA 12.4 and nvtop to monitor GPU usage
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/fedora39/x86_64/cuda-fedora39.repo
sudo dnf clean all
sudo dnf -y install cuda-toolkit-12-4 nvtop
Go to the project’s GitHub to see the
supported backends.
Find the CUDA
backend. You’ll see a pip install
command.
You’ll want to add a few options to ensure it gets installed over the
existing package: --force-reinstall
. Your final
command should look like this:
# Verify CUDA can be found in your PATH variable
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64
export PATH=$PATH:$CUDA_HOME/bin
# Recompile llama-cpp-python using CUDA
pip cache remove llama_cpp_python
pip install --force-reinstall llama_cpp_python==0.2.79 -C cmake.args="-DLLAMA_CUDA=on"
# Re-install InstructLab
pip install instructlab/.
If you are running Fedora 40, you need to replace the Recompile llama-cpp-python using CUDA
section above with the
following until CUDA
supports GCC v14.1+.
# Recompile llama-cpp-python using CUDA
sudo dnf install clang17
CUDAHOSTCXX=$(which clang++-17) pip install --force-reinstall llama_cpp_python==0.2.79 -C cmake.args="-DLLAMA_CUDA=on"
Proceed to the Initialize
section of
the CLI README,
and use the nvtop
utility to validate GPU utilization when interacting
with ilab model chat
or ilab data generate
AMD/ROCm¶
Your user account must be in the video
and render
group to have permission
to access the GPU hardware. If the id
command does not show both groups, then
run the following command. You have to log out log and log in again to refresh
your current user session.
sudo usermod -a -G render,video $LOGNAME
ROCm container¶
The most convenient approach is the ROCm toolbox container. The container comes with PyTorch, llama-cpp, and other dependencies pre-installed and ready-to-use.
Manual installation¶
torch
does not yet ship with AMD ROCm support, so you’ll need to install a version compiled with support.
Visit PyTorch “Get Started Locally” page
and use the matrix installer tool to find the ROCm package. Stable, Linux, Pip, Python, ROCm 5.7
in the matrix installer spits out the following command:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
You don’t need torchvision
or torchaudio
, so get rid of those. You also want
to make very sure you’re installing the right package, and not the old one
that doesn’t have GPU support, so you should add these options:
--force-reinstall
and --no-cache-dir
. Your command should look like below.
Run it to install the new version of torch
.
pip install torch --force-reinstall --no-cache-dir --index-url https://download.pytorch.org/whl/rocm6.0
With that done, it’s time to move on to llama-cpp-python
.
hipBLAS¶
If using hipBLAS you may need to install additional ROCm and hipBLAS Dependencies:
# Optionally enable repo.radeon.com repository, available through AMD documentation or Radeon Software for Linux for RHEL 9.3 at https://www.amd.com/en/support/linux-drivers
# The above will get you the latest 6.x drivers, and will not work with rocm5.7 pytorch
# to grab rocm 5.7 drivers: https://repo.radeon.com/amdgpu-install/23.30.3/rhel/9.2/
# ROCm Dependencies
sudo dnf install rocm-dev rocm-utils rocm-llvm rocminfo
# hipBLAS dependencies
sudo dnf install hipblas-devel hipblas rocblas-devel
With those dependencies installed, you should be able to install (and build)
llama-cpp-python
!
You can use rocminfo | grep gfx
from rocminfo
package or amdgpu-arch
from
clang-tools-extra
package to find our GPU model to include in the build
command - this may not be necessary in Fedora 40+ or ROCm 6.0+. You should see
something like the following if you have an AMD Integrated and Dedicated GPU:
$ rocminfo | grep gfx
Name: gfx1100
Name: amdgcn-amd-amdhsa--gfx1100
Name: gfx1036
Name: amdgcn-amd-amdhsa--gfx103
In this case, gfx1100
is the model we’re looking for (our dedicated GPU) so
we’ll include that in our build command as follows:
export PATH=/opt/rocm/llvm/bin:$PATH
pip cache remove llama_cpp_python
CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DCMAKE_C_COMPILER='/opt/rocm/llvm/bin/clang' -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_PREFIX_PATH=/opt/rocm -DAMDGPU_TARGETS=gfx1100" FORCE_CMAKE=1 pip install --force-reinstall llama_cpp_python==0.2.79
Note: This is explicitly forcing the build to use the ROCm compilers and prefix path for dependency resolution in the CMake build. This works around an issue in the CMake and ROCm version in Fedora 39 and below and is fixed in Fedora 40. With Fedora 40’s ROCm packages, use
CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DAMDGPU_TARGETS=gfx1100"
instead.
Once that package is installed, recompile ilab
with pip install .
. You also
need to tell HIP
which GPU to use - you can find this out via rocminfo
although it is typically GPU 0. To set which device is visible to HIP, we’ll
set export HIP_VISIBLE_DEVICES=0
for GPU 0. You may also have to set
HSA_OVERRIDE_GFX_VERSION
to override ROCm GFX version detection, for example
export HSA_OVERRIDE_GFX_VERSION=10.3.0
to force an unsupported gfx1032
card
to use use supported gfx1030
version. The environment variable
AMD_LOG_LEVEL
enables debug logging of ROCm libraries, for example
AMD_LOG_LEVEL=3
to print API calls to stderr
.
Now you can skip to the Testing
section.
CLBlast (OpenCL)¶
Your final command should look like so (this uses CLBlast
):
pip cache remove llama_cpp_python
pip install --force-reinstall llama_cpp_python==0.2.79 -C cmake.args="-DLLAMA_CLBLAST=on"
Once that package is installed, recompile ilab
with pip install .
and skip
to the Testing
section.
Metal/Apple Silicon¶
The ilab
default installation should have Metal support by default. If that
isn’t the case, these steps might help to enable it.
torch
should already ship with Metal support, so you only have to
replace llama-cpp-python
. Go to the project’s GitHub to see the
supported backends.
Find the Metal
backend. You’ll see a pip install
command. You’ll want to
add a few options to ensure it gets installed over the existing package:
--force-reinstall
and --no-cache-dir
. Your final command should look like so:
pip cache remove llama_cpp_python
pip install --force-reinstall llama_cpp_python==0.2.79 -C cmake.args="-DLLAMA_METAL=on"
Once that package is installed, recompile ilab
with pip install .
and skip
to the Testing
section.
Testing¶
Test your changes by chatting to the LLM. Run ilab model serve
and ilab model chat
and
chat to the LLM. If you notice significantly faster inference, congratulations!
You’ve enabled GPU acceleration. You should also notice that the ilab data generate
step will take significantly less time. You can use tools like nvtop
and
radeontop
to monitor GPU usage.
Use the scripts containers/bin/debug-pytorch
and containers/bin/debug-llama
to verify that PyTorch and llama-cpp are able to use your GPU.
The torch
and llama_cpp
packages provide functions to debug GPU support. Here is an example from an AMD ROCm system with a single GPU, ROCm build of PyTorch and llama-cpp with HIPBLAS. Don’t be confused by the fact that PyTorch uses torch.cuda
API for ROCm or llama-cpp reports hipBLAS as cuBLAS. The packages treat ROCm like a variant of CUDA.
>>> import torch
>>> torch.__version__
'2.2.1+rocm5.7'
>>> torch.version.cuda or 'n/a'
'n/a'
>>> torch.version.hip or 'n/a'
'5.7.31921-d1770ee1b'
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.get_device_name(torch.cuda.current_device())
'AMD Radeon RX 7900 XT'
>>> import llama
>>> llama_cpp.__version__
'0.2.56'
>>> llama_cpp.llama_supports_gpu_offload()
True
>>> llama_cpp.llama_backend_init()
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: no
Training¶
ilab model train
also experimentally supports GPU acceleration on Linux. Details
of a working set up is included above. Training is memory-intensive and requires
a modern GPU to work. The GPU must support bfloat16
or fp16
and have at
least 17 GiB of free GPU memory. Nvidia CUDA on WSL2 is able to use shared host
memory (USM) if GPU memory is not sufficient, but that comes with a performance
penalty. Training on Linux Kernel requires all data to fit in GPU memory. We are
working on improvements like 4-bit quantization.
It has been successfully tested on:
Nvidia GeForce RTX 3090 (24 GiB), Fedora 39, PyTorch 2.2.1 CUDA 12.1
Nvidia GeForce RTX 3060 Ti (8 GiB + 9 GiB shared), Fedora 39 on WSL2, CUDA 12.1
Nvidia Tesla V100 (16 GB) on AWS
p3.2xlarge
, Fedora 39, PyTorch 2.2.1, 4-bit quantizationAMD Radeon RX 7900 XT (20 GiB), Fedora 39, PyTorch 2.2.1+rocm5.7
AMD Radeon RX 7900 XTX (24 GiB), Fedora 39, PyTorch 2.2.1+rocm5.7
AMD Radeon RX 6700 XT (12 GiB), Fedora 39, PyTorch 2.2.1+rocm5.7, 4-bit quantization
Incompatible devices:
NVidia cards with Turing architecture (GeForce RTX 20 series) or older. They lack support for
bfloat16
andfp16
.
Note: PyTorch implements AMD ROCm support on top of its
torch.cuda
API and treats AMD GPUs as CUDA devices. In a ROCm build of PyTorch,cuda:0
is actually the first ROCm device.
Note: Training does not use a local lab server. You can stop
ilab model serve
to free up GPU memory.
ilab model train --device cuda
LINUX_TRAIN.PY: PyTorch device is 'cuda:0'
NVidia CUDA version: n/a
AMD ROCm HIP version: 5.7.31921-d1770ee1b
Device 'cuda:0' is 'AMD Radeon RX 7900 XT'
Free GPU memory: 19.9 GiB of 20.0 GiB
LINUX_TRAIN.PY: NUM EPOCHS IS: 1
...