Intel Gaudi / Habana Labs HPU with SynapseAI¶

WARNING Intel Gaudi support is currently under development and not ready for production.

NOTE These instructions install llama-cpp-python for CPU. Inference in ilab model chat, ilab model serve, and ilab data generate is not using hardware acceleration.

System requirements¶

RHEL 9 on x86_64 (tested with RHEL 9.3 and patched installer)
Intel Gaudi 2 device
Habana Labs software stack (tested with 1.16.2)
software from Habana Vault for RHEL and PyTorch
software HabanaAI GitHub org like optimum-habana fork

System preparation¶

Kernel modules, firmware, firmware tools¶

Enable CRB and EPEL repositories

sudo subscription-manager repos --enable codeready-builder-for-rhel-9-$(arch)-rpms
sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

Add Habana Vault repository /etc/yum.repos.d/Habana-Vault.repo

[vault]
name=Habana Vault
baseurl=https://vault.habana.ai/artifactory/rhel/9/9.2
enabled=1
repo_gpgcheck=0

Install firmware and tools

dnf install habanalabs-firmware habanalabs-firmware-tools

Install Kernel drivers. This will build and install several Kernel modules with DKMS

dnf install habanalabs

Load Kernel drivers

modprobe habanalabs_en habanalabs_cn habanalabs

Check journald for device

journalctl -o cat | grep habanalabs
habanalabs hl0: Loading secured firmware to device, may take some time...
habanalabs hl0: preboot full version: 'Preboot version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:03:16)'
habanalabs hl0: boot-fit version 49.0.0-sec-9
habanalabs hl0: Successfully loaded firmware to device
habanalabs hl0: Linux version 49.0.0-sec-9
habanalabs hl0: Found GAUDI2 device with 96GB DRAM
habanalabs hl0: hwmon1: add sensors information
habanalabs hl0: Successfully added device 0000:19:00.0 to habanalabs driver

Check hl-smi

hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.15.1-fw-49.0.0.0          |
| Driver Version:                                     1.15.1-62f612b          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-225              N/A  | 0000:19:00.0     N/A |                   0  |
| N/A   29C   N/A    93W / 600W |    768MiB / 98304MiB |     0%           N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0        N/A   N/A    N/A                                      N/A        |
+=============================================================================+

See Intel Gaudi SW Stack for RHEL 9.2 for detailed documentation.

Other tools¶

The Habana Vault repository provides several other tools, e.g. OCI container runtime hooks.

dnf install habanalabs-graph habanatool habanalabs-thunk habanalabs-container-runtime

Install Python, Intel oneMKL, and PyTorch stack¶

Retrieve installer script

curl -O https://vault.habana.ai/artifactory/gaudi-installer/1.15.1/habanalabs-installer.sh
chmod +x habanalabs-installer.sh

NOTE

Habana Labs Installer 1.15.1 only supports RHEL 9.2 and will fail on 9.3+. You can hack around the limitation by patching the installer:
sed -i 's/OS_VERSION=\$VERSION_ID/OS_VERSION=9.2/' habanalabs-installer.sh

Install dependencies (use --verbose for verbose logging). This will install several RPM packages, download Intel compilers + libraries, download + compile Python 3.10, and more.

export MAKEFLAGS="-j$(nproc)"
./habanalabs-installer.sh install --type dependencies --skip-install-firmware

Install PyTorch with Habana Labs framework in a virtual environment:

export HABANALABS_VIRTUAL_DIR=$HOME/habanalabs-venv
./habanalabs-installer.sh install --type pytorch --venv

Validate installation:

./habanalabs-installer.sh validate

Habana Lab’s PyTorch stack¶

Habana Labs comes with a modified fork of PyTorch that is build with Intel’s oneAPI Math Kernel Library (oneMKL). The actual HPU bindings and helpers are provided by the habana_framework package. Imports of habana_framework sub-packages register hpu device support, torch.hpu module, and dynamo backends.

The SFTTrainer from trl does not work with Habana stack. Instead the GaudiSFTTrainer from optimum-habana is needed. The version on PyPI is currently broken, but the HabanaAI optimum-habana-fork works.

Install and run InstructLab with Intel Gaudi¶

Install InstructLab from checkout with additional dependencies:

. $HABANALABS_VIRTUAL_DIR/bin/activate
pip install -r instructlab/requirements-hpu.txt ./instructlab

TIP If llama-cpp-python fails to build with error unsupported instruction `vpdpbusd', then install with CFLAGS="-mno-avx" pip install ....

Train environment (see Habana runtime environment variables

# environment variables for training
export TSAN_OPTIONS='ignore_noninstrumented_modules=1'
export TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD=7516192768
export LD_PRELOAD=/lib64/libtcmalloc.so

# work around race condition on systems with lots of cores
export OMP_NUM_THREADS=16

# Gaudi configuration
export PT_HPU_LAZY_MODE=0
export PT_HPU_ENABLE_EAGER_CACHE=TRUE
export PT_HPU_EAGER_4_STAGE_PIPELINE_ENABLE=TRUE
export PT_ENABLE_INT64_SUPPORT=1

# additional environment variables for debugging
#export ENABLE_CONSOLE=true
#export LOG_LEVEL_ALL=5
#export LOG_LEVEL_PT_FALLBACK=1

Train on HPU

ilab model train --device=hpu

Output:

LINUX_TRAIN.PY: Using device 'hpu'
============================= HABANA PT BRIDGE CONFIGURATION ===========================
 PT_HPU_LAZY_MODE = 0
 PT_RECIPE_CACHE_PATH =
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG =
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 48
CPU RAM       : 263943121 KB
------------------------------------------------------------------------------
Device count: 1
  hpu:0 is 'GAUDI2', cap: 1.15.1.b3dea3b61 (sramBaseAddress=1153202979533225984, dramBaseAddress=1153203082662772736, sramSize=50331648, dramSize=102106132480, tpcEnabledMask=16777215, dramEnabled=1, fd=21, device_id=0, device_type=4)
PT and Habana Environment variables
  HABANALABS_HLTHUNK_TESTS_BIN_PATH="/opt/habanalabs/src/hl-thunk/tests/arc"
  HABANA_LOGS="/var/log/habana_logs/"
  HABANA_PLUGINS_LIB_PATH="/usr/lib/habanatools/habana_plugins"
  HABANA_PROFILE="profile_api_light"
  HABANA_SCAL_BIN_PATH="/opt/habanalabs/engines_fw"
  PT_ENABLE_INT64_SUPPORT="1"
  PT_HPU_EAGER_4_STAGE_PIPELINE_ENABLE="TRUE"
  PT_HPU_ENABLE_EAGER_CACHE="TRUE"
  PT_HPU_LAZY_MODE="0"

Container¶

dnf install habanalabs-container-runtime podman
make hpu
podman run -ti --privileged -v ./data:/opt/app-root/src:z localhost/instructlab:hpu

Known issues and limitations¶

Training is limited to a single device, no DistributedDataParallel, yet.
On systems with lots of CPU cores, training sometimes crashes with a segfault right after “loading the base model”. The back trace suggests a race condition in libgomp or oneMKL. Use the environment variable OMP_NUM_THREADS to reduce OMP’s threads, e.g. OMP_NUM_THREADS=1.
habana-container-hook can cause podman build to fail.
Training parameters are not optimized and verified for best results.
llama-cpp has no hardware acceleration backend for HPUs. Inference (ilab data generate and ilab model chat) is slow and CPU bound.
The container requires --privileged. A non-privileged container is missing /dev/hl* and other device files for HPUs.