What is XNACK on AMD GPUs, and How to Enable the Feature
Or: How to avoid 4000% performance degradation if I use HIP Managed Memory
Introduction
On AMD GPUs, the feature XNACK
is essential for running HIP code that uses Managed
Memory, or running SYCL code that uses Unified Shared Memory (USM). Performance
can degrade by as much as 4000% without XNACK
.
Exactly what XNACK does, or how can it be enabled, is poorly documented in all but a few places. I believe this article is the only comprehensive guide on the entire Web.
Background
Managed Memory
Traditionally in GPU programming, there’s a split between “host memory” accessible
by the CPU, and “device memory” accessible by the GPU. To move data from host to
device memory or vice versa, programmers must copy the data manually via a memcpy()
like library function.
However, this is a tedious task. Thus, GPU runtimes started to support the use of Managed Memory or Unified Memory. Under this memory model, CPU and GPU memory accesses can be freely mixed together, and both devices share the same address space. The underlying runtime is responsible for moving data implicitly, transparent to the programmer.
In AMD HIP, Managed Memory is allocated via hipMallocManaged()
. In SYCL, Unified
Shared Memory is allocated via sycl::malloc_shared()
, and is usually a wrapper to
HIP’s Managed Memory on AMD GPUs.
XNACK
XNACK
(a combination of letter X
and non-acknowledgement NACK
) is a feature
that allows the GPU to retry memory accesses after a page fault when the accessed
data does not exist in VRAM. This is needed for on-demand page migration between the
host and the GPU, allowing memory copying to occur automatically (on APUs with
integrated graphics, no actual copying takes place).
Without XNACK
, HIP’s shared memory operates in a degraded mode - memory will not be
automatically migrated based on access patterns, and Managed Memory behaves like
pinned Host Memory - every access from the GPU is a transfer over PCIe. If the code
assumes Managed Memory always works and does not make use of data prefetching, this
will be extremely slow. I’ve seen performance degradation as much as 4000% without
XNACK
.
Thus, if the code depends on Managed Memory, XNACK
should be enabled.
According to an AMD tutorial (beware that HSA_XNACK=1
is not enough to enable
XNACK
, please keep reading the rest of this article):
On MI200 GPUs there is an option to automatically migrate pages of memory between host and device. This is important for managed memory, where the locality of the data is important for performance. Depending on the system, page migration may be disabled by default in which case managed memory will act like pinned host memory and suffer degraded performance.
Enabling page migration allows the GPU (or host) to retry after a page fault (normally a memory access error), and instead retrieve the missing page. On MI200 platforms we can enable page migration by setting the environment variable
HSA_XNACK=1
. While this environment variable is required at kernel runtime to enable page migration, it is also helpful to enable this environment variable at compile time, which can change the performance of any compiled kernels
According to the documentation of Oak Ridge National Laboratory’s supercomputer (again,
beware that HSA_XNACK=1
is not enough to enable XNACK
, please keep reading the rest
of this article):
XNACK
(pronounced X-knack) refers to the AMD GPU’s ability to retry memory accesses that fail due to a page fault. The XNACK mode of an MI250X can be changed by setting the environment variableHSA_XNACK
before starting a process that uses the GPU. Valid values are 0 (disabled) and 1 (enabled), and all processes connected to a GPU must use the same XNACK setting. The default MI250X on Crusher isHSA_XNACK=0
.If
HSA_XNACK=0
, page faults in GPU kernels are not handled and will terminate the kernel. Therefore all memory locations accessed by the GPU must either be resident in the GPU HBM or mapped by the HIP runtime. Memory regions may be migrated between the host DDR4 and GPU HBM using explicit HIP library functions such ashipMemAdvise
andhipPrefetchAsync
, but memory will not be automatically migrated based on access patterns alone.If
HSA_XNACK=1
, page faults in GPU kernels will trigger a page table lookup. If the memory location can be made accessible to the GPU, either by being migrated to GPU HBM or being mapped for remote access, the appropriate action will occur and the access will be replayed. Page migration will happen between CPU DDR4 and GPU HBM according to page touch. The exceptions are if the programmer uses a HIP library call such ashipPrefetchAsync
to request migration, or if a preferred location is set viahipMemAdvise
, or if GPU HBM becomes full and the page must forcibly be evicted back to CPU DDR4 to make room for other data.
Supported Hardware
Not all GPUs are supported. Most GFX9 GPUs from the GCN series usually support XNACK
,
but only APU platforms enabled it by default. On dedicated graphics cards, it’s
disabled by the Linux amdgpu
kernel driver, possibly due to stability concerns as it’s
still an experimental feature.
For users of GFX10/GFX11 GPUs from the RDNA series, unfortunately, XNACK
is no longer
supported. Only computing cards from the CDNA series has XNACK
support, such as Instinct
MI100 and MI200 - and they also belong to the GFX900 series.
Thus, use of Unified Shared Memory, which is the recommended practice and heavily
used in SYCL programming, suffers from a serious hit. By not supporting XNACK
on customer-grade desktop GPUs, AMD has essetially made a core feature in SYCL almost
useless, forcing it to be an exclusive feature for datacenter users running CDNA cards
with a price tag of $5000. This is unfortunate, but is something that developers who
want to write cross-platform GPU code need to live with (for the highest performance,
you may want to use manual data movements anyway, so it’s not all a loss, more on that
later).
According to AMD:
Page migration is not always available – e.g. on the AMD RDNA™ 2 GPUs or in operating systems that do not support heterogeneous memory management (HMM).
The ISA table from ROCm source code currently reads:
ISAREG_ENTRY_GEN("gfx810:xnack-", 8, 1, 0, unsupported, disabled, 64)
ISAREG_ENTRY_GEN("gfx810:xnack+", 8, 1, 0, unsupported, enabled, 64)
ISAREG_ENTRY_GEN("gfx900:xnack-", 9, 0, 0, unsupported, disabled, 64)
ISAREG_ENTRY_GEN("gfx900:xnack+", 9, 0, 0, unsupported, enabled, 64)
ISAREG_ENTRY_GEN("gfx902:xnack-", 9, 0, 2, unsupported, disabled, 64)
ISAREG_ENTRY_GEN("gfx902:xnack+", 9, 0, 2, unsupported, enabled, 64)
ISAREG_ENTRY_GEN("gfx904:xnack-", 9, 0, 4, unsupported, disabled, 64)
ISAREG_ENTRY_GEN("gfx904:xnack+", 9, 0, 4, unsupported, enabled, 64)
ISAREG_ENTRY_GEN("gfx906:xnack-", 9, 0, 6, any, disabled, 64)
ISAREG_ENTRY_GEN("gfx906:xnack+", 9, 0, 6, any, enabled, 64)
ISAREG_ENTRY_GEN("gfx908:xnack-", 9, 0, 8, any, disabled, 64)
ISAREG_ENTRY_GEN("gfx908:xnack+", 9, 0, 8, any, enabled, 64)
ISAREG_ENTRY_GEN("gfx909:xnack-", 9, 0, 9, unsupported, disabled, 64)
ISAREG_ENTRY_GEN("gfx909:xnack+", 9, 0, 9, unsupported, enabled, 64)
ISAREG_ENTRY_GEN("gfx90a:xnack-", 9, 0, 10, any, disabled, 64)
ISAREG_ENTRY_GEN("gfx90a:xnack+", 9, 0, 10, any, enabled, 64)
ISAREG_ENTRY_GEN("gfx90c:xnack-", 9, 0, 12, unsupported, disabled, 64)
ISAREG_ENTRY_GEN("gfx90c:xnack+", 9, 0, 12, unsupported, enabled, 64)
ISAREG_ENTRY_GEN("gfx1010:xnack-", 10, 1, 0, unsupported, disabled, 32)
ISAREG_ENTRY_GEN("gfx1010:xnack+", 10, 1, 0, unsupported, enabled, 32)
ISAREG_ENTRY_GEN("gfx1011:xnack-", 10, 1, 1, unsupported, disabled, 32)
ISAREG_ENTRY_GEN("gfx1011:xnack+", 10, 1, 1, unsupported, enabled, 32)
ISAREG_ENTRY_GEN("gfx1012:xnack-", 10, 1, 2, unsupported, disabled, 32)
ISAREG_ENTRY_GEN("gfx1012:xnack+", 10, 1, 2, unsupported, enabled, 32)
ISAREG_ENTRY_GEN("gfx1013:xnack-", 10, 1, 3, unsupported, disabled, 32)
ISAREG_ENTRY_GEN("gfx1013:xnack+", 10, 1, 3, unsupported, enabled, 32)
ISAREG_ENTRY_GEN("gfx1030", 10, 3, 0, unsupported, unsupported, 32)
ISAREG_ENTRY_GEN("gfx1031", 10, 3, 1, unsupported, unsupported, 32)
ISAREG_ENTRY_GEN("gfx1032", 10, 3, 2, unsupported, unsupported, 32)
ISAREG_ENTRY_GEN("gfx1033", 10, 3, 3, unsupported, unsupported, 32)
ISAREG_ENTRY_GEN("gfx1034", 10, 3, 4, unsupported, unsupported, 32)
ISAREG_ENTRY_GEN("gfx1035", 10, 3, 5, unsupported, unsupported, 32)
ISAREG_ENTRY_GEN("gfx1036", 10, 3, 6, unsupported, unsupported, 32)
ISAREG_ENTRY_GEN("gfx1100", 11, 0, 0, unsupported, unsupported, 32)
ISAREG_ENTRY_GEN("gfx1101", 11, 0, 1, unsupported, unsupported, 32)
ISAREG_ENTRY_GEN("gfx1102", 11, 0, 2, unsupported, unsupported, 32)
ISAREG_ENTRY_GEN("gfx1103", 11, 0, 3, unsupported, unsupported, 32)
Furthermore, according to the
comments in the Linux kernel,
even though the XNACK
feature is present on some GFX10 and newer GPUs, the implementation is incomplete and
can lead to even deadlocks, so XNACK
is disabled.
bool kfd_process_xnack_mode(struct kfd_process *p, bool supported)
{
int i;
/* On most GFXv9 GPUs, the retry mode in the SQ must match the
* boot time retry setting. Mixing processes with different
* XNACK/retry settings can hang the GPU.
*
* Different GPUs can have different noretry settings depending
* on HW bugs or limitations. We need to find at least one
* XNACK mode for this process that's compatible with all GPUs.
* Fortunately GPUs with retry enabled (noretry=0) can run code
* built for XNACK-off. On GFXv9 it may perform slower.
*
* Therefore applications built for XNACK-off can always be
* supported and will be our fallback if any GPU does not
* support retry.
*/
for (i = 0; i < p->n_pdds; i++) {
struct kfd_dev *dev = p->pdds[i]->dev;
/* Only consider GFXv9 and higher GPUs. Older GPUs don't
* support the SVM APIs and don't need to be considered
* for the XNACK mode selection.
*/
if (!KFD_IS_SOC15(dev))
continue;
/* Aldebaran can always support XNACK because it can support
* per-process XNACK mode selection. But let the dev->noretry
* setting still influence the default XNACK mode.
*/
if (supported && KFD_SUPPORT_XNACK_PER_PROCESS(dev))
continue;
/* GFXv10 and later GPUs do not support shader preemption
* during page faults. This can lead to poor QoS for queue
* management and memory-manager-related preemptions or
* even deadlocks.
*/
if (KFD_GC_VERSION(dev) >= IP_VERSION(10, 1, 1))
return false;
if (dev->noretry)
return false;
}
return true;
}
Stability
Furthermore, even on computing cards like the MI100 or the MI250x, support was non-existent until recent years, even then it’s still experimental by now. According to a 2020 research paper:
UM only works on recent AMD GPUs, including Vega10 and MI100. Older GPUs such as Fiji and Polaris are not supported. There are two flavors of the support: XNACK-enabled and XNACK-disabled. In the XNACK-enabled mode […] The XNACK-enabled mode only has experimental support. Not all the math libraries included in ROCm support the XNACK-enabled mode on current hardware. A mode can be chosen at boot-time, and the default is XNACK-disabled. Due to the uncertainties of the XNACK-enabled mode, our evaluation is limited to the XNACK-disabled mode. We would like to investigate the XNACK-enabled mode in our future work
As of July 2023, there’s an open bug in the ROCm’s GitHub repository, created by AMD developers, indicating that the feature is still somewhat experimental:
Occasionally,
HSA_XNACK
forces some page faults, and memory may not initialize as expected in specific cases. This issue is under investigation and will be fixed in a future release.
How do I enable XNACK
To enable XNACK, the following conditions should be satisfied:
-
It must be supported by the hardware.
-
It must be supported by the Linux kernel (by using an up-to-date version with HMM support).
-
It must be enabled in the Linux kernel.
-
It must be enabled in the runtime via the environmental variable
HSA_XNACK
before running a HIP program. -
(Optionally) The code should preferably be compiled to a
xnack+
target.
These are explained below.
Checking XNACK Status
You can check whether xnack is enabled by looking at your platform name in clinfo
or rocminfo
. If the platform name has xnack-
, it means XNACK is disabled. If it
has xnack+, it means XNACK is enabled. If it has no xnack
, it means xnack
is
probably unsupported.
On GFX8 APUs with integrated graphics, it’s usually enabled by default. On dedicated graphics, it’s usually disabled.
For example, on my AMD Radeon Instinct MI50 (Radeon Pro VII), the platform name is
(with XNACK
enabled):
$ rocminfo | grep Name:
Name: AMD Ryzen 5 3500X 6-Core Processor
Marketing Name: AMD Ryzen 5 3500X 6-Core Processor
Vendor Name: CPU
Name: gfx906
Marketing Name: AMD Radeon Pro VII
Vendor Name: AMD
Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack+
To check whether XNACK is really enabled at the application level (taking the effect
of both kernel and HSA_XNACK
into account), AMD has a small demo program at
/opt/rocm/hip/bin/hipInfo
. Run it with AMD_LOG_LEVEL=4 HSA_XNACK=1 /opt/rocm/hip/bin/hipInfo
,
it should report the following at the beginning of the output.
Initializing HSA stack.
Loading COMGR library.
Numa selects cpu
HMM support: 1, xnack: 1, direct host access: 0
On my particular distro (Gentoo), hipInfo
is not built by default, but it can be found in
/usr/share/hip/samples/1_Utils/hipInfo
. Change Makefile’s HIPCC=$(HIP_PATH)/bin/hipcc
to
HIPCC=hipcc
and run make
to build.
Enabling XNACK
To enable XNACK in the Linux kernel, the kernel must be up-to-date with HMM support.
Then, it can enabled by two methods: a boot-time kernel argument amdgpu.noretry=0
by putting this argument in the bootloader configuration, alternatively, passing
the amdgpu
kernel module parameter noretry=0
by declaring options amdgpu noretry=0
in /etc/modprobe.d
.
The Linux kernel documentation says:
noretry (int): Disable XNACK retry in the SQ by default on GFXv9 hardware. On ASICs that do not support per-process XNACK this also disables retry page faults. (0 = retry enabled, 1 = retry disabled, -1 auto (default))
Then, before running an HIP program, the environmental variable HSA_XNACK=1
must be set.
It may be a good idea to put export HSA_XNACK=1
in your shell profile.
Compiler Options
To further increase performance, it’s possible to specifically target AMD GPUs with the
assumption that XNACK
is always enabled during compile time. To implement this, add
the suffix :xnack+
to your target name. For example, instead of targeting gfx906
,
you can instead target gfx906:xnack+
.
It’s worth noting that the object code compilewd with xnack+
does not work on GPUs
without XNACK
. Thus, if you decide to do this, you may want to compile another
version without xnack
as fallback.
Personally, I didn’t notice a difference in performance of my own test code.
According to LLVM’s documentation:
If specified, generate code that can only be loaded and executed in a process that has a matching setting for XNACK replay. If not specified for code object V2 to V3, generate code that can be loaded and executed in a process with XNACK replay enabled. If not specified for code object V4 or above, generate code that can be loaded and executed in a process with either setting of XNACK replay.
XNACK replay can be used for demand paging and page migration. If enabled in the device, then if a page fault occurs the code may execute incorrectly unless generated with XNACK replay enabled, or generated for code object V4 or above without specifying XNACK replay. Executing code that was generated with XNACK replay enabled, or generated for code object V4 or above without specifying XNACK replay, on a device that does not have XNACK replay enabled will execute correctly but may be less performant than code generated for XNACK replay disabled.
According to GCC’s documentation:
-mxnack=on
-mxnack=off
-mxnack=any
Compile binaries suitable for devices with the XNACK feature enabled, disabled, or either mode. Some devices always require XNACK and some allow the user to configure XNACK. The compiled code must match the device mode. At present this option is a placeholder for support that is not yet implemented.
Managed Memory without XNACK
Without XNACK
, HIP’s shared memory operates in a degraded mode - memory will not be
automatically migrated based on access patterns, and Managed Memory behaves like
pinned Host Memory - every access from the GPU is a transfer over PCIe. If the code
assumes Managed Memory always works and does not make use of data prefetching, this
will be extremely slow. I’ve seen performance degradation as much as 4000% without
XNACK
.
But it doesn’t mean that all hopes are lost. If manual memory management is acceptable, it’s still possible to use Managed Memory with manual data movement via prefetching and hints.
In addition, since automatic page migration is not perfect, to achieve the highest performance, prefetching may be necessary regardless of XNACK or not.
HIP
HIP provides two memory management hints hipPrefetchAsync()
, and hipMemAdvise()
.
hipPrefetchAsync()
accepts a pointer and a size in bytes, it can be used basically
as a manual memory copy function. After prefetching was added, I found that the
performance degradation of Managed Memory without XNACK
became just 200%, not the
previous 1000% to 4000% I’ve seen.
For further tuning, hipMemAdvise()
accepts the following flags:
hipMemAdviseSetReadMostly
: Data will mostly be read and only occassionally be written tohipMemAdviseSetPreferredLocation
: Set the preferred location for the data as the specified devicehipMemAdviseSetAccessedBy
: Data will be accessed by the specified device, so prevent page faults as much as possiblehipMemAdviseSetCoarseGrain
: The default memory model is fine-grain. That allows coherent operations between host and device, while executing kernels. The coarse-grain can be used for data that only needs to be coherent at dispatch boundaries for better performance
In my case, I found hipMemAdviseSetReadMostly
and hipMemAdviseSetPreferredLocation
had no effect, but hipMemAdviseSetCoarseGrain
was able to close the final performance
gap between Device Memory and Managed Memory - at the expense of data coherency during
simultaneous execution between host and device, this may or may not be acceptable for
your application.
SYCL
SYCL provides sycl::queue::prefetch()
and sycl::queue::mem_advise()
with the same
function. The flags accepted by mem_advise()
are not standardized, they’re specific
to a particular runtime.
On AMD, SYCL is implemented on top of HIP. Thus, SYCL’s Unified Shared Memory is
implemented as hipMallocManaged()
. For example, USM is implemented on AMD GPUs
as the following in OpenSYCL:
void * hip_allocator::allocate_usm(size_t bytes)
{
hip_device_manager::get().activate_device(_dev);
void *ptr;
auto err = hipMallocManaged(&ptr, bytes);
if (err != hipSuccess) {
register_error(__hipsycl_here(),
error_info{"hip_allocator: hipMallocManaged() failed",
error_code{"HIP", err},
error_type::memory_allocation_error});
return nullptr;
}
return ptr;
}
Thus, SYCL’s demand paging depends entirely on HIP. sycl::queue::prefetch()
is a
wrapper to hipPrefetchAsync()
, and sycl::queue::mem_advise()
is a wrapper to
hipMemAdvise()
and thus accepts the same flags (but you should perform a runtime
check instead of hardcoding the SYCL code for AMD).
It’s also worth noting that in OpenSYCL, the standard asynchronous
sycl::queue::mem_advise()
is not supported yet. Instead, a non-standard
extension sycl::mem_advise()
is used in place of it. You may need a
preprocessor macro to be compatible with both Intel DP++ and OpenSYCL.