What is XNACK on AMD GPUs, and How to Enable the Feature

July 23, 2023

Or: How to avoid 4000% performance degradation if I use HIP Managed Memory

Introduction

On AMD GPUs, the feature XNACK is essential for running HIP code that uses Managed Memory, or running SYCL code that uses Unified Shared Memory (USM). Performance can degrade by as much as 4000% without XNACK.

Exactly what XNACK does, or how can it be enabled, is poorly documented in all but a few places. I believe this article is the only comprehensive guide on the entire Web.

Background

Managed Memory

Traditionally in GPU programming, there’s a split between “host memory” accessible by the CPU, and “device memory” accessible by the GPU. To move data from host to device memory or vice versa, programmers must copy the data manually via a memcpy() like library function.

However, this is a tedious task. Thus, GPU runtimes started to support the use of Managed Memory or Unified Memory. Under this memory model, CPU and GPU memory accesses can be freely mixed together, and both devices share the same address space. The underlying runtime is responsible for moving data implicitly, transparent to the programmer.

In AMD HIP, Managed Memory is allocated via hipMallocManaged(). In SYCL, Unified Shared Memory is allocated via sycl::malloc_shared(), and is usually a wrapper to HIP’s Managed Memory on AMD GPUs.

XNACK

XNACK (a combination of letter X and non-acknowledgement NACK) is a feature that allows the GPU to retry memory accesses after a page fault when the accessed data does not exist in VRAM. This is needed for on-demand page migration between the host and the GPU, allowing memory copying to occur automatically (on APUs with integrated graphics, no actual copying takes place).

Without XNACK, HIP’s shared memory operates in a degraded mode - memory will not be automatically migrated based on access patterns, and Managed Memory behaves like pinned Host Memory - every access from the GPU is a transfer over PCIe. If the code assumes Managed Memory always works and does not make use of data prefetching, this will be extremely slow. I’ve seen performance degradation as much as 4000% without XNACK.

Thus, if the code depends on Managed Memory, XNACK should be enabled.

According to an AMD tutorial (beware that HSA_XNACK=1 is not enough to enable XNACK, please keep reading the rest of this article):

On MI200 GPUs there is an option to automatically migrate pages of memory between host and device. This is important for managed memory, where the locality of the data is important for performance. Depending on the system, page migration may be disabled by default in which case managed memory will act like pinned host memory and suffer degraded performance.

Enabling page migration allows the GPU (or host) to retry after a page fault (normally a memory access error), and instead retrieve the missing page. On MI200 platforms we can enable page migration by setting the environment variable HSA_XNACK=1. While this environment variable is required at kernel runtime to enable page migration, it is also helpful to enable this environment variable at compile time, which can change the performance of any compiled kernels

AMD Instinct™ MI200 GPU memory space overview

According to the documentation of Oak Ridge National Laboratory’s supercomputer (again, beware that HSA_XNACK=1 is not enough to enable XNACK, please keep reading the rest of this article):

XNACK (pronounced X-knack) refers to the AMD GPU’s ability to retry memory accesses that fail due to a page fault. The XNACK mode of an MI250X can be changed by setting the environment variable HSA_XNACK before starting a process that uses the GPU. Valid values are 0 (disabled) and 1 (enabled), and all processes connected to a GPU must use the same XNACK setting. The default MI250X on Crusher is HSA_XNACK=0.

If HSA_XNACK=0, page faults in GPU kernels are not handled and will terminate the kernel. Therefore all memory locations accessed by the GPU must either be resident in the GPU HBM or mapped by the HIP runtime. Memory regions may be migrated between the host DDR4 and GPU HBM using explicit HIP library functions such as hipMemAdvise and hipPrefetchAsync, but memory will not be automatically migrated based on access patterns alone.

If HSA_XNACK=1, page faults in GPU kernels will trigger a page table lookup. If the memory location can be made accessible to the GPU, either by being migrated to GPU HBM or being mapped for remote access, the appropriate action will occur and the access will be replayed. Page migration will happen between CPU DDR4 and GPU HBM according to page touch. The exceptions are if the programmer uses a HIP library call such as hipPrefetchAsync to request migration, or if a preferred location is set via hipMemAdvise, or if GPU HBM becomes full and the page must forcibly be evicted back to CPU DDR4 to make room for other data.

Crusher Quick-Start Guide, Oak Ridge National Laboratory

Supported Hardware

Not all GPUs are supported. Most GFX9 GPUs from the GCN series usually support XNACK, but only APU platforms enabled it by default. On dedicated graphics cards, it’s disabled by the Linux amdgpu kernel driver, possibly due to stability concerns as it’s still an experimental feature.

For users of GFX10/GFX11 GPUs from the RDNA series, unfortunately, XNACK is no longer supported. Only computing cards from the CDNA series has XNACK support, such as Instinct MI100 and MI200 - and they also belong to the GFX900 series.

Thus, use of Unified Shared Memory, which is the recommended practice and heavily used in SYCL programming, suffers from a serious hit. By not supporting XNACK on customer-grade desktop GPUs, AMD has essetially made a core feature in SYCL almost useless, forcing it to be an exclusive feature for datacenter users running CDNA cards with a price tag of $5000. This is unfortunate, but is something that developers who want to write cross-platform GPU code need to live with (for the highest performance, you may want to use manual data movements anyway, so it’s not all a loss, more on that later).

According to AMD:

Page migration is not always available – e.g. on the AMD RDNA™ 2 GPUs or in operating systems that do not support heterogeneous memory management (HMM).

AMD Instinct™ MI200 GPU memory space overview

The ISA table from ROCm source code currently reads:

  ISAREG_ENTRY_GEN("gfx810:xnack-",          8, 1, 0,  unsupported, disabled,    64)
  ISAREG_ENTRY_GEN("gfx810:xnack+",          8, 1, 0,  unsupported, enabled,     64)
  ISAREG_ENTRY_GEN("gfx900:xnack-",          9, 0, 0,  unsupported, disabled,    64)
  ISAREG_ENTRY_GEN("gfx900:xnack+",          9, 0, 0,  unsupported, enabled,     64)
  ISAREG_ENTRY_GEN("gfx902:xnack-",          9, 0, 2,  unsupported, disabled,    64)
  ISAREG_ENTRY_GEN("gfx902:xnack+",          9, 0, 2,  unsupported, enabled,     64)
  ISAREG_ENTRY_GEN("gfx904:xnack-",          9, 0, 4,  unsupported, disabled,    64)
  ISAREG_ENTRY_GEN("gfx904:xnack+",          9, 0, 4,  unsupported, enabled,     64)
  ISAREG_ENTRY_GEN("gfx906:xnack-",          9, 0, 6,  any,         disabled,    64)
  ISAREG_ENTRY_GEN("gfx906:xnack+",          9, 0, 6,  any,         enabled,     64)
  ISAREG_ENTRY_GEN("gfx908:xnack-",          9, 0, 8,  any,         disabled,    64)
  ISAREG_ENTRY_GEN("gfx908:xnack+",          9, 0, 8,  any,         enabled,     64)
  ISAREG_ENTRY_GEN("gfx909:xnack-",          9, 0, 9,  unsupported, disabled,    64)
  ISAREG_ENTRY_GEN("gfx909:xnack+",          9, 0, 9,  unsupported, enabled,     64)
  ISAREG_ENTRY_GEN("gfx90a:xnack-",          9, 0, 10, any,         disabled,    64)
  ISAREG_ENTRY_GEN("gfx90a:xnack+",          9, 0, 10, any,         enabled,     64)
  ISAREG_ENTRY_GEN("gfx90c:xnack-",          9, 0, 12, unsupported, disabled,    64)
  ISAREG_ENTRY_GEN("gfx90c:xnack+",          9, 0, 12, unsupported, enabled,     64)
  ISAREG_ENTRY_GEN("gfx1010:xnack-",         10, 1, 0, unsupported, disabled,    32)
  ISAREG_ENTRY_GEN("gfx1010:xnack+",         10, 1, 0, unsupported, enabled,     32)
  ISAREG_ENTRY_GEN("gfx1011:xnack-",         10, 1, 1, unsupported, disabled,    32)
  ISAREG_ENTRY_GEN("gfx1011:xnack+",         10, 1, 1, unsupported, enabled,     32)
  ISAREG_ENTRY_GEN("gfx1012:xnack-",         10, 1, 2, unsupported, disabled,    32)
  ISAREG_ENTRY_GEN("gfx1012:xnack+",         10, 1, 2, unsupported, enabled,     32)
  ISAREG_ENTRY_GEN("gfx1013:xnack-",         10, 1, 3, unsupported, disabled,    32)
  ISAREG_ENTRY_GEN("gfx1013:xnack+",         10, 1, 3, unsupported, enabled,     32)
  ISAREG_ENTRY_GEN("gfx1030",                10, 3, 0, unsupported, unsupported, 32)
  ISAREG_ENTRY_GEN("gfx1031",                10, 3, 1, unsupported, unsupported, 32)
  ISAREG_ENTRY_GEN("gfx1032",                10, 3, 2, unsupported, unsupported, 32)
  ISAREG_ENTRY_GEN("gfx1033",                10, 3, 3, unsupported, unsupported, 32)
  ISAREG_ENTRY_GEN("gfx1034",                10, 3, 4, unsupported, unsupported, 32)
  ISAREG_ENTRY_GEN("gfx1035",                10, 3, 5, unsupported, unsupported, 32)
  ISAREG_ENTRY_GEN("gfx1036",                10, 3, 6, unsupported, unsupported, 32)
  ISAREG_ENTRY_GEN("gfx1100",                11, 0, 0, unsupported, unsupported, 32)
  ISAREG_ENTRY_GEN("gfx1101",                11, 0, 1, unsupported, unsupported, 32)
  ISAREG_ENTRY_GEN("gfx1102",                11, 0, 2, unsupported, unsupported, 32)
  ISAREG_ENTRY_GEN("gfx1103",                11, 0, 3, unsupported, unsupported, 32)

Furthermore, according to the comments in the Linux kernel, even though the XNACK feature is present on some GFX10 and newer GPUs, the implementation is incomplete and can lead to even deadlocks, so XNACK is disabled.

bool kfd_process_xnack_mode(struct kfd_process *p, bool supported)
{
    int i;

    /* On most GFXv9 GPUs, the retry mode in the SQ must match the
     * boot time retry setting. Mixing processes with different
     * XNACK/retry settings can hang the GPU.
     *
     * Different GPUs can have different noretry settings depending
     * on HW bugs or limitations. We need to find at least one
     * XNACK mode for this process that's compatible with all GPUs.
     * Fortunately GPUs with retry enabled (noretry=0) can run code
     * built for XNACK-off. On GFXv9 it may perform slower.
     *
     * Therefore applications built for XNACK-off can always be
     * supported and will be our fallback if any GPU does not
     * support retry.
     */
    for (i = 0; i < p->n_pdds; i++) {
        struct kfd_dev *dev = p->pdds[i]->dev;

        /* Only consider GFXv9 and higher GPUs. Older GPUs don't
         * support the SVM APIs and don't need to be considered
         * for the XNACK mode selection.
         */
        if (!KFD_IS_SOC15(dev))
            continue;
        /* Aldebaran can always support XNACK because it can support
         * per-process XNACK mode selection. But let the dev->noretry
         * setting still influence the default XNACK mode.
         */
        if (supported && KFD_SUPPORT_XNACK_PER_PROCESS(dev))
            continue;

        /* GFXv10 and later GPUs do not support shader preemption
         * during page faults. This can lead to poor QoS for queue
         * management and memory-manager-related preemptions or
         * even deadlocks.
         */
        if (KFD_GC_VERSION(dev) >= IP_VERSION(10, 1, 1))
            return false;

        if (dev->noretry)
            return false;
    }

    return true;
}

Stability

Furthermore, even on computing cards like the MI100 or the MI250x, support was non-existent until recent years, even then it’s still experimental by now. According to a 2020 research paper:

UM only works on recent AMD GPUs, including Vega10 and MI100. Older GPUs such as Fiji and Polaris are not supported. There are two flavors of the support: XNACK-enabled and XNACK-disabled. In the XNACK-enabled mode […] The XNACK-enabled mode only has experimental support. Not all the math libraries included in ROCm support the XNACK-enabled mode on current hardware. A mode can be chosen at boot-time, and the default is XNACK-disabled. Due to the uncertainties of the XNACK-enabled mode, our evaluation is limited to the XNACK-disabled mode. We would like to investigate the XNACK-enabled mode in our future work

Evaluating Unified Memory Performance in HIP, Oak Ridge National Laboratory

As of July 2023, there’s an open bug in the ROCm’s GitHub repository, created by AMD developers, indicating that the feature is still somewhat experimental:

Occasionally, HSA_XNACK forces some page faults, and memory may not initialize as expected in specific cases. This issue is under investigation and will be fixed in a future release.

Issue with HSA_XNACK, RadeonOpenCompute/ROCm Issue #2286

How do I enable XNACK

To enable XNACK, the following conditions should be satisfied:

It must be supported by the hardware.
It must be supported by the Linux kernel (by using an up-to-date version with HMM support).
It must be enabled in the Linux kernel.
It must be enabled in the runtime via the environmental variable HSA_XNACK before running a HIP program.
(Optionally) The code should preferably be compiled to a xnack+ target.

These are explained below.

Checking XNACK Status

You can check whether xnack is enabled by looking at your platform name in clinfo or rocminfo. If the platform name has xnack-, it means XNACK is disabled. If it has xnack+, it means XNACK is enabled. If it has no xnack, it means xnack is probably unsupported.

On GFX8 APUs with integrated graphics, it’s usually enabled by default. On dedicated graphics, it’s usually disabled.

For example, on my AMD Radeon Instinct MI50 (Radeon Pro VII), the platform name is (with XNACK enabled):

$ rocminfo  | grep Name:
  Name:                    AMD Ryzen 5 3500X 6-Core Processor 
  Marketing Name:          AMD Ryzen 5 3500X 6-Core Processor 
  Vendor Name:             CPU                                
  Name:                    gfx906                             
  Marketing Name:          AMD Radeon Pro VII                 
  Vendor Name:             AMD                                
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack+

To check whether XNACK is really enabled at the application level (taking the effect of both kernel and HSA_XNACK into account), AMD has a small demo program at /opt/rocm/hip/bin/hipInfo. Run it with AMD_LOG_LEVEL=4 HSA_XNACK=1 /opt/rocm/hip/bin/hipInfo, it should report the following at the beginning of the output.

Initializing HSA stack.
Loading COMGR library.
Numa selects cpu 
HMM support: 1, xnack: 1, direct host access: 0

On my particular distro (Gentoo), hipInfo is not built by default, but it can be found in /usr/share/hip/samples/1_Utils/hipInfo. Change Makefile’s HIPCC=$(HIP_PATH)/bin/hipcc to HIPCC=hipcc and run make to build.

Enabling XNACK

To enable XNACK in the Linux kernel, the kernel must be up-to-date with HMM support. Then, it can enabled by two methods: a boot-time kernel argument amdgpu.noretry=0 by putting this argument in the bootloader configuration, alternatively, passing the amdgpu kernel module parameter noretry=0 by declaring options amdgpu noretry=0 in /etc/modprobe.d.

The Linux kernel documentation says:

noretry (int): Disable XNACK retry in the SQ by default on GFXv9 hardware. On ASICs that do not support per-process XNACK this also disables retry page faults. (0 = retry enabled, 1 = retry disabled, -1 auto (default))

Then, before running an HIP program, the environmental variable HSA_XNACK=1 must be set. It may be a good idea to put export HSA_XNACK=1 in your shell profile.

Compiler Options

To further increase performance, it’s possible to specifically target AMD GPUs with the assumption that XNACK is always enabled during compile time. To implement this, add the suffix :xnack+ to your target name. For example, instead of targeting gfx906, you can instead target gfx906:xnack+.

It’s worth noting that the object code compilewd with xnack+ does not work on GPUs without XNACK. Thus, if you decide to do this, you may want to compile another version without xnack as fallback.

Personally, I didn’t notice a difference in performance of my own test code.

According to LLVM’s documentation:

If specified, generate code that can only be loaded and executed in a process that has a matching setting for XNACK replay. If not specified for code object V2 to V3, generate code that can be loaded and executed in a process with XNACK replay enabled. If not specified for code object V4 or above, generate code that can be loaded and executed in a process with either setting of XNACK replay.

XNACK replay can be used for demand paging and page migration. If enabled in the device, then if a page fault occurs the code may execute incorrectly unless generated with XNACK replay enabled, or generated for code object V4 or above without specifying XNACK replay. Executing code that was generated with XNACK replay enabled, or generated for code object V4 or above without specifying XNACK replay, on a device that does not have XNACK replay enabled will execute correctly but may be less performant than code generated for XNACK replay disabled.

According to GCC’s documentation:

-mxnack=on -mxnack=off -mxnack=any

Compile binaries suitable for devices with the XNACK feature enabled, disabled, or either mode. Some devices always require XNACK and some allow the user to configure XNACK. The compiled code must match the device mode. At present this option is a placeholder for support that is not yet implemented.

Managed Memory without XNACK

Without XNACK, HIP’s shared memory operates in a degraded mode - memory will not be automatically migrated based on access patterns, and Managed Memory behaves like pinned Host Memory - every access from the GPU is a transfer over PCIe. If the code assumes Managed Memory always works and does not make use of data prefetching, this will be extremely slow. I’ve seen performance degradation as much as 4000% without XNACK.

But it doesn’t mean that all hopes are lost. If manual memory management is acceptable, it’s still possible to use Managed Memory with manual data movement via prefetching and hints.

In addition, since automatic page migration is not perfect, to achieve the highest performance, prefetching may be necessary regardless of XNACK or not.

HIP

HIP provides two memory management hints hipPrefetchAsync(), and hipMemAdvise().

hipPrefetchAsync() accepts a pointer and a size in bytes, it can be used basically as a manual memory copy function. After prefetching was added, I found that the performance degradation of Managed Memory without XNACK became just 200%, not the previous 1000% to 4000% I’ve seen.

For further tuning, hipMemAdvise() accepts the following flags:

hipMemAdviseSetReadMostly: Data will mostly be read and only occassionally be written to
hipMemAdviseSetPreferredLocation: Set the preferred location for the data as the specified device
hipMemAdviseSetAccessedBy: Data will be accessed by the specified device, so prevent page faults as much as possible
hipMemAdviseSetCoarseGrain: The default memory model is fine-grain. That allows coherent operations between host and device, while executing kernels. The coarse-grain can be used for data that only needs to be coherent at dispatch boundaries for better performance

In my case, I found hipMemAdviseSetReadMostly and hipMemAdviseSetPreferredLocation had no effect, but hipMemAdviseSetCoarseGrain was able to close the final performance gap between Device Memory and Managed Memory - at the expense of data coherency during simultaneous execution between host and device, this may or may not be acceptable for your application.

SYCL

SYCL provides sycl::queue::prefetch() and sycl::queue::mem_advise() with the same function. The flags accepted by mem_advise() are not standardized, they’re specific to a particular runtime.

On AMD, SYCL is implemented on top of HIP. Thus, SYCL’s Unified Shared Memory is implemented as hipMallocManaged(). For example, USM is implemented on AMD GPUs as the following in OpenSYCL:

void * hip_allocator::allocate_usm(size_t bytes)
{
  hip_device_manager::get().activate_device(_dev);

  void *ptr;
  auto err = hipMallocManaged(&ptr, bytes);
  if (err != hipSuccess) {
    register_error(__hipsycl_here(),
                   error_info{"hip_allocator: hipMallocManaged() failed",
                              error_code{"HIP", err},
                              error_type::memory_allocation_error});
    return nullptr;
  }

  return ptr;
}

Thus, SYCL’s demand paging depends entirely on HIP. sycl::queue::prefetch() is a wrapper to hipPrefetchAsync(), and sycl::queue::mem_advise() is a wrapper to hipMemAdvise() and thus accepts the same flags (but you should perform a runtime check instead of hardcoding the SYCL code for AMD).

It’s also worth noting that in OpenSYCL, the standard asynchronous sycl::queue::mem_advise() is not supported yet. Instead, a non-standard extension sycl::mem_advise() is used in place of it. You may need a preprocessor macro to be compatible with both Intel DP++ and OpenSYCL.