AMD Radeon (GNU libgomp)

Next: nvptx, Up: Offload-Target Specifics [Contents][Index]

12.1 AMD Radeon (GCN) ¶

On the hardware side, there is the hierarchy (fine to coarse):

work item (thread)
wavefront
work group
compute unit (CU)

All OpenMP and OpenACC levels are used, i.e.

OpenMP’s simd and OpenACC’s vector map to work items (thread)
OpenMP’s threads (“parallel”) and OpenACC’s workers map to wavefronts
OpenMP’s teams and OpenACC’s gang use a threadpool with the size of the number of teams or gangs, respectively.

The used sizes are

Number of teams is the specified num_teams (OpenMP) or num_gangs (OpenACC) or otherwise the number of CU. It is limited by two times the number of CU.
Number of wavefronts is 4 for gfx900 and 16 otherwise; num_threads (OpenMP) and num_workers (OpenACC) overrides this if smaller.
The wavefront has 102 scalars and 64 vectors
Number of workitems is always 64
The hardware permits maximally 40 workgroups/CU and 16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU.
80 scalars registers and 24 vector registers in non-kernel functions (the chosen procedure-calling API).
For the kernel itself: as many as register pressure demands (number of teams and number of threads, scaled down if registers are exhausted)

The implementation remark:

I/O within OpenMP target regions and OpenACC compute regions is supported using the C library printf functions and the Fortran print/write statements.
Reverse offload regions (i.e. target regions with device(ancestor:1)) are processed serially per target region such that the next reverse offload region is only executed after the previous one returned.
OpenMP code that has a requires directive with unified_shared_memory is only supported if all AMD GPUs have the HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT property; for discrete GPUs, this may require setting the HSA_XNACK environment variable to ‘1’; for systems with both an APU and a discrete GPU that does not support XNACK, consider using ROCR_VISIBLE_DEVICES to enable only the APU. If not supported, all AMD GPU devices are removed from the list of available devices (“host fallback”).
The available stack size can be changed using the GCN_STACK_SIZE environment variable; the default is 32 kiB per thread.
Low-latency memory (omp_low_lat_mem_space) is supported when the the access trait is set to cgroup. The default pool size is automatically scaled to share the 64 kiB LDS memory between the number of teams configured to run on each compute-unit, but may be adjusted at runtime by setting environment variable GOMP_GCN_LOWLAT_POOL=bytes.
omp_low_lat_mem_alloc cannot be used with true low-latency memory because the definition implies the omp_atv_all trait; main graphics memory is used instead.
omp_cgroup_mem_alloc, omp_pteam_mem_alloc, and omp_thread_mem_alloc, all use low-latency memory as first preference, and fall back to main graphics memory when the low-latency pool is exhausted.
The unique identifier (UID), used with OpenMP’s API UID routines, is the value returned by the HSA runtime library for HSA_AMD_AGENT_INFO_UUID. For GPUs, it is currently ‘GPU-’ followed by 16 lower-case hex digits, yielding a string like GPU-f914a2142fc3413a. The output matches the one used by rocminfo.