On the hardware side, there is the hierarchy (fine to coarse):
All OpenMP and OpenACC levels are used, i.e.
The used sizes are
num_teams
(OpenMP) or
num_gangs
(OpenACC) or otherwise the number of CU. It is limited
by two times the number of CU.
num_threads
(OpenMP) and num_workers
(OpenACC)
overrides this if smaller.
The implementation remark:
printf
functions and the Fortran
print
/write
statements.
target
regions with
device(ancestor:1)
) are processed serially per target
region
such that the next reverse offload region is only executed after the previous
one returned.
requires
directive with
unified_shared_memory
is only supported if all AMD GPUs have the
HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT
property; for
discrete GPUs, this may require setting the HSA_XNACK
environment
variable to ‘1’; for systems with both an APU and a discrete GPU that
does not support XNACK, consider using ROCR_VISIBLE_DEVICES
to
enable only the APU. If not supported, all AMD GPU devices are removed
from the list of available devices (“host fallback”).
GCN_STACK_SIZE
environment variable; the default is 32 kiB per thread.
omp_low_lat_mem_space
) is supported when the
the access
trait is set to cgroup
. The default pool size
is automatically scaled to share the 64 kiB LDS memory between the number
of teams configured to run on each compute-unit, but may be adjusted at
runtime by setting environment variable
GOMP_GCN_LOWLAT_POOL=bytes
.
omp_low_lat_mem_alloc
cannot be used with true low-latency memory
because the definition implies the omp_atv_all
trait; main
graphics memory is used instead.
omp_cgroup_mem_alloc
, omp_pteam_mem_alloc
, and
omp_thread_mem_alloc
, all use low-latency memory as first
preference, and fall back to main graphics memory when the low-latency
pool is exhausted.
HSA_AMD_AGENT_INFO_UUID
.
For GPUs, it is currently ‘GPU-’ followed by 16 lower-case hex digits,
yielding a string like GPU-f914a2142fc3413a
. The output matches
the one used by rocminfo
.