LoopManagers
Documentation for LoopManagers.
LoopManagers.LoopManagersLoopManagers.GPUConfigLoopManagers.KernelAbstractions_GPULoopManagers.MainThreadLoopManagers.MultiThreadLoopManagers.PlainCPULoopManagers.SRangeLoopManagers.SingleCPULoopManagers.VectorizedCPULoopManagers.distributeLoopManagers.warpsize
LoopManagers.LoopManagers — Module
Module LoopManagers provides computing managers to pass to functions using the performance portability module ManagedLoops. It implements the API functions defined by ManagedLoops for the provided managers. Currently supported are SIMD and/or multithreaded execution on the CPU. Offloading to GPU via CUDA and oneAPI is experimental.
Additional iteration/offloading strategies (e.g. cache-friendly iteration) can be implemented by defining new manager types and implementing specialized versions of ManagedLoops.offload.
LoopManagers.GPUConfig — Type
config = GPUConfig(nwarp, repeat)Return singleton object config influencing how loops are executed on a GPU. Pass config to configure, and use the resulting manager with @with or offload.
nwarpis the number of warps per block ; typical values are 1,2,4,8 ; higher values are beneficial for small kernels, but not for those using many registersrepeatis the number of elements taken care of by each thread ; higher value can help amortize the cost of launching the kernel.- an excessive value of
nwarp*repeatmay result in too few warp blocks to fill the GPU.
For 1D loops:
- the loop is split into chunks of size
warpsizenwarprepeatwithwarpsizea GPU-dependent number of threads per warp (usually 32) - each thread takes care of
repeatindices separated bywarpsize*nwarp, resulting in contiguous memory accesses. - the number of warp blocks is roughly the loop count divided by
warpsizenwarprepeat.
For 2D loops:
- the inner loop is distributed among a single warp, so that data depending only on the outer loop index can be reused.
- the outer loop is distributed among warp blocks.
- each thread takes care of
repeatouter indices separated bynwarp - the number of warp blocks is roughly the outer loop count divided by
nwarp*repeat.
LoopManagers.KernelAbstractions_GPU — Type
gpu = KernelAbstractions_GPU(gpu::KernelAbstractions.GPU, ArrayType)
# examples
gpu = KernelAbstractions_GPU(CUDABackend(), CuArray)
gpu = KernelAbstractions_GPU(ROCBackend(), ROCArray)
gpu = KernelAbstractions_GPU(oneBackend(), oneArray)Returns a manager that offloads computations to a KernelAbstractions GPU backend. The returned manager will call ArrayType(data) when it needs to transfer data to the device.
LoopManagers.MainThread — Type
manager = MainThread(cpu_manager=PlainCPU(), nthreads=Threads.nthreads())Returns a multithread manager derived from cpu_manager, initially in sequential mode. In this mode, manager behaves exactly like cpu_manager. When manager is passed to ManagedLoops.parallel, nthreads threads are spawn. The manager passed to threads works in parallel mode. In this mode, manager behaves like cpu_manager, except that the outer loop is distributed among threads. Furthermore ManagedLoops.barrier and ManagedLoops.share allow synchronisation and data-sharing across threads.
main_mgr = MainThread()
LoopManagers.parallel(main_mgr) do thread_mgr
x = LoopManagers.share(thread_mgr) do master_mgr
randn()
end
println("Thread $(Threads.threadid()) has drawn $x.")
endLoopManagers.MultiThread — Type
manager = MultiThread(b=PlainCPU(), nt=Threads.nthreads())Returns a multithread manager derived from cpu_manager, with a fork-join pattern. When manager is passed to ManagedLoops.offload, manager.nthreads threads are spawn (fork). They each work on a subset of indices. Progress continues only after all threads have finished (join), so that barrier is not needed between two uses of offload and does nothing.
It is highly recommended to pin the Julia threads to specific cores. The simplest way is probably to set JULIA_EXCLUSIVE=1 before launching Julia. See also Julia Discourse
LoopManagers.PlainCPU — Type
manager = PlainCPU()Manager for sequential execution on the CPU. LLVM will try to vectorize loops marked with @simd. This works mostly for simple loops and arithmetic computations. For Julia-side vectorization, especially of mathematical functions, see `VectorizedCPU'.
LoopManagers.SRange — Type
range = SRange(start, step, stop)
Return a range similar to start:step:stop with the following differences:
stepis in the type domain and must be known at compile timestart,stepandstopare converted toUInt32
LoopManagers.SingleCPU — Type
abstract type SingleCPU<:HostManager endParent type for manager executing on a single core. Derived types should specialize distribute[@ref] or offload_single[@ref] and leave offload as it is.
LoopManagers.VectorizedCPU — Type
manager = VectorizedCPU()Returns a manager for executing loops with optional explicit SIMD vectorization. Only inner loops marked with @vec will use explicit vectorization. If this causes errors, use @simd instead of @vec. Vectorization of loops marked with @simd is left to the Julia/LLVM compiler, as with PlainCPU.
LoopManagers.distribute — Method
Divide work among vectorized CPU threads.
LoopManagers.warpsize — Method
nthreads = warpsize(gpu)Returns the number of threads per warp for a KernelAbstraction gpu. Defaults to 32.