LoopManagers

Documentation for LoopManagers.

LoopManagers.LoopManagersModule

Module LoopManagers provides computing managers to pass to functions using the performance portability module ManagedLoops. It implements the API functions defined by ManagedLoops for the provided managers. Currently supported are SIMD and/or multithreaded execution on the CPU. Offloading to GPU via CUDA and oneAPI is experimental.

Additional iteration/offloading strategies (e.g. cache-friendly iteration) can be implemented by defining new manager types and implementing specialized versions of ManagedLoops.offload.

source
LoopManagers.GPUConfigType
config = GPUConfig(nwarp, repeat)

Return singleton object config influencing how loops are executed on a GPU. Pass config to configure, and use the resulting manager with @with or offload.

  • nwarp is the number of warps per block ; typical values are 1,2,4,8 ; higher values are beneficial for small kernels, but not for those using many registers
  • repeat is the number of elements taken care of by each thread ; higher value can help amortize the cost of launching the kernel.
  • an excessive value of nwarp*repeat may result in too few warp blocks to fill the GPU.

For 1D loops:

  • the loop is split into chunks of size warpsizenwarprepeat with warpsize a GPU-dependent number of threads per warp (usually 32)
  • each thread takes care of repeat indices separated by warpsize*nwarp, resulting in contiguous memory accesses.
  • the number of warp blocks is roughly the loop count divided by warpsizenwarprepeat.

For 2D loops:

  • the inner loop is distributed among a single warp, so that data depending only on the outer loop index can be reused.
  • the outer loop is distributed among warp blocks.
  • each thread takes care of repeat outer indices separated by nwarp
  • the number of warp blocks is roughly the outer loop count divided by nwarp*repeat.
source
LoopManagers.KernelAbstractions_GPUType
gpu = KernelAbstractions_GPU(gpu::KernelAbstractions.GPU, ArrayType)
# examples
gpu = KernelAbstractions_GPU(CUDABackend(), CuArray)
gpu = KernelAbstractions_GPU(ROCBackend(), ROCArray)
gpu = KernelAbstractions_GPU(oneBackend(), oneArray)

Returns a manager that offloads computations to a KernelAbstractions GPU backend. The returned manager will call ArrayType(data) when it needs to transfer data to the device.

Note

While KA_GPU is always available, implementations of [offload] are available only if the module KernelAbstractions is loaded by the main program or its dependencies.

source
LoopManagers.MainThreadType
manager = MainThread(cpu_manager=PlainCPU(), nthreads=Threads.nthreads())

Returns a multithread manager derived from cpu_manager, initially in sequential mode. In this mode, manager behaves exactly like cpu_manager. When manager is passed to ManagedLoops.parallel, nthreads threads are spawn. The manager passed to threads works in parallel mode. In this mode, manager behaves like cpu_manager, except that the outer loop is distributed among threads. Furthermore ManagedLoops.barrier and ManagedLoops.share allow synchronisation and data-sharing across threads.

main_mgr = MainThread()
LoopManagers.parallel(main_mgr) do thread_mgr
    x = LoopManagers.share(thread_mgr) do master_mgr
        randn()
    end
    println("Thread $(Threads.threadid()) has drawn $x.")
end
source
LoopManagers.MultiThreadType
manager = MultiThread(b=PlainCPU(), nt=Threads.nthreads())

Returns a multithread manager derived from cpu_manager, with a fork-join pattern. When manager is passed to ManagedLoops.offload, manager.nthreads threads are spawn (fork). They each work on a subset of indices. Progress continues only after all threads have finished (join), so that barrier is not needed between two uses of offload and does nothing.

Tip

It is highly recommended to pin the Julia threads to specific cores. The simplest way is probably to set JULIA_EXCLUSIVE=1 before launching Julia. See also Julia Discourse

source
LoopManagers.PlainCPUType
manager = PlainCPU()

Manager for sequential execution on the CPU. LLVM will try to vectorize loops marked with @simd. This works mostly for simple loops and arithmetic computations. For Julia-side vectorization, especially of mathematical functions, see `VectorizedCPU'.

source
LoopManagers.SRangeType

range = SRange(start, step, stop)

Return a range similar to start:step:stop with the following differences:

  • step is in the type domain and must be known at compile time
  • start, step and stop are converted to UInt32
source
LoopManagers.SingleCPUType
abstract type SingleCPU<:HostManager end

Parent type for manager executing on a single core. Derived types should specialize distribute[@ref] or offload_single[@ref] and leave offload as it is.

source
LoopManagers.VectorizedCPUType
manager = VectorizedCPU()

Returns a manager for executing loops with optional explicit SIMD vectorization. Only inner loops marked with @vec will use explicit vectorization. If this causes errors, use @simd instead of @vec. Vectorization of loops marked with @simd is left to the Julia/LLVM compiler, as with PlainCPU.

Note

ManagedLoops.no_simd(::VectorizedCPU) returns a PlainCPU.

source
LoopManagers.warpsizeMethod
nthreads = warpsize(gpu)

Returns the number of threads per warp for a KernelAbstraction gpu. Defaults to 32.

source