LoopManagers

Documentation for LoopManagers.

LoopManagers.LoopManagers
LoopManagers.GPUConfig
LoopManagers.KernelAbstractions_GPU
LoopManagers.MainThread
LoopManagers.MultiThread
LoopManagers.PlainCPU
LoopManagers.SRange
LoopManagers.SingleCPU
LoopManagers.VectorizedCPU
LoopManagers.distribute
LoopManagers.warpsize

LoopManagers.LoopManagers — Module

Module LoopManagers provides computing managers to pass to functions using the performance portability module ManagedLoops. It implements the API functions defined by ManagedLoops for the provided managers. Currently supported are SIMD and/or multithreaded execution on the CPU. Offloading to GPU via CUDA and oneAPI is experimental.

Additional iteration/offloading strategies (e.g. cache-friendly iteration) can be implemented by defining new manager types and implementing specialized versions of ManagedLoops.offload.

source

LoopManagers.GPUConfig — Type

config = GPUConfig(nwarp, repeat)

Return singleton object config influencing how loops are executed on a GPU. Pass config to configure, and use the resulting manager with @with or offload.

nwarp is the number of warps per block ; typical values are 1,2,4,8 ; higher values are beneficial for small kernels, but not for those using many registers
repeat is the number of elements taken care of by each thread ; higher value can help amortize the cost of launching the kernel.
an excessive value of nwarp*repeat may result in too few warp blocks to fill the GPU.

For 1D loops:

the loop is split into chunks of size warpsizenwarprepeat with warpsize a GPU-dependent number of threads per warp (usually 32)
each thread takes care of repeat indices separated by warpsize*nwarp, resulting in contiguous memory accesses.
the number of warp blocks is roughly the loop count divided by warpsizenwarprepeat.

For 2D loops:

the inner loop is distributed among a single warp, so that data depending only on the outer loop index can be reused.
the outer loop is distributed among warp blocks.
each thread takes care of repeat outer indices separated by nwarp
the number of warp blocks is roughly the outer loop count divided by nwarp*repeat.

source

LoopManagers.KernelAbstractions_GPU — Type

gpu = KernelAbstractions_GPU(gpu::KernelAbstractions.GPU, ArrayType)
# examples
gpu = KernelAbstractions_GPU(CUDABackend(), CuArray)
gpu = KernelAbstractions_GPU(ROCBackend(), ROCArray)
gpu = KernelAbstractions_GPU(oneBackend(), oneArray)

Returns a manager that offloads computations to a KernelAbstractions GPU backend. The returned manager will call ArrayType(data) when it needs to transfer data to the device.

Note

While KA_GPU is always available, implementations of [offload] are available only if the module KernelAbstractions is loaded by the main program or its dependencies.

source

LoopManagers.MainThread — Type

manager = MainThread(cpu_manager=PlainCPU(), nthreads=Threads.nthreads())

Returns a multithread manager derived from cpu_manager, initially in sequential mode. In this mode, manager behaves exactly like cpu_manager. When manager is passed to ManagedLoops.parallel, nthreads threads are spawn. The manager passed to threads works in parallel mode. In this mode, manager behaves like cpu_manager, except that the outer loop is distributed among threads. Furthermore ManagedLoops.barrier and ManagedLoops.share allow synchronisation and data-sharing across threads.

main_mgr = MainThread()
LoopManagers.parallel(main_mgr) do thread_mgr
    x = LoopManagers.share(thread_mgr) do master_mgr
        randn()
    end
    println("Thread $(Threads.threadid()) has drawn $x.")
end

source

LoopManagers.MultiThread — Type

manager = MultiThread(b=PlainCPU(), nt=Threads.nthreads())

Returns a multithread manager derived from cpu_manager, with a fork-join pattern. When manager is passed to ManagedLoops.offload, manager.nthreads threads are spawn (fork). They each work on a subset of indices. Progress continues only after all threads have finished (join), so that barrier is not needed between two uses of offload and does nothing.

Tip

It is highly recommended to pin the Julia threads to specific cores. The simplest way is probably to set JULIA_EXCLUSIVE=1 before launching Julia. See also Julia Discourse

source

LoopManagers.PlainCPU — Type

manager = PlainCPU()

Manager for sequential execution on the CPU. LLVM will try to vectorize loops marked with @simd. This works mostly for simple loops and arithmetic computations. For Julia-side vectorization, especially of mathematical functions, see `VectorizedCPU'.

source

LoopManagers.SRange — Type

range = SRange(start, step, stop)

Return a range similar to start:step:stop with the following differences:

step is in the type domain and must be known at compile time
start, step and stop are converted to UInt32

source

LoopManagers.SingleCPU — Type

abstract type SingleCPU<:HostManager end

Parent type for manager executing on a single core. Derived types should specialize distribute[@ref] or offload_single[@ref] and leave offload as it is.

source

LoopManagers.VectorizedCPU — Type

manager = VectorizedCPU()

Returns a manager for executing loops with optional explicit SIMD vectorization. Only inner loops marked with @vec will use explicit vectorization. If this causes errors, use @simd instead of @vec. Vectorization of loops marked with @simd is left to the Julia/LLVM compiler, as with PlainCPU.

Note

ManagedLoops.no_simd(::VectorizedCPU) returns a PlainCPU.

source

LoopManagers.distribute — Method

Divide work among vectorized CPU threads.

source

LoopManagers.warpsize — Method

nthreads = warpsize(gpu)

Returns the number of threads per warp for a KernelAbstraction gpu. Defaults to 32.

source