Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion docs/src/shmem_design.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,12 +35,40 @@ The high-level view of the design is:
(different operators require different arguments, and therefore different
types and amounts of shmem).
- Recursively fill the shmem for all `StencilBroadcasted`. This is done
by reading the argument data from `getidx`
by reading the argument data from `getidx`. See the section discussion below for more details.
- The destination field is filled with the result of `getidx` (as it is without
shmem), except that we overload `getidx` (for supported `StencilBroadcasted`
types) to retrieve the result of `getidx` via `fd_operator_evaluate`, which
retrieves the result from the shmem, instead of global memory.

### Populating shared memory, and memory access safety

We use tail-recursion when filling shared memory of the broadcast expressions.
That is, we visit leaves of the broadcast expression, then work our way up.
It's important to note that the `StencilBroadcasted` and `Broadcasted` can be
interleaved.

Let's take `DivergenceF2C()(f*GradientC2F()(a*b)))` as an example (depicted in
the image below).

Recursion must go through the entire expression in order to ensure that we've
reached all of the leaves of the `StencilBroadcasted` objects (otherwise, we
could introduce race conditions with memory access). The leaves of the
`StencilBroadcasted` will call `getidx`, below which there are (by definition)
no more `StencilBroadcasted`, and those `getidx` calls will read from global
memory. All subsequent reads will be from shmem(as they will be caught by the
`getidx(parent_space, bc::StencilBroadcasted
{CUDAWithShmemColumnStencilStyle}, idx, hidx)` defined in the
`ClimaCoreCUDAExt` module).

In the diagram below, we traverse and fill the yellow highlighted sections
(bottom first and top last). The algorithmic impact of using shared memory is
that the duplicate global memory reads (highlighted in red circles) become one
global memory read (performed in `fd_operator_fill_shmem!`).

Finally, its important to note that threads must by syncrhonized after each node
in the tree is filled, to avoid race conditions for subsequent `getidx
(parent_space, bc::StencilBroadcasted{CUDAWithShmemColumnStencilStyle}, idx,
hidx)` calls (which are retrieved via shmem).

![](shmem_diagram_example.png)
Binary file added docs/src/shmem_diagram_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 11 additions & 2 deletions ext/cuda/operators_fd_shmem_common.jl
Original file line number Diff line number Diff line change
Expand Up @@ -323,9 +323,18 @@ Base.@propagate_inbounds function fd_resolve_shmem!(
ᶜidx = get_cent_idx(idx)
ᶠidx = get_face_idx(idx)

_fd_resolve_shmem!(idx, hidx, bds, sbc.args...) # propagate idx, not bc_idx recursively through broadcast expressions
# Here, we use tail-recursion. We visit leaves of the broadcast expression,
# then work our way up. The StencilBroadcasted and Broadcasted can be
# interleaved (e.g., `DivergenceF2C()(f*GradientC2F()(a*b)))`. The leaves of
# the StencilBroadcasted will call `getidx`, below which there are
# (by definition) no more `StencilBroadcasted`, and those `getidx` calls
# will read from global memory. Immediately above those reads, all
# subsequent reads will be from shmem (as they will be caught by the
# `getidx` defined above).
_fd_resolve_shmem!(idx, hidx, bds, sbc.args...)

# After recursion, check if shmem is supported for this operator
# Once we're about ready to fill the shmem, check if shmem is supported for
# this operator
Operators.fd_shmem_is_supported(sbc) || return nothing

# There are `Nf` threads, where `Nf` is the number of face levels. So,
Expand Down
Loading