Memory Model Relaxation Annotations¶
Introduction¶
Memory Model Relaxation Annotations (MMRAs) are target-defined properties on instructions that can be used to selectively relax constraints placed by the memory model. For example:
The use of
VulkanMemoryModel
in a SPIRV program allows certain memory operations to be reordered acrossacquire
orrelease
operations.OpenCL APIs expose primitives to only fence a specific set of address spaces. Carrying that information to the backend can enable the use of faster synchronization instructions, rather than fencing all address spaces everytime.
MMRAs offer an opt-in system for targets to relax the default LLVM memory model. As such, they are attached to an operation using LLVM metadata which can always be dropped without affecting correctness.
Definitions¶
- memory operation
A load, a store, an atomic, or a function call that is marked as accessing memory.
- synchronizing operation
An instruction that synchronizes memory with other threads (e.g. an atomic or a fence).
- tag
Metadata attached to a memory or synchronizing operation that represents some target-defined property regarding memory synchronization.
An operation may have multiple tags that each represent a different property.
A tag is composed of a pair of metadata string: a prefix and a suffix.
In LLVM IR, the pair is represented using a metadata tuple. In other cases (comments, documentation, etc.), we may use the
prefix:suffix
notation. For example:!0 = !{!"scope", !"workgroup"} # scope:workgroup !1 = !{!"scope", !"device"} # scope:device !2 = !{!"scope", !"system"} # scope:system
Note
The only semantics relevant to the optimizer is the “compatibility” relation defined below. All other semantics are target defined.
Tags can also be organised in lists to allow operations to specify all of the tags they belong to. Such a list is referred to as a “set of tags”.
!0 = !{!"scope", !"workgroup"} !1 = !{!"sync-as", !"private"} !2 = !{!0, !2}
Note
If an operation does not have MMRA metadata, it’s treated as if it has an empty list (
!{}
) of tags.Note that it is not an error if a tag is not recognized by the instruction it is applied to, or by the current target. Such tags are simply ignored.
Both synchronizing operations and memory operations can have zero or more tags attached to them using the
!mmra
syntax.For the sake of readability in examples below, we use a (non-functional) short syntax to represent MMMRA metadata:
store %ptr1 # foo:bar store %ptr1 !mmra !{!"foo", !"bar"}
These two notations can be used in this document and are strictly equivalent. However, only the second version is functional.
- compatibility
Two sets of tags are said to be compatible iff, for every unique tag prefix P present in at least one set:
the other set contains no tag with prefix P, or
at least one tag with prefix P is common to both sets.
The above definition implies that an empty set is always compatible with any other set. This is an important property as it ensures that if a transform drops the metadata on an operation, it can never affect correctness. In other words, the memory model cannot be relaxed further by deleting metadata from instructions.
The happens-before Relation¶
Compatibility checks can be used to opt out of the happens-before relation established between two instructions.
- Ordering
When two instructions’ metadata are not compatible, any program order between them are not in happens-before.
For example, consider two tags
foo:bar
andfoo:baz
exposed by a target:A: store %ptr1 # foo:bar B: store %ptr2 # foo:baz X: store atomic release %ptr3 # foo:bar
In the above figure,
A
is compatible withX
, and henceA
happens-beforeX
. ButB
is not compatible withX
, and hence it is not happens-beforeX
.- Synchronization
If an synchronizing operation has one or more tags, then whether it synchronizes-with and participates in the
seq_cst
order with other operations is target dependent.Whether the following example synchronizes with another sequence depends on the target-defined semantics of
foo:bar
andfoo:bux
.fence release # foo:bar store atomic %ptr1 # foo:bux
Examples¶
- Example 1:
A: store ptr addrspace(1) %ptr2 # sync-as:1 vulkan:nonprivate B: store atomic release ptr addrspace(1) %ptr3 # sync-as:0 vulkan:nonprivate
A and B are not ordered relative to each other (no happens-before) because their sets of tags are not compatible.
Note that the
sync-as
value does not have to match theaddrspace
value. e.g. In Example 1, a store-release to a location inaddrspace(1)
wants to only synchronize with operations happening inaddrspace(0)
.- Example 2:
A: store ptr addrspace(1) %ptr2 # sync-as:1 vulkan:nonprivate B: store atomic release ptr addrspace(1) %ptr3 # sync-as:1 vulkan:nonprivate
The ordering of A and B is unaffected because their set of tags are compatible.
Note that A and B may or may not be in happens-before due to other reasons.
- Example 3:
A: store ptr addrspace(1) %ptr2 # sync-as:1 vulkan:nonprivate B: store atomic release ptr addrspace(1) %ptr3 # vulkan:nonprivate
The ordering of A and B is unaffected because their set of tags are compatible.
- Example 4:
A: store ptr addrspace(1) %ptr2 # sync-as:1 B: store atomic release ptr addrspace(1) %ptr3 # sync-as:2
A and B do not have to be ordered relative to each other (no happens-before) because their sets of tags are not compatible.
Use-cases¶
SPIRV
NonPrivatePointer
¶MMRAs can support the SPIRV capability
VulkanMemoryModel
, where synchronizing operations only affect memory operations that specifyNonPrivatePointer
semantics.The example below is generated from a SPIRV program using the following recipe:
Add
vulkan:nonprivate
to every synchronizing operation.Add
vulkan:nonprivate
to every non-atomic memory operation that is markedNonPrivatePointer
.Add
vulkan:private
to tags of every non-atomic memory operation that is not markedNonPrivatePointer
.
Thread T1: A: store %ptr1 # vulkan:nonprivate B: store %ptr2 # vulkan:private X: store atomic release %ptr3 # vulkan:nonprivate Thread T2: Y: load atomic acquire %ptr3 # vulkan:nonprivate C: load %ptr2 # vulkan:private D: load %ptr1 # vulkan:nonprivate
Compatibility ensures that operation
A
is ordered relative toX
while operationD
is ordered relative toY
. IfX
synchronizes withY
, thenA
happens-beforeD
. No such relation can be inferred about operationsB
andC
.Note
The Vulkan Memory Model considers all atomic operation non-private.
Whether
vulkan:nonprivate
would be specified on atomic operations is an implementation detail, as an atomic operation is alwaysnonprivate
. The implementation may choose to be explicit and emit IR withvulkan:nonprivate
on every atomic operation, or it could choose to only emitvulkan::private
and assumevulkan:nonprivate
by default.Operations marked with
vulkan:private
effectively opt out of the happens-before order in a SPIRV program since they are incompatible with every synchronizing operation. Note that SPIRV operations that are not markedNonPrivatePointer
are not entirely private to the thread — they are implicitly synchronized at the start or end of a thread by the Vulkan system-synchronizes-with relationship. This example assumes that the target-defined semantics ofvulkan:private
correctly implements this property.This scheme is general enough to express the interoperability of SPIRV programs with other environments.
Thread T1: A: store %ptr1 # vulkan:nonprivate X: store atomic release %ptr2 # vulkan:nonprivate Thread T2: Y: load atomic acquire %ptr2 # foo:bar B: load %ptr1
In the above example, thread
T1
originates from a SPIRV program while threadT2
originates from a non-SPIRV program. WhetherX
can synchronize withY
is target defined. IfX
synchronizes withY
, thenA
happens beforeB
(because A/X and Y/B are compatible).Implementation Example¶
Consider the implementation of SPIRV
NonPrivatePointer
on a target where all memory operations are cached, and the entire cache is flushed or invalidated at arelease
oracquire
respectively. A possible scheme is that when translating a SPIRV program, memory operations markedNonPrivatePointer
should not be cached, and the cache contents should not be touched during anacquire
andrelease
operation.This could be implemented using the tags that share the
vulkan:
prefix, as follows:For memory operations:
Operations with
vulkan:nonprivate
should bypass the cache.Operations with
vulkan:private
should be cached.Operations that specify neither or both should conservatively bypass the cache to ensure correctness.
For synchronizing operations:
Operations with
vulkan:nonprivate
should not flush or invalidate the cache.Operations with
vulkan:private
should flush or invalidate the cache.Operations that specify neither or both should conservatively flush or invalidate the cache to ensure correctness.
Note
In such an implementation, dropping the metadata on an operation, while not affecting correctness, may have big performance implications. e.g. an operation bypasses the cache when it shouldn’t.
Memory Types¶
MMRAs may express the selective synchronization of different memory types.
As an example, a target may expose an
sync-as:<N>
tag to pass information about which address spaces are synchronized by the execution of a synchronizing operation.Note
Address spaces are used here as a common example, but this concept can apply for other “memory types”. What “memory types” means here is up to the target.
# let 1 = global address space # let 3 = local address space Thread T1: A: store %ptr1 # sync-as:1 B: store %ptr2 # sync-as:3 X: store atomic release ptr addrspace(0) %ptr3 # sync-as:3 Thread T2: Y: load atomic acquire ptr addrspace(0) %ptr3 # sync-as:3 C: load %ptr2 # sync-as:3 D: load %ptr1 # sync-as:1
In the above figure,
X
andY
are atomic operations on a location in theglobal
address space. IfX
synchronizes withY
, thenB
happens-beforeC
in thelocal
address space. But no such statement can be made about operationsA
andD
, although they are peformed on a location in theglobal
address space.Implementation Example: Adding Address Space Information to Fences¶
Languages such as OpenCL C provide fence operations such as
atomic_work_item_fence
that can take an explicit address space to fence.By default, LLVM has no means to carry that information in the IR, so the information is lost during lowering to LLVM IR. This means that targets such as AMDGPU have to conservatively emit instructions to fence all address spaces in all cases, which can have a noticeable performance impact in high-performance applications.
MMRAs may be used to preserve that information at the IR level, all the way through code generation. For example, a fence that only affects the global address space
addrspace(1)
may be lowered asfence release # sync-as:1
and the target may use the presence of
sync-as:1
to infer that it must only emit instruction to fence the global address space.Note that as MMRAs are opt in, a fence that does not have MMRA metadata could still be lowered conservatively, so this optimization would only apply if the front-end emits the MMRA metadata on the fence instructions.
Additional Topics¶
Note
The following sections are informational.
Performance Impact¶
MMRAs are a way to capture optimization opportunities in the program. But when an operation mentions no tags or conflicting tags, the target may need to produce conservative code to ensure correctness at the cost of performance. This can happen in the following situations:
When a target first introduces MMRAs, the frontend might not have been updated to emit them.
An optimization may drop MMRA metadata.
An optimization may add arbitrary tags to an operation.
Note that targets can always choose to ignore (or even drop) MMRAs and revert to the default behavior/codegen heuristics without affecting correctness.
Consequences of the Absence of happens-before¶
In the happens-before section, we defined how an happens-before relation between two instruction can be broken by leveraging compatibility between MMRAs. When the instructions are incompatible and there is no happens-before relation, we say that the instructions “do not have to be ordered relative to each other”.
“Ordering” in this context is a very broad term which covers both static and runtime aspects.
When there is no ordering constraint, we could statically reorder the instructions in an optimizer transform if the reordering does not break other constraints as single location coherence. Static reordering is one consequence of breaking happens-before, but is not the most interesting one.
Run-time consequences are more interesting. When there is an happens-before relation between instructions, the target has to emit synchronization code to ensure other threads will observe the effects of the instructions in the right order.
For instance, the target may have to wait for previous loads & stores to finish before starting a fence-release, or there may be a need to flush a memory cache before executing the next instruction. In the absence of happens-before, there is no such requirement and no waiting or flushing is required. This may noticeably speed up execution in some cases.
Combining Operations¶
If a pass can combine multiple memory or synchronizing operations into one, it needs to be able to combine MMRAs. One possible way to achieve this is by doing a prefix-wise union of the tag sets.
Let A and B be two tags set, and U be the prefix-wise union of A and B. For every unique tag prefix P present in A or B:
If either A or B has no tags with prefix P, no tags with prefix P are added to U.
If both A and B have at least one tag with prefix P, all tags with prefix P from both sets are added to U.
Passes should avoid aggressively combining MMRAs, as this can result in significant losses of information. While this cannot affect correctness, it may affect performance.
As a general rule of thumb, common passes such as SimplifyCFG that aggressively combine/reorder operations should only combine instructions that have identical sets of tags. Passes that combine less frequently, or that are well aware of the cost of combining the MMRAs can use the prefix-wise union described above.
Examples:
A: store release %ptr1 # foo:x, foo:y, bar:x B: store release %ptr2 # foo:x, bar:y # Unique prefixes P = [foo, bar] # "foo:x" is common to A and B so it's added to U. # "bar:x" != "bar:y" so it's not added to U. U: store release %ptr3 # foo:x
A: store release %ptr1 # foo:x, foo:y B: store release %ptr2 # foo:x, bux:y # Unique prefixes P = [foo, bux] # "foo:x" is common to A and B so it's added to U. # No tags have the prefix "bux" in A. U: store release %ptr3 # foo:x
A: store release %ptr1 B: store release %ptr2 # foo:x, bar:y # Unique prefixes P = [foo, bar] # No tags with "foo" or "bar" in A, so no tags added. U: store release %ptr3