This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

C11Tester: A Race Detector for C/C++ Atomics
Technical Report

Weiyu Luo University of California, IrvineIrvine, CaliforniaUSA weiyul7@uci.edu  and  Brian Demsky University of California, IrvineIrvine, CaliforniaUSA bdemsky@uci.edu
Abstract.

Writing correct concurrent code that uses atomics under the C/C++ memory model is extremely difficult. We present C11Tester, a race detector for the C/C++ memory model that can explore executions in a larger fragment of the C/C++ memory model than previous race detector tools. Relative to previous work, C11Tester’s larger fragment includes behaviors that are exhibited by ARM processors. C11Tester uses a new constraint-based algorithm to implement modification order that is optimized to allow C11Tester to make decisions in terms of application-visible behaviors. We evaluate C11Tester on several benchmark applications, and compare C11Tester’s performance to both tsan11rec, the state of the art tool that controls scheduling for C/C++; and tsan11, the state of the art tool that does not control scheduling.

1. Introduction

The C/C++11 standards added a weak memory model with support for low-level atomics operations (cpp11spec; c11spec) that allows experts to craft efficient concurrent data structures that scale better or provide stronger liveness guarantees than lock-based data structures. The potential benefits of atomics can lure both experts and novice developers to use them. However, writing correct concurrent code using these atomics operations is extremely difficult.

Simply executing concurrent code is not an effective approach to testing. Exposing concurrency bugs often requires executing a specific path that might only occur when the program is heavily loaded during deployment, executed on a specific processor, or compiled with a specific compiler. Some prior work helps record and replay buggy executions (deloren). Debuggers like Symbiosis (symbiosis) and Cortex (cortex) focus on sequential consistency and test programs by modifying thread scheduling of given initial executions. However, both the thread scheduling and relaxed behavior of C/C++ atomics are sources of nondeterminism in a C/C++ programs that use atomics. Thus, it is necessary to develop tools to help test for concurrency bugs. We present the C11Tester tool for testing C/C++ programs that use atomics.

Figure 1 presents an overview of the C11Tester system. C11Tester is implemented as a dynamically linked library together with an LLVM compiler pass, which instruments atomic operations, non-atomic accesses to shared memory locations, and fence operations with function calls into the C11Tester dynamic library. The C++ and pthread library functions are overridden by the C11Tester library—C11Tester implements its own threading library using fibers to precisely control the scheduling of each thread. The C11Tester library implements a race detector and C11Tester reports any races or assertion violations that it discovers.

Refer to caption
Figure 1. C11Tester system overview

The C/C++ memory model defines the modification order relation to totally order all atomic stores to a memory location. This relation captures the notion of cache coherence. The modification order is not directly observable by the program execution — it is only observed indirectly through its effects on program visible behaviors such as the values returned by loads. Under the C/C++ memory model, modification order cannot be extended to be a total order over all stores that is consistent with the happens-before relation.

This paper presents a new technique for scaling a constraint-based treatment of the modification order relation to long executions. This technique allows C11Tester to support a larger fragment of the C/C++ memory model than previous race detectors. In particular, this technique can handle the full range of modification orders that are permitted by the C/C++ memory model.

Constraint-based modification order delays decisions about the modification order until the decisions have observable effects on the program’s behavior. For example, when an algorithm decides which store a load will read from, C11Tester adds the corresponding constraints to the modification order. This approach allows testing algorithms to focus on program visible behaviors such as the value a load reads and does not require them to eagerly decide the modification order.

Fibers provide a more efficient means to control thread schedules than kernel threads. However, C/C++ programs commonly make use of thread local storage (TLS) and fibers do not directly support TLS. This paper presents a new technique, thread context borrowing, that allows fiber-based scheduling to support thread local storage without incurring dependencies on TLS implementation details that can vary across different library versions.

1.1. Comparison to Prior Work on Testing C/C++11

Prior work on data race detectors for C/C++11 such as tsan11 (tsan11) and tsan11rec (tsan11rec) require hbrfmosc\textit{hb}\cup\textit{rf}\cup\textit{mo}\cup\textit{sc} be acyclic and thus miss potentially bug-revealing executions that both are allowed by the C/C++ memory model and can be produced by mainstream hardware including ARM processors. We have found examples of bugs that C11Tester can detect but tsan11 and tsan11rec miss due to the set of hbrf\textit{hb}\cup\textit{rf} edges orders writes in the modification order.

C11Tester’s constraint-based approach to modification order supports a larger fragment of the C/C++ memory model than tsan11 and tsan11rec. C11Tester adds minor constraints to the C/C++ memory model to forbid out-of-thin-air (OOTA) executions for relaxed atomics. Furthermore, these constraints appear to incur minimal overheads on existing ARM processors (oota) while x86 and PowerPC processors already implement these constraints.

1.2. Contributions

This paper makes the following contributions:

  • Scalable Concurrency Testing Tool: It presents a tool for the C/C++ memory model that can test full programs.

  • Supports a Larger Fragment of the C/C++ Memory Model: It presents a tool that supports a larger fragment of the C/C++ memory model than previous tools.

  • Constraint-Based Modification Order: The modification order relation is not directly visible to the application, instead it constrains the behaviors of visible relations such as the reads-from relation. Eagerly selecting the modification order limits the choices of stores that a load can read from and thus limits the information available to algorithms. We develop a scalable constraint-based approach to modeling the modification order relation that allows algorithms to ignore the modification order relation and focus on program visible behaviors.

  • Support for Limiting Memory Usage: The size of the C/C++ execution graph and execution trace grows as the program executes and thus limits the length of executions that a testing tool can support. Naively freeing portions of the graph can cause a tool to produce executions that are forbidden by the memory model. We present techniques that can limit the memory usage of C11Tester while ensuring that C11Tester only produces executions that are allowed by the C/C++ memory model.

  • Fiber-based Support for Thread Local Storage: Fibers are the most efficient way to control the scheduling of the application under test, but supporting thread local storage with fibers is problematic. We develop a novel approach for borrowing the context of a kernel thread to support thread local storage.

  • Evaluation: We evaluate C11Tester on several applications and compare against both tsan11 and tsan11rec. We show that C11Tester can find bugs that tsan11 and tsan11rec miss. We present a performance comparison with both tsan11 and tsan11rec.

2. C/C++ Atomics

In this section, we present general background on the C/C++ memory model and then discuss the fragment of the C/C++ memory model that C11Tester supports. The C and C++ standards were extended in 2011 to include a weak memory model that provides precise guarantees about the behavior of both the compiler and the underlying processor. The standards divide memory locations into two types: normal types, which are accessed using normal memory primitives; and atomic types, which are accessed using atomic memory primitives. The standards forbid data races on normal memory types and allow arbitrary accesses to atomic memory types. Accesses to atomic memory types have an optional memory_order argument that explicitly specifies the ordering constraints. Any operation on an atomic object will have one of six memory orders, each of which falls into one or more of the following categories. Like all other tools for the C/C++ memory model, compilers, and work on formalization to our knowledge, C11Tester does not support the consume memory order and thus we omit consume in our presentation.

seq-cst::

memory_order_seq_cst – strongest memory ordering, there exists a total order of all operations with this memory ordering. Loads that are seq_cst either read from the last store in the seq_cst order or from some store that is not part of seq_cst total order.

release::

memory_order_release, memory_order_acq_rel, and memory_order_seq_cst – when a load-acquire reads from a store-release, it establishes a happens-before relation between the store and the load. Release sequences generalize this notion to allow intervening RMW operations to not break synchronization.

acquire::

memory_order_acquire, memory_order_acq_rel, and memory_order_seq_cst – may form release/acquire synchronization.

relaxed::

memory_order_relaxed – weakest memory ordering. The only constraints for relaxed memory operations are a per-location total order, the modification order, that is equivalent to cache coherence.

The C/C++ memory model expresses program behavior in the form of binary relations or orderings. We briefly summarize the relations:

  • Sequenced-Before: The evaluation order within a program establishes an intra-thread sequenced-before (sb) relation—a strict preorder of the atomic operations over the execution of a single thread.

  • Reads-From: The reads-from (rf) relation consists of store-load pairs (X,Y)(X,Y) such that YY takes its value from XX. In the C/C++ memory model, this relation is non-trivial, as a given load operation may read from one of many potential stores in the execution.

  • Synchronizes-With: The synchronizes-with (sw) relation captures the synchronization that occurs when certain atomic operations interact across threads.

  • Happens-Before: In the absence of memory operations with the consume memory ordering, the happens-before (hb) relation is the transitive closure of the union of the sequenced-before and the synchronizes-with relations.

  • Sequentially Consistent: All operations that declare the memory_order_seq_cst memory order have a total ordering (sc) in the program execution.

  • Modification Order: Each atomic object in a program has an associated modification order (mo)—a total order of all stores to that object—which informally represents an ordering in which those stores may be observed by the rest of the program.

2.1. Example

To explore some of the key concepts of the memory-ordering operations provided by the C/C++ memory model, consider the example in Figure 2, assuming that two independent threads execute the methods threadA() and threadB(). This example uses the C++ syntax for atomics; shared, concurrently-accessed variables are given an atomic type, whose loads and stores are marked with an explicit memory_order governing their inter-thread ordering and visibility properties. In the example, the memory operations are specified to have the relaxed memory ordering, which is the weakest ordering in the C/C++ memory model and allows memory operations to different locations to be reordered.

In this example, a few simple interleavings of threadA() and threadB() show that we may see executions in which {𝚛𝟷=𝚛𝟸=𝟶}\{{\tt r1=r2=0}\}, {𝚛𝟷=𝚛𝟸=𝟷}\{{\tt r1=r2=1}\}, or {𝚛𝟷=𝟶𝚛𝟸=𝟷}\{{\tt r1=0}\wedge{\tt r2=1}\}, but it is somewhat counter-intuitive that we may also see {𝚛𝟷=𝟷𝚛𝟸=𝟶}\{{\tt r1=1}\wedge{\tt r2=0}\}, in which the first load statement sees the second store but the second load statement does not see the first store. While this latter behavior cannot occur under a sequentially-consistent execution of this program, it is, in fact, allowed by the relaxed memory ordering used in the example (and achieved by compiler or processor reorderings).

Now, consider a modification of the same example, where the store and load on variable y (Line LABEL:line:store-relaxed-example and Line LABEL:line:load-relaxed-example) now use memory_order_release and memory_order_acquire, respectively, so that when the load-acquire reads from the store-release, they form a release/acquire synchronization pair. Then in any execution where r1 = 1 and thus the load-acquire statement (Line LABEL:line:load-relaxed-example) reads from the store-release statement (Line LABEL:line:store-relaxed-example), the synchronization between the store-release and the load-acquire forms an ordering between threadB() and threadA()—particularly, that the actions in threadB() after the acquire must observe the effects of the actions in threadA() before the release. In the terminology of the C/C++ memory model, we say that all actions in threadA() sequenced before the release happen before all actions in threadB() sequenced after the acquire.

So when r1 = 1, threadB() must see r2 = 1. In summary, this modified example allows only three of the four previously-described behaviors: {𝚛𝟷=𝚛𝟸=𝟶}\{{\tt r1=r2=0}\}, {𝚛𝟷=𝚛𝟸=𝟷}\{{\tt r1=r2=1}\}, and {𝚛𝟷=𝟶𝚛𝟸=𝟷}\{{\tt r1=0}\wedge{\tt r2=1}\}.

1atomic<int> x(0), y(0);
2
3void threadA() {
4 x.store(1, memory_order_relaxed);
5 y.store(1, memory_order_relaxed);/*@ \label{line:store-relaxed-example} @*/
6}
7void threadB() {
8 int r1 = y.load(memory_order_relaxed);/*@ \label{line:load-relaxed-example} @*/
9 int r2 = x.load(memory_order_relaxed);
10 printf("r1 = %d\n", r1);
11 printf("r2 = %d\n", r2);
12}
Figure 2. A Variant of Message Passing in C++

2.2. C11Tester’s C/C++ Memory Model Fragment

We next describe the fragment of the C/C++ memory model that C11Tester supports. Our memory model has the following changes based on the formalization of Batty et al. (c11popl):

1) Use the C/C++20 release sequence definition: Since the original C/C++11 memory model, the definition of release sequences has been weakened (releasesequences). This change is part of the C/C++20 standard (cpp-draft-n4849). C11Tester uses the newly weakened definition. The new definition of release sequences does not allow memory_order_relaxed stores by the thread that originally performed the memory_order_release store that heads the release sequence to appear in the release sequence.

2) Add hbscrf\textit{hb}\cup\textit{sc}\cup\textit{rf} is acyclic: Supporting load buffering or out-of-thin-air executions is extremely difficult and the existing approaches introduce high overheads in dynamic tools (prescientmemory; oopsla2013; toplascdschecker). Thus, we prohibit out-of-thin-air executions with a similar assumption made by much work on the C/C++ memory model — we add the constraint that the union of happens-before, sequential consistency, and reads-from relations, i.e., hbscrf\textit{hb}\cup\textit{sc}\cup\textit{rf}, is acyclic (vafeiadis2013relaxed).111The C/C++11 memory model already requires that hbsc\textit{hb}\cup\textit{sc} is acyclic. This feature of the C/C++ memory model is known to be generally problematic and similar solutions have been proposed to fix the C/C++ memory model (mspc14; N3786; N3710; oota).

3) Strengthen consume atomics to acquire: No compilers support the consume access mode. Instead, all compilers strengthen consume atomics to acquire.

We formalize the above changes in Section LABEL:sec:restricted-model of the Appendix. Our fragment of the C/C++ memory model is larger than that of tsan11 and tsan11rec (tsan11; tsan11rec). The tsan11 and tsan11rec tools add a very strong restriction to the C/C++ memory model that requires that hbscrfmo\textit{hb}\cup\textit{sc}\cup\textit{rf}\cup\textit{mo} be acyclic.

3. C11Tester Overview

We present our algorithm in this section. In our presentation, we adapt some terminology and symbols from stateless model checking (dpor). We denote the initial state with s0s_{0}. We associate every state transition tt taken by thread pp with the dynamic operation that affected the transition. We use enabled(s)\textit{enabled}(s) to denote the set of all threads that are enabled in state ss (threads can be disabled when waiting on a mutex, condition variable, or when completed). We say that next(s,p)\textit{next}(s,p) is the next transition in thread pp at state ss.

1:procedure Explore
2:    s:=s0s:=s_{0}
3:    while enabled(s)\textit{enabled}(s) is not empty do
4:         Select pp from enabled(s)\textit{enabled}(s)
5:         t:=next(s,p)t:=\textit{next}(s,p)
6:         behaviors(t):={Initial behaviors}\textit{behaviors}(t):=\{\text{Initial behaviors}\}
7:         Select a behavior bb from behaviors(t)\textit{behaviors}(t)
8:         s:=Execute(s,t,b)s:=\textit{Execute}(s,t,b)
9:    end while
10:end procedure
Figure 3. Pseudocode for C11Tester’s Algorithm

Figure 3 presents pseudocode for C11Tester’s exploration algorithm. C11Tester calls Explore multiple times—each time generates one program execution. Recall from Section 2 that the thread schedule does not uniquely define the behavior of C/C++ atomics. Therefore, we split the exploration into two components: (1) selecting the next thread to execute and (2) selecting the behavior of that thread’s next operation. C11Tester has a pluggable framework for testing algorithms—C11Tester generates a set of legal choices for the next thread and behavior, and then the plugin selects the next thread and behavior. The default plugin implements a random strategy.

Scheduling

Thread scheduling decisions are made at each atomic operation, threading operation, or synchronization operation (such as locking a mutex). Every time a thread finishes a visible operation, the next thread to execute is randomly selected from the set of enabled threads. However, when a thread performs several consecutive stores with memory order release or relaxed, the scheduler executes these stores consecutively without interruption from other threads. Executing these stores consecutively does not limit the set of possible executions and provides C11Tester with more stores to select from when deciding which store a load should read from. This decision also reduces bias in comparison to a purely randomized algorithm.

For example, in Figure  4, under a purely randomized algorithm, the probability that r1 = 1 is much greater than that of r1 = 2, because in order for r1 = 2, the scheduler must schedule threadA() twice before threadB() is scheduled. However, under C11Tester’s strategy, once threadA is scheduled to run, both stores at line LABEL:line:bias-first-store and line LABEL:line:bias-second-store will be performed consecutively. So when the load is encountered, the may-read-from set (defined in the paragraphs below) either only contains the initial store at line LABEL:line:bias-initial-store or contains all three stores. Thus, r1 is equally likely to read 1 or 2.

1atomic<int> x(0); /*@ \label{line:bias-initial-store} @*/
2
3void threadA() {
4 x.store(1, memory_order_relaxed);/*@ \label{line:bias-first-store} @*/
5 x.store(2, memory_order_relaxed);/*@ \label{line:bias-second-store} @*/
6}
7void threadB() {
8 r1 = x.load(memory_order_relaxed);
9}
Figure 4. Bias of a Purely Randomized Algorithm

Transition Behaviors

The source of multiple behaviors for a given schedule arises from the reads-from relation—in C/C++, loads can read from stores besides just the “last” store to an atomic object.

We use the concept of a may-read-from set, which is an overapproximation of the stores that a given atomic load may read from that just considers constraints from the happens-before relation. The may-read-from set for a load YY is constructed as:

may-read-from(Y)\displaystyle\textit{may-read-from}(Y) ={Xstores(Y)¬(YhbX)\displaystyle=\{X\in\textit{stores}(Y)\mid\neg(Y\stackrel{{\scriptstyle\textit{hb}}}{{\rightarrow}}X)\wedge
(Zstores(Y) . XhbZhbY)},\displaystyle(\nexists Z\in\textit{stores}(Y)\text{ . }X\stackrel{{\scriptstyle\textit{hb}}}{{\rightarrow}}Z\stackrel{{\scriptstyle\textit{hb}}}{{\rightarrow}}Y)\}\text{,}

where stores(Y)\textit{stores}(Y) denotes the set of all stores to the same object from which YY reads. C11Tester selects a store from the may-read-from set. C11Tester then checks that establishing this rf relation does not violate constraints imposed by the modification order, as described in Section 4. If the given selection is not allowed, C11Tester repeats the selection process. C11Tester delays the modification order check until after a selection is made to optimize for performance.

4. Memory Model Support

In this section, we present how C11Tester efficiently supports key aspects of the C/C++ memory model.

CDSChecker (oopsla2013) initially introduced the technique of using a constraint-based treatment of modification order to remove redundancy from the search space it explores. There are essentially two types of constraints on the modification order: (1) that a store sAs_{A} is modification ordered before a store sBs_{B} and (2) that a store sAs_{A} immediately precedes an RMW rBr_{B} in the modification order.

CDSChecker models these constraints using a modification order graph. Two types of edges correspond to these two types of constraints. Edges only exist between two nodes if they both represent memory accesses to the same location. There is a cycle in the modification order graph if and only if the graph corresponds to an unsatisfiable set of constraints. Otherwise, a topological sort of the graph (with the additional constraint that an RMW node immediately follows the store that it reads from) yields a modification order that is consistent with the observed program behavior. CDSChecker used depth first search to check for cycles in the graph. CDSChecker would add edges to the modification order graph to determine whether a given reads-from edge was plausible — if the edge made the set of constraints unsatisfiable, CDSChecker would rollback the changes that the edge made to the graph.

This approach works well for model checking where the graphs are small—the fundamental scalability limits of model checking ensure that the executions always contain a very small number of stores. This approach is infeasible when executions (and thus the modification order graphs) can contain millions of atomic stores, because the graph traversals become extremely expensive.

4.1. Modification Order Graph

We next describe the modification order graph in more detail. We represent modification order (mo) as a set of constraints, built as a constraint graph, namely the modification order graph (mo-graph). A node in the mo-graph represents a single store or RMW in the execution. There are two types of edges in the graph. An mo edge from node AA to node BB represents the constraint AmoBA\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}B. A rmw edge from node AA to node BB represents the constraint that AA must immediately precede BB or formally that: AmoBA\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}B and C.CACB(AmoCBmoC)(CmoBCmoA)\forall C.C\neq A\wedge C\neq B\Rightarrow(A\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}C\Rightarrow B\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}C)\wedge(C\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}B\Rightarrow C\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}A).

C11Tester must only ensure that there exists some mo that satisfies the set of constraints, or equivalently an acyclic mo-graph. C11Tester dynamically adds edges to mo-graph when new rf and hb relations are formed. We briefly summarize the properties of mo as implications (oopsla2013) in Figure 5. C11Tester maintains a per-thread list of atomic memory accesses to each memory location. Whenever a new atomic load or store is executed, C11Tester uses this list to evaluate the implications in Figure 5 as well as additional implications for fences.

Read-Read Coherence
Refer to caption \Longrightarrow Refer to caption
Write-Read Coherence
Refer to caption \Longrightarrow Refer to caption
Read-Write Coherence
Refer to caption \Longrightarrow Refer to caption
Write-Write Coherence
Refer to caption \Longrightarrow Refer to caption
Seq-cst / MO Consistency
Refer to caption \Longrightarrow Refer to caption
Seq-cst Write-Read Coherence
Refer to caption \Longrightarrow Refer to caption
RMW / MO Consistency
Refer to caption \Longrightarrow Refer to caption
RMW Atomicity
Refer to caption \Longrightarrow Refer to caption
Figure 5. Modification order implications. On the left side of each implication, AA, BB, CC, XX, and YY must be distinct.

4.2. Clock Vectors

Due to the high cost of graph traversals for large graphs, graph traversals are not a feasible implementation approach for C11Tester. We next describe how we adapt clock vectors (l-clocks) to efficiently compute reachability in the mo-graph and scale the constraint-based modification order approach to large executions. We associate a clock vector with each node in the mo-graph. It is important to note that our use of clock vectors in the mo-graph is not to track the happens-before relation. Instead we use clock vectors to efficiently compute reachability between nodes in the mo-graph. Thus, our mo-graph clock vectors model a partial order that contains the current set of ordering constraints on the modification order.

Each event EE 222Events in each thread consist of atomic operations, thread creation and join, mutex lock and unlock, and other synchronization operations. in C11Tester has a unique sequence number sEs_{E}. Sequence numbers are a global counter of events across all threads, which is incremented by one at each event. We denote the thread that executed EE as tEt_{E}. Each node in the mo-graph represents an atomic store. The initial mo-graph clock vector CVA\perp_{CV_{A}} associated with the node representing an atomic store AA, the union operator \cup, and the comparison operator \leq for mo-graph clock vectors are defined as follows:

CVA\displaystyle\perp_{CV_{A}} =λt. if t==tA then sA else 0,\displaystyle=\lambda t.\text{ if }t==t_{A}\text{ then }s_{A}\text{ else }0,
CV1CV2\displaystyle CV_{1}\cup CV_{2} λt.max(CV1(t),CV2(t)),\displaystyle\triangleq\lambda t.max(CV_{1}(t),CV_{2}(t)),
CV1CV2\displaystyle CV_{1}\leq CV_{2} t.CV1(t)CV2(t).\displaystyle\triangleq\forall t.CV_{1}(t)\leq CV_{2}(t).

Note that two mo-graph clock vectors can only be compared if their associated nodes represent atomic stores to the same memory location.

The mo-graph clock vectors are updated when new mo relations are formed. For example, if AmoBA\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}B is a newly formed mo relation, then the node BB’s mo-graph clock vector is merged with that of node AA, i.e., CVB:=CVACVBCV_{B}:=CV_{A}\cup CV_{B}. If CVBCV_{B} is updated by this merge, the change in CVBCV_{B} must be propagated to all nodes reachable from BB using the union operator.

Figure 7 presents pseudocode for updating the modification order graph. The Merge procedure merges the mo-graph clock vector of the src node into the dst node and returns true if the dst mo-graph clock vector changed. The AddEdge procedure adds a new modification order edge to the graph. It first compares mo-graph clock vectors to check if the edge is redundant and if so drops the edge update. Recall that RMW operations are ordered immediately after the stores that they read from. To implement this, AddEdge checks to see if the from node has a rmw edge, and if so, follows the rmw edge. AddEdge finally adds the relevant edge, and then propagates any changes in the mo-graph clock vectors. The AddRMWEdge procedure has two parameters, where the rmw node reads from the from node. It first adds an rmw edge and then migrates any outgoing edges from the source of the edge to the rmw node. Finally, it calls the AddEdge procedure to add a normal modification order edge and to propagate mo-graph clock vector changes.

Figure 7 presents pseudocode for the helper method AddEdges that adds a set of edges to the mo-graph. The parameter set is a set of atomic stores or RMWs, and SS is an atomic store or RMW. The GetNode method converts an atomic action to the corresponding node in the mo-graph. If such node does not exist yet, then the method will create a new node in the mo-graph.

1:procedure Merge(Node dst, Node src)
2:    if src.cv \leq dst.cv then
3:         return false
4:    end if
5:    dst.cv := dst.cv \cup src.cv
6:    return true
7:end procedure
1:procedure AddEdge(Node from, Node to)
2:    mustAddEdge := (from.rmw == to \vee from.tid == to.tid)
3:    if from.cv \leq to.cv ¬\wedge\neg mustAddEdge then
4:         return
5:    end if
6:    while from.rmw \neq null do
7:         next := from.rmw
8:         if next == to then
9:             break
10:         end if
11:         from := next
12:    end while
13:    from.edges := from.edges \cup to
14:    if Merge(to, from) then
15:         Q := { to }
16:         while Q is not empty do
17:             node := remove item from Q
18:             for each dst in node.edges do
19:                 if Merge(dst, node) then
20:                     Q := Q \cup dst
21:                 end if
22:             end for
23:         end while
24:    end if
25:end procedure
1:procedure AddRMWEdge(Node from, Node rmw)
2:    from.rmw := rmw
3:    for each dst in from.edges do
4:         if dst \neq rmw then
5:             rmw.edges := rmw.edges \cup dst
6:         end if
7:    end for
8:    from.edges := \emptyset
9:    AddEdge(from, rmw)
10:end procedure
Figure 6. Pseudocode for Updating mo-graph
1:procedure AddEdges(set, SS)
2:    nS:=𝙶𝚎𝚝𝙽𝚘𝚍𝚎(S)n_{S}:={\tt GetNode}(S)
3:    for each ee in set do
4:         ne:=𝙶𝚎𝚝𝙽𝚘𝚍𝚎(e)n_{e}:={\tt GetNode}(e)
5:         AddEdge(nen_{e}, nSn_{S})
6:    end for
7:end procedure
Figure 7. Helper method for adding a set of edges to the mo-graph

Theorem 4 guarantees the soundness of our use of mo-graph clock vectors. We present the theorem and its proof in Section 5. This theorem states that we can solely rely on mo-graph clock vectors to compute reachability between nodes in mo-graph.

4.3. Eliminating Rollback in Mo-graph

Prior work on constraint-based modification order utilized rollback when it was determined that a given reads-from relation was not feasible (oopsla2013; toplascdschecker). C11Tester may also hit such infeasible executions because the may-read-from set defined in Section 3 is an overapproximation of the set of stores that a load can read from. To determine precisely whether a load can read from a store, a naive approach is to add edges to the mo-graph and then utilize rollback if adding these edges introduces cycles in the mo-graph. However, the addition of clock vectors and clock vector propagation makes rollback much more expensive. It is thus critical that C11Tester avoids the need for rollback. We now discuss how C11Tester avoids rollback.

The mo-graph is updated whenever a new atomic store, atomic load, or atomic RMW is encountered. Processing a new atomic store, atomic load, or atomic RMW can potentially add multiple edges to the mo-graph. We next analyze each case to understand how to avoid rollback:

  • Atomic Store: Since an atomic load can only read from past stores, a newly created store node in mo-graph has no outgoing edges. By the properties of mo, only incoming edges from other nodes to this new node will be created. Hence, a new store node cannot introduce any cycles.

  • Atomic Load: Consider a new atomic load YY that reads from a store X0X_{0}. Forming a new rf relation may only cause edges to be created from other nodes to the node representing the store X0X_{0}. We denote this set of ”other nodes” as ReadPriorSet(X0)\textit{ReadPriorSet}(X_{0}) and compute it using the ReadPriorSet procedure in Figure LABEL:alg:priorset. Lines LABEL:line:rpriorset-s1LABEL:line:rpriorset-s2, and LABEL:line:rpriorset-s3 in the ReadPriorSet procedure consider statements 5, 4, and 6 in Section 29.3 of the C++11 standard. Line LABEL:line:rpriorset-s4 in the procedure considers write-read and read-read coherences. Therefore, the set returned by the ReadPriorSet procedure captures the set of stores from where new mo relations are to be formed if the rf relation is established.

    Before forming the rf relation, C11Tester checks whether any node in ReadPriorSet(X0)\textit{ReadPriorSet}(X_{0}) is reachable from X0X_{0}. If so, then having load YY read from store X0X_{0} will introduce a cycle in the mo-graph, so we discard X0X_{0} and try another store. While it is possible for a cycle to contain two or more edges in the set of newly created edges, this also implies that there is a cycle with one edge (since all edges have the same destination).

  • Atomic RMWs: An atomic RMW is similar to both a load and store, but with the constraint that it must be immediately modification ordered after the store it reads from. We implement this by moving modification order edges from the store it reads from to the RMW. Thus, the same checks used by the load suffice to check for cycles for atomic RMWs.

Thus, C11Tester first computes a set of edges that reading from a given store would add to the mo-graph. Then for each edge, it checks the mo-graph clock vectors to see if the destination of the edge can reach the source of the edge. If none of the edges would create a cycle, it adds all of the edges to the mo-graph using the AddEdge and AddRMWEdge procedures.

5. Correctness of Mo-graph

To prove the correctness of mo-graphs, we first prove three Lemmas and then prove Theorem 4. Lemma 1 and Lemma 2 characterize some important properties of mo-graph clock vectors. Lemma 3 proves one direction in Theorem 4. Mo-graph clock vectors are simply referred to as clock vectors in the following context.

Lemma 0.

Let C0moC1momoCnC_{0}\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}C_{1}\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}...\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}C_{n} be a path in a modification order graph GG, such that CVC0CVCnCV_{C_{0}}\leq...\leq CV_{C_{n}}. Then if any new edge EE is added to GG using procedures in Figure 7, it holds that

(5.1) CVC0CVCn\displaystyle CV_{C_{0}}^{\prime}\leq...\leq CV_{C_{n}}^{\prime}

for the updated clock vectors. We define CVCi:=CVCiCV_{C_{i}}^{\prime}:=CV_{C_{i}} if the values of CVCiCV_{C_{i}} are not actually updated.

Proof.

To simplify notation, we define CVi:=CVCiCV_{i}:=CV_{C_{i}} for all i{0,n}i\in\{0...,n\}. Let’s first consider the case where no rmw edge is added, i.e., the AddRMWEdge procedure is not called.

By the definition of the union operator, each slot in clock vectors is monotonically increasing when the Merge procedure is called. By the structure of procedure AddEdge’s algorithm, a node XX is added to QQ if and only if this node’s clock vector is updated by the Merge procedure.

Let’s assume that adding the new edge EE updates any of CV0,,CVnCV_{0},...,CV_{n}. Otherwise, it is trivial. Let ii be the smallest integer in {0,,n}\{0,...,n\} such that CViCV_{i} is updated. Then CVk=CVkCV_{k}^{\prime}=CV_{k} for all kI:={0,,i1}k\in I:=\{0,...,i-1\}, and we have

(5.2) CV0CVi.\displaystyle CV_{0}^{\prime}\leq...\leq CV_{i}^{\prime}.

If i=0i=0, then we take I=I=\varnothing. There are two cases.

Case 1: Suppose CViCVjCV_{i}^{\prime}\leq CV_{j} for some j{i+1,,n}j\in\{i+1,...,n\}, let j0j_{0} be the smallest such integer. Then CVk=CVkCV_{k}^{\prime}=CV_{k} for all k{j0,,n}k\in\{j_{0},...,n\}, as nodes {Cj0,,Cn}\{C_{j_{0}},...,C_{n}\} will not be added to QQ in the AddEdge procedure, and it holds trivially that

(5.3) CVj0CVn.\displaystyle CV_{j_{0}}^{\prime}\leq...\leq CV_{n}^{\prime}.

By line 14 to line 24 in the AddEdge procedure, we have

(5.4) CVk=CVkCVk1,\displaystyle CV_{k}^{\prime}=CV_{k}\cup CV_{k-1}^{\prime},

for all kS:={i+1,,j01}k\in S:=\{i+1,...,j_{0}-1\}. If j0j_{0} happens to be i+1i+1, then take S=S=\varnothing. And we have for all kSk\in S, CVk1CVkCV_{k-1}^{\prime}\leq CV_{k}^{\prime}. Then combining with inequality (5.2), we have

CV0CViCVj01.CV_{0}^{\prime}\leq...\leq CV_{i}\leq...\leq CV_{j_{0}-1}^{\prime}.

Together with inequality (5.3), we only need to show that CVj01CVj0CV_{j_{0}-1}^{\prime}\leq CV_{j_{0}}^{\prime} to complete the proof.

If j0=i+1j_{0}=i+1, then we are done, because by assumption CViCVj0=CVj0CV_{i}^{\prime}\leq CV_{j_{0}}=CV_{j_{0}}^{\prime}. If j0>i+1j_{0}>i+1, then CViCVj0CV_{i}^{\prime}\leq CV_{j_{0}} and CVi+1CVj0CV_{i+1}\leq CV_{j_{0}} imply that CVi+1=CVi+1CViCVj0=CVj0CV_{i+1}^{\prime}=CV_{i+1}\cup CV_{i}^{\prime}\leq CV_{j_{0}}=CV_{j_{0}}^{\prime}. Based on equation (5.4), we can deduce in a similar way that CVi+2CVj01CVj0CV_{i+2}^{\prime}\leq...\leq CV_{j_{0}-1}^{\prime}\leq CV_{j_{0}}^{\prime}.

Case 2: Suppose CViCVjCV_{i}\nleq CV_{j} for all j{i+1,,n}j\in\{i+1,...,n\}. Then by line 14 to line 24 in the AddEdge procedure, all nodes {Ci,,Cn}\{C_{i},...,C_{n}\} are added to QQ in the AddEdge procedure, and CVk=CVkCVk1CV_{k}^{\prime}=CV_{k}\cup CV_{k-1}^{\prime} for all kS:={i+1,,n}k\in S:=\{i+1,...,n\}. This recursive formula guarantees that for all kSk\in S, CVk1CVkCV_{k-1}^{\prime}\leq CV_{k}^{\prime}. Therefore, combining with inequality (5.2), we have CV0CVnCV_{0}^{\prime}\leq...\leq CV_{n}^{\prime}.

Now suppose the newly added edge EE is a rmw edge. If E:XrmwCiE:X\xrightarrow{\textit{rmw}}C_{i} where i{0,,n}i\in\{0,...,n\} and XX is some node not in path PP, then the path PP remains unchanged and AddEdge(XX,CiC_{i}) is called. Then the above proof shows that inequality (5.1) holds. If E:CirmwXE:C_{i}\xrightarrow{\textit{rmw}}X, then CimoCi+1C_{i}\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}C_{i+1} is migrated to XmoCi+1X\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}C_{i+1} by line 3 to line 7 in the AddRMWEdge procedure, and CimoXC_{i}\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}X is added.

If XX is not in path PP, then path PP becomes

C0momoCimoXmoCi+1momoCn.C_{0}\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}...\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}C_{i}\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}X\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}C_{i+1}\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}...\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}C_{n}.

Since AddEdge(CiC_{i},XX) is called, the same proof in the case without rmw edges applies. If XX is in path PP, then XX can only be Ci+1C_{i+1} and the path PP remains unchanged. Otherwise, a cycle is created and this execution is invalid. In any case, the same proof applies. ∎

Let x=(x1,x2,,xn)\vec{x}=(x_{1},x_{2},...,x_{n}). We define the projection function UiU_{i} that extracts the ithi^{\textit{th}} position of x\vec{x} as Ui(x)=xi,U_{i}(\vec{x})=x_{i}, where we assume ini\leq n.

Lemma 0.

Let AA be a store with sequence number sAs_{A} performed by thread ii in an acyclic modification order graph GG. Then Ui(CVA)=Ui(CVA)=sAU_{i}(CV_{A})=U_{i}(\perp_{CV_{A}})=s_{A} throughout each execution that terminates.

Proof.

We will prove by contradiction. Let S={A1,A2,}S=\{A_{1},A_{2},...\} be the sequence of stores performed by thread ii with sequence numbers {s1,s2,}\{s_{1},s_{2},...\}, respectively. Suppose that there is a point of time in a terminating execution such that the first store AnA_{n} in the sequence with Ui(CVAn)>snU_{i}(CV_{A_{n}})>s_{n} appears. Sequence numbers are strictly increasing and by the Merge procedure, Ui(CVAn){sn+1,sn+2,,}U_{i}(CV_{A_{n}})\in\{s_{n+1},s_{n+2},...,\}. Let Ui(CVAn)=sNU_{i}(CV_{A_{n}})=s_{N} for some N>nN>n.

For Ui(CVAn)U_{i}(CV_{A_{n}}) to increase to sNs_{N} from sns_{n}, CVAnCV_{A_{n}} must be merged with the clock vector of some node XX (i.e., some store XX) in GG such that Ui(CVX)=sNU_{i}(CV_{X})=s_{N}. Such XX is modification ordered before AnA_{n}.

If XX is performed by thread ii, then XX has to be the store ANA_{N}, because Ui(CVAj)U_{i}(CV_{A_{j}}) is unique for all stores AjA_{j} in the sequence SS other than AnA_{n}. Then CVXCVAn\perp_{CV_{X}}\geq\perp_{CV_{A_{n}}}. By the definition of initial values of clock vectors and sequence numbers, XX happens after and is modification ordered after AnA_{n}. However, XX is also modification ordered before AnA_{n}, and we have a cycle in GG. This is a contradiction.

If XX is not performed by thread ii, then Ui(CVX)=0U_{i}(\perp_{CV_{X}})=0. For Ui(CVX)U_{i}(CV_{X}) to be sNs_{N}, XX must be modification ordered after by some store YY in GG such that Ui(CVY)=sNU_{i}(CV_{Y})=s_{N}. If YY is done by thread ii, then the same argument in the last paragraph leads to a contradiction; otherwise, by repeating the same argument as in this paragraph finitely many times (there are only a finite number of stores in such a terminating execution), we would eventually deduce that XX is modification ordered after some store by thread ii. Hence, we would have a cycle in GG, a contradiction.

Lemma 0.

Let AA and BB be two nodes that write to the same location in an acyclic modification order graph GG. If BB is reachable from AA in GG, then CVACVBCV_{A}\leq CV_{B}.

Proof.

Suppose that BB is reachable from AA in GG. Let AmoC1momoCn1moBA\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}C_{1}\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}...\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}C_{n-1}\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}B be the shortest path PP from AA to BB in graph GG. To simplify notation, XmoYX\stackrel{{\scriptstyle\textit{mo}}}{{\rightarrow}}Y is abbreviated as XYX\rightarrow Y in the following. As the AddRMWEdge procedure calls the AddEdge procedure to create an mo edge, we can assume that all the mo edges in PP are created by directly calling AddEdge.

Base Case 1: Suppose the path PP has length 1, i.e., AA immediately precedes BB. Then when the edge ABA\rightarrow B was formed by calling AddEdge(AA,BB), CVBCV_{B} was merged with CVACV_{A} in line 14 of the AddEdge procedure. In other words, CVB=CVBCVACVA.CV_{B}=CV_{B}\cup CV_{A}\geq CV_{A}.

Base Case 2: Suppose the path PP has length 2, i.e., AC1BA\rightarrow C_{1}\rightarrow B. There are two cases:

(a) If AC1A\rightarrow C_{1} was formed first, then CVACVC1CV_{A}\leq CV_{C_{1}}. When C1BC_{1}\rightarrow B was formed, CVBCV_{B} was merged with CVC1CV_{C_{1}} and CVC1CVBCV_{C_{1}}\leq CV_{B}. According to Lemma 1, adding the edge C1BC_{1}\rightarrow B or any edge not in path PP (if any such edges were formed before C1BC_{1}\rightarrow B was formed) to GG would not break the inequality CVACVC1CV_{A}\leq CV_{C_{1}}. It follows that CVACVC1CVBCV_{A}\leq CV_{C_{1}}\leq CV_{B}.

(b) If C1BC_{1}\rightarrow B was formed first, then CVC1CVBCV_{C_{1}}\leq CV_{B}. Based on Lemma 1, this inequality remains true when AC1A\rightarrow C_{1} was formed. Therefore CVACVC1CVBCV_{A}\leq CV_{C_{1}}\leq CV_{B}.

Inductive Step: Suppose that BB being reachable from AA implies that CVACVBCV_{A}\leq CV_{B} for all paths with length kk or less, for some k>2k>2. We want to prove that the same holds for paths with length k+1k+1. Let PP be a path from AA to BB with length k+1k+1,

P:A=C0C1CkCk+1=B.P:A=C_{0}\rightarrow C_{1}\rightarrow...\rightarrow C_{k}\rightarrow C_{k+1}=B.

We denote AA as C0C_{0} and BB as Ck+1C_{k+1} in the following.

Let E:CiCi+1E:C_{i}\rightarrow C_{i+1} be the last edge formed in path PP, where i{0,,k}i\in\{0,...,k\}. Then before edge EE was formed, the inductive hypothesis implies that CVC0CVCiCV_{C_{0}}\leq...\leq CV_{C_{i}} and CVCi+1CVCk+1CV_{C_{i+1}}\leq...\leq CV_{C_{k+1}}, because both C0CiC_{0}\rightarrow...\rightarrow C_{i} and Ci+1Ck+1C_{i+1}\rightarrow...\rightarrow C_{k+1} have length kk or less. Lemma 1 guarantees that

CVC0\displaystyle CV_{C_{0}} CVCi,\displaystyle\leq...\leq CV_{C_{i}},
CVCi+1\displaystyle CV_{C_{i+1}} CVCk+1\displaystyle\leq...\leq CV_{C_{k+1}}

remain true if any edge not in path PP was added to GG as well as the moment when EE was formed. Therefore when the edge EE was formed, we have CVCiCVCi+1CV_{C_{i}}\leq CV_{C_{i+1}}, and

CVA=CVC0CVCk+1=CVB.CV_{A}=CV_{C_{0}}\leq...\leq CV_{C_{k+1}}=CV_{B}.

Theorem 4.

Let AA and BB be two nodes that write to the same location in an acyclic modification order graph GG for a terminating execution. Then CVACVBCV_{A}\leq CV_{B} iff BB is reachable from AA in GG.

Proof.

Lemma 3 proves the backward direction, so we only need to prove the forward direction. Suppose that CVACVBCV_{A}\leq CV_{B}. Let’s first consider the situation where the graph GG contain no rmw edges.

Case 1: AA and BB are two stores performed by the same thread with thread id ii. Then it is either AA happens before BB or BB happens before AA. If AA happens before BB, then AA precedes BB in the modification order because AA and BB are performed by the same thread. Hence BB is reachable from AA in GG. We want to show that the other case is impossible.

If BB happens before AA and hence precedes AA in the modification order, then AA is reachable from BB. By Lemma 3, AA being reachable from BB implies that CVBCVACV_{B}\leq CV_{A}. Since CVACVBCV_{A}\leq CV_{B} by assumption, we deduce that CVA=CVBCV_{A}=CV_{B}. This is impossible according to Lemma 2, because each store has a unique sequence number and Ui(CVA)=sAsB=Ui(CVB)U_{i}(CV_{A})=s_{A}\neq s_{B}=U_{i}(CV_{B}), implying that CVACVBCV_{A}\neq CV_{B}.

Case 2: AA and BB are two stores done by different threads. Suppose that AA is performed by thread ii. Let CVA=(,sA,)CV_{A}=(...,s_{A},...) and CVB=(,tb,)CV_{B}=(...,t_{b},...) where both sAs_{A} and tbt_{b} are in the ithi^{\textit{th}} position. By assumption, we have 0<sAtb0<s_{A}\leq t_{b}.

Since BB is not performed by thread ii, we have Ui(CVB)=0U_{i}(\perp_{CV_{B}})=0. We can apply the same argument similar to the second, third and fourth paragraphs in the proof of Lemma 2 and deduce that BB is modification ordered after AA or some store sequenced after AA. Since modification order is consistent with sequenced-before relation, if follows that BB is reachable from AA in graph GG.

Now, consider the case where rmw edges are present. Adding a rmw edge from a node SS to a node RR first transfers to RR all outgoing mo edges coming from SS and then adds a normal mo edge from SS to RR. So, any updates in CVSCV_{S} are propagated to all nodes that are reachable from SS. Therefore, the above argument still applies. ∎

6. Operational Model

We present our operational model with respect to the tsan11 (tsan11) core language described by the grammar in Figure 8. A program is a sequence of statements. LocNA and LocA denote disjoint sets of non-atomic and atomic memory locations. A statement can be one of these forms: an if statement, assigning the result of an expression to a non-atomic location, forking a new thread, joining a thread via its thread handle, and atomic statements. The symbol ϵ\epsilon denotes an empty statement. Atomic statements denoted by StmtA include atomic loads, store, RMWs, and fences. An RMW takes a functor, F, to implement RMW operations, such as atomic_fetch_add. We omit loops for simplicity and leave the details of an expression unspecified. We omit lock and unlock operations because they can be implemented with atomic statements.

Prog ::= Stmt ; ϵ\epsilon
Stmt ::= Stmt ; Stmt
| if (LocNA) {Stmt} else {Stmt}
| LocNA := Expr
| LocNA = Fork(Prog)
| Join(LocNA)
| StmtA
| ϵ\epsilon
StmtA ::= LocNA = Load(LocA, MO)
| Store(LocNA, LocA, MO)
| RMW(LocA, MO, F)
| Fence(MO)
MO ::= relaxed | release | acquire | rel_acq
| seq_cst
Expr ::= <literal> | LocNA | Expr op Expr
Figure 8. Syntax for our core language

States:

Tid \displaystyle\triangleq\mathbb{Z} Seq \displaystyle\triangleq\mathbb{Z} \displaystyle\mathbb{C} :TidCV\displaystyle:\hbox{{Tid}}\rightarrow\hbox{{CV}}
𝔽rel\displaystyle\mathbb{F}^{\textit{rel}} :TidCV\displaystyle:\hbox{{Tid}}\rightarrow\hbox{{CV}} 𝔽\displaystyle\mathbb{RF} :SeqCV\displaystyle:\hbox{{Seq}}\rightarrow\hbox{{CV}} 𝔽acq\displaystyle\mathbb{F}^{\textit{acq}} :TidCV\displaystyle:\hbox{{Tid}}\rightarrow\hbox{{CV}}

[RELEASE STORE] {mathpar} \inferrule* RF’ = RF [ s := C _t] ( C, RF, F^rel , F^acq ) ⇒^store_rel(s, t) ( C, RF’, F^rel , F^acq )

[RELAXED STORE] {mathpar} \inferrule* RF’ = RF [ s := F^rel _t] ( C, RF, F^rel , F^acq ) ⇒^store_rlx(s, t) ( C, RF’, F^rel , F^acq )

[RELEASE RMW] {mathpar} \inferrule* RF’ = RF [ s := C _t ∪RF _s’] ( C, RF, F^rel , F^acq ) ⇒^rmw_rel(s, t), rf(s’, t’) ( C, RF’, F^rel , F^acq )

[RELAXED RMW] {mathpar} \inferrule* RF’ = RF [ s := F^rel _t ∪RF _s’] ( C, RF, F^rel , F^acq ) ⇒^rmw_rlx(s, t), rf(s’, t’) ( C, RF’, F^rel , F^acq )

[ACQUIRE LOAD] {mathpar} \inferrule* C’ = C [ t := C _t ∪RF _s’ ] ( C, RF, F^rel , F^acq ) ⇒^load_acq(s, t), rf(s’, t’) ( C’, RF, F^rel , F^acq )

[RELAXED LOAD] {mathpar} \inferrule* F^acq ’ = C [ t := F^acq _t ∪RF _s’ ] ( C, RF, F^rel , F^acq ) ⇒^load_rlx(s, t), rf(s’, t’) ( C, RF, F^rel , F^acq ’ )

[RELEASE FENCE] {mathpar} \inferrule* F^rel ’ = F^rel [ t := C _t ] ( C, RF, F^rel , F^acq ) ⇒^fence_rel(t)