c++ 11引入了一个标准化的内存模型。这是什么意思?它将如何影响c++编程?

时间:2022-04-22 00:31:35

C++11 introduced a standardized memory model, but what exactly does that mean? And how is it going to affect C++ programming?

c++ 11引入了一个标准化的内存模型,但是这到底意味着什么呢?它将如何影响c++编程?

This article (by Gavin Clarke who quotes Herb Sutter) says that,

这篇文章(由加文·克拉克引用Herb Sutter)说,

The memory model means that C++ code now has a standardized library to call regardless of who made the compiler and on what platform it's running. There's a standard way to control how different threads talk to the processor's memory.

内存模型意味着c++代码现在有一个标准化的库来调用,而不管编译器是由谁创建的,以及它运行的是什么平台。有一种标准的方法来控制不同线程如何与处理器的内存对话。

"When you are talking about splitting [code] across different cores that's in the standard, we are talking about the memory model. We are going to optimize it without breaking the following assumptions people are going to make in the code," Sutter said.

“当您讨论在标准的不同内核中分割(代码)时,我们讨论的是内存模型。我们将优化它,而不会破坏人们在代码中所做的以下假设。”Sutter说。

Well, I can memorize this and similar paragraphs available online (as I've had my own memory model since birth :P) and can even post as answer to questions asked by others, but to be honest, I don't exactly understand this.

嗯,我可以在网上记住这些和类似的段落(因为我从出生起就有了自己的记忆模型:P),甚至可以作为回答别人提出的问题的答案,但说实话,我不太明白这一点。

So, what I basically want to know is, C++ programmers used to develop multi-threaded applications even before, so how does it matter if it's POSIX threads, or Windows threads, or C++11 threads? What are the benefits? I want to understand the low-level details.

所以,我主要想知道的是,c++程序员甚至在以前就使用过多线程应用程序,所以如果是POSIX线程,或者是Windows线程,或者c++ 11线程,那又有什么关系呢?的好处是什么?我想了解底层的细节。

I also get this feeling that the C++11 memory model is somehow related to C++11 multi-threading support, as I often see these two together. If it is, how exactly? Why should they be related?

我也有这样的感觉,c++ 11内存模型与c++ 11多线程支持相关,因为我经常看到这两者在一起。如果是,具体是怎样的?他们为什么要联系?

As I don't know how internals of multi-threading works, and what memory model means in general, please help me understand these concepts. :-)

由于我不知道多线程工作的内部结构,以及一般的内存模型意味着什么,请帮助我理解这些概念。:-)

6 个解决方案

#1


1715  

First, you have to learn to think like a Language Lawyer.

首先,你必须学会像一个语言律师那样思考。

The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machine that is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.

c++规范没有提到任何特定的编译器、操作系统或CPU。它提到了一个抽象的机器,它是对实际系统的概括。在语言律师世界中,程序员的工作是为抽象机器编写代码;编译器的工作是在一台混凝土机器上实现该代码。通过对规范的严格编码,您可以确定您的代码将在任何系统上编译和运行,而不需要修改任何符合c++编译器的系统,不管是现在还是50年后。

The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen, never mind things like mutexes.

在c++ 98/ c++ 03规范中的抽象机器基本上是单线程的。因此,不可能编写具有“完全可移植性”的多线程c++代码。规范甚至没有提到内存加载和存储的原子性,也没有提到负载和存储可能发生的顺序,更别提互斥对象了。

Of course, you can write multi-threaded code in practice for particular concrete systems -- like pthreads or Windows. But there is no standard way to write multi-threaded code for C++98/C++03.

当然,您可以为特定的具体系统编写多线程代码,比如pthreads或Windows。但是,没有标准的方法来编写c++ 98/ c++ 03的多线程代码。

The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.

c++ 11中的抽象机器是多线程设计的。它还有一个定义良好的内存模型;也就是说,编译器在访问内存时可能会做什么,也可能不会做什么。

Consider the following example, where a pair of global variables are accessed concurrently by two threads:

考虑下面的例子,其中两个全局变量同时被两个线程访问:

           Global
           int x, y;

Thread 1            Thread 2
x = 17;             cout << y << " ";
y = 37;             cout << x << endl;

What might Thread 2 output?

线程2的输出可能是什么?

Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaningless because the standard does not contemplate anything called a "thread".

在c++ 98/ c++ 03中,这甚至不是未定义的行为;这个问题本身毫无意义,因为标准没有考虑任何所谓的“线程”。

Under C++11, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.

在c++ 11中,结果是未定义的行为,因为负载和存储通常不需要是原子的。这似乎并不是一个很大的进步……而它本身并不是。

But with C++11, you can write this:

但是对于c++ 11,你可以这样写:

           Global
           atomic<int> x, y;

Thread 1                 Thread 2
x.store(17);             cout << y.load() << " ";
y.store(37);             cout << x.load() << endl;

Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0 (if it runs before Thread 1), 37 17 (if it runs after Thread 1), or 0 17 (if it runs after Thread 1 assigns to x but before it assigns to y).

现在事情变得有趣多了。首先,这里的行为是有定义的。线程2现在可以打印0(如果它在线程1之前运行),3717(如果它在线程1之后运行),或者0 17(如果它在线程1分配给x之后运行,但是在它分配给y之前)。

What it cannot print is 37 0, because the default mode for atomic loads/stores in C++11 is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicity and ordering for loads and stores.

它不能打印的是37个0,因为在c++ 11中原子加载/存储的默认模式是强制执行顺序一致性。这就意味着所有的负载和存储都必须是“好像”它们发生在您在每个线程中编写它们的顺序中,而线程之间的操作可以被交叉存取,但是系统喜欢。因此,atomics的默认行为为加载和存储提供了原子性和排序。

Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0 as output from this program, then you can write this:

现在,在一个现代的CPU上,确保顺序一致性可能是昂贵的。特别是,编译器可能会在每个访问之间释放全面的内存壁垒。但是,如果您的算法能够容忍无序加载和存储;即。,如果需要原子性但不需要排序;即。如果它能容忍这个程序输出的37个0,那么你可以这样写:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_relaxed);   cout << y.load(memory_order_relaxed) << " ";
y.store(37,memory_order_relaxed);   cout << x.load(memory_order_relaxed) << endl;

The more modern the CPU, the more likely this is to be faster than the previous example.

CPU越现代化,它就越有可能比之前的例子更快。

Finally, if you just need to keep particular loads and stores in order, you can write:

最后,如果您只需要保持特定的负载和存储,您可以编写:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_release);   cout << y.load(memory_order_acquire) << " ";
y.store(37,memory_order_release);   cout << x.load(memory_order_acquire) << endl;

This takes us back to the ordered loads and stores -- so 37 0 is no longer a possible output -- but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)

这将我们带回到有序的加载和存储——因此,37个0不再是可能的输出——但它的开销很小。(在这个简单的示例中,结果与完整的顺序一致性相同;在一个更大的项目中,它不会。

Of course, if the only outputs you want to see are 0 0 or 37 17, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).

当然,如果您想要看到的唯一输出是0或37 17,您可以将一个互斥锁括在原始代码周围。但如果你读了这么多,我打赌你已经知道这是怎么回事了,而且这个答案已经比我预想的要长了:-)。

So, bottom line. Mutexes are great, and C++11 standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.

所以,底线。互斥量很大,而c++ 11使它们标准化。但有时出于性能考虑,您需要低级的原语(例如,经典的双重检查锁定模式)。新的标准提供了诸如互斥和条件变量之类的高级小工具,它还提供了一些低级的小工具,比如原子类型和各种各样的内存屏障。因此,现在您可以完全在标准指定的语言中编写复杂的、高性能的并发例程,您可以确定您的代码将在今天的系统和明天的系统中编译和运行。

Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.

尽管坦率地说,除非您是一位专家,并且在处理一些严重的低层代码,您可能应该坚持互斥和条件变量。这就是我打算做的。

For more on this stuff, see this blog post.

有关这方面的更多信息,请参阅本博客。

#2


263  

I will just give the analogy with which I understand memory consistency models (or memory models, for short). It is inspired by Leslie Lamport's seminal paper "Time, Clocks, and the Ordering of Events in a Distributed System". The analogy is apt and has fundamental significance, but may be overkill for many people. However, I hope it provides a mental image (a pictorial representation) that facilitates reasoning about memory consistency models.

我将用我所理解的内存一致性模型(简称内存模型)来进行类比。它的灵感来自Leslie Lamport的开创性论文《时间、时钟和分布式系统中的事件顺序》。这个类比是恰当的,具有根本的意义,但对许多人来说可能是过分的。然而,我希望它能提供一种思维图像(一种图形表示),以促进对内存一致性模型的推理。

Let’s view the histories of all memory locations in a space-time diagram in which the horizontal axis represents the address space (i.e., each memory location is represented by a point on that axis) and the vertical axis represents time (we will see that, in general, there is not a universal notion of time). The history of values held by each memory location is, therefore, represented by a vertical column at that memory address. Each value change is due to one of the threads writing a new value to that location. By a memory image, we will mean the aggregate/combination of values of all memory locations observable at a particular time by a particular thread.

让我们看一下时空图中所有内存位置的历史,其中横轴表示地址空间(即:,每个内存位置都用一个点表示,而纵轴表示时间(我们将会看到,总的来说,没有一个普遍的时间概念)。因此,每个内存位置所持有的值的历史是由该内存地址的垂直列表示的。每个值的变化都是由于其中一个线程为该位置写入了一个新值。通过一个内存镜像,我们将意味着一个特定线程在特定时间内可观察到的所有内存位置的集合/组合。

Quoting from "A Primer on Memory Consistency and Cache Coherence"

引用“关于记忆一致性和缓存一致性的入门”

The intuitive (and most restrictive) memory model is sequential consistency (SC) in which a multithreaded execution should look like an interleaving of the sequential executions of each constituent thread, as if the threads were time-multiplexed on a single-core processor.

直观的(且最严格的)内存模型是顺序一致性(SC),其中多线程的执行应该看起来像每个组成线程的顺序执行的交错,就像线程在单核处理器上被时间复用一样。

That global memory order can vary from one run of the program to another and may not be known beforehand. The characteristic feature of SC is the set of horizontal slices in the address-space-time diagram representing planes of simultaneity (i.e., memory images). On a given plane, all of its events (or memory values) are simultaneous. There is a notion of Absolute Time, in which all threads agree on which memory values are simultaneous. In SC, at every time instant, there is only one memory image shared by all threads. That's, at every instant of time, all processors agree on the memory image (i.e., the aggregate content of memory). Not only does this imply that all threads view the same sequence of values for all memory locations, but also that all processors observe the same combinations of values of all variables. This is the same as saying all memory operations (on all memory locations) are observed in the same total order by all threads.

这个全局内存顺序可以从程序的一个运行到另一个,并且可能事先不知道。SC的特征特征是在地址-时空图中表示同时性平面的水平切片。、内存图像)。在给定的平面上,所有事件(或内存值)都是同时发生的。有一个绝对时间的概念,在这个概念中,所有的线程都同意哪些内存值是同时存在的。在SC中,在每一个时刻,只有一个内存镜像被所有线程共享。也就是说,在每一时刻,所有的处理器都会在内存镜像上达成一致。,即内存的聚合内容。这不仅意味着所有的线程都对所有内存位置都查看相同的值序列,而且所有的处理器都观察到所有变量的相同的值组合。这与说所有的线程在相同的总次序中观察到的所有内存操作(在所有内存位置上)是一样的。

In relaxed memory models, each thread will slice up address-space-time in its own way, the only restriction being that slices of each thread shall not cross each other because all threads must agree on the history of every individual memory location (of course, slices of different threads may, and will, cross each other). There is no universal way to slice it up (no privileged foliation of address-space-time). Slices do not have to be planar (or linear). They can be curved and this is what can make a thread read values written by another thread out of the order they were written in. Histories of different memory locations may slide (or get stretched) arbitrarily relative to each other when viewed by any particular thread. Each thread will have a different sense of which events (or, equivalently, memory values) are simultaneous. The set of events (or memory values) that are simultaneous to one thread are not simultaneous to another. Thus, in a relaxed memory model, all threads still observe the same history (i.e., sequence of values) for each memory location. But they may observe different memory images (i.e., combinations of values of all memory locations). Even if two different memory locations are written by the same thread in sequence, the two newly written values may be observed in different order by other threads.

在轻松的内存模型中,每个线程将切片address-space-time以自己的方式,唯一的限制是每个线程片不得相互交叉,因为所有的线程必须同意每个个体的历史内存位置(当然,片不同的线程可能,会互相交叉)。没有一种通用的方法来分割它(没有特殊的地址-时空)。片不必是平面的(或线性的)。它们可以是弯曲的,这可以使线程读取另一个线程写入的顺序。在任何特定的线程中,不同内存位置的历史可能会被任意地相对地滑动(或被拉伸)。每个线程都有一种不同的感觉,哪些事件(或者,相等的,内存值)是同时发生的。同步到一个线程的事件集合(或内存值)不会同时发生在另一个线程上。因此,在一个轻松的内存模型中,所有线程仍然遵循相同的历史(即:,每个内存位置的值序列。但他们可能观察到不同的记忆图像。,所有内存位置的值的组合。即使两个不同的内存位置是由同一个线程按顺序写的,两个新写入的值也可以用其他线程的不同顺序来观察。

[Picture from Wikipedia] c++ 11引入了一个标准化的内存模型。这是什么意思?它将如何影响c++编程?

(图片来自*)

Readers familiar with Einstein’s Special Theory of Relativity will notice what I am alluding to. Translating Minkowski’s words into the memory models realm: address space and time are shadows of address-space-time. In this case, each observer (i.e., thread) will project shadows of events (i.e., memory stores/loads) onto his own world-line (i.e., his time axis) and his own plane of simultaneity (his address-space axis). Threads in the C++11 memory model correspond to observers that are moving relative to each other in special relativity. Sequential consistency corresponds to the Galilean space-time (i.e., all observers agree on one absolute order of events and a global sense of simultaneity).

熟悉爱因斯坦狭义相对论的读者会注意到我指的是什么。将闵可夫斯基的词翻译成记忆模型领域:地址空间和时间是地址-时空的影子。在这种情况下,每个观察者(即,线程)将投射事件的阴影(即:在他自己的世界范围内(也就是)。,他的时间轴)和他自己的同时性平面(他的地址空间轴)。c++ 11内存模型中的线程对应于在狭义相对论中相互移动的观察者。顺序一致性对应于伽利略的时空(即:所有的观察者都同意一个绝对的事件顺序和一个全球性的同时性。

The resemblance between memory models and special relativity stems from the fact that both define a partially-ordered set of events, often called a causal set. Some events (i.e., memory stores) can affect (but not be affected by) other events. A C++11 thread (or observer in physics) is no more than a chain (i.e., a totally ordered set) of events (e.g., memory loads and stores to possibly different addresses).

记忆模型和狭义相对论的相似之处在于,它们都定义了一组有序的事件,通常被称为因果集合。(内存存储)可以影响(但不受其他事件影响)。一个c++ 11线程(或物理观察者)不过是一个链。,一个完全有序的事件集(例如,内存加载和存储到可能不同的地址)。

In relativity, some order is restored to the seemingly chaotic picture of partially ordered events, since the only temporal ordering that all observers agree on is the ordering among “timelike” events (i.e., those events that are in principle connectible by any particle going slower than the speed of light in a vacuum). Only the timelike related events are invariantly ordered. Time in Physics, Craig Callender.

在相对论中,一些秩序被恢复到部分有序事件的看似混乱的画面,因为所有观察者一致同意的唯一的时间顺序是“时间”事件的排序(也就是)。这些事件在原则上是可连接的任何粒子都比真空中的光速要慢。只有时间相关的事件是不变的。物理学的时间,克雷格·卡伦德。

In C++11 memory model, a similar mechanism (the acquire-release consistency model) is used to establish these local causality relations.

在c++ 11内存模型中,使用了类似的机制(acquire-release一致性模型)来建立这些局部因果关系。

To provide a definition of memory consistency and a motivation for abandoning SC, I will quote from "A Primer on Memory Consistency and Cache Coherence"

为了提供内存一致性的定义和放弃SC的动机,我将引用“关于内存一致性和缓存一致性的入门知识”。

For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. The correctness criterion for a single processor core partitions behavior between “one correct result” and “many incorrect alternatives”. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many (more) incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads.

对于共享内存机器,内存一致性模型定义了其内存系统的架构可见的行为。单处理器核心分区在“一个正确的结果”和“许多不正确的选择”之间的正确性标准。这是因为处理器的架构要求线程的执行将给定的输入状态转换为一个定义良好的输出状态,甚至是在一个无序的内核中。然而,共享内存一致性模型关注的是多个线程的负载和存储,并且通常允许许多正确的执行,同时不允许许多(更多)错误的执行。多重正确执行的可能性是由于ISA允许多个线程并发执行,通常有许多可能的法律交叉指令来自不同的线程。

Relaxed or weak memory consistency models are motivated by the fact that most memory orderings in strong models are unnecessary. If a thread updates ten data items and then a synchronization flag, programmers usually do not care if the data items are updated in order with respect to each other but only that all data items are updated before the flag is updated (usually implemented using FENCE instructions). Relaxed models seek to capture this increased ordering flexibility and preserve only the orders that programmers “require” to get both higher performance and correctness of SC. For example, in certain architectures, FIFO write buffers are used by each core to hold the results of committed (retired) stores before writing the results to the caches. This optimization enhances performance but violates SC. The write buffer hides the latency of servicing a store miss. Because stores are common, being able to avoid stalling on most of them is an important benefit. For a single-core processor, a write buffer can be made architecturally invisible by ensuring that a load to address A returns the value of the most recent store to A even if one or more stores to A are in the write buffer. This is typically done by either bypassing the value of the most recent store to A to the load from A, where “most recent” is determined by program order, or by stalling a load of A if a store to A is in the write buffer. When multiple cores are used, each will have its own bypassing write buffer. Without write buffers, the hardware is SC, but with write buffers, it is not, making write buffers architecturally visible in a multicore processor.

轻松或弱的内存一致性模型的动机是,强大模型中的大多数内存排序都是不必要的。如果一个线程更新了10个数据项,然后是一个同步标志,程序员通常不关心数据项是否按顺序更新,而只是在更新标志之前更新所有数据项(通常使用FENCE指令实现)。放松模型试图捕捉这只订购增加灵活性和保存订单,程序员“要求”来获得更高的性能和SC的正确性。例如,在某些架构,使用FIFO缓冲区写每个核心持有承诺的结果(退休)之前将结果写入缓存。这种优化提高了性能,但是违反了SC。写缓冲区隐藏了服务商店错误的延迟。因为商店是通用的,所以能够避免对大多数商店的拖延是一个重要的好处。对于单内核处理器,可以通过确保将最近的存储的值返回到一个或多个存储在写缓冲区中的值,从而使写缓冲区在架构上是不可见的。这通常是通过将最近的商店的值转移到A到A的负载来完成的,在这里“最近”是由程序顺序决定的,或者如果存储在写缓冲区中,则是将A的负载延迟到A。当使用多个内核时,每个内核都有自己的绕过写缓冲区。如果没有写缓冲区,硬件是SC,但是使用写缓冲区,它不是,使写缓冲区在多核处理器中架构可见。

Store-store reordering may happen if a core has a non-FIFO write buffer that lets stores depart in a different order than the order in which they entered. This might occur if the first store misses in the cache while the second hits or if the second store can coalesce with an earlier store (i.e., before the first store). Load-load reordering may also happen on dynamically-scheduled cores that execute instructions out of program order. That can behave the same as reordering stores on another core (Can you come up with an example interleaving between two threads?). Reordering an earlier load with a later store (a load-store reordering) can cause many incorrect behaviors, such as loading a value after releasing the lock that protects it (if the store is the unlock operation). Note that store-load reorderings may also arise due to local bypassing in the commonly implemented FIFO write buffer, even with a core that executes all instructions in program order.

如果一个核心有一个非fifo写缓冲区,它允许存储以不同的顺序离开,而不是它们输入的顺序,那么存储库的重新排序就可能发生。如果第一个存储在缓存中丢失,而第二个存储则在第二个存储中丢失,或者第二个存储可以合并到一个较早的存储区(也就是),这可能会发生。,在第一家店之前。负载重新排序也可能发生在动态调度的核心上,执行指令超出程序顺序。这可以和在另一个核心上重新排序存储一样(您能想到一个在两个线程之间交错的例子吗?)重新排序较早的加载和稍后的存储(一个负载存储重新排序)会导致许多不正确的行为,比如在释放保护它的锁(如果存储是解锁操作)之后加载一个值。注意,由于本地绕过通常实现的FIFO写缓冲区,所以存储负载的重新排序可能也会出现,即使是使用一个核心来执行程序命令中的所有指令。

Because cache coherence and memory consistency are sometimes confused, it is instructive to also have this quote:

由于缓存一致性和内存一致性有时会被混淆,所以引用如下:

Unlike consistency, cache coherence is neither visible to software nor required. Coherence seeks to make the caches of a shared-memory system as functionally invisible as the caches in a single-core system. Correct coherence ensures that a programmer cannot determine whether and where a system has caches by analyzing the results of loads and stores. This is because correct coherence ensures that the caches never enable new or different functional behavior (programmers may still be able to infer likely cache structure using timing information). The main purpose of cache coherence protocols is maintaining the single-writer-multiple-readers (SWMR) invariant for every memory location. An important distinction between coherence and consistency is that coherence is specified on a per-memory location basis, whereas consistency is specified with respect to all memory locations.

与一致性不同,缓存一致性对软件和要求都不可见。Coherence试图使共享内存系统的缓存在功能上不可见,就像在单一核心系统中的缓存一样。正确的一致性确保了程序员不能通过分析负载和存储的结果来确定系统是否和在哪里缓存。这是因为正确的一致性确保了缓存不会启用新的或不同的功能行为(程序员可能仍然能够使用计时信息推断出可能的缓存结构)。缓存一致性协议的主要目的是维护每个内存位置的单写-多读者(SWMR)不变量。一致性和一致性之间的一个重要区别是,一致性是在每个内存位置的基础上指定的,而一致性是在所有内存位置上指定的。

Continuing with our mental picture, the SWMR invariant corresponds to the physical requirement that there be at most one particle located at any one location but there can be an unlimited number of observers of any location.

在我们的脑海中,SWMR不变量与物理上的要求相对应,即在任何一个位置上最多只能有一个粒子,但是任何位置的观测者数量都是无限的。

#3


74  

This is now a multiple-year old question, but being very popular, it's worth mentioning a fantastic resource for learning about the C++11 memory model. I see no point in summing up his talk in order to make this yet another full answer, but given this is the guy who actually wrote the standard, I think it's well worth watching the talk.

这是一个多年前的问题,但是非常流行,值得一提的是关于c++ 11内存模型的一个很好的学习资源。我认为总结他的演讲是没有意义的,为了让这成为另一个完整的答案,但考虑到这是一个真正写出标准的人,我认为值得一看。

Herb Sutter has a three hour long talk about the C++11 memory model titled "atomic<> Weapons", available on the Channel9 site - part 1 and part 2. The talk is pretty technical, and covers the following topics:

Herb Sutter有一个长达三个小时的关于c++ 11内存模型的讨论,名为“原子的<>武器”,可以在Channel9站点上找到——第1部分和第2部分。这个演讲很有技术性,涵盖了以下几个主题:

  1. Optimizations, Races, and the Memory Model
  2. 优化、竞争和内存模型。
  3. Ordering – What: Acquire and Release
  4. 订购-什么:获得和释放。
  5. Ordering – How: Mutexes, Atomics, and/or Fences
  6. 排序-如何:互斥,原子,和/或栅栏。
  7. Other Restrictions on Compilers and Hardware
  8. 对编译器和硬件的其他限制。
  9. Code Gen & Performance: x86/x64, IA64, POWER, ARM
  10. 代码Gen和性能:x86/x64, IA64, POWER, ARM。
  11. Relaxed Atomics
  12. 放松的原子

The talk doesn't elaborate on the API, but rather on the reasoning, background, under the hood and behind the scenes (did you know relaxed semantics were added to the standard only because POWER and ARM do not support synchronized load efficiently?).

对话没有详细说明API,而是基于推理、背景、幕后和幕后(您知道放松语义是添加到标准中仅仅是因为POWER和ARM不支持有效地同步负载吗?)

#4


66  

It means that the standard now defines multi-threading, and it defines what happens in the context of multiple threads. Of course, people used varying implementations, but that's like asking why we should have a std::string when we could all be using a home-rolled string class.

这意味着标准现在定义了多线程,它定义了在多个线程的上下文中发生了什么。当然,人们使用不同的实现,但这就像问为什么我们应该有一个std::字符串,当我们都可以使用一个自卷的字符串类时。

When you're talking about POSIX threads or Windows threads, then this is a bit of an illusion as actually you're talking about x86 threads, as it's a hardware function to run concurrently. The C++0x memory model makes guarantees, whether you're on x86, or ARM, or MIPS, or anything else you can come up with.

当您讨论POSIX线程或Windows线程时,这是一种错觉,实际上您正在讨论的是x86线程,因为它是同时运行的硬件函数。c++ 0x内存模型提供了保证,无论您是在x86、ARM、MIPS,还是其他任何您能想到的东西。

#5


48  

For languages not specifying a memory model, you are writing code for the language and the memory model specified by the processor architecture. The processor may choose to re-order memory accesses for performance. So, if your program has data races (a data race is when it's possible for multiple cores / hyper-threads to access the same memory concurrently) then your program is not cross platform because of its dependence on the processor memory model. You may refer to the Intel or AMD software manuals to find out how the processors may re-order memory accesses.

对于没有指定内存模型的语言,您正在编写用于语言的代码和由处理器体系结构指定的内存模型。处理器可以选择重新排序内存访问以获得性能。因此,如果您的程序有数据竞争(数据竞争是当多个内核/超线程可以同时访问同一内存时),那么您的程序就不是跨平台的,因为它依赖于处理器内存模型。您可以参考Intel或AMD的软件手册来了解处理器如何重新排序内存访问。

Very importantly, locks (and concurrency semantics with locking) are typically implemented in a cross platform way... So if you are using standard locks in a multithreaded program with no data races then you don't have to worry about cross platform memory models.

非常重要的是,锁(和锁的并发语义)通常是在跨平台的方式中实现的…因此,如果您在一个没有数据竞争的多线程程序中使用标准锁,那么您就不必担心跨平台内存模型。

Interestingly, Microsoft compilers for C++ have acquire / release semantics for volatile which is a C++ extension to deal with the lack of a memory model in C++ http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.80).aspx. However, given that Windows runs on x86 / x64 only, that's not saying much (Intel and AMD memory models make it easy and efficient to implement acquire / release semantics in a language).

有趣的是,对于c++的Microsoft编译器已经获得/释放了volatile的语义,这是一个c++扩展,用来处理c++中缺少内存模型的问题。然而,考虑到Windows仅在x86 / x64上运行,这并没有说明太多(Intel和AMD的内存模型使得在语言中实现获取/发布语义变得简单而高效)。

#6


22  

If you use mutexes to protect all your data, you really shouldn't need to worry. Mutexes have always provided sufficient ordering and visibility guarantees.

如果您使用互斥锁来保护您的所有数据,您真的不需要担心。互斥对象总是提供足够的排序和可见性保证。

Now, if you used atomics, or lock-free algorithms, you need to think about the memory model. The memory model describes precisely when atomics provide ordering and visibility guarantees, and provides portable fences for hand-coded guarantees.

现在,如果你使用原子或无锁算法,你需要考虑内存模型。内存模型精确地描述了atomics提供的排序和可见性保证,并为手工编码的保证提供了可移动的围栏。

Previously, atomics would be done using compiler intrinsics, or some higher level library. Fences would have been done using CPU-specific instructions (memory barriers).

在此之前,atomics将使用编译器特性或更高级别的库来完成。用cpu特定的指令(内存屏障)来完成栅栏。

#1


1715  

First, you have to learn to think like a Language Lawyer.

首先,你必须学会像一个语言律师那样思考。

The C++ specification does not make reference to any particular compiler, operating system, or CPU. It makes reference to an abstract machine that is a generalization of actual systems. In the Language Lawyer world, the job of the programmer is to write code for the abstract machine; the job of the compiler is to actualize that code on a concrete machine. By coding rigidly to the spec, you can be certain that your code will compile and run without modification on any system with a compliant C++ compiler, whether today or 50 years from now.

c++规范没有提到任何特定的编译器、操作系统或CPU。它提到了一个抽象的机器,它是对实际系统的概括。在语言律师世界中,程序员的工作是为抽象机器编写代码;编译器的工作是在一台混凝土机器上实现该代码。通过对规范的严格编码,您可以确定您的代码将在任何系统上编译和运行,而不需要修改任何符合c++编译器的系统,不管是现在还是50年后。

The abstract machine in the C++98/C++03 specification is fundamentally single-threaded. So it is not possible to write multi-threaded C++ code that is "fully portable" with respect to the spec. The spec does not even say anything about the atomicity of memory loads and stores or the order in which loads and stores might happen, never mind things like mutexes.

在c++ 98/ c++ 03规范中的抽象机器基本上是单线程的。因此,不可能编写具有“完全可移植性”的多线程c++代码。规范甚至没有提到内存加载和存储的原子性,也没有提到负载和存储可能发生的顺序,更别提互斥对象了。

Of course, you can write multi-threaded code in practice for particular concrete systems -- like pthreads or Windows. But there is no standard way to write multi-threaded code for C++98/C++03.

当然,您可以为特定的具体系统编写多线程代码,比如pthreads或Windows。但是,没有标准的方法来编写c++ 98/ c++ 03的多线程代码。

The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory.

c++ 11中的抽象机器是多线程设计的。它还有一个定义良好的内存模型;也就是说,编译器在访问内存时可能会做什么,也可能不会做什么。

Consider the following example, where a pair of global variables are accessed concurrently by two threads:

考虑下面的例子,其中两个全局变量同时被两个线程访问:

           Global
           int x, y;

Thread 1            Thread 2
x = 17;             cout << y << " ";
y = 37;             cout << x << endl;

What might Thread 2 output?

线程2的输出可能是什么?

Under C++98/C++03, this is not even Undefined Behavior; the question itself is meaningless because the standard does not contemplate anything called a "thread".

在c++ 98/ c++ 03中,这甚至不是未定义的行为;这个问题本身毫无意义,因为标准没有考虑任何所谓的“线程”。

Under C++11, the result is Undefined Behavior, because loads and stores need not be atomic in general. Which may not seem like much of an improvement... And by itself, it's not.

在c++ 11中,结果是未定义的行为,因为负载和存储通常不需要是原子的。这似乎并不是一个很大的进步……而它本身并不是。

But with C++11, you can write this:

但是对于c++ 11,你可以这样写:

           Global
           atomic<int> x, y;

Thread 1                 Thread 2
x.store(17);             cout << y.load() << " ";
y.store(37);             cout << x.load() << endl;

Now things get much more interesting. First of all, the behavior here is defined. Thread 2 could now print 0 0 (if it runs before Thread 1), 37 17 (if it runs after Thread 1), or 0 17 (if it runs after Thread 1 assigns to x but before it assigns to y).

现在事情变得有趣多了。首先,这里的行为是有定义的。线程2现在可以打印0(如果它在线程1之前运行),3717(如果它在线程1之后运行),或者0 17(如果它在线程1分配给x之后运行,但是在它分配给y之前)。

What it cannot print is 37 0, because the default mode for atomic loads/stores in C++11 is to enforce sequential consistency. This just means all loads and stores must be "as if" they happened in the order you wrote them within each thread, while operations among threads can be interleaved however the system likes. So the default behavior of atomics provides both atomicity and ordering for loads and stores.

它不能打印的是37个0,因为在c++ 11中原子加载/存储的默认模式是强制执行顺序一致性。这就意味着所有的负载和存储都必须是“好像”它们发生在您在每个线程中编写它们的顺序中,而线程之间的操作可以被交叉存取,但是系统喜欢。因此,atomics的默认行为为加载和存储提供了原子性和排序。

Now, on a modern CPU, ensuring sequential consistency can be expensive. In particular, the compiler is likely to emit full-blown memory barriers between every access here. But if your algorithm can tolerate out-of-order loads and stores; i.e., if it requires atomicity but not ordering; i.e., if it can tolerate 37 0 as output from this program, then you can write this:

现在,在一个现代的CPU上,确保顺序一致性可能是昂贵的。特别是,编译器可能会在每个访问之间释放全面的内存壁垒。但是,如果您的算法能够容忍无序加载和存储;即。,如果需要原子性但不需要排序;即。如果它能容忍这个程序输出的37个0,那么你可以这样写:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_relaxed);   cout << y.load(memory_order_relaxed) << " ";
y.store(37,memory_order_relaxed);   cout << x.load(memory_order_relaxed) << endl;

The more modern the CPU, the more likely this is to be faster than the previous example.

CPU越现代化,它就越有可能比之前的例子更快。

Finally, if you just need to keep particular loads and stores in order, you can write:

最后,如果您只需要保持特定的负载和存储,您可以编写:

           Global
           atomic<int> x, y;

Thread 1                            Thread 2
x.store(17,memory_order_release);   cout << y.load(memory_order_acquire) << " ";
y.store(37,memory_order_release);   cout << x.load(memory_order_acquire) << endl;

This takes us back to the ordered loads and stores -- so 37 0 is no longer a possible output -- but it does so with minimal overhead. (In this trivial example, the result is the same as full-blown sequential consistency; in a larger program, it would not be.)

这将我们带回到有序的加载和存储——因此,37个0不再是可能的输出——但它的开销很小。(在这个简单的示例中,结果与完整的顺序一致性相同;在一个更大的项目中,它不会。

Of course, if the only outputs you want to see are 0 0 or 37 17, you can just wrap a mutex around the original code. But if you have read this far, I bet you already know how that works, and this answer is already longer than I intended :-).

当然,如果您想要看到的唯一输出是0或37 17,您可以将一个互斥锁括在原始代码周围。但如果你读了这么多,我打赌你已经知道这是怎么回事了,而且这个答案已经比我预想的要长了:-)。

So, bottom line. Mutexes are great, and C++11 standardizes them. But sometimes for performance reasons you want lower-level primitives (e.g., the classic double-checked locking pattern). The new standard provides high-level gadgets like mutexes and condition variables, and it also provides low-level gadgets like atomic types and the various flavors of memory barrier. So now you can write sophisticated, high-performance concurrent routines entirely within the language specified by the standard, and you can be certain your code will compile and run unchanged on both today's systems and tomorrow's.

所以,底线。互斥量很大,而c++ 11使它们标准化。但有时出于性能考虑,您需要低级的原语(例如,经典的双重检查锁定模式)。新的标准提供了诸如互斥和条件变量之类的高级小工具,它还提供了一些低级的小工具,比如原子类型和各种各样的内存屏障。因此,现在您可以完全在标准指定的语言中编写复杂的、高性能的并发例程,您可以确定您的代码将在今天的系统和明天的系统中编译和运行。

Although to be frank, unless you are an expert and working on some serious low-level code, you should probably stick to mutexes and condition variables. That's what I intend to do.

尽管坦率地说,除非您是一位专家,并且在处理一些严重的低层代码,您可能应该坚持互斥和条件变量。这就是我打算做的。

For more on this stuff, see this blog post.

有关这方面的更多信息,请参阅本博客。

#2


263  

I will just give the analogy with which I understand memory consistency models (or memory models, for short). It is inspired by Leslie Lamport's seminal paper "Time, Clocks, and the Ordering of Events in a Distributed System". The analogy is apt and has fundamental significance, but may be overkill for many people. However, I hope it provides a mental image (a pictorial representation) that facilitates reasoning about memory consistency models.

我将用我所理解的内存一致性模型(简称内存模型)来进行类比。它的灵感来自Leslie Lamport的开创性论文《时间、时钟和分布式系统中的事件顺序》。这个类比是恰当的,具有根本的意义,但对许多人来说可能是过分的。然而,我希望它能提供一种思维图像(一种图形表示),以促进对内存一致性模型的推理。

Let’s view the histories of all memory locations in a space-time diagram in which the horizontal axis represents the address space (i.e., each memory location is represented by a point on that axis) and the vertical axis represents time (we will see that, in general, there is not a universal notion of time). The history of values held by each memory location is, therefore, represented by a vertical column at that memory address. Each value change is due to one of the threads writing a new value to that location. By a memory image, we will mean the aggregate/combination of values of all memory locations observable at a particular time by a particular thread.

让我们看一下时空图中所有内存位置的历史,其中横轴表示地址空间(即:,每个内存位置都用一个点表示,而纵轴表示时间(我们将会看到,总的来说,没有一个普遍的时间概念)。因此,每个内存位置所持有的值的历史是由该内存地址的垂直列表示的。每个值的变化都是由于其中一个线程为该位置写入了一个新值。通过一个内存镜像,我们将意味着一个特定线程在特定时间内可观察到的所有内存位置的集合/组合。

Quoting from "A Primer on Memory Consistency and Cache Coherence"

引用“关于记忆一致性和缓存一致性的入门”

The intuitive (and most restrictive) memory model is sequential consistency (SC) in which a multithreaded execution should look like an interleaving of the sequential executions of each constituent thread, as if the threads were time-multiplexed on a single-core processor.

直观的(且最严格的)内存模型是顺序一致性(SC),其中多线程的执行应该看起来像每个组成线程的顺序执行的交错,就像线程在单核处理器上被时间复用一样。

That global memory order can vary from one run of the program to another and may not be known beforehand. The characteristic feature of SC is the set of horizontal slices in the address-space-time diagram representing planes of simultaneity (i.e., memory images). On a given plane, all of its events (or memory values) are simultaneous. There is a notion of Absolute Time, in which all threads agree on which memory values are simultaneous. In SC, at every time instant, there is only one memory image shared by all threads. That's, at every instant of time, all processors agree on the memory image (i.e., the aggregate content of memory). Not only does this imply that all threads view the same sequence of values for all memory locations, but also that all processors observe the same combinations of values of all variables. This is the same as saying all memory operations (on all memory locations) are observed in the same total order by all threads.

这个全局内存顺序可以从程序的一个运行到另一个,并且可能事先不知道。SC的特征特征是在地址-时空图中表示同时性平面的水平切片。、内存图像)。在给定的平面上,所有事件(或内存值)都是同时发生的。有一个绝对时间的概念,在这个概念中,所有的线程都同意哪些内存值是同时存在的。在SC中,在每一个时刻,只有一个内存镜像被所有线程共享。也就是说,在每一时刻,所有的处理器都会在内存镜像上达成一致。,即内存的聚合内容。这不仅意味着所有的线程都对所有内存位置都查看相同的值序列,而且所有的处理器都观察到所有变量的相同的值组合。这与说所有的线程在相同的总次序中观察到的所有内存操作(在所有内存位置上)是一样的。

In relaxed memory models, each thread will slice up address-space-time in its own way, the only restriction being that slices of each thread shall not cross each other because all threads must agree on the history of every individual memory location (of course, slices of different threads may, and will, cross each other). There is no universal way to slice it up (no privileged foliation of address-space-time). Slices do not have to be planar (or linear). They can be curved and this is what can make a thread read values written by another thread out of the order they were written in. Histories of different memory locations may slide (or get stretched) arbitrarily relative to each other when viewed by any particular thread. Each thread will have a different sense of which events (or, equivalently, memory values) are simultaneous. The set of events (or memory values) that are simultaneous to one thread are not simultaneous to another. Thus, in a relaxed memory model, all threads still observe the same history (i.e., sequence of values) for each memory location. But they may observe different memory images (i.e., combinations of values of all memory locations). Even if two different memory locations are written by the same thread in sequence, the two newly written values may be observed in different order by other threads.

在轻松的内存模型中,每个线程将切片address-space-time以自己的方式,唯一的限制是每个线程片不得相互交叉,因为所有的线程必须同意每个个体的历史内存位置(当然,片不同的线程可能,会互相交叉)。没有一种通用的方法来分割它(没有特殊的地址-时空)。片不必是平面的(或线性的)。它们可以是弯曲的,这可以使线程读取另一个线程写入的顺序。在任何特定的线程中,不同内存位置的历史可能会被任意地相对地滑动(或被拉伸)。每个线程都有一种不同的感觉,哪些事件(或者,相等的,内存值)是同时发生的。同步到一个线程的事件集合(或内存值)不会同时发生在另一个线程上。因此,在一个轻松的内存模型中,所有线程仍然遵循相同的历史(即:,每个内存位置的值序列。但他们可能观察到不同的记忆图像。,所有内存位置的值的组合。即使两个不同的内存位置是由同一个线程按顺序写的,两个新写入的值也可以用其他线程的不同顺序来观察。

[Picture from Wikipedia] c++ 11引入了一个标准化的内存模型。这是什么意思?它将如何影响c++编程?

(图片来自*)

Readers familiar with Einstein’s Special Theory of Relativity will notice what I am alluding to. Translating Minkowski’s words into the memory models realm: address space and time are shadows of address-space-time. In this case, each observer (i.e., thread) will project shadows of events (i.e., memory stores/loads) onto his own world-line (i.e., his time axis) and his own plane of simultaneity (his address-space axis). Threads in the C++11 memory model correspond to observers that are moving relative to each other in special relativity. Sequential consistency corresponds to the Galilean space-time (i.e., all observers agree on one absolute order of events and a global sense of simultaneity).

熟悉爱因斯坦狭义相对论的读者会注意到我指的是什么。将闵可夫斯基的词翻译成记忆模型领域:地址空间和时间是地址-时空的影子。在这种情况下,每个观察者(即,线程)将投射事件的阴影(即:在他自己的世界范围内(也就是)。,他的时间轴)和他自己的同时性平面(他的地址空间轴)。c++ 11内存模型中的线程对应于在狭义相对论中相互移动的观察者。顺序一致性对应于伽利略的时空(即:所有的观察者都同意一个绝对的事件顺序和一个全球性的同时性。

The resemblance between memory models and special relativity stems from the fact that both define a partially-ordered set of events, often called a causal set. Some events (i.e., memory stores) can affect (but not be affected by) other events. A C++11 thread (or observer in physics) is no more than a chain (i.e., a totally ordered set) of events (e.g., memory loads and stores to possibly different addresses).

记忆模型和狭义相对论的相似之处在于,它们都定义了一组有序的事件,通常被称为因果集合。(内存存储)可以影响(但不受其他事件影响)。一个c++ 11线程(或物理观察者)不过是一个链。,一个完全有序的事件集(例如,内存加载和存储到可能不同的地址)。

In relativity, some order is restored to the seemingly chaotic picture of partially ordered events, since the only temporal ordering that all observers agree on is the ordering among “timelike” events (i.e., those events that are in principle connectible by any particle going slower than the speed of light in a vacuum). Only the timelike related events are invariantly ordered. Time in Physics, Craig Callender.

在相对论中,一些秩序被恢复到部分有序事件的看似混乱的画面,因为所有观察者一致同意的唯一的时间顺序是“时间”事件的排序(也就是)。这些事件在原则上是可连接的任何粒子都比真空中的光速要慢。只有时间相关的事件是不变的。物理学的时间,克雷格·卡伦德。

In C++11 memory model, a similar mechanism (the acquire-release consistency model) is used to establish these local causality relations.

在c++ 11内存模型中,使用了类似的机制(acquire-release一致性模型)来建立这些局部因果关系。

To provide a definition of memory consistency and a motivation for abandoning SC, I will quote from "A Primer on Memory Consistency and Cache Coherence"

为了提供内存一致性的定义和放弃SC的动机,我将引用“关于内存一致性和缓存一致性的入门知识”。

For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. The correctness criterion for a single processor core partitions behavior between “one correct result” and “many incorrect alternatives”. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many (more) incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads.

对于共享内存机器,内存一致性模型定义了其内存系统的架构可见的行为。单处理器核心分区在“一个正确的结果”和“许多不正确的选择”之间的正确性标准。这是因为处理器的架构要求线程的执行将给定的输入状态转换为一个定义良好的输出状态,甚至是在一个无序的内核中。然而,共享内存一致性模型关注的是多个线程的负载和存储,并且通常允许许多正确的执行,同时不允许许多(更多)错误的执行。多重正确执行的可能性是由于ISA允许多个线程并发执行,通常有许多可能的法律交叉指令来自不同的线程。

Relaxed or weak memory consistency models are motivated by the fact that most memory orderings in strong models are unnecessary. If a thread updates ten data items and then a synchronization flag, programmers usually do not care if the data items are updated in order with respect to each other but only that all data items are updated before the flag is updated (usually implemented using FENCE instructions). Relaxed models seek to capture this increased ordering flexibility and preserve only the orders that programmers “require” to get both higher performance and correctness of SC. For example, in certain architectures, FIFO write buffers are used by each core to hold the results of committed (retired) stores before writing the results to the caches. This optimization enhances performance but violates SC. The write buffer hides the latency of servicing a store miss. Because stores are common, being able to avoid stalling on most of them is an important benefit. For a single-core processor, a write buffer can be made architecturally invisible by ensuring that a load to address A returns the value of the most recent store to A even if one or more stores to A are in the write buffer. This is typically done by either bypassing the value of the most recent store to A to the load from A, where “most recent” is determined by program order, or by stalling a load of A if a store to A is in the write buffer. When multiple cores are used, each will have its own bypassing write buffer. Without write buffers, the hardware is SC, but with write buffers, it is not, making write buffers architecturally visible in a multicore processor.

轻松或弱的内存一致性模型的动机是,强大模型中的大多数内存排序都是不必要的。如果一个线程更新了10个数据项,然后是一个同步标志,程序员通常不关心数据项是否按顺序更新,而只是在更新标志之前更新所有数据项(通常使用FENCE指令实现)。放松模型试图捕捉这只订购增加灵活性和保存订单,程序员“要求”来获得更高的性能和SC的正确性。例如,在某些架构,使用FIFO缓冲区写每个核心持有承诺的结果(退休)之前将结果写入缓存。这种优化提高了性能,但是违反了SC。写缓冲区隐藏了服务商店错误的延迟。因为商店是通用的,所以能够避免对大多数商店的拖延是一个重要的好处。对于单内核处理器,可以通过确保将最近的存储的值返回到一个或多个存储在写缓冲区中的值,从而使写缓冲区在架构上是不可见的。这通常是通过将最近的商店的值转移到A到A的负载来完成的,在这里“最近”是由程序顺序决定的,或者如果存储在写缓冲区中,则是将A的负载延迟到A。当使用多个内核时,每个内核都有自己的绕过写缓冲区。如果没有写缓冲区,硬件是SC,但是使用写缓冲区,它不是,使写缓冲区在多核处理器中架构可见。

Store-store reordering may happen if a core has a non-FIFO write buffer that lets stores depart in a different order than the order in which they entered. This might occur if the first store misses in the cache while the second hits or if the second store can coalesce with an earlier store (i.e., before the first store). Load-load reordering may also happen on dynamically-scheduled cores that execute instructions out of program order. That can behave the same as reordering stores on another core (Can you come up with an example interleaving between two threads?). Reordering an earlier load with a later store (a load-store reordering) can cause many incorrect behaviors, such as loading a value after releasing the lock that protects it (if the store is the unlock operation). Note that store-load reorderings may also arise due to local bypassing in the commonly implemented FIFO write buffer, even with a core that executes all instructions in program order.

如果一个核心有一个非fifo写缓冲区,它允许存储以不同的顺序离开,而不是它们输入的顺序,那么存储库的重新排序就可能发生。如果第一个存储在缓存中丢失,而第二个存储则在第二个存储中丢失,或者第二个存储可以合并到一个较早的存储区(也就是),这可能会发生。,在第一家店之前。负载重新排序也可能发生在动态调度的核心上,执行指令超出程序顺序。这可以和在另一个核心上重新排序存储一样(您能想到一个在两个线程之间交错的例子吗?)重新排序较早的加载和稍后的存储(一个负载存储重新排序)会导致许多不正确的行为,比如在释放保护它的锁(如果存储是解锁操作)之后加载一个值。注意,由于本地绕过通常实现的FIFO写缓冲区,所以存储负载的重新排序可能也会出现,即使是使用一个核心来执行程序命令中的所有指令。

Because cache coherence and memory consistency are sometimes confused, it is instructive to also have this quote:

由于缓存一致性和内存一致性有时会被混淆,所以引用如下:

Unlike consistency, cache coherence is neither visible to software nor required. Coherence seeks to make the caches of a shared-memory system as functionally invisible as the caches in a single-core system. Correct coherence ensures that a programmer cannot determine whether and where a system has caches by analyzing the results of loads and stores. This is because correct coherence ensures that the caches never enable new or different functional behavior (programmers may still be able to infer likely cache structure using timing information). The main purpose of cache coherence protocols is maintaining the single-writer-multiple-readers (SWMR) invariant for every memory location. An important distinction between coherence and consistency is that coherence is specified on a per-memory location basis, whereas consistency is specified with respect to all memory locations.

与一致性不同,缓存一致性对软件和要求都不可见。Coherence试图使共享内存系统的缓存在功能上不可见,就像在单一核心系统中的缓存一样。正确的一致性确保了程序员不能通过分析负载和存储的结果来确定系统是否和在哪里缓存。这是因为正确的一致性确保了缓存不会启用新的或不同的功能行为(程序员可能仍然能够使用计时信息推断出可能的缓存结构)。缓存一致性协议的主要目的是维护每个内存位置的单写-多读者(SWMR)不变量。一致性和一致性之间的一个重要区别是,一致性是在每个内存位置的基础上指定的,而一致性是在所有内存位置上指定的。

Continuing with our mental picture, the SWMR invariant corresponds to the physical requirement that there be at most one particle located at any one location but there can be an unlimited number of observers of any location.

在我们的脑海中,SWMR不变量与物理上的要求相对应,即在任何一个位置上最多只能有一个粒子,但是任何位置的观测者数量都是无限的。

#3


74  

This is now a multiple-year old question, but being very popular, it's worth mentioning a fantastic resource for learning about the C++11 memory model. I see no point in summing up his talk in order to make this yet another full answer, but given this is the guy who actually wrote the standard, I think it's well worth watching the talk.

这是一个多年前的问题,但是非常流行,值得一提的是关于c++ 11内存模型的一个很好的学习资源。我认为总结他的演讲是没有意义的,为了让这成为另一个完整的答案,但考虑到这是一个真正写出标准的人,我认为值得一看。

Herb Sutter has a three hour long talk about the C++11 memory model titled "atomic<> Weapons", available on the Channel9 site - part 1 and part 2. The talk is pretty technical, and covers the following topics:

Herb Sutter有一个长达三个小时的关于c++ 11内存模型的讨论,名为“原子的<>武器”,可以在Channel9站点上找到——第1部分和第2部分。这个演讲很有技术性,涵盖了以下几个主题:

  1. Optimizations, Races, and the Memory Model
  2. 优化、竞争和内存模型。
  3. Ordering – What: Acquire and Release
  4. 订购-什么:获得和释放。
  5. Ordering – How: Mutexes, Atomics, and/or Fences
  6. 排序-如何:互斥,原子,和/或栅栏。
  7. Other Restrictions on Compilers and Hardware
  8. 对编译器和硬件的其他限制。
  9. Code Gen & Performance: x86/x64, IA64, POWER, ARM
  10. 代码Gen和性能:x86/x64, IA64, POWER, ARM。
  11. Relaxed Atomics
  12. 放松的原子

The talk doesn't elaborate on the API, but rather on the reasoning, background, under the hood and behind the scenes (did you know relaxed semantics were added to the standard only because POWER and ARM do not support synchronized load efficiently?).

对话没有详细说明API,而是基于推理、背景、幕后和幕后(您知道放松语义是添加到标准中仅仅是因为POWER和ARM不支持有效地同步负载吗?)

#4


66  

It means that the standard now defines multi-threading, and it defines what happens in the context of multiple threads. Of course, people used varying implementations, but that's like asking why we should have a std::string when we could all be using a home-rolled string class.

这意味着标准现在定义了多线程,它定义了在多个线程的上下文中发生了什么。当然,人们使用不同的实现,但这就像问为什么我们应该有一个std::字符串,当我们都可以使用一个自卷的字符串类时。

When you're talking about POSIX threads or Windows threads, then this is a bit of an illusion as actually you're talking about x86 threads, as it's a hardware function to run concurrently. The C++0x memory model makes guarantees, whether you're on x86, or ARM, or MIPS, or anything else you can come up with.

当您讨论POSIX线程或Windows线程时,这是一种错觉,实际上您正在讨论的是x86线程,因为它是同时运行的硬件函数。c++ 0x内存模型提供了保证,无论您是在x86、ARM、MIPS,还是其他任何您能想到的东西。

#5


48  

For languages not specifying a memory model, you are writing code for the language and the memory model specified by the processor architecture. The processor may choose to re-order memory accesses for performance. So, if your program has data races (a data race is when it's possible for multiple cores / hyper-threads to access the same memory concurrently) then your program is not cross platform because of its dependence on the processor memory model. You may refer to the Intel or AMD software manuals to find out how the processors may re-order memory accesses.

对于没有指定内存模型的语言,您正在编写用于语言的代码和由处理器体系结构指定的内存模型。处理器可以选择重新排序内存访问以获得性能。因此,如果您的程序有数据竞争(数据竞争是当多个内核/超线程可以同时访问同一内存时),那么您的程序就不是跨平台的,因为它依赖于处理器内存模型。您可以参考Intel或AMD的软件手册来了解处理器如何重新排序内存访问。

Very importantly, locks (and concurrency semantics with locking) are typically implemented in a cross platform way... So if you are using standard locks in a multithreaded program with no data races then you don't have to worry about cross platform memory models.

非常重要的是,锁(和锁的并发语义)通常是在跨平台的方式中实现的…因此,如果您在一个没有数据竞争的多线程程序中使用标准锁,那么您就不必担心跨平台内存模型。

Interestingly, Microsoft compilers for C++ have acquire / release semantics for volatile which is a C++ extension to deal with the lack of a memory model in C++ http://msdn.microsoft.com/en-us/library/12a04hfd(v=vs.80).aspx. However, given that Windows runs on x86 / x64 only, that's not saying much (Intel and AMD memory models make it easy and efficient to implement acquire / release semantics in a language).

有趣的是,对于c++的Microsoft编译器已经获得/释放了volatile的语义,这是一个c++扩展,用来处理c++中缺少内存模型的问题。然而,考虑到Windows仅在x86 / x64上运行,这并没有说明太多(Intel和AMD的内存模型使得在语言中实现获取/发布语义变得简单而高效)。

#6


22  

If you use mutexes to protect all your data, you really shouldn't need to worry. Mutexes have always provided sufficient ordering and visibility guarantees.

如果您使用互斥锁来保护您的所有数据,您真的不需要担心。互斥对象总是提供足够的排序和可见性保证。

Now, if you used atomics, or lock-free algorithms, you need to think about the memory model. The memory model describes precisely when atomics provide ordering and visibility guarantees, and provides portable fences for hand-coded guarantees.

现在,如果你使用原子或无锁算法,你需要考虑内存模型。内存模型精确地描述了atomics提供的排序和可见性保证,并为手工编码的保证提供了可移动的围栏。

Previously, atomics would be done using compiler intrinsics, or some higher level library. Fences would have been done using CPU-specific instructions (memory barriers).

在此之前,atomics将使用编译器特性或更高级别的库来完成。用cpu特定的指令(内存屏障)来完成栅栏。