两个线程如何共享相同的缓存行

时间:2021-10-19 00:36:28

I am using a custom network protocol library. This library is built on TCP/IP and is supposedly to be used in high frequency messaging. It is a non-blocking library and uses callbacks as the interface to integrate with the caller.

我正在使用一个自定义网络协议库。这个库是建立在TCP/IP上的,应该用于高频率的消息传递。它是一个非阻塞库,使用回调函数作为与调用者集成的接口。

I am no performance expert, and that is why I decided to ask this question here. The custom library comes with a particular constraint, outlined below:

我不是性能专家,所以我决定在这里提出这个问题。自定义库具有特定的约束,如下所示:

"Callee should not invoke any of the library's API under the context of the callback thread. If they attempt to do so, the thread will hang"

“Callee不应该在回调线程的上下文中调用库的任何API。如果他们试图这样做,线程就会挂掉"

The only way to overcome API restriction is that I start another thread which process message and invokes the library to send a response. The library thread and process thread would share a common queue, which would be protected by a mutex and use wait_notify() calls to indicate the presence of a message.

克服API限制的惟一方法是启动另一个线程,该线程处理消息,并调用库发送响应。库线程和进程线程将共享一个公共队列,该队列将由互斥体保护,并使用wait_notify()调用来指示消息的存在。

If I am receiving 80k messages per second, then I would be putting threads to sleep and waking them up pretty often, performing thread context switches ~80k times per second.

如果我每秒接收到80k条消息,那么我将把线程放入休眠状态,并经常唤醒它们,执行线程上下文切换~80k次每秒。

Plus, as there are two threads, they will not share the message buffer in the L1 cache. The cache line containing the message would first be filled by the library's thread, then evicted and pulled into the process thread's core's L1 cache. Am I missing something or is it possible that the library's design is not meant for high performance use cases?

另外,由于有两个线程,它们不会共享L1缓存中的消息缓冲区。包含消息的缓存行将首先由库的线程填充,然后将其取出并拖放到进程线程的核心L1缓存中。我是否遗漏了一些东西,或者库的设计不适合高性能用例?

My questions are:

我的问题是:

  1. I have seen the warnings like "Don't use this API in a callback's context as it can cause locks." across many libraries. What are the common design choices that cause such design constraints? They can use recursive locks if it is a simple question of same thread calling the lock multiple times. Is this a re-entrant issue, and what challenges might cause an API owner to make non re-entrant API?

    我在许多库中看到过这样的警告:“不要在回调的上下文中使用这个API,因为它可能导致锁。”造成这种设计限制的常见设计选择是什么?如果同一个线程多次调用锁,那么可以使用递归锁。这是一个可重入的问题,哪些挑战可能会导致API所有者生成不可重入的API?

  2. Is there a way in the above design model, where the library thread and process thread can share the same core, and consequently share a cache line?

    在上面的设计模型中,是否有一种方法可以让库线程和进程线程共享同一个核心,从而共享一条缓存线?

  3. How expensive are volatile sig_atomic_t's as a mechanism to share data between two threads?

    volatile sig_atomic_t作为在两个线程之间共享数据的机制有多昂贵?

  4. Given a high frequency scenario, what is a light-weight way to share information between two threads?

    给定一个高频场景,在两个线程之间共享信息的轻量级方法是什么?

The library and my application are built on C++ and Linux.

这个库和我的应用程序是建立在c++和Linux上的。

2 个解决方案

#1


6  

How can 2 threads share the same cache line?

两个线程如何共享相同的缓存行?

Threads have nothing to do with cache lines. At least not explicitly. What can go wrong is cache flush on context switch and TLB invalidation, but given the same virtual address mapping for threads, caches should generally be oblivious to these things.

线程与缓存行没有任何关系。至少没有明确。可能出错的地方是在上下文切换和TLB失效时缓存刷新,但是如果为线程提供相同的虚拟地址映射,缓存通常应该对这些东西不敏感。

What are the common design choices that cause such design constraints?

造成这种设计限制的常见设计选择是什么?

Implementors of the library do not want to deal with:

图书馆的实施者不愿处理:

  1. Complex locking schemes.
  2. 复杂的锁方案。
  3. Re-entrant logic (i.e. you call 'send()', library calls you back with on_error(), from which you call send() again — that would need a special care to be taken by them).
  4. reentrant逻辑(例如,您调用'send()',库使用on_error()调用您,然后再次调用send()——这需要他们特别注意)。

I personally consider it as a very bad thing to have an API designed around callbacks when it comes to high performance and especially network-related things. Though sometimes it makes life a lot simpler for both users and developers (in terms of just ease of writing the code). The only exception to this might be CPU interrupt handling, but that is a different story and you can hardly call it an API.

我个人认为,当涉及到高性能,特别是与网络相关的事情时,使用围绕回调的API是一件非常糟糕的事情。虽然有时它使用户和开发人员的生活变得简单得多(就编写代码的易用性而言)。唯一的例外可能是CPU中断处理,但这是另一回事,您很难将它称为API。

They can use recursive locks if it is a simple question of same thread calling the lock multiple times.

如果同一个线程多次调用锁,那么可以使用递归锁。

Recursive mutex are relatively very expensive. People who care about run-time efficiency tend to avoid them where possible.

递归互斥量比较昂贵。关心运行时效率的人倾向于尽可能避免使用它们。

Is there a way in the above design model, where the library thread and process thread can share the same core, and consequently share a cache line?

在上面的设计模型中,是否有一种方法可以让库线程和进程线程共享同一个核心,从而共享一条缓存线?

Yes. You will have to pin both threads to the same CPU core. For example, by using sched_setaffinity(). But this also goes beyond a single program — the whole environment must be configured right. For example, you may want to consider not allowing OS to run anything on that core but your two threads (including interrupts), and not allow these two threads to migrate to a different CPU.

是的。您必须将两个线程绑定到同一个CPU内核。例如,使用sched_setaff()。但这也不仅仅是一个程序——整个环境必须正确配置。例如,您可能想要考虑不允许OS在内核上运行任何东西,而是允许您的两个线程(包括中断)运行,并且不允许这两个线程迁移到不同的CPU上。

How expensive are volatile sig_atomic_t's as a mechanism to share data between two threads?

volatile sig_atomic_t作为在两个线程之间共享数据的机制有多昂贵?

By itself it is not expensive. In a multi-core environment, however, you may some cache invalidation, stalls, increased MESI traffic, etc. Given that both of the threads are on the same core and nothing intrudes — the only penalty is not being able to cache the variable, which is OK since it should not be cached (i.e. compiler will always fetch it from memory, be that a cache or main memory).

它本身并不贵。在多核环境中,然而,你可能有些缓存失效,停滞,MESI增加流量,等等。鉴于两个线程在同一核心和没有侵入,唯一的惩罚是不能缓存变量,这是好的,因为它不应该被缓存(即编译器总是拿它从内存,缓存或内存)。

Given a high frequency scenario, what is a light-weight way to share information between two threads?

给定一个高频场景,在两个线程之间共享信息的轻量级方法是什么?

Read and write from/to the same memory. Possibly without any system calls, blocking calls, etc. For example, one can implement ring buffers with for two concurrent threads by using memory barriers and nothing else, for Intel architecture at least. You have to be extremely down to details in order to do that. If, however, something must be explicitly synchronized, then atomic instructions are the next level. Haswell also comes with Transactional Memory that can be used for low overhead synchronization. After that nothing is fast.

从相同的内存中读写。可能不需要任何系统调用、阻塞调用等等。例如,可以为两个并发线程实现环形缓冲区,只需要使用内存屏障,而不需要使用其他任何东西,至少对于Intel架构来说是这样。要做到这一点,你必须非常细致。但是,如果必须显式地同步某些东西,那么原子指令将是下一个级别。Haswell还提供了事务性内存,可用于低开销同步。在那之后没有什么是快的。

Also, take a look at the Intel Architectures Developer's Manual, Chapter 11, about memory cache & control.

此外,请参阅Intel architecture Developer's Manual,即关于内存缓存和控制的第11章。

#2


1  

An important thing to keep in mind here is that when working on network applications, the more important performance metric is "latency-per-task" and not the raw cpu cycle throughput of the entire application. To that end, thread message queues tend to be a very good method for responding to activity in the quickest possible fashion.

这里需要记住的一件重要的事情是,在处理网络应用程序时,更重要的性能指标是“每个任务”,而不是整个应用程序的原始cpu周期吞吐量。为此,线程消息队列通常是以最快的方式响应活动的非常好的方法。

80k messages per second on today's server infrastructure (or even my Core i3 laptop) is bordering on being insignificant territory -- especially insofar as L1 cache performance is concerned. If the threads are doing a significant amount of work, then its not unreasonable at all to expect the CPU to flush through the L1 cache every time a message is processed, and if the messages are not doing very much work at all, then it just doesn't matter because its probably going to be less than 1% of the CPU load regardless of L1 policy.

现在的服务器基础设施(甚至我的核心i3笔记本)上每秒80k条消息几乎是微不足道的——尤其是在L1缓存性能方面。如果线程正在做大量的工作,那么它的不合理期望CPU通过L1缓存刷新每次处理一个消息,如果消息不做太多工作,那么它并不重要,因为它可能会小于1%的CPU负载不管L1政策。

At that rate of messaging I would recommend a passive threading model, eg. one where threads are woken up to handle messages and then fall back asleep. That will give you the best latency-vs-performance model. Eg, its not the most performance-efficient method but it will be the best at responding quickly to network requests (which is usually what you want to favor when doing network programming).

按照这样的消息传递速度,我建议使用被动线程模型(例如)。线程被唤醒来处理消息,然后重新进入休眠状态。这将为您提供最好的延迟-vs-性能模型。它不是性能效率最高的方法,但它将是对网络请求做出快速响应的最佳方法(这通常是您在进行网络编程时希望采用的方法)。

On today's architectures (2.8ghz, 4+ cores), I wouldn't even begin to worry about raw performance unless I expected to be handling maybe 1 million queued messages per second. And even then, it'd depend a bit on exactly how much Real Work the messages are expected to perform. It it isn't expected to do much more than prep and send some packets, then 1 mil is definitely conservative.

在今天的体系结构(2.8ghz, 4+核心)上,我甚至不会开始担心原始性能,除非我期望每秒处理100万个排队消息。即使这样,它也会在一定程度上取决于消息预期执行多少实际工作。它除了准备和发送一些数据包之外,不被期望做更多,那么1 mil绝对是保守的。

Is there a way in the above design model, where the library thread and process thread can share the same core, and consequently share a cache line?

在上面的设计模型中,是否有一种方法可以让库线程和进程线程共享同一个核心,从而共享一条缓存线?

No. I mean, sure there is if you want to roll your own Operating System. But if you want to run in a multitasking environment with the expectation of sharing the CPU with other tasks, then "No." And locking threads to cores is something that is very likely to hurt your threads' avg response times, without providing much in the way of better performance. (and any performance gain would be subject to the system being used exclusively for your software and would probably evaporate on a system running multiple tasks)

不。我的意思是,如果你想要滚动你自己的操作系统。但是,如果您希望在一个多任务环境中运行,期望与其他任务共享CPU,那么“不”。将线程锁定到内核很可能会损害线程的avg响应时间,而不会提供更好的性能。(任何性能的提高都将取决于专门用于您的软件的系统,并且很可能在运行多个任务的系统上消失)

Given a high frequency scenario, what is a light-weight way to share information between two threads?

给定一个高频场景,在两个线程之间共享信息的轻量级方法是什么?

Message queues. :) Seriously. I don't mean to sound silly, but that's what message queues are. They share information between two threads and they're typically light-weight about it. If you want to reduce context switches, only signal the worker to drain the queue after some number of messages have accumulated (or some timeout period, in case of low activity) -- but be weary that will increase your program's response time/latency.

消息队列。:认真)。我并不是说这听起来很傻,但这就是消息队列。它们在两个线程之间共享信息,它们通常是轻量级的。如果您想减少上下文切换,那么只需要在积累了一定数量的消息(或者在低活动的情况下有一定的超时时间)之后通知worker,让它清空队列——但是要注意,这会增加程序的响应时间/延迟。

#1


6  

How can 2 threads share the same cache line?

两个线程如何共享相同的缓存行?

Threads have nothing to do with cache lines. At least not explicitly. What can go wrong is cache flush on context switch and TLB invalidation, but given the same virtual address mapping for threads, caches should generally be oblivious to these things.

线程与缓存行没有任何关系。至少没有明确。可能出错的地方是在上下文切换和TLB失效时缓存刷新,但是如果为线程提供相同的虚拟地址映射,缓存通常应该对这些东西不敏感。

What are the common design choices that cause such design constraints?

造成这种设计限制的常见设计选择是什么?

Implementors of the library do not want to deal with:

图书馆的实施者不愿处理:

  1. Complex locking schemes.
  2. 复杂的锁方案。
  3. Re-entrant logic (i.e. you call 'send()', library calls you back with on_error(), from which you call send() again — that would need a special care to be taken by them).
  4. reentrant逻辑(例如,您调用'send()',库使用on_error()调用您,然后再次调用send()——这需要他们特别注意)。

I personally consider it as a very bad thing to have an API designed around callbacks when it comes to high performance and especially network-related things. Though sometimes it makes life a lot simpler for both users and developers (in terms of just ease of writing the code). The only exception to this might be CPU interrupt handling, but that is a different story and you can hardly call it an API.

我个人认为,当涉及到高性能,特别是与网络相关的事情时,使用围绕回调的API是一件非常糟糕的事情。虽然有时它使用户和开发人员的生活变得简单得多(就编写代码的易用性而言)。唯一的例外可能是CPU中断处理,但这是另一回事,您很难将它称为API。

They can use recursive locks if it is a simple question of same thread calling the lock multiple times.

如果同一个线程多次调用锁,那么可以使用递归锁。

Recursive mutex are relatively very expensive. People who care about run-time efficiency tend to avoid them where possible.

递归互斥量比较昂贵。关心运行时效率的人倾向于尽可能避免使用它们。

Is there a way in the above design model, where the library thread and process thread can share the same core, and consequently share a cache line?

在上面的设计模型中,是否有一种方法可以让库线程和进程线程共享同一个核心,从而共享一条缓存线?

Yes. You will have to pin both threads to the same CPU core. For example, by using sched_setaffinity(). But this also goes beyond a single program — the whole environment must be configured right. For example, you may want to consider not allowing OS to run anything on that core but your two threads (including interrupts), and not allow these two threads to migrate to a different CPU.

是的。您必须将两个线程绑定到同一个CPU内核。例如,使用sched_setaff()。但这也不仅仅是一个程序——整个环境必须正确配置。例如,您可能想要考虑不允许OS在内核上运行任何东西,而是允许您的两个线程(包括中断)运行,并且不允许这两个线程迁移到不同的CPU上。

How expensive are volatile sig_atomic_t's as a mechanism to share data between two threads?

volatile sig_atomic_t作为在两个线程之间共享数据的机制有多昂贵?

By itself it is not expensive. In a multi-core environment, however, you may some cache invalidation, stalls, increased MESI traffic, etc. Given that both of the threads are on the same core and nothing intrudes — the only penalty is not being able to cache the variable, which is OK since it should not be cached (i.e. compiler will always fetch it from memory, be that a cache or main memory).

它本身并不贵。在多核环境中,然而,你可能有些缓存失效,停滞,MESI增加流量,等等。鉴于两个线程在同一核心和没有侵入,唯一的惩罚是不能缓存变量,这是好的,因为它不应该被缓存(即编译器总是拿它从内存,缓存或内存)。

Given a high frequency scenario, what is a light-weight way to share information between two threads?

给定一个高频场景,在两个线程之间共享信息的轻量级方法是什么?

Read and write from/to the same memory. Possibly without any system calls, blocking calls, etc. For example, one can implement ring buffers with for two concurrent threads by using memory barriers and nothing else, for Intel architecture at least. You have to be extremely down to details in order to do that. If, however, something must be explicitly synchronized, then atomic instructions are the next level. Haswell also comes with Transactional Memory that can be used for low overhead synchronization. After that nothing is fast.

从相同的内存中读写。可能不需要任何系统调用、阻塞调用等等。例如,可以为两个并发线程实现环形缓冲区,只需要使用内存屏障,而不需要使用其他任何东西,至少对于Intel架构来说是这样。要做到这一点,你必须非常细致。但是,如果必须显式地同步某些东西,那么原子指令将是下一个级别。Haswell还提供了事务性内存,可用于低开销同步。在那之后没有什么是快的。

Also, take a look at the Intel Architectures Developer's Manual, Chapter 11, about memory cache & control.

此外,请参阅Intel architecture Developer's Manual,即关于内存缓存和控制的第11章。

#2


1  

An important thing to keep in mind here is that when working on network applications, the more important performance metric is "latency-per-task" and not the raw cpu cycle throughput of the entire application. To that end, thread message queues tend to be a very good method for responding to activity in the quickest possible fashion.

这里需要记住的一件重要的事情是,在处理网络应用程序时,更重要的性能指标是“每个任务”,而不是整个应用程序的原始cpu周期吞吐量。为此,线程消息队列通常是以最快的方式响应活动的非常好的方法。

80k messages per second on today's server infrastructure (or even my Core i3 laptop) is bordering on being insignificant territory -- especially insofar as L1 cache performance is concerned. If the threads are doing a significant amount of work, then its not unreasonable at all to expect the CPU to flush through the L1 cache every time a message is processed, and if the messages are not doing very much work at all, then it just doesn't matter because its probably going to be less than 1% of the CPU load regardless of L1 policy.

现在的服务器基础设施(甚至我的核心i3笔记本)上每秒80k条消息几乎是微不足道的——尤其是在L1缓存性能方面。如果线程正在做大量的工作,那么它的不合理期望CPU通过L1缓存刷新每次处理一个消息,如果消息不做太多工作,那么它并不重要,因为它可能会小于1%的CPU负载不管L1政策。

At that rate of messaging I would recommend a passive threading model, eg. one where threads are woken up to handle messages and then fall back asleep. That will give you the best latency-vs-performance model. Eg, its not the most performance-efficient method but it will be the best at responding quickly to network requests (which is usually what you want to favor when doing network programming).

按照这样的消息传递速度,我建议使用被动线程模型(例如)。线程被唤醒来处理消息,然后重新进入休眠状态。这将为您提供最好的延迟-vs-性能模型。它不是性能效率最高的方法,但它将是对网络请求做出快速响应的最佳方法(这通常是您在进行网络编程时希望采用的方法)。

On today's architectures (2.8ghz, 4+ cores), I wouldn't even begin to worry about raw performance unless I expected to be handling maybe 1 million queued messages per second. And even then, it'd depend a bit on exactly how much Real Work the messages are expected to perform. It it isn't expected to do much more than prep and send some packets, then 1 mil is definitely conservative.

在今天的体系结构(2.8ghz, 4+核心)上,我甚至不会开始担心原始性能,除非我期望每秒处理100万个排队消息。即使这样,它也会在一定程度上取决于消息预期执行多少实际工作。它除了准备和发送一些数据包之外,不被期望做更多,那么1 mil绝对是保守的。

Is there a way in the above design model, where the library thread and process thread can share the same core, and consequently share a cache line?

在上面的设计模型中,是否有一种方法可以让库线程和进程线程共享同一个核心,从而共享一条缓存线?

No. I mean, sure there is if you want to roll your own Operating System. But if you want to run in a multitasking environment with the expectation of sharing the CPU with other tasks, then "No." And locking threads to cores is something that is very likely to hurt your threads' avg response times, without providing much in the way of better performance. (and any performance gain would be subject to the system being used exclusively for your software and would probably evaporate on a system running multiple tasks)

不。我的意思是,如果你想要滚动你自己的操作系统。但是,如果您希望在一个多任务环境中运行,期望与其他任务共享CPU,那么“不”。将线程锁定到内核很可能会损害线程的avg响应时间,而不会提供更好的性能。(任何性能的提高都将取决于专门用于您的软件的系统,并且很可能在运行多个任务的系统上消失)

Given a high frequency scenario, what is a light-weight way to share information between two threads?

给定一个高频场景,在两个线程之间共享信息的轻量级方法是什么?

Message queues. :) Seriously. I don't mean to sound silly, but that's what message queues are. They share information between two threads and they're typically light-weight about it. If you want to reduce context switches, only signal the worker to drain the queue after some number of messages have accumulated (or some timeout period, in case of low activity) -- but be weary that will increase your program's response time/latency.

消息队列。:认真)。我并不是说这听起来很傻,但这就是消息队列。它们在两个线程之间共享信息,它们通常是轻量级的。如果您想减少上下文切换,那么只需要在积累了一定数量的消息(或者在低活动的情况下有一定的超时时间)之后通知worker,让它清空队列——但是要注意,这会增加程序的响应时间/延迟。