高效的Linux套接字(DMA /零拷贝)

时间:2022-12-31 11:03:33

I'm building a very high performance Linux server (based on epoll, non-blocking sockets, and async disk IO [based on io_submit/io_getevents/eventfd]). Some of my benchmarks show that the way I handle sockets isn't efficient enough for my requirements. In particular, I'm concerned with getting data from the userspace buffer to the network card, and from the network card back to the userspace buffer (let's ignore sendfile call for now).

我正在构建一个性能非常高的Linux服务器(基于epoll,非阻塞套接字和异步磁盘IO [基于io_submit / io_getevents / eventfd])。我的一些基准测试表明,我处理套接字的方式不足以满足我的要求。特别是,我关心的是从用户空间缓冲区到网卡,从网卡返回到用户空间缓冲区(让我们暂时忽略sendfile调用)。

From what I understand, calling read/write on a non-blocking Linux socket isn't fully asynchronous - the system call blocks while it copies the buffer from the userspace to the kernel (or the other way around), and only then returns. Is there a way to avoid this overheard in Linux? In particular, is there a fully asynchronous write call that I can make on a socket that would return immediately, DMA the userspace buffer to the network card as necessary, and signal/set an event/etc. on completion? I know Windows has an interface for this, but I couldn't find anything about this in Linux.

根据我的理解,在非阻塞Linux套接字上调用读/写不是完全异步的 - 系统调用阻塞,同时将缓冲区从用户空间复制到内核(或者反过来),然后才返回。有没有办法避免在Linux中无意中听到这种情况?特别是,我是否可以在一个立即返回的套接字上进行完全异步写入调用,必要时将用户空间缓冲区DMA发送到网卡,并发出信号/设置事件/等。在完成的时候?我知道Windows有一个接口,但我在Linux中找不到任何相关的东西。

Thanks!

谢谢!

2 个解决方案

#1


19  

There's been some talk on linux-kernel recently about providing an API for something along these lines, but the sticking point is that you can't DMA from general userspace buffers to the network card, because:

最近在linux-kernel上有一些关于为这些内容提供API的讨论,但关键是你不能从一般用户空间缓冲区到网卡,因为:

  • What looks like contiguous data in the userspace linear address space is probably not-contiguous in physical memory, which is a problem if the network card doesn't do scatter-gather DMA;
  • 在用户空间线性地址空间中看起来像连续数据的东西在物理内存中可能是不连续的,如果网卡不进行分散 - 聚集DMA,则这是一个问题。
  • On many machines, not all physical memory addresses are "DMA-able". There's no way at the moment for a userspace application to specifically request a DMA-able buffer.
  • 在许多机器上,并非所有物理内存地址都是“可DMA”的。目前,用户空间应用程序无法专门请求具有DMA功能的缓冲区。

On recent kernels, you could try using vmsplice and splice together to achieve what you want - vmsplice the pages (with SPLICE_F_GIFT) you want to send into a pipe, then splice them (with SPLICE_F_MOVE) from the pipe into the socket.

在最近的内核上,您可以尝试使用vmsplice并拼接在一起以实现您想要的效果 - 将要发送到管道的页面(使用SPLICE_F_GIFT)vmsplice,然后将它们(使用SPLICE_F_MOVE)从管道拼接到套接字中。

#2


1  

AFAIK you are using the most efficient calls available if you cant use sendfile(2). Various aspects of efficient high performance networking code is covered by The C10K problem

如果你不能使用sendfile(2),AFAIK你正在使用最有效的调用。 C10K问题涵盖了高效高性能网络代码的各个方面

#1


19  

There's been some talk on linux-kernel recently about providing an API for something along these lines, but the sticking point is that you can't DMA from general userspace buffers to the network card, because:

最近在linux-kernel上有一些关于为这些内容提供API的讨论,但关键是你不能从一般用户空间缓冲区到网卡,因为:

  • What looks like contiguous data in the userspace linear address space is probably not-contiguous in physical memory, which is a problem if the network card doesn't do scatter-gather DMA;
  • 在用户空间线性地址空间中看起来像连续数据的东西在物理内存中可能是不连续的,如果网卡不进行分散 - 聚集DMA,则这是一个问题。
  • On many machines, not all physical memory addresses are "DMA-able". There's no way at the moment for a userspace application to specifically request a DMA-able buffer.
  • 在许多机器上,并非所有物理内存地址都是“可DMA”的。目前,用户空间应用程序无法专门请求具有DMA功能的缓冲区。

On recent kernels, you could try using vmsplice and splice together to achieve what you want - vmsplice the pages (with SPLICE_F_GIFT) you want to send into a pipe, then splice them (with SPLICE_F_MOVE) from the pipe into the socket.

在最近的内核上,您可以尝试使用vmsplice并拼接在一起以实现您想要的效果 - 将要发送到管道的页面(使用SPLICE_F_GIFT)vmsplice,然后将它们(使用SPLICE_F_MOVE)从管道拼接到套接字中。

#2


1  

AFAIK you are using the most efficient calls available if you cant use sendfile(2). Various aspects of efficient high performance networking code is covered by The C10K problem

如果你不能使用sendfile(2),AFAIK你正在使用最有效的调用。 C10K问题涵盖了高效高性能网络代码的各个方面