Linux有零拷贝吗?拼接或发送文件?

时间:2022-12-31 11:03:39

When splice was introduced it was discussed on the kernel list that sendfile was re-implemented based off of splice. The documentation for splice SLICE_F_MOVE states:

当引入splice时,在内核列表中讨论了sendfile是基于splice重新实现的。 splice SLICE_F_MOVE的文档说明:

Attempt to move pages instead of copying. This is only a hint to the kernel: pages may still be copied if the kernel cannot move the pages from the pipe, or if the pipe buffers don't refer to full pages. The initial implementation of this flag was buggy: therefore starting in Linux 2.6.21 it is a no-op (but is still permitted in a splice() call); in the future, a correct implementation may be restored.

尝试移动页面而不是复制。这只是内核的提示:如果内核无法从管道中移动页面,或者管道缓冲区没有引用整页,则仍可能会复制页面。这个标志的初始实现是错误的:因此从Linux 2.6.21开始它是一个无操作(但仍然允许在splice()调用中);将来,可以恢复正确的实施。

So does that mean Linux has no zero-copy method for writing to sockets? Or was this fixed at some point and nobody updated the documentation for years? Does either of sendfile or splice have a zero copy implementation in any of the latest 3.x kernel versions?

那么这是否意味着Linux没有写入套接字的零拷贝方法?或者这是在某些时候修复的,多年来没有人更新文档? sendfile或splice中的任何一个在最新的3.x内核版本中都有零拷贝实现吗?

Since Google has no answer to this query, I'm creating a * question for the next poor schmuck who wants to know if there's any benefit to using vmsplice and splice or sendfile over plain old write.

由于谷歌没有对此查询的回答,我正在为下一个可怜的骗子创建一个*问题,他想知道使用vmsplice和splice或sendfile是否有利于普通旧写。

2 个解决方案

#1


15  

sendfile has been ever since, and still is zero-copy (assuming the hardware allows for it, but that is usually the case). Being zero-copy was the entire point of having this syscall in the first place. sendfile is nowadays implemented as a wrapper around splice.

sendfile一直以来,仍然是零拷贝(假设硬件允许它,但通常是这种情况)。零拷贝是首先拥有这个系统调用的重点。 sendfile现在被实现为splice的包装器。

That suggests that splice, too, is zero-copy, and this is indeed the case. At least in theory, and at least in some cases. The problem is figuring out how to correctly use it so it works reliably and so it is zero-copy. The documentation is... sparse, to say the least.

这表明拼接也是零拷贝,事实确实如此。至少在理论上,至少在某些情况下。问题是弄清楚如何正确使用它,以便它可靠地工作,因此它是零拷贝。至少可以说文档是......稀疏的。

In particular, splice only works zero-copy if the pages were given as "gift", i.e. you don't own them any more (formally, but in reality you still do). That is a non-issue if you simply splice a file descriptor onto a socket, but it is a big issue if you want to splice data from your application's address space, or from one pipe to another. It is unclear what to do with the pages afterwards (and when). The documentation states that you may not touch the pages afterwards or do anything with them, never, not ever. So if you follow the letter of the documentation, you must leak the memory.
That's obviously not correct (it can't be), but there is no good way of knowing (for you at least!) when it's safe to reuse or release that memory. The kernel doing a sendfile would know, since as soon as it receives the TCP ACK, it knows that the data is never needed again. The problem is, you don't ever get to see an ACK. All you know when splice has returned is that data has been accepted to be sent (but you have no idea whether it has already been sent or received, nor when this will happen).
Which means you need to figure this out somehow on an application layer, either by doing manual ACKs (comes for free with reliable UDP), or by assuming that if the other side sends an answer to your request, they obviously must have gotten the request.

特别是,如果页面被赋予“礼物”,拼接只能进行零拷贝,即你不再拥有它们(正式地,但实际上你仍然这样做)。如果您只是将文件描述符拼接到套接字上,那么这是一个非问题,但如果您想要将数据从应用程序的地址空间或从一个管道拼接到另一个管道,则这是一个大问题。目前还不清楚如何处理页面(以及何时)。文档说明您之后可能不会触摸页面或对它们执行任何操作,永远不会。因此,如果您按照文档的说明,则必须泄漏内存。这显然是不正确的(它不可能),但是当重用或释放内存时,没有好的方法(对于你至少!)。执行sendfile的内核会知道,因为一旦收到TCP ACK,它就知道再也不需要数据了。问题是,你永远不会看到ACK。当splice返回时你所知道的是,已经接受了数据被发送(但你不知道它是否已被发送或接收,也不会发生这种情况)。这意味着您需要在应用层上以某种方式解决这个问题,或者通过手动ACK(免费提供可靠的UDP),或者假设如果对方发送了对您的请求的答案,他们显然必须得到请求。

Another thing you have to manage is the finite pipe space. The default is very small, but even if you increase the size, you can't just naively splice a file of any size. sendfile on the other hand will just let you do that, which is cool.

你需要管理的另一件事是有限的管道空间。默认值非常小,但即使您增加了大小,也不能只是天真地拼接任何大小的文件。另一方面sendfile只会让你这样做,这很酷。

All in all, sendfile is nice because it just works, and it works well, and you don't need to care about any of the above details. It's not a panacea, but it sure is a great addition.
I would, personally, stay away from splice and its family until the whole thing is greatly overhauled and until it is 100% clear what you have to do (and when) and what you don't have to do.

总而言之,sendfile很不错,因为它只是有效,并且运行良好,您无需关心上述任何细节。它不是灵丹妙药,但肯定是一个很好的补充。我个人会远离拼接和它的家庭,直到整个事情被彻底改变,直到你100%清楚你必须做什么(以及何时)以及你不必做什么。

The real, effective gains over plain old write are marginal for most applications, anyway. I recall some less than polite comments by Mr. Torvalds a few years ago (when BSD had a form of write that would do some magic with remapping pages to get zero-copy, and Linux didn't) which pointed out that making a copy usually isn't any issue, but playing tricks with pages is [won't repeat that here].

无论如何,对于大多数应用来说,相对于普通旧写的真实有效收益是微不足道的。我记得几年前Torvalds先生提出的一些不太礼貌的评论(当BSD有一种写作形式可以通过重新映射页面来实现零拷贝,而Linux却没有这样做),它指出制作副本通常不是任何问题,但玩页面的技巧是[这里不再重复]。

#2


2  

According to the relevant man page on splice as of 2014-07-08 I quote:

根据2014-07-08关于拼接的相关手册页我引用:

Though we talk of copying, actual copies are generally avoided. The kernel does this by implementing a pipe buffer as a set of reference-counted pointers to pages of kernel memory. The kernel creates "copies" of pages in a buffer by creating new pointers (for the output buffer) referring to the pages, and increasing the reference counts for the pages: only pointers are copied, not the pages of the buffer.

虽然我们谈论复制,但通常避免实际复制。内核通过将管道缓冲区实现为一组指向内核内存页面的引用计数指针来实现此目的。内核通过创建引用页面的新指针(用于输出缓冲区)并增加页面的引用计数来创建缓冲区中页面的“副本”:仅复制指针,而不复制缓冲区的页面。

Therefore, yes, splice is documented to be currently zero copy in most cases.

因此,是的,在大多数情况下,splice被记录为当前零拷贝。

#1


15  

sendfile has been ever since, and still is zero-copy (assuming the hardware allows for it, but that is usually the case). Being zero-copy was the entire point of having this syscall in the first place. sendfile is nowadays implemented as a wrapper around splice.

sendfile一直以来,仍然是零拷贝(假设硬件允许它,但通常是这种情况)。零拷贝是首先拥有这个系统调用的重点。 sendfile现在被实现为splice的包装器。

That suggests that splice, too, is zero-copy, and this is indeed the case. At least in theory, and at least in some cases. The problem is figuring out how to correctly use it so it works reliably and so it is zero-copy. The documentation is... sparse, to say the least.

这表明拼接也是零拷贝,事实确实如此。至少在理论上,至少在某些情况下。问题是弄清楚如何正确使用它,以便它可靠地工作,因此它是零拷贝。至少可以说文档是......稀疏的。

In particular, splice only works zero-copy if the pages were given as "gift", i.e. you don't own them any more (formally, but in reality you still do). That is a non-issue if you simply splice a file descriptor onto a socket, but it is a big issue if you want to splice data from your application's address space, or from one pipe to another. It is unclear what to do with the pages afterwards (and when). The documentation states that you may not touch the pages afterwards or do anything with them, never, not ever. So if you follow the letter of the documentation, you must leak the memory.
That's obviously not correct (it can't be), but there is no good way of knowing (for you at least!) when it's safe to reuse or release that memory. The kernel doing a sendfile would know, since as soon as it receives the TCP ACK, it knows that the data is never needed again. The problem is, you don't ever get to see an ACK. All you know when splice has returned is that data has been accepted to be sent (but you have no idea whether it has already been sent or received, nor when this will happen).
Which means you need to figure this out somehow on an application layer, either by doing manual ACKs (comes for free with reliable UDP), or by assuming that if the other side sends an answer to your request, they obviously must have gotten the request.

特别是,如果页面被赋予“礼物”,拼接只能进行零拷贝,即你不再拥有它们(正式地,但实际上你仍然这样做)。如果您只是将文件描述符拼接到套接字上,那么这是一个非问题,但如果您想要将数据从应用程序的地址空间或从一个管道拼接到另一个管道,则这是一个大问题。目前还不清楚如何处理页面(以及何时)。文档说明您之后可能不会触摸页面或对它们执行任何操作,永远不会。因此,如果您按照文档的说明,则必须泄漏内存。这显然是不正确的(它不可能),但是当重用或释放内存时,没有好的方法(对于你至少!)。执行sendfile的内核会知道,因为一旦收到TCP ACK,它就知道再也不需要数据了。问题是,你永远不会看到ACK。当splice返回时你所知道的是,已经接受了数据被发送(但你不知道它是否已被发送或接收,也不会发生这种情况)。这意味着您需要在应用层上以某种方式解决这个问题,或者通过手动ACK(免费提供可靠的UDP),或者假设如果对方发送了对您的请求的答案,他们显然必须得到请求。

Another thing you have to manage is the finite pipe space. The default is very small, but even if you increase the size, you can't just naively splice a file of any size. sendfile on the other hand will just let you do that, which is cool.

你需要管理的另一件事是有限的管道空间。默认值非常小,但即使您增加了大小,也不能只是天真地拼接任何大小的文件。另一方面sendfile只会让你这样做,这很酷。

All in all, sendfile is nice because it just works, and it works well, and you don't need to care about any of the above details. It's not a panacea, but it sure is a great addition.
I would, personally, stay away from splice and its family until the whole thing is greatly overhauled and until it is 100% clear what you have to do (and when) and what you don't have to do.

总而言之,sendfile很不错,因为它只是有效,并且运行良好,您无需关心上述任何细节。它不是灵丹妙药,但肯定是一个很好的补充。我个人会远离拼接和它的家庭,直到整个事情被彻底改变,直到你100%清楚你必须做什么(以及何时)以及你不必做什么。

The real, effective gains over plain old write are marginal for most applications, anyway. I recall some less than polite comments by Mr. Torvalds a few years ago (when BSD had a form of write that would do some magic with remapping pages to get zero-copy, and Linux didn't) which pointed out that making a copy usually isn't any issue, but playing tricks with pages is [won't repeat that here].

无论如何,对于大多数应用来说,相对于普通旧写的真实有效收益是微不足道的。我记得几年前Torvalds先生提出的一些不太礼貌的评论(当BSD有一种写作形式可以通过重新映射页面来实现零拷贝,而Linux却没有这样做),它指出制作副本通常不是任何问题,但玩页面的技巧是[这里不再重复]。

#2


2  

According to the relevant man page on splice as of 2014-07-08 I quote:

根据2014-07-08关于拼接的相关手册页我引用:

Though we talk of copying, actual copies are generally avoided. The kernel does this by implementing a pipe buffer as a set of reference-counted pointers to pages of kernel memory. The kernel creates "copies" of pages in a buffer by creating new pointers (for the output buffer) referring to the pages, and increasing the reference counts for the pages: only pointers are copied, not the pages of the buffer.

虽然我们谈论复制,但通常避免实际复制。内核通过将管道缓冲区实现为一组指向内核内存页面的引用计数指针来实现此目的。内核通过创建引用页面的新指针(用于输出缓冲区)并增加页面的引用计数来创建缓冲区中页面的“副本”:仅复制指针,而不复制缓冲区的页面。

Therefore, yes, splice is documented to be currently zero copy in most cases.

因此,是的,在大多数情况下,splice被记录为当前零拷贝。