读取OpenGL中默认帧缓冲区的像素数据:FBO与PBO的性能。

时间:2022-06-02 05:58:20

My goal is to read the contents of the default OpenGL framebuffer and store the pixel data in a cv::Mat. Apparently there are two different ways of achieving this:

我的目标是读取默认OpenGL framebuffer的内容,并将像素数据存储在cv::Mat中。显然,实现这一目标有两种不同的方式:

1) Synchronous: use FBO and glRealPixels

1)同步:使用FBO和glRealPixels

cv::Mat a = cv::Mat::zeros(cv::Size(1920, 1080), CV_8UC3);
glReadPixels(0, 0, 1920, 1080, GL_BGR, GL_UNSIGNED_BYTE, a.data);

2) Asynchronous: use PBO and glReadPixels

2)异步:使用PBO和glReadPixels

cv::Mat b = cv::Mat::zeros(cv::Size(1920, 1080), CV_8UC3);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo_userImage);
    glReadPixels(0, 0, 1920, 1080, GL_BGR, GL_UNSIGNED_BYTE, 0);
    unsigned char* ptr = static_cast<unsigned char*>(glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY));
    std::copy(ptr, ptr + 1920 * 1080 * 3 * sizeof(unsigned char), b.data);
    glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

From all the information I collected on this topic, the asynchronous version 2) should be much faster. However, comparing the elapsed time for both versions yields that the differences are often times minimal, and sometimes version 1) events outperforms the PBO variant.

从我收集的关于这个主题的所有信息中,异步版本2)应该要快得多。但是,比较两个版本的运行时间会发现,差异往往是最小的倍,有时候版本1)事件的性能要优于PBO变体。

For performance checks, I've inserted the following code (based on this answer):

为了进行性能检查,我插入了以下代码(基于此答案):

std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
....
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << std::endl;

I've also experimented with the usage hint when creating the PBO: I didn't find much of difference between GL_DYNAMIC_COPY and GL_STREAM_READ here.

在创建PBO时,我还尝试了用法提示:在这里,我没有发现GL_DYNAMIC_COPY和GL_STREAM_READ之间有什么区别。

I'd be happy for suggestions how to increase the speed of this pixel read operation from the framebuffer even further.

我很乐意得到如何进一步提高帧缓存中像素读取操作速度的建议。

1 个解决方案

#1


5  

Your second version is not asynchronous at all, since you're mapping the buffer immediately after triggering the copy. The map call will then block until the contents of the buffer are available, effectively becoming synchronous.

您的第二个版本完全不是异步的,因为您在触发复制后立即映射缓冲区。然后,映射调用将阻塞,直到缓冲区的内容可用,从而有效地成为同步的。

Or: depending on the driver, it will block when actually reading from it. In other words the driver may implement the mapping in such a way that it causes a pagefault, and a subsequent synchronization. It doesn't really matter in your case, since you are still accessing that data straight away due to the std::copy.

或者:根据驱动程序的不同,它在实际读取时将会阻塞。换句话说,驱动程序实现映射的方式可能会导致一个pagefault,以及随后的同步。这在您的情况中并不重要,因为由于std::copy,您仍然直接访问数据。

The proper way of doing this is by using sync objects and fences.

正确的方法是使用同步对象和围栏。

Keep your PBO setup, but after issuing the glReadPixels into a PBO, insert a sync object into the stream via glFenceSync. Then, some time later, poll for that fence sync object to be complete (or just wait for it altogether) via glClientWaitSync.

保持PBO设置,但是在将glReadPixels发布到PBO之后,通过glFenceSync向流中插入一个sync对象。然后,过了一段时间,通过glClientWaitSync对这个fence sync对象进行轮询(或者干脆等待它完成)。

If glClientWaitSync returns that the commands before the fence are complete, you can now read from the buffer without an expensive CPU/GPU sync. (If the driver is particularly stupid and didn't already move the buffer contents into mappable addresses, in spite of your usage hints on the PBO, you can use another thread to perform the map. glGetBufferSubData can be therefore cheaper, as the data doesn't need to be in a mappable range.)

如果glClientWaitSync返回在fence完成之前的命令,您现在可以从缓冲区中读取,而无需昂贵的CPU/GPU同步。(如果驱动程序特别笨,并且还没有将缓冲区内容移动到可映射地址,尽管您在PBO上有使用提示,但是您可以使用另一个线程来执行映射。因此,glGetBufferSubData可以更便宜,因为数据不需要位于可映射范围内。


If you need to do this on a frame-by-frame basis, you'll notice that it's very likely that you'll need more than one PBO, that is, have a small pool of them. This is because at the next frame the readback of the previous frame's data is not complete yet and the corresponding fence not signalled. (Yes, GPUs are massively pipelined these days, and they will be some frames behind your submission queue).

如果您需要逐帧进行此操作,您将会注意到您很可能需要多个PBO,即拥有一个小的PBO池。这是因为在下一帧中,前一帧数据的回读还没有完成,而相应的栅栏没有信号。(是的,gpu近来大量流水线化,它们将是提交队列后面的一些帧)。

#1


5  

Your second version is not asynchronous at all, since you're mapping the buffer immediately after triggering the copy. The map call will then block until the contents of the buffer are available, effectively becoming synchronous.

您的第二个版本完全不是异步的,因为您在触发复制后立即映射缓冲区。然后,映射调用将阻塞,直到缓冲区的内容可用,从而有效地成为同步的。

Or: depending on the driver, it will block when actually reading from it. In other words the driver may implement the mapping in such a way that it causes a pagefault, and a subsequent synchronization. It doesn't really matter in your case, since you are still accessing that data straight away due to the std::copy.

或者:根据驱动程序的不同,它在实际读取时将会阻塞。换句话说,驱动程序实现映射的方式可能会导致一个pagefault,以及随后的同步。这在您的情况中并不重要,因为由于std::copy,您仍然直接访问数据。

The proper way of doing this is by using sync objects and fences.

正确的方法是使用同步对象和围栏。

Keep your PBO setup, but after issuing the glReadPixels into a PBO, insert a sync object into the stream via glFenceSync. Then, some time later, poll for that fence sync object to be complete (or just wait for it altogether) via glClientWaitSync.

保持PBO设置,但是在将glReadPixels发布到PBO之后,通过glFenceSync向流中插入一个sync对象。然后,过了一段时间,通过glClientWaitSync对这个fence sync对象进行轮询(或者干脆等待它完成)。

If glClientWaitSync returns that the commands before the fence are complete, you can now read from the buffer without an expensive CPU/GPU sync. (If the driver is particularly stupid and didn't already move the buffer contents into mappable addresses, in spite of your usage hints on the PBO, you can use another thread to perform the map. glGetBufferSubData can be therefore cheaper, as the data doesn't need to be in a mappable range.)

如果glClientWaitSync返回在fence完成之前的命令,您现在可以从缓冲区中读取,而无需昂贵的CPU/GPU同步。(如果驱动程序特别笨,并且还没有将缓冲区内容移动到可映射地址,尽管您在PBO上有使用提示,但是您可以使用另一个线程来执行映射。因此,glGetBufferSubData可以更便宜,因为数据不需要位于可映射范围内。


If you need to do this on a frame-by-frame basis, you'll notice that it's very likely that you'll need more than one PBO, that is, have a small pool of them. This is because at the next frame the readback of the previous frame's data is not complete yet and the corresponding fence not signalled. (Yes, GPUs are massively pipelined these days, and they will be some frames behind your submission queue).

如果您需要逐帧进行此操作,您将会注意到您很可能需要多个PBO,即拥有一个小的PBO池。这是因为在下一帧中,前一帧数据的回读还没有完成,而相应的栅栏没有信号。(是的,gpu近来大量流水线化,它们将是提交队列后面的一些帧)。