AMD 64位双核优化

时间:2022-09-01 10:05:32

We have a graphics intensive application that seems to be experiencing problems on AMD 64 bit Dual Core platforms that are not apparent on Intel platforms.

我们有一个图形密集型应用程序似乎在AMD 64位双核平台上遇到问题,这在英特尔平台上并不明显。

Running the application causes the CPU to run at 100%, in particular when using code for shadows and lighting (Open GL).

运行应用程序会导致CPU以100%运行,特别是在使用阴影和照明代码时(Open GL)。

Does anyone know of specific issues with AMD processors that could cause this or know where to track down the problem, and/or ways to optimize the code base to avoid these issues?

有没有人知道AMD处理器可能导致这个问题的具体问题,或者知道在哪里追查问题,和/或优化代码库以避免这些问题的方法?

note, the application generally works well on mid range hardware, my dev machine has an nvidia gtx260 card in, so lack of power should not be an issue

请注意,该应用程序通常适用于中档硬件,我的开发机器有一个nvidia gtx260卡,所以缺乏电源应该不是问题

6 个解决方案

#1


2  

Note that AMD64 is a NUMA architecture - if you are using a multi-processor box you may be running lots of memory accesses across the hypertransport bus which will be slower than the local memory and may explain the behaviour.

请注意,AMD64是一种NUMA架构 - 如果您使用的是多处理器盒,则可能会在Hyperansport总线上运行大量内存访问,这将比本地内存慢,并可能解释其行为。

This will not be the case between cores on a single socket so feel free to ignore this if you are not using a multiple-socket machine.

在单个套接字上的内核之间不会出现这种情况,因此如果您不使用多插槽机器,请随意忽略它。

Linux is NUMA aware (i.e. it has system services to allocate memory by local bank and bind processes to specific CPU's). I believe that Win 2k3 server, 2k8 and Vista are NUMA aware but XP is not. Most of the proprietary unix variants such as Solaris have NUMA support as well.

Linux可以识别NUMA(即它具有系统服务,可以通过本地银行分配内存并将进程绑定到特定的CPU)。我相信Win 2k3服务器,2k8和Vista都是NUMA识别但XP不是。大多数专有的unix变体(如Solaris)也支持NUMA。

#2


1  

Late answer here.

这里的回答很晚。

Dunno if this is related, but in some win32 OpenGL drivers, SwapBuffers() will not yield the CPU while waiting for vsync, making it very easy to get 100% CPU utilisation.

Dunno如果这是相关的,但在一些win32 OpenGL驱动程序中,SwapBuffers()在等待vsync时不会产生CPU,这使得很容易获得100%的CPU利用率。

The solution I use to this is to measure the time since the last SwapBuffers() completed, which tells me how far away the next vsync is. So before calling SwapBuffers(), I take short Sleep()s until I detect that vsync is imminent. This way SwapBuffers() doesn't have to wait long for vsync, and so doesn't hog the CPU too badly.

我使用的解决方案是测量自上一次SwapBuffers()完成以来的时间,它告诉我下一个vsync的距离。所以在调用SwapBuffers()之前,我会使用Sleep()s,直到我检测到vsync即将发生。这样,SwapBuffers()不必等待很长时间来进行vsync,因此不会太糟糕地占用CPU。

Note that you may have to use timeBeginPeriod() to get sufficient Sleep() precision for this to work reliably.

请注意,您可能必须使用timeBeginPeriod()来获得足够的Sleep()精度,以使其可靠地工作。

#3


1  

Depending on how you've done your shadows and other graphics code, it possible that youve "fallen off the fast path" and the graphics driver has started doing software emulation. This can happen if you have complicated pipelines, or are using too many conditionals (or just too many instructions) in shader code.

根据您完成阴影和其他图形代码的方式,您可能会“脱离快速路径”并且图形驱动程序已开始进行软件仿真。如果您有复杂的管道,或者在着色器代码中使用了太多的条件(或太多的指令),就会发生这种情况。

I would make sure that this particular graphics card supports all the features you are using.

我会确保这个特定的显卡支持您正在使用的所有功能。

#4


0  

I would invest in profiling software to trace down the actual cause of the problem.

我会投资分析软件来追查问题的实际原因。

On linux, Valgrind ( which contains Cachegrind & Callgrind ) + KCacheGrind can make working out where all the heavy function calls are going on.

在linux上,Valgrind(包含Cachegrind和Callgrind)+ KCacheGrind可以解决所有繁重函数调用的问题。

Also, compile with full debug symbols and it can even show the assembley code at the slow function calls.

此外,使用完整的调试符号进行编译,它甚至可以在慢速函数调用时显示汇编代码。

If you're using an Intel Specific compiler, this may be part of your problem ( not definate tho ), and try the GCC family.

如果您正在使用英特尔特定编译器,这可能是您的问题的一部分(不是定义),并尝试GCC系列。

Also, you may want to dive into OpenMP and Threads if you haven't already.

此外,如果您还没有,可能需要深入研究OpenMP和Threads。

#5


0  

Hm - if you use shadows the GPU should be under load, so it's unlikely that the GPU renders the frames faster than the CPU sends graphic data. In this case 100% load is ok and even expected.

嗯 - 如果使用阴影,GPU应该处于负载状态,因此GPU渲染帧的速度不可能比CPU发送图形数据的速度快。在这种情况下,100%负载是可以的,甚至是预期的。

It could simply be a borked OpenGL driver that does burns CPU-cycles in a spinlock somewhere. To find out what's exactly going on I suggest you run a profiling tool such as Code Analyst from AMD (free last time I've used it).

它可能只是一个borked OpenGL驱动程序,它会在某个地方的自旋锁中烧掉CPU周期。为了找出究竟发生了什么,我建议你运行一个分析工具,比如AMD的Code Analyst(我上次使用它时免费)。

Profile your program a couple of minutes and take a look where the time is spent. If you see a big peak in the opengl drivers and not in your application get a new driver. Otherwise you at least get an idea what's going on.

分析您的程序几分钟,看看花费的时间。如果你看到opengl驱动程序中的一个大峰值,而不是在你的应用程序中获得一个新的驱动程序。否则你至少会知道发生了什么。

Btw - let me guess, you're using an ATI card, right? I don't want to offend any ATI fans out there, but their OpenGL-drives are not exactly stellar. If you're unlucky you may even used a feature that the card does not support or that is disabled due to a silicon bug. The driver will fallback into software rasterization mode in this case. This will slow down things a lot and give you a 100% CPU-Load even if your program is single-threaded.

顺便说一句 - 让我猜,你正在使用ATI卡,对吧?我不想得罪任何ATI粉丝,但他们的OpenGL驱动器并不完全是一流的。如果你运气不好,你甚至可能会使用该卡不支持或由于硅虫而被禁用的功能。在这种情况下,驱动程序将回退到软件光栅化模式。即使您的程序是单线程的,这也会使速度变慢,并为您提供100%的CPU负载。

#6


0  

Also the cache is not shared, which might cause a lack of performance when sharing data among multiple threads.

此外,不共享缓存,这可能会导致在多个线程之间共享数据时性能不足。

#1


2  

Note that AMD64 is a NUMA architecture - if you are using a multi-processor box you may be running lots of memory accesses across the hypertransport bus which will be slower than the local memory and may explain the behaviour.

请注意,AMD64是一种NUMA架构 - 如果您使用的是多处理器盒,则可能会在Hyperansport总线上运行大量内存访问,这将比本地内存慢,并可能解释其行为。

This will not be the case between cores on a single socket so feel free to ignore this if you are not using a multiple-socket machine.

在单个套接字上的内核之间不会出现这种情况,因此如果您不使用多插槽机器,请随意忽略它。

Linux is NUMA aware (i.e. it has system services to allocate memory by local bank and bind processes to specific CPU's). I believe that Win 2k3 server, 2k8 and Vista are NUMA aware but XP is not. Most of the proprietary unix variants such as Solaris have NUMA support as well.

Linux可以识别NUMA(即它具有系统服务,可以通过本地银行分配内存并将进程绑定到特定的CPU)。我相信Win 2k3服务器,2k8和Vista都是NUMA识别但XP不是。大多数专有的unix变体(如Solaris)也支持NUMA。

#2


1  

Late answer here.

这里的回答很晚。

Dunno if this is related, but in some win32 OpenGL drivers, SwapBuffers() will not yield the CPU while waiting for vsync, making it very easy to get 100% CPU utilisation.

Dunno如果这是相关的,但在一些win32 OpenGL驱动程序中,SwapBuffers()在等待vsync时不会产生CPU,这使得很容易获得100%的CPU利用率。

The solution I use to this is to measure the time since the last SwapBuffers() completed, which tells me how far away the next vsync is. So before calling SwapBuffers(), I take short Sleep()s until I detect that vsync is imminent. This way SwapBuffers() doesn't have to wait long for vsync, and so doesn't hog the CPU too badly.

我使用的解决方案是测量自上一次SwapBuffers()完成以来的时间,它告诉我下一个vsync的距离。所以在调用SwapBuffers()之前,我会使用Sleep()s,直到我检测到vsync即将发生。这样,SwapBuffers()不必等待很长时间来进行vsync,因此不会太糟糕地占用CPU。

Note that you may have to use timeBeginPeriod() to get sufficient Sleep() precision for this to work reliably.

请注意,您可能必须使用timeBeginPeriod()来获得足够的Sleep()精度,以使其可靠地工作。

#3


1  

Depending on how you've done your shadows and other graphics code, it possible that youve "fallen off the fast path" and the graphics driver has started doing software emulation. This can happen if you have complicated pipelines, or are using too many conditionals (or just too many instructions) in shader code.

根据您完成阴影和其他图形代码的方式,您可能会“脱离快速路径”并且图形驱动程序已开始进行软件仿真。如果您有复杂的管道,或者在着色器代码中使用了太多的条件(或太多的指令),就会发生这种情况。

I would make sure that this particular graphics card supports all the features you are using.

我会确保这个特定的显卡支持您正在使用的所有功能。

#4


0  

I would invest in profiling software to trace down the actual cause of the problem.

我会投资分析软件来追查问题的实际原因。

On linux, Valgrind ( which contains Cachegrind & Callgrind ) + KCacheGrind can make working out where all the heavy function calls are going on.

在linux上,Valgrind(包含Cachegrind和Callgrind)+ KCacheGrind可以解决所有繁重函数调用的问题。

Also, compile with full debug symbols and it can even show the assembley code at the slow function calls.

此外,使用完整的调试符号进行编译,它甚至可以在慢速函数调用时显示汇编代码。

If you're using an Intel Specific compiler, this may be part of your problem ( not definate tho ), and try the GCC family.

如果您正在使用英特尔特定编译器,这可能是您的问题的一部分(不是定义),并尝试GCC系列。

Also, you may want to dive into OpenMP and Threads if you haven't already.

此外,如果您还没有,可能需要深入研究OpenMP和Threads。

#5


0  

Hm - if you use shadows the GPU should be under load, so it's unlikely that the GPU renders the frames faster than the CPU sends graphic data. In this case 100% load is ok and even expected.

嗯 - 如果使用阴影,GPU应该处于负载状态,因此GPU渲染帧的速度不可能比CPU发送图形数据的速度快。在这种情况下,100%负载是可以的,甚至是预期的。

It could simply be a borked OpenGL driver that does burns CPU-cycles in a spinlock somewhere. To find out what's exactly going on I suggest you run a profiling tool such as Code Analyst from AMD (free last time I've used it).

它可能只是一个borked OpenGL驱动程序,它会在某个地方的自旋锁中烧掉CPU周期。为了找出究竟发生了什么,我建议你运行一个分析工具,比如AMD的Code Analyst(我上次使用它时免费)。

Profile your program a couple of minutes and take a look where the time is spent. If you see a big peak in the opengl drivers and not in your application get a new driver. Otherwise you at least get an idea what's going on.

分析您的程序几分钟,看看花费的时间。如果你看到opengl驱动程序中的一个大峰值,而不是在你的应用程序中获得一个新的驱动程序。否则你至少会知道发生了什么。

Btw - let me guess, you're using an ATI card, right? I don't want to offend any ATI fans out there, but their OpenGL-drives are not exactly stellar. If you're unlucky you may even used a feature that the card does not support or that is disabled due to a silicon bug. The driver will fallback into software rasterization mode in this case. This will slow down things a lot and give you a 100% CPU-Load even if your program is single-threaded.

顺便说一句 - 让我猜,你正在使用ATI卡,对吧?我不想得罪任何ATI粉丝,但他们的OpenGL驱动器并不完全是一流的。如果你运气不好,你甚至可能会使用该卡不支持或由于硅虫而被禁用的功能。在这种情况下,驱动程序将回退到软件光栅化模式。即使您的程序是单线程的,这也会使速度变慢,并为您提供100%的CPU负载。

#6


0  

Also the cache is not shared, which might cause a lack of performance when sharing data among multiple threads.

此外,不共享缓存,这可能会导致在多个线程之间共享数据时性能不足。