在if(0)中用四个nops修复的bug,世界不再有意义

时间:2022-05-18 19:27:52

I was writing a function to figure out if a given system of linear inequalities has a solution, when all of a sudden it started giving the wrong answers after a seemingly innocuous change.

我正在写一个函数来弄清楚一个给定的线性不等式系统是否有解决方案,突然间它在一个看似无害的变化后开始给出错误的答案。

I undid some changes, re-did them, and then proceeded to fiddle for the next two hours, until I had reduced it to absurdity.

我解除了一些变化,重新做了它们,然后在接下来的两个小时内进行了调整,直到我把它减少到荒谬。

The following, inserted anywhere into the function body, but nowhere else in the program, fixes it:

以下内容,插入到函数体的任何位置,但程序中没有其他地方,修复它:

if(0) {
    __asm__("nop\n");
    __asm__("nop\n");
    __asm__("nop\n");
    __asm__("nop\n");
}

It's for a school assignment, so I probably shouldn't post the function on the web, but this is so ridiculous that I don't think any context is going to help you. And all the function does is a bunch of math and looping. It doesn't even touch memory that isn't allocated on the stack.

这是一个学校作业,所以我可能不应该在网上发布这个功能,但这太荒谬了,我认为没有任何背景可以帮助你。所有功能都是一堆数学和循环。它甚至不会触及未在堆栈上分配的内存。

Please help me make sense of the world! I'm loathe to chalk it up to the GCC, since the first rule of debugging is not to blame the compiler. But heck, I'm about to. I'm running Mac OS 10.5 on a G5 tower, and the compiler in question identifies itself as 'powerpc-apple-darwin9-gcc-4.0.1' but I'm thinking it could be an impostor...

请帮我理解这个世界!我不喜欢将它归结为GCC,因为调试的第一条规则并不是责怪编译器。但是,我即将到来。我在G5塔上运行Mac OS 10.5,并且有问题的编译器将自己标识为'powerpc-apple-darwin9-gcc-4.0.1',但我认为它可能是冒名顶替者......

UPDATE: Curiouser and curiouser... I diffed the .s files with nops and without. Not only are there too many differences to check, but with no nops the .s file is 196,620 bytes, and with it's 156,719 bytes. (!)

更新:Curiouser和curiouser ...我用nops和没有来扩展.s文件。不仅要检查太多差异,而且没有nops,.s文件是196,620字节,并且有156,719字节。 (!)

UPDATE 2: Wow, should have posted the code! I came back to the code today, with fresh eyes, and immediately saw the error. See my sheepish self-answer below.

更新2:哇,应该已经发布了代码!今天我带着新鲜的眼睛回到了代码,并立即看到了错误。看下面我的怯懦自我回答。

9 个解决方案

#1


Most times when you modify the code inconsequentially and it fixes your problem, it's a memory corruption problem of some sort. We may need to see the actual code to do proper analysis, but that would be my first guess, based on the available information.

大多数情况下,当您修改代码时,它会解决您的问题,这是某种内存损坏问题。我们可能需要查看实际的代码才能进行正确的分析,但根据可用的信息,这将是我的第一个猜测。

#2


It's faulty pointer arithmetic, either directly (through a pointer) or indirectly (by going past the end of an array). Check all your arrays. Don't forget that if your array is

它是错误的指针算法,直接(通过指针)或间接(通过遍历数组的末尾)。检查所有阵列。如果您的阵列是,请不要忘记

 int a[4];

then a[4] doesn't exist.

然后a [4]不存在。

What you're doing is overwriting something on the stack accidentally. The stack contains both locals, parameters, and the return address from your function. You might be damaging the return address in a way that the extra noops cures.

你正在做的是意外地覆盖堆栈中的东西。堆栈包含本地,参数和函数的返回地址。您可能会以额外的noops治愈的方式损坏返回地址。

For example, if you have some code that is adding something to the return address, inserting those extra 16 bytes of noops would cure the problem, because instead of returning past the next line of code, you return into the middle of some noops.

例如,如果你有一些代码在返回地址中添加一些内容,那么插入那些额外的16个字节的noop就可以解决问题,因为不是返回到下一行代码,而是返回到一些noops的中间。

One way you might be adding something to the return address is by going past the end of a local array or a parameter, for example

例如,您可能向返回地址添加内容的一种方法是通过本地数组或参数的末尾

  int a[4];
  a[4]++;

#3


I came back to this after a few days busy with other things, and figured it out right away. Sorry I didn't post the code sooner, but it was hard coming up with minimal example that displayed the problem.

几天忙于其他事情后我回到了这里,立刻想出来了。对不起,我没有尽快发布代码,但很难找到显示问题的最小示例。

The root problem was that I left out the return statements in the recursive function. I had:

根本问题是我在递归函数中遗漏了return语句。我有:

bool function() {
    /* lots of code */
    function()
}

When it should have been:

什么时候应该是:

bool function() {
    /* lots of code */
    return function()
}

This worked because, through the magic of optimization, the right value happened to be in the right register at the right time, and made it to the right place.

这是有效的,因为通过优化的魔力,正确的价值碰巧在正确的时间出现在正确的寄存器中,并使其到达正确的位置。

The bug was originally introduced when I broke the first call into its own special-cased function. And, at that point, the extra nops were the difference between this first case being inlined directly into the general recursive function.

这个bug最初是在我将第一个调用打入其自己的特殊功能时引入的。并且,在那一点上,额外的nops是第一种情况直接内联到一般递归函数之间的区别。

Then, for reasons that I don't fully understand, inlining this first case led to the right value not being in the right place at the right time, and the function returning junk.

然后,由于我不完全理解的原因,内联第一个案例导致正确的值在正确的时间不在正确的位置,并且函数返回垃圾。

#4


Does it happen in debug and release mode build (with symbols and without)? Does it behave the same way using a debugger? Is the code moultithreaded? Are you compiling with optimizations? Can you try another machine?

它是否在调试和发布模式构建中发生(带符号和不带符号)?它使用调试器的行为方式是否相同?代码是否是milttithaded?您是否正在进行优化编译?你能试试另一台机器吗?

#5


Can you confirm that you are indeed getting different executables when you add the if(0) {nops}? I don't see nops on my system.

当你添加if(0){nops}时,你能确认你确实得到了不同的可执行文件吗?我的系统上没有看到nops。

$ gcc --version
powerpc-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)

$ cat nop.c
void foo()
{
    if (0) {
        __asm__("nop");
        __asm__("nop");
        __asm__("nop");
        __asm__("nop");
    }
}

$ gcc nop.c -S -O0 -o -
    .
    .
_foo:
    stmw r30,-8(r1)
    stwu r1,-48(r1)
    mr r30,r1
    lwz r1,0(r1)
    lmw r30,-8(r1)
    blr

$ gcc nop.c -S -O3 -o -
    .
    .
_foo:
    blr

#6


My guess is stack corruption -- though gcc should optimize anything inside an if(0) out, I would have thought.

我的猜测是堆栈损坏 - 虽然gcc应该优化if(0)中的任何内容,但我想。

You could try sticking a big array on the stack in your function and see if that also fixes it -- that would also implicate stack corruption.

您可以尝试在函数中的堆栈上添加一个大数组,看看是否也修复了它 - 这也会导致堆栈损坏。

Are you sure you're running what you think you're running? (dumb question, but it happens.)

你确定你正在运行你认为自己正在运行的东西吗? (愚蠢的问题,但它发生了。)

#7


Looks like you will need to put in some hard work and elbow grease

Your problem sounds similar to something I have debugged in the past where my app was running regular ... when out of nowhere it jumped to a different part of the app and the callstack got completely messed up ( however this was embedded programming )!

你的问题听起来类似于我以前在我的应用程序运行常规时调试过的东西...当它无处不在时它跳到应用程序的不同部分并且callstack完全搞砸了(但这是嵌入式编程)!

It sounds like you are spending your time "thinking" about "what should be happening" ... when you should be "looking" at "what is actually happening". A lot of the times the hardest bugs are things that you would never think "should happen".

听起来你正在花时间“思考”“应该发生什么”...当你应该“看”“实际发生的事情”时。在很多时候,最困难的事情是你永远不会想到“应该发生”的事情。

I would approach the problem like so:

我会像这样处理问题:

  1. Break out your favorite debugger
  2. 打破你最喜欢的调试器

  3. Start stepping through your code and watch the call stack and local variables and look for suspicious activity
  4. 开始单步执行代码并观察调用堆栈和本地变量并查找可疑活动

  5. Make the system fail
  6. 使系统失败

  7. Focus in to where the system is failing
  8. 专注于系统失败的地方

Focus on iterating your code changes:

专注于迭代您的代码更改:

  1. making code changes that will "make the system fail"
  2. 进行代码更改将“使系统失败”

  3. running/debugging and watching
  4. 运行/调试和观看

  5. If it runs fine you are looking/trying the wrong thing and you need to try something else. If you make it fail then you have made progress towards finding the bug.
  6. 如果运行正常,你正在寻找/尝试错误的东西,你需要尝试别的东西。如果你失败了那么你在找到错误方面取得了进展。

  7. If you don't know where or how the system fails you will not be able to solve the problem.
  8. 如果您不知道系统在哪里或如何失败,您将无法解决问题。


This will be a good opportunity to build your debugging skills. For more help on building your debugging skills read check out the book "9 rules for debugging".

这将是构建调试技能的好机会。有关构建调试技巧的更多帮助,请阅读“9个调试规则”一书。

Here is a poster from the book:

这是书中的海报:

9 Rules of debugging image http://tbn2.google.com/images?q=tbn:BXfm745sxN6oKM:http://www.debuggingrules.com/debuggingrules.jpg

9调试图像的规则http://tbn2.google.com/images?q=tbn:BXfm745sxN6oKM:http://www.debuggingrules.com/debuggingrules.jpg


Concrete suggestions:

  1. If you think it is the compiler, then run a different platform/OS/compiler.
  2. 如果您认为它是编译器,则运行不同的平台/ OS /编译器。

  3. Once you have ruled out the platform/OS/compiler, then try restructuring the code. Look for the "clever" code parts and see if they are actually doing what the code meant to do... maybe the clever solution wasn't actually clever and is doing something else.
  4. 一旦排除了平台/ OS /编译器,就尝试重构代码。寻找“聪明”的代码部分,看看它们是否实际上正在执行代码的意图......也许聪明的解决方案实际上并不聪明并且正在做其他事情。

#8


I am the author of "Debugging" so kindly referenced above by Trevor Boyd Smith. He has it right -- the key rules here are #2 Make It Fail (which you seem to be doing okay), and #3 Quit Thinking and Look. The conjectures above are very good (demonstrating mastery of rule #1 -- Understand the System -- in this case the way code size can change a bug). But actually watching it fail with a debugger will show you what's actually happening without guesswork.

我是Trevor Boyd Smith上面提到的“调试”的作者。他说得对 - 这里的关键规则是#2让它失败(你似乎做得很好),以及#3退出思考和看。上面的猜想非常好(证明了对规则#1的理解 - 理解系统 - 在这种情况下代码大小可以改变错误的方式)。但实际上用调试器观察它失败会告诉你实际发生了什么而没有猜测。

#9


Break out that one function into a separate .c file (or .cpp or whatever). Compile just that one file with the nops and without them, to .s files and compare them.

将一个函数分解为单独的.c文件(或.cpp或其他)。只编译一个带有nops的文件,没有它们,编译成.s文件并进行比较。

Try an old version of gcc. Go back 5 or 10 years and see if things get stranger.

试试旧版本的gcc。回去5年或10年,看看事情是否变得陌生。

#1


Most times when you modify the code inconsequentially and it fixes your problem, it's a memory corruption problem of some sort. We may need to see the actual code to do proper analysis, but that would be my first guess, based on the available information.

大多数情况下,当您修改代码时,它会解决您的问题,这是某种内存损坏问题。我们可能需要查看实际的代码才能进行正确的分析,但根据可用的信息,这将是我的第一个猜测。

#2


It's faulty pointer arithmetic, either directly (through a pointer) or indirectly (by going past the end of an array). Check all your arrays. Don't forget that if your array is

它是错误的指针算法,直接(通过指针)或间接(通过遍历数组的末尾)。检查所有阵列。如果您的阵列是,请不要忘记

 int a[4];

then a[4] doesn't exist.

然后a [4]不存在。

What you're doing is overwriting something on the stack accidentally. The stack contains both locals, parameters, and the return address from your function. You might be damaging the return address in a way that the extra noops cures.

你正在做的是意外地覆盖堆栈中的东西。堆栈包含本地,参数和函数的返回地址。您可能会以额外的noops治愈的方式损坏返回地址。

For example, if you have some code that is adding something to the return address, inserting those extra 16 bytes of noops would cure the problem, because instead of returning past the next line of code, you return into the middle of some noops.

例如,如果你有一些代码在返回地址中添加一些内容,那么插入那些额外的16个字节的noop就可以解决问题,因为不是返回到下一行代码,而是返回到一些noops的中间。

One way you might be adding something to the return address is by going past the end of a local array or a parameter, for example

例如,您可能向返回地址添加内容的一种方法是通过本地数组或参数的末尾

  int a[4];
  a[4]++;

#3


I came back to this after a few days busy with other things, and figured it out right away. Sorry I didn't post the code sooner, but it was hard coming up with minimal example that displayed the problem.

几天忙于其他事情后我回到了这里,立刻想出来了。对不起,我没有尽快发布代码,但很难找到显示问题的最小示例。

The root problem was that I left out the return statements in the recursive function. I had:

根本问题是我在递归函数中遗漏了return语句。我有:

bool function() {
    /* lots of code */
    function()
}

When it should have been:

什么时候应该是:

bool function() {
    /* lots of code */
    return function()
}

This worked because, through the magic of optimization, the right value happened to be in the right register at the right time, and made it to the right place.

这是有效的,因为通过优化的魔力,正确的价值碰巧在正确的时间出现在正确的寄存器中,并使其到达正确的位置。

The bug was originally introduced when I broke the first call into its own special-cased function. And, at that point, the extra nops were the difference between this first case being inlined directly into the general recursive function.

这个bug最初是在我将第一个调用打入其自己的特殊功能时引入的。并且,在那一点上,额外的nops是第一种情况直接内联到一般递归函数之间的区别。

Then, for reasons that I don't fully understand, inlining this first case led to the right value not being in the right place at the right time, and the function returning junk.

然后,由于我不完全理解的原因,内联第一个案例导致正确的值在正确的时间不在正确的位置,并且函数返回垃圾。

#4


Does it happen in debug and release mode build (with symbols and without)? Does it behave the same way using a debugger? Is the code moultithreaded? Are you compiling with optimizations? Can you try another machine?

它是否在调试和发布模式构建中发生(带符号和不带符号)?它使用调试器的行为方式是否相同?代码是否是milttithaded?您是否正在进行优化编译?你能试试另一台机器吗?

#5


Can you confirm that you are indeed getting different executables when you add the if(0) {nops}? I don't see nops on my system.

当你添加if(0){nops}时,你能确认你确实得到了不同的可执行文件吗?我的系统上没有看到nops。

$ gcc --version
powerpc-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5490)

$ cat nop.c
void foo()
{
    if (0) {
        __asm__("nop");
        __asm__("nop");
        __asm__("nop");
        __asm__("nop");
    }
}

$ gcc nop.c -S -O0 -o -
    .
    .
_foo:
    stmw r30,-8(r1)
    stwu r1,-48(r1)
    mr r30,r1
    lwz r1,0(r1)
    lmw r30,-8(r1)
    blr

$ gcc nop.c -S -O3 -o -
    .
    .
_foo:
    blr

#6


My guess is stack corruption -- though gcc should optimize anything inside an if(0) out, I would have thought.

我的猜测是堆栈损坏 - 虽然gcc应该优化if(0)中的任何内容,但我想。

You could try sticking a big array on the stack in your function and see if that also fixes it -- that would also implicate stack corruption.

您可以尝试在函数中的堆栈上添加一个大数组,看看是否也修复了它 - 这也会导致堆栈损坏。

Are you sure you're running what you think you're running? (dumb question, but it happens.)

你确定你正在运行你认为自己正在运行的东西吗? (愚蠢的问题,但它发生了。)

#7


Looks like you will need to put in some hard work and elbow grease

Your problem sounds similar to something I have debugged in the past where my app was running regular ... when out of nowhere it jumped to a different part of the app and the callstack got completely messed up ( however this was embedded programming )!

你的问题听起来类似于我以前在我的应用程序运行常规时调试过的东西...当它无处不在时它跳到应用程序的不同部分并且callstack完全搞砸了(但这是嵌入式编程)!

It sounds like you are spending your time "thinking" about "what should be happening" ... when you should be "looking" at "what is actually happening". A lot of the times the hardest bugs are things that you would never think "should happen".

听起来你正在花时间“思考”“应该发生什么”...当你应该“看”“实际发生的事情”时。在很多时候,最困难的事情是你永远不会想到“应该发生”的事情。

I would approach the problem like so:

我会像这样处理问题:

  1. Break out your favorite debugger
  2. 打破你最喜欢的调试器

  3. Start stepping through your code and watch the call stack and local variables and look for suspicious activity
  4. 开始单步执行代码并观察调用堆栈和本地变量并查找可疑活动

  5. Make the system fail
  6. 使系统失败

  7. Focus in to where the system is failing
  8. 专注于系统失败的地方

Focus on iterating your code changes:

专注于迭代您的代码更改:

  1. making code changes that will "make the system fail"
  2. 进行代码更改将“使系统失败”

  3. running/debugging and watching
  4. 运行/调试和观看

  5. If it runs fine you are looking/trying the wrong thing and you need to try something else. If you make it fail then you have made progress towards finding the bug.
  6. 如果运行正常,你正在寻找/尝试错误的东西,你需要尝试别的东西。如果你失败了那么你在找到错误方面取得了进展。

  7. If you don't know where or how the system fails you will not be able to solve the problem.
  8. 如果您不知道系统在哪里或如何失败,您将无法解决问题。


This will be a good opportunity to build your debugging skills. For more help on building your debugging skills read check out the book "9 rules for debugging".

这将是构建调试技能的好机会。有关构建调试技巧的更多帮助,请阅读“9个调试规则”一书。

Here is a poster from the book:

这是书中的海报:

9 Rules of debugging image http://tbn2.google.com/images?q=tbn:BXfm745sxN6oKM:http://www.debuggingrules.com/debuggingrules.jpg

9调试图像的规则http://tbn2.google.com/images?q=tbn:BXfm745sxN6oKM:http://www.debuggingrules.com/debuggingrules.jpg


Concrete suggestions:

  1. If you think it is the compiler, then run a different platform/OS/compiler.
  2. 如果您认为它是编译器,则运行不同的平台/ OS /编译器。

  3. Once you have ruled out the platform/OS/compiler, then try restructuring the code. Look for the "clever" code parts and see if they are actually doing what the code meant to do... maybe the clever solution wasn't actually clever and is doing something else.
  4. 一旦排除了平台/ OS /编译器,就尝试重构代码。寻找“聪明”的代码部分,看看它们是否实际上正在执行代码的意图......也许聪明的解决方案实际上并不聪明并且正在做其他事情。

#8


I am the author of "Debugging" so kindly referenced above by Trevor Boyd Smith. He has it right -- the key rules here are #2 Make It Fail (which you seem to be doing okay), and #3 Quit Thinking and Look. The conjectures above are very good (demonstrating mastery of rule #1 -- Understand the System -- in this case the way code size can change a bug). But actually watching it fail with a debugger will show you what's actually happening without guesswork.

我是Trevor Boyd Smith上面提到的“调试”的作者。他说得对 - 这里的关键规则是#2让它失败(你似乎做得很好),以及#3退出思考和看。上面的猜想非常好(证明了对规则#1的理解 - 理解系统 - 在这种情况下代码大小可以改变错误的方式)。但实际上用调试器观察它失败会告诉你实际发生了什么而没有猜测。

#9


Break out that one function into a separate .c file (or .cpp or whatever). Compile just that one file with the nops and without them, to .s files and compare them.

将一个函数分解为单独的.c文件(或.cpp或其他)。只编译一个带有nops的文件,没有它们,编译成.s文件并进行比较。

Try an old version of gcc. Go back 5 or 10 years and see if things get stranger.

试试旧版本的gcc。回去5年或10年,看看事情是否变得陌生。