如何学习C调试和最佳实践

时间:2022-09-05 18:42:47

I've written an Apache module in C. Under certain conditions, I can get it to segfault, but I have no idea as to why. At this point, it could be my code, it could be the way I'm compiling the program, or it could be a bug in the OS library (the segfault happens during a call to dlopen()).

我在C中编写了一个Apache模块。在某些条件下,我可以将它转换为段错误,但我不知道为什么。此时,它可能是我的代码,它可能是我编译程序的方式,或者它可能是OS库中的错误(在调用dlopen()期间发生了段错误)。

I've tried running through GDB and Valgrind with no success. GDB gives me a backtrace into the dlopen() system call that appears meaningless. In Valgrind, the bug actually seems to disappear or at least become non-reproducible. On the other hand, I'm a total novice when it comes to these tools.

我尝试过运行GDB和Valgrind但没有成功。 GDB让我对dlopen()系统调用进行了回溯,看起来毫无意义。在Valgrind中,这个bug实际上似乎消失了,或者至少变得不可重复。另一方面,对于这些工具,我完全是新手。

I'm a little new to production quality C programming (I started on C many years ago, but have never worked professionally with it.) What is the best way for me to go about learning the ropes of debugging programs? What other tools should I be investigating? In summary, how do you figure out how to tackle new bug challenges?

我对生产质量C编程有点新兴(我多年前从C开始,但从未专业地使用它。)对于我学习调试程序的最佳方法是什么?我应该调查哪些其他工具?总之,您如何弄清楚如何应对新的bug挑战?

EDIT: Just to clarify, I want to thank Sydius's and dmckee's input. I had taken a look at Apache's guide and am fairly familiar with dlopen (and dlsym and dlclose). My module works for the most part (it's at about 3k lines of code and, as long as I don't activate this one section, things seem to work just fine.)

编辑:只是为了澄清,我想感谢Sydius和dmckee的输入。我看了一下Apache的指南,并且对dlopen(以及dlsym和dlclose)非常熟悉。我的模块大部分工作(大约3k行代码,只要我没有激活这一部分,事情似乎工作得很好。)

I guess this is where my original question comes from - I don't know what to do next. I know I haven't used GDB and Valgrind to their full potential. I know that I may not be compiling with the exact right flags. But I'm having trouble figuring out more. I can find beginner's guides that tell me what I already know, and man pages that tell me more than I need to know but with no guidance.

我想这是我原来的问题所在 - 我不知道接下来该做什么。我知道我没有充分发挥GDB和Valgrind的潜力。我知道我可能没有用正确的标志进行编译。但我很难搞清楚。我可以找到初学者的指南,告诉我我已经知道的东西,以及告诉我超过我需要了解但没有指导的人工页面。

7 个解决方案

#1


8  

This link may help: Apache Debugging Guide with your specific problem. Experience with specific problems is one of the best ways to get better in the general case.

此链接可能有所帮助:Apache调试指南与您的具体问题。具体问题的经验是在一般情况下变得更好的最好方法之一。

#2


5  

Unfortunately the GNU tools are not the best, and my experience is that the dynamic linker muddies the waters enormously. If you can get Apache to link statically with your module that will enable gdb especially to perform more reliably. I don't know how easy that is; a lot depends on the Apache build system.

不幸的是,GNU工具并不是最好的,我的经验是动态链接器极大地混淆了水域。如果您可以让Apache静态链接到您的模块,这将使gdb尤其能够更可靠地执行。我不知道这有多容易;很大程度上取决于Apache构建系统。

It's worrisome but not shocking that you can't easily reproduce the bug with valgrind.

令人担忧的是,你不能轻易地用valgrind重现这个bug,这并不令人震惊。

Regarding compiling with the right flags, both valgrind and gdb will give you much better information if you compile everything in sight with -g -O0. Don't believe the claims on the gcc man page that gcc -g -O is good enough; it isn't---even -O will cause variables in the source code to be eliminated by the optimizer.

关于使用正确的标志进行编译,如果使用-g -O0编译所有内容,valgrind和gdb都会为您提供更好的信息。不要相信gcc -g -O足够好的gcc手册页上的说法;它不是---甚至-O将导致优化器消除源代码中的变量。

#3


3  

I'm sure that debugging techniques are in general language independent and there is no such think "C debugging".
There is a lot of different tool that can help you find simple problems like memory leak, or just stupid mistakes in the code, some times it even can catch simple memory overruns. But for real hard to find problems like problems originated from multitasking/interrupt, dma memory corruption the only tool is your brain and well written code (with thinking in advance that this code will be debugged). You can find more about preparing your code to debugging here. It seems from Sydius post that Apache already have a good tracing mechanism in place, so just use it and add simalar to your code base.
In additional i would say that another important step in debugging is "don't assume/think". Base all your steps on bare facts, prove all your assumption with 100% accuracy before you making another step based on that assumption. Basing your debugging on assumption usually will bring you to wrong direction.

我确信调试技术通常与语言无关,并且没有这样的思考“C调试”。有许多不同的工具可以帮助您找到内存泄漏等简单问题,或者只是代码中的愚蠢错误,有时它甚至可以捕获简单的内存溢出。但是对于真正难以发现的问题,例如源于多任务/中断的问题,dma内存损坏,唯一的工具就是你的大脑和编写良好的代码(事先考虑将调试此代码)。您可以在此处找到有关准备代码进行调试的更多信息。从Sydius帖子看,Apache已经有了很好的跟踪机制,所以只需使用它并为你的代码库添加simalar。另外我会说调试中的另一个重要步骤是“不要假设/思考”。将所有步骤基于事实,在基于该假设做出另一步之前,以100%准确度证明您的所有假设。根据假设进行调试通常会带来错误的方向。

Edit after Dave's clarification:

在戴夫的澄清之后编辑:

You next step should be find the smallest part of the code that cause the problem. You sad that if your disable certain section the module is loaded. just make this section as small is possible, remove/moke everything in the section until you will find ideally one line that will cause the module not to load. And after you find this line. it will be an exact time to start using your brain :) Just don't forget to 100% verify that this is the line.

下一步应该是找到导致问题的代码的最小部分。你很遗憾,如果你禁用某些部分模块已加载。只需使这一部分尽可能小,删除/ moke部分中的所有内容,直到理想情况下找到一行会导致模块无法加载。在找到这条线之后。这将是开始使用你的大脑的确切时间:)只是不要忘记100%验证这是行。

#4


2  

Very general advice:

  • Look again at that backtrace. Are any of the stack frames in code you control? If so, what line, and what is happening there?

    再看看那个回溯。您控制的代码中是否有任何堆栈帧?如果是这样,什么线,以及那里发生了什么?

  • Do you know what dlopen() does? If not read the manual. If the backtrace does not include any of you code, this may well be failing at the time Apache tries to load your code. Are you sure you've built the module with the right compiler options?

    你知道dlopen()有什么作用吗?如果没有阅读手册。如果回溯不包含任何代码,那么在Apache尝试加载代码时,这可能会失败。您确定已使用正确的编译器选项构建模块吗?

  • Effective debugging requires knowing your environment and tools. Sydius's advice is good here.

    有效的调试需要了解您的环境和工具。 Sydius的建议很好。

  • If you're stuck on other paths, check that you can write, load, and run a trivial module. Probably you'll find an example of this in almost any documentation on the subject.

    如果您遇到其他路径,请检查您是否可以编写,加载和运行一个简单的模块。您可能会在几乎所有关于该主题的文档中找到这样的示例。


To dave's clarification: Between beginner and expert can be a tough spot.

戴夫的澄清:在初学者和专家之间可能是个难点。

Are you calling in libraries in the offending code that you don't use elsewhere? Maybe the loader path is messed up just for that resource.

您是否在使用其他地方没有使用的违规代码中调用库?也许加载器路径仅为该资源搞砸了。

Aside from that I'm just about out of advice. Sorry.

除此之外,我只是出于建议。抱歉。


NB: I had occasion to read David J. Agans' book Debugging last year. It is not software specific, but is a good read, and helpful even if you are already a pretty good debugger.

注意:去年有机会阅读David J. Agans的调试书。它不是特定于软件的,但是读取效果很好,即使您已经是一个非常好的调试器也很有帮助。

#5


2  

The fact that it is failing on the dlopen() call seems a bit suspect to me. There are a number of things that can go wrong when attempting to open a shared object; but none of them should cause a seg fault.

它在dlopen()调用失败的事实似乎对我有点怀疑。尝试打开共享对象时可能会出现许多问题;但它们都不应该导致seg故障。

The one exception I can think of is a problem in the library initialization of the SO. On that basis, I would suggest a few things you could try to get more information.

我能想到的一个例外是SO初始化库中的问题。在此基础上,我会建议您尝试获取更多信息。

  • Check your library path, and ensure that the library you're trying to load is in this path. (Note: Since you're using Apache, I think you also need to check the library path for the user under which Apache is running. (I think the user is "nobody".) I believe you're looking for the LD_LIBRARY_PATH environment variable.) Also note that if you have multiple versions of the library, this can be really important. Make sure you're loading the correct version of the library.
  • 检查库路径,确保您尝试加载的库位于此路径中。 (注意:由于您使用的是Apache,我认为您还需要检查运行Apache的用户的库路径。(我认为用户是“nobody”。)我相信您正在寻找LD_LIBRARY_PATH环境变量。)另请注意,如果您有多个版本的库,这可能非常重要。确保您正在加载正确的库版本。

  • As a general debugging principle, try to simplify the problem. Given that I know little about Apache modules, I would try to remove Apache from the equation: Try writing a simple C program that does little more than a dlopen() and possibly the subsequent dlsym(), then exits. This program provides a much simpler environment to troubleshoot and/or debug. If this program runs cleanly, then you may need to look more closely at what's different when the program seg faults. (What's Apache doing differently?) On the other hand, if your program also seg faults, you may consider a potential problem with the library, your compiler switches for the program, and the code in the program. (Or all of the above.)
  • 作为一般调试原则,尝试简化问题。鉴于我对Apache模块知之甚少,我会尝试从等式中删除Apache:尝试编写一个简单的C程序,它只能执行dlopen()以及后续的dlsym(),然后退出。该程序提供了一个更简单的故障排除和/或调试环境。如果这个程序运行得很干净,那么当程序段出现故障时,您可能需要更仔细地查看不同的程序。 (Apache的做法有何不同?)另一方面,如果您的程序也出现故障,您可能会考虑库的潜在问题,编译器为程序切换,以及程序中的代码。 (或以上所有。)

While I may not have offered very many general purpose debugging tips, I hope something here may have been helpful.

虽然我可能没有提供很多通用调试技巧,但我希望这里的一些内容可能会有所帮助。

#6


1  

I had a look at the valgrind documentation and by default it doesn't check child processes. It wouldn't surprise me at all if Apache could run your module in a child thread. Please try

我查看了valgrind文档,默认情况下它不会检查子进程。如果Apache可以在子线程中运行您的模块,那么我一点都不会感到惊讶。请试试

valgrind --trace-children=yes ....

#7


1  

Our non CS students (i.e. electrical engineering, math, physic students) I recommend in the programming lectures "The Practice of Programming" from Kernighan. It good delivers some basic concepts which aids devlopment (like testing and here it comes: debugging).

我的非CS学生(即电气工程,数学,物理学生)我在Kernighan的编程讲座“The Programming of Programming”中推荐。它很好地提供了一些有助于开发的基本概念(比如测试,现在它来了:调试)。

If you are already experienced programmer, it is maybe too basic for you. Then I have just one more of this Zen proverbs for you: "Wisdom withouth the filtering through experience is worthless".

如果您已经是经验丰富的程序员,那对您来说可能太基础了。然后我再向你提出一句禅谚语:“没有通过经验过滤的智慧毫无价值”。

One answer I can only back up: Look again at the stack trace, this is the most relevant help by debugging (esp there at the borders, where the execution crosses different modules (esp yours and the lib/OS borders), and look at the argument of the function and check if they are sane).

一个答案我只能备份:再看一下堆栈跟踪,这是调试最相关的帮助(特别是在边界,执行穿过不同的模块(尤其是你的和lib / OS边框),并看看函数的参数并检查它们是否合理)。

#1


8  

This link may help: Apache Debugging Guide with your specific problem. Experience with specific problems is one of the best ways to get better in the general case.

此链接可能有所帮助:Apache调试指南与您的具体问题。具体问题的经验是在一般情况下变得更好的最好方法之一。

#2


5  

Unfortunately the GNU tools are not the best, and my experience is that the dynamic linker muddies the waters enormously. If you can get Apache to link statically with your module that will enable gdb especially to perform more reliably. I don't know how easy that is; a lot depends on the Apache build system.

不幸的是,GNU工具并不是最好的,我的经验是动态链接器极大地混淆了水域。如果您可以让Apache静态链接到您的模块,这将使gdb尤其能够更可靠地执行。我不知道这有多容易;很大程度上取决于Apache构建系统。

It's worrisome but not shocking that you can't easily reproduce the bug with valgrind.

令人担忧的是,你不能轻易地用valgrind重现这个bug,这并不令人震惊。

Regarding compiling with the right flags, both valgrind and gdb will give you much better information if you compile everything in sight with -g -O0. Don't believe the claims on the gcc man page that gcc -g -O is good enough; it isn't---even -O will cause variables in the source code to be eliminated by the optimizer.

关于使用正确的标志进行编译,如果使用-g -O0编译所有内容,valgrind和gdb都会为您提供更好的信息。不要相信gcc -g -O足够好的gcc手册页上的说法;它不是---甚至-O将导致优化器消除源代码中的变量。

#3


3  

I'm sure that debugging techniques are in general language independent and there is no such think "C debugging".
There is a lot of different tool that can help you find simple problems like memory leak, or just stupid mistakes in the code, some times it even can catch simple memory overruns. But for real hard to find problems like problems originated from multitasking/interrupt, dma memory corruption the only tool is your brain and well written code (with thinking in advance that this code will be debugged). You can find more about preparing your code to debugging here. It seems from Sydius post that Apache already have a good tracing mechanism in place, so just use it and add simalar to your code base.
In additional i would say that another important step in debugging is "don't assume/think". Base all your steps on bare facts, prove all your assumption with 100% accuracy before you making another step based on that assumption. Basing your debugging on assumption usually will bring you to wrong direction.

我确信调试技术通常与语言无关,并且没有这样的思考“C调试”。有许多不同的工具可以帮助您找到内存泄漏等简单问题,或者只是代码中的愚蠢错误,有时它甚至可以捕获简单的内存溢出。但是对于真正难以发现的问题,例如源于多任务/中断的问题,dma内存损坏,唯一的工具就是你的大脑和编写良好的代码(事先考虑将调试此代码)。您可以在此处找到有关准备代码进行调试的更多信息。从Sydius帖子看,Apache已经有了很好的跟踪机制,所以只需使用它并为你的代码库添加simalar。另外我会说调试中的另一个重要步骤是“不要假设/思考”。将所有步骤基于事实,在基于该假设做出另一步之前,以100%准确度证明您的所有假设。根据假设进行调试通常会带来错误的方向。

Edit after Dave's clarification:

在戴夫的澄清之后编辑:

You next step should be find the smallest part of the code that cause the problem. You sad that if your disable certain section the module is loaded. just make this section as small is possible, remove/moke everything in the section until you will find ideally one line that will cause the module not to load. And after you find this line. it will be an exact time to start using your brain :) Just don't forget to 100% verify that this is the line.

下一步应该是找到导致问题的代码的最小部分。你很遗憾,如果你禁用某些部分模块已加载。只需使这一部分尽可能小,删除/ moke部分中的所有内容,直到理想情况下找到一行会导致模块无法加载。在找到这条线之后。这将是开始使用你的大脑的确切时间:)只是不要忘记100%验证这是行。

#4


2  

Very general advice:

  • Look again at that backtrace. Are any of the stack frames in code you control? If so, what line, and what is happening there?

    再看看那个回溯。您控制的代码中是否有任何堆栈帧?如果是这样,什么线,以及那里发生了什么?

  • Do you know what dlopen() does? If not read the manual. If the backtrace does not include any of you code, this may well be failing at the time Apache tries to load your code. Are you sure you've built the module with the right compiler options?

    你知道dlopen()有什么作用吗?如果没有阅读手册。如果回溯不包含任何代码,那么在Apache尝试加载代码时,这可能会失败。您确定已使用正确的编译器选项构建模块吗?

  • Effective debugging requires knowing your environment and tools. Sydius's advice is good here.

    有效的调试需要了解您的环境和工具。 Sydius的建议很好。

  • If you're stuck on other paths, check that you can write, load, and run a trivial module. Probably you'll find an example of this in almost any documentation on the subject.

    如果您遇到其他路径,请检查您是否可以编写,加载和运行一个简单的模块。您可能会在几乎所有关于该主题的文档中找到这样的示例。


To dave's clarification: Between beginner and expert can be a tough spot.

戴夫的澄清:在初学者和专家之间可能是个难点。

Are you calling in libraries in the offending code that you don't use elsewhere? Maybe the loader path is messed up just for that resource.

您是否在使用其他地方没有使用的违规代码中调用库?也许加载器路径仅为该资源搞砸了。

Aside from that I'm just about out of advice. Sorry.

除此之外,我只是出于建议。抱歉。


NB: I had occasion to read David J. Agans' book Debugging last year. It is not software specific, but is a good read, and helpful even if you are already a pretty good debugger.

注意:去年有机会阅读David J. Agans的调试书。它不是特定于软件的,但是读取效果很好,即使您已经是一个非常好的调试器也很有帮助。

#5


2  

The fact that it is failing on the dlopen() call seems a bit suspect to me. There are a number of things that can go wrong when attempting to open a shared object; but none of them should cause a seg fault.

它在dlopen()调用失败的事实似乎对我有点怀疑。尝试打开共享对象时可能会出现许多问题;但它们都不应该导致seg故障。

The one exception I can think of is a problem in the library initialization of the SO. On that basis, I would suggest a few things you could try to get more information.

我能想到的一个例外是SO初始化库中的问题。在此基础上,我会建议您尝试获取更多信息。

  • Check your library path, and ensure that the library you're trying to load is in this path. (Note: Since you're using Apache, I think you also need to check the library path for the user under which Apache is running. (I think the user is "nobody".) I believe you're looking for the LD_LIBRARY_PATH environment variable.) Also note that if you have multiple versions of the library, this can be really important. Make sure you're loading the correct version of the library.
  • 检查库路径,确保您尝试加载的库位于此路径中。 (注意:由于您使用的是Apache,我认为您还需要检查运行Apache的用户的库路径。(我认为用户是“nobody”。)我相信您正在寻找LD_LIBRARY_PATH环境变量。)另请注意,如果您有多个版本的库,这可能非常重要。确保您正在加载正确的库版本。

  • As a general debugging principle, try to simplify the problem. Given that I know little about Apache modules, I would try to remove Apache from the equation: Try writing a simple C program that does little more than a dlopen() and possibly the subsequent dlsym(), then exits. This program provides a much simpler environment to troubleshoot and/or debug. If this program runs cleanly, then you may need to look more closely at what's different when the program seg faults. (What's Apache doing differently?) On the other hand, if your program also seg faults, you may consider a potential problem with the library, your compiler switches for the program, and the code in the program. (Or all of the above.)
  • 作为一般调试原则,尝试简化问题。鉴于我对Apache模块知之甚少,我会尝试从等式中删除Apache:尝试编写一个简单的C程序,它只能执行dlopen()以及后续的dlsym(),然后退出。该程序提供了一个更简单的故障排除和/或调试环境。如果这个程序运行得很干净,那么当程序段出现故障时,您可能需要更仔细地查看不同的程序。 (Apache的做法有何不同?)另一方面,如果您的程序也出现故障,您可能会考虑库的潜在问题,编译器为程序切换,以及程序中的代码。 (或以上所有。)

While I may not have offered very many general purpose debugging tips, I hope something here may have been helpful.

虽然我可能没有提供很多通用调试技巧,但我希望这里的一些内容可能会有所帮助。

#6


1  

I had a look at the valgrind documentation and by default it doesn't check child processes. It wouldn't surprise me at all if Apache could run your module in a child thread. Please try

我查看了valgrind文档,默认情况下它不会检查子进程。如果Apache可以在子线程中运行您的模块,那么我一点都不会感到惊讶。请试试

valgrind --trace-children=yes ....

#7


1  

Our non CS students (i.e. electrical engineering, math, physic students) I recommend in the programming lectures "The Practice of Programming" from Kernighan. It good delivers some basic concepts which aids devlopment (like testing and here it comes: debugging).

我的非CS学生(即电气工程,数学,物理学生)我在Kernighan的编程讲座“The Programming of Programming”中推荐。它很好地提供了一些有助于开发的基本概念(比如测试,现在它来了:调试)。

If you are already experienced programmer, it is maybe too basic for you. Then I have just one more of this Zen proverbs for you: "Wisdom withouth the filtering through experience is worthless".

如果您已经是经验丰富的程序员,那对您来说可能太基础了。然后我再向你提出一句禅谚语:“没有通过经验过滤的智慧毫无价值”。

One answer I can only back up: Look again at the stack trace, this is the most relevant help by debugging (esp there at the borders, where the execution crosses different modules (esp yours and the lib/OS borders), and look at the argument of the function and check if they are sane).

一个答案我只能备份:再看一下堆栈跟踪,这是调试最相关的帮助(特别是在边界,执行穿过不同的模块(尤其是你的和lib / OS边框),并看看函数的参数并检查它们是否合理)。