C和汇编程序实际上编译的是什么?

时间:2021-12-04 12:43:26

So I found out that C(++) programs actually don't compile to plain "binary" (I may have gotten some things wrong here, in that case I'm sorry :D) but to a range of things (symbol table, os-related stuff,...) but...

因此,我发现C(++)程序实际上并不编译成普通的“二进制”(我可能在这里做错了一些事情,在这种情况下,我很抱歉:D),但是对于很多东西(符号表,操作系统相关的东西,…)但是…

  • Does assembler "compile" to pure binary? That means no extra stuff besides resources like predefined strings, etc.

    汇编程序“编译”到纯二进制吗?这意味着除了资源(如预定义字符串)之外没有额外的内容。

  • If C compiles to something else than plain binary, how can that small assembler bootloader just copy the instructions from the HDD to memory and execute them? I mean if the OS kernel, which is probably written in C, compiles to something different than plain binary - how does the bootloader handle it?

    如果C编译的不是普通的二进制文件,那么这个小型的汇编程序引导程序怎么能够将指令从HDD复制到内存中并执行它们呢?我的意思是,如果操作系统内核(可能是用C编写的)编译成与纯二进制不同的东西——引导加载程序如何处理它?

edit: I know that assembler doesn't "compile" because it only has your machine's instruction set - I didn't find a good word for what assembler "assembles" to. If you have one, leave it here as comment and I'll change it.

编辑:我知道汇编器不会“编译”,因为它只有你的机器的指令集——我找不到一个好词来表示汇编器“汇编”。如果你有的话,把它放在这里作为注释,我会修改它。

12 个解决方案

#1


38  

C typically compiles to assembler, just because that makes life easy for the poor compiler writer.

C通常编译为汇编程序,因为这使得编写编译器的编写变得很容易。

Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:

汇编代码总是汇编(而不是“编译”)以重新定位目标代码。您可以将其视为二进制机器代码和二进制数据,但需要大量的修饰和元数据。关键部分:

  • Code and data appear in named "sections".

    代码和数据出现在命名的“节”中。

  • Relocatable object files may include definitions of labels, which refer to locations within the sections.

    可重定位对象文件可以包含标签的定义,标签指的是区域内的位置。

  • Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.

    可重定位的对象文件可能包含“漏洞”,这些“漏洞”将填充其他地方定义的标签的值。这样一个洞的官方名称是一个重新安置的入口。

For example, if you compile and assemble (but don't link) this program

例如,如果编译并组装(但不链接)此程序

int main () { printf("Hello, world\n"); }

you are likely to wind up with a relocatable object file with

您可能会得到一个可重定位对象文件

  • A text section containing the machine code for main

    包含main机器代码的文本部分

  • A label definition for main which points to the beginning of the text section

    指向文本部分开头的main的标签定义

  • A rodata (read-only data) section containing the bytes of the string literal "Hello, world\n"

    一个rodata(只读数据)部分,包含字符串文字“Hello, world\n”的字节

  • A relocation entry that depends on printf and that points to a "hole" in a call instruction in the middle of a text section.

    一个重定位条目,它依赖于printf并指向文本部分中间的调用指令中的一个“洞”。

If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o, and you can explore the label definitions and uses with a simple tool called nm, and you can get more detailed information from a somewhat more complicated tool called objdump.

如果您在Unix系统上,可重定位对象文件通常称为.o文件,如hello中所示。o,您可以使用一个简单的工具nm来研究标签定义和使用,您可以从一个更复杂的工具objdump获得更详细的信息。

I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.

我教过一个涵盖这些主题的课程,我让学生们编写汇编程序和链接器,这需要几周的时间,但是当他们完成之后,他们中的大多数人都能很好地处理可重定位的对象代码。这不是一件容易的事。

#2


34  

Let's take a C program.

我们取一个C程序。

When you run 'gcc' or 'cl' on the c program, it will go through these stages:

当你在c程序上运行“gcc”或“cl”时,它将经历这些阶段:

  1. Preprocessor lexing(#include, #ifdef, trigraph analysis, encoding translations, comment management, macros...)
  2. 预处理程序lexing(#include, #ifdef, trigraph analysis,编码翻译,注释管理,宏…)
  3. Lexical analysis(producing tokens and lexical errors).
  4. 词法分析(生成标记和词法错误)。
  5. Syntactical analysis(producing a parse tree and syntactical errors).
  6. 语法分析(生成解析树和语法错误)。
  7. Semantic analysis(producing a symbol table, scoping information and scoping/typing errors).
  8. 语义分析(生成符号表、范围信息和范围/类型错误)。
  9. Output into assembly(or another intermediate format)
  10. 输出到程序集(或其他中间格式)
  11. Optimization of assembly(as above). Probably in ASM strings still.
  12. 优化组装(如上所述)。可能还是在ASM中。
  13. Assembling of the assembly into some binary object format.
  14. 将程序集组装成某种二进制对象格式。
  15. Linking of the assembly into whatever static libraries are needed, as well as relocating it if needed.
  16. 将程序集链接到需要的任何静态库中,并在需要时重新定位程序集。
  17. Output of final executable in elf or coff format.
  18. 最后可执行文件的输出精灵或coff格式。

In practice, some of these steps may be done at the same time, but this is the logical order.

在实践中,有些步骤可以同时完成,但这是逻辑顺序。

Note that there's a 'container' of elf or coff format around the actual executable binary.

注意,在实际的可执行二进制文件周围有一个elf或coff格式的“容器”。

You will find that a book on compilers(I recommend the Dragon book, the standard introductory book in the field) will have all the information you need and more.

你会发现一本关于编译器的书(我推荐《龙书》,这个领域的标准入门书)会提供你所需要的所有信息。

As Marco commented, linking and loading is a large area and the Dragon book more or less stops at the output of the executable binary. To actually go from there to running on an operating system is a decently complex process, which Levine in Linkers and Loaders covers.

正如Marco所言,链接和加载是一个很大的区域,而Dragon book或多或少会在可执行二进制文件的输出端停止。实际上,要在操作系统上运行,是一个非常复杂的过程,Levine在Linkers和Loaders中介绍了这个过程。

I've wiki'd this answer to let people tweak any errors/add information.

我有wiki这个答案,可以让人们修改任何错误/添加信息。

#3


17  

There are different phases in translating C++ into a binary executable. The language specification does not explicitly state the translation phases. However, I will describe the common translation phases.

将c++翻译成二进制可执行文件有不同的阶段。语言规范没有明确地说明翻译阶段。然而,我将描述常见的翻译阶段。

Source C++ To Assembly or Itermediate Language

Some compilers actually translate the C++ code into an assembly language or an intermediate language. This is not a required phase, but helpful in debugging and optimizations.

有些编译器实际上将c++代码转换为汇编语言或中间语言。这不是必需的阶段,但对调试和优化很有帮助。

Assembly To Object Code

The next common step is to translate Assembly language into an Object code. The object code contains assembly code with relative addresses and open references to external subroutines (methods or functions). In general, the translator puts in as much information into an object file as it can, everything else is unresolved.

下一个常见的步骤是将汇编语言转换成一个对象代码。对象代码包含具有相对地址的汇编代码和对外部子例程(方法或函数)的开放引用。一般来说,转换器将尽可能多的信息输入到对象文件中,其他的一切都无法解决。

Linking Object Code(s)

The linking phase combines one or more object codes, resolves references and eliminates duplicate subroutines. The final output is an executable file. This file contains information for the operating system and relative addresses.

链接阶段组合一个或多个对象代码,解析引用并消除重复的子例程。最后的输出是一个可执行文件。此文件包含操作系统的信息和相关地址。

Executing Binary Files

The Operating System loads the executable file, usually from a hard drive, and places it into memory. The OS may convert relative addresses into physical locations. The OS may also prepare resources (such as DLLs and GUI widgets) that are required by the executable (which may be stated in the Executable file).

操作系统通常从硬盘上加载可执行文件,并将其放入内存中。操作系统可以将相对地址转换为物理位置。操作系统还可以准备可执行文件所需的资源(如dll和GUI窗口小部件)(可在可执行文件中声明)。

Compiling Directly To Binary Some compilers, such as the ones used in Embedded Systems, have the capability to compile from C++ directly to an executable binary code. This code will have physical addresses instead of relative address and not require an OS to load.

一些编译器(如嵌入式系统中使用的)可以直接编译成二进制代码,从c++直接编译成可执行的二进制代码。此代码将具有物理地址,而不是相对地址,不需要加载操作系统。

Advantages

One of the advantages of these phases is that C++ programs can be broken into pieces, compiled individually and linked at a later time. They can even be linked with pieces from other developers (a.k.a. libraries). This allows developers to only compiler pieces in development and link in pieces that are already validated. In general, the translation from C++ to object is the time consuming part of the process. Also, a person doesn't want to wait for all the phases to complete when there is an error in the source code.

这些阶段的优点之一是,c++程序可以分解成不同的部分,单独编译,并在稍后进行链接。它们甚至可以与其他开发人员(也称为库)的作品链接在一起。这允许开发人员只在开发中编译部分,并将已验证的部分链接到其中。通常,从c++到object的转换是过程中耗时的部分。此外,当源代码中出现错误时,一个人不希望等待所有的阶段完成。

Keep an open mind and always expect the Third Alternative (Option).

保持开放的心态,永远期待第三种选择。

#4


3  

To answer your questions, please note that this is subjective as there are different processors, different platforms, different assemblers and C compilers, in this case, I will talk about the Intel x86 platform.

要回答您的问题,请注意这是主观的,因为有不同的处理器、不同的平台、不同的汇编器和C编译器,在本例中,我将讨论Intel x86平台。

  1. Assemblers do not compile to pure binary, they are raw machine code, defined with segments, such as data, text and bss to name but a few, this is called object code. The Linker steps in and adjusts the segments to make it executable, that is, ready to run. Incidentally, the default output when you compile using gcc is 'a.out', that is a shorthand for Assembler Output.
  2. 汇编程序不会编译成纯二进制,它们是原始的机器代码,用段(如数据、文本和bss)来定义,但是有一些,这叫做对象代码。链接器进入并调整段以使其可执行,也就是说,可以运行。顺便说一句,使用gcc编译时的默认输出是“a”。这是汇编输出的简写。
  3. Boot loaders have a special directive defined, back in the days of DOS, it would be common to find a directive such as .Org 100h, which defines the assembler code to be of the old .COM variety before .EXE took over in popularity. Also, you did not need to have a assembler to produce a .COM file, using the old debug.exe that came with MSDOS, did the trick for small simple programs, the .COM files did not need a linker and were straight ready-to-run binary format. Here's a simple session using DEBUG.
  4. 引导加载器有一个特殊的指令定义,在DOS的日子里,通常会找到一个指令,比如org 100h,它将汇编代码定义为以前的com类型。exe流行起来。此外,您不需要一个汇编程序来使用旧的调试生成一个。com文件。MSDOS附带的exe对简单的小程序起了作用,. com文件不需要链接器,而是直接准备运行的二进制格式。下面是一个使用DEBUG的简单会话。
1:*a 0100
2:* mov AH,07
3:* int 21
4:* cmp AL,00
5:* jnz 010c
6:* mov AH,07
7:* int 21
8:* mov AH,4C
9:* int 21
10:*
11:*r CX
12:*10
13:*n respond.com
14:*w
15:*q

This produces a ready-to-run .COM program called 'respond.com' that waits for a keystroke and not echo it to the screen. Notice, the beginning, the usage of 'a 100h' which shows that the Instruction pointer starts at 100h which is the feature of a .COM. This old script was mainly used in batch files waiting for a response and not echo it. The original script can be found here.

这将生成一个名为“respond.com”的可运行的。com程序,该程序等待一个击键,而不会将其回显到屏幕上。注意,开始时,使用'a 100h'表示指令指针从100h开始,这是。com的特性。这个旧脚本主要用于等待响应的批处理文件,而不是回显。原始的脚本可以在这里找到。

Again, in the case of boot loaders, they are converted to a binary format, there was a program that used to come with DOS, called EXE2BIN. That was the job of converting the raw object code into a format that can be copied on to a bootable disk for booting. Remember no linker is run against the assembled code, as the linker is for the runtime environment and sets up the code to make it runnable and executable.

再一次,在引导加载程序的情况下,它们被转换成二进制格式,有一个程序过去与DOS一起使用,叫做EXE2BIN。这就是将原始对象代码转换为可以复制到可引导磁盘上进行引导的格式的工作。记住,没有链接器是针对汇编的代码运行的,因为链接器是针对运行时环境的,并设置代码使其可运行和可执行。

The BIOS when booting, expects code to be at segment:offset, 0x7c00, if my memory serves me correct, the code (after being EXE2BIN'd), will start executing, then the bootloader relocates itself lower down in memory and continue loading by issuing int 0x13 to read from the disk, switch on the A20 gate, enable the DMA, switch onto protected mode as the BIOS is in 16bit mode, then the data read from the disk is loaded into memory, then the bootloader issues a far jump into the data code (likely to be written in C). That is in essence how the system boots.

BIOS启动时,预计代码段:抵消,0 x7c00,如果我没记错的话正确,代码(EXE2BIN后会),将开始执行,然后引导装载程序拆迁本身降低内存中并继续加载通过发行int 0 * 13从磁盘读取,A20门开关,使DMA,切换到保护模式是BIOS在16位模式中,然后从磁盘读取的数据加载到内存中,然后引导加载程序向数据代码(很可能是用C语言编写的)发出一个较大的跳转,这本质上就是系统引导的方式。

Ok, the previous paragraph sounds abstracted and simple, I may have missed out something, but that is how it is in a nutshell.

好吧,前一段听起来很抽象也很简单,我可能漏掉了一些东西,但简单来说就是这样。

Hope this helps, Best regards, Tom.

祝你好运,汤姆。

#5


1  

They compile to a file in a specific format (COFF for Windows, etc), composed of headers and segments, some of which have "plain binary" op codes. Assemblers and compilers (such as C) create the same sort of output. Some formats, such as the old *.COM files, had no headers, but still had certain assumptions (such as where in memory it would get loaded or how big it could be).

它们以一种特定的格式(Windows的COFF等)编译到一个文件中,该格式由头文件和段文件组成,其中一些文件具有“纯二进制”op代码。汇编程序和编译器(如C)创建相同类型的输出。一些格式,比如旧的*。COM文件,没有头,但是仍然有一些假设(比如在内存中它会被加载,或者它有多大)。

On Windows machines, the OS's boostrapper is in a disk sector loaded by the BIOS, where both of these are "plain". Once the OS has loaded its loader, it can read files that have headers and segments.

在Windows机器上,操作系统的boostrapper位于BIOS加载的磁盘扇区中,这两个扇区都是“普通的”。一旦操作系统加载了加载程序,它就可以读取具有头文件和段文件。

Does that help?

这有帮助吗?

#6


1  

To answer the assembly part of the question, assembly doesn't compile to binary as I understand it. Assembly === binary. It directly translates. Each assembly operation has a binary string that directly matches it. Each operation has a binary code, and each register variable has a binary address.

要回答问题的汇编部分,我理解的汇编语言不是二进制的。组装= = =二进制。它直接翻译。每个程序集操作都有一个直接匹配它的二进制字符串。每个操作都有一个二进制代码,每个寄存器变量都有一个二进制地址。

That is, unless Assembler != Assembly and I'm misunderstanding your question.

也就是说,除非汇编程序!=汇编,我误解了你的问题。

#7


1  

There are two things that you may mix here. Generally there are two topics:

这里有两种情况。通常有两个主题:

The latter may compile to the former in the process of assembly. Some intermediate formats are not assembled, but executed by a virtual machine. In case of C++ it may be compiled into CIL, which is assembled into a .NET assembly, hence there me be some confusion.

后者可以在装配过程中对前者进行编译。有些中间格式不是组装的,而是由虚拟机执行的。对于c++,它可能被编译成CIL,它被组装成。net程序集,因此我有些困惑。

But in general C and C++ are usually compiled into binary, or in other words, into a executable file format.

但是一般来说,C和c++通常被编译成二进制,或者换句话说,被编译成可执行文件格式。

#8


1  

You have a lot of answers to read through, but I think I can keep this succinct.

你有很多答案要通读,但我想我可以保持这个简洁。

"Binary code" refers to the bits that feed through the microprocessor's circuits. The microprocessor loads each instruction from memory in sequence, doing whatever they say. Different processor families have different formats for instructions: x86, ARM, PowerPC, etc. You point the processor at the instruction you want by giving it the address of the instruction in memory, and then it chugs merrily along through the rest of the program.

“二进制代码”是指通过微处理器电路传输的比特。微处理器按顺序将每个指令按顺序加载,按他们说的做。不同的处理器家族有不同的指令格式:x86、ARM、PowerPC等等。你把处理器指向你想要的指令,在内存中给它指令的地址,然后它就愉快地在程序的其余部分中运行。

When you want to load a program into the processor, you first have to make the binary code accessible in memory so it has an address in the first place. The C compiler outputs a file in the filesystem, which has to be loaded into a new virtual address space. Therefore, in addition to binary code, that file has to include the information that it has binary code, and what its address space should look like.

当您想要将程序加载到处理器中时,首先必须使二进制代码在内存中可访问,以便它首先具有一个地址。C编译器在文件系统中输出一个文件,文件必须加载到一个新的虚拟地址空间中。因此,除了二进制代码之外,该文件还必须包含它有二进制代码的信息,以及它的地址空间应该是什么样子。

A bootloader has different requirements, so its file format might be different. But the idea is the same: binary code is always a payload in a larger file format, which includes at a minimum a sanity check to ensure that it's written in the correct instruction set.

引导加载程序有不同的需求,因此它的文件格式可能不同。但是想法是一样的:二进制代码总是一个更大文件格式的有效负载,它至少包含一个完整的检查,以确保它是在正确的指令集中编写的。

C compilers and assemblers are typically configured to produce static library files. For embedded applications, you're more likely to find a compiler which produces something like a raw memory image with instructions beginning at address zero. Otherwise, you can write a linker which converts the output of the C compiler into whatever else you want.

通常配置C编译器和汇编器来生成静态库文件。对于嵌入式应用程序,您更有可能找到一种编译器,它可以生成像原始内存镜像这样的东西,并在地址为0的地方开始使用指令。否则,您可以编写一个链接器,它将C编译器的输出转换为您想要的任何其他内容。

#9


0  

As I understand it, a chipset (CPU, etc.) will have a set of registers for storing data, and understand a set of instructions for manipulating these registers. The instructions will be things like 'store this value to this register', 'move this value', or 'compare these two values'. These instructions are often expressed in short human-grokable alphabetic codes (assembly language, or assembler) which are mapped to the numbers that the chipset understands - those numbers are presented to the chip in binary (machine code.)

根据我的理解,芯片组(CPU等)将拥有一组用于存储数据的寄存器,并理解一组用于操作这些寄存器的指令。指令将像'存储这个值到这个寄存器','移动这个值',或'比较这两个值'。这些指令通常用简短的人类可触摸的字母代码(汇编语言或汇编语言)表示,这些代码映射到芯片组能够理解的数字——这些数字以二进制形式(机器代码)呈现在芯片上。

Those codes are the lowest level that the software gets down to. Going deeper than that gets into the architecture of the actual chip, which is something I haven't gotten involved in.

这些代码是软件的最低级别。更深入地研究实际芯片的架构,这是我还没有涉及到的。

#10


0  

There's plenty of answers above for you to look at, but I thought I'd add these resources that'll give you a flavour of what happens. Basically, on Windows and linux, someone has tried to create the tiniest executable possible; in Linux, ELF, windows, PE.

上面有很多答案供您参考,但我想我应该添加这些资源,让您对所发生的事情有所了解。基本上,在Windows和linux上,有人试图创建尽可能小的可执行文件;在Linux中,ELF, windows, PE。

Both run through what is removed and why and you use assemblers to construct ELF files without using the -felf like options that do it for you.

两者都要遍历要删除的内容和原因,并且您使用汇编程序来构建ELF文件,而不需要使用-felf之类的选项来实现。

Hope that helps.

希望有帮助。

Edit - you could also take a look at the assembly for a bootloader like the one in truecrypt http://www.truecrypt.org or "stage1" of grub (the bit that actually gets written to the MDR).

编辑——您还可以查看引导加载程序的程序集,如truecrypt http://www.truecrypt.org或grub的“stage1”(实际写入MDR的位)中的程序集。

#11


0  

The executable files (PE format on windows) cannot be used to boot the computer because the PE loader is not in memory.

可执行文件(windows上的PE格式)不能用于引导计算机,因为PE加载程序不在内存中。

The way bootstrapping works is that the master boot record on the disk contains a blob of a few hundred bytes of code. The BIOS of the computer (in ROM on the motherboard) loads this blob into memory and sets the CPU instruction pointer to the beginning of this boot code.

引导的工作方式是,磁盘上的主引导记录包含几百字节的代码。计算机的BIOS(在主板上的ROM中)将这个blob加载到内存中,并将CPU指令指针设置为启动代码的开头。

The boot code then loads a "second stage" loader, on Windows called NTLDR (no extension) from the root directory. This is raw machine code that, like the MBR loader, is loaded into memory cold and executed.

然后引导代码从根目录加载一个名为NTLDR(无扩展名)的Windows上的“第二阶段”加载程序。这是原始的机器代码,就像MBR加载程序一样,加载到内存中并执行。

NTLDR has the full capability to load PE files including DLLs and drivers.

NTLDR能够加载PE文件,包括dll和驱动程序。

#12


-4  

С(++) (unmanaged) really compiles to plain binary. Some OS-related stuff - are BIOS and OS function calls, they're different for each OS, but still binary.
1. Assembler compiles to pure binary, but, as strange as it gets, it is less optimized than C(++)
2. OS kernel, as well as bootloader, also written in C, so no problems here.

Java, Managed C++, and other .NET stuff, compiles into some pseudocode (MSIL in .NET), which makes it cross-OS and cross-platform, but requires local interpreter or translator to run.

С(+ +)的(不受托管的)真正编译普通的二进制。一些与操作系统相关的东西是BIOS和OS函数调用,它们对于每个操作系统都是不同的,但是仍然是二进制的。1。汇编程序编译为纯二进制,但是,尽管很奇怪,它比C(++) 2优化得更少。操作系统内核以及引导加载程序也是用C编写的,所以这里没有问题。Java、托管c++和其他。net之类的东西编译成一些伪代码(. net中的MSIL),这使得它可以跨操作系统和跨平台运行,但需要本地解释器或转换器才能运行。

#1


38  

C typically compiles to assembler, just because that makes life easy for the poor compiler writer.

C通常编译为汇编程序,因为这使得编写编译器的编写变得很容易。

Assembly code always assembles (not "compiles") to relocatable object code. You can think of this as binary machine code and binary data, but with lots of decoration and metadata. The key parts are:

汇编代码总是汇编(而不是“编译”)以重新定位目标代码。您可以将其视为二进制机器代码和二进制数据,但需要大量的修饰和元数据。关键部分:

  • Code and data appear in named "sections".

    代码和数据出现在命名的“节”中。

  • Relocatable object files may include definitions of labels, which refer to locations within the sections.

    可重定位对象文件可以包含标签的定义,标签指的是区域内的位置。

  • Relocatable object files may include "holes" that are to be filled with the values of labels defined elsewhere. The official name for such a hole is a relocation entry.

    可重定位的对象文件可能包含“漏洞”,这些“漏洞”将填充其他地方定义的标签的值。这样一个洞的官方名称是一个重新安置的入口。

For example, if you compile and assemble (but don't link) this program

例如,如果编译并组装(但不链接)此程序

int main () { printf("Hello, world\n"); }

you are likely to wind up with a relocatable object file with

您可能会得到一个可重定位对象文件

  • A text section containing the machine code for main

    包含main机器代码的文本部分

  • A label definition for main which points to the beginning of the text section

    指向文本部分开头的main的标签定义

  • A rodata (read-only data) section containing the bytes of the string literal "Hello, world\n"

    一个rodata(只读数据)部分,包含字符串文字“Hello, world\n”的字节

  • A relocation entry that depends on printf and that points to a "hole" in a call instruction in the middle of a text section.

    一个重定位条目,它依赖于printf并指向文本部分中间的调用指令中的一个“洞”。

If you are on a Unix system a relocatable object file is generally called a .o file, as in hello.o, and you can explore the label definitions and uses with a simple tool called nm, and you can get more detailed information from a somewhat more complicated tool called objdump.

如果您在Unix系统上,可重定位对象文件通常称为.o文件,如hello中所示。o,您可以使用一个简单的工具nm来研究标签定义和使用,您可以从一个更复杂的工具objdump获得更详细的信息。

I teach a class that covers these topics, and I have students write an assembler and linker, which takes a couple of weeks, but when they've done that most of them have a pretty good handle on relocatable object code. It's not such an easy thing.

我教过一个涵盖这些主题的课程,我让学生们编写汇编程序和链接器,这需要几周的时间,但是当他们完成之后,他们中的大多数人都能很好地处理可重定位的对象代码。这不是一件容易的事。

#2


34  

Let's take a C program.

我们取一个C程序。

When you run 'gcc' or 'cl' on the c program, it will go through these stages:

当你在c程序上运行“gcc”或“cl”时,它将经历这些阶段:

  1. Preprocessor lexing(#include, #ifdef, trigraph analysis, encoding translations, comment management, macros...)
  2. 预处理程序lexing(#include, #ifdef, trigraph analysis,编码翻译,注释管理,宏…)
  3. Lexical analysis(producing tokens and lexical errors).
  4. 词法分析(生成标记和词法错误)。
  5. Syntactical analysis(producing a parse tree and syntactical errors).
  6. 语法分析(生成解析树和语法错误)。
  7. Semantic analysis(producing a symbol table, scoping information and scoping/typing errors).
  8. 语义分析(生成符号表、范围信息和范围/类型错误)。
  9. Output into assembly(or another intermediate format)
  10. 输出到程序集(或其他中间格式)
  11. Optimization of assembly(as above). Probably in ASM strings still.
  12. 优化组装(如上所述)。可能还是在ASM中。
  13. Assembling of the assembly into some binary object format.
  14. 将程序集组装成某种二进制对象格式。
  15. Linking of the assembly into whatever static libraries are needed, as well as relocating it if needed.
  16. 将程序集链接到需要的任何静态库中,并在需要时重新定位程序集。
  17. Output of final executable in elf or coff format.
  18. 最后可执行文件的输出精灵或coff格式。

In practice, some of these steps may be done at the same time, but this is the logical order.

在实践中,有些步骤可以同时完成,但这是逻辑顺序。

Note that there's a 'container' of elf or coff format around the actual executable binary.

注意,在实际的可执行二进制文件周围有一个elf或coff格式的“容器”。

You will find that a book on compilers(I recommend the Dragon book, the standard introductory book in the field) will have all the information you need and more.

你会发现一本关于编译器的书(我推荐《龙书》,这个领域的标准入门书)会提供你所需要的所有信息。

As Marco commented, linking and loading is a large area and the Dragon book more or less stops at the output of the executable binary. To actually go from there to running on an operating system is a decently complex process, which Levine in Linkers and Loaders covers.

正如Marco所言,链接和加载是一个很大的区域,而Dragon book或多或少会在可执行二进制文件的输出端停止。实际上,要在操作系统上运行,是一个非常复杂的过程,Levine在Linkers和Loaders中介绍了这个过程。

I've wiki'd this answer to let people tweak any errors/add information.

我有wiki这个答案,可以让人们修改任何错误/添加信息。

#3


17  

There are different phases in translating C++ into a binary executable. The language specification does not explicitly state the translation phases. However, I will describe the common translation phases.

将c++翻译成二进制可执行文件有不同的阶段。语言规范没有明确地说明翻译阶段。然而,我将描述常见的翻译阶段。

Source C++ To Assembly or Itermediate Language

Some compilers actually translate the C++ code into an assembly language or an intermediate language. This is not a required phase, but helpful in debugging and optimizations.

有些编译器实际上将c++代码转换为汇编语言或中间语言。这不是必需的阶段,但对调试和优化很有帮助。

Assembly To Object Code

The next common step is to translate Assembly language into an Object code. The object code contains assembly code with relative addresses and open references to external subroutines (methods or functions). In general, the translator puts in as much information into an object file as it can, everything else is unresolved.

下一个常见的步骤是将汇编语言转换成一个对象代码。对象代码包含具有相对地址的汇编代码和对外部子例程(方法或函数)的开放引用。一般来说,转换器将尽可能多的信息输入到对象文件中,其他的一切都无法解决。

Linking Object Code(s)

The linking phase combines one or more object codes, resolves references and eliminates duplicate subroutines. The final output is an executable file. This file contains information for the operating system and relative addresses.

链接阶段组合一个或多个对象代码,解析引用并消除重复的子例程。最后的输出是一个可执行文件。此文件包含操作系统的信息和相关地址。

Executing Binary Files

The Operating System loads the executable file, usually from a hard drive, and places it into memory. The OS may convert relative addresses into physical locations. The OS may also prepare resources (such as DLLs and GUI widgets) that are required by the executable (which may be stated in the Executable file).

操作系统通常从硬盘上加载可执行文件,并将其放入内存中。操作系统可以将相对地址转换为物理位置。操作系统还可以准备可执行文件所需的资源(如dll和GUI窗口小部件)(可在可执行文件中声明)。

Compiling Directly To Binary Some compilers, such as the ones used in Embedded Systems, have the capability to compile from C++ directly to an executable binary code. This code will have physical addresses instead of relative address and not require an OS to load.

一些编译器(如嵌入式系统中使用的)可以直接编译成二进制代码,从c++直接编译成可执行的二进制代码。此代码将具有物理地址,而不是相对地址,不需要加载操作系统。

Advantages

One of the advantages of these phases is that C++ programs can be broken into pieces, compiled individually and linked at a later time. They can even be linked with pieces from other developers (a.k.a. libraries). This allows developers to only compiler pieces in development and link in pieces that are already validated. In general, the translation from C++ to object is the time consuming part of the process. Also, a person doesn't want to wait for all the phases to complete when there is an error in the source code.

这些阶段的优点之一是,c++程序可以分解成不同的部分,单独编译,并在稍后进行链接。它们甚至可以与其他开发人员(也称为库)的作品链接在一起。这允许开发人员只在开发中编译部分,并将已验证的部分链接到其中。通常,从c++到object的转换是过程中耗时的部分。此外,当源代码中出现错误时,一个人不希望等待所有的阶段完成。

Keep an open mind and always expect the Third Alternative (Option).

保持开放的心态,永远期待第三种选择。

#4


3  

To answer your questions, please note that this is subjective as there are different processors, different platforms, different assemblers and C compilers, in this case, I will talk about the Intel x86 platform.

要回答您的问题,请注意这是主观的,因为有不同的处理器、不同的平台、不同的汇编器和C编译器,在本例中,我将讨论Intel x86平台。

  1. Assemblers do not compile to pure binary, they are raw machine code, defined with segments, such as data, text and bss to name but a few, this is called object code. The Linker steps in and adjusts the segments to make it executable, that is, ready to run. Incidentally, the default output when you compile using gcc is 'a.out', that is a shorthand for Assembler Output.
  2. 汇编程序不会编译成纯二进制,它们是原始的机器代码,用段(如数据、文本和bss)来定义,但是有一些,这叫做对象代码。链接器进入并调整段以使其可执行,也就是说,可以运行。顺便说一句,使用gcc编译时的默认输出是“a”。这是汇编输出的简写。
  3. Boot loaders have a special directive defined, back in the days of DOS, it would be common to find a directive such as .Org 100h, which defines the assembler code to be of the old .COM variety before .EXE took over in popularity. Also, you did not need to have a assembler to produce a .COM file, using the old debug.exe that came with MSDOS, did the trick for small simple programs, the .COM files did not need a linker and were straight ready-to-run binary format. Here's a simple session using DEBUG.
  4. 引导加载器有一个特殊的指令定义,在DOS的日子里,通常会找到一个指令,比如org 100h,它将汇编代码定义为以前的com类型。exe流行起来。此外,您不需要一个汇编程序来使用旧的调试生成一个。com文件。MSDOS附带的exe对简单的小程序起了作用,. com文件不需要链接器,而是直接准备运行的二进制格式。下面是一个使用DEBUG的简单会话。
1:*a 0100
2:* mov AH,07
3:* int 21
4:* cmp AL,00
5:* jnz 010c
6:* mov AH,07
7:* int 21
8:* mov AH,4C
9:* int 21
10:*
11:*r CX
12:*10
13:*n respond.com
14:*w
15:*q

This produces a ready-to-run .COM program called 'respond.com' that waits for a keystroke and not echo it to the screen. Notice, the beginning, the usage of 'a 100h' which shows that the Instruction pointer starts at 100h which is the feature of a .COM. This old script was mainly used in batch files waiting for a response and not echo it. The original script can be found here.

这将生成一个名为“respond.com”的可运行的。com程序,该程序等待一个击键,而不会将其回显到屏幕上。注意,开始时,使用'a 100h'表示指令指针从100h开始,这是。com的特性。这个旧脚本主要用于等待响应的批处理文件,而不是回显。原始的脚本可以在这里找到。

Again, in the case of boot loaders, they are converted to a binary format, there was a program that used to come with DOS, called EXE2BIN. That was the job of converting the raw object code into a format that can be copied on to a bootable disk for booting. Remember no linker is run against the assembled code, as the linker is for the runtime environment and sets up the code to make it runnable and executable.

再一次,在引导加载程序的情况下,它们被转换成二进制格式,有一个程序过去与DOS一起使用,叫做EXE2BIN。这就是将原始对象代码转换为可以复制到可引导磁盘上进行引导的格式的工作。记住,没有链接器是针对汇编的代码运行的,因为链接器是针对运行时环境的,并设置代码使其可运行和可执行。

The BIOS when booting, expects code to be at segment:offset, 0x7c00, if my memory serves me correct, the code (after being EXE2BIN'd), will start executing, then the bootloader relocates itself lower down in memory and continue loading by issuing int 0x13 to read from the disk, switch on the A20 gate, enable the DMA, switch onto protected mode as the BIOS is in 16bit mode, then the data read from the disk is loaded into memory, then the bootloader issues a far jump into the data code (likely to be written in C). That is in essence how the system boots.

BIOS启动时,预计代码段:抵消,0 x7c00,如果我没记错的话正确,代码(EXE2BIN后会),将开始执行,然后引导装载程序拆迁本身降低内存中并继续加载通过发行int 0 * 13从磁盘读取,A20门开关,使DMA,切换到保护模式是BIOS在16位模式中,然后从磁盘读取的数据加载到内存中,然后引导加载程序向数据代码(很可能是用C语言编写的)发出一个较大的跳转,这本质上就是系统引导的方式。

Ok, the previous paragraph sounds abstracted and simple, I may have missed out something, but that is how it is in a nutshell.

好吧,前一段听起来很抽象也很简单,我可能漏掉了一些东西,但简单来说就是这样。

Hope this helps, Best regards, Tom.

祝你好运,汤姆。

#5


1  

They compile to a file in a specific format (COFF for Windows, etc), composed of headers and segments, some of which have "plain binary" op codes. Assemblers and compilers (such as C) create the same sort of output. Some formats, such as the old *.COM files, had no headers, but still had certain assumptions (such as where in memory it would get loaded or how big it could be).

它们以一种特定的格式(Windows的COFF等)编译到一个文件中,该格式由头文件和段文件组成,其中一些文件具有“纯二进制”op代码。汇编程序和编译器(如C)创建相同类型的输出。一些格式,比如旧的*。COM文件,没有头,但是仍然有一些假设(比如在内存中它会被加载,或者它有多大)。

On Windows machines, the OS's boostrapper is in a disk sector loaded by the BIOS, where both of these are "plain". Once the OS has loaded its loader, it can read files that have headers and segments.

在Windows机器上,操作系统的boostrapper位于BIOS加载的磁盘扇区中,这两个扇区都是“普通的”。一旦操作系统加载了加载程序,它就可以读取具有头文件和段文件。

Does that help?

这有帮助吗?

#6


1  

To answer the assembly part of the question, assembly doesn't compile to binary as I understand it. Assembly === binary. It directly translates. Each assembly operation has a binary string that directly matches it. Each operation has a binary code, and each register variable has a binary address.

要回答问题的汇编部分,我理解的汇编语言不是二进制的。组装= = =二进制。它直接翻译。每个程序集操作都有一个直接匹配它的二进制字符串。每个操作都有一个二进制代码,每个寄存器变量都有一个二进制地址。

That is, unless Assembler != Assembly and I'm misunderstanding your question.

也就是说,除非汇编程序!=汇编,我误解了你的问题。

#7


1  

There are two things that you may mix here. Generally there are two topics:

这里有两种情况。通常有两个主题:

The latter may compile to the former in the process of assembly. Some intermediate formats are not assembled, but executed by a virtual machine. In case of C++ it may be compiled into CIL, which is assembled into a .NET assembly, hence there me be some confusion.

后者可以在装配过程中对前者进行编译。有些中间格式不是组装的,而是由虚拟机执行的。对于c++,它可能被编译成CIL,它被组装成。net程序集,因此我有些困惑。

But in general C and C++ are usually compiled into binary, or in other words, into a executable file format.

但是一般来说,C和c++通常被编译成二进制,或者换句话说,被编译成可执行文件格式。

#8


1  

You have a lot of answers to read through, but I think I can keep this succinct.

你有很多答案要通读,但我想我可以保持这个简洁。

"Binary code" refers to the bits that feed through the microprocessor's circuits. The microprocessor loads each instruction from memory in sequence, doing whatever they say. Different processor families have different formats for instructions: x86, ARM, PowerPC, etc. You point the processor at the instruction you want by giving it the address of the instruction in memory, and then it chugs merrily along through the rest of the program.

“二进制代码”是指通过微处理器电路传输的比特。微处理器按顺序将每个指令按顺序加载,按他们说的做。不同的处理器家族有不同的指令格式:x86、ARM、PowerPC等等。你把处理器指向你想要的指令,在内存中给它指令的地址,然后它就愉快地在程序的其余部分中运行。

When you want to load a program into the processor, you first have to make the binary code accessible in memory so it has an address in the first place. The C compiler outputs a file in the filesystem, which has to be loaded into a new virtual address space. Therefore, in addition to binary code, that file has to include the information that it has binary code, and what its address space should look like.

当您想要将程序加载到处理器中时,首先必须使二进制代码在内存中可访问,以便它首先具有一个地址。C编译器在文件系统中输出一个文件,文件必须加载到一个新的虚拟地址空间中。因此,除了二进制代码之外,该文件还必须包含它有二进制代码的信息,以及它的地址空间应该是什么样子。

A bootloader has different requirements, so its file format might be different. But the idea is the same: binary code is always a payload in a larger file format, which includes at a minimum a sanity check to ensure that it's written in the correct instruction set.

引导加载程序有不同的需求,因此它的文件格式可能不同。但是想法是一样的:二进制代码总是一个更大文件格式的有效负载,它至少包含一个完整的检查,以确保它是在正确的指令集中编写的。

C compilers and assemblers are typically configured to produce static library files. For embedded applications, you're more likely to find a compiler which produces something like a raw memory image with instructions beginning at address zero. Otherwise, you can write a linker which converts the output of the C compiler into whatever else you want.

通常配置C编译器和汇编器来生成静态库文件。对于嵌入式应用程序,您更有可能找到一种编译器,它可以生成像原始内存镜像这样的东西,并在地址为0的地方开始使用指令。否则,您可以编写一个链接器,它将C编译器的输出转换为您想要的任何其他内容。

#9


0  

As I understand it, a chipset (CPU, etc.) will have a set of registers for storing data, and understand a set of instructions for manipulating these registers. The instructions will be things like 'store this value to this register', 'move this value', or 'compare these two values'. These instructions are often expressed in short human-grokable alphabetic codes (assembly language, or assembler) which are mapped to the numbers that the chipset understands - those numbers are presented to the chip in binary (machine code.)

根据我的理解,芯片组(CPU等)将拥有一组用于存储数据的寄存器,并理解一组用于操作这些寄存器的指令。指令将像'存储这个值到这个寄存器','移动这个值',或'比较这两个值'。这些指令通常用简短的人类可触摸的字母代码(汇编语言或汇编语言)表示,这些代码映射到芯片组能够理解的数字——这些数字以二进制形式(机器代码)呈现在芯片上。

Those codes are the lowest level that the software gets down to. Going deeper than that gets into the architecture of the actual chip, which is something I haven't gotten involved in.

这些代码是软件的最低级别。更深入地研究实际芯片的架构,这是我还没有涉及到的。

#10


0  

There's plenty of answers above for you to look at, but I thought I'd add these resources that'll give you a flavour of what happens. Basically, on Windows and linux, someone has tried to create the tiniest executable possible; in Linux, ELF, windows, PE.

上面有很多答案供您参考,但我想我应该添加这些资源,让您对所发生的事情有所了解。基本上,在Windows和linux上,有人试图创建尽可能小的可执行文件;在Linux中,ELF, windows, PE。

Both run through what is removed and why and you use assemblers to construct ELF files without using the -felf like options that do it for you.

两者都要遍历要删除的内容和原因,并且您使用汇编程序来构建ELF文件,而不需要使用-felf之类的选项来实现。

Hope that helps.

希望有帮助。

Edit - you could also take a look at the assembly for a bootloader like the one in truecrypt http://www.truecrypt.org or "stage1" of grub (the bit that actually gets written to the MDR).

编辑——您还可以查看引导加载程序的程序集,如truecrypt http://www.truecrypt.org或grub的“stage1”(实际写入MDR的位)中的程序集。

#11


0  

The executable files (PE format on windows) cannot be used to boot the computer because the PE loader is not in memory.

可执行文件(windows上的PE格式)不能用于引导计算机,因为PE加载程序不在内存中。

The way bootstrapping works is that the master boot record on the disk contains a blob of a few hundred bytes of code. The BIOS of the computer (in ROM on the motherboard) loads this blob into memory and sets the CPU instruction pointer to the beginning of this boot code.

引导的工作方式是,磁盘上的主引导记录包含几百字节的代码。计算机的BIOS(在主板上的ROM中)将这个blob加载到内存中,并将CPU指令指针设置为启动代码的开头。

The boot code then loads a "second stage" loader, on Windows called NTLDR (no extension) from the root directory. This is raw machine code that, like the MBR loader, is loaded into memory cold and executed.

然后引导代码从根目录加载一个名为NTLDR(无扩展名)的Windows上的“第二阶段”加载程序。这是原始的机器代码,就像MBR加载程序一样,加载到内存中并执行。

NTLDR has the full capability to load PE files including DLLs and drivers.

NTLDR能够加载PE文件,包括dll和驱动程序。

#12


-4  

С(++) (unmanaged) really compiles to plain binary. Some OS-related stuff - are BIOS and OS function calls, they're different for each OS, but still binary.
1. Assembler compiles to pure binary, but, as strange as it gets, it is less optimized than C(++)
2. OS kernel, as well as bootloader, also written in C, so no problems here.

Java, Managed C++, and other .NET stuff, compiles into some pseudocode (MSIL in .NET), which makes it cross-OS and cross-platform, but requires local interpreter or translator to run.

С(+ +)的(不受托管的)真正编译普通的二进制。一些与操作系统相关的东西是BIOS和OS函数调用,它们对于每个操作系统都是不同的,但是仍然是二进制的。1。汇编程序编译为纯二进制,但是,尽管很奇怪,它比C(++) 2优化得更少。操作系统内核以及引导加载程序也是用C编写的,所以这里没有问题。Java、托管c++和其他。net之类的东西编译成一些伪代码(. net中的MSIL),这使得它可以跨操作系统和跨平台运行,但需要本地解释器或转换器才能运行。