什么时候内存负载导致x86-64 linux上的总线错误?

I used to think that x86-64 supports unaligned memory access and invalid memory access always causes segmentation fault (except, perhaps, SIMD instructions like movdqa or movaps). Nevertheless recently I observed bus error with normal mov instruction. Here is a reproducer:

我曾经认为，x86-64支持未对齐的内存访问和无效的内存访问总是会导致分割错误(也许，除了SIMD指令，比如movdqa或movaps)。然而，最近我观察到用普通mov指令的总线错误。这是一个复制程序:

void test(void *a)
{
    asm("mov %0, %%rbp\n\t"
        "mov 0(%%rbp), %%rdx\n\t"
        : : "r"(a) : "rbp", "rdx");
}

int main()
{
    test((void *)0x706a2e3630332d69);
    return 0;
}

(must be compiled with frame pointer omission, e.g. gcc -O test.c && ./a.out).

(必须用帧指针省略来编译，例如gcc -O测试。c & & . / a.o ut)。

mov 0(%rbp), %rdx instruction and the address 0x706a2e3630332d69 were copied from a coredump of the buggy program. Changing it to 0 causes segfault, but just aligning to 0x706a2e3630332d60 is still bus error (my guess is that it is related to the fact that address space is 48-bit on x86-64).

mov 0(%rbp)、%rdx指令和地址0x706a2e3630332d69是从错误程序的coredump中复制的。将它更改为0将导致segfault，但是仅仅将它调整为0x706a2e3630332d60仍然是总线错误(我猜测它与x86-64上的地址空间为48位有关)。

The question is: which addresses cause bus error (SIGBUS)? Is it determined by architecture or configured by OS kernel (i.e. in page table, control registers or something similar)?

问题是:哪个地址导致总线错误(SIGBUS)?它是由架构决定的还是由OS内核配置的(例如在页表、控制寄存器或类似的东西中)?

3 个解决方案

#1

SIGBUS is in a sad state. There's no consensus between different operating systems what it should mean and when it is generated varies wildly between operating systems, cpu architectures, configuration and the phase of the moon. Unless you work with a very specific configuration you should just treat it "just like SIGSEGV, but different".

西格布斯情绪低落。不同的操作系统之间没有一致的共识，它应该意味着什么，以及它何时生成在操作系统、cpu架构、配置和月亮的相位之间有很大的差异。除非你使用一个非常特定的配置，否则你应该把它当作“像SIGSEGV一样，但是不一样”。

I suspect that originally it was supposed to mean "you tried a memory access that could not possibly be successful no matter what the kernel does", so in other words the exact bit pattern you have in the address can never be a valid memory access. Most commonly this would mean unaligned access on strict alignment architectures. Then some systems started using it for accesses to virtual address space that doesn't exist (like in your example, the address you have can't exist). Then by accident some systems made it also mean that userland tried to touch kernel memory (since at least technically it's virtual address space that doesn't exist from the point of view of userland). Then it became just random.

我怀疑，最初它的意思应该是“您尝试了一种内存访问，无论内核做什么都不可能成功”，因此，换句话说，您在地址中拥有的确切的位模式永远不可能是有效的内存访问。最常见的情况是，这意味着对严格的对齐体系结构的不对齐访问。然后，一些系统开始使用它来访问不存在的虚拟地址空间(如您的示例中所示，您拥有的地址不存在)。然后，一些系统意外地使它也意味着userland试图触摸内核内存(因为至少从技术上讲，它是userland认为不存在的虚拟地址空间)。然后它变成随机的。

Other than that I've seen SIGBUS from:

除此之外，我还看到了SIGBUS:

access to non-existent physical address from mmap:ed hardware.
从mmap:ed硬件访问不存在的物理地址。
exec of non-exec mapping
执行非执行映射
access to perfectly valid mapping, but overcommitted memory couldn't be faulted in at this moment (I've seen SIGSEGV, SIGKILL and SIGBUS here, at least one operating system does this differently depending on which architecture you're on).
访问完全有效的映射，但此时不能出现内存过度提交(我在这里看到过SIGSEGV、SIGKILL和SIGBUS，至少有一个操作系统会根据您所使用的体系结构的不同而进行不同的操作)。
memory management deadlocks (and other "something went horribly wrong, but we don't know what" memory management errors).
内存管理死锁(以及其他“发生了可怕的错误，但我们不知道是什么”内存管理错误)。
stack red zone access
堆栈红色区域访问
hardware errors (ECC memory, pci bus parity errors, etc.)
硬件错误(ECC内存、pci总线奇偶校验错误等)
access to mmap:ed file where the file contents don't exist (past the end of the file or a hole).
对mmap的访问:文件内容不存在的ed文件(超过文件的末尾或一个洞)。
access to mmap:ed file where the file contents should exist, but don't (I/O errors).
访问mmap:应该存在文件内容但不存在的ed文件(I/O错误)。
access to normal memory that got swapped out and swap in couldn't be performed (I/O error).
无法执行对正常内存的访问(I/O错误)。

#2

Generally, a SIGBUS can be sent on an unaligned memory access, i.e. when writing a 64-bit integer to an address, which is not 8-byte aligned. However, in recent systems. either the hardware itself handles it correctly (albeit a bit slower than an aligned access), or the OS emulates the access it in an exception handler (with 2 or more separate memory accesses).

通常，可以在未对齐的内存访问中发送SIGBUS，例如，当将64位整数写到未对齐的地址时。然而,在最近的系统。硬件本身可以正确地处理它(尽管比对齐的访问要慢一些)，或者操作系统模拟在异常处理程序中访问它(带有两个或多个单独的内存访问)。

In this case, the problem is, that an address outside the permissible virtual address address space was specified. Despite a pointer has 64-bit, only the address space from 0-(2^48-1) (0x0-0xffffffffffff) is valid on current 64-bit intel processors. Linux provides even less address space to its processes, from 0-(2^47-1) (which is 0-0x7fffffffffff), the rest (0x800000000000-0xffffffffffff) is used by the kernel.

在这种情况下，问题是，指定了允许的虚拟地址空间之外的地址。尽管有64位指针,只有从0 -地址空间(2 ^ 48-1)(0 x0-0xffffffffffff)是有效的在当前的64位英特尔处理器。Linux提供进程地址空间更少,从0 -(2 ^ 47-1)(0-0x7fffffffffff),其余(0 x800000000000-0xffffffffffff)所使用的内核。

This means, that the kernel sends a SIGBUS because of an access to an invalid address (every address >= 0x800000000000), as opposed to a SIGSEGV, which means, that an access error to a valid address occurred (missing page entry, wrong access rights, etc.).

这意味着，内核发送SIGBUS是因为访问一个无效的地址(每个地址>= 0x800000000000)，而不是SIGSEGV，这意味着一个有效地址的访问错误发生了(缺少页面条目，错误的访问权限等等)。

#3

The only situation where POSIX specifically requires generation of a SIGBUS is, when you create a file-backed mmap region that extends beyond the end of the backing file by more than a whole page, and then access addresses sufficiently far past the end. (The exact words are "References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal.", from the specification of mmap.)

POSIX特别需要生成SIGBUS的惟一情况是，当您创建一个文件支持的mmap区域时，该区域扩展到支持文件的末尾超过整个页，然后访问远远超过末尾的地址。(确切的词是“在地址范围内的引用，从pa开始，并在对象结束后继续为len字节到整个页面，将导致SIGBUS信号的传递”。，源自mmap的规范。)

In all other circumstances, whether you get a SIGSEGV or a SIGBUS for an invalid memory access, or no signal at all, is left completely up to the implementation.

在所有其他情况下，对于无效的内存访问获得SIGSEGV或SIGBUS，或者根本没有信号，都完全由实现决定。

#1