
时间:2023-01-05 03:07:22

I'm trying to understand the relationship between C language system calls API, syscall assembler instruction and the exception mechanism (interrupts) used to switch contexts between processes. There's a lot to study out on my own, so please bear with me.


Is my understanding correct that C language system calls are implemented by compiler as syscall's with respective code in assembly, which in turn, are implemented by OS as exceptions mechanism (interrupts)?


So the call to the write function in the following C code:


#include <unistd.h>

int main(void)
    write(2, "There was an error writing to standard out\n", 44);
    return 0;

Is compiled to assembly as a syscall instruction:


mov eax,4       ; system call number (sys_write)

And the instruction, in turn, is implemented by OS as exceptions mechanism (interrupt)?


4 个解决方案




The syscall instruction itself acts like a glorified jump, it's a hardware-supported way to efficiently and safely jump from unprivileged user-space into the kernel.
The syscall instruction jumps to a kernel entry-point that dispatches the call.


Before x86_64 two other mechanisms were used: the int instruction and the sysenter instruction.
They have different entry-points (still present today in 32-bit kernels, and 64-bit kernels that can run 32-bit user-space programs).
The former uses the x86 interrupt machinery and can be confused with the exceptions dispatching (that also uses the interrupt machinery).
However, exceptions are spurious events while int is used to generate a software interrupt, again, a glorified jump.


The C language doesn't concern itself with system calls, it relies on the C runtime to perform all the interactions with the environment of the future program.


The C runtime implements the above-mentioned interactions through an environment specific mechanism.
There could be various layers of software abstractions but in the end the OS APIs get called.

C运行时通过特定于环境的机制实现上述交互。可能有不同的软件抽象层,但最终会调用OS api。

The term API is used to denote a contract, strictly speaking using an API doesn't require to invoke a piece of kernel code (the trend is to implement non-critical functions in userspace to limit the exploitable code), here we are only interested in the subset of the API that requires a privilege switch.


Under Linux, the kernel exposes a set of services accessible from userspace, these entry-points are called system calls.
Under Windows, the kernel services (that are accessed with the same mechanism of the Linux analogues) are considered private in the sense that they are not required to be stable across versions.
A set of DLL/EXE exported functions are used as entry-points instead (e.g. ntoskrnl.exe, hal.dll, kernel32.dll, user32.dll) that in turn use the kernel services through a (private) system call.
Note that under Linux, most system calls have a POSIX wrapper around it, so it's possible to use these wrappers, that are ordinary C functions, to invoke a system call.
The underlying ABI is different, so is for the error reporting; the wrapper translates between the two worlds.

在Linux下,内核公开一组可从用户空间访问的服务,这些入口点称为系统调用。在Windows下,内核服务(使用Linux模拟程序的相同机制访问)被认为是私有的,因为它们不需要跨版本保持稳定。一组DLL/EXE导出的函数被用作入口点(例如ntoskrnl)。exe,哈尔。dll,kernel32。dll, user32.dll)通过(私有)系统调用使用内核服务的。注意,在Linux下,大多数系统调用都有一个POSIX包装器,因此可以使用这些包装器(普通的C函数)来调用系统调用。基础ABI是不同的,错误报告也是不同的;包装在两个世界之间转换。

The C runtime calls the OS APIs, in the case of Linux the system calls are used directly because they are public (in the sense that are stable across versions), while for Windows the usual DLLs, like kernel32.dll, are marked as dependencies and used.

C运行时调用OS api,对于Linux,系统调用是直接使用的,因为它们是公共的(在跨版本稳定的意义上),而对于Windows,通常的dll,比如kernel32。dll,标记为依赖项并使用。

We are reduced to the point where an user-mode program, being it part of the C runtime (Linux) or part of an API DLL (Windows), need to invoke a code in the kernel.

我们被简化为一个用户模式程序,它是C运行时(Linux)的一部分,或者是API DLL (Windows)的一部分,需要在内核中调用代码。

The x86 architecture historically offered different ways to do so, for example, a call gate.
Another way is through the int instruction, it has a few advantages:


  • It is what the BIOS and the DOS did in their times.
    In real-mode, using an int instructions is suitable because a vector number (e.g. 21h) is easier to remember than a far address (e.g. 0f000h:0fff0h).
  • 这就是BIOS和DOS在它们的时代所做的事情。在实际模式中,使用int指令是合适的,因为向量号(例如21h)比远地址(例如0f000h:0fff0h)更容易记住。
  • It saves the flags.
  • 它节省了旗帜。
  • It is easy to set up (setting up ISR is relatively easy).
  • 它很容易设置(设置ISR相对容易)。

With the modernization of the architecture this mechanism turned out to have a big disadvantage: it is slow. Before the introduction of the sysenter (note, sysenter not syscall) instruction there was no faster alternative (a call gate would be equally slow).


With the advent of the Pentium Pro/II[1] a new pair of instructions sysenter and sysexit were introduced to make system calls faster.
Linux started using them since the version 2.5 and are still used today on 32-bit systems I believe.
I won't explain the whole mechanism of the sysenter instruction and the companion VDSO necessary to use it, it is only needed to say that it was faster than the int mechanism (I can't find an article from Andy Glew where he says that sysenter turned out to be slow on Pentium III, I don't know how it performs nowadays).


With the advent of x86-64 the AMD response to sysenter, i.e. the syscall/sysret pair, began the de-facto way to switch from user-mode to kernel-mode.
This is due to the fact that sysenter is actually fast and very simple (it copies rip and rflags into rcx and r11 respectively, masks rflags and jump to an address set in IA32_LSTAR).


64-bit versions of both Linux and Windows use syscall.


To recap, control can be given to the kernel through three mechanism:


  • Software interrupts.
    This was int 80h for 32-bit Linux (pre 2.5) and int 2eh for 32-bit Windows.
  • 软件中断。这是32位Linux的int 80h (pre 2.5)和32位Windows的int 2eh。
  • Via sysenter.
    Used by 32-bit versions of Linux since 2.5.
  • 通过sysenter。自2.5以来被32位版本的Linux使用。
  • Via syscall.
    Used by 64-bit versions of Linux and Windows.
  • 通过系统调用。用于64位版本的Linux和Windows。

Here is a nice page to put it in a better shape.


The C runtime is usually a static library, thus pre-compiled, that uses one of the three methods above.


The syscall instruction transfers control to a kernel entry-point (see entry_64.s) directly.
It is an instruction that just does so, it is not implemented by the OS, it is used by the OS.


The term exception is overloaded in CS, C++ has exceptions, so do Java and C#.
The OS can have a language agnostic exception trapping mechanism (under windows it was once called SEH, now has been rewritten).
The CPU also has exceptions.
I believe we are talking about the last meaning.


Exceptions are dispatched through interrupts, they are a kind of interrupt.
It goes unsaid that while exceptions are synchronous (they happen at specific, replayable points) they are "unwanted", they are exceptional, in the sense that programmers tend to avoid them and when they happen is due to either a bug, an unhandled corner case or a bad situation.
They, thus, are not used to transfer control to the kernel (they could).


Software interrupts (that are synchronous too) were used instead; the mechanism is almost exactly the same (exceptions can have a status code pushed on the kernel stack) but the semantic is different.
We never deferenced a null-pointer, accessed an unmapped page or similar to invoke a system call, we used the int instruction instead.




Is my understanding correct that C language system calls are implemented by compiler as syscall's with respective code in assembly […]?




The C compiler handles system calls the same way that it handles calls to any other function:


; write(2, "There was an error writing to standard out\n", 44);
mov    $44, %edx
lea    .LC0(%rip), %rsi  ; address of the string
mov    $2, %edi
call   write

The implementation of these functions in libc (your system's C library) will probably contain a syscall instruction, or whatever the equivalent is on your system's architecture.






Yes, the C application calls a C library function which buried in the C library solution is a system specific call or set of calls, which use an architecturally specific way to reach the operating system, which has an exception/interrupt handler setup to deal with these system calls. Actually doesnt have to be architecturally specific, can simply jump/call to a well known address, but with modern desire for security and protection modes, a simple call wont have those added features, still functionally correct though.


How the library is implemented is implementation defined. And how the compiler connects your code to that library runtime or link time has a number of combinations as to how that can happen, there is no one way it can or needs to happen, so it is implementation defined as well. So long as it is functionally correct and doesnt interfere with the C standards then it can work.


With operating systems like windows and linux and others on our phones and tables there is a strong desire to isolate the applications from the system so they cannot do damage in various ways, so protection is desired, and you need to have an architecturally specific way to make a function call into the operating system that is not a normal call as it switches modes. If the architecture has more than one way to do this then the operating system can choose one or more of the ways as part of their design.


A "software interrupt" is one common way as with hardware interrupts most solutions include a table of handler addresses, by extending that table and having some of the vectors be tied to a software created "interrupt" (hitting a special instruction rather than a signal changing state on an input) but go through the same stop, save some state, call the vector, etc.




Not a direct answer to the question but this might interest you (I don't have enough karma to comment) - it explains all the user space execution (including glibc and how it does syscalls) in detail:




You'll probably be interested in particular in 'Step 8 - Final string written to standard output':

您可能会对“步骤8 -写入标准输出的最终字符串”感兴趣:

And what does __libc_write look like...?


000000000040f9c0 <__libc_write>:
  40f9c0:  83 3d c5 bb 2a 00 00   cmpl   $0x0,0x2abbc5(%rip)  # 6bb58c <__libc_multiple_threads>
  40f9c7:  75 14                  jne    40f9dd <__write_nocancel+0x14>

000000000040f9c9 <__write_nocancel>:
  40f9c9: b8 01 00 00 00          mov    $0x1,%eax
  40f9ce: 0f 05                   syscall 

Write simply checks the threading state and, assuming all is well, moves the write syscall number (1) in to EAX and enters the kernel.

Write只检查线程状态,假设一切正常,将写syscall number(1)移到EAX并进入内核。

Some notes:


  • x86-64 Linux write syscall is 1, old x86 was 4
  • x86-64 Linux写syscall是1,老x86是4
  • rdi refers to stdout
  • rdi指stdout
  • rsi points to the string
  • rsi指向字符串
  • rdx is the string size count
  • rdx是字符串大小的计数

Note that this was for the author's x86-64 Linux system.

注意,这是针对作者的x86-64 Linux系统的。

For x86, this provides some help:




Under Linux the execution of a system call is invoked by a maskable interrupt or exception class transfer, caused by the instruction int 0x80. We use vector 0x80 to transfer control to the kernel. This interrupt vector is initialized during system startup, along with other important vectors like the system clock vector.

在Linux下,系统调用的执行由一个可屏蔽的中断或异常类转移调用,这是由指令int 0x80引起的。我们使用向量0x80将控制传递给内核。在系统启动时初始化这个中断向量,以及其他重要的向量,如系统时钟向量。

But as a general answer for a Linux kernel:


Is my understanding correct that C language system calls are implemented by compiler as syscall's with respective code in assembly, which in turn, are implemented by OS as exceptions mechanism (interrupts)?







The syscall instruction itself acts like a glorified jump, it's a hardware-supported way to efficiently and safely jump from unprivileged user-space into the kernel.
The syscall instruction jumps to a kernel entry-point that dispatches the call.


Before x86_64 two other mechanisms were used: the int instruction and the sysenter instruction.
They have different entry-points (still present today in 32-bit kernels, and 64-bit kernels that can run 32-bit user-space programs).
The former uses the x86 interrupt machinery and can be confused with the exceptions dispatching (that also uses the interrupt machinery).
However, exceptions are spurious events while int is used to generate a software interrupt, again, a glorified jump.


The C language doesn't concern itself with system calls, it relies on the C runtime to perform all the interactions with the environment of the future program.


The C runtime implements the above-mentioned interactions through an environment specific mechanism.
There could be various layers of software abstractions but in the end the OS APIs get called.

C运行时通过特定于环境的机制实现上述交互。可能有不同的软件抽象层,但最终会调用OS api。

The term API is used to denote a contract, strictly speaking using an API doesn't require to invoke a piece of kernel code (the trend is to implement non-critical functions in userspace to limit the exploitable code), here we are only interested in the subset of the API that requires a privilege switch.


Under Linux, the kernel exposes a set of services accessible from userspace, these entry-points are called system calls.
Under Windows, the kernel services (that are accessed with the same mechanism of the Linux analogues) are considered private in the sense that they are not required to be stable across versions.
A set of DLL/EXE exported functions are used as entry-points instead (e.g. ntoskrnl.exe, hal.dll, kernel32.dll, user32.dll) that in turn use the kernel services through a (private) system call.
Note that under Linux, most system calls have a POSIX wrapper around it, so it's possible to use these wrappers, that are ordinary C functions, to invoke a system call.
The underlying ABI is different, so is for the error reporting; the wrapper translates between the two worlds.

在Linux下,内核公开一组可从用户空间访问的服务,这些入口点称为系统调用。在Windows下,内核服务(使用Linux模拟程序的相同机制访问)被认为是私有的,因为它们不需要跨版本保持稳定。一组DLL/EXE导出的函数被用作入口点(例如ntoskrnl)。exe,哈尔。dll,kernel32。dll, user32.dll)通过(私有)系统调用使用内核服务的。注意,在Linux下,大多数系统调用都有一个POSIX包装器,因此可以使用这些包装器(普通的C函数)来调用系统调用。基础ABI是不同的,错误报告也是不同的;包装在两个世界之间转换。

The C runtime calls the OS APIs, in the case of Linux the system calls are used directly because they are public (in the sense that are stable across versions), while for Windows the usual DLLs, like kernel32.dll, are marked as dependencies and used.

C运行时调用OS api,对于Linux,系统调用是直接使用的,因为它们是公共的(在跨版本稳定的意义上),而对于Windows,通常的dll,比如kernel32。dll,标记为依赖项并使用。

We are reduced to the point where an user-mode program, being it part of the C runtime (Linux) or part of an API DLL (Windows), need to invoke a code in the kernel.

我们被简化为一个用户模式程序,它是C运行时(Linux)的一部分,或者是API DLL (Windows)的一部分,需要在内核中调用代码。

The x86 architecture historically offered different ways to do so, for example, a call gate.
Another way is through the int instruction, it has a few advantages:


  • It is what the BIOS and the DOS did in their times.
    In real-mode, using an int instructions is suitable because a vector number (e.g. 21h) is easier to remember than a far address (e.g. 0f000h:0fff0h).
  • 这就是BIOS和DOS在它们的时代所做的事情。在实际模式中,使用int指令是合适的,因为向量号(例如21h)比远地址(例如0f000h:0fff0h)更容易记住。
  • It saves the flags.
  • 它节省了旗帜。
  • It is easy to set up (setting up ISR is relatively easy).
  • 它很容易设置(设置ISR相对容易)。

With the modernization of the architecture this mechanism turned out to have a big disadvantage: it is slow. Before the introduction of the sysenter (note, sysenter not syscall) instruction there was no faster alternative (a call gate would be equally slow).


With the advent of the Pentium Pro/II[1] a new pair of instructions sysenter and sysexit were introduced to make system calls faster.
Linux started using them since the version 2.5 and are still used today on 32-bit systems I believe.
I won't explain the whole mechanism of the sysenter instruction and the companion VDSO necessary to use it, it is only needed to say that it was faster than the int mechanism (I can't find an article from Andy Glew where he says that sysenter turned out to be slow on Pentium III, I don't know how it performs nowadays).


With the advent of x86-64 the AMD response to sysenter, i.e. the syscall/sysret pair, began the de-facto way to switch from user-mode to kernel-mode.
This is due to the fact that sysenter is actually fast and very simple (it copies rip and rflags into rcx and r11 respectively, masks rflags and jump to an address set in IA32_LSTAR).


64-bit versions of both Linux and Windows use syscall.


To recap, control can be given to the kernel through three mechanism:


  • Software interrupts.
    This was int 80h for 32-bit Linux (pre 2.5) and int 2eh for 32-bit Windows.
  • 软件中断。这是32位Linux的int 80h (pre 2.5)和32位Windows的int 2eh。
  • Via sysenter.
    Used by 32-bit versions of Linux since 2.5.
  • 通过sysenter。自2.5以来被32位版本的Linux使用。
  • Via syscall.
    Used by 64-bit versions of Linux and Windows.
  • 通过系统调用。用于64位版本的Linux和Windows。

Here is a nice page to put it in a better shape.


The C runtime is usually a static library, thus pre-compiled, that uses one of the three methods above.


The syscall instruction transfers control to a kernel entry-point (see entry_64.s) directly.
It is an instruction that just does so, it is not implemented by the OS, it is used by the OS.


The term exception is overloaded in CS, C++ has exceptions, so do Java and C#.
The OS can have a language agnostic exception trapping mechanism (under windows it was once called SEH, now has been rewritten).
The CPU also has exceptions.
I believe we are talking about the last meaning.


Exceptions are dispatched through interrupts, they are a kind of interrupt.
It goes unsaid that while exceptions are synchronous (they happen at specific, replayable points) they are "unwanted", they are exceptional, in the sense that programmers tend to avoid them and when they happen is due to either a bug, an unhandled corner case or a bad situation.
They, thus, are not used to transfer control to the kernel (they could).


Software interrupts (that are synchronous too) were used instead; the mechanism is almost exactly the same (exceptions can have a status code pushed on the kernel stack) but the semantic is different.
We never deferenced a null-pointer, accessed an unmapped page or similar to invoke a system call, we used the int instruction instead.




Is my understanding correct that C language system calls are implemented by compiler as syscall's with respective code in assembly […]?




The C compiler handles system calls the same way that it handles calls to any other function:


; write(2, "There was an error writing to standard out\n", 44);
mov    $44, %edx
lea    .LC0(%rip), %rsi  ; address of the string
mov    $2, %edi
call   write

The implementation of these functions in libc (your system's C library) will probably contain a syscall instruction, or whatever the equivalent is on your system's architecture.






Yes, the C application calls a C library function which buried in the C library solution is a system specific call or set of calls, which use an architecturally specific way to reach the operating system, which has an exception/interrupt handler setup to deal with these system calls. Actually doesnt have to be architecturally specific, can simply jump/call to a well known address, but with modern desire for security and protection modes, a simple call wont have those added features, still functionally correct though.


How the library is implemented is implementation defined. And how the compiler connects your code to that library runtime or link time has a number of combinations as to how that can happen, there is no one way it can or needs to happen, so it is implementation defined as well. So long as it is functionally correct and doesnt interfere with the C standards then it can work.


With operating systems like windows and linux and others on our phones and tables there is a strong desire to isolate the applications from the system so they cannot do damage in various ways, so protection is desired, and you need to have an architecturally specific way to make a function call into the operating system that is not a normal call as it switches modes. If the architecture has more than one way to do this then the operating system can choose one or more of the ways as part of their design.


A "software interrupt" is one common way as with hardware interrupts most solutions include a table of handler addresses, by extending that table and having some of the vectors be tied to a software created "interrupt" (hitting a special instruction rather than a signal changing state on an input) but go through the same stop, save some state, call the vector, etc.




Not a direct answer to the question but this might interest you (I don't have enough karma to comment) - it explains all the user space execution (including glibc and how it does syscalls) in detail:




You'll probably be interested in particular in 'Step 8 - Final string written to standard output':

您可能会对“步骤8 -写入标准输出的最终字符串”感兴趣:

And what does __libc_write look like...?


000000000040f9c0 <__libc_write>:
  40f9c0:  83 3d c5 bb 2a 00 00   cmpl   $0x0,0x2abbc5(%rip)  # 6bb58c <__libc_multiple_threads>
  40f9c7:  75 14                  jne    40f9dd <__write_nocancel+0x14>

000000000040f9c9 <__write_nocancel>:
  40f9c9: b8 01 00 00 00          mov    $0x1,%eax
  40f9ce: 0f 05                   syscall 

Write simply checks the threading state and, assuming all is well, moves the write syscall number (1) in to EAX and enters the kernel.

Write只检查线程状态,假设一切正常,将写syscall number(1)移到EAX并进入内核。

Some notes:


  • x86-64 Linux write syscall is 1, old x86 was 4
  • x86-64 Linux写syscall是1,老x86是4
  • rdi refers to stdout
  • rdi指stdout
  • rsi points to the string
  • rsi指向字符串
  • rdx is the string size count
  • rdx是字符串大小的计数

Note that this was for the author's x86-64 Linux system.

注意,这是针对作者的x86-64 Linux系统的。

For x86, this provides some help:




Under Linux the execution of a system call is invoked by a maskable interrupt or exception class transfer, caused by the instruction int 0x80. We use vector 0x80 to transfer control to the kernel. This interrupt vector is initialized during system startup, along with other important vectors like the system clock vector.

在Linux下,系统调用的执行由一个可屏蔽的中断或异常类转移调用,这是由指令int 0x80引起的。我们使用向量0x80将控制传递给内核。在系统启动时初始化这个中断向量,以及其他重要的向量,如系统时钟向量。

But as a general answer for a Linux kernel:


Is my understanding correct that C language system calls are implemented by compiler as syscall's with respective code in assembly, which in turn, are implemented by OS as exceptions mechanism (interrupts)?


