Syscall and Sysret
note: the article may contain errors, of spellings, codes, or others... if you find one do not hesitate to make an issue or a pr to github.com/supercip971/supercip971.github.io ❤️
What is a syscall?
Syscalls allow to execute kernel actions from userspace. They are like complex functions that link the program and the kernel. For example we can have a syscall to allocate memory, one to open a file... This is an important part of the kernel that needs to be very fast because a user applications can call a lot of syscalls.
What were we doing before for syscalls?
Before (and some are still using it, and it's still quite effective) we used the interrupts of the cpu: the interrupt allows you to go directly to the kernel by executing specific code pointed in the interrupt table.
int instruction allows to call a certain interrupt, for exemple we can use
int 68 for calling interrupts number 67.
Some os reserve an interruption for the syscall (wingos used interrupt 127, linux use 128...) this interrupt may be the only interrupt that a RING 3 process can call. In the interrupt handler the registers are saved and used as arguments for the syscall.
note: all registers can be used, but RCX and R11 should not be used (if we want to easily make the kernel portable to 64bit syscall/sysret) because they are needed to save the cpu state with the
After the execution of the syscall code in the interrupt, we can modify the value of the RAX register so that it contains the syscall return value.
However interruptions are slow for syscall. It needs to check a lot of things and it's not the best solution available.
Before syscall and sysret: sysenter and sysexit
Sysenter and Sysexit were added by intel. One problem of sysenter and sysexit in 32Bit is that we don't know if it is supported. The instruction may not be available. Another problem of sysenter is that you must write for each syscall the return address to RDX and the return stack to RCX, that's fine but you don't know what the RIP and RSP of the syscall is!
The user app must put a return address and a return stack to syscall parameters themselves:
I think it is sketchy and can be the cause of error. This is my opinion, but I think syscall/sysret are 100 times better than sysenter/sysexit.
What are syscall and sysret?
First what are syscall and sysret?
Syscall and sysret are long mode instruction for doing syscall from userspace to the kernel. These instructions allow you to make faster and safer syscalls. They are faster thanks to the fact that it takes into account that it has consistent segments.
Faster certainly but what is the gain in performance?
I wanted to test on GNU/ linux the syscall "Getpid", with an interrupt and with the syscall instruction (using g++ -O3, and google benchmarks)
here are the results:
the syscall is 2 times faster than the interrupt!
note: I have a ryzen 5 3600X so results can be different on other cpus and systems
however setting up a syscall is a bit more complicated than setting up a syscall with an interrupt:
first you need to turn them on with model specific register (address 0xC0000080 bit 0)
then you need to setup syscall gdt segments:
it is necessary to know that the MSR STAR register must contain the segment when the syscall is executed (ring 0) and the segment when the syscall is exited (ring 3) but it is also important that the gdt entry has a precise order:
SELECTOR_1: must be kernel code
SELECTOR_1 + 8: must be kernel data
SELECTOR_2 + 8: must be user data
SELECTOR_2 + 16: must be user code
So in wingos I changed the order of the gdt to have:
I can have
SELECTOR_1 = kernel code
SELECTOR_2 = kernel data | 3
It's maybe weird but it's one of the only solution I found except if I make an empty entry between
Then you have to load the address of the syscall handler in the
The syscall handler:
Before talking about the syscall handler I should tell you that in 64bit and with smp, there is a local structure for each cpu stored in the gs register (other kernels can use fs). This structure contains a temporary stack for the syscall, an address to store the process stack temporarily (and maybe other things...).
the local cpu structure stored in gs:
So at each syscall we change the stack temporarily to use the syscall_stack.
But in 64bit a user can write to the gs register (with
wrgsbase)! which can really be problematic... So we use the instruction:
Which allows to change between user gs and the gs which is stored in the msr register:
KERNEL_GS so we can 'secure' the use of the gs register. At the end of the syscall_handle we can call swapgs again to reset to the previous value of gs.
Also when entering the syscall_handle, the cpu puts the previous value of RIP in RCX and the previous value of RFLAGS in R11. The processor also uses them to reset the value of RIP and RFLAGS when the syscall returns.
Here is my sycall handler:
We should not pop the rax register because we want to keep its value.
Then the syscall_higher_handler manages which syscall to call from the rax register (which stores the syscall id).
How userspace call the syscall?
It's like interrupt but we replace
int $127 with
We also need to change the asm code to push and pop R11 and RCX registers, because they keep their values (RCX for RIP and R11 for RFLAGS).
Et voila! This was how syscall/sysret was implemented in wingos!