There still are memory areas below addr limit.seg that the user mode processes are not allowed to access or pass as a parameter to a system call. References to these areas are not checked but rather the segmentation faults caused by such unauthorized access are caught, and a correct error code is returned. But how to make sure that a segmentation fault occured in the kernel mode is caused by an invalid parameter and not by a kernel bug or something as serious? This is achieved by forcing all the access of the process address space to go through a very limited set of macros and
assembly language instructions, the addresses of which are known. In the _les include/asm/i386/uaccess.h and arch/i386/lib/getuser.S there are a set of macros and functions for checking the pointers passed to system calls and accessing the process address space memory. The addresses of all assembly language functions that have the right to access process address space are placed in the system exception table de_ned in arch/i386/mm/extable.c. Along with every address of an instruction in the table there is the address of the code that should be executed, if the instruction causes a segmentation fault. This associated code is called _xup code and it usually simply returns from the system call with an error code in eax. Every kernel module has its own exceptiontables aswell, and such tables are loaded in the memory at the same time as their parent modules. When a segmentation fault occures while running in kernel
mode the do page fault() handler function located in arch/i386/mm/fault.c tries to _nd the address of the instructiong causing the fault in the exception tables[2]. If the address is found, the _xup code associated with the instruction is run. If the address of the instruction is not found in the exception table, there is a bug in the kernel and a kernel oops is performed.
Direct Linux syscalls
Often you will be told that using C library ( libc ) is the only way, and direct system calls are bad. This is true. To some extent. In general, you must know that libc is not sacred, and in most cases it only does some checks, then calls kernel, and then sets errno. You can easily do this in your program as well (if you need to), and your program will be dozen times smaller, and this will result in improved performance as well, just because you're not using shared libraries (static binaries are faster). Using or not using libc in assembly programming is more a question of taste/belief than something practical. Remember, Linux is aiming to be POSIX compliant, so does libc . This means that syntax of almost all libc "system calls" exactly matches syntax of real kernel system calls (and vice versa). Besides, GNU libc ( glibc ) becomes slower and slower from version to version, and eats more and more memory; and so, cases of using direct system calls become quite usual. But.. main drawback of throwing libc away is that possibly you will need to implement several libc specific functions (that are not just syscall wrappers) on your own ( printf() and Co.).. and you are ready for that, aren't you? :)
Here is summary of direct system calls pros and cons.
Pros:
the smallest possible size; squeezing the last byte out of the system
the highest possible speed; squeezing cycles out of your favorite benchmark
full control: you can adapt your program/library to your specific language or memory requirements or whatever
no pollution by libc cruft
no pollution by C calling conventions (if you're developing your own language or environment)
static binaries make you independent from libc upgrades or crashes, or from dangling #! path to an interpreter (and are faster)
just for the fun out of it (don't you get a kick out of assembly programming?)
Cons:
If any other program on your computer uses the libc, then duplicating the libc code will actually wastes memory, not saves it.
Services redundantly implemented in many static binaries are a waste of memory. But you can make your libc replacement a shared library.
Size is much better saved by having some kind of bytecode, wordcode, or structure interpreter than by writing everything in assembly. (the interpreter itself could be written either in C or assembly.) The best way to keep multiple binaries small is to not have multiple binaries, but instead to have an interpreter process files with #! prefix. This is how OCaml works when used in wordcode mode (as opposed to optimized native code mode), and it is compatible with using the libc. This is also how Tom Christiansen's Perl PowerTools reimplementation of unix utilities works. Finally, one last way to keep things small, that doesn't depend on an external file with a hardcoded path, be it library or interpreter, is to have only one binary, and have multiply-named hard or soft links to it: the same binary will provide everything you need in an optimal space, with no redundancy of subroutines or useless binary headers; it will dispatch its specific behavior according to its argv[0] ; in case it isn't called with a recognized name, it might default to a shell, and be possibly thus also usable as an interpreter!
You cannot benefit from the many functionalities that libc provides besides mere linux syscalls: that is, functionality described in section 3 of the manual pages, as opposed to section 2, such as malloc, threads, locale, password, high-level network management, etc.
Therefore, you might have to reimplement large parts of libc, from printf() to malloc() and gethostbyname . It's redundant with the libc effort, and can be quite boring sometimes. Note that some people have already reimplemented "light" replacements for parts of the libc -- check them out! (Redhat's minilibc, Rick Hohensee's libsys , Felix von Leitner's dietlibc , Christian Fowelin's libASM , asmutils project is working on pure assembly libc)
Static libraries prevent you to benefit from libc upgrades as well as from libc add-ons such as the zlibc package, that does on-the-fly transparent decompression of gzip-compressed files.
The few instructions added by the libc can be a ridiculously small speed overhead as compared to the cost of a system call. If speed is a concern, your main problem is in your usage of system calls, not in their wrapper's implementation.
Using the standard assembly API for system calls is much slower than using the libc API when running in micro-kernel versions of Linux such as L4Linux, that have their own faster calling convention, and pay high convention-translation overhead when using the standard one (L4Linux comes with libc recompiled with their syscall API; of course, you could recompile your code with their API, too).
See previous discussion for general speed optimization issue.
If syscalls are too slow to you, you might want to hack the kernel sources (in C) instead of staying in userland.
If you've pondered the above pros and cons, and still want to use direct syscalls, then here is some advice.
You can easily define your system calling functions in a portable way in C (as opposed to unportable using assembly), by including asm/unistd.h , and using provided macros.
Since you're trying to replace it, go get the sources for the libc, and grok them. (And if you think you can do better, then send feedback to the authors!)
As an example of pure assembly code that does everything you want, examine Linux assembly resources .
Basically, you issue an int 0x80 , with the __NR_ syscallname number (from asm/unistd.h ) in eax , and parameters (up to six ) in ebx , ecx , edx , esi , edi , ebp respectively.
Result is returned in eax , with a negative result being an error, whose opposite is what libc would put into errno . The user-stack is not touched, so you needn't have a valid one when doing a syscall.
Passing sixth parameter in ebp appeared in Linux 2.4, previous Linux versions understand only 5 parameters in registers.
Linux Kernel Internals , and especially How System Calls Are Implemented on i386 Architecture? chapter will give you more robust overview.
As for the invocation arguments passed to a process upon startup, the general principle is that the stack originally contains the number of arguments argc , then the list of pointers that constitute *argv , then a null-terminated sequence of null-terminated variable=value strings for the environ ment. For more details, do examine Linux assembly resources , read the sources of C startup code from your libc ( crt0.S or crt1.S ), or those from the Linux kernel ( exec.c and binfmt_*.c in ).
Hardware I/O under
The source code here is an example of such a kernel module. We want to `spy' on a certain user, and to printk() a message whenever that user opens a file. Towards this end, we replace the system call to open a file with our own function, called our_sys_open. This function checks the uid (user's id) of the current process, and if it's equal to the uid we spy on, it calls printk() to display the name of the file to be opened. Then, either way, it calls the original open() function with the same parameters, to actually open the file.
The init_module function replaces the appropriate location in sys_call_table and keeps the original pointer in a variable. The cleanup_module function uses that variable to restore everything back to normal. This approach is dangerous, because of the possibility of two kernel modules changing the same system call. Imagine we have two kernel modules, A and B. A's open system call will be A_open and B's will be B_open. Now, when A is inserted into the kernel, the system call is replaced with A_open, which will call the original sys_open when it's done. Next, B is inserted into the kernel, which replaces the system call with B_open, which will call what it thinks is the original system call, A_open, when it's done.
Now, if B is removed first, everything will be well---it will simply restore the system call to A_open, which calls the original. However, if A is removed and then B is removed, the system will crash. A's removal will restore the system call to the original, sys_open, cutting B out of the loop. Then, when B is removed, it will restore the system call to what it thinks is the original, A_open, which is no longer in memory. At first glance, it appears we could solve this particular problem by checking if the system call is equal to our open function and if so not changing it at all (so that B won't change the system call when it's removed), but that will cause an even worse problem. When A is removed, it sees that the system call was changed to B_open so that it is no longer pointing to A_open, so it won't restore it to sys_open before it is removed from memory. Unfortunately, B_open will still try to call A_open which is no longer there, so that even without removing B the system would crash.
I can think of two ways to prevent this problem. The first is to restore the call to the original value, sys_open. Unfortunately, sys_open is not part of the kernel system table in /proc/ksyms, so we can't access it. The other solution is to use the reference count to prevent root from rmmod'ing the module once it is loaded. This is good for production modules, but bad for an educational sample --- which is why I didn't do it here.
. procfs.c
|
/* syscall.c * * System call "stealing" sample. */
/* Copyright (C) 2001 by Peter Jay Salzman */
/* The necessary header files */
/* Standard in kernel modules */ #include <linux/kernel.h> /* We're doing kernel work */ #include <linux/module.h> /* Specifically, a module */
/* Deal with CONFIG_MODVERSIONS */ #if CONFIG_MODVERSIONS==1 #define MODVERSIONS #include <linux/modversions.h> #endif
#include <sys/syscall.h> /* The list of system calls */
/* For the current (process) structure, we need * this to know who the current user is. */ #include <linux/sched.h>
/* In 2.2.3 /usr/include/linux/version.h includes a * macro for this, but 2.0.35 doesn't - so I add it * here if necessary. */ #ifndef KERNEL_VERSION #define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c)) #endif
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0) #include <asm/uaccess.h> #endif
/* The system call table (a table of functions). We * just define this as external, and the kernel will * fill it up for us when we are insmod'ed */ extern void *sys_call_table[];
/* UID we want to spy on - will be filled from the * command line */ int uid;
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0) MODULE_PARM(uid, "i"); #endif
/* A pointer to the original system call. The reason * we keep this, rather than call the original function * (sys_open), is because somebody else might have * replaced the system call before us. Note that this * is not 100% safe, because if another module * replaced sys_open before us, then when we're inserted * we'll call the function in that module - and it * might be removed before we are. * * Another reason for this is that we can't get sys_open. * It's a static variable, so it is not exported. */ asmlinkage int (*original_call)(const char *, int, int);
/* For some reason, in 2.2.3 current->uid gave me * zero, not the real user ID. I tried to find what went * wrong, but I couldn't do it in a short time, and * I'm lazy - so I'll just use the system call to get the * uid, the way a process would. * * For some reason, after I recompiled the kernel this * problem went away. */ asmlinkage int (*getuid_call)();
/* The function we'll replace sys_open (the function * called when you call the open system call) with. To * find the exact prototype, with the number and type * of arguments, we find the original function first * (it's at fs/open.c). * * In theory, this means that we're tied to the * current version of the kernel. In practice, the * system calls almost never change (it would wreck havoc * and require programs to be recompiled, since the system * calls are the interface between the kernel and the * processes). */ asmlinkage int our_sys_open(const char *filename, int flags, int mode) { int i = 0; char ch;
/* Check if this is the user we're spying on */ if (uid == getuid_call()) { /* getuid_call is the getuid system call, * which gives the uid of the user who * ran the process which called the system * call we got */
/* Report the file, if relevant */ printk("Opened file by %d: ", uid); do { #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0) get_user(ch, filename+i); #else ch = get_user(filename+i); #endif i++; printk("%c", ch); } while (ch != 0); printk("\n"); }
/* Call the original sys_open - otherwise, we lose * the ability to open files */ return original_call(filename, flags, mode); }
/* Initialize the module - replace the system call */ int init_module() { /* Warning - too late for it now, but maybe for * next time... */ printk("I'm dangerous. I hope you did a "); printk("sync before you insmod'ed me.\n"); printk("My counterpart, cleanup_module(), is even"); printk("more dangerous. If\n"); printk("you value your file system, it will "); printk("be \"sync; rmmod\" \n"); printk("when you remove this module.\n");
/* Keep a pointer to the original function in * original_call, and then replace the system call * in the system call table with our_sys_open */ original_call = sys_call_table[__NR_open]; sys_call_table[__NR_open] = our_sys_open;
/* To get the address of the function for system * call foo, go to sys_call_table[__NR_foo]. */
printk("Spying on UID:%d\n", uid);
/* Get the system call for getuid */ getuid_call = sys_call_table[__NR_getuid];
return 0; }
/* Cleanup - unregister the appropriate file from /proc */ void cleanup_module() { /* Return the system call back to normal */ if (sys_call_table[__NR_open] != our_sys_open) { printk("Somebody else also played with the "); printk("open system call\n"); printk("The system may be left in "); printk("an unstable state.\n"); }
sys_call_table[__NR_open] = original_call; } |
Sysenter and the vsyscall page
It has been observed that a 2 GHz Pentium 4 was much slower than an 850 MHz Pentium III on certain tasks, and that this slowness is caused by the very large overhead of the traditional int 0x80 interrupt on a Pentium 4. Some models of the i386 family do have faster ways to enter the kernel. On Pentium II there is the sysenter instruction. Also AMD has a syscall instruction. It would be good if these could be used.
Something else is that in some applications gettimeofday() is a done very often, for example for timestamping all transactions. It would be nice if it could be implemented with very low overhead.
One way of obtaining a fast gettimeofday() is by writing the current time in a fixed place, on a page mapped into the memory of all applications, and updating this location on each clock interrupt. These applications could then read this fixed location with a single instruction - no system call required.
There might be other data that the kernel could make available in a read-only way to the process, like perhaps the current process ID. A vsyscall is a "system" call that avoids crossing the userspace-kernel boundary.
Linux is in the process of implementing such ideas. Since Linux 2.5.53 there is a fixed page, called the vsyscall page, filled by the kernel. At kernel initialization time the routine sysenter_setup() is called. It sets up a non-writable page and writes code for the sysenter instruction if the CPU supports that, and for the classical int 0x80 otherwise. Thus, the C library can use the fastest type of system call by jumping to a fixed address in the vsyscall page.
Concerning gettimeofday(), a vsyscall version for the x86-64 is already part of the vanilla kernel. Patches for i386 exist. (An example of the kind of timing differences: John Stultz reports on an experiment where he measures gettimeofday() and finds 1.67 us for the int 0x80 way, 1.24 us for the sysenter way, and 0.88 us for the vsyscall.)
Some details
The kernel maps a page (0xffffe000-0xffffefff) in the memory of every process. (This is the one but last addressable page. The last is not mapped - maybe to avoid bugs related to wraparound.) We can read it:
/* get vsyscall page */
#include <unistd.h>
#include <string.h>
int main() {
char *p = (char *) 0xffffe000;
char buf[4096];
#if 0
write(1, p, 4096);
/* this gives EFAULT */
#else
memcpy(buf, p, 4096);
write(1, buf, 4096);
#endif
return 0;
}
and if we do, find an ELF binary.
% ./get_vsyscall_page > syspage
% file syspage
syspage: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), stripped
% objdump -h syspage
syspage: file format elf32-i386
Sections:
Idx Name Size VMA LMA File off Algn
0 .hash 00000050 ffffe094 ffffe094 00000094 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .dynsym 000000f0 ffffe0e4 ffffe0e4 000000e4 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .dynstr 00000056 ffffe1d4 ffffe1d4 000001d4 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .gnu.version 0000001e ffffe22a ffffe22a 0000022a 2**1
CONTENTS, ALLOC, LOAD, READONLY, DATA
4 .gnu.version_d 00000038 ffffe248 ffffe248 00000248 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
5 .text 00000047 ffffe400 ffffe400 00000400 2**5
CONTENTS, ALLOC, LOAD, READONLY, CODE
6 .eh_frame_hdr 00000024 ffffe448 ffffe448 00000448 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
7 .eh_frame 0000010c ffffe46c ffffe46c 0000046c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
8 .dynamic 00000078 ffffe578 ffffe578 00000578 2**2
CONTENTS, ALLOC, LOAD, DATA
9 .useless 0000000c ffffe5f0 ffffe5f0 000005f0 2**2
CONTENTS, ALLOC, LOAD, DATA
% objdump -d syspage
syspage: file format elf32-i386
Disassembly of section .text:
ffffe400 <.text>:
ffffe400: 51 push %ecx
ffffe401: 52 push %edx
ffffe402: 55 push %ebp
ffffe403: 89 e5 mov %esp,%ebp
ffffe405: 0f 34 sysenter
ffffe407: 90 nop
ffffe408: 90 nop
... more nops ...
ffffe40d: 90 nop
ffffe40e: eb f3 jmp 0xffffe403
ffffe410: 5d pop %ebp
ffffe411: 5a pop %edx
ffffe412: 59 pop %ecx
ffffe413: c3 ret
... zero bytes ...
ffffe420: 58 pop %eax
ffffe421: b8 77 00 00 00 mov $0x77,%eax
ffffe426: cd 80 int $0x80
ffffe428: 90 nop
ffffe429: 90 nop
... more nops ...
ffffe43f: 90 nop
ffffe440: b8 ad 00 00 00 mov $0xad,%eax
ffffe445: cd 80 int $0x80
The interesting addresses here are found via
% grep ffffe System.map
ffffe000 A VSYSCALL_BASE
ffffe400 A __kernel_vsyscall
ffffe410 A SYSENTER_RETURN
ffffe420 A __kernel_sigreturn
ffffe440 A __kernel_rt_sigreturn
%
So __kernel_vsyscall pushes a few registers and does a sysenter instruction. And SYSENTER_RETURN pops the registers again and returns. And __kernel_sigreturn and __kernel_rt_sigreturn do system calls 119 and 173, that is, sigreturn and rt_sigreturn, respectively.
What about the jump just before SYSENTER_RETURN? It is a trick to handle restarting of system calls with 6 parameters. As Linus said: I'm a disgusting pig, and proud of it to boot.
The code involved is most easily seen from a slightly earlier patch.
A tiny demo program.
#include <stdio.h>
int pid;
int main() {
__asm__(
"movl $20, %eax \n"
"call 0xffffe400 \n"
"movl %eax, pid \n"
);
printf("pid is %d\n", pid);
return 0;
}
This does the getpid() system call (__NR_getpid is 20) using call 0xffffe400 instead of int 0x80.
However, the proper thing to do is not call 0xffffe400 but call *%gs:0x18. If %gs has been set up so that it addresses 0xffffe000, then at location 0xffffe018 we find the value of __kernel_vsyscall, the entry point of the kernel vsyscalls. Such general setup requires the parsing of the ELF headers of this vsyscall page, but then is future-proof.