read() SysCall Internals
1 An Example: read(fd, buf, count)
- Shown below is the (conceptual version of the) Linux code for the
syscall
read(fd, buf, count)
511 SYSCALL_read(unsigned int fd, char __user * buf, size_t count) 512 { 513 struct fd f = fdget_pos(fd); 514 ssize_t ret = -EBADF; 515 516 if (f.file) { 517 loff_t pos = file_pos_read(f.file); 518 ret = vfs_read(f.file, buf, count, &pos); 519 if (ret >= 0) 520 file_pos_write(f.file, pos); 521 fdput_pos(f); 522 } 523 return ret; 524 }
The return value is stored in ret
pessimistically initialized to
-EBADF
. [Read man errno
.] If the file descriptor fd is good,
get the number of bytes read so far, and ask vfs_read
to get from
that position on count
bytes, deposit them in the buffer buf
, and
set ret
to the return value given by vfs_read
. Note the use of
&
in &pos
; the value of pos
may/will get updated. The
fdput_pos(f)
updates the number of bytes read so far. Return the
value of ret
.
- The bulk of the work is done by
vfs_read()
, which is a kernel internal ordinary routine, not a system call.
In Linux, the read system call can influence the write system call,
and vice versa. In the above file_pos_write(f.file, pos)
updates
the position where the write syscall would write in this f.file
.
Note that we could not check the size of the buffer buf
. If this
user supplied buffer is shorter than count
bytes, user process will
crash with a segmentation error.
1.1 C Language Macros
The Linux kernel source defines a macro called SYSCALL_DEFINEx
in a
file named syscalls.h
located within linux/
. There are
SYSCALL_DEFINEx
for x = 0 to 6; the x is the number of arguments
that the call takes.
E.g., the standard read
system call is defined as follows:
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count) { ... code elided ... }
Note that type decalarations precede the name of a variable. The
syscall read
takes three arguments, namely: fd, the file descriptor,
buf, an array of bytes where the bytes read from the file will be
placed, and count, the number of bytes to read from the file.
(Run find / -name syscalls.h
to find the full pathname(s) of this
file. For a fster result, replace the /
with the pathname of the
dir where you untarred the Linux src code tree.)
1.2 SYSCALL_DEFINE3(read, ...)
In the actual Linux source code, line 511 above is replaced with the following. Why? Because of the need to collect syscall info into a kernel-wide table, etc.
511 SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf,size_t,count)
The SYSCALL_DEFINE3
is a C-preprocessor (cpp) #define
macro. It
expands to the following plain C function definitions below.
The macro expands this definition into several more lines that help
with the kernel build.
The macro SYSCALL_DEFINE3
cooked up three function names: sys_read,
SYSC_read, SyS_read
(note the lc/UC details)
The first asmlinkage long sys_read(...);
is a forward declaration,
and the second one is the definition of SyS_read
, which uses the
SYSC_read
static (i.e., visible only within this file) function.
What is asmlinkage
? See below.
asmlinkage long sys_read(unsigned int fd, char __user * buf, size_t count) __attribute__((alias(__stringify(SyS_read)))); static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count); asmlinkage long SyS_read(long int fd, long int buf, long int count); asmlinkage long SyS_read(long int fd, long int buf, long int count) { long ret = SYSC_read((unsigned int) fd, (char __user *) buf, (size_t) count); asmlinkage_protect(3, ret, fd, buf, count); return ret; } static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count) { struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; ... code elided ... }
1.3 asmlinkage
#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
asmlinkage
informs the compiler that function arguments are on the CPU stack, instead of registers.- A system call does a table lookup using its first argument, the system call number, and up to four more remaining arguments are passed along to the actual system call procedure.
- system call achieves this feat simply by leaving its other arguments (which were passed to it in registers) on the stack.