diff --git a/course/07-System_calls.md b/course/07-System_calls.md new file mode 100644 index 0000000000000000000000000000000000000000..d0a682424b23ba19703d8c6dcf979609a8294108 --- /dev/null +++ b/course/07-System_calls.md @@ -0,0 +1,326 @@ +--- +author: Florent Gluck - Florent.Gluck@hesge.ch + +title: System Calls + +date: \vspace{.5cm} \footnotesize \today + +pandoc-latex-fontsize: + - classes: [tiny] + size: tiny + - classes: [verysmall] + size: scriptsize + - classes: [small] + size: footnotesize + - classes: [huge, important] + size: huge +--- + +[//]: # ---------------------------------------------------------------- +## + +\centering +{ width=100% } + +[//]: # ---------------------------------------------------------------- +# Reminder: why a kernel? + +[//]: # ---------------------------------------------------------------- +## Kernel's purpose + +- **Purpose** of the kernel: + - to **provide services** to tasks (read a file, access devices, e.g. screen, keyboard, etc.) + - to **multiplex hardware** ressources (RAM, CPU, devices) among tasks (processes) + - to **isolate** tasks from each others and from the kernel (protection) +- \textcolor{myred}{A kernel bug can \textbf{crash} the whole system!} +- However: a buggy application **should not** be able to crash the system or impact other applications! + +[//]: # ---------------------------------------------------------------- +## Kernel mode vs user mode + +- \textcolor{myred}{\textbf{Kernel mode}} is **privileged** + - kernel can do anything, without any restrictions + +\vspace{.5cm} + +- \textcolor{mygreen}{\textbf{User mode}} is **restricted** + - user applications are **unprivileged** + - what an application can or cannot do is **controlled by the kernel** + +[//]: # ---------------------------------------------------------------- +## Reminder: Protection mechanism + +- IA-32 supports 4 privilege levels, called **rings** +- The most privileged level is \textcolor{myred}{\textbf{ring 0}} +- Usually, the kernel runs in \textcolor{myred}{\textbf{ring 0}} while user code (applications) run in \textcolor{mygreen}{\textbf{ring 3}} + +\vspace{.2cm} + +\centering +{ width=60% } + +[//]: # ---------------------------------------------------------------- +## CPU privileged mode + +\textcolor{myred}{\textbf{Ring 0}} is the most privileged level + +- Bootloader and kernel run in \textcolor{myred}{ring 0} + +- \textcolor{myred}{Ring 0} is privileged, because in \textcolor{myred}{ring 0} we can: + - access the full instruction set and registers + - access the whole CPU address space + - program the MMU (paging) and interrupt vector table (IDT) + - restrict a task's PMIO address space (`in/out` instructions) + +[//]: # ---------------------------------------------------------------- +## CPU restricted (unprivileged) mode + +\small + +\textcolor{mygreen}{\textbf{Ring 3}} is the least privileged, the one in which applications run + +- In \textcolor{mygreen}{ring 3}, we **\textcolor{myred}{CANNOT}**: + - use the whole set of CPU instructions + - read/write outside the task's address space (i.e. non-mapped pages) + - access the task's page dir/table (and read/write `cr3` register) + - access and load the IDT (and execute the` ldtr` instruction) + - access restricted areas of the PMIO address space + - mask/unmask hardware interrupts (`cli`/`sti` instructions) + +- If a **task does any of the above**, the **CPU raises an exception** + - must be caught by kernel $\rightarrow$ typically kills the offending task! + +[//]: # ---------------------------------------------------------------- +## Services + +::: incremental + +- Given a task in \textcolor{mygreen}{ring 3 (user mode)} is limited/non-privileged, how can it access system resources, such as screen, disk, keyboard, etc.? +- Through **system calls** + - kernel exposes a limited number of functions callable from \textcolor{mygreen}{ring 3 (user mode)} + - these functions are called "system calls" (syscalls) + - during the execution of a system call, a **change of privilege** occurs + +::: + +[//]: # ---------------------------------------------------------------- +## Change of privilege during syscall + +\centering +{ width=100% } + +[//]: # ---------------------------------------------------------------- +## Why system calls? + +- System calls (syscalls) are the kernel API + - the set of functions the kernel exposes to user applications + - syscalls are services to applications! + +- \textcolor{myred}{Without syscalls, user tasks wouldn't be able to do anything (almost)} + - can't access devices (I/O) + - can't extend their adress space (e.g. `malloc()`) + - etc. + +- Consequently, user applications **require syscalls** to do anything meaningful! + +[//]: # ---------------------------------------------------------------- +## Examples of system calls + +\textcolor{mygreen}{Syscalls are \textbf{required} for anything related to}: + +- I/O (device) access +- dynamic memory allocation (`malloc`) +- accessing files +- executing/terminating a process +- inter-process communication (IPC) +- timers +- etc. + +[//]: # ---------------------------------------------------------------- +## When system calls are not needed + +\textcolor{myred}{However, syscalls are \textbf{not needed} for}: + +- changing the content of a string +- calling a function +- copying memory within the task's address space +- etc. + +[//]: # ---------------------------------------------------------------- +## Linux system calls + +\small + +Example of Linux system calls: + +- `open, read, write, close, fork, mmap, getpid`, etc. + +List of Linux system calls : `man syscalls` + +```{.tiny} +System call Kernel Notes +-------------------------------------------------------------------- +... +accept(2) 2.0 See notes on socketcall(2) +accept4(2) 2.6.28 +... +bind(2) 2.0 See notes on socketcall(2) +bpf(2) 3.18 +... +capset(2) 2.2 +chdir(2) 1.0 +chmod(2) 1.0 +... +clone2(2) 2.4 IA-64 only +clone(2) 1.0 +clone3(2) 5.3 +... +``` + +[//]: # ---------------------------------------------------------------- +## System calls and privilege levels + +- As stated before, an unprivileged task (running in \textcolor{mygreen}{ring 3}) is extremely limited +- Such a task requires functions exposed by the kernel to do anything meaningful + - these functions are called **system calls** +- System calls are functions **implemented in the kernel** (executing in \textcolor{myred}{ring 0})... +- But they **can be called by unprivileged tasks** (executing in \textcolor{mygreen}{ring 3})! + +[//]: # ---------------------------------------------------------------- +## System calls: how? + +- How to allow some code executing at a low privilege level (\textcolor{mygreen}{ring 3}) to call code at a higher privilege level (\textcolor{myred}{ring 0})? +- We program a special **software interrupt** that's allowed to be called from a lower privilege level + - achieved by creating a specifically built interrupt descriptor callable from \textcolor{mygreen}{ring 3} +- A more efficient way, but less portable, is to use dedicated CPU instructions: + - `sysenter/sysexit` on Intel CPUs + - `syscall/sysret` on AMD CPUs + +[//]: # ---------------------------------------------------------------- +## System call implementation example + +- Implementation example using a software interrupt + +- Using the code provided in lab3, `idt.c` is modified by adding a new interrupt handler for software interrupts: + ```{.verysmall .c} + // IDT entry 123: system call + idt[123] = idt_build_entry(GDT_KERNEL_CODE_SELECTOR, + (uint32_t)&_syscall_handler, + TYPE_TRAP_GATE, + DPL_USER); + ``` + +- Then, from some unprivileged task code, a syscall is executed by triggering software interrupt 123 with (in assembly): + ```{.small .c} + int 123 + ``` + +[//]: # ---------------------------------------------------------------- +## Many system calls + +- Typically, a kernel implements many syscalls (Linux has > 350) + - not enough software interrupts for all possible syscalls +- Moreover: syscalls usually have arguments + +- How to solve these two problems? + +[//]: # ---------------------------------------------------------------- +## How to handle many system calls? + +Basic Idea: + +- All syscalls use the **same** software interrupt +- The syscall number is just an extra argument passed to the software interrupt + +[//]: # ---------------------------------------------------------------- +## How to handle system call arguments? + + +How to pass multiple arguments to the software interrupt? + +- Three solutions: + (1) use CPU registers + (1) use the stack + (1) use a dedicated area of memory shared between the kernel and the task + +Here, we will use (1) + +[//]: # ---------------------------------------------------------------- +## System calls dispatch table + +- How to handle the different syscalls given a **single** software interrupt is triggered? + +- Solution: by using a syscall **dispatch table**: + - a table of functions, one per syscall, is implemented in the kernel (array of pointers to functions) + - the syscall number is an index in this table + +[//]: # ---------------------------------------------------------------- +## System calls: overview + +\centering +{ width=100% } + +[//]: # ---------------------------------------------------------------- +## Full workflow of a syscall + +\centering +{ width=100% } + +[//]: # ---------------------------------------------------------------- +## System library + +\small + +- System calls ressemble function calls +- However: + - syscalls are not very readable + - programmer must remember every syscall number + - usually too low-level +- Solution? + - introduce a system library that abstracts system calls and provides a programmer-friendly API + - under GNU Linux, the system library is the glibc + - the system library is often a mixture of system calls and pure user space code + - the system linker typically links every application to the system library + +[//]: # ---------------------------------------------------------------- +## System calls overhead + +\small + +::: incremental + +Compared to function calls, system calls are **very expensive** + +- Why? +- Because each system call implies: + 1. the caller context must be saved + 1. change of privilege (security checks) + 1. kernel code execution + 1. change of privilege + 1. caller context must be restored + +::: + +[//]: # ---------------------------------------------------------------- +## System calls performance consideration + +::: incremental + +Given system calls are expensive: + +- Applications should minimize the number of times they call system calls +- Kernels should avoid exposing more system calls than really necessary + +\vspace{.5cm} + +- Quiz: difference between `read` and `fread`? + +::: + +[//]: # ---------------------------------------------------------------- +## Resources + +\small + +- Operating Systems: Three Easy Pieces, Remzi H. and Andrea C. Arpaci-Dusseau. Arpaci-Dusseau Books\ +\footnotesize [\textcolor{myblue}{http://pages.cs.wisc.edu/~remzi/OSTEP/}](http://pages.cs.wisc.edu/~remzi/OSTEP/) diff --git a/course/07-System_calls.pdf b/course/07-System_calls.pdf new file mode 100644 index 0000000000000000000000000000000000000000..2ebde2fcc5d71b739d4017a556ceb33923fb2165 Binary files /dev/null and b/course/07-System_calls.pdf differ diff --git a/course/images/syscall_full_workflow.odg b/course/images/syscall_full_workflow.odg index dfb09c005ef600869b10c9a04e113e5155dd6509..f692efb3685d8f6ad880f5821e2d63448ad391ec 100644 Binary files a/course/images/syscall_full_workflow.odg and b/course/images/syscall_full_workflow.odg differ diff --git a/course/images/syscall_full_workflow.png b/course/images/syscall_full_workflow.png index 4ce6f70fe251b769d25a480166fe9db303f2e808..65a783fe1dccd772eb3b5bd4d4c5e5c83f3ce141 100644 Binary files a/course/images/syscall_full_workflow.png and b/course/images/syscall_full_workflow.png differ diff --git a/course/images/syscalls_dispatch_table.odg b/course/images/syscalls_dispatch_table.odg index d234cab0aad41ba6c06d51208c2f4c8efff1379e..0557daaf80934e71afb813666ef131f5771cc994 100644 Binary files a/course/images/syscalls_dispatch_table.odg and b/course/images/syscalls_dispatch_table.odg differ diff --git a/course/images/syscalls_dispatch_table.png b/course/images/syscalls_dispatch_table.png index a16652636a3370741ab10c4fe5556078e0f43597..eaa755c44005ac5d0e4ac4936b4d629478262a6e 100644 Binary files a/course/images/syscalls_dispatch_table.png and b/course/images/syscalls_dispatch_table.png differ