OOM Killer: Creating our own CGroups

15 Jun 2024

Last time, when triggering the OOM Killer, we were a bit silent but we looked at how docker creates cgroups, how the memory limits are seen on the cgroup when we limit the memory of a container and how it looks when the OOM killer gets involved and kills our process when we run a command that exceeds those limits.

Today I want to go a bit deeper and expand our understanding. We learn by doing, not by watching. So, instead of watching docker create a cgroup, we’re going to create our own.

First let’s look at our cgroup tree in systemd

# Terminal 1
dan:~ % systemctl status
...
           └─user.slice
             └─user-1000.slice
               ├─session-1.scope
               │ ├─ 697 "sshd: dan [priv]"
               │ ├─ 711 "sshd: dan@pts/0"
               │ ├─ 712 -zsh
               │ ├─3533 systemctl status
               │ └─3534 pager
               ├─session-3.scope
               │ ├─1495 "sshd: dan [priv]"
               │ ├─1501 "sshd: dan@pts/1"
               │ └─1502 -zsh
               └─user@1000.service
                 └─init.scope
                   ├─700 /lib/systemd/systemd --user
                   └─701 "(sd-pam)"

Let’s create our own cgroup called dans-slice, inside user-1000.slice

# Terminal 1
dan:~ % sudo mkdir /sys/fs/cgroup/user.slice/user-1000.slice/dans-slice
dan:~ % ls -p /sys/fs/cgroup/user.slice/user-1000.slice/dans-slice
cgroup.controllers      cpu.pressure         memory.pressure
cgroup.events           cpu.stat             memory.reclaim
cgroup.freeze           cpu.weight           memory.stat
cgroup.kill             cpu.weight.nice      memory.swap.current
cgroup.max.depth        io.pressure          memory.swap.events
cgroup.max.descendants  memory.current       memory.swap.high
cgroup.pressure         memory.events        memory.swap.max
cgroup.procs            memory.events.local  memory.zswap.current
cgroup.stat             memory.high          memory.zswap.max
cgroup.subtree_control  memory.low           pids.current
cgroup.threads          memory.max           pids.events
cgroup.type             memory.min           pids.max
cpu.idle                memory.numa_stat     pids.peak
cpu.max                 memory.oom.group
cpu.max.burst           memory.peak

If we check systemctl status, we would see that it doesn’t appear in the cgroup tree.

Let’s add a process to it.

# Terminal 2
dan:~ % sleep 1000
# Terminal 1
dan:~ % pgrep sleep
3635
dan:~ % echo 3635 | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/dans-slice/cgroup.procs
3635
dan:~ % systemctl status
...
           └─user.slice
             └─user-1000.slice
               ├─dans-slice         ## <---- our cgroup
               │ └─3635 sleep 1000  ## <---- our sleep process
               ├─session-1.scope
               │ ├─ 697 "sshd: dan [priv]"
               │ ├─ 711 "sshd: dan@pts/0"
               │ ├─ 712 -zsh
               │ ├─3655 systemctl status
               │ └─3656 pager
               ├─session-3.scope
               │ ├─1495 "sshd: dan [priv]"
               │ ├─1501 "sshd: dan@pts/1"
               │ └─1502 -zsh
               └─user@1000.service
                 └─init.scope
                   ├─700 /lib/systemd/systemd --user
                   └─701 "(sd-pam)"

sleep is unlikely to run out of memory very soon, so let’s ctrl-c it and run something a bit different.

Let’s create a shell and put it in dans-slice, then we’ll be able to see that subprocesses are create within the same cgroup. This will save us from continually needing to echo pids into cgroup.procs.

# Terminal 2
dan:~ % bash
dan@debian:~$
# Terminal 1
dan:~ % pgrep bash | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/dans-slice/cgroup.procs
3713
dan:~ % systemctl status | fgrep -A 100 user.slice | head
           └─user.slice
             └─user-1000.slice
               ├─dans-slice
               │ └─3713 bash
               ├─session-1.scope
               │ ├─ 697 "sshd: dan [priv]"
               │ ├─ 711 "sshd: dan@pts/0"
               │ ├─ 712 -zsh
               │ ├─3781 systemctl status
               │ ├─3782 grep -F -A 100 user.slice
# Terminal 2
dan@debian:~$ sleep 10
# Terminal 1
dan:~ % systemctl status | fgrep -A 100 user.slice | head
           └─user.slice
             └─user-1000.slice
               ├─dans-slice
               │ ├─3713 bash
               │ └─3787 sleep 10  ## <-- subprocess appears aside parent
               ├─session-1.scope
               │ ├─ 697 "sshd: dan [priv]"
               │ ├─ 711 "sshd: dan@pts/0"
               │ ├─ 712 -zsh
               │ ├─3788 systemctl status

Since we’re interested in the OOM Killer, let’s set ourselves some limits. This time, 20 MiB and let’s not allow any swapping. I leave it as an exercise to the reader to investigate how memory.swap.* and memory.zswap.* behave.

# Terminal 1
dan:~ % cat /sys/fs/cgroup/user.slice/user-1000.slice/dans-slice/memory.current
73728
dan:~ % echo $((20 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/dans-slice/memory.max
20971520
dan:~ % echo 0 | sudo tee /sys/fs/cgroup/user.slice/user-1000.slice/dans-slice/memory.swap.max
0

Now let’s try to exceed these memory limits. First let’s write a program that allocates 50MiB.

# Terminal 1
dan:~ % mkdir -p src/bigleak && cd src/bigleak
dan:~/src/bigleak % edit bigleak.c
...
dan:~/src/bigleak % cat bigleak.c
#include <stdlib.h>
#include <unistd.h>

int main()
{
    void* x = malloc(50 * 1024 * 1024);

    for (;;) {
        sleep(1);
    }

    free(x);
}
dan:~/src/bigleak % make bigleak
cc     bigleak.c   -o bigleak
# Terminal 2
dan@debian:~/src/bigleak$ ./bigleak

Why don’t we get “Killed”? Let’s check how much memory its using:

# Terminal 1
dan:~/src/bigleak % cat /sys/fs/cgroup/user.slice/user-1000.slice/dans-slice/memory.current
380928

Only 372 KiB …

Perhaps it’s because we haven’t used the memory. Let’s write a new program that writes to the memory, MiB by MiB:

# Terminal 1
dan:~/src/bigleak % edit bigleak2.c
...
dan:~/src/bigleak % cat bigleak2.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main()
{
    enum { N =  50 * 1024 * 1024 };
    char* x = malloc(N);

    for (int i = 0; i < N; i += (1024 * 1024)) {
        printf("i = %d\n", i);
        memset(x + i, 0, 1024 * 1024);
    }

    free(x);
}
dan:~/src/bigleak % make bigleak2
cc     bigleak2.c   -o bigleak2
# Terminal 2
^C
dan@debian:~/src/bigleak$ ./bigleak2
i = 0
i = 1048576
i = 2097152
i = 3145728
i = 4194304
i = 5242880
i = 6291456
i = 7340032
i = 8388608
i = 9437184
i = 10485760
i = 11534336
i = 12582912
i = 13631488
i = 14680064
i = 15728640
i = 16777216
i = 17825792
i = 18874368
i = 19922944
Killed
dan@debian:~/src/bigleak$

We can again see the details in the kernel logs with

sudo journalctl --system --dmesg -xe --utc

Or, since I want to reduce the width of this to fit it in the blog, let’s take it straight from dmesg.

# Terminal 1
dan:~/src/bigleak % sudo dmesg | fgrep -A 1000 bigleak2 | cut -c 16-
bigleak2 invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
CPU: 1 PID: 4045 Comm: bigleak2 Not tainted 6.1.0-21-arm64 #1  Debian 6.1.90-1
Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
 dump_backtrace+0xe4/0x140
 show_stack+0x20/0x30
 dump_stack_lvl+0x64/0x80
 dump_stack+0x18/0x34
 dump_header+0x4c/0x200
 oom_kill_process+0x2ec/0x2f0
 out_of_memory+0xec/0x590
 mem_cgroup_out_of_memory+0x134/0x14c
 try_charge_memcg+0x584/0x66c
 charge_memcg+0x54/0xc0
 __mem_cgroup_charge+0x40/0x8c
 __handle_mm_fault+0x620/0x1100
 handle_mm_fault+0xe4/0x260
 do_page_fault+0x174/0x3c0
 do_translation_fault+0x54/0x70
 do_mem_abort+0x4c/0xa0
 el0_da+0x48/0xf0
 el0t_64_sync_handler+0xac/0x120
 el0t_64_sync+0x18c/0x190
memory: usage 20480kB, limit 20480kB, failcnt 148
swap: usage 0kB, limit 0kB, failcnt 0
Memory cgroup stats for /user.slice/user-1000.slice/dans-slice:
anon 20705280
file 0
kernel 188416
kernel_stack 16384
pagetables 81920
sec_pagetables 0
percpu 0
sock 0
vmalloc 0
shmem 0
zswap 0
zswapped 0
file_mapped 0
file_dirty 0
file_writeback 0
swapcached 0
anon_thp 18874368
file_thp 0
shmem_thp 0
inactive_anon 20652032
active_anon 4096
inactive_file 0
active_file 0
unevictable 0
slab_reclaimable 0
slab_unreclaimable 18752
slab 18752
workingset_refault_anon 0
workingset_refault_file 0
workingset_activate_anon 0
workingset_activate_file 0
workingset_restore_anon 0
workingset_restore_file 0
workingset_nodereclaim 0
pgscan 0
pgsteal 0
pgscan_kswapd 0
pgscan_direct 0
pgsteal_kswapd 0
pgsteal_direct 0
pgfault 2461
pgmajfault 0
pgrefill 0
pgactivate 0
pgdeactivate 0
pglazyfree 0
pglazyfreed 0
zswpin 0
zswpout 0
thp_fault_alloc 19
thp_collapse_alloc 0
Tasks state (memory values in pages):
[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[   3713]  1000  3713     2115     1257    57344        0             0 bash
[   4045]  1000  4045    13348     5298    90112        0             0 bigleak2
oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=user.slice,mems_allowed=0,oom_memcg=/user.slice/user-1000.slice/dans-slice,task_memcg=/user.slice/user-1000.slice/dans-slice,task=bigleak2,pid=4045,uid=1000
Memory cgroup out of memory: Killed process 4045 (bigleak2) total-vm:53392kB, anon-rss:20020kB, file-rss:1172kB, shmem-rss:0kB, UID:1000 pgtables:88kB oom_score_adj:0

So, what happened there? As we first start use the memory, we take a page fault and the kernel tries to find us some physical memory. We reached the limit of the physical memory available to our cgroup and so the kernel chose to kill a process from our cgroup to free some memory. It went for bigleak2.

So to wrap up. We learned how to create our own cgroup, set memory limits for it and to exhaust those limits by writing to memory we’ve been allocated.