Lista para version 5.9

Linux 5.9 has been released

Summary: This release implements better management of anonymous (malloc'ed) memory; a new cgroup slab controller that improves slab utilization by allowing memory cgroups to share slab memory; support for proactive memory defragmentation; CPU Capacity awareness for the deadline scheduling class; support for running BPF programs on socket lookups; new close_range() system call for easier closing of entire ranges of file descriptors, support for FSGSBASE x86 instructions that provide faster context switching, NFS support for extended attributes; and support for ZSTD compressed kernel, ramdisk and initramfs. As always, there are many other new drivers and improvements.

Better management of anonymous memory

This release implements better workload detection and protection of anonymous memory (memory that is not file-backed, ie. malloc'ed memory). The Linux kernel manages the memory of anonymous memory placing its pages in either the active list or inactive list. Under memory pressure, unused pages are moved from the active to the inactive list and unmapped, giving them a chance of being referenced again (aka: soft fault) before being moved to swap, if there is more pressure.

In the previous implementation, newly created or swap-in pages were placed on the active list, which could force actively used pages to the inactive list. In this release, newly created or swap-in anonoymous pages are started in the inactive list (thus protecting existing hot workloads), and only promoted to the active list when they are referenced enough. Aditionally, because this change can also cause newly created or swap-in anonymous pages to swap-out existing pages in the inactive list, the existing workingset detection mechanisms have been extended to deal with the anonymous LRU list to make more optimal decisions.

New cgroup slab controller shares slab memory

The cgroup slab memory controller was based on the idea of replicating slab allocator internals for each memory cgroup, so those cgroups didn't share slab memory, which lead to low slab utilization and higher slab memory usage. The slab controller used to be an opt-int feature, but today it's enabled by default in the memory controller, and modern systems with systemd create many cgroups, so these ineffiencies affect many people.

This release incorporats a new cgroup slab memory controller that allows to share slab memory between memory cgroups. For Facebook, it saved significant amount of memory, measured from high hundreds of MBs to single GBs per host; on average the size of slab memory was reduced by 35-45%. Desktop systems also benefit: on a 16GB Fedora system, the new slab controller saves ~45-50% of slab memory, measured just after loading of the system.

Proactive memory compaction

Huge Pages (ie. pages bigger than 4KB on x86) are a processor feature that can improve performance due to reduced TLB usage. Making use of these pages requires having large amounts of contiguous free memory, which can difficult to obtain when memory is heavily fragmented. Linux supports memory compaction (ie. defragmentation), but it is only triggered when a huge page needs to be allocated, which can take time and hence hurts allocation latency. This release adds support for proactive memory compaction, that is, automatically triggering memory compaction before doing any allocation, so that future allocations can succeed faster.

Recommended LWN article: Proactive compaction for the kernel

New close_range() system call for easier closing of file descriptors

This release incorporates a new system call, close_range(2). It allows to efficiently close a range of file descriptors up to all file descriptors of a calling task. Eg, {{{close_range(3, ~0U);}}} will close all descriptors past stderr. It turns out, quite a bunch of projects need to do exactly that: service managers, libcs, container runtimes, programming language runtimes/standard libraries (Rust/Python). This system call has been coordinated with FreeBSD, so it is also available there.

Support for running BPF programs on socket lookups

As with every new version, there are many improvements to BPF. An interesting new feature is a new BPF program type named {{{BPF_PROG_TYPE_SK_LOOKUP}}}, which runs when transport layer is looking up a listening socket for a new connection request (TCP), or when looking up an unconnected socket for a packet (UDP). This serves as a mechanism to overcome the limits of the bind() API. Two use-cases driving this work are: 1) steer packets destined to an IP range, fixed port to a single socket, 2) steer packets destined to an IP address, any port to a single socket.

CPU Capacity awareness for the deadline scheduling class

Since Linux 3.14

Recommended LWn article: Capacity awareness for the deadline scheduler

Faster context switch with supports FSGSBASE x86 instructions

The FSGSBASE instructions are an Intel feature that has been available for a long time. They allow direct access to the FS and FS segment base registers. In addition to benefits to applications, performance improvements to the OS context switch code are possible by making use of these instructions

Recommended LWN article: A possible end to the FSGSBASE saga

NFS support for extended attributes

This release incorporates support for extended attributes ( RFC 8276

Support for ZSTD compressed kernel, ramdisk and initramfs

This release adds support for a ZSTD-compressed kernel, ramdisk, and initramfs in the kernel boot process (ZSTD-compressed ramdisk and initramfs are supported on all architectures, the ZSTD-compressed kernel is only hooked up to x86 for now). ZSTD offers good compression rates and very high decompression speeds. When Facebook switched from a xz compressed initramfs to a zstd compressed initramfs decompression time shrunk from 12 seconds to 3 seconds. When they switched from a xz compressed kernel to a zstd compressed kernel they saved 2 seconds of boot time.