Linux 6.13 has been released
Summary: This release includes a new lazy preemption model that provides more preemption opportunities than the voluntary preemption mode used often as default, but not as many as the full preemption mode. There is also support for fine-grained timestamps, without the performance overhead that would often come with providing high-resolution timestamps for every single file; lightweight guard pages; support for storage with atomic writes in XFS and Ext4; support for NAPI suspension during idle periods; a new networking device API to configure TX H/W shaping; various io_uring improvements; ARM support for running Linux in a protected VM under the Arm Confidential Compute Architecture; ARM support for user-space shadow stacks; and a referenced counting mechanism for files that is slightly more scalable. As always, there are many other features, new drivers, improvements and fixes. Also, you might be interested in the LWN merge window report: part 1
Lazy preemption: a bit more of preemption
The Linux kernel support four different preemption modes. There is a "full preemption" mode, but since preemption is usually at odds with performance, most Linux kernels default to using the "voluntary preemption" mode, which provides some preemption opportunities, but it's not full preemption.
This release adds a "lazy preemption" mode that aims to be a bridge between the voluntary and the full preemption mode. It optimizes fair-class preemption by delaying preemption requests to the tick boundary, while working as full preemption for RR/FIFO/DEADLINE classes.
Recommended LWN article: The long road to lazy preemption
Support for multi-grain file timestamps: fine-grained timestamps, without the performance overhead
Some applications (notably, NFS) need higher-resolution timestamps on files, but higher resolution timestamps on all files can increase the rate at which metadata needs to be written to the disk. In this release, Linux adds support for fine-grained timestamps, but only when processes do query that information for a file. This allows for finer-grained timestamps without the performance overhead.
Documentation: Multigrain Timestamps
Recommended LWN article: Rethinking multi-grain timestamps
Support for atomic writes
There is some hardware that supports atomic write operations, by which we mean writes to write data that is larger than the storage's sector size in an atomic way. This release adds support for atomic writes in XFS, Ext4's Direct I/O, and some md RAID modes.
Recommended LWN article: Atomic writes without tears
NAPI suspension for more efficient networking
Interrupt mitigation in networking loads can be accomplished with busy polling, and can be quite efficient, but it cannot effectively support both low- and high-load situations.
This release adds a new packet delivery mode that properly alternates between busy polling and interrupt-based delivery depending on busy and idle periods of the application. During a busy period, the system operates in busy-polling mode, which avoids interference. During an idle period, the system falls back to interrupt deferral, but with a small timeout to avoid excessive latencies
New networking device API to configure TX H/W shaping
There is a plurality of shaping-related drivers API, but none flexible enough to meet existing demand from vendors. This release introduces new device APIs to configure in a flexible way TX H/W shaping. The new functionalities are exposed via a newly defined generic netlink interface and include introspection capabilities.
API documentation: Family net-shaper netlink specification
Lightweight guard pages
A guard page is a page that, when accessed, cause a fatal signal to arise. Installing a guard page in certain places can be useful in various situations. Currently users must establish {{{PROT_NONE}}} ranges to achieve this, but this is costly memory-wise - it needs a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs
This release implements a {{{MADV_GUARD_INSTALL}}} flag for the {{{madvise()}}} system call which implements a guard page, but without that overhead, thus making it cheaper and easier to use these pages.
Various io_uring improvements
This release adds support for various io_uring features:
* Add support for ring resizing, so apps can start with a small ring and grow it as needed
* Support for sending a sync message to another ring, without having a ring available to send a normal async message
* Add support for just doing partial buffer clones, rather than always cloning the entire buffer table
* Add support for fixed wait regions, rather than needing to copy the same wait data tons of times for each wait operation
* Add static NAPI support, where a specific NAPI instance is used rather than having a list of them available that need lookup
* Regions, param pre-mapping and reg waits extension: it's a better and more generic API for ring/memory/region registration, and it changes the API extending registered waits to be a generic parameter passing mechanism. That will be useful in the future to implement a more flexible rings creation, especially when we want to share same huge page / mapping
* Add support for hybrid IO polling, which is a variant of strict IOPOLL but with an initial sleep delay to avoid spinning too early and wasting resources on devices that aren't necessarily in the < 5 usec category wrt latencies
ARM64 virtualization and security improvements
This release adds support in the ARM architecture for:
* Running Linux in a protected VM under the Arm Confidential Compute Architecture (CCA) ( Arm Confidential Compute Architecture documentation
* Support for Guarded Control Stack in userspace (ARM's implementation of shadow stacks), which provides support for hardware protected stacks of return addresses, intended to provide hardening against return oriented programming (ROP) attacks and to make it easier to gather call stacks for applications such as profiling.
Reference counting mechanism for more scalable file operations
This release introduce a new reference counting mechanism for files. It gives consistent improvement up to 3-5% on workloads with loads of threads