Linux 4.5 has been released
Summary: This release adds a new copy_file_range(2) system call that allows to make copies of files without transferring data through userspace; experimental Powerplay power management for modern Radeon GPUs; scalability improvements in the Btrfs free space handling; support GCC's Undefined Behavior Sanitizer (-fsanitize=undefined); Forwarded Error Correction support in the device-mapper's verity target; support for the MADV_FREE flag in madvise(); the new cgroup unified hierarchy is considered stable; scalability improvements for SO_REUSEPORT UDP sockets; scalability improvements for epoll, and better memory accounting of sockets in the memory controller. There are also new drivers and many other small improvements.
Copy offloading with new copy_file_range(2) system call
Copying a file consists in reading the data from a file to user space memory, then copy that memory to the destination file. There is nothing wrong with this way of doing things, but it requires doing extra copies of the data to/from the process memory. In this release Linux adds a system call, copy_file_range(2), which allows to copy a range of data from one file to another, avoiding the mentioned cost of transferring data from the kernel to user space and then back into the kernel.
This system call is only very slightly faster
Recommended LWN articles: 1: copy_file_range()
Raw man page: copy_file_range.2
Code: commit
Experimental PowerPlay supports brings high performance to the amdgpu driver
Modern GPUs start running in low power, low performance modes. To get the best performance, they need to dynamically change its frequency. But doing that requires good power management. This release adds support for PowerPlay
Powerplay support is not enabled by default for all kind of hardware supported in this release due to stability concerns; in these cases the use of Powerplay can be forced with the "amdgpu.powerplay=1" kernel option.
Code: see link
Btrfs free space handling scalability improvements
Filesystems need to keep track of which blocks are being used and which ones are free. They also need to store information about the free space somewhere, because it's too costly to generate it from scratch. Btrfs has been able to store a cache of the available free space since 2.6.37
This release includes a new, experimental way of representing the free space cache that takes less work overall to update on each commit and fixes the scalability issues. This new code is experimental, and it's not the default yet. It can be enabled with the {{{-o space_cache=v2}}} mount option. On the first mount with the this option set, the new free space tree will be created and a read-only compatibility flag will be enabled (older kernels will be able to read, but not to write, to the filesystem). It is possible to revert to the old free space cache (and remove the compatibility flag) by mounting the filesystem with the options {{{-o clear_cache,space_cache=v1}}}.
Code: commit
Support for GCC's Undefined Behavior Sanitizer (-fsanitize=undefined)
UBSAN (Undefined Behaviour SANitizer) is a debugging tool available since GCC 4.9 (see -fsanitize=undefined documentation
In this release, Linux supports compiling the kernel with the Undefined Behavior Sanitizer enabled with the {{{-fsanitize}}} options {{{shift, integer-divide-by-zero, unreachable, vla-bound, null, signed-integer-overflow, bounds, object-size, returns-nonnull-attribute, bool, enum}}} and, optionally, {{{alignment}}}. Most of the work is done by compiler, all the kernel does is to handle the printing of errors.
Links:
* Red Hat Developer blog entry about ubsan
* GCC's -fsanitize documentation
Code: commit
Forwarded Error Correction support in the device-mapper's verity target
The device-mapper's "verity" target, used by popular platforms such as Android
This release adds Forward Error Correction
Code: commit
Add MADV_FREE flag to madvise(2)
When an application wants to signal the kernel that it isn't going to use a range of memory in the near future, it can use the MADV_DONTNEED flag, so the kernel can free resources associated with it. Subsequent accesses in the range will succeed, but will result either in reloading of the memory contents from the underlying mapped file or zero-fill-on-demand pages for mappings without an underlying file. But there are some kind of apps (notably, memory allocators) that can reuse that memory range after a short time, and MADV_DONTNEED forces them to incur in page fault, page allocation, page zeroing, etc. For avoiding that overhead, other OS like BSDs have supported MADV_FREE
Recommended LWN article: Volatile ranges and MADV_FREE
Code: commit
Better epoll multithread scalability
When multiple epoll
This release introduces a new {{{EPOLLEXCLUSIVE}}} flag that can be passed as part of the {{{event}}} argument during an epoll_ctl(2)
Recommended LWN article: Epoll evolving: Better multi-threaded behavior
Code: commit
cgroup unified hierarchy is considered stable
cgroups, or control groups, are a feature introduced in Linux 2.6.24
In this release, the unified hierarchy is considered stable, and it's no longer hidden behind that developer flag. It can be mounted using the {{{cgroup2}}} filesystem type (unfortunately, the cpu controller for cgroup2 hasn't made it into this release, only memory and io controllers are available at the moment). For more details, including a detailed reasoning behind the migration to the unified hierarchy, see the cgroup2 documentation: Documentation/cgroup-v2.txt
Code: (merge)
Performance improvements for SO_REUSEPORT UDP sockets
In this release, Linux includes two optimizations for {{{SO_REUSEPORT}}} sockets (in this release, only for UDP sockets):
* Two new sockets options allow to define a classic or extended BPF program ({{{SO_ATTACH_REUSEPORT_CBPF}}} and {{{SO_ATTACH_REUSEPORT_EBPF}}}). These BPF programs can define how packets are assigned to the sockets placed in the {{{SO_REUSEPORT}}} group of sockets that are bound to the same port.
* Faster lookup when selecting a {{{SO_REUSEPORT}}} socket for an incoming packet. Previously, the lookup process needed to consider all sockets, in this release an appropriate socket can be found much faster (see the commit link for benchmarks).
Code: commit
Proper control of socket memory usage in the memory controller
In past releases, socket buffers were accounted in the cgroup's memory controller, separately, without any pressure equalization between anonymous memory, page cache, and the socket buffers. When the socket buffer pool was exhausted, buffer allocations would fail and cause network performance to tank, regardless of whether there was still memory available to the group or not. Likewise, struggling anonymous or cache workingsets could not dip into an idle socket memory pool. Because of this, the feature was not usable for many real life applications.
In this release, the new unified memory controller will account all types of memory pages it is tracking on behalf of a cgroup in a single pool. Upon pressure, the VM reclaims and shrinks and puts pressure on whatever memory consumer in that pool is within its reach. When the VM has trouble freeing memory, the network code is instructed to stop growing the cgroup's transmit windows. Overhead is only incurred when a non-root control group is created and the memory controller is instructed to track and account the memory footprint of that group. cgroup.memory=nosocket can be specified on the boot commandline to override any runtime configuration and forcibly exclude socket memory from active memory resource control.
Code: commit