Linux 4.4 has been released
Summary: This release adds support for 3D support in virtual GPU driver, which allows 3D hardware-accelerated graphics in virtualization guests; loop device support for Direct I/O and Asynchronous I/O, which saves memory and increases performance; support for Open-channel SSDs, which are devices that share the responsibility of the Flash Translation Layer with the operating system; the TCP listener handling is completely lockless and allows for faster and more scalable TCP servers; journalled RAID5 in the MD layer which fixes the RAID write hole; eBPF programs can now be run by unprivileged users, they can be made persistent, and perf has added support for eBPF programs aswell; a new mlock2() syscall that allows users to request memory to be locked on page fault; and block polling support for improved performance in high-end storage devices. There are also new drivers and many other small improvements.
Faster and leaner loop device with Direct I/O and Asynchronous I/O support
This release introduces support of Direct I/O and asynchronous I/O for the loop block device. There are several advantages to use direct I/O and AIO on read/write loop's backing file: double cache is avoided due to Direct I/O which reduces memory usage a lot; unlike user space direct I/O there isn't cost of pinning pages; avoids context switches in some cases because concurrent submissions can be avoided. See commits for benchmarks.
Code: commit
3D support in virtual GPU driver
virtio-gpu is a driver for virtualization guests that allows to use the host graphics card efficiently. In this release, it allows the virtualization guest to use the capabilities of the host GPU to accelerate 3D rendering. In practice, this means that a virtualized linux guest can run a opengl game while using the GPU acceleration capabilities of the host, as show in this
44m linux.conf talk about the project
Code: commit
LightNVM adds support for Open-Channel SSDs
Open-channel SSDs are devices that share responsibilities with the operating system in order to implement and maintain features that typical SSDs keep strictly in firmware. These include the Flash Translation Layer (FTL), bad block management, and hardware units such as the flash controller, the interface controller, and large amounts of flash chips. In this way, Open-channels SSDs exposes direct access to their physical flash storage, while keeping a subset of the internal features of SSDs.
LightNVM is a specification that gives support to Open-channel SSDs. LightNVM allows the host to manage data placement, garbage collection, and parallelism. Device specific responsibilities such as bad block management, FTL extensions to support atomic IOs, or metadata persistence are still handled by the device. This Linux release adds support for lightnvm, (and adds support to NVMe as well).
Recommended LWN article: Taking control of SSDs with LightNVM
Code: commit
TCP listener handling completely lockless, making TCP servers faster and more scalable
In this release, and as a result from an effort that started two years ago, the TCP implementation has been refactored to make the TCP listener fast path completely lockless. During tests, a server was able to process 3,500,000 SYN packets per second on one listener and still have available CPU cycles - about 2 to 3 order of magnitude what it was possible before. SO_REUSEPORT has also been extended (see Networking section) to add proper CPU/NUMA affinities, so that heavy duty TCP servers can get proper siloing thanks to multi-queues NICs.
Code: commit
Preliminary journalled RAID5 MD support
This release adds journalled RAID 5 support to the MD (RAID/LVM) layer. With a journal device configured (typically NVRAM or SSD), Data/parity writing to RAID array first writes to the log, then write to raid array disks. If crash happens, we can recovery data from the log. This can speed up RAID resync and fixes RAID5 write hole issue - a crash during degraded operations cannot result in data corruption. In future releases the journal will also be used to improve performance and latency
Code: merge
Unprivileged eBPF + persistent eBPF programs
'''Unprivileged eBPF '''
eBPF programs got its own syscall in Linux 3.18
Recommended LWN article: Unprivileged bpf()
Code: commit
'''Persistent eBPF maps/progs'''
This release also adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users, for example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits, so a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path.
To fix issues as these, a new minimal file system has been created that can hold map/prog objects at ''/sys/fs/bpf/''. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through ''bpf(2)'' along with a pathname with two new commands (''BPF_OBJ_PIN/BPF_OBJ_GET''), that in turn creates the file system nodes. The user can use that to access maps and progs later on, through ''bpf(2)''.
Code: commit
perf + eBPF integration
In this release, eBPF programs have been integrated with perf. When perf is given an eBPF .c source file (or .o file built for the 'bpf' target with clang), will get it automatically built, validated and loaded into the kernel, which can then be used and seen using ''perf trace'' and other tools.
Users are allowed to use BPF filter like: ''# perf record --event ./hello_world.o ls'', and the eBPF program is attached to a newly created perf event which works with all tools.
Code: commit
Block polling support
This release adds basic support for polling for specific IO to complete, which can improve latency and throughput in very fast devices. Currently O_DIRECT sync read/write are supported. This support is only intended for testing, in future releases stats tracking will be used to auto-tune this. For now, for benchmark and testing purposes, we add a sysfs file (''io_poll'') that controls whether polling is enabled or not.
Recommended LWN article: Block-layer I/O polling
Code: commit
mlock2() syscall allow users to request memory to be locked on page fault
''mlock()'' allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings this is not ideal: For example, security applications that need ''mlock()'' are forced to lock an entire buffer, no matter how big it is. Or maybe a large graphical models where the path through the graph is not known until run time, they are forced to lock the entire graph or lock page by page as they are faulted in.
This new ''mlock2()'' syscall set creates a middle ground. Pages are marked to be placed on the unevictable LRU (locked) when they are first used, but they are not faulted in by the mlock call. The new system call that takes a flags argument along with the start address and size. This flags argument gives the caller the ability to request memory be locked in the traditional way, or to be locked after the page is faulted in. New calls are added for ''munlock()'' and ''munlockall()'' which give the called a way to specify which flags are supposed to be cleared. A new MCL flag is added to mirror the lock on fault behavior from mlock() in mlockall(). Finally, a flag for mmap() is added that allows a user to specify that the covered are should not be paged out, but only after the memory has been used the first time.
Recommended LWN article: Deferred memory locking
Code: commit