Lista para version 4.10

Linux 4.10 was released

Summary: This release adds support for virtualized GPUs, a new 'perf c2c' tool for cacheline contention analysis in NUMA systems, a new 'perf sched timehist' command for a detailed history of task scheduling, improved writeback management that should make the system more responsive under heavy writing load, a new hybrid block polling method that uses less CPU than pure polling, support for ARM devices such as the Nexus 5 & 6 or Allwinner A64, a feature that allows to attach eBPF programs to cgroups, an experimental MD RAID5 writeback cache, support for Intel Cache Allocation Technology, and many other improvements and new drivers.

Virtual GPU support

This release adds support for Intel GVT-g for KVM (a.k.a. KVMGT), a full GPU virtualization solution with mediated pass-through, starting from 4th generation Intel Core (Haswell) processors with Intel Graphics. This feature is based on a new VFIO Mediated Device framework. Unlike direct pass-through alternatives, the mediated device framework allows KVMGT to offer a complete virtualized GPU with full GPU features to each one of the virtualized guests, with part of performance critical resources directly assigned, while still having performance close to native. The capability of running native graphics driver inside a VM, without hypervisor intervention in performance critical paths, achieves a good balance among performance, feature, and sharing capability.

For more details, see these papers:

A Full GPU Virtualization Solution with Mediated Pass-Through

KVMGT: a Full GPU Virtualization Solution

vGPU on KVM

Intel GVT main site

Code: VFIO Mediated device commit

New 'perf c2c' tool, for cacheline contention analysis

In modern systems with multiple processors, different memory modules are physically connected to different CPUs. In these NUMA

perf c2c (for "cache to cache") is a new tool designed to analyse and track down performance problems caused by false sharing on NUMA systems. The tool is based on x86's load latency and precise store facility events provided by Intel CPUs. At a high level, perf c2c will show you:

* The cachelines where false sharing was detected.

* The readers and writers to those cachelines, and the offsets where those accesses occurred.

* The pid, tid, instruction addr, function name, binary object name for those readers and writers.

* The source file and line number for each reader and writer.

* The average load latency for the loads to those cachelines.

* Which numa nodes the samples a cacheline came from and which CPUs were involved.

and more. For more details on perf c2c and how to use it, see https://joemario.github.io/blog/2016/09/01/c2c-blog/

Code: (merge)

Detailed history of scheduling events with perf sched timehist

'perf sched timehist' provides an analysis of scheduling events. Example usage: {{{$ perf sched record -- sleep 1; perf sched timehist}}}. By default it shows the individual schedule events, including the wait time (time between sched-out and next sched-in events for the task), the task scheduling delay (time between wakeup and actually running) and run time for the task:

{{{ time cpu task name wait time sch delay run time

[tid/pid] (msec) (msec) (msec)

-------- ------ ---------------- --------- --------- --------

1.874569 [0011] gcc[31949] 0.014 0.000 1.148

1.874591 [0010] gcc[31951] 0.000 0.000 0.024

1.874603 [0010] migration/10[59] 3.350 0.004 0.011

1.874604 [0011] 1.148 0.000 0.035

1.874723 [0005] 0.016 0.000 1.383

1.874746 [0005] gcc[31949] 0.153 0.078 0.022

}}}

For more details, see this article from Brendan Gregg: perf sched for Linux CPU scheduler analysis

Code: (merge)

Improved writeback management

Since the dawn of time, the way Linux synchronizes to disk the data written to memory by processes (aka. background writeback) has sucked. When Linux writes all that data in the background, it should have little impact on foreground activity. That's the definition of background activity...But for a long as it can be remembered, heavy buffered writers have not behaved like that. For instance, if you do something like {{{$ dd if=/dev/zero of=foo bs=1M count=10k}}}, or try to copy files to USB storage, and then try and start a browser or any other large app, it basically won't start before the buffered writeback is done, and your desktop, or command shell, feels unreponsive. These problems happen because heavy writes -the kind of write activity caused by the background writeback- fill up the block layer, and other IO requests have to wait a lot to be attended (for more details, see the LWN article

This release adds a mechanism that throttles back buffered writeback, which makes more difficult for heavy writers to monopolize the IO requests queue, and thus provides a smoother experience in Linux desktops and shells than what people was used to. The algorithm for when to throttle can monitor the latencies of requests, and shrinks or grows the request queue depth accordingly, which means that it's auto-tunable, and generally, a user would not have to touch the settings. This feature needs to be enabled explicitly in the configuration (and, as it should be expected, there can be regressions)

Recommended LWN article: Toward less-annoying background writeback

Code: commit

Hybrid block polling

Linux 4.4 added

The hybrid block polling is disabled by default. A new sysfs file, {{{/sys/block//queue/io_poll_delay}}} has been added, which makes the polling behave as follows: {{{-1}}}: never enter hybrid sleep, always poll (default); {{{0}}}: Use half of the completion mean for this request type for the sleep delay (aka: hybrid poll); {{{>0}}}: disregard the mean value calculated by the kernel, and always use this specific value as the sleep delay.

Code: commit

Better support for ARM devices such as Nexus 5 & 6 or Allwinner A64

As an evidence of the work being done to bring Android and mainline kernels together, this release includes support for ARM socs such as:

* Huawei Nexus 6P (Angler)

* LG Nexus 5X (Bullhead)

* Nexbox A1 and A95X Android TV boxes

* Pine64 development board based on Allwinner A64

* Globalscale Marvell ESPRESSOBin community board based on Armada 3700

* Renesas "R-Car Starter Kit Pro" (M3ULCB) low-cost automotive board

Code: (merge)

Allow attaching eBPF programs to cgroups

This release adds eBPF hooks for cgroups, to allow eBPF programs for network filtering and accounting to be attached to cgroups, so that they apply to all sockets of all tasks placed in that cgroup. A new BPF program type is added, {{{BPF_PROG_TYPE_CGROUP_SKB}}}. The bpf(2)

Recommended LWN article: Network filtering for control groups

Code: commit

This release also adds a new cgroup-based program type, {{{BPF_PROG_TYPE_CGROUP_SOCK}}}. Similar to {{{BPF_PROG_TYPE_CGROUP_SKB}}} programs can be attached to a cgroup and run any time a process in the cgroup opens an {{{AF_INET}}} or {{{AF_INET6}}} socket. Currently only {{{sk_bound_dev_if}}} is exported to userspace for modification by a bpf program.

Code: commit

Experimental MD raid5 writeback cache and FAILFAST support

This release implements a raid5 writeback cache in the MD subsystem (Multiple Devices). Its goal is to aggregate writes to make full stripe write and reduce read-modify-write. It's helpful for workload which does sequential write and follows fsync for example.

This feature is experimental and off by default.

Code: commit

This release also adds "failfast" support. RAID disk with failed IOs are marked as broken quickly, and avoided in the future, which can improve latency.

Code: commit

Support for Intel Cache Allocation Technology

A Intel feature that allows to set policies on the L2/L3 CPU caches; e.g. real-time tasks could be assigned dedicated cache space. For more details, read the recommended LWN article: Controlling access to the memory cache

Code: commit