August 2018
M	T	W	T	F	S	S
	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Performance impact of CPU bug fixes

25 August 2018 Andrea Mauro

Reading Time: 9 minutes

With all those Meltdown, Spectre, Foreshadow, … bugs that affect several CPU, you may be interested in what can be the overall performance impact for all the related patches.

There isn’t a simple answer, because it really varies by the processor vendor (Intel CPUs are more affected than AMD CPUs), probably also by CPU the family, for sure by the type of workloads (CPU bound workloads will be more affected, but it depends also on which instructions are used), … but also the type of environment.

On a bare-metal system you can have a type of impact, in a hypervisor (or container) based environment you can have a different impact. And the impact may depend also on the system load.

What does it mean? It’s not only a performance degradation related to some CPU instructions that may perform slower but in some patches, there are significant changes in some internal OS (or hypervisor) tasks, like memory management or CPU scheduling. This means that you can have an increase in some housekeeping tasks or an increase in context switch time or other types of degradation.

This can increase the CPU (or also memory) latency and slow your application… also if your CPU usage is “low”. To make an example, it’s like have a vSphere environment with “low” CPU usage, but high CPU ready time… if you look at the idle CPU resources seems fine, but the reality is different and some VM can perform really badly.

Also, it’s not clear year if those patches have a direct impact on the driver’s stack, for example in the storage or network management…

And the different mitigations have involved a combination of different type of fixes: some OS software based, such as Microsoft and Linux versions of the “kernel page table isolation” protection, but also some hardware based, like the CPU microcode updates, but also some fixed at hypervisor level or hypervisor assisted (for virtualized environment).

But how can you estimate it (before applying the patches) and how can you measure it (when the patches have been applied)?

Has written in an old post (Analyze performance impact of Spectre and Meltdown patches), you must track the following key areas of your environment: performance, configuration, and capacity.

Planning is very important to avoid surprise, but it can only estimate the possible impact. A correct measurement after the reconfiguration it’s also important.

There are different tools that can help in the performance and capacity analysis, for example:

Turbonomic includes modeling and planning features within your environment to accurately measure the impact of changes to utilization and performance, including how to optimize in light of those changes. There is an interesting post (Mitigating the Meltdown and Spectre Patch Performance) that provide several details on how using the out-of-the-box planning features to creating a custom plan allows you to add a simulated percentage of load to the environment.
vRealize Operations add some dashboards and features to help you assess your environment. The usage of those tools is quite similar on each resource and capacity assessment.
Login VSI provides a powerful performance measurement tools for Citrix XenApp, Citrix XenDesktop, VMware Horizon or Microsoft RDS environments. You can get a normal free trial of 20 users / 5 days for free here. If you use the code L1TF we will upgrade this normal trial to a 50 user / 30-day version to help you to get a first impression of the exact impact of these mitigations.
Other benchmark tools can also be useful

But also observing some common tasks duration before and after the patches can be an empiric way to identify the performance impact. For example, some backup operations involve deduplication, compression, encryption and all those operations are mostly CPU bound.

Where you need a proper planning? For sure in each big production environment, especially if they have a high ratio of virtual CPU (or containers or processes) vs. the number of physical cores. In this case, the part of “housekeeping” degradation could be too much relevant. And of course, in each Citrix/VMware/RDS environment that fit perfectly in the high CPU overprovisioning!

Also, it’s very important to understand that there are different impacts at different layers. For example in a vSphere environment you have:

Virtualization layer overhead: This includes only the ESXi patches and maybe the relevant CPU microcode but without Guest Operating System mitigation patches. In most cases, the mitigations at this level may have a minimal performance impact for most workloads on a representative range of recent server processors.
Full stack overhead: This includes all virtualization layer mitigations above with the addition of Guest Operating System mitigation patches.
The impact of these mitigations will vary depending on your application. Applications with very heavy system call usage, including those with very high IO rates, will show a more significant impact than their counterparts with lower system call usage.

Of course, you are interested at the full stack overhead, not at the overhead on a specific layer. And this can make more confusing found the real impact of those mitigation patches. And make more important make real tests on the entire stack!

And when you consider the entire stack, consider also the storage part, because it can be affected, and usually, it’s affected in a hyper-converged environment. Again the importance to make a real test on the real case.

But where you can conduct those tests? A good option could use the DR site if it’s similar to the main site (and if it exists) because you can also have a copy of your workload with effective data. Otherwise, you can measure the impact only after the remediation and you need to estimate it before proceeding.

Different sources give different data…

TechCrunch says that “The Meltdown fix may reduce the performance of Intel chips by as little as 5 percent or as much as 30 — but there will be some hit. Whatever it is, it’s better than the alternative.”

Intel says that the patches for the Spectre variant 4 can cause up to 8% performance hit.

There are some interesting benchmark and analisys on Linux kernels, like: The Performance Cost Of Spectre / Meltdown / Foreshadow Mitigations On Linux 4.19. Intel CPU loose 10-15% and AMD around 5%, BUT you are not considering also the hypervisor layer!

LoginVSI has used its tool to measure the performance impact on a VDI environment:

VMware has realized some interesting report for the main issues:

But only the latest, related to L1 Terminal Fault is very accurate with several performance tests.

There are some tests on a different type of workloads with different mitigations for this issues.

Patch applied: Sequential-Context attack vector mitigation:

Mitigation of the Sequential attack vector is achieved through a standard update to the product versions which will be listed in VMSA-2018-0020 when available. Below are the results of the performance impact observed in our test environments for enterprise-class workloads.

Application Workload / Guest OS	Performance Degradation after applying patch
Database OLTP / Windows	3%
Database OLTP / Linux	3%
Mixed Workload / Linux	1%
Java / Linux	<1%
VDI / Windows	3%

ESXi Side-Channel-Aware Scheduler enabled: Concurrent-Context Attack vector Mitigation:
Mitigation of the Concurrent attack vector requires enablement of a new option known as the ESXi Side-Channel-Aware Scheduler (ESXi SCA Scheduler) which will be included with the updates listed in VMSA-2018-0020. The performance impact of this mitigation depends on the remaining usable host CPU capacity prior to enabling the mitigation.

For example, consider the results below for an OLTP Database workload:

Application Workload / Guest OS	% Host CPU Utilization before Enabling the ESXi SCA Scheduler	Remaining usable Host CPU Capacity before enabling ESXi SCA Scheduler	Performance Degradation after enabling the ESXi SCA Scheduler
OLTP Database / Linux	90%	0%	32%
	77%	~15%	20%
	62%	~30%	1%

Note: In this example, peak application throughput is reached at 90% host CPU utilization.

This analysis demonstrates how a load of a multi-system environment can really affect the performance degradation, and probably same can be applied to containers and RDS/VDI environment.

So what? A good way to estimate the impact could be considered the worst case and use the following table:

Vulnerability	Other names	CVE	Intel	AMD	Mitigation	Performance impact
GPZ Variant 1 (Spectre)	Bounds Check Bypass (BCB)	2017-5753	Several	Unclear	Hypervisor OS	Minimal
GPZ Variant 2 (Spectre)	Branch Target Injection (BTI)	2017-5715		Yes	Firmware Hypervisor	Up to 20%
GPZ Variant 3 (Meltdown)	Rogue Data Cache Load (RDCL)	2017-5754		None	Hypervisor OS	Minimal
SpectreNG Variant 3a	Rogue System Register Read (RSRE)	2018-3640	Several	None	Microcode	Up to 8%
SpectreNG Variant 4	Speculative Store Bypass (SSB)	2018-3639	Several	Family 15 processors (“Bulldozer” products)	Hypervisor OS	Up to 8%
SpectreNG	Lazy FP State Restore	2018-3665	Intel Core	None	OS only	Not for servers
Spectre Variant 1.1	Bounds Check Bypass Store (BCBS)	2018-3693	Several	Unclear	OS only	Minimal
Spectre Variant 1.2	Read-only protection bypass (RPB)		Several	None (yet)		Unclear
Spectre Variant 5 SpectreRSB	ret2spec Return Mispredict		Unclear	Unclear		Unclear
Foreshadow	L1 Terminal Fault (L1TF) SGX	2018-3615	Several	None (yet)	OS	Up to 30%
Foreshadow-NG	L1 Terminal Fault (L1TF) OS/SMM	2018-3620			OS
Foreshadow-NG	L1 Terminal Fault (L1TF) VMM	2018-3646			Hypervisor

But probably it’s too much conservative think that you can lose 50% of performance on an Intel-based system if you apply all the patches… Again the importance to perform a good testing also later to verify the real impact.

Also, we have to consider some remediations are improved and optimized… but also new bugs can be found… so it’s just a battle between getting and loose…

For sure the impact on AMD CPU is lower than the one on Intel CPU… And at this time considering the new AMD based servers it’s quite interesting because part of the bugs does not apply to this family… yet… Be aware that probably more bugs are found on Intel CPU just because they are the majority…

Also choosing a different CPU platform will limit your VM mobility (VM live migration it’s not possible between different type of CPU vendors) so can make sense if you start from scratch, but maybe not if you have already Intel-based hosts.

Andrea Mauro

Virtualization, Cloud and Storage Architect. Tech Field delegate. VMUG IT Co-Founder and board member. VMware VMTN Moderator and vExpert 2010-24. Dell TechCenter Rockstar 2014-15. Microsoft MVP 2014-16. Veeam Vanguard 2015-23. Nutanix NTC 2014-20. Several certifications including: VCDX-DCV, VCP-DCV/DT/Cloud, VCAP-DCA/DCD/CIA/CID/DTA/DTD, MCSA, MCSE, MCITP, CCA, NPP.

vSecurity MeltdownSpectre

#1 | Written by Fabio about 4 years ago.

Hi Mauro,
Are there any contraindications if SCAv1 or SCAv2 is disabled after it has been enabled?

Thank You
Fabio
- #2 | Written by Andrea Mauro about 4 years ago.
  
  You are exposed to the CPU bug

vInfrastructure Blog

Languages

Categories

Archive

Most viewed posts

Performance impact of CPU bug fixes

Andrea Mauro

Languages

Categories

Tags

Archive

Most viewed posts

Performance impact of CPU bug fixes

Related Posts

Andrea Mauro