Performance Analysis of Scientific Computing Workloads on General Purpose TEEs

Ayaz Akram, Anna Giannakou, Venkatesh Akella, Jason Lowe-Power, Sean Peisert. 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021).

Paper Download Extended ArXiV version Presentation Download Presentation Video Source Code

HPC TEEs

Scientific computing sometimes involves computation on sensitive data. Depending on the data and the execution environment, the HPC (high-performance computing) user or data provider may require confidentiality and/or integrity guarantees. The table below shows some examples of sensitive data which providers may want to share with data scientists.

Domain Data provider Data types Applications
Health care Hospital or VA Health records, medical images, gene sequences Machine learning models
Transportation Public transportation, delivery/logistics firms driving routes graph analysis, optimization
Energy Utility company Home & building energy usage Real-time demand/response prediction

One possible architecture to enable HPC on sensitive data is to leverage trusted execution environments. Although these were originally designed for a narrow set of workloads, new hardware-accelerated TEEs like AMD’s SEV and Intel’s SGX may provide the perfomance and programmability required for HPC.

Thus, in this paper we study a diverse set of workloads on AMD SEV and Intel SGX.

We make the following three findings:

SEV may be suitable for HPC with careful memory management

With a NUMA-aware data placement policy, we find that the slowdown when running HPC applications with SEV is managable (1-1.5x slowdown). However, without careful data placement, the slowdown can be significant (1-3.4x) compared to running the application without SEV enabled.

The graphs below explain why SEV doesn’t work well with the default NUMA placement. The first image is the memory allocated over time when SEV is enabled. This bar shows that all of the memory is allocated on one NUMA node reducing the peak memory bandwidth and increasing the memory latency for the application. Compare this with the next figure which shows the memory allocated over time when we disable SEV. The data is spread nearly evenly across all of the NUMA nodes increasing the memory bandwidth and improving the performance.

SEV memory usage over time with default allocation

Memory useage over time without SEV

To solve this problem, you can either carefully configure your virtual machine with the same NUMA arrangement as the host platform, or you can force the system to interleave the memory allocation when spawning the virtual machine.

Virtualization required by SEV hurts performance

To use AMD’s SEV, you must run your workload inside a virtual machine. Virtualization is well known to cause performance problems for memory-intensive workloads. However, we also found that some I/O intensive workloads, which are not uncommon for big data applications in HPC, suffer from virtualization overheads as well. That said, most workloads perform about the same with SEV and when they are executed natively.

This figure shows the slowdown relative to native exeution for some “real world” HPC workloads. The key bar is the one on the right of each set (green, hatched) which shows the performance of SEV with NUMA interleaving enabled.

SEV performance on HPC applications

SGX is inappropriate for HPC applications

Finally, we also evaluated Intel’s SGX and found that it was not appropriate for running HPC applications. SGX was designed for applications which can be easily partitioned into “secure” and “insecure” and that don’t execute much computation in the “secure” portion.

Thus, we found that SGX has the following problems.

  1. It was difficult to program. Running unmodified applications was very difficult, and, in some cases, impossible.
  2. SGX is limited in the amount of memory a program can access (~100 MB on our system and ~200MB on more recent designs). This is a poor fit for HPC applications which may require 100s GB of working memory.
  3. System calls and multi-threading perform poorly on SGX.

Overall, we found extreme performance penalties when running workloads on SGX as shown below.

SGX performance on HPC applications

You can find more information about how we ran our experiments in this blog post on Setting up Trusted HPC System in the Cloud.

Acknowledgements

This work was performed in collaboration with Lawrence Berkeley National Labs.

Citation

Coming soon

Updated: