Setting up Trusted HPC System in the Cloud

Ayaz Akram

The emphasis on security in computer systems has been increasing day by day, as the users are getting more concerned about data privacy issues. Cloud providers have started to provide the ability of secure execution on sensitive data using additional mechanisms for security on top of the isolation provided by the traditional usage of virtual machines (VMs). The provision of secure enclaves with the help of hardware based trusted execution environments (TEEs) seems a ubiquitous trend. Examples of deployment of such secure enclaves include: AWS Nitro Enclaves, Google Confidential Cloud Computing, Microsoft’s Azure Confidential Computing, and IBM Cloud Data Shield.

In this blog post, we will take a look into how to build a simple experimental set-up in Google Cloud for running HPC-like workloads securely. Recently, Google’s confidential cloud computing initiative announced the availability of confidential virtual machines based on AMD’s SEV (secure encrypted virtualization) for trusted execution support. When SEV is enabled, all data stored in the main memory for a particular virtual machine (VM) will be encrypted. This ensures the data cannot be read by other VMs, the hypervisor, or even individuals with physical access to the main memory hardware.

The HPC workloads are mostly distributed and scale across multiple machines/nodes (e.g. using OpenMPI). High bandwidth network interconnects are used for communication among these nodes at HPC centers. Unfortunately, SEV does not have support for secure multi-node computation yet. If you are executing workloads which scale across multiple VMs on different nodes, the data can’t be sent encrypted across them with SEV since each VM has it’s own key (generated at random by the hardware). Moreover, SEV makes use of the data’s physical address in its encryption function in addition to the key, thus making the transfer of encrypted data even harder. Today, in Google Cloud infrastructure, if the boundary of one of Google Cloud’s data-centers is not passed, there is some kind of authentication provided for the communication between different nodes, however the communication still happens in the clear. Therefore, in this blog post, we will make use of a VPN connection to enable secure communication. Although a VPN based secure communication is expected to be slow, currently it seems to be the only way of securing communication out of the box.

The first part of this blog post presents some performance data to find out what kind of slowdowns can be expected on Google cloud VMs on top of un-secure execution. The second part of the blog (Appendix) discusses a step-by-step guide to setup Google cloud VMs with all the necessary tools required to run the tests presented in this write-up.

Performance Analysis

Here, we will look into some performance data to analyze the slowdowns that secure execution can lead to. We use MPI workloads in these tests, which are very common in HPC domain. There are multiple open-source MPI implementations (OpenMPI, MPICH, MVAPIC) available and potentially can be used inside VMs. We use OpenMPI at this point and build and use it wih UCX which is a framework (collection of libraries and interfaces) to allow building various HPC protocols like RMA, fragmentation, MPI tag matching etc. and supports different transport protocols for communication like RDMA, TCP, shared memory etc. Steps taken to install OpenMPI, UCX and some MPI workloads are provided in the Appendix section.

Figure 1 and 2 (below) show the bandwidth and latency numbers for selected micro-benchmarks from OSU MPI micro-benchmarks suite. Three different configurations shown in the figures are described below:

No_Conf: No confidentiality support (SEV is disabled)

Conf: Confidentiality support (SEV) is enabled, but communication channel is un-encrypted

VPN_Conf: Confidentiality support (SEV) is enabled, and communication channel is secure using a VPN connection

As shown in Figure 1 and 2, although the bandwidth and latency numbers are pretty close for No_Conf and Conf configurations, the numbers are much worse when VPN is used for communication. This is an expected result as VPN based communication can be very slow.

Figure 1. Bandwidth Test from OSU Microbenchmarks

Figure 3 shows the execution time for NAS Parallel Benchmarks (C Class), which are often used to benchmark HPC systems, under three different configurations defined above. Figure 4 shows the same data in the form of observed slowdown in comparison to un-secure execution. As can be seen in the figure, the slowdown with Conf configuration suggest no significant performance degradtion. The slowdown with VPN_Conf for most of the benchmarks is not as bad as in the above microbenchmarks. However, it should be noted that we were limited by the maximum number of processes which is one of the reasons the tests do not involve higher level of classes of NAS Parallel Benchmarks (e.g. D Class) which are more representative of the HPC workloads that scientists will be interested in.

Slowdown for NAS Parallel Benchmarks — Figure 4. Slowdown for for NAS Parallel Benchmarks (C Class), 8 processes in total except bt and sp

Conclusion

To summarize the above results, while it is possible today to run HPC workloads securely in the cloud (like Google cloud), there are limitations:

Only 8 total vCPUs (for now) in Google cloud, which severely limits the kind of HPC workloads which can be executed on these resources.
While data is stored encrypted, communication is in the clear necessitating application-specific solutions or VPN.
Using a VPN for secure communication is undesirable due to the performance overheads.

Appendix

Next, is a discussion of the steps taken to set up the VMs on Google cloud and the software installed in order to perform the experiments which produced the data shown above.

Setting up Confidential VM in Google Cloud

On Google Cloud Compute Engine, when you try to create your VM instance from CREATE INSTANCE OPTION, select the option of Confidential VM Service as shown in Figure 5. This will automatically change the machine type to correct option (namely N2D series, which are powered by AMD EPYC Rome CPU). There are options of different virtual CPUs and memory (high memory or high CPU count options are provided too). However, currently (last checked in the first week of November 2020), Google Cloud does not allow using more than 8 vCPUs per data center in total. So, we will rely on setting up two VMs each of 4 vCPUs for this blog. We expect that these limitations will be relaxed in future.

Figure 5. Google Cloud Compute Engine Interface

Once you create your instances (two in this case), you can access them through SSH tab, which will open a secure SSH window, with direct access to the VM. You can verify that SEV is enabled on these instances by running:

ls -l /dev/sev

If SEV is enabled, the above command should not complain about /dev/sev being not present.

Install the required libraries:

sudo apt-get update && sudo apt-get install autoconf make gcc gfortran libtool numactl libnuma-dev g++ flex

VPN Set up

First, we will set up a VPN connection

We will use OpenVPN for VPN connection.

Most of these instructions are based on this article. Below is a brief summary of all the steps needed to be taken, but if you are interested to know the detailed reason for every step, please refer to the article.

We will assume that first VM instance will act as a server and the other as a client.

First, install OpenVPN and EasyRSA (EasyRSA will be used for setting up certificates):

sudo apt-get update
sudo apt-get install openvpn easy-rsa

Setting up the certificates:

make-cadir ~/openvpn-ca
cd ~/openvpn-ca

There should be a file titled vars in this directory, which contains most of the configuration parameters for certificate creation. If you want, you can different parameters (like KEY_COUNTRY, KEY_EMAIl etc.) which are used in the key creation or just keep them to their default values.

Then,

source vars

Now, to build certificates:

./build-ca

For server certificate and key:

./build-key-server server

If you want you can enter a challenge password at this point or just keep it blank and press enter.

Then, generate a Diffie-Hellman key:

./build-dh
openvpn --genkey --secret keys/tiv.key

We will create the client certificate on the same machine as well:

cd ~/openvpn-ca
source vars
./build-key client

Now to set up the OpenVPN server (OpenVPN installation should have already created /etc/openvpn):

cd ~/openvpn-ca/keys
sudo cp ca.crt server.crt server.key tiv.key dh2048.pem /etc/openvpn

Using the default sample configuration for OpenVPN server:

gunzip -c /usr/share/doc/openvpn/examples/sample-config-files/server.conf.gz | sudo tee /etc/openvpn/server.conf

Next, we need to edit this configuration file:

sudo nano /etc/openvpn/server.conf

Find tls-auth tiv.key 0 in the file and remove a ; in front of it and add key-direction 0 under it:

tls-auth tiv.key 0 # This file is secret
key-direction 0

Uncomment the AES-128-CBC line and add auth below it:

cipher AES-128-CBC
auth SHA256

Then uncomment:

user nobody
group nogroup

Also uncomment the lines containing dhcp-option:

push "dhcp-option DNS 208.67.222.222"
push "dhcp-option DNS 208.67.220.220"

We will use UDP for communication, so make sure that it is uncommented and TCP is commented. Also, you can configure any port number you want to use (the default is UDP:1194):

port 443
;proto tcp
proto udp

Make sure, in your file, you also have:

cert server.crt
key server.key

Save all the settings and close /etc/openvpn.

Some other changes that are needed in the OS settings:

Open sudo nano /etc/sysctl.conf and uncomment:

net.ipv4.ip_forward=1

Save and close the file and run sudo sysctl -p for this change to take effect.

To update firewall rules, first check:

ip route | grep default

Note the interface from the above command that has dev in it.

For instance if it is ens0, we will udpate firewall rules as follows:

sudo nano /etc/ufw/before.rules

Add the following lines:

# OPENVPN
# NAT Table
*nat
:POSTROUTING ACCEPT [0:0]
# OpenVPN client traffic
-A POSTROUTING -s 10.8.0.0/8 -o ens0 -j MASQUERADE
COMMIT
# OPENVPN

Save and close and do

sudo nano /etc/default/ufw

and change DEFAULT_FORWARD_POLICY to ACCEPT.

Save and close.

OpenVPN server can then be started using:

sudo systemctl start openvpn@server

You can check the status of this service using:

sudo systemctl status openvpn@server

Now for client configuration:

mkdir -p ~/clients/files
chmod 700 ~/clients/files
cp /usr/share/doc/openvpn/examples/sample-config-files/client.conf ~/clients/base.conf

Edit the file nano ~/clients/base.conf

Find the remote directive and add your server’s IP address and port number for OpenVPN:

remote [server IP] [port number]

Make sure the proto is set to udp as in server’s configuration above.

Uncomment:

user nobody
group nogroup

Comment the following lines:

# ca ca.crt
# cert client.crt
# key client.key

Make sure you have the following in your file as well:

cipher AES-128-CBC
auth SHA256
key-direction 1

Also add

script-security 2
up /etc/openvpn/update-resolv-conf
down /etc/openvpn/update-resolv-conf

Save and close and then edit nano ~/clients/gen_config.sh with the following:

#!/bin/bash
KEY_DIR=~/openvpn-ca/keys
OUTPUT_DIR=~/clients/files
BASE_CONFIG=~/clients/base.conf
cat ${BASE_CONFIG} \
    <(echo -e '<ca>') \
    ${KEY_DIR}/ca.crt \
    <(echo -e '</ca>\n<cert>') \
    ${KEY_DIR}/${1}.crt \
    <(echo -e '</cert>\n<key>') \
    ${KEY_DIR}/${1}.key \
    <(echo -e '</key>\n<tls-auth>') \
    ${KEY_DIR}/tiv.key \
    <(echo -e '</tls-auth>') \
    > ${OUTPUT_DIR}/${1}.ovpn

Run this script:

chmod 700 ~/clients/gen_config.sh
cd ~/clients
./gen_config.sh client

This should generate client.ovpn in ~/clients/files/.

From your SSH session you can download /home/username/clients/files/client.ovpn using the Download file option and upload it to the client VM instance (again using the upload option on SSH session).

Make sure the libraries (including OpenVPN) are installed on the client VM as well and run

sudo openvpn -- config client.ovpn

If everything worked well this should create a VPN connection between the server and client. Your SSH connection window on client VM will go unresponsive as it is now tunneling all internet traffic through the server.

You can directly create an SSH connection from inside the server to access the client.

Setting up the System to Run HPC-like Workloads

Next, we will look at the steps to install all the required libraries needed to run MPI (mesage passing interface) workloads in our 2 node system.

UCX Installation

First create a directory .ucx in your home directory and then run:

git clone https://github.com/openucx/ucx.git ucx
cd ucx
git checkout v1.8.1
./autogen.sh
mkdir build
cd build
../configure --prefix=/home/"$USER"/.ucx/
make -j4
make install -j4

OpenMPI Installation

Create a directory .openmpi in your home directory and then run:

git clone https://github.com/open-mpi/ompi.git
cd ompi
git checkout v4.0.4
./autogen.pl
mkdir build
cd build
../configure --prefix=/home/"$USER"/.openmpi/ --with-ucx=/home/sev/.ucx/
make -j4
make install -j4

OSU MPI Micro-benchmarks

To compile OSU MPI microbenchmarks:

wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.6.3.tar.gz
tar -xzvf osu-micro-benchmarks-5.6.3.tar.gz
cd osu-micro-benchmarks-5.6.3
./configure CC=/home/sev/.openmpi/bin/mpicc CXX=/home/sev/.openmpi/bin/mpicxx
make -j64

The above UCX, OpenMPI and OSU microbenchmarks installation process should be performed on all nodes in the system.

Now you can run any of these workloads using commands like the following:

/home/"$USER"/.openmpi/bin/mpirun -mca btl self -x UCX_NET_DEVICES=tun0 -x UCX_TLS=tcp -np 2 -host [Client-IP-Address],[Server-IP-Address] /home/"$USER"/osu-micro-benchmarks-5.6.3/mpi/pt2pt/osu_bw

Note: The above command should be run on the client machine and tun0 in this command is the network interface for the tunnel connection.

Issues and Solutions

Updating the grub configuration in Google Cloud VMs:

sudo vim /etc/default/grub.d/50-cloudimg-settings.cfg

To generate ssh key and configure password less ssh connection with other VMs (required to run MPI workloads):

ssh-keygen
copy id_rsa.pub to authorized_keys of other device