AutoTM Artifact Workflow

This section outlines how to run the experiments performed in the AutoTM paper and generate Figures 7 to 12 from the paper. The code to run these experiments lives in $AUTOTM_HOME/experiments/Benchmarker. Unless otherwise specified, all commands given below should be executed from this directory. Julia should be started with julia --project.

PMM - Configuring 1LM and 2LM

Servers with Intel Optane DC can be configured to run in either 1LM/AppDirect mode, where reads and writes to PMM are managed manually, or 2LM/Memory Mode where PMM is accessed as main memory with DRAM as a transparent cache.

Most of the AutoTM code expects to run in 1LM mode with PMM mounted to /mnt/public. Scripts are provided in the $AUTOTM_HOME/scripts directory to aid in switching modes.

Switching to 1LM

Reboot the system and select 1LM in the BIOS. After reboot, navigate to $AUTOTM_HOME/scripts and run

sudo ./change_1lm.sh

Reboot the system again. After the system comes online again, navigate back to $AUTOTM_HOME/scripts and run

sudo ./setup_1lm.sh

Note

The script setup_1lm.sh will destroy all data in PMM namespace 1.0. DO NOT run this script if there is any data on there that must be preserved.

The setup script will create a new file system on the NVDIMMs on Socket 1 and perform a direct-access filesystem mount to /mnt.

Switching to 2LM

Reboot the system and select 2LM in the BIOS. After reboot, navigate to $AUTOTM_HOME/scripts and run

sudo ./change_2lm.sh

Reboot the system again. That is all.

PMM - Conventional Benchmarks

Make sure the system is in AppDirect mode and that setup_1lm.sh has been executed.

Kernel Profiling

Kernel timing profiling must happen separately before the actual execution of benchmarks due to memory fragmentation.

To perform kernel profiling, run

using Benchmarker, AutoTM
Benchmarker.kernel_profile(Benchmarker.conventional_functions())

Kernel profiling for all networks can take hours. Grab a cup of coffee and let AutoTM do its thing.

The serialized data structure for the cached kernel profiles lives in $AUTOTM_HOME/data/caches.

Running Benchmarks

Reboot the system before running these benchmarks. Ensure the system is under light load for best results.

using Benchmarker, AutoTM

optimizers = [
    AutoTM.Optimizer.Static,
    AutoTM.Optimizer.Synchronous,
    AutoTM.Optimizer.Numa
]

ratios = Benchmarker.common_ratios()

for fn in Benchmarker.conventional_functions()
    Benchmarker.run_conventional(fn, optimizers, ratios)
end

Results for these runs will be stored to $AUTOTM_HOME/experiments/Benchmarker/data/cpu

Generating Plots

To generate Figures 7, 9, and 11 - run the following

using Benchmarker

# Figure 7
Benchmarker.plot_speedup()

# Figure 9
Benchmarker.plot_costs()

# Figure 11
Benchmarker.plot_conventional_error()

Test Run

For verification purposes, a small Vgg19 network is included.

using Benchmarker, AutoTM
Benchmarker.kernel_profile(Benchmarker.test_vgg())
Benchmarker.run_conventional(
    Benchmarker.test_vgg(),
    [AutoTM.Optimizer.Static, AutoTM.Optimizer.Synchronous, AutoTM.Optimizer.Numa],
    Benchmarker.common_ratios(),
)

# Generate Plots
Benchmarker.plot_speedup(
    models = [Benchmarker.test_vgg()],
)

Benchmarker.plot_conventional_error(
    models = [Benchmarker.test_vgg()],
)

Benchmarker.plot_costs(
    pairs = [Benchmarker.test_vgg() => "synchronous"],
)

PMM - Inception Case Study

This experiment explores the sensitivity of the ILP formulation to PMM/DRAM ratios. Make sure the kernels are profiled prior to performing this experiment.

Running the Experiment

The Inception Case study simply involves running the conventional_inception() workload for a large number of PMM to DRAM ratios.

using Benchmarker
Benchmarker.inception_case_study()

Generating Plots

To generate Figure 10a, 10b, and 10c, run

using Benchmarker
Benchmarker.inception_case_study_plots()

PMM - Large Networks

This experiment compares AutoTM with the hardware managed 2LM. The workloads used for this experiment all used on the order of 650 GB of memory and so far exceed the size of local DRAM.

Kernel Profiling

As with the conventional workloads, kernel profiling must be performed. The command given below will perform all profiling. Be warned that because of the large number of unique kernels in DenseNet, profiling can take about a day. Thus, you may want to just run a subset of the workloads.

using Benchmarker

workloads = [
    Benchmarker.large_vgg(),
    Benchmarker.large_inception(),
    Benchmarker.large_resnet(),
    Benchmarker.large_densenet()
]

Benchmarker.kernel_profile(workloads)

AutoTM Data

Due to the large size of these workloads, the system should be rebooted between each run to minimize memory fragmentation. It's not absolutely necessary, but can help with consistency.

using Benchmarker, AutoTM

### Run each of the large workloads

# Vgg
Benchmarker.run_large(Benchmarker.large_vgg(), AutoTM.Optimizer.Static)
Benchmarker.run_large(Benchmarker.large_vgg(), AutoTM.Optimizer.Synchronous)

# Inception
Benchmarker.run_large(Benchmarker.large_inception(), AutoTM.Optimizer.Static)
Benchmarker.run_large(Benchmarker.large_inception(), AutoTM.Optimizer.Synchronous)

# Resnet
Benchmarker.run_large(Benchmarker.large_resnet(), AutoTM.Optimizer.Static)
Benchmarker.run_large(Benchmarker.large_resnet(), AutoTM.Optimizer.Synchronous)

# DenseNet
Benchmarker.run_large(Benchmarker.large_densenet(), AutoTM.Optimizer.Static)
Benchmarker.run_large(Benchmarker.large_densenet(), AutoTM.Optimizer.Synchronous)

2LM Data

Switch over the system to 2LM using the process outlined above. Once the system is in 2LM, run the following commands

using Benchmarker

Benchmarker.run_2lm(Benchmarker.large_vgg())
Benchmarker.run_2lm(Benchmarker.large_inception())
Benchmarker.run_2lm(Benchmarker.large_resnet())
Benchmarker.run_2lm(Benchmarker.large_densenet())

Generating Plots

This generates Figure 8.

using Benchmarker

Benchmarker.plot_large()

GPU

Preparation

To allow data to be moved to the host system, CUDA needs pinned memory. Make sure to run ulimit -l to allow unlimited pinned host memory before running.

Navigate to the Benchmarker directory

cd $AUTOTM_HOME/experiments/Benchmarker

Run a new Julia session

julia --project

In the Julia REPL, make sure all dependencies are installed

julia> ]

(Benchmarker) pkg> instantiate

Profiling

Because of memory overheads, the GPU experiments are split into two parts. The first part involves generating the kernel profile information. The second part is the actual running of the experiments themselves.

To generate the kernel profile data, perform the following sequence of commands in the Benchmarker directory

using Benchmarker, AutoTM

Benchmarker.gpu_profile()

when the system finishes profiling, exit the Julia session.

Running Benchmarks

Julia must be restarted between each benchmark. This is because while ngraph is responsible for one large allocation for intermediate data, input and outputs tensors on the Julia side are managed by CuArrays These two allocation sources generally confuse each other across multiple runs, so the most consistent way to get results is to restart Julia.

A script has been provided in the Benchmarker directory. To run it, execute

julia --color=yes gpu_script.jl

Note

There are some default variables set for the amount of GPU DRAM and for the overhead of the ngraph/CUDA runtimes. These are set to 11 GB and 1 GB respectively for a RTX 2080Ti. With a different GPU/CUDA version, these will need to be changed. For example, if your GPU has 6 GB of memory, these values may be set using

using Benchmarker, AutoTM

Benchmarker.GPU_MAX_MEMORY[] = 6_000_000_000
Benchmarker.GPU_MEMORY_OVERHEAD[] = 1_000_000_000

Memory overhead can be queried using nvidia-smi

Generating Plots

Following benchmark runs, the GPU performance plot (Figure 12) are simply generated using

Benchmarker.plot_gpu_performance()