AutoTM Artifact Workflow
This section outlines how to run the experiments performed in the AutoTM paper and generate Figures 7 to 12 from the paper. The code to run these experiments lives in $AUTOTM_HOME/experiments/Benchmarker
. Unless otherwise specified, all commands given below should be executed from this directory. Julia should be started with julia --project
.
PMM - Configuring 1LM and 2LM
Servers with Intel Optane DC can be configured to run in either 1LM/AppDirect mode, where reads and writes to PMM are managed manually, or 2LM/Memory Mode where PMM is accessed as main memory with DRAM as a transparent cache.
Most of the AutoTM code expects to run in 1LM mode with PMM mounted to /mnt/public
. Scripts are provided in the $AUTOTM_HOME/scripts
directory to aid in switching modes.
Switching to 1LM
Reboot the system and select 1LM in the BIOS. After reboot, navigate to $AUTOTM_HOME/scripts
and run
sudo ./change_1lm.sh
Reboot the system again. After the system comes online again, navigate back to $AUTOTM_HOME/scripts
and run
sudo ./setup_1lm.sh
The script setup_1lm.sh
will destroy all data in PMM namespace 1.0. DO NOT run this script if there is any data on there that must be preserved.
The setup script will create a new file system on the NVDIMMs on Socket 1 and perform a direct-access filesystem mount to /mnt
.
Switching to 2LM
Reboot the system and select 2LM in the BIOS. After reboot, navigate to $AUTOTM_HOME/scripts
and run
sudo ./change_2lm.sh
Reboot the system again. That is all.
PMM - Conventional Benchmarks
Make sure the system is in AppDirect mode and that setup_1lm.sh
has been executed.
Kernel Profiling
Kernel timing profiling must happen separately before the actual execution of benchmarks due to memory fragmentation.
To perform kernel profiling, run
using Benchmarker, AutoTM
Benchmarker.kernel_profile(Benchmarker.conventional_functions())
Kernel profiling for all networks can take hours. Grab a cup of coffee and let AutoTM do its thing.
The serialized data structure for the cached kernel profiles lives in $AUTOTM_HOME/data/caches
.
Running Benchmarks
Reboot the system before running these benchmarks. Ensure the system is under light load for best results.
using Benchmarker, AutoTM
optimizers = [
AutoTM.Optimizer.Static,
AutoTM.Optimizer.Synchronous,
AutoTM.Optimizer.Numa
]
ratios = Benchmarker.common_ratios()
for fn in Benchmarker.conventional_functions()
Benchmarker.run_conventional(fn, optimizers, ratios)
end
Results for these runs will be stored to $AUTOTM_HOME/experiments/Benchmarker/data/cpu
Generating Plots
To generate Figures 7, 9, and 11 - run the following
using Benchmarker
# Figure 7
Benchmarker.plot_speedup()
# Figure 9
Benchmarker.plot_costs()
# Figure 11
Benchmarker.plot_conventional_error()
Test Run
For verification purposes, a small Vgg19 network is included.
using Benchmarker, AutoTM
Benchmarker.kernel_profile(Benchmarker.test_vgg())
Benchmarker.run_conventional(
Benchmarker.test_vgg(),
[AutoTM.Optimizer.Static, AutoTM.Optimizer.Synchronous, AutoTM.Optimizer.Numa],
Benchmarker.common_ratios(),
)
# Generate Plots
Benchmarker.plot_speedup(
models = [Benchmarker.test_vgg()],
)
Benchmarker.plot_conventional_error(
models = [Benchmarker.test_vgg()],
)
Benchmarker.plot_costs(
pairs = [Benchmarker.test_vgg() => "synchronous"],
)
PMM - Inception Case Study
This experiment explores the sensitivity of the ILP formulation to PMM/DRAM ratios. Make sure the kernels are profiled prior to performing this experiment.
Running the Experiment
The Inception Case study simply involves running the conventional_inception()
workload for a large number of PMM to DRAM ratios.
using Benchmarker
Benchmarker.inception_case_study()
Generating Plots
To generate Figure 10a, 10b, and 10c, run
using Benchmarker
Benchmarker.inception_case_study_plots()
PMM - Large Networks
This experiment compares AutoTM with the hardware managed 2LM. The workloads used for this experiment all used on the order of 650 GB of memory and so far exceed the size of local DRAM.
Kernel Profiling
As with the conventional workloads, kernel profiling must be performed. The command given below will perform all profiling. Be warned that because of the large number of unique kernels in DenseNet, profiling can take about a day. Thus, you may want to just run a subset of the workloads.
using Benchmarker
workloads = [
Benchmarker.large_vgg(),
Benchmarker.large_inception(),
Benchmarker.large_resnet(),
Benchmarker.large_densenet()
]
Benchmarker.kernel_profile(workloads)
AutoTM Data
Due to the large size of these workloads, the system should be rebooted between each run to minimize memory fragmentation. It's not absolutely necessary, but can help with consistency.
using Benchmarker, AutoTM
### Run each of the large workloads
# Vgg
Benchmarker.run_large(Benchmarker.large_vgg(), AutoTM.Optimizer.Static)
Benchmarker.run_large(Benchmarker.large_vgg(), AutoTM.Optimizer.Synchronous)
# Inception
Benchmarker.run_large(Benchmarker.large_inception(), AutoTM.Optimizer.Static)
Benchmarker.run_large(Benchmarker.large_inception(), AutoTM.Optimizer.Synchronous)
# Resnet
Benchmarker.run_large(Benchmarker.large_resnet(), AutoTM.Optimizer.Static)
Benchmarker.run_large(Benchmarker.large_resnet(), AutoTM.Optimizer.Synchronous)
# DenseNet
Benchmarker.run_large(Benchmarker.large_densenet(), AutoTM.Optimizer.Static)
Benchmarker.run_large(Benchmarker.large_densenet(), AutoTM.Optimizer.Synchronous)
2LM Data
Switch over the system to 2LM using the process outlined above. Once the system is in 2LM, run the following commands
using Benchmarker
Benchmarker.run_2lm(Benchmarker.large_vgg())
Benchmarker.run_2lm(Benchmarker.large_inception())
Benchmarker.run_2lm(Benchmarker.large_resnet())
Benchmarker.run_2lm(Benchmarker.large_densenet())
Generating Plots
This generates Figure 8.
using Benchmarker
Benchmarker.plot_large()
GPU
Preparation
To allow data to be moved to the host system, CUDA needs pinned memory. Make sure to run ulimit -l
to allow unlimited pinned host memory before running.
Navigate to the Benchmarker directory
cd $AUTOTM_HOME/experiments/Benchmarker
Run a new Julia session
julia --project
In the Julia REPL, make sure all dependencies are installed
julia> ]
(Benchmarker) pkg> instantiate
Profiling
Because of memory overheads, the GPU experiments are split into two parts. The first part involves generating the kernel profile information. The second part is the actual running of the experiments themselves.
To generate the kernel profile data, perform the following sequence of commands in the Benchmarker
directory
using Benchmarker, AutoTM
Benchmarker.gpu_profile()
when the system finishes profiling, exit the Julia session.
Running Benchmarks
Julia must be restarted between each benchmark. This is because while ngraph is responsible for one large allocation for intermediate data, input and outputs tensors on the Julia side are managed by CuArrays These two allocation sources generally confuse each other across multiple runs, so the most consistent way to get results is to restart Julia.
A script has been provided in the Benchmarker
directory. To run it, execute
julia --color=yes gpu_script.jl
There are some default variables set for the amount of GPU DRAM and for the overhead of the ngraph/CUDA runtimes. These are set to 11 GB and 1 GB respectively for a RTX 2080Ti. With a different GPU/CUDA version, these will need to be changed. For example, if your GPU has 6 GB of memory, these values may be set using
using Benchmarker, AutoTM
Benchmarker.GPU_MAX_MEMORY[] = 6_000_000_000
Benchmarker.GPU_MEMORY_OVERHEAD[] = 1_000_000_000
Memory overhead can be queried using nvidia-smi
Generating Plots
Following benchmark runs, the GPU performance plot (Figure 12) are simply generated using
Benchmarker.plot_gpu_performance()