Experiment Customization

There are a couple of ways to customize experiments.

CPU - Changing the Number of Threads.

The CPU portion of the code defaults to using 24 threads on Socket 1 of a dual socket system (with sockets numbered as 0 and 1). This can be changed by calling

AutoTM.setup_affinities(; omp_num_threads = nthreads)

By default, thread affinities are assigned one per physical core, essentially disabling hyperthreading. If you really want hyperthreading, the keyword argument threads_per_core = 2 may be passed to setup_affinities.

Note

Calls to setup_affinities only work if called before the first run of the ngraph compiler. The LLVM compiler backing ngraph doesn't like changing the number of OMP threads for some reason.

Also don't do something crazy like omp_num_threads = 1024 - I have no idea what will happen.

Kernel profiles are parameterized by number of threads so you don't have to worry about kernel profiles clobbering when changing the number of threads.

CPU - Changing DRAM Limits

Supporting different PMM to DRAM ratios is straight forward. When calling run_conventional entry function, custom ratios may be passed. These ratios are simply defined by Julia's native Rational{Int} type. For example, for a 16 to 1 PMM to DRAM ratio, simply pass 16 // 1. The resulting call might look like

Benchmarker.run_conventional(
    Benchmarker.test_vgg(),
    [AutoTM.Optimizer.Synchronous],
    16 // 1
)

Furthermore, a hard DRAM limit can be passed by just passing in an Int number of bytes to the third argument.

GPU - Changing DRAM Limits

The GPU DRAM limits can be changed by changing the GPU_MAX_MEMORY and GPU_MEMORY_OVERHEAD variables as described in GPU.

CPU/GPU - New Networks

The code to create the benchmarked networks lives in $AUTOTM_HOME/AutoTM/src/zoo. Networks are modeled following Julia's Flux machine learning library and are converted into ngraph computation graphs (with the help of nGraph.jl and the mighty Cassette).

Custom networks can be defined externally and passed as an AutoTM.Actualizer to the functions Benchmarker.run_conventional or Benchmarker.run_gpu. A detailed example is given below.

Suppose we want to model a simple MLP.

using Benchmarker, Flux, AutoTM, nGraph

# Define a function that returns a simple MLP wrapped up in an `Actualizer`.
function mlp(batchsize)
    # Define the network
    network = Flux.Chain(
        Dense(4096, 4096, Flux.relu),
        Dense(4096, 4096, Flux.relu),
        Dense(4096, 4096, Flux.relu),
        Dense(4096, 10, Flux.relu),
        softmax,
    ) 

    # Create input array
    X = randn(Float32, 4096, batchsize)

    # Create dummy one-hot input
    Y = zeros(Float32, 10, batchsize)
    for i in 1:batchsize
        Y[rand(1:10)] = one(eltype(Y))
    end

    # Compute the loss function.  
    loss(x, y) = Flux.crossentropy(network(x), y)
    return AutoTM.Actualizer(loss, X, Y; optimizer = nGraph.SGD(Float32(0.005)))
end

# This function can now be passed to Benchmarker.run_gpu
# If running with a batchsize of 16
Benchmarker.run_gpu(() -> mlp(16))

The results from the above will end up in $AUTOTM_HOME/experiments/Benchmarker/data/gpu with the name unknown_network. Results can be expected by deserializing the data

using Serialization
data = deserialize("data/gpu/unknown_network_asynchronous_gpu_profile.jls");
display(first(data.runs))

# Roughly Expected Output
#   :bytes_async_moved_dram    => 2003336
#   :bytes_input_tensors       => 545962020
#   :predicted_runtime         => 0.41705
#   :pmem_alloc_size           => 0x000000000012b000
#   :num_async_move_nodes      => 32
#   :num_dram_async_move_nodes => 17
#   :move_time                 => 0.0
#   :dram_alloc_size           => 269910080
#   :num_input_tensors         => 107
#   :num_dram_move_nodes       => 17
#   :actual_runtime            => 0.00424737
#   :bytes_output_tensors      => 744613584
#   :bytes_async_moved_pmem    => 2002056
#   :num_dram_input_tensors    => 107
#   :tensor_size_map           => Dict(...)
#   :num_dram_output_tensors   => 72
#   :bytes_moved_pmem          => 2002056
#   :num_pmem_async_move_nodes => 15
#   :num_kernels               => 72
#   :bytes_async_moved         => 4005392
#   :bytes_moved               => 0
#   :dram_limit                => 8597
#   :bytes_dram_input_tensors  => 545962020
#   :bytes_dram_output_tensors => 744613584
#   :bytes_moved_dram          => 2003336
#   :num_move_nodes            => 0
#   :num_output_tensors        => 72
#   :num_pmem_move_nodes       => 15
#   :oracle_time               => 4140.0