Running SDK on a Wafer-Scale Cluster

Running SDK on a Wafer-Scale Cluster

Note

The Cerebras Wafer-Scale Cluster Appliance running CSSoft 2.0 supports SDK 0.9. For SDK 0.9 documentation, see here.

In addition to the containerized Singularity build of the Cerebras SDK (see Installation and Setup) for installation information), the SDK is also supported on Cerebras Wafer-Scale Clusters running in Appliance Mode.

This page documents some modifications needed to your code to run on a Wafer-Scale Cluster. For more information about setting up and using a Wafer-Scale Cluster, see the documention here. In particular, see here for setting up your Python virtual environment and installing the Appliance Python wheel.

Summary

The Cerebras Wafer-Scale Cluster is our solution to training massive neural networks with near-linear scaling. The Wafer-Scale Cluster consists of one or more CS-2 systems, together with special CPU nodes, memory servers, and interconnects, presented to the end user as a single system, or Appliance. The Appliance is responsible for job scheduling and allocation of the systems.

There are two types of SDK jobs that can run on the Appliance: compile jobs, which are used to compile code on a worker node, and run jobs, which either run the compiled code on a worker node using the simulator, or run the code on a real CS-2 within the Appliance.

We will walk through some changes necessary to compile and run your code on a Wafer-Scale Cluster. Modified code examples for supporting a Wafer-Scale Cluster can be requested from developer@cerebras.net.

Note that there are currently some limitations for running SDK jobs on a Wafer-Scale Cluster. Unlike ML jobs, SDK jobs can only use a single worker node and CS-2.

Compiling

As an example, we’ll walk through porting GEMV 1: A Complete Program. In the containerized SDK setup, this code is compiled with the following command:

cslc ./layout.csl --fabric-dims=8,3 --fabric-offsets=4,1 --memcpy --channels=1 -o out

To compile for the Wafer-Scale cluster, we use a Python script which launches a compile job:

import json
from cerebras_appliance.sdk import SdkCompiler

# Instantiate copmiler
compiler = SdkCompiler()

# Launch compile job
artifact_id = compiler.compile(
    ".",
    "layout.csl",
    "--fabric-dims=8,3 --fabric-offsets=4,1 --memcpy --channels=1 -o out",
)

# Write the artifact_id to a JSON file
with open("artifact_id.json", "w", encoding="utf8") as f:
    json.dump({"artifact_id": artifact_id,}, f)

The SdkCompiler::compile function takes three arguments:

  • the directory containing the CSL code files,

  • the name of the top level CSL code file that contains the layout block,

  • and the compiler arguments.

The function returns an artifact ID, which is used when running to locate the compile artifacts on the Appliance. We write this artifact ID to a JSON file which will be read by the runner object in the Python host code.

Just as before, simply pass the full dimensions of the target system to the --fabric-dims argument to compile for a real hardware run.

Running

In the containerized SDK setup, our Python host code for running is as follows:

import argparse
import numpy as np

from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType, MemcpyOrder

# Read arguments
parser = argparse.ArgumentParser()
parser.add_argument('--name', help="the test compile output dir")
parser.add_argument('--cmaddr', help="IP:port for CS system")
args = parser.parse_args()

# Matrix dimensions
M = 4
N = 6

# Construct A, x, b
A = np.arange(M*N, dtype=np.float32).reshape(M, N)
x = np.full(shape=N, fill_value=1.0, dtype=np.float32)
b = np.full(shape=M, fill_value=2.0, dtype=np.float32)

# Calculate expected y
y_expected = A@x + b

# Construct a runner using SdkRuntime
runner = SdkRuntime(args.name, cmaddr=args.cmaddr)

# Load and run the program
runner.load()
runner.run()

# Launch the init_and_compute function on device
runner.launch('init_and_compute', nonblock=False)

# Copy y back from device
y_symbol = runner.get_id('y')
y_result = np.zeros([1*1*M], dtype=np.float32)
runner.memcpy_d2h(y_result, y_symbol, 0, 0, 1, 1, M, streaming=False,
order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)

# Stop the program
runner.stop()

# Ensure that the result matches our expectation
np.testing.assert_allclose(y_result, y_expected, atol=0.01, rtol=0)
print("SUCCESS!")

For Appliance mode, we need a few modifications to the host code:

import json
import os

import numpy as np

from cerebras_appliance.pb.sdk.sdk_common_pb2 import MemcpyDataType, MemcpyOrder
from cerebras_appliance.sdk import SdkRuntime

# Matrix dimensions
M = 4
N = 6

# Construct A, x, b
A = np.arange(M*N, dtype=np.float32).reshape(M, N)
x = np.full(shape=N, fill_value=1.0, dtype=np.float32)
b = np.full(shape=M, fill_value=2.0, dtype=np.float32)

# Calculate expected y
y_expected = A@x + b

# Read the artifact_id from the JSON file
with open("artifact_id.json", "r", encoding="utf8") as f:
    data = json.load(f)
    artifact_id = data["artifact_id"]

# Instantiate a runner object using a context manager
with SdkRuntime(artifact_id, simulator=True) as runner:
    # Launch the init_and_compute function on device
    runner.launch('init_and_compute', nonblock=False)

    # Copy y back from device
    y_symbol = runner.get_id('y')
    y_result = np.zeros([1*1*M], dtype=np.float32)
    runner.memcpy_d2h(y_result, y_symbol, 0, 0, 1, 1, M, streaming=False,
    order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)

# Ensure that the result matches our expectation
np.testing.assert_allclose(y_result, y_expected, atol=0.01, rtol=0)
print("SUCCESS!")

In particular, note that:

  • The imports have changed to reflect Appliance modules.

  • We no longer need to specify a compile output directory. Instead, we read our artifact ID from the JSON file generated when compiling.

  • We no longer need to specify a CM address when running on real hardware. Instead, we simply pass a flag to the SdkRuntime constructor specifying whether to run in the simulator or on hardware.

  • load() and run() are replaced by start().

  • We can use a context manager for the runner object. If we do so, the start() and stop() functions are implicit, and we do not need to explicitly call them.

  • Without a context manager, you must call start() and stop() explicitly.