Libraries
Contents
Libraries¶
Libraries are imported by enclosing the name of the library with angled brackets.
/// main.csl
const math = @import_module("<math>");
fn distance(x0 : f16, y0 : f16, x1 : f16, x1 : f16) f16 {
return math.sqrt((x0-x1)*(x0-x1) + (x0-x1)*(x0-x1));
}
<complex>¶
The complex library provides structs containing real
and imag
components and basic complex functions.
complex
is a generic struct parameterized by its field type. The
complex_32
and complex_64
non-generic names are also provided; these
define a complex number using two f16
values and a complex number using
two f32
values, respectively.
get_complex
is a generic constructor that returns a complex struct based on
the type of its inputs. The non-generic get_complex_32
and
get_complex_64
constructor functions are provided as well:
// Returns struct {real: T, imag: T} where T can be f16 or f32
fn complex(comptime T: type) type
const complex_32 = complex(f16); // struct {real: f16, imag: f16}
const complex_64 = complex(f32); // struct {real: f32, imag: f32}
// Can operate on f16 or f32
fn get_complex(r: anytype, i: @type_of(r)) complex(@type_of(r))
fn get_complex_32(r : f16, i : f16) complex_32
fn get_complex_64(r : f32, i : f32) complex_64
The following functions are provided for operating on complex numbers. They are
written as generic functions to facilitate use in other libraries or
abstractions. In addition, non-generic complex_32
and complex_64
functions are provided. These functions have names suffixed with _32
and
_64
, respectively.
// x, y can be complex_32 or complex_64
fn add_complex(x: anytype, y: @type_of(x)) @type_of(x)
fn subtract_complex(x: anytype, y: @type_of(x)) @type_of(x)
fn multiply_complex(x: anytype, y: @type_of(x)) @type_of(x)
<debug>¶
The debug library provides a tracing mechanism to record tagged values.
// Record values of the specified type
fn trace_bool(x : bool) void
fn trace_u8(x : u8) void
fn trace_i8(x : i8) void
fn trace_u16(x : u16) void
fn trace_i16(x : i16) void
fn trace_f16(x : f16) void
fn trace_u32(x : u32) void
fn trace_i32(x : i32) void
fn trace_f32(x : f32) void
// Record a compile-time string
fn trace_string(comptime str : comptime_string) void
// Generic version
fn trace(x : anytype) void
// Record timestamp using the <time> library
fn trace_timestamp() void
// These functions are for internal use, recording raw words
fn tagged_put_u8(tag : u8, x : u8) void
fn put_u16(x : u16) void
fn put_u32(x : u32) void
A minimal example of a PE program recording timestamps and
values using an imported instance of the <debug>
library:
// pe_program.csl
// When importing an instance of the <debug> module, two things must be
// specified:
//
// * (key: comptime_string) user-specified key that can be used to
// retrieve the contents of the trace buffer after execution
// * (buffer_size: comptime_int) size of buffer for recording traces
const trace = @import_module(
"<debug>",
.{ .key = "debug_example",
.buffer_size = 100,
}
);
var global : i16 = 0;
task main_task() void {
// Record timestamp for beginning of task
trace.trace_timestamp();
// Record a compile-time string
trace.trace_string("Hello, world");
// Update global variable and record
global = 5;
trace.trace_i16(global);
// Record timestamp for end of task
trace.trace_timestamp();
}
<directions>¶
The directions library provides utility functions for manipulating directions.
fn rotate_clockwise(d : direction) direction
fn rotate_counterclockwise(d : direction) direction
fn flip_vertical(d : direction) direction
fn flip_horizontal(d : direction) direction
fn flip(d : direction) direction
<empty>¶
This library is empty on purpose. This allows a conditional module import as follows:
const cache = @import_module(if (stage == 0) "<empty>" else cache_name);
<layout>¶
This library provides access to information about where the PE is located.
Specifically, the x
and y
coordinates in the rectangle can be
accessed at runtime, allowing code to be shared between PEs at different
locations.
const layout_module = @import_module("<layout>");
// Return the 0-indexed x and y coord
// Only supported on WSE-2
layout_module.get_x_coord() u16;
layout_module.get_y_coord() u16;
<malloc>¶
The malloc library implements an arena allocator using a statically allocated buffer.
In arena allocators, a single buffer (arena) is used to ensure that all objects are allocated sequentially in memory. Allocating and deallocating memory are fast operations, requiring an addition and/or assignment. The free operation frees all allocated objects at once.
The parameter buffer_num_words
specifies the number of words of the
statically allocated buffer.
If the param asserts_enabled
is true, all allocations assert that the
buffer has enough free memory. The default is false.
// specify buffer size
const mem = @import_module("<malloc>", .{buffer_num_words = <buffer_size>});
// mem provides the following API:
// returns pointer to num_values of type T
fn malloc(comptime T: type, num_values: u16) [*]T
// non-generic versions, return pointer to num_values of corresponding type
fn malloc_i16(num_values:u16) [*]i16
fn malloc_u16(num_values:u16) [*]u16
fn malloc_f16(num_values:u16) [*]f16
fn malloc_i32(num_values:u16) [*]i32
fn malloc_u32(num_values:u16) [*]u32
fn malloc_f32(num_values:u16) [*]f32
// returns true if an allocation of num_words elements of type T would
// succeed
fn has_enough_space(comptime T: type, num_values: u16) bool
// non-generic versions
// returns true if an allocation of num_words elements of type
// i16, u16, f16 would succeed:
fn has_enough_words(num_words:u16) bool
// returns true if an allocation of num_words elements of type
// i32, u32, f32 would succeed:
fn has_enough_double_words(num_double_words:u16) bool
// frees the entire buffer
fn free() void;
<math>¶
The math library functions are named using the convention
<operationName>_<principalType>()
. So for example
the sin
function over f32
values has the name sin_f32
.
Math constants¶
The following can be used anywhere a floating point number is needed.
const PI : comptime_float
const E : comptime_float
Math functions¶
The <math>
library provides standard mathematical functions. They are
written as generic functions to facilitate use in other libraries or
abstractions. In addition, non-generic f16
and f32
functions are
provided. These functions have names suffixed with _f16
and _f32
,
respectively.
The following functions are provided:
// T can be f16 or f32
fn POSITIVE_INF(comptime T: type) : T
fn NEGATIVE_INF(comptime T: type) : T
fn NaN(comptime T: type) : T
// x can be f16, f32, i8, i16, i32,
// u8, u16, or u32
fn abs(x: anytype) @type_of(x)
fn max(x: anytype, y: @type_of(x)) @type_of(x)
fn min(x: anytype, y: @type_of(x)) @type_of(x)
fn sign(x: anytype) @type_of(x)
// x can be f16 or f32
fn ceil(x: anytype) @type_of(x)
fn cos(x: anytype) @type_of(x)
fn exp(x: anytype) @type_of(x)
fn floor(x: anytype) @type_of(x)
fn fscale(f: anytype, s: i16) @type_of(f)
fn inv(x: anytype) @type_of(x)
fn invsqrt(x: anytype) @type_of(x)
fn isNaN(x: anytype) bool
fn isInf(x: anytype) bool
fn isFinite(x: anytype) bool
fn isSignaling(x: anytype) bool
fn log(x: anytype) @type_of(x)
fn pow(x: anytype, y: @type_of(x)) @type_of(x)
fn sig(x: anytype) @type_of(x)
fn signbit(x: anytype) bool
fn sin(x: anytype) @type_of(x)
fn sqrt(x: anytype) @type_of(x)
fn tanh(x: anytype) @type_of(x)
Example¶
const math = @import_module("<math>");
var x: f16;
task t() void {
if (!math.isFinite(x)) {
x = 0.0;
}
var one = math.pow(math.sin(x), 2.0) + math.pow(math.cos(x), 2.0);
if (math.abs(math.log(one) - 1.0) > 0.001) {
x = math.NaN(f16);
}
}
The same code can be written using non-generic functions:
task t() void {
if (!math.isFinite_f16(x)) {
x = 0.0;
}
var one = math.pow_f16(math.sin_f16(x), 2.0) +
math.pow_f16(math.cos_f16(x), 2.0);
if (math.abs_f16(math.log_f16(one) - 1.0) > 0.001) {
x = math.NaN_f16;
}
}
Note on sin
and cos
accuracy¶
Both f16
and f32
versions of sin
and cos
will produce
incorrect results when abs(x) ≥ 16384π (approximately 51472).
<random>¶
fn set_global_prng_seed(seed: u32) void
fn random_f16(lower: f16, upper: f16) f16
fn random_f32(lower: f32, upper: f32) f32
fn random_pow_u32(pow : u16) u32
fn random_normal_f32() f32
<tile_config>¶
The tile_config library contains APIs relating to the hardware configuration of a PE. It contains the following top-level constants:
// The base addresses of memory-mapped registers
const addresses: enum(reg_type)
// The type of a word-sized memory-mapped register
const reg_type: type
const reg_ptr: type = *reg_type
// The type of a memory-mapped register occupying two words
const double_reg_type: type
// The name of the target architecture, such as "wse2"
const target_name: comptime_string
// The size of a word in bytes
const word_size: comptime_int
The tile_config library also contains an API to access the PE’s coordinates in the rectangle at runtime. It is not supported on WSE-1; attempting to use this API when compiling for WSE-1 produces a compile error.
const fabric_coord: enum(reg_type) {
X,
Y
};
fn get_fabric_coord(dimension: fabric_coord) u16
filters
¶
This submodule of tile_config contains APIs for configuring filters:
// The number of filters provided by the architecture.
const num_filters: comptime_int
// Set the active limit of a counter filter identified by `filter_id`
// to `limit`.
fn set_active_limit(filter_id: u16, limit: reg_type) void
// Set the maximum counter value of a counter filter identified by
// `filter_id` to `max_counter`.
fn set_max_counter(filter_id: u16, max_counter: reg_type) void
// Set the counter value of a counter filter identified by `filter_id`
// to `counter`.
fn set_counter(filter_id: u16, counter: reg_type) void
These functions can be used like:
const config = @import_module("<tile_config>");
// Set the counter of filter ID 1 to 0
config.filters.set_counter(1, 0);
teardown
¶
This submodule of tile_config contains teardown APIs:
// Returns the task ID that is reserved for the teardown handler.
fn get_task_id() local_task_id
// Return the values of the "teardown-pending" registers combined into
// one value. Only the first invocation of this function per-task is
// guaranteed to return the correct value. Any additional calls per-task
// will have undefined results.
fn get_pending() double_reg_type
// Given a value that represents the "teardown-pending" state, which has 1
// bit per routable color indicating the ones that are currently in
// teardown state, return `true` iff the input color `c` is in teardown
// state.
fn is_pending(value: double_reg_type, c: color) bool
// Exit the teardown state for a given color `c`.
fn exit(c: color) void
These functions can be used like:
const config = @import_module("<tile_config>");
// Check if teardown is pending on color 8 or 9
var pendings = config.teardown.get_pending();
bool pending_8_or_9 = config.teardown.is_pending(pendings, @get_color(8)) or
config.teardown.is_pending(pendings, @get_color(9));
task_priority
¶
This submodule of tile_config contains APIs for configuring task priority:
// Enum for task priorities: either HIGH or LOW.
const level = enum(u16) {
LOW = 0,
HIGH = 1
};
// Updates the task priority associated with `task_id` to `priority`.
fn update_task_priority(task_id: anytype, priority: level) void
// Sets the task priority associated with `task_id` to high.
fn set_task_priority(task_id: anytype) void
// Sets the task priority associated with `task_id` to low.
fn clear_task_priority(task_id: anytype) void
The provided task_id
can be a data_task_id
or local_task_id
to set
the priority of the associated task. In addition, the priority of tasks
activated by wavelets, including tasks bound to a control_task_id
, can be
configured using the color
that carries the wavelets.
Note that updates to task priority made at runtime may take a few clock cycles to take effect. These functions may be used at comptime or at runtime.
These functions can be used like:
const config = @import_module("<tile_config>");
const task_priority = config.task_priority;
const task_priority_level = task_priority.level;
param high_id: data_task_id;
param low_id: local_task_id;
comptime {
// Equivalent to:
// task_priority.update_task_priority(
// high_id,
// task_priority_level.HIGH);
task_priority.set_task_priority(high_id);
}
task main() void {
// Equivalent to:
// task_priority.update_task_priority(
// low_id,
// task_priority_level.LOW);
task_priority.clear_task_priority(low_id);
}
main_thread_priority
¶
This submodule of tile_config contains APIs for configuring main thread
priority. The main thread is the thread that executes non-async
operations. Operations tagged with async
execute on a microthread, which
is associated with a fabric input or output queue. Main thread priority and
microthread priority determine the relative scheduling priority of the
threads.
// Enum for main thread priorities. The meanings of main thread priority
// levels are relative to microthread priorities, as follows:
//
// MEDIUM_LOW: Between low- and medium-priority microthreads.
// MEDIUM: Same priority as medium-priority microthreads.
// MEDIUM_HIGH: Between medium- and high-priority microthreads.
// HIGH: Same priority as high-priority microthreads.
const level = enum(u16) {
MEDIUM_LOW = ...,
MEDIUM = ...,
MEDIUM_HIGH = ...,
HIGH = ...
};
// Updates the priority for the main thread to `priority`. Note that updates
// to main thread priority made at runtime make take a few clock cycles to
// take effect. This function may be used at comptime or at runtime.
fn update_main_thread_priority(priority: level) void;
This function can be used like:
const config = @import_module("<tile_config>");
const mt_priority = config.main_thread_priority;
comptime {
mt_priority.update_main_thread_priority(mt_priority.level.MEDIUM);
}
task main() void {
mt_priority.update_main_thread_priority(mt_priority.level.MEDIUM_HIGH);
}
control_transform
¶
This submodule of tile_config contains a function for setting the mask for
transforming the index part of control wavelets. This function is to be used
together with the DSD property control_transform
to XOR
the first six
bits of the index portion of a wavelet with the specified mask.
fn set_mask(mask: reg_type) void
This function can be used like:
const tile_config = @import_module("<tile_config>");
const ctrl_xform = tile_config.control_transform;
var in_dsd = @get_dsd(fabin_dsd, .{ .fabric_color = recv_channel,
.extent = 100,
.input_queue = @get_input_queue(0),
.control_transform = true });
const out_dsd = @get_dsd(fabout_dsd, .{ .extent = 100,
.fabric_color = send_channel,
.output_queue = @get_output_queue(1),
.control_transform = true });
var buf = @zeros([5]u32);
const fifo = @allocate_fifo(buf);
task buffer() void {
@mov32(fifo, in_dsd, .{ .async = true });
@mov32(out_dsd, fifo, .{ .async = true });
}
comptime {
ctrl_xform.set_mask(2);
}
The set_mask
function can be used either at comptime or runtime. Only the
first six bits of the mask are taken into account.
exceptions
¶
This submodule of tile_config contains functions for setting values in the exception mask register. The exception mask register determines which exceptions cause the processor to stop. An unmasked exception causes the processor to immediately stop execution. A masked exception allows execution to continue. By default, all exceptions are masked. The functions in this submodule can be used to unmask them.
// Exceptions which can be unmasked with the below functions
const PERF_CNT_0_OVERFLOW = ...;
const PERF_CNT_1_OVERFLOW = ...;
const SW_EXCEPTION = ...;
const FP_UNDERFLOW = ...;
const FP_OVERFLOW = ...;
const FP_INEXACT = ...;
const FP_INVALID = ...;
const FP_DIV_BY_0 = ...;
// Unmask one of the exceptions above at comptime
fn set_exception_mask_comptime(comptime exception_mask: reg_type) void;
// Unmask one of the exceptions above at runtime
fn set_exception_mask(exception_mask: reg_type) void;
This submodule can be used as follows:
const tile_config = @import_module("<tile_config>");
const exceptions = tile_config.exceptions;
fn fp_div_by_0() f32 {
// Set exception mask for FP_DIV_BY_0.
// When floating point divide by zero occurs,
// processor will stop execution.
exceptions.set_exception_mask(exceptions.FP_DIV_BY_0);
var x : f32 = 42.0;
var y : f32 = 0.0;
// This operation is a divide by zero, so processor should hang
return x / y;
}
Each call to set_exception_mask
overwrites the exception mask register.
Multiple exceptions can be unmasked simultaneously as follows:
// FP_DIV_BY_0 is unmasked.
exceptions.set_exception_mask(exceptions.FP_DIV_BY_0);
// FP_DIV_BY_0 is masked again.
// FP_OVERFLOW and FP_UNDERFLOW are now unmasked.
exceptions.set_exception_mask(exceptions.FP_OVERFLOW
& exceptions.FP_UNDERFLOW);
<time>¶
The time library returns the current 48-bit timestamp counter as three 16-bit unsigned integers in little endian form.
fn enable_tsc() void
fn disable_tsc() void
fn get_timestamp(result : *[3]u16) void;
<kernels>¶
This library differs from all other libraries in that it provides kernels, as opposed to individual functions. The “tally” kernel implements a two-phase tally, used to coordinate the work done by multiple PEs. It is documented in the kernel code itself.
<tally>¶
The tally library implements a two-phase tally kernel that allows PEs within a rectangle to communicate progress/completion to the host.
The library consists of two modules:
<kernels/tally/layout>
: imported once and use in thelayout
block to parameterize each PE’s tally behavior.<kernels/tally/pe>
: imported once by each PE, consuming the parameters generated by the layout module.
A minimal example of importing and using both modules, starting with the layout module:
// code.csl
const tally = @import_module("<kernels/tally/layout>", .{
.kernel_height=8,
.kernel_width=4,
.phase2_tally=0,
.colors=[3]color{@get_color(1), @get_color(2), @get_color(3)},
.output_color=@get_color(0),
});
layout {
@set_rectangle(4, 8);
for (@range(u16, 4)) |i| {
for (@range(u16, 8)) |j| {
@set_tile_code(i, j, "pe.csl", .{
.tally_params = tally.get_params(i, j),
});
}
}
}
And the per-PE module:
// pe.csl
param tally_params:comptime_struct;
const tally = @import_module("<kernels/tally/pe>",
@concat_structs(tally_params, .{
.output_queues=[2]u16{0, 1},
}));
task done() void {
tally.signal_completion();
}
...
The tally kernel operates in two phases.
In the first phase, every PE must signal completion at least once. For kernels where each PE knows when it is finished, this is the only phase needed.
The first phase ends when every PE has signaled completion at least once.
During the second phase, PEs can bump (increase) the global tally. When the
global tally meets or exceeds the phase2_tally
parameter, the kernel signals
completion by sending the total to the North on output_color
from the
PE at (kernel_width - 1, 0).
The second phase is optional. If phase2_tally == 0
, the second phase will
be skipped and the output signal on output_color
will be 0.
<collectives_2d>¶
This library implements collective communication directives that allows PEs to communicate data with one another.
The library consists of two modules:
<collectives_2d/params>
: Imported once to parameterize each PE in thelayout
block.<collectives_2d/pe>
: Imported once per dimension per PE. Contains collective communication directives for a single axis.
<collectives_2d/params>¶
The parameter module exposes a compile-time helper function for configuring
PEs to use <collectives_2d>
fn get_params(Px: u16, Py: u16, ids: comptime_struct) comptime_struct
Px
is the PE’s x-coordinate.Py
is the PE’s y-coordinate.ids
is a struct that is expected to have either thex
-related fields, they
-related fields, or all four, of the following:x_colors
: a struct containing 2 distinct colors as anonymous fieldsx_entrypoints
: a struct containing 2 distinct local task IDs as anonymous fieldsy_colors
: a struct containing 2 distinct colors as anonymous fieldsy_entrypoints
: a struct containing 2 distinct local task IDs as anonymous fields
Returns a struct containing the parameters necessary to import library modules for the specified PE. This struct contains:
x
: an opaque struct containing parameters needed to configure collective communications in the x-dimension.y
: an opaque struct containing parameters needed to configure collective communications in the y-dimension.
<collectives_2d/pe>¶
The following directives are currently supported:
fn init() void
fn broadcast(root: u16, buf: [*]u32, count: u16, callback: local_task_id) void
fn scatter(root: u16, send_buf: [*]u32, recv_buf: [*]u32, count: u16,
callback: local_task_id) void
fn gather(root: u16, send_buf: [*]u32, recv_buf: [*]u32, count: u16,
callback: local_task_id) void
fn reduce_fadds(root: u16, send_buf: [*]f32, recv_buf: [*]f32, count: u16,
callback: local_task_id) void
init
initializes the library. It must be invoked for each axis.
broadcast
transmits the contents of buf
from the root PE to the buf
of other PEs in the row or column. count
should be the length of buf
.
It is akin to MPI_Bcast
.
scatter
transmits count
-many elements from send_buf
from the
root PE to the recv_buf
of other PEs in the row/column. It is akin
to MPI_Scatter
.
gather
accumulates count
-many elements from send_buf
of other
PEs into the recv_buf
of the root PE. It is akin to MPI_Gather
.
When distributing or aggregating elements using scatter
or gather
for N
PEs, the send_buf
or recv_buf
should have space for
count * N
elements, respectively.
reduce_fadds
computes an MPI_Sum
for buffers of f32
.
In general, all PEs must call the same directive with same root
and count
. The primitives have the following common parameters:
root
is the root PE for network configuration,send_buf
is a buffer containing data to be transmitted,recv_buf
is a buffer for holding data received,count
is the number of elements to be transmitted,callback
is activated when the primitive completes.
The user can configure the resources of collectives_2d
. Each
imported module must be assigned queue IDs (queues
) and DSR
IDs (dest_dsr_ids
, src0_dsr_ids
, src1_dsr_ids
). If the
user does not specify these parameters explicitly, the default values
apply. The following example shows the default values of queue IDs
and DSR IDs of collectives_2d
.
A minimal example that sets up PEs to broadcast 10 elements from the root PE to every other PE in the row/column consists of the following layout code:
// code.csl
param width: u16;
param height: u16;
param root: u16;
const c2d = @import_module("<collectives_2d/params>");
layout {
@set_rectangle(width, height);
var x: u16 = 0;
while (x < width) : (x += 1) {
var y: u16 = 0;
while (y < height) : (y += 1) {
const c2d_params = c2d.get_params(x, y, .{
.x_colors = .{
@get_color(0),
@get_color(1)
},
.x_entrypoints = .{
@get_local_task_id(2),
@get_local_task_id(3)
},
.y_colors = .{
@get_color(4),
@get_color(5)
},
.y_entrypoints = .{
@get_local_task_id(6),
@get_local_task_id(7)
},
});
@set_tile_code(
x,
y,
"pe.csl",
.{ .root = root, .c2d_params = c2d_params }
);
}
}
}
And the per-PE module:
// pe.csl
param c2d_params: comptime_struct;
const rect_height = @get_rectangle().height;
const rect_width = @get_rectangle().width;
// Pick two task IDs not used in the library for callbacks
const x_task_id = @get_local_task_id(15);
const y_task_id = @get_local_task_id(16);
const len = 10;
var x_data = @zeros([len]u32);
var y_data = @zeros([len]u32);
const mpi_x = @import_module(
"<collectives_2d/pe>",
.{ .dim_params = c2d_params.x,
.queues = [2]u16{2,4},
.dest_dsr_ids = [1]u16{1},
.src0_dsr_ids = [1]u16{1},
.src1_dsr_ids = [1]u16{1}
}
);
const mpi_y = @import_module(
"<collectives_2d/pe>",
.{ .dim_params = c2d_params.y,
.queues = [2]u16{3,5},
.dest_dsr_ids = [1]u16{2},
.src0_dsr_ids = [1]u16{2},
.src1_dsr_ids = [1]u16{2}
}
);
task x_task() void {
var send_buf = @ptrcast([*]u32, &x_data);
var recv_buf = @ptrcast([*]u32, &@zeros[len]u32);
if (root == mpi_x.pe_id) {
mpi_x.broadcast(root, send_buf, len, x_task_id);
} else {
mpi_x.broadcast(root, recv_buf, len, x_task_id);
}
}
task y_task() void {
var send_buf = @ptrcast([*]u32, &y_data);
var recv_buf = @ptrcast([*]u32, &@zeros[len]u32);
if (root == mpi_y.pe_id) {
mpi_y.broadcast(root, send_buf, len, y_task_id);
} else {
mpi_y.broadcast(root, recv_buf, len, y_task_id);
}
}