.. _sdk-rel-notes-cumulative.rst: SDK Release Notes ================= The following are the release notes for the Cerebras SDK. .. _v1-0-0: Version 1.0.0 ------------- Released 13 November 2023 .. note:: The Cerebras Wafer-Scale Cluster Appliance running CSSoft 2.0 supports SDK 0.9. For SDK 0.9 documentation, `see here `_. New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - CSL language and compiler enhancements: - Introduces the ``data_task_id``, ``local_task_id``, and ``control_task_id`` types, to explicitly differentiate the three types of tasks. Values of these types are created via the new ``@get_data_task_id``, ``@get_local_task_id``, and ``@get_control_task_id`` builtins, respectively. ``@get_data_task_id`` generates a task ID from a routable ``color``, while ``@get_local_task_id`` and ``@get_control_task_id`` generate task IDs from an integer within the range of allowed IDs. See :ref:`language-task-ids` for more information on the new task type system. - Introduces the ``@bind_data_task``, ``@bind_local_task``, and ``@bind_control_task`` builtins for binding tasks to the corresponding task ID type. Data tasks must take either one or two arguments (corresponding to the contents of a wavelet's payload), and local tasks must take no arguments. - Colors which are used by a ``fabin_dsd`` to receive data and are not explicitly bound to a task no longer need to be blocked at compile time. The initial state of a ``data_task_id`` not explicitly bound to a task is now blocked. - Introduces the ``@get_int`` builtin to return the numerical value of values of type ``data_task_id``, ``control_task_id``, ``local_task_id``, ``color``, ``input_queue``, and ``output_queue``, as well as values of any ``enum`` or integer type. ``@get_color_id`` is now deprecated. - ``@activate`` builtin and ``.activate`` field of builtins on DSDs now take values of type ``local_task_id`` as an argument. Using ``@activate`` or the ``.activate`` field on a value of type ``color`` is now deprecated. - ``.activate_pop`` and ``.activate_push`` fields of FIFOs now take values of type ``local_task_id`` as an argument. Using these fields on a value of type ``color`` is now deprecated. - ``@block`` and ``@unblock`` builtins and ``.unblock`` field of builtins on DSDs now take values of type ``local_task_id`` or ``data_task_id`` as arguments. - The ``@rpc`` builtin now takes values of type ``data_task_id``. It no longer accepts values of type ``color``. - Introduces the ``cslc`` compiler flag ``--warnings-as-errors``, to treat compiler warnings as errors. - ``cslc`` compiler script which launches container to run the compiler now reads ``CSL_IMPORT_PATH`` environment variable to search additional paths for ``@import_module``. - CSL ``memcpy`` library enhancements: - The ``memcpy`` library has been rewritten to use the new task ID types. - Other CSL library enhancements: - ``collectives_2d`` library has been rewritten to use the new task ID types. - ``SdkRuntime`` host runtime enhancements: - Introduces new functionality in the ``sdk_utils`` module to simplify data type transformations for ``memcpy_h2d()`` and ``memcpy_d2h()`` calls. - Introduces new functionality in the ``sdk_utils`` module to process elapsed timestamp data. - Introduces ``suppress_simfab_trace`` option in the ``SdkRuntime`` constructor to suppress generation of ``simfab_traces`` files when running. - Example programs: - Example programs have been reorganized, renumbered, and updated. - Introduces three new example programs in the GEMV series, demonstrating more complex communication patterns. - Introduces a series of pipelining example programs to demonstrate the use of ``memcpy`` ``streaming`` mode to create a computation pipeline on the WSE. - Documentation improvements: - Introduces new documentation on debugging CSL programs. See :ref:`debugging-guide`. - Expands installation documentation to include Apptainer for running the SDK container. See :ref:`install-guide`. Deprecations ~~~~~~~~~~~~ - Support for ``CSELFRunner`` has now been fully removed. All programs should use the ``SdkRuntime`` host runtime. - The ``call()`` function in the ``SdkRuntime`` Python host API has been deprecated. Use ``launch()`` instead, which includes argument type checking. - ``cslc`` no longer accepts ``--channels=0`` when compiling, as this setting corresponded to ``CSELFRunner`` ``memcpy`` support. - The ``@get_color_id`` and ``@bind_task`` builtins have been deprecated. - Using values of type ``color`` with the ``@activate`` builtin or the ``.activate``, ``.activate_pop``, and ``.activate_push`` fields has been deprecated. - The ``@rpc`` builtin no longer accepts values of type ``color``. Values of type ``data_task_id`` must be used instead. Known issues ~~~~~~~~~~~~ - The bandwidth of memory transfers saturates at around 8 IO channels. - When a DSD operation uses an explicit ``fabin`` DSR, the compiler does not bind the color to the associated input queue at runtime. Instead, the user has to bind the color to the input queue explicitly via ``@initialize_queue``. See ``pe.csl`` in :ref:`sdkruntime-stencil-3d-7pts` for an example. - The 1D FFT example program may fail to compile if ``Nz >= 256``, triggering an internal compiler exception. Notes for future releases ~~~~~~~~~~~~~~~~~~~~~~~~~ - Using the ``@bind_task`` builtin to bind a task to a ``color`` is now deprecated. This builtin will be removed in a future release. Use ``@bind_data_task`` for wavelet-triggered data tasks, ``@bind_local_task`` for self-activated tasks, and ``@bind_control_task for`` control wavelet-triggered tasks. - Using the ``@get_color_id`` builtin to get the numerical value of a color is now deprecated. This builtin will be removed in a future release. Use ``@get_int`` instead. - Using the ``@activate`` builtin on a ``color`` is now deprecated. The ability to do this will be removed in a future release. .. _v0-9-0: Version 0.9.0 ------------- Released 2 October 2023 New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - CSL language and compiler enhancements: - ``@get_tensor_ptr`` is now legal in code that contains no exported symbols, and will compile. If ``@get_tensor_ptr`` is executed at runtime when no symbols have been exported, then an ``assert(false)`` will be hit. - Introduces ``@has_exported_tensors`` builtin, which evaluates to ``true`` at comptime if the program contains any exported tensors. - Introduces ``extern`` keyword. The ``extern`` storage class declares that a symbol for a variable or function is expected to be defined in an ``export`` declaration elsewhere. See :ref:`language-syntax-storage-classes`. - Introduces ``export`` keyword. The ``export`` storage class defines a variable or function with a certain name and type, and makes that variable or function available to other object files that are linked with the object being compiled. See :ref:`language-syntax-storage-classes`. - Introduces ``linkname`` keyword, which can be used to specify the name of the ELF symbol corresponding to the variable. See :ref:`language-syntax`. - Introduces support for function pointers. See :ref:`language-syntax`. - Introduces new FIFO DSR types ``dsr_fifo_dest`` and ``dsr_fifo_src``, which allow FIFOs to be used with explicit DSRs. See :ref:`language-dsrs`. - The ``bool`` type is no longer allowed with the ``@zeros`` builtin. ``@constants`` should be used instead to initialize an array with ``false``. - Bitwise not operator ``~`` is no longer allowed on the ``bool`` type. - Logical not operator ``!`` is no longer allowed on integer types. - Compiler diagnostics for circular dependencies have been improved. - CSL ``memcpy`` library enhancements: - The ``memcpy`` framework reserves two DSRs, ``dsr_dest 0`` and ``dsr_src1 0``, to enable improved performance and reduce resource usage. The user should avoid using these explicit DSRs. - The `.data_type` field is no longer needed when importing ``memcpy`` to support copy mode. - Other CSL library enhancements: - The ``collectives_2d`` library has been rewritten to use explicit DSRs, enabling improved performance and reducing resource usage. By default, the library uses ``dsr_dest``, ``dsr_src0``, and ``dsr_src1`` IDs 1 and 2, for the X and Y dimensions, respectively, but can be configured to use other IDs when imported. - The input and output queue IDs of ``collectives_2d`` are also now configurable when imported. By default, the X dimension uses queues ``2`` and ``4``, and the Y dimension uses queues ``3`` and ``5``. - The ``tile_config`` library contains a new ``exceptions`` submodule, which can be used to unmask exceptions. See :ref:`language-libraries-tile-config`. - ``SdkRuntime`` host runtime additions: - Introduces an ``sdk_utils`` library which includes utility functions to prepare data sent with ``memcpy_h2d`` and process data received from ``memcpy_d2h``. See :ref:`sdkruntime-api-reference`. - Example programs additions: - Adds ``SdkRuntime`` versions of ``gemv-checkerboard-pattern`` and ``gemv-collectives``, which implement two different approaches for computing GEMV. See :ref:`sdkruntime-gemv-checkerboard` and :ref:`sdkruntime-gemv-collectives`. - Adds ``SdkRuntime`` version of ``cholesky``, which computes the Cholesky decomposition of a symmetric positive-definite matrix. See :ref:`sdkruntime-cholesky`. - Adds additional ``SdkRuntime`` tutorial example programs, including demos of sparse tensor operations, switches, filters, FIFOs, and the ``@map`` builtin. - See the ``csl-examples`` `GitHub repository `_ for more example programs, including a 1D and 2D FFT, ``histogram-torus``, ``mandelbrot``, and ``wide-multiplication``. - Documentation improvements: - Introduces additional documentation on the ``SdkRuntime`` Python host API, including the new ``sdk_utils`` library. See :ref:`sdkruntime-api-reference`. Resolved issues ~~~~~~~~~~~~~~~ - Fixes crash when compiling pointer to array of non-scalars. - Fixes crash when compiling pointer coercion from multidimensional array to 1D pointer of unknown size. - Fixes LLVM backend bug which previously produced incorrect addresses in certain circumstances, resulting in "Invalid address" errors in the simulator. This in particular could cause issues with the ``collectives_2d`` library. - Fixes behavior of CSL ``math`` library's ``isSignaling(x)`` for checking if ``x`` is a signaling NaN. - Fixes a bug where programs using ``collectives_2d`` stall if the width or height of the core rectangle is greater than 160 PEs. - The simulator can now support programs with height greater than 256 PEs. - ``csdb`` has been fixed to correctly read core dumps from SDK programs. Known issues ~~~~~~~~~~~~ - The Singularity image may fail to work on Debian-based Linux distributions. The image works best with a Fedora-based distribution such as Red Hat or Rocky. - The bandwidth of memory transfers saturates at around 8 IO channels. - When a DSD operation uses an explicit ``fabin`` DSR, the compiler does not bind the color to the associated input queue at runtime. Instead, the user has to bind the color to the input queue explicitly via ``@initialize_queue``. See ``pe.csl`` in :ref:`sdkruntime-stencil-3d-7pts` for an example. Notes for future releases ~~~~~~~~~~~~~~~~~~~~~~~~~ - The ``CSELFRunner`` host runtime has been deprecated. It will be completely removed in a future release. .. _v0-8-0: Version 0.8.0 ------------- Released 21 June 2023 New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Introduces support for Cerebras Wafer-Scale Clusters running in appliance mode. This support is limited to Python host code using the ``SdkRuntime`` host runtime, and only one SDK compile or execute job can be launched at a time, using no more than one Cerebras system. See :ref:`appliance-mode`. - CSL language and compiler enhancements: - Introduces ``@get_output_queue`` builtin for creating output queue types. Using integers for output queue IDs is now deprecated and produces a warning. - Introduces additional improvements and enhancements to internal builtins for supporting remote procedure calls (RPCs). - Introduces improved error handling for type casts using the ``@as`` builtin. - ``@load_to_dsr`` now allow runtime determined colors in the ``@activate`` and ``@unblock`` fields. - The grammar of ``inititialize_queue`` has been updated. Previously, inititializing a queue with ID ``queue_id`` on color ``color_id`` took the form ``@initialize_queue(queue_id, color_id);``. The new syntax is ``@initialize_queue(queue_id, .{.color = color_id});``. - CSL ``memcpy`` library enhancements: - The ``memcpy`` library can now support multiple types in the same kernel. The user still needs to import ``memcpy.csl`` with the ``.data_type =`` field. The semantic meaning of ``.data_type`` is to enable copy mode for the host runtime. - ``SdkRuntime`` host runtime enhancements: - Introduces a ``debug_utils`` library which includes ``get_symbol``, ``get_symbol_rect``, and ``read_trace``, providing parity with ``CSELFRunner``'s debug support. Note that this library is available for simulator runs only. - Introduces a ``launch`` function, which features type checking and a variable number of arguments for kernel launches with the RPC mechanism. The legacy ``memcpy_launch`` function has been deprecated, and users should use ``launch`` instead. - ``memcpy_d2h`` and ``memcpy_h2d`` now feature dimension and data type checking for the host tensor. - The bandwidth of D2H transfers is greatly improved for systems running in weight streaming mode. - Benchmark programs additions: - Adds ``spmv-hypersparse`` to demonstrate a hypersparse matrix-vector multiplication. See :ref:`sdkruntime-spmv-hypersparse`. - Adds ``stencil-3d-7pts`` to demonstrate a sparse matrix-vector product using a matrix generated by a finite difference seven-point stencil. See :ref:`sdkruntime-stencil-3d-7pts`. - Adds ``bicgstab``, ``powerMethod``, ``conjugateGradient``, and ``preconditionedConjugateGradient`` to demonstrate iterative methods on a seven-point stencil. See :ref:`sdkruntime-bicgstab`, :ref:`sdkruntime-power-method`, :ref:`sdkruntime-conjugate-gradient`, and :ref:`sdkruntime-preconditioned-conjugate-gradient`. - Adds ``single-tile-matvec``, which benchmarks the performance of single-PE matrix-vector products in terms of aggregate wafer memory bandwidth and FLOPS. See :ref:`sdkruntime-single-tile-matvec`. - Documentation improvements: - Introduces new tutorials for ``SdkRuntime`` built around computing a GEMV. - Introduces additional documentation on the ``SdkRuntime`` Python host API. See :ref:`sdkruntime-api-reference`. Resolved issues ~~~~~~~~~~~~~~~ - When using ``SdkRuntime``, a nonblocking ``memcpy_d2h`` before ``stop()`` no longer triggers a segmentation fault. - Programs using ``SdkRuntime`` now load correctly in the SDK GUI. Known issues ~~~~~~~~~~~~ - The bandwidth of memory transfers saturates at around 8 IO channels. - When a DSD operation uses an explicit ``fabin`` DSR, the compiler does not bind the color to the associated input queue at runtime. Instead, the user has to bind the color to the input queue explicitly via ``@initialize_queue``. Notes for future releases ~~~~~~~~~~~~~~~~~~~~~~~~~ - The ``CSELFRunner`` host runtime has been deprecated. It will be completely removed in a future release. .. _v0-7-0: Version 0.7.0 ------------- Released 17 April 2023 New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - CSL language and compiler enhancements: - Introduces ``@set_teardown_handler`` builtin which virtualizes the teardown task and allows for separate definitions of teardown operations for different colors. - Introduces ``@rpc`` builtin which automatically generates RPC interpreter for exported functions. Used with the ``call`` host function added to ``SdkRuntime``. Note that exported symbols may not have struct or enum types, and exported function may have at most 15 parameters. - Introduces ``@get_input_queue`` builtin for creating input queue types. Using integers for input queue IDs is now deprecated and produces a warning. - Variables now have a ``linksection`` attribute. With the ``--link-section-address-bytes`` flag, this allows global variables to be placed at a specific address. - Introduces ``control_transform`` field for DSDs to transform the index portion of control wavelets. - Introduces ``@dfilt`` builtin which instructs an input queue to drop all data wavelets until a certain number of control wavelets are encountered. - DSD ``.activate`` field now allows a runtime-determined color value. - Deprecated color config syntax has been removed. - Compiler task table packing optimization increases performance of small tasks. - CSL library enhancements: - ``tile_config`` library introduces ``control_transform`` submodule to set mask when transforming index portion of control wavelets. - ``collectives_2d`` library now uses the virtualized teardown task, allowing for interoperability with programs that use ``memcpy`` and the ``SdkRuntime`` host runtime. - ``SdkRuntime`` host runtime enhancements: - ``SdkRuntime`` introduces a ``call`` function to greatly simplify kernel launches with the RPC mechanism. Functions exported in device code with the ``@rpc`` builtin are now directly host-callable. - ``memcpy`` library now supports 16-bit for copy mode. - ``memcpy`` library now reserves color 27 to deliver better performance. - Both ``copy`` and ``streaming`` mode now support 16-bit data. Note that in ``streaming`` mode, the ``MemcpyDataType`` parameter in ``memcpy_h2d`` and ``memcpy_d2h`` host calls has no effect, and the user must handle the data appropriately in the receiving wavelet-triggered task. - The ``memcpy_h2d`` and ``memcpy_d2h`` host functions take an argument to specify the packing of the 3D input/output tensor into a 1D array, either row-major or column-major. The column-major option improves bandwidth of data transfers when the host data is packed in that order. - The ``memcpy_h2d`` and ``memcpy_d2h`` host functions have new function signatures to better handle the increased number of transfer type arguments. These are passed in a ``struct`` in the C++ interface, or as required ``kwargs`` in the Python interface. This release supports the following options: - ``DataType``: (new option) 16-bit or 32-bit - ``Order``: (new option) row-major or column-major - ``streaming``: true or false - ``nonblock``: true or false - The runtime can seamlessly aggregate consecutive nonblocking ``memcpy_h2d`` calls, improving the bandwidth of bursts of small transfers. - Benchmark programs additions and enhancements: - Adds ``bandwidthTest`` to benchmark data transfer performance between host and device. See :ref:`sdkruntime-bandwidthTest`. - Adds a version of ``gemm-collectives_2d`` using ``SdkRuntime``, which showcases the interoperability of the ``collectives_2d`` library with ``memcpy``. See :ref:`sdkruntime-gemm-collectives`. - Benchmark programs written with ``SdkRuntime`` and using the RPC mechanism to launch device kernels have been rewritten to use ``call`` in the host code and the ``@rpc`` builtin in the device code, greatly reducing the complexity of the programs. - Documentation improvements: - Example programs have been reorganized into ``CSELFRunner`` and ``SdkRuntime`` sections, to clearly differentiate programs by their host runtime. - Adds appendix to describe SIMD operations on DSDs. See :ref:`language-appendix-simd`. - Adds five tutorial example programs using ``SdkRuntime``, mirroring those written to use ``CSELFRunner``. - Adds improved documentation on ``SdkRuntime`` and its host API. Resolved issues ~~~~~~~~~~~~~~~ - Runtime expressions with ``comptime``-only types in comparisons no longer crash the compiler. - ``comptime`` switch expressions can now switch on ``comptime_int``. - Binding more than one task to the same color now produces a compiler error. - Compiler now checks that dimensionality of a tensor access expression does not exceed max dimensionality of type. Known issues ~~~~~~~~~~~~ - Programs using the ``SdkRuntime`` host runtime may fail to load in the ``sdk-gui`` when invoked with ``sdk_debug_shell visualize``. - The bandwidth of D2H (device to host) memory transfers using ``memcpy`` are about 7x to 8x slower than H2D (host to device). - The bandwidth of memory transfers saturates at around 8 IO channels. - When a DSD operation uses an explicit ``fabin`` DSR, the compiler does not bind the color to the associated input queue at runtime. Instead, the user has to bind the color to the input queue explicitly via ``@initialize_queue``. - When using ``SdkRuntime``, if the last call before ``stop()`` is a nonblocking ``memcpy_d2h``, then ``stop()`` may trigger a segmentation fault. Notes for future releases ~~~~~~~~~~~~~~~~~~~~~~~~~ - The ``CSELFRunner`` runtime will be deprecated in a future release. Code should be ported to the ``SdkRuntime`` runtime. - Using integers for input queue IDs is now deprecated and will be removed in a future release. .. _v0-6-0: Version 0.6.0 ------------- Released 22 December 2022 New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Compile times are improved due to enhanced caching support. - Introduces a new host-side runtime, ``SdkRuntime``, with greatly improved host-to-device and device-to-host data transfer performance. - Supports host-to-device (H2D) copy to a device CSL variable address (``memcpy_h2d``), device-to-host (D2H) copy from a device CSL variable address (``memcpy_d2h``), and launch of CSL device kernels (``memcpy_launch``). - See :ref:`tensor-streaming` for more details. For examples using the new API, see :ref:`residual-memcpy` and :ref:`stencil-memcpy`. - The legacy runtime, ``CSELFRunner``, now supports host-to-device and device-to-host copy using the memcpy API. - CSL language enhancements: - Support for normal-mode FIFOs. - Introduces explicit DSRs, providing a more efficient way to execute DSD operations. - Initial RPC (remote procedure call) support, with a mechanism for host-device communication using shared symbols. - Additional support for DSD-to-scalar operations. - Support for setting task and microthread priority at comptime and runtime. - Improved assertion failure messages in ``@comptime_assert``. - The ``.unblock`` DSD field can now be used at runtime and comptime. - CSL library enhancements: - Introduces ``collectives_2d`` library, which implements MPI-like communication primitives over rows or columns of PEs. - New generic API for math libraries. - Introduces ``directions`` library, which provides utility functions for manipulating directions. - Adds efficient implementations of ``sin_f16`` and ``cos_f16``. - Adds ``issignaling_f16`` and ``issignaling_f32``, which check for signalling NaN. - A new version of the ``memcpy`` library supports copies to/from address, and updates to support new runtime. See :ref:`residual-memcpy` and :ref:`stencil-memcpy` examples. - ``cs_readelf`` improvements: - Adds ``--visualize`` command line option for drawing ASCII art representation of PE populations. See ``--help`` information for details. - All addresses (both command line option inputs and printed outputs) are now in byte (8-bit) units instead of word (16-bit) units. - New benchmark programs: - Dense Cholesky decomposition. - Hadamard product, demonstrating selective batched execution mode. - GEMV with collective communications, demonstrating the ``collectives_2d`` library. - Documentation improvements: - Adds a new introductory tutorials section to provide step-by-step instruction for learning CSL. See :ref:`csl-tutorials`. - Adds new example demonstrating the use of the ``debug`` library for tracing values at runtime. - Adds sections on generics and DSRs. See :ref:`language-generics` and :ref:`language-dsrs`. Resolved issues ~~~~~~~~~~~~~~~ - Relative paths are now handled correctly when importing code files as modules. Known issues ~~~~~~~~~~~~ - The copy mode of ``memcpy`` only supports 32-bit data. To copy 16-bit data to the device, streaming mode must be used instead. - If there are two device-to-host (D2H) ``memcpy`` calls in a non-blocking sequence, and the first D2H is non-blocking, then the run can stall, especially when using back-to-back D2H calls. To avoid this risk, the user must use blocking D2H calls instead. Notes for future releases ~~~~~~~~~~~~~~~~~~~~~~~~~ - The ``CSELFRunner`` runtime will be deprecated in a future release. Code should be ported to the ``SdkRuntime`` runtime. .. _v0-5-1: Version 0.5.1 ------------- Released 27 September 2022 New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - An optional new implementation for tensor streaming is available. The new implementation is described in :ref:`tensor-streaming`, along with instructions for porting kernels to use the new implementation. Two new CSL code examples, :ref:`residual-memcpy` and :ref:`stencil-memcpy`, are provided for reference. - The SDK GUI has introduced new features, detailed in :ref:`sdk-gui`. Major new features include: - Updated display of routing. - Addition of instruction tracing in the timeline. - CSL language enhancements: - Runtime support for named struct types. - ``switch`` support. - ``comptime`` and ``anytype`` function argument support. - ``comptime_string`` support. - Either color or task can now be used for DSD config operations. - CSL library enhancements: - Initial complex number support. - Runtime support for finding the position of the running PE within the rectangle. .. _v0-4-0: Version 0.4.0 ------------- Released 29 April 2022 New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - New CLI tool ``csdb`` introduced. ``csdb`` currently supports debugging on hardware and will eventually support simulation debugging. - New CLI tool ``cs_readelf`` introduced. - As of 0.3.1, the numbers in the ELF binary names do NOT correspond to PE coordinates. - To access prior versions of SDK documentation, please email ``developer@cerebras.net``. Known issues ~~~~~~~~~~~~ - In the SDK GUI timeline view, clicking multiple PEs on the grid in quick succession may result in a JSON error. To avoid this error, please wait for the timeline to load before clicking the next PE. If you see this error for a PE, click a different PE, allow the timeline to load, and then click the original PE again. - If you launch ``csdb`` and type ``ctrl+x``, the container will lock up and prevent further action. If this happens, you must exit and re-launch your terminal session. - ``cslc --help`` returns options for ``cslc-driver``, which are very similar tools, but not exactly the same. Please note that some options listed may not be available in ``cslc``. Notes for future releases ~~~~~~~~~~~~~~~~~~~~~~~~~ - ``csdb`` CLIs will replace ``sdk_debug_shell`` CLIs in a future release. ``sdk_debug_shell`` will be deprecated. - Content under ``CSL Code Examples`` will be move to the ``csl-examples`` GitHub repository in a future release. Please let us know if you need access to this repository by emailing ``developer@cerebras.net``. .. _v0-3-1: Version 0.3.1 ------------- Released 25 February 2022 New features and enhancements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Compile time is faster now due to caching improvements. - Support for FIFOs is added. See :ref:`language-dsds` for documentation and ``@allocate_fifo`` in :ref:`language-builtins`. - See :ref:`sdkruntime-topic-08-fifos` for an example showing how to use ``@allocate_fifo``. - Support for switching and filtering is added. With this feature, you can specify the routing configuration for a specific color at a specific processing element (PE). This can be done in a layout block (``@set_color_config``) or in a processing element’s top-level ``comptime`` block (``@set_local_color_config``). See :ref:`language-builtins` for documentation. - See :ref:`sdkruntime-topic-05-switches` and :ref:`sdkruntime-topic-07-filters` for examples. - Support for microthreads is added. See :ref:`language-dsds` for documentation. - Library support is added. See :ref:`language-libraries` for a full list of supported library functions. - Added the following built-ins. See :ref:`language-builtins` for a full list of supported built-ins. - ``@set_dsd_base_addr`` - ``@random16`` - ``@is_same_type`` - ``@is_comptime`` - Compile time floating point constants are now automatically type-casted as needed. So, instead of ``@as(f32, 1.0)`` (see :ref:`language-builtins`) or ``@as(f16, 1.0)``, simply write ``1.0``. - Runtime floating point constants no longer default to type ``f16`` but to ``comptime_float``. If you want a runtime variable, you now need to explicitly specify the desired type of that variable. For example, instead of ``var x = 0.0;`` (wrong), write ``var x: f16 = 0.0;``. - Adds support for setting the state of the pseudo-random number generator (PRNG). - Adds support for using general purpose registers (GPRs) as destination for DSD operations: .. code-block:: csl var result: f16 = 1.0; const buffer = [3]f16 {100.0, 250.0, 349.0}; task fooTask() void { const dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{3} -> buffer[i] }); @faddh(&result, result, dsd); } - Asynchronous DSD operations must have at least one fabric DSD operand. Non-compliant code will now trigger an error message. - Adds support for the dot operator to access members of structs. Implemented for compile time only. - Colors can now be compared using ``==`` and ``!=`` operators. - DSD operations, for example, ``add16``, now support unsigned integer operands. - A new ``--verbose`` compiler flag shows progress. Requirements and unsupported features ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - The SDK requires that the `overlay filesystem `_ functionality is available on your Linux system. - This SDK is supported only on Linux systems. - There are no guarantees for forward- or backward-compatibility for this release. - The SDK does not support running external Python scripts in the Singularity container. - The SDK only supports running the versions of packages provided in the Singularity container. Resolved issues ~~~~~~~~~~~~~~~ - Fixes a bug that prevented unit innermost dimension loops in ``mem4d_dsds``. - Fixes a bug so that ``mem4d_dsds`` is now allowed to set the ``wavelet_index_offset`` bit. - Compile time and runtime semantics of ``set_dsd_base_addr`` (see :ref:`language-builtins`) were different. This is fixed and now they are the same. Known issues ~~~~~~~~~~~~ - When using the SDK GUI, via ``sdk_debug_shell visualize --artifact_dir`` command, if the artifacts in the artifact directory change, then the SDK GUI will continue to show the old artifact data in a cache. To view the new artifacts, restart the SDK GUI by running the command ``sdk_debug_shell visualize --artifact_dir``. - When you run the command ``sdk_debug_shell visualize --artifact_dir`` to invoke the SDK GUI, you will see the following error message. This message can be safely ignored. .. code-block:: bash $ sdk_debug_shell visualize --artifact_dir /cb/cold/user1/sandbox/sdk_tool_rel-0.3.1/residual WARNING:cerebras.common.decorators:Call to deprecated function EnumFiles WARNING:root: . is not a valid workdir. ERROR:root:plan.meta not found in current directory or subdirectories. ERROR:root:No entries will be displayed. Click this link to open URL: http://user1:8000/?session_id=12b77f285e Click this link to open URL: http://172.xx.51.216:8000/?session_id=12b77f285e Press Ctrl-C to exit ERROR:root:Error reading A_1_1.elf ERROR:root:Error reading A_0_1.elf ERROR:root:Error reading A_1_0.elf ERROR:root:Error reading A_0_0.elf - The SDK GUI currently displays the color values only in the range of 0-14 inclusive. .. _v0-2-1: Version 0.2.1 ------------- Released 5 November 2021 This release adds usability improvements and fixes bugs encountered in the 0.2.0 debug tool CLIs. This release also adds compatibility with the Cerebras R0.9 Software Release, so the CS system hardware does not require re-imaging in order to use the SDK. - This SDK is supported only on Linux systems. - There are no guarantees for forward- or backward-compatibility for this release. - The SDK requires that the `overlay filesystem `_ functionality is available on your Linux system. - The SDK only supports running the versions of packages provided in the Singularity container. - If the CSL compiler aborts with the LLVM error message ``"PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script."`` then do not report to llvm.org but instead report the problem to Cerebras. - The visualizer tool will not display single-ended routes, i.e., routes where PE A transmits to PE B, but PE B is missing a receiving route, and vice-versa. - ``CSELFRunner`` supports single-node host only. - We are no longer actively supporting or maintaining the CASM and Spoke workflow of version 0.1.x. Migration of code to CSL is needed. - The following examples in the ``cslang/benchmarks`` directory of the SDK can be run only in simulation, and not on the CS system: - ``cslang/benchmarks/FFT`` - ``cslang/benchmarks/wide-multiplication`` - The ``cslang/benchmarks/FFT`` example incorrectly states "SUCCESS" on test completion. - To run the CSL examples on the CS-1 you must manually emit wavelet to terminate the runtime. .. _v0-2-0: Version 0.2.0 ------------- Released 12 October 2021 - This SDK is supported only on Linux systems. - There are no guarantees for forward- or backward-compatibility for this release. - The SDK 0.2.0 requires that the `overlay filesystem `_ functionality is available on your Linux system. - Hardware support for SDK 0.2.0 is limited to the CS-1. - The SDK does not support running external Python scripts in the Singularity container. - The SDK only supports running the versions of packages provided in the Singularity container. - The SDK 0.2.0 image for the CS-1 is incompatible with the Cerebras Graph Compiler (CGC) 0.9.0 image. Hence, the SDK system image must be loaded in order to run CSL programs on the CS-1 system. - The visualizer tool will not display single-ended routes, i.e., routes where PE A transmits to PE B, but PE B is missing a receiving route, and vice-versa. - CSELFRunner supports single-node host only. - The following examples in the ``cslang/benchmarks`` directory of the SDK can be run only in simulation, and not on the CS system: - ``cslang/benchmarks/FFT``. - ``cslang/benchmarks/wide-multiplication``. Pre-release Version 0.2.0 ------------------------- Released 27 August 2021 - Initial availability of the Pre-release 0.2.0 of the SDK documentation.