ZML Concepts
Model lifecycle
ZML is an inference stack that helps running Machine Learning (ML) models, and particulary Neural Networks (NN).
The lifecycle of a model is implemented in the following steps:
Open the model file and read the shapes of the weights, but don't load the weights yet.
Using the loaded shapes and optional metadata, instantiate a model struct with
Tensors, representing the shape and layout of each layer of the NN.Compile the model struct and it's
forwardfunction into an accelerator specific executable. Theforwardfunction describes the mathematical operations corresponding to the model inference.Load the model weights from disk, onto the accelerator memory.
Load some user inputs, and copy them to the accelerator.
Call the executable using the weights and the user inputs.
Fetch the returned model output from accelerator into host memory, and finally present it to the user.
When all user inputs have been processed, free the executable resources, weights and inputs.
Some details:
Note that the compilation and weight loading steps are both bottlenecks to your model startup time, but they can be done in parallel thanks to Zig's std.Io interface.
The accelerator is typically a GPU, but can be another chip, or even the CPU itself, churning vector instructions.
Tensor Bros.
In ZML, we leverage Zig's static type system to differentiate between a few concepts, hence we not only have a Tensor to work with, like other ML frameworks, but also Buffer, Slice, and Shape.
Let's explain all that.
Shape: describes a multi-dimension array.Shape.init(.{16}, .f32)represents a vector of 16 floats of 32 bits precision.Shape.init(.{512, 1024}, .f16)represents a matrix of512*1024floats of 16 bits precision, i.e. a[512][1024]f16array.
A
Shapeis only metadata, it doesn't point to or own any memory. TheShapestruct can also represent a regular number, aka a scalar:Shape.init(.{}, .i32)represents a 32-bit signed integer.Slice: is the combination of aShapeand raw bytes (bytes that are on the CPU).- can own the underlying memory - but can also accomodate non-owned memory.
Buffer: is a multi-dimension array, whose memory is allocated on an accelerator.- contains a handle that the ZML runtime can use to convert it into a physical address, but there is no guarantee this address is visible from the CPU.
- can be created from a
Sliceby callingBuffer.fromSlice(...).
Tensor: is a mathematical object representing an intermediary result of a computation or an input to an executable (including weights).- is basically a
Shapewith an attached MLIR value when in the context of a compilation.
- is basically a
The model struct
The model struct is the Zig code that describes your Neural Network (NN). Let's look a the following model architecture:
This is how we can describe it in a Zig struct:
const Model = struct {
input_layer: zml.Tensor,
output_layer: zml.Tensor,
pub fn forward(self: Model, input: zml.Tensor) zml.Tensor {
const hidden = self.input_layer.matmul(input);
const output = self.output_layer.matmul(hidden);
return output;
}
}
NNs are generally seen as a composition of smaller NNs, which are split into layers. ZML makes it easy to mirror this structure in your code.
const Model = struct {
input_layer: MyOtherLayer,
output_layer: MyLastLayer,
pub fn forward(self: Model, input: zml.Tensor) zml.Tensor {
const hidden = self.input_layer.forward(input);
const output = self.output_layer.forward(hidden);
return output;
}
}
zml.nn module provides a number of well-known layers to more easily bootstrap models.
Since the Model struct contains Tensors, it is only ever useful during the compilation stage, but not during inference. If we want to represent the model with actual Buffers, we can use the zml.Bufferize(Model), which is a mirror struct of Model but with a Buffer replacing every Tensor.
Strong type checking
Let's look at the model life cycle again, but this time annotated with the corresponding types.
Open the model file and read the shapes of the weights ->
zml.ShapeInstantiate a model struct ->
Modelstruct (withzml.Tensorinside)Compile the model struct and its
forwardfunction into an executable.fowardis aTensor -> Tensorfunction, executable is azml.ExecutableLoad the model weights from disk, onto accelerator memory ->
zml.Bufferized(Model)struct (withzml.Bufferinside)Load some user inputs (custom struct), encode them into arrays of numbers (
zml.Slice), and copy them to the accelerator (zml.Buffer).Call the executable on the user inputs.
module.callacceptszml.Bufferarguments and returnszml.BufferReturn the model output (
zml.Buffer) to the host (zml.Slice), decode it (custom struct) and finally return to the user.