Deploying Models on a Server

To run models on remote GPU/TPU machines, it is inconvenient to have to check out your project’s repository and compile it on every target. Instead, you more likely want to cross-compile right from your development machine, for every supported target architecture and accelerator.

See Getting Started with ZML if you need more information on how to compile a model.

Here's a quick recap:

You can compile models for accelerator runtimes by appending one or more of the following arguments to the command line when compiling / running a model:

NVIDIA CUDA: --@zml//runtimes:cuda=true
AMD RoCM: --@zml//runtimes:rocm=true
Google TPU: --@zml//runtimes:tpu=true
AWS Trainium/Inferentia 2: --@zml//runtimes:neuron=true
AVOID CPU: --@zml//runtimes:cpu=false

So, to run the OpenLLama model from above on your development machine housing an NVIDIA GPU, run the following:

cd examples
bazel run -c opt //llama:OpenLLaMA-3B --@zml//runtimes:cuda=true

Cross-Compiling and creating a TAR for your server

Currently, ZML lets you cross-compile to one of the following target architectures:

Linux X86_64: --platforms=@zml//platforms:linux_amd64
Linux ARM64: --platforms=@zml//platforms:linux_arm64
MacOS ARM64: --platforms=@zml//platforms:macos_arm64

As an example, here is how you build above OpenLLama for CUDA on Linux X86_64:

cd examples
bazel build -c opt //llama:OpenLLaMA-3B               \
            --@zml//runtimes:cuda=true                \
            --@zml//runtimes:cpu=false                \
            --platforms=@zml//platforms:linux_amd64

Creating the TAR

When cross-compiling, it is convenient to produce a compressed TAR file that you can copy to the target host, so you can unpack it there and run the model.

Let's use MNIST as example.

If not present already, add an "archive" target to the model's BUILD.bazel, like this:

load("@aspect_bazel_lib//lib:tar.bzl", "mtree_spec", "tar")

# Manifest, required for building the tar archive
mtree_spec(
    name = "mtree",
    srcs = [":mnist"],
)

# Create a tar archive from the above manifest
tar(
    name = "archive",
    srcs = [":mnist"],
    args = [
        "--options",
        "zstd:compression-level=9",
    ],
    compress = "zstd",
    mtree = ":mtree",
)

... and then build the TAR archive:

# cd examples
bazel build -c opt //mnist:archive                    \
            --@zml//runtimes:cuda=true                \
            --@zml//runtimes:cpu=false                \
            --platforms=@zml//platforms:linux_amd64

Note the //mnist:archive notation.

The resulting tar file will be in bazel-bin/mnist/archive.tar.zst.

Run it on the server

You can copy the TAR archive onto your Linux X86_64 NVIDIA GPU server, untar and run it:

# on your machine
scp bazel-bin/mnist/archive.tar.zst destination-server:
ssh destination-server   # to enter the server

# ... on the server
tar xvf archive.tar.zst
./mnist \
    'mnist.runfiles/_main~_repo_rules~com_github_ggerganov_ggml_mnist/file/mnist.pt' \
    'mnist.runfiles/_main~_repo_rules~com_github_ggerganov_ggml_mnist_data/file/mnist.ylc'

The easiest way to figure out the commandline arguments of an example model is to consult the model's BUILD.bazel and check out its args section. It will reference e.g. weights files that are defined either in the same BUILD.bazel file or in a weights.bzl file.

You can also consult the console output when running your model locally:

bazel run //mnist

INFO: Analyzed target //mnist:mnist (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //mnist:mnist up-to-date:
  bazel-bin/mnist/mnist
INFO: Elapsed time: 0.302s, Critical Path: 0.00s
INFO: 3 processes: 3 internal.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/mnist/mnist ../_main~_repo_rules~com_github_ggerganov_ggml_mnist/file/mnist.pt ../_main~_repo_rules~com_github_ggerganov_ggml_mnist_data/file/mnist.ylc
# ...

You see the command line right up there. On the server, you just need to replace ../ with the 'runfiles' directory of your TAR.