Ideally, the ML framework used to run a model should just be an implementation detail. By decoupling your inference code from specific frameworks, you can easily keep up with the cutting-edge.
How much overhead does Carton have?
Most of Carton is implemented in optimized async Rust code. Preliminary benchmarks with small inputs show an overhead of less than 100 microseconds (0.0001 seconds) per inference call.
We're still optimizing things further with better use of Shared Memory. This should bring models with large inputs to similar levels of overhead.
What platforms does Carton support?
Currently, Carton supports the following platforms:
x86_64 Linux and macOS
aarch64 Linux (e.g. Linux on AWS Graviton)
aarch64 macOS (e.g. M1 and M2 Apple Silicon chips)
WebAssembly (metadata access only for now, but WebGPU runners are coming soon)
What is "a carton"?
A carton is the output of the packing step. It is a zip file that contains your original model and some metadata. It does not modify the original model, avoiding error-prone conversion steps.
Why use Carton instead of ONNX?
ONNX converts models while Carton wraps them. Carton uses the underlying framework (e.g. PyTorch) to actually execute a model under the hood. This is important because it makes it easy to use custom ops, TensorRT, etc without changes. For some sophisticated models, "conversion" steps (e.g. to ONNX) can be problematic and require validation. By removing these conversion steps, Carton enables faster experimentation, deployment, and iteration.
With that said, we plan to support ONNX models within Carton. This lets you use ONNX if you choose and it enables some interesting use cases (like running models in-browser with WASM).