Pack a Hugging Face Transformers model

There are two ways to do this:

You can pack a model using the rust-bert runner. This library is a partial Rust port of the transformers library and doesn't require Python at runtime. It allows your model to run completely in native code. If your model is supported, see the Rust Bert section below to get started.
You can pack a model as arbitrary Python code. See the Python Runner section below to get started.

Select a programming language:

Python Runner

For complete examples, you can look at the code that packs the Intel DPT depth estimation model and the code that packs Stable Diffusion XL.

You can also look at the Python packing docs for more detail.

1. Get the model

Within our packing code, the first thing we're going to do is get the model we want to pack

1# At /path/to/my/model/pack.py
2import asyncio
3import cartonml as carton
4
5from transformers import DPTForDepthEstimation, DPTFeatureExtractor
6
7async def main():
8    # Download the model components we need
9    model_id = "Intel/dpt-hybrid-midas"
10    model_revision = "fc1dad95a6337f3979a108e336932338130255a0"
11    model = DPTForDepthEstimation.from_pretrained(
12        model_id,
13        revision = model_revision,
14        cache_dir="./to_pack/model",
15    )
16
17    feature_extractor = DPTFeatureExtractor.from_pretrained(
18        model_id,
19        revision = model_revision,
20        cache_dir="./to_pack/model",
21    )
22
23    # ...
24
25asyncio.run(main())

Note the cache_dir argument. This saves everything needed to run the model to that directory.

2. Create the model entrypoint

This is the code that runs when your model is loaded. For more details, see the Python packing docs.

1# At /path/to/my/model/to_pack/infer.py
2import torch
3from transformers import DPTForDepthEstimation, DPTFeatureExtractor
4
5class Model:
6    def __init__(self):
7        # Load the model we downloaded (notice the `local_files_only=True`)
8        model_id = "Intel/dpt-hybrid-midas"
9        model_revision = "fc1dad95a6337f3979a108e336932338130255a0"
10        self.model = DPTForDepthEstimation.from_pretrained(
11            model_id,
12            revision = model_revision,
13            cache_dir="./model",
14            local_files_only=True,
15        )
16
17        self.feature_extractor = DPTFeatureExtractor.from_pretrained(
18            model_id,
19            revision = model_revision,
20            cache_dir="./model",
21            local_files_only=True,
22        )
23
24        if torch.cuda.is_available():
25            self.model.to("cuda")
26
27    def infer_with_tensors(self, tensors):
28        image = tensors["image"]
29        
30        # ... Use self.model and self.feature_extractor ...
31
32        return {
33            "depth": prediction
34        }
35
36def get_model():
37    return Model()

Again, notice the cache_dir and local_files_only arguments. This loads everything from the directory we set up in our packing code.

We also want to create a requirements.txt file in the same directory as infer.py (/path/to/my/model/to_pack/):

transformers==4.31.0
accelerate==0.21.0
torch==2.0.1

3. Pack the model

Continuing from the code in step 1, we can let Carton know about the entrypoint we defined

1# At /path/to/my/model/pack.py
2import asyncio
3import cartonml as carton
4
5from transformers import DPTForDepthEstimation, DPTFeatureExtractor
6
7async def main():
8    # ...
9    # Continued from step 1 above
10
11    packed_model_path = await carton.pack(
12        os.path.join(os.path.dirname(os.path.abspath(__file__)), "to_pack"),
13        runner_name = "python",
14        required_framework_version="=3.10",
15        runner_opts = {
16            "entrypoint_package": "infer",
17            "entrypoint_fn": "get_model",
18        },
19        # ...
20        # See the link below for a list of other information you can provide when packing a model
21    )
22
23asyncio.run(main())

There are several other options (e.g. description, examples, etc.) you can provide when packing a model. See the packing code for the Intel DPT depth estimation model for an example.

4. Improve packing speed and load time (optional)

Optionally, we can tell Carton to load large files directly from Hugging Face instead of storing them in the model. This can improve packing and loading time in some cases.

1# At /path/to/my/model/pack.py
2import asyncio
3import cartonml as carton
4from cartonml.utils.hf import get_linked_files
5
6from transformers import DPTForDepthEstimation, DPTFeatureExtractor
7
8async def main():
9    # ...
10    # Continued from step 1 above
11
12    # For a smaller output model file, we can let Carton know to pull large files from HF instead of storing
13    # them in the output file. This can also lead to better caching behavior if the same files are used across
14    # several models
15    linked_files = get_linked_files(model_id, model_revision)
16
17    packed_model_path = await carton.pack(
18        os.path.join(os.path.dirname(os.path.abspath(__file__)), "to_pack"),
19        runner_name = "python",
20        required_framework_version="=3.10",
21        runner_opts = {
22            "entrypoint_package": "infer",
23            "entrypoint_fn": "get_model",
24        },
25        # ...
26        # See the link below for a list of other information you can provide when packing a model
27        linked_files = linked_files,
28    )
29
30asyncio.run(main())

Rust-Bert

This type of model uses the rust-bert runner. This library is a partial Rust port of the transformers library and doesn't require Python at runtime. It allows your model to run completely in native code.

Text Generation

First, you need to create a folder to contain the model you want to pack. In this example, we'll use {MODEL_PATH}, but you should replace that with your folder path.

Once you select a text generation model to use from Hugging Face, download the following files into {MODEL_PATH}/model:

The model (rust_model.ot)
The configuration (config.json)
The vocabulary file (vocab.txt)
The merges file (optional, usually named something like merges.txt)

Next, we'll create a config file so that Carton knows the type of the model and where the required files are:

json

{
  "TextGeneration": {
    "model_type": "GPT2",
    "model_path": "./model/rust_model.ot",
    "config_path": "./model/config.json",
    "vocab_path": "./model/vocab.json",
    "merges_path": "./model/merges.txt"
  }
}

This file should be at {MODEL_PATH}/config.json. The valid values for model_type are here.

Finally, you can pack the model as follows

1import asyncio
2import cartonml as carton
3
4async def main():
5    packed_model_path = await carton.pack(
6        "{MODEL_PATH}", # Don't forget to change this!
7        runner_name = "rust-bert",
8        required_framework_version="=0.21.0",
9        # ...
10        # See the link below for a list of other information you can provide when packing a model
11    )
12
13asyncio.run(main())

There are several other options (e.g. description, examples, etc.) you can provide when packing a model.

Finally, you can also provide linked_files as in step 4 of the first Python example above. This can speed up packing and loading of some large models.