Why I decided to use NVIDIA Triton for inference

Neural networks are cool. They can solve various tasks and are used everywhere. Let's image you have trained one for medicine. It performs well. What's next? Nobody has access to it yet. You need somehow to inference it.

I've faced the same problem. I didn't know much about specialized frameworks, but knew Docker. So the decision was to build inference myself.

Using self-made inference

The architecture was really simple. I built an API with FastAPI which puts requests to a RabbitMQ queue. The Pytorch model is running in a container, consuming messages from one queue and posting results to another. To satisfy more requests, I simply create more replicas of the neural net container.

It worked ok. But then new models were trained. More features were added (segmentation of new pathologies, segmentation of the region of interest, rotation of images). The processing graph became more complex and included many nodes. Some of them could be run in parallel.

The simplest solution was to just run all the models sequentially. But this wasn't the option because of the time constraints.

The second thought was to use queues to make models run in parallel. At first, this seemed ok and fast to implement. But a closer look revealed some problems:

What will happen if a net fails?
How to make a model (node) wait for responses from several previous ones? (You can easily send to multiple queues, but not consume from them)?
How to set a limit on maximum number of simultaneously running models? (You can have constraints on GPU memory)?
How to make sure all models are ready to run?

Using Triton

As new and new problems kept arising, I realized that I was trying to build my own inference framework from scratch. I don't think it's a good idea to reinvent the wheel, so I decided to use NVIDIA Triton Inference Server.

Official site says

NVIDIA Triton™ Inference Server, is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production.

So I gave it a try. Implementation took more time than I expected. These are my thoughts:

What I loved:

Error handling

If an error happens in a node, it is passed to the next one and so on. You can handle it anywhere further.

Defining the processing graph

Triton has two options of connecting your nodes: ensembling (for not so complicated graphs) and BLS (for example, when choice of the next node depends on the current result). That's not only solves the problem of multiple inputs to the model, but also standardizes the way you define the information flow. I believe it makes code easier to understand for other developers.

Limiting the max in parallel

Triton has the Rate Limiter feature. You can define the resource and how much every model uses. If the resource is less than you need, Triton won't start the node until enough is free.

Health check

Triton has a built-in health check. You can check whether all models are ready or not.

Docker images

Triton has ready to use docker images. There is even a model which supports a few backends only (like pytorch or tensorflow) to reduce image size.

What was annoying

Strict model repository structure

Triton expects you to place all the models in a specific format:

└── model_repository
    ├── some_model
    │   ├── 1
    │   │   └── model.py
    │   └── config.pbtxt
    └── some_model
        ├── 1
        │   └── model.py
        ├── 2
        │   └── model.py
        └── config.pbtxt

You can not organize nested structure because all the models must be on the same level. You must also create a folder that represents the model version (the one with the number), even if there is no model itself.

Lazy loading of models

All models are lazy loaded. A model will be completely loaded when the first request is received, which causes delays. You have to warm it up manually.

RAM consumption

Triton uses CUDA memory really sparingly. But I can't say the same about the RAM. Not only Triton uses it a lot, but also doesn't want to release it.

NVIDIA only

Triton works only with NVIDIA GPUs.

Documentation in GitHub

Triton doesn't have a dedicated documentation web page. All documentation is in a pile of .md files in several GitHub repositories.

Conclusion

In my opinion, NVIDIA Triton Inference Server is a good framework for building inference for your models. It has all the features needed for building complex processing graphs. Most importantly, it is rapidly developing and has a large community.