Running parallel Ollama inference

Easy way to parallely infer with multiple ollama instances all running in the same machine

Problem

Ollama runs based on a single backend concept.

Even if you have good specs in your system, every new request is processed in a queue manner to make maximum resource usage for a single query.

This is useful for personal laptop usage. But if you have a powerful maching with 10s of GPU RAM, and multiple users inferring at same time, it would be a problem

Solution

Run multiple ollama instances as isolated container using docker.

References:

Approach 1

Use docker swarm

docker swarm init
docker service create --replicas 10 \
--name ollama \
--constraint 'node.labels.gpu'==true \
--mount type=volume,source=ollama,target=/root/.ollama \
--publish published=11434,target=11434 \
ollama/ollama

This creates 10 replicas managed by docker swarm.

The base url is same as that of a single instance which you can connect to just like connecting to a regular standalone instance of ollama

Problem: This one doesn't have GPU support

Approach 2

Manage everything ourselves with a tiny bit of help from litellm

Steps:

  1. Write a script to create the specified number(n) of ollama container from port 11434 to 11434+n

  2. Ensure that all of them uses the same volume to store the models so as to save up space

  3. Create litellm proxy server using a custom config file(generated automatically from create script)

  4. Connect to this litellm proxy just as connecting to a single ollama instance.

destroy.sh

test.sh

Sample config.yaml

Last updated

Was this helpful?