Running parallel Ollama inference
Easy way to parallely infer with multiple ollama instances all running in the same machine
Problem
Ollama runs based on a single backend concept.
Even if you have good specs in your system, every new request is processed in a queue manner to make maximum resource usage for a single query.
This is useful for personal laptop usage. But if you have a powerful maching with 10s of GPU RAM, and multiple users inferring at same time, it would be a problem
Solution
Run multiple ollama instances as isolated container using docker.
References:
Ollama blog announcing docker containers
Approach 1
Use docker swarm
docker swarm initdocker service create --replicas 10 \
--name ollama \
--constraint 'node.labels.gpu'==true \
--mount type=volume,source=ollama,target=/root/.ollama \
--publish published=11434,target=11434 \
ollama/ollamaThis creates 10 replicas managed by docker swarm.
The base url is same as that of a single instance which you can connect to just like connecting to a regular standalone instance of ollama
Problem: This one doesn't have GPU support
Approach 2
Manage everything ourselves with a tiny bit of help from litellm
Steps:
Write a script to create the specified number(n) of ollama container from port 11434 to 11434+n
Ensure that all of them uses the same volume to store the models so as to save up space
Create litellm proxy server using a custom config file(generated automatically from create script)
Connect to this litellm proxy just as connecting to a single ollama instance.
destroy.sh
test.sh
Sample config.yaml
Last updated
Was this helpful?