Easy way to parallely infer with multiple ollama instances all running in the same machine
Problem
Ollama runs based on a single backend concept.
Even if you have good specs in your system, every new request is processed in a queue manner to make maximum resource usage for a single query.
This is useful for personal laptop usage. But if you have a powerful maching with 10s of GPU RAM, and multiple users inferring at same time, it would be a problem
Solution
Run multiple ollama instances as isolated container using docker.
The base url is same as that of a single instance which you can connect to just like connecting to a regular standalone instance of ollama
Problem: This one doesn't have GPU support
Approach 2
Manage everything ourselves with a tiny bit of help from litellm
Steps:
Write a script to create the specified number(n) of ollama container from port 11434 to 11434+n
Ensure that all of them uses the same volume to store the models so as to save up space
Create litellm proxy server using a custom config file(generated automatically from create script)
Connect to this litellm proxy just as connecting to a single ollama instance.
#!/bin/bash
# create.sh - to create a given set of docker containers running llama2
CONFIG_FILE="config.yml"
VOLUME_NAME="ollama_volume"
MODEL="llama2"
echo "model_list:" > $CONFIG_FILE
# Define function to create and manage ollama containers
create_ollama_container() {
local container_name="ollama$1"
local port=$2
# Check if the container with the same name exists, if yes, remove it
if docker ps -a --format '{{.Names}}' | grep -q "^$container_name$"; then
echo "Removing existing container: $container_name"
docker rm -f $container_name
fi
# Check if the container on the specified port exists, if yes, remove it
if docker ps -a --format '{{.Ports}}' | grep -q "0.0.0.0:$port->"; then
echo "Removing existing container on port $port"
docker rm -f $(docker ps -a --format '{{if (index (split .Ports "/") 0) | (index (split . "->") 0) | (eq "'"0.0.0.0:$port"'")}}{{.Names}}{{end}}')
fi
# Create and run the new container
docker run -d --gpus=all \
-v $VOLUME_NAME:/root/.ollama \
-p $port:11434 \
--name $container_name \
ollama/ollama
echo "Created container: $container_name on port $port"
# Add entry to the config file
echo " - model_name: $MODEL" >> $CONFIG_FILE
echo " litellm_params:" >> $CONFIG_FILE
echo " model: ollama/$MODEL" >> $CONFIG_FILE
echo " api_base: http://localhost:$port" >> $CONFIG_FILE
}
# Create n different ollama containers with unique names and ports
for i in {1..6}; do
port=$((11434 + i))
create_ollama_container $i $port
done
destroy.sh
#!/bin/bash
# Script to destroy running docker containers
docker stop $(docker ps -a -q) && docker rm $(docker ps -a -q)
test.sh
#!/bin/bash
# test.sh - to make request to a given number of llama2 instances
rm file.txt && echo "file.txt cleare"
# Define the command to be executed
curl_command="curl --location 'http://0.0.0.0:8000/chat/completions' --header 'Content-Type: application/json' --data ' {
\"model\": \"llama2\",
\"messages\": [
{
\"role\": \"user\",
\"content\": \"tell a long story about unicorns\"
}
]
}' | jq .choices[0].message.content >> file.txt"
# Run the command n times in parallel
for i in {1..3}; do
eval "$curl_command" && echo "Command $i succeeded." & # Run the command in the background
done
# Wait for all background processes to finish
wait
echo "All commands completed."