Easy way to parallely infer with multiple ollama instances all running in the same machine
Problem
Ollama runs based on a single backend concept.
Even if you have good specs in your system, every new request is processed in a queue manner to make maximum resource usage for a single query.
This is useful for personal laptop usage. But if you have a powerful maching with 10s of GPU RAM, and multiple users inferring at same time, it would be a problem
Solution
Run multiple ollama instances as isolated container using docker.
The base url is same as that of a single instance which you can connect to just like connecting to a regular standalone instance of ollama
Problem: This one doesn't have GPU support
Approach 2
Manage everything ourselves with a tiny bit of help from litellm
Steps:
Write a script to create the specified number(n) of ollama container from port 11434 to 11434+n
Ensure that all of them uses the same volume to store the models so as to save up space
Create litellm proxy server using a custom config file(generated automatically from create script)
Connect to this litellm proxy just as connecting to a single ollama instance.
#!/bin/bash# create.sh - to create a given set of docker containers running llama2CONFIG_FILE="config.yml"VOLUME_NAME="ollama_volume"MODEL="llama2"echo"model_list:"> $CONFIG_FILE# Define function to create and manage ollama containerscreate_ollama_container() {local container_name="ollama$1"local port=$2# Check if the container with the same name exists, if yes, remove itifdockerps-a--format'{{.Names}}'|grep-q"^$container_name$"; thenecho"Removing existing container: $container_name"dockerrm-f $container_namefi# Check if the container on the specified port exists, if yes, remove itifdockerps-a--format'{{.Ports}}'|grep-q"0.0.0.0:$port->"; thenecho"Removing existing container on port $port" docker rm -f $(docker ps -a --format '{{if (index (split .Ports "/") 0) | (index (split . "->") 0) | (eq "'"0.0.0.0:$port"'")}}{{.Names}}{{end}}')
fi# Create and run the new containerdockerrun-d--gpus=all \-v $VOLUME_NAME:/root/.ollama \-p $port:11434 \--name $container_name \ollama/ollamaecho"Created container: $container_name on port $port"# Add entry to the config fileecho" - model_name: $MODEL">> $CONFIG_FILEecho" litellm_params:">> $CONFIG_FILEecho" model: ollama/$MODEL">> $CONFIG_FILEecho" api_base: http://localhost:$port">> $CONFIG_FILE}# Create n different ollama containers with unique names and portsfor i in {1..6}; do port=$((11434+i))create_ollama_container $i $portdone
destroy.sh
#!/bin/bash# Script to destroy running docker containersdockerstop $(dockerps-a-q) &&dockerrm $(dockerps-a-q)
test.sh
#!/bin/bash# test.sh - to make request to a given number of llama2 instancesrmfile.txt&&echo"file.txt cleare"# Define the command to be executedcurl_command="curl --location 'http://0.0.0.0:8000/chat/completions' --header 'Content-Type: application/json' --data ' {
\"model\": \"llama2\", \"messages\": [ { \"role\": \"user\", \"content\": \"tell a long story about unicorns\" } ]}' | jq .choices[0].message.content >> file.txt"# Run the command n times in parallelfor i in {1..3}; doeval"$curl_command"&&echo"Command $i succeeded."&# Run the command in the backgrounddone# Wait for all background processes to finishwaitecho"All commands completed."