💯
TIL
  • TIL
  • Different options for doing LLM inference
  • Generate Subtitles locally using whisper
  • Compressing images from the CLI
  • Running parallel Ollama inference
  • Fetching Train availability in IRCTC
  • Changing Git Commit Email Address
  • Clone any website to local using Httrack
  • Adding Drag drop from terminal in Mac OS
  • Minimum Effort
    • YouTube Transcript Formatter
    • Virtual Talking AI companion to help out in tough times
    • Super fast voice based translation system
  • Proud Memories
    • Conducting an LLM Workshop
    • Honours degree from Education Minister
    • GIDS Conference Participant from TW
    • Speaking at MEC.Conf
    • Being a judge at Hack4Tomorrow hackathon
Powered by GitBook
On this page
  • Problem
  • Solution
  • References:
  • Approach 1
  • Approach 2
  • Steps:

Was this helpful?

Edit on GitHub

Running parallel Ollama inference

Easy way to parallely infer with multiple ollama instances all running in the same machine

PreviousCompressing images from the CLINextFetching Train availability in IRCTC

Last updated 1 year ago

Was this helpful?

Problem

Ollama runs based on a single backend concept.

Even if you have good specs in your system, every new request is processed in a queue manner to make maximum resource usage for a single query.

This is useful for personal laptop usage. But if you have a powerful maching with 10s of GPU RAM, and multiple users inferring at same time, it would be a problem

Solution

Run multiple ollama instances as isolated container using docker.

References:

  • announcing docker containers

Approach 1

Use docker swarm

docker swarm init
docker service create --replicas 10 \
--name ollama \
--constraint 'node.labels.gpu'==true \
--mount type=volume,source=ollama,target=/root/.ollama \
--publish published=11434,target=11434 \
ollama/ollama

This creates 10 replicas managed by docker swarm.

The base url is same as that of a single instance which you can connect to just like connecting to a regular standalone instance of ollama

Problem: This one doesn't have GPU support

Approach 2

Manage everything ourselves with a tiny bit of help from litellm

Steps:

  1. Write a script to create the specified number(n) of ollama container from port 11434 to 11434+n

  2. Ensure that all of them uses the same volume to store the models so as to save up space

  3. Create litellm proxy server using a custom config file(generated automatically from create script)

  4. Connect to this litellm proxy just as connecting to a single ollama instance.

#!/bin/bash

# create.sh - to create a given set of docker containers running llama2


CONFIG_FILE="config.yml"
VOLUME_NAME="ollama_volume"
MODEL="llama2"

echo "model_list:" > $CONFIG_FILE

# Define function to create and manage ollama containers
create_ollama_container() {
    local container_name="ollama$1"
    local port=$2

    # Check if the container with the same name exists, if yes, remove it
    if docker ps -a --format '{{.Names}}' | grep -q "^$container_name$"; then
        echo "Removing existing container: $container_name"
        docker rm -f $container_name
    fi

    # Check if the container on the specified port exists, if yes, remove it
    if docker ps -a --format '{{.Ports}}' | grep -q "0.0.0.0:$port->"; then
        echo "Removing existing container on port $port"
        docker rm -f $(docker ps -a --format '{{if (index (split .Ports "/") 0) | (index (split . "->") 0) | (eq "'"0.0.0.0:$port"'")}}{{.Names}}{{end}}')
    fi

        # Create and run the new container
    docker run -d --gpus=all \
        -v $VOLUME_NAME:/root/.ollama \
        -p $port:11434 \
        --name $container_name \
        ollama/ollama

    echo "Created container: $container_name on port $port"


    # Add entry to the config file
    echo "  - model_name: $MODEL" >> $CONFIG_FILE
    echo "    litellm_params:" >> $CONFIG_FILE
    echo "        model: ollama/$MODEL" >> $CONFIG_FILE
    echo "        api_base: http://localhost:$port" >> $CONFIG_FILE

}

# Create n different ollama containers with unique names and ports
for i in {1..6}; do
    port=$((11434 + i))
    create_ollama_container $i $port
done

destroy.sh

#!/bin/bash
# Script to destroy running docker containers

docker stop $(docker ps -a -q) && docker rm $(docker ps -a -q)

test.sh

#!/bin/bash
# test.sh - to make request to a given  number of llama2 instances

rm file.txt && echo "file.txt cleare"

# Define the command to be executed
curl_command="curl --location 'http://0.0.0.0:8000/chat/completions' --header 'Content-Type: application/json' --data ' {
  \"model\": \"llama2\",
  \"messages\": [
    {
      \"role\": \"user\",
      \"content\": \"tell a long story about unicorns\"
    }
  ]
}' | jq .choices[0].message.content  >> file.txt"

# Run the command n times in parallel
for i in {1..3}; do
    eval "$curl_command" && echo "Command $i succeeded." &  # Run the command in the background
done

# Wait for all background processes to finish
wait

echo "All commands completed."

Sample config.yaml

model_list:
  - model_name: llama2
    litellm_params:
        model: ollama/llama2
        api_base: http://localhost:11435
  - model_name: llama2
    litellm_params:
        model: ollama/llama2
        api_base: http://localhost:11436
  - model_name: llama2
    litellm_params:
        model: ollama/llama2
        api_base: http://localhost:11437
  - model_name: llama2
    litellm_params:
        model: ollama/llama2
        api_base: http://localhost:11438

Ollama blog