Different options for doing LLM inference

Note formatted by Claude after from my voicenote transcription. About different options I've explored for hosting gen-ai models

Intro

While many developers default to using OpenAI's APIs for large language model (LLM) inference, there are numerous alternatives available that offer different advantages in terms of cost, performance, and flexibility. This guide explores various options for LLM inference, including both self-hosted and cloud-based solutions, as well as serverless and traditional hosting approaches.

Cloud-Based Solutions

1. Replicate

Pros:
- Offers thousands of models, including popular and fine-tuned versions
- Supports custom model deployment (both public and private)
- Easy to get started quickly
Cons:
- Initial pricing may be higher compared to other options
If you want to build a prototype quickly for a startup, this is recommended

Pros:
- Affordable pricing (e.g., 10 applications for $9 on GPUs)
- Serverless architecture
Cons:
- Cold start issues, which can lead to longer initial processing times
Based on experience with Indic Subtitler, this is cheap but the free tier will have sufficient lag if you have a large model which needs to be loaded to memory once the serverless system boots up

3. Hugging Face Spaces

Pros:
- Always running, no cold start issues
- Exposes API using Gradio
Cons:
- No pay-as-you-go pricing

4. Fal.ai

Offers serverless inference options
Compare with Replicate and make a decision

5. Groq

Pros:
- Significantly cheaper than OpenAI (e.g., 10x less cost for Whisper model)
- Faster processing due to specialized LPU chips
Cons:
- Limited model availability (currently offering Whisper v3, Llama 2, Llama 3, Gemma, Mixtral)
- Requires contacting for premium trial, with potential delays as of now

6. AWS Options

Bedrock: Umbrella service for various AI tools
SageMaker: For model training and deployment

7. NVIDIA NIM (New)

Pros:
- Containerizes LLM models (similar to Kubernetes approach)
- Allows local development and easy cloud deployment
- Simple inference, similar to OpenAI APIs
- Supports local testing and cloud deployment
Cons:
- Currently not open for everyone (requires filling out a form to try)
How it works:
- Containerize your LLM model (e.g., GPT-3.5, GPT-4)
- Develop locally
- Push to cloud, which spins up a separate container for the model to run

Self-Hosted Solutions

1. eXLama V2

Pros:
- One of the fastest open-source LLM inference options
- Supports quantized models
- Based on LLaMA.cpp
Cons:
- Cannot offload to CPU; requires good VRAM

2. VLLM

Pros:
- High-performance inference
Cons:
- Does not support quantized models

3. LLaMA.cpp

Pros:
- Supports quantized models
- Offers HTTP server functionality for API exposure

4. BentoML

Supports serverless inference
Kubernetes based
Open Source

5. Seldon.io

Kubernetes-based open-source solution for scaling models
Alternative to BentoML

6. Ollama

Pros:
- One of the easiest ways to run models locally
- Supports multiple GGUF format versions easily pullable from Ollama hub
- Offers both CLI and API interfaces
- Recently added support for concurrency and parallelization
How it works:
- Pull models from Ollama hub
- Run locally with simple commands
- Use CLI or API for inference

Considerations When Choosing an LLM Inference Solution

Cost: Compare pricing structures, especially for long-term usage.
Performance: Consider factors like cold start times and inference speed.
Scalability: Evaluate how well the solution can handle increasing loads.
Flexibility: Check support for custom models and fine-tuning capabilities.
Ease of Use: Consider the learning curve and integration complexity.
Hardware Requirements: For self-hosted solutions, ensure you have the necessary hardware (e.g., GPUs with sufficient VRAM).

Conclusion

While OpenAI's APIs are popular, exploring alternative LLM inference options can lead to cost savings, improved performance, or better alignment with specific project requirements. For those just starting, cloud-based solutions like Replicate or Modal.com offer an easy entry point. As your needs grow or become more specific, you might consider transitioning to more comprehensive services like AWS or self-hosted solutions for greater control and customization.

For local development and testing, tools like Ollama provide an excellent balance of ease of use and flexibility. NVIDIA NIM presents an interesting containerized approach that could bridge the gap between local development and cloud deployment.

Remember to regularly reassess your chosen solution as the field of AI and LLMs is rapidly evolving, with new options and improvements emerging frequently.

PreviousTIL NextGenerate Subtitles locally using whisper

Last updated 1 year ago

Was this helpful?