Running Multiple Open Source LLMs Locally with Ollama

Running Multiple Open Source LLMs Locally with Ollama

Introduction

Large language models (LLMs) are being used in various applications, from chatbots to content generation. While cloud-based LLMs are popular, running them locally has advantages like enhanced privacy, reduced latency, and more customization. Ollama is a tool designed for this purpose, enabling you to run open-source LLMs like Mistral, Llama2, and Llama3 on your PC.

In this blog post, we'll explore how to use Ollama to run multiple open-source LLMs, discuss its basic and advanced features, and provide complete code snippets to build a powerful local LLM setup.

Why Run LLMs Locally?

Running LLMs locally offers several benefits:

  1. Privacy: Local operation ensures that your data stays on your machine, reducing the risk of unauthorized access.

  2. Reduced Latency: Processing occurs locally, leading to faster response times compared to cloud-based LLMs.

  3. Customization: You have full control over the models, allowing for deeper customization and optimization.

Ollama facilitates this local setup, offering a platform to run various open-source LLMs without depending on cloud services.

Getting Started with Ollama

To use Ollama, ensure you meet the following system requirements and set up your environment accordingly.

  • System Requirements: Operating System: Ollama is designed for macOS, windows, and Linux.

  • Hardware: A modern multi-core processor with at least 8 GB of RAM for running large models.

Installing Ollama

To install Ollama, follow these steps:

  1. Download Ollama: Visit the Ollama website or the Ollama GitHub repository and download the latest version.

  2. Create a Virtual Environment: Create a virtual environment to manage dependencies.

# Create a virtual environment
python -m venv ollama_env
source ollama_env/bin/activate  # On Windows, use `ollama_env\Scripts\activate`

Installing Dependencies

Ollama relies on additional dependencies for optimal performance. Install them using the following commands:

# Install required libraries
pip install numpy torch torchvision

These dependencies ensure Ollama runs smoothly and interacts with open-source LLMs.

Available Open Source Models

Ollama supports various open-source models, including:

  • Mistral

  • Llama2

  • Llama3

  • Vicuna

  • GPT-J

  • GPT-NeoX

To load these models, you can use the ollama load command or the equivalent Python function.

# Load the Llama2 model
ollama load llama2
# Load Llama2 with Ollama
from ollama import Ollama

ollama = Ollama(model\_name='llama2')

Building a Simple Chatbot with Ollama

To create a chatbot with Ollama, you need to load a model and define a function to process user input and generate responses.

# Import required libraries
from ollama import Ollama

# Create an Ollama instance
chatbot = Ollama(model_name='llama2')

# Define a function to handle user input and generate responses
def chatbot_response(question):
    response = chatbot.query(question)
    return response

# Test the chatbot
user_question = "What's the weather today?"
chatbot_answer = chatbot_response(user_question)

print("Chatbot response:", chatbot_answer)

This code snippet demonstrates a simple chatbot that responds to user questions using the Llama2 model. You can expand this example to build more complex chatbots or integrate it with other applications.

Creating a FastAPI Server with Ollama

Ollama can also be used to create a FastAPI server, allowing users to select open-source models or even multiple models to generate responses. Here's a complete code snippet for building a FastAPI server with Ollama:

```
# Install FastAPI and Uvicorn
pip install fastapi uvicorn
```
```python
# Import necessary libraries
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from ollama import Ollama

# Create a FastAPI instance
app = FastAPI()

# Define a request model for user input
class UserRequest(BaseModel):
    question: str
    model: str = "llama2"  # Default model

# Endpoint to process user questions and generate responses
@app.post("/ask")
def ask_question(user_request: UserRequest):
    model_name = user_request.model
    question = user_request.question

    # Check if the model is supported
    supported_models = ["llama2", "mistral", "llama3"]
    if model_name not in supported_models:
        raise HTTPException(status_code=400, detail="Model not supported")

    # Create an Ollama instance with the specified model
    ollama = Ollama(model_name=model_name)

    # Generate a response to the question
    response = ollama.query(question)

    return {"model": model_name, "response": response}

# Run the FastAPI server
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="127.0.0.1", port=8000)

The FastAPI server allows users to send questions and select a specific model or multiple models to generate responses. Here's an example request and response where a user selects a model and gets a response:

# Send a POST request to the FastAPI server
curl -X POST "http://127.0.0.1:8000/ask" -H "Content-Type: application/json" -d '{"question":"What is the capital of France?","model":"llama2"}'

Response:

{
  "model": "llama2",
  "response": "The capital of France is Paris."
}

An example with two models to generate responses from both:

```bash
# Send a POST request with two different models
curl -X POST "http://127.0.0.1:8000/ask" -H "Content-Type: application/json" -d '{"question":"Who was Napoleon Bonaparte?","model":"mistral"}'

curl -X POST "http://127.0.0.1:8000/ask" -H "Content-Type: application/json" -d '{"question":"Who was Napoleon Bonaparte?","model":"llama2"}'

Responses:

# Response from Mistral
{
  "model": "mistral",
  "response": "Napoleon Bonaparte was a French military leader and emperor who conquered much of Europe in the early 19th century."
}

# Response from Llama2
{
  "model": "llama2",
  "response": "Napoleon Bonaparte was a French military and political leader who rose to prominence during the French Revolution and became emperor of France."
}

These examples demonstrate how the FastAPI server can handle user requests and provide responses based on the selected model(s).

Exploring the Ollama API for Advanced Features

The Ollama API offers a rich set of endpoints that allow you to interact with and manage large language models (LLMs) on your local machine. This section covers some of the key features provided by the Ollama API, including generating completions, listing local models, creating models from Modelfiles, and more. The Ollama API typically runs on localhost at port 11434. You can start it by running ollama serve in your terminal or command line.

Generate a Completion

To generate a completion with a specified model and prompt, use the POST /api/generate endpoint. This is a streaming endpoint, so the response might contain a series of JSON objects. Here's how to generate a completion:

Request (Streaming)
codecurl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?"
}'

The response will return a stream of JSON objects, with the final response including additional data about the generation process:

{
  "model": "llama2",
  "created_at": "2023-08-04T08:52:19.385406455-07:00",
  "response": "The",
  "done": false
}
Request (No Streaming)

To receive the response in one reply, you can set stream to false:

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

This will return a single JSON object with the full response:

{
  "model": "llama2",
  "created_at": "2023-08-04T19:22:45.499127Z",
  "response": "The sky is blue because it is the color of the sky.",
  "done": true,
  "context": [1, 2, 3],
  "total_duration": 5043500667,
  "load_duration": 5025959,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 325953000,
  "eval_count": 290,
  "eval_duration": 4709213000
}
Request (JSON Mode)

To structure the response as a well-formed JSON object, set the format parameter to json. It's crucial to instruct the model to respond in JSON within the prompt:

curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "What color is the sky at different times of the day? Respond using JSON.",
  "format": "json",
  "stream": false
}'

The response should contain a structured JSON response:

{
  "model": "llama2",
  "created_at": "2023-11-09T21:07:55.186497Z",
  "response": "{\n\"morning\": {\n\"color\": \"blue\"\n},\n\"noon\": {\n\"color\": \"blue-gray\"\n},\n\"afternoon\": {\n\"color\": \"warm gray\"\n},\n\"evening\": {\n\"color\": \"orange\"\n}\n}\n",
  "done": true,
  "context": [1, 2, 3],
  "total_duration": 4648158584,
  "load_duration": 4071084,
  "prompt_eval_count": 36,
  "prompt_eval_duration": 439038000,
  "eval_count": 180,
  "eval_duration": 4196918000
}

Other Ollama API Endpoints

In addition to generating completions, the Ollama API offers several other useful endpoints for managing models and interacting with the Ollama server:

  • Create a Model: Use ollama create with a Modelfile to create a model:

      ollama create mymodel -f ./Modelfile
    
  • List Local Models: List all models installed on your machine:

      ollama list
    
  • Pull a Model: Pull a model from the Ollama library:

      ollama pull llama3
    
  • Delete a Model: Remove a model from your machine:

      ollama rm llama3
    
  • Copy a Model: Copy a model to create a new version:

      ollama cp llama3 my-model
    

These endpoints provide flexibility in managing and customizing models on your local machine.

Streaming Responses and Conventions

The Ollama API has certain conventions, such as model names following the model:tag format, and durations returned in nanoseconds. Additionally, certain endpoints stream responses, while others return non-streamed responses.

For streaming endpoints, like POST /api/generate, a series of JSON objects is returned as the model generates the response. The final object includes statistics and additional data from the request, such as:

  • total_duration: Time spent generating the response.

  • load_duration: Time spent loading the model.

  • prompt_eval_count: Number of tokens in the prompt.

  • eval_count: Number of tokens in the response.

Understanding these conventions is essential for effectively working with the Ollama API.

Best Practices and Tips for Using Ollama

To ensure optimal performance with Ollama, consider these best practices and tips:

  • Resource Management: Ensure your system has sufficient memory and CPU capacity for running LLMs.

  • Data Security: Since you're running models locally, take measures to protect your data and ensure privacy.

  • Regular Updates: Keep Ollama and your models updated to access the latest features and bug fixes.

If you encounter issues, try these troubleshooting tips:

  • Restart Ollama: Sometimes restarting Ollama can resolve minor issues.

  • Reinstall Dependencies: If you experience problems, reinstalling key dependencies might help.

  • Check System Logs: Reviewing system logs can provide insights into potential errors or bottlenecks.

Conclusion

We've covered a lot in this post, from setting up Ollama to diving into the nuts and bolts of running LLMs like Llama3, all on your own computer. The code we went through is your toolkit for building things like chatbots and text analysis tools without sending your data off to a server somewhere else.

By trying out different models and tweaking the settings, you can really make these tools work just right for whatever project you're tackling. And with the Ollama API, you've got even more control to get things running smoothly and keep them that way.

If you found this guide helpful and you're looking to learn more or get the latest tips on using language models and other cool tech, don't forget to follow me. There's always something new to discover and more clever ways to use tools like Ollama, and I'm here to share those with you.

That's it for now! Get out there, start playing around with these models, and see what you can create. Keep an eye out for more posts if you're interested in upping your tech game!

Resources and Further Reading

For additional information and resources on Ollama and open-source LLMs, check out the following:

Ollama Official Website

Ollama GitHub Repository

Llama2 GitHub Repository

These resources offer detailed documentation and community support to help you further explore the capabilities of Ollama and the open-source LLMs it supports.