# Deploy Chroma DB on AWS EC2

### Introduction

ChromaDB, a groundbreaking tool in the world of embeddings, is reshaping semantic search with its vector database. This innovation simplifies embedding storage, management, and retrieval, while its standout feature, semantic search, offers unparalleled precision and efficiency in applications like NLP and image analysis. ChromaDB is revolutionizing the embedding landscape, ensuring seamless integration across various use cases, making it the go-to solution for those seeking a robust and efficient system for semantic search.

**Note:** If you're not familiar with ChromaDB and its capabilities, you can get more insights into it [here.](https://blog.futuresmart.ai/chromadb-an-open-source-vector-embedding-database)

**Table of Contents:**

* Why we need to host chromadb?
    
* Prerequisites and setting up
    
* Accessing the hosted Chroma db
    
* Managing Collections in Chroma
    
* Adding Data to a Collection
    
* Querying a Collection
    
* Updating and Deleting Data in a Collection
    
    # Why do we need to host chroma db?
    
    1. **Accessibility**: When you host ChromaDB, it becomes accessible from anywhere with an internet connection. You can access your database from your laptop, Google Colab, or multiple applications without worrying about the physical location of your data.
        
    2. **Collaboration**: Hosting ChromaDB allows you to collaborate with others more effectively. You can share access to the hosted database with team members or collaborators, making it easier to work on a project together.
        
    3. **Data Synchronization**: Hosting ChromaDB ensures that your data is synchronized and up-to-date. You won't need to manually update and copy-paste data folders whenever changes occur; the hosted service takes care of this for you.
        
    4. **Scalability**: Hosting your database on a server provides the flexibility to scale resources as needed. You can accommodate larger datasets or higher traffic without worrying about the limitations of your local machine.
        
    5. **Data Security**: Depending on the hosting service, you can benefit from enhanced security measures, such as data encryption and access controls, to protect your valuable data.
        
    
    # Prerequisites and setting up
    
    To begin the setup process for utilizing Chromadb, including its installation along with Docker Compose, as well as obtaining the Chromadb repository, please follow these steps:
    
    1. **Set Up a Virtual Machine (VM)**:
        
        * Create a VM on AWS (Amazon Web Services).
            
        * Choose an instance type with sufficient RAM (e.g., 4GB or more).
            
        * Create a key pair for SSH access to the VM.
            
    2. **Connect to the VM**:
        
        This will allow you to connect to the EC2 instance.
        
        ```python
        ssh -i [your-key.pem] ubuntu@[your-instance-ip]
        ```
        
    3. **Install Docker on ubuntu ec2 instance**:
        
        ```python
        # Update the Package List
        sudo apt update
        
        # Install Required Packages
        sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
        
        # Add the Docker Repository
        curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
        echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
        
        # Install Docker
        sudo apt update
        sudo apt install -y docker-ce docker-ce-cli containerd.io
        
        # Start and Enable Docker
        sudo systemctl start docker
        sudo systemctl enable docker
        
        # verifying docker installation
        sudo docker --version
        ```
        
    4. **Get the Chroma Docker image from** [**Docker Hub**](https://hub.docker.com/r/chromadb/chroma)
        
        ```python
        # pulling the image
        sudo docker pull chromadb/chroma
        
        # running the image on port 8000 of our virtual machine
        sudo docker run -p 8000:8000 chromadb/chroma
        ```
        
    
    # Accessing the hosted Chroma db
    
    **Installing the Chroma db**
    
    1. ```python
              !pip install chromadb
        ```
        
    
    **Connect to the server running in the Docker container.**
    
    1. ```python
              import chromadb
              
              # Create a Chroma client instance
              chroma_client = chromadb.HttpClient(host='<our_vm_publicip address>', port=8000)
        ```
        
        This Chroma client instance will now enable seamless communication with the Chroma server, establishing the bridge between your application and the stored data.
        
    
    Now the database is up at localhost at port 8000
    
    # Managing Collections in Chroma
    
    ##### **Creating Your Collection**
    
    Collections are like data containers. You can create one like this:
    
    ```python
    # Create a collection with a special touch (embedding function)
    collection = client.create_collection(name="my_collection", embedding_function=emb_fn)
    ```
    
    **Taking a Peek at Collections**
    
    Want to see what's in a collection? Easy peasy:
    
    ```python
    # Take a look at what's in the collection (don't forget the embedding function)
    collection = client.get_collection(name="my_collection", embedding_function=emb_fn)
    ```
    
    **Bidding Farewell to Collections**
    
    When a collection's purpose is served, it's time to let it go:
    
    ```python
    # Say goodbye to the collection
    client.delete_collection(name="my_collection")
    ```
    
    With collections, organizing your data turns from a puzzle into a walk in the park. Stay tuned for more insights into how ChromaDB transforms data management into a delightful experience!
    
    # Adding Data to a Collection
    
    ChromaDB lets you effortlessly inject data into your collection using the `.add` function. This single command can handle various types of data, making your collection richer and more informative.
    
    **Adding Raw Documents**
    
    For simple data addition, use `.add` with the `documents` parameter. ChromaDB will tokenize and embed them using your collection's default method:
    
    ```python
    collection.add(
        documents=["doc1", "doc2", "doc3"],
        metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
        ids=["id1", "id2", "id3"]
    )
    ```
    
    **Direct Embedding with Metadata**
    
    Alternatively, add documents along with their embeddings and metadata:
    
    ```python
    collection.add(
        documents=["doc1", "doc2", "doc3"],
        embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4]],
        metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
        ids=["id1", "id2", "id3"]
    )
    ```
    
    **Linking External Vectors**
    
    If your documents are stored elsewhere, associate vectors using their IDs:
    
    ```python
    collection.add(
        embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4]],
        metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
        ids=["id1", "id2", "id3",]
    )
    ```
    
    # Querying a Collection
    
    **Querying with Query Embeddings**
    
    With ChromaDB's vector database, .query simplifies semantic search. It swiftly retrieves the top closest results for your query embeddings, enhancing precision and efficiency in applications like NLP and image analysis. Discover a new level of search capabilities with ChromaDB.
    
    ```python
    collection.query(
        query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2]],
        n_results=10,
        where={"metadata_field": "is_equal_to_this"},
        where_document={"$contains":"search_string"}
    )
    ```
    
    You can use optional filters to refine your search based on metadata or document content.
    
    **Querying with Query Texts**
    
    Alternatively, you can query using query texts. ChromaDB handles the embedding, allowing you to retrieve results based on these texts:
    
    ```python
    collection.query(
        query_texts=["doc10", "thus spake zarathustra"],
        n_results=10,
        where={"metadata_field": "is_equal_to_this"},
        where_document={"$contains":"search_string"}
    )
    ```
    
    # Updating and Deleting Data in a Collection
    
    In ChromaDB, adapting and refining your dataset is a seamless process. With the `.update` and `.upsert` methods, you can easily modify existing entries or introduce new ones. Additionally, when it's time to trim down, ChromaDB's `.delete` method offers a way to remove data.
    
    **Refining with** `.update` **and** `.upsert`
    
    Whether it's enhancing metadata, changing embeddings, or updating documents, ChromaDB's `.update` method has you covered. Use it to modify specific items in the collection:
    
    ```python
    collection.update(
        ids=["id1", "id2", "id3"],
        embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4]],
        metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
        documents=["doc1", "doc2", "doc3"],
    )
    ```
    
    For a smarter touch, the `.upsert` method combines updates and additions:
    
    ```python
    collection.upsert(
        ids=["id1", "id2", "id3"],
        embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4]],
        metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}],
        documents=["doc1", "doc2", "doc3"],
    )
    ```
    
    **Trimming with** `.delete`
    
    When it's time to clean up, the `.delete` method steps in. Erase items based on their IDs:
    
    ```python
    collection.delete(
        ids=["id1", "id2", "id3"],
        where={"chapter": "20"}
    )
    ```
    
    Remember, `.delete` is a powerful action that permanently removes data, so exercise caution.
    
    **In a Nutshell**
    
    Updating, adding, or removing data in ChromaDB is a breeze with these methods. Your dataset remains dynamic and tailored to your needs. Stay tuned as we further explore the versatile possibilities of ChromaDB in the upcoming sections.
    
    # **Summary**
    
    1. **Semantic Search Reinvented:** Harness the Power of ChromaDB's Vector Database. Discover how ChromaDB's vector database revolutionizes semantic search, making it a breeze to find, store, and manage embeddings, and supercharging NLP and image analysis.
        
    2. **ChromaDB: Your Semantic Search Ally:** ChromaDB's vector database streamlines embedding management, reshaping semantic search for NLP and image analysis, making it your trusted ally.
        
    3. **Docker Made Easy**: ChromaDB + Docker = smooth sailing. We set up effortlessly for client/server teamwork.
        
    4. **Data Magic**: Creating, adding, and exploring data collections is a cinch, giving you insights without the hassle.
        
    5. **Uncover Insights**: Whether words or images, ChromaDB uncovers hidden gems, making your data journey transformative and exciting.
        
    
    ### Next Step
    
    If you're eager to learn more about using vector databases like ChromaDB to build applications with Langchain, we recommend watching this informative video tutorial.
    
* %[https://youtu.be/5NG8mefEsCU?si=v4l1oLGgXiyPKp8c] 
    
* # References
    
    1. [Hosting Chroma DB on AWS EC2: Server Setup and Client Connection Tutorial](https://www.youtube.com/watch?v=F6yXY0F8lig)
        
    2. [chroma db documentation](https://docs.trychroma.com/deployment)
