Vector search using Langchain, Weaviate and OpenSearch

-

With the popularity of ChatGPT and Large Language Models (LLM), everybody is talking about it. On my Linkedin home page, about 90% of the posts seem to speak about ChatGPT, AI, and LLM. With my experience in search solutions and interest in everything related to Natural Language Processing and Search, I also had to start working on solutions.

A few weeks ago, I visited the Haystack conference in Charlottesville. Listened to good talks and had interesting conversations with like-minded people. I learned about a framework called Langchain. LangChain is a framework for developing applications powered by language models, making working with other products like vector databases and large language modes much more straightforward. OpenSearch and Weaviate are among the integrations which I know well. I decided to experiment with a similarity search using both products.

This blog post discusses the sample. It shows the different steps to accomplish the following task:

  • Read content using a Langchain loader.
  • Store data in OpenSearch and Weaviate using the Langchain VectorStore interface.
  • Perform a similarity search using the Langchain VectorStore interface
  • Print the results, including the score used for sorting

Running the project yourself

You can find the source code of the project on GitHub:

git clone git@github.com:jettro/MyDataPipeline.git
git checkout blog-langchain-vectorstores

You need to set up Python to run the sample. Next to Python, you need a running OpenSearch instance. I have provided a docker-compose file in the infra folder. For Weaviate, I advise using a sandbox environment. You must also create a .env file in your project’s root folder. You can use the env_template file as a template for your .env file.

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Now you can run the file run_langchain_ro_vac.py, but before you do, change the properties do_load_content at the bottom to True. That way, you will load the content into OpenSearch and Weaviate. After one successful run, you can switch them back to False.

Load the content using LangChain.

LangChain uses Loaders to fetch data. It comes with a lot of different loaders to load data from databases, csv files, remote json files. It does however not support remote XML out of the box. The dataset we are using is coming from the Dutch government. It contains frequently asked questions, called “Vraag Antwoord Combinaties”. The default format is XML, you can request json if you want, but I stuck with XML. Therefore I had to create a customer XML loader. The code for the loader is in the code block below.

import xml.etree.ElementTree as ET
from urllib.request import urlopen

from langchain.document_loaders.base import BaseLoader
from langchain.schema import Document
class CustomXMLLoader(BaseLoader):
    def __init__(self, file_path: str, encoding: str = "utf-8"):
        super().__init__()
        self.file_path = file_path
        self.encoding = encoding

        def load(self) -> list[Document]:
        with urlopen(self.file_path) as f:

        tree = ET.parse(f)
        root = tree.getroot()

        docs = []
        for document in root:
            # Extract relevant data from the XML element
            text = document.find("question").text
            metadata = {"docid": document.find("id").text, "dataurl": document.find("dataurl").text}
            # Create a Document object with the extracted data
            doc = Document(page_content=text, metadata=metadata)
            # Append the Document object to the list of documents
            docs.append(doc)

        return docs

Below is the function that makes use of the XML loader. The specific VectorStore is passed into the method.

def load_content(vector_store: VectorStore) -> None:
    custom_xml_loader = CustomXMLLoader(file_path="https://opendata.rijksoverheid.nl/v1/infotypes/faq?rows=200")
    docs = custom_xml_loader.load()

    run_logging.info("Store the content")
    vector_store.add_documents(docs)

Before inserting documents, you have to provide the schema for the class. Weaviate does create a default schema for your content, but that does not work with similarity search. You need a field that stores embedded versions of the content. Since loading the schema has nothing to do with LangChain, I will not post all the code. Please look at the provided repository. I want to show you part of the schema that works for the LangChain similarity search. In the schema, I configure the class RijksoverheidVac and the field text, which is the default field for LangChain. We can change this if we want.

{
  "class": "RijksoverheidVac",
  "description": "Dit is een vraag voor de Rijksoverheid ",
  "vectorizer": "text2vec-openai",
  "properties": [
    {
      "dataType": [
        "text"
      ],
      "moduleConfig": {
        "text2vec-openai": {"skip": false, "vectorizePropertyName": false}
      },
      "name": "text"
    }
  ]
}

Create Weaviate VectorStore and execute a similarity search

Next, we can construct the Weaviate client, the Weaviate VectorStore, and the VectorStoreIndexWrapper. Notice in the code below that:

  • I use a wrapper around the Weaviate client. This wrapper makes interacting with Weaviate easy. LangChain uses its’ own wrapper within the VectorStore.
  • Using the parameter do_load_content, you can control a fresh load of the content.
  • In the additional field, we pass the field certainty (I added this feature in a pull request :-))
  • I use the field certainty in the _additional field of metadata to print the Weaviate certainty, which is used as a score.
def run_weaviate(query: str = "enter your query", do_load_content: bool = False) -> None:
    weaviate_client = WeaviateClient()
    vector_store = Weaviate(
        client=weaviate_client.client,
        index_name=WEAVIATE_CLASS,
        text_key="text"
    )

    if do_load_content:
        load_weaviate_schema(weaviate_client=weaviate_client)
        load_content(vector_store=vector_store)

    index = VectorStoreIndexWrapper(vectorstore=vector_store)
    docs = index.vectorstore.similarity_search(
        query=query,
        search_distance=0.6,
        additional=["certainty"])

    print(f"\nResults from: Weaviate")
    for doc in docs:
        print(f"{doc.metadata['_additional']['certainty']} - {doc.page_content}")

This method gives you the power to do a similarity search against Weaviate.

Create OpenSearch VectorStore and execute similarity search.

Next, I show you that working with OpenSearch is similar to Weaviate. It is not the same, but similar. Managing the index does work from LangChain. So there is no need to create it by ourselves. The following code block should now be self-explanatory. Notice that we have to provide our own embedding here. Weaviate uses the schema to determine how to create the embeddings. With OpenSearch, we provided our own embeddings. In the end, we use OpenAI for both embeddings.

def run_opensearch(query: str = "enter your query", do_load_content: bool = False) -> None:
    auth = (os.getenv('OS_USERNAME'), os.getenv('OS_PASSWORD'))
    opensearch_client = OpenSearchClient()

    vector_store = OpenSearchVectorSearch(
        index_name=OPENSEARCH_INDEX,
        embedding_function=OpenAIEmbeddings(openai_api_key=os.getenv('OPEN_AI_API_KEY')),
        opensearch_url="https://localhost:9200",
        use_ssl=True,
        verify_certs=False,
        ssl_show_warn=False,
        http_auth=auth
    )

    if do_load_content:
        opensearch_client.delete_index(OPENSEARCH_INDEX)
        load_content(vector_store=vector_store)

    docs = vector_store.similarity_search_with_score(query=query)
    print(f"\nResults from: OpenSearch")
    for doc, _score in docs:
    print(f"{_score} - {doc.page_content}")

Testing the integration

Now you can test the example. You should see similar results using the code below.

if __name__ == '__main__':
    run_logging.info("Starting the script Langchain Rijksoverheid Vraag Antwoord Combinaties")

    query_str = "mag ik een groen zwaai licht"

    run_weaviate(query=query_str,
                 do_load_content=False)

    run_opensearch(query=query_str,
                   do_load_content=False)

This is the output.

Results from: Weaviate
0.9312953054904938 - Mag ik een zwaailicht of een sirene gebruiken op mijn auto?
0.9251135289669037 - Hoe kan ik mijn duurzame initiatief voor een Green Deal aanmelden?
0.9233253002166748 - Wanneer moet ik mijn autoverlichting gebruiken?
0.9228493869304657 - Wat is groen sparen of beleggen?

Results from: OpenSearch
0.78848106 - Mag ik een zwaailicht of een sirene gebruiken op mijn auto?
0.7636849 - Wat is groen sparen of beleggen?
0.755817 - Hoe kan ik mijn duurzame initiatief voor een Green Deal aanmelden?
0.75559974 - Wanneer moet ik mijn autoverlichting gebruiken?

Concluding

I hope you learned that it is not hard to start working with vector-based similarity search using OpenAI, LangChain, Weaviate, and OpenSearch. These are exciting times. We can improve search results using vector-based search. We can start using Hybrid Search—more on these topics in another blog.