RAG optimisation: use an LLM to chunk your text semantically

Generated image showing semantic chunks
DALL·E generated image showing the flow for semantic chunks

In a previous blog post, I wrote about providing a suitable context for an LLM to answer questions using your content. The previous post talks about having the proper chunking mechanism and matching the semantics of your question or statement. In this post, you learn about more advanced chunking and embedding techniques. Some of the mechanisms we look at are hierarchical chunking, rule-based chunking, and semantic chunking.

First things first, why do you need chunking for semantic search?

When searching for semantic meaning in a bucket of documents, you can create a vector representation of each document. This can work with short and to-the-point documents. News items that deal with a specific topic can be transformed into a single vector, available for topic search. Compare it to writing a summary of the content. If you can write a very short summary that contains all the required information, you have a good chance that vectors for the whole document will work. Imagine you are looking for something small that you did not include in the summary. Using one vector for your whole document also misses the semantic meaning of that fact.

You overcome this problem using a splitter to create chunks from the complete article. Two well-known chunking mechanisms are:

  • Sentence splitting — Each sentence becomes its own chunk
  • Max token splitting — Each chunk has a maximum of tokens

The challenge semantic chunking

Each chunk must have enough semantic meaning to be relevant when people search for something. The problem with a sentence is that it might not have all the relevant information for one semantic topic. The same can happen for a max token chunk; however, with max tokens, multiple semantic meanings or knowledge items can also be discussed in the same chunk. In a vector, the average is created for the chunk. Therefore, the question does not match the chunk, even if it does have relevant information.

Overcoming the semantic meaning problem of chunks

In short, small chunks can have semantic meaning outside of context. Large chunks can have multiple semantic meanings. What if we can make the chunks have a good separation of knowledge or semantic meaning? Can LLMs help us get to that knowledge-based chunking?

You can follow along with my Jupiter Notebook. The notebook starts with a text containing explicit sections. The next code block shows the SectionSplitter that we use for the example.

class SectionSplitter(Splitter):
  def split(self, input_document: InputDocument) -> 
    sections = re.split(r"\n\s*\n", input_document.text)
    print(f"Num sections: {len(sections)}")

    chunks_ = []
    for i, section in enumerate(sections):
      chunk_ = Chunk(

     return chunks_

  def name() -> str:
    return SectionSplitter.__name__


The parent Splitter class is from the RAG4p library. We use this splitter to chop the text into chunks, which are the input for the semantic or knowledgebase splitter. The next code block shows how to use OpenAI to create the knowledge chunks.

openai_client = OpenAI(api_key=key_loader.get_openai_api_key())

def fetch_knowledge_chunks(orig_chunk: Chunk) -> List[Chunk]:

    prompt = f"""Task: Extract Knowledge Chunks
    Please extract knowledge chunks from the following text. 
    Each chunk should capture distinct, self-contained units 
    of information in a subject-description format. Return 
    the extracted knowledge chunks as a JSON object or array, 
    ensuring that each chunk includes both the subject and 
    its corresponding description. Use the format: 
    {{"knowledge_chunks": [
      {{"subject": "subject", "description": "description"}}

    completion = openai_client.chat.completions.create(
        response_format={"type": "json_object"},
            {"role": "system",
             "content": "You are an assistant that takes 
                 apart a piece of text into semantic 
                 chunks to be used in a RAG system."},
            {"role": "user", "content": prompt},
    answer = json.loads(completion.choices[0].message.content)

    chunks_ = []
    for index, kc in enumerate(answer["knowledge_chunks"]):
        chunk_ = Chunk(
            f'{kc["subject"]}: {kc["description"]}', 
            {"original_text": orig_chunk.chunk_text, 
             "original_chunk_id": orig_chunk.get_id(), 
             "original_total_chunks": orig_chunk.total_chunks})
    return chunks_

Notice that the OpenAI client supports a response_format. You can now tell OpenAI to return a JSON object as a response. Good to know. I had to add the format to use. Before I did, the response was not consistent. We can use this with the sections generated by the previous step. The first section contains the following text:

Ever thought about building your very own question-answering system? Like the one that powers Siri, Alexa, or Google Assistant? Well, we’ve got something awesome lined up for you! In our hands-on workshop, we’ll guide you through the ins and outs of creating a question-answering system. We prefer using Python for the workshop. We have prepared a GUI that works with python. If you prefer another language, you can still do the workshop, but you will miss the GUI to test your application.

The following knowledge chunks are extracted in the format “subject: description”:

  • Building a question-answering system: Learn how to build a question-answering system like the ones that power Siri, Alexa, or Google Assistant.
  • Workshop: Participate in a hands-on workshop that guides you through the creation of a question-answering system.
  • Programming Language Preference: Python is the preferred programming language for the workshop.
  • GUI for Python: A GUI has been prepared that works with Python for testing your application.
  • Programming Language Flexibility: Other programming languages can be used in the workshop, but the GUI for testing will not be available. Semantically matching your query to the answer context.

Executing vector queries

We can reuse some of the components from the RAG4p framework. We use the InternalContentStore to store the chunks, create the embeddings, and execute queries.

from rag4p.integrations.openai import EMBEDDING_SMALL

# Create an in memory content store to hold some chunks
openai_embedder = OpenAIEmbedder(
content_store = InternalContentStore(embedder=openai_embedder, 

for chunk in chunks:
    knowledge_chunks = fetch_knowledge_chunks(chunk)

With the content store in place, we can start executing queries:

result = content_store.find_relevant_chunks(
    "What are examples of a RAG system?")

for found_chunk in result:
    print(f"Score: {found_chunk.score:.3f}, 
            Chunk: {found_chunk.get_id()}, \
            Num chunks: {found_chunk.total_chunks} \n 

Finding relevant chunks for query: 
    What are examples of a RAG system?
Score: 1.188, Chunk: input-doc_0_0, Num chunks: 5 
 Building a question-answering system: 
    The process of creating a 
   system similar to Siri, Alexa, or Google Assistant that 
   can answer questions.
Score: 1.213, Chunk: input-doc_4_3, Num chunks: 4 
 Pipeline Creation: Tools for creating a pipeline include 
   Langchain and Custom solutions.
Score: 1.223, Chunk: input-doc_4_1, Num chunks: 4 
 Large Language Model: Large Language Models that can be 
   used include OpenAI, HuggingFace, Cohere, PaLM, and 
Score: 1.223, Chunk: input-doc_0_1, Num chunks: 5 
 Workshop offering: A hands-on workshop that guides 
   participants through creating a question-answering system.

When a chunk is matched, we can obtain the original text from which it was taken. This is a good base for context when sending the data back to an LLM to generate an answer to the provided question.

Answering the question using RAG (Retrieval Augmented Generation)

Next, we use the context to answer questions using a Large Language Model. The next code block shows how we construct the answer.

question = "What are examples of a q&a systems?"
result = content_store.find_relevant_chunks(
    question, max_results=1)

found_chunk = result[0]
context = found_chunk.properties["original_text"]
openai_answer_generator = OpenaiAnswerGenerator(
answer = openai_answer_generator.generate_answer(
    question, context)

Examples of question-answering systems are Siri, Alexa, and Google Assistant.

The question “What will we learn?” receives the following answer.

You will learn how to work with vector stores and Large Language Models, as well as how to combine these two elements to perform semantic searches, which go beyond traditional keyword-based searches.

It becomes much harder when someone asks a question that requires more than one knowledge chunk, as it does not ask for just one knowledge item. In that case, we have to do the same thing with the question: extract the knowledge parts of it.

First, we create a function that extracts sub-questions from the provided question.

def fetch_knowledge_question_chunks(orig_text: str) -> 
    prompt = f"""Task: Extract Knowledge parts from question 
        to use in a RAG system
        Please extract sub questions from the following 
        question. Each sub-question should ask for distinct, 
        self-contained units of information. Return the 
        subquestions as a JSON array, ensuring that each 
        item is a question. Use the format: 
        {{"sub_questions": ["question1", "question2"]}}

    completion = openai_client.chat.completions.create(
        response_format={"type": "json_object"},
            {"role": "system",
             "content": "You are an assistant that takes apart 
                 a question into sub-questions."},
            {"role": "user", "content": prompt},

    answer_ = json.loads(completion.choices[0].message.content)

    parts_ = []
    if "sub_questions" not in answer_:
        print(f"Error in answer: {answer_}")
        return parts_

    for know_part in answer_["sub_questions"]:

    return parts_

In the next step, we extract the sub-questions, obtain relevant parts for each sub-question, combine the different original texts into one context, and ask the LLM for an answer to the complete question with the created context.

question = "What is semantic search and what vector stores are 
    we using?"
query_parts = fetch_knowledge_question_chunks(question)
context_parts = []
for part in query_parts:
    result = content_store.find_relevant_chunks(
        part, max_results=1)
    found_chunk = result[0]
        f"Score: {found_chunk.score:.3f}, Chunk: 
          {found_chunk.get_id()}, Num chunks: 
          {found_chunk.total_chunks} \n

context = " ".join(context_parts)
openai_answer_generator = OpenaiAnswerGenerator(
answer = openai_answer_generator.generate_answer(
    question, context)

print(f"Context: \n{context}")
print(f"\nAnswer: \n{answer}")

What is semantic search?
Finding relevant chunks for query: What is semantic search?
Score: 0.678, Chunk: input-doc_1_2, Num chunks: 4 
Introduction to semantic search: Semantic search is the next 
big thing after traditional keyword-based searches.

What vector stores are we using?
Finding relevant chunks for query: What vector stores are 
we using?
Score: 0.881, Chunk: input-doc_4_0, Num chunks: 4 
Vector Store: Tools for storing vectors include OpenSearch, 
Elasticsearch, and Weaviate.

You'll get your hands dirty with vector stores and Large 
Language Models, we help you combine these two in a way 
you've never done before. You've probably used search 
engines for keyword-based searches, right? Well, prepare to 
have your mind blown. We'll dive into something called 
semantic search, which is the next big thing after traditional 
searches. It’s like moving from asking Google to search "best 
pizza places" to "Where can I find a pizza place that my 
gluten-intolerant, vegan friend would love?" – you get the 
idea, right? 
Some of the highlights of the workshop: 
- Use a vector store (OpenSearch, Elasticsearch, Weaviate)
- Use a Large Language Model (OpenAI, HuggingFace, Cohere, 
      PaLM, Bedrock)
- Use a tool for content extraction (Unstructured, Llama)
- Create your pipeline (Langchain, Custom)

Semantic search is a type of search that goes beyond 
traditional keyword-based searches and understands the context 
and intent of the query. For example, instead of just searching 
for "best pizza places," semantic search can understand and 
find "Where can I find a pizza place that my gluten-intolerant, 
vegan friend would love?"

The vector stores being used in the workshop are OpenSearch, 
Elasticsearch, and Weaviate.

We got a very nice answer from the LLM based on the context we provided.



Want to know more about what we do?

We are your dedicated partner. Reach out to us.