Mastering RAG: Choosing the Right Vector Embedding Model for Your RAG Application

Learn how to evaluate your RAG model and understand how performance changes with the choice of embeddings!

Retrieval-Augmented Generation (RAG) applications are becoming increasingly popular as large language models (LLMs) improve. These applications combine retrieval and generation to provide accurate, contextually relevant responses. Validating and evaluating RAG applications is crucial because the quality of the generated responses heavily depends on the effectiveness of the underlying retrieval and embedding models. Proper evaluation helps optimize performance, enhancing user experience and reliability. This tutorial explores how the choice of embedding model affects RAG performance, detailing the evaluation metrics and demonstrating the impact of different embeddings on a naive RAG pipeline. By understanding these factors, developers can make informed decisions to improve their RAG applications.

What Are Embedding Models?

Embedding models are neural network-based models designed to convert high-dimensional data, such as text, images, or audio, into dense vector representations. These vectors, or embeddings, capture the semantic information of the input data in a lower-dimensional space, making it easier to perform various machine learning tasks. If you want to know more about embedding models, read this blog.

In Retrieval-Augmented Generation (RAG), embedding models play a crucial role by converting both the query and the corpus of documents into embeddings. These embeddings are then used to find the most relevant documents to the query, which are subsequently used to generate the final response. The quality and relevance of the generated response highly depend on the accuracy and efficiency of the embedding model used.

  1. Query Embedding: When a user inputs a query, the embedding model converts this query into a dense vector representation.

  2. Document Embedding: Similarly, all the documents in the corpus are pre-processed and converted into embeddings.

  3. Similarity Search: The query embedding is then compared with the document embeddings to find the most similar documents. This is typically done using similarity measures such as cosine similarity.

  4. Response Generation: The retrieved documents are then passed to a generative model (like a Transformer-based model) to generate a response that is informed by the retrieved documents.

The choice of the embedding model can significantly impact the performance of the RAG system. A more accurate embedding model will result in better retrieval of relevant documents which, in turn, will improve the quality of the generated responses. Factors such as the domain of the documents, the type of queries, and the computational resources available should be considered when selecting an embedding model.

How Embedding Models Are Evaluated

Embedding models are evaluated using a variety of metrics that assess their performance on different tasks. The MTEB (Massive Text Embedding Benchmark) leaderboard provides a comprehensive evaluation across multiple datasets and tasks.

The MTEB leaderboard provides a standardized way to compare embedding models across these various metrics. By evaluating models on a wide range of datasets and tasks, it helps identify strengths and weaknesses in different scenarios. This comprehensive evaluation ensures that an embedding model is not only good at a specific task but also performs well across diverse applications.

Let’s now understand each of the metrics of the MTEB leaderboard and how they are used to evaluate embedding models. These metrics include:

  • Average: The average performance across all 56 datasets. This metric provides a comprehensive overview of how well an embedding model performs across diverse tasks and datasets.

  • Classification Average: The average performance on 12 classification datasets, where models are evaluated based on how accurately they can categorize texts into predefined classes. It involves training a logistic regression classifier on the embeddings of the training set and testing it on the embeddings of the test set.

  • Clustering Average: The average performance on 11 clustering datasets, where models are evaluated based on how well they can group similar texts together.

  • Pair Classification Average: The average performance on 3 pair classification datasets, where models are evaluated based on how well they can predict whether pairs of texts are duplicates or paraphrases.

  • Reranking Average: The average performance on 4 reranking datasets, where models are evaluated based on how well they can rearrange a list of candidates so that the most relevant items are at the top.

  • Retrieval Average: The average performance on 15 retrieval datasets, where models are evaluated based on how well they can find the most relevant documents given a query.

  • STS Average: The average performance on 10 Semantic Textual Similarity (STS) datasets, where models are evaluated based on how similarly pairs of sentences are rated.

  • Summarization Average: The performance on a single summarization dataset, where models are evaluated based on how well they can generate a summary for a given text.

If you’re interested in knowing more about the MTEB leaderboard and its evaluation criteria, you can read the MTEB paper.

How to Evaluate a RAG App

Now that we understand the importance of embeddings, let’s see how we can use different embedding models in our RAG application and evaluate the overall performance. Here’s our architecture stack:

Different embedding models that we will use in this tutorial include:

Here, we will guide you through the building of a RAG application using this stack and demonstrate how these embedding models can impact its performance.

While building the RAG application, we will be covering the following steps:

  1. Loading the data.

  2. Splitting it into chunks.

  3. Storing the embeddings of the chunks in the Qdrant database.

  4. Performing similarity search using the user query on the Qdrant database.

  5. Using the retrieved documents as context for the Llama 3 model to generate the answer.

The RAGAs framework is designed to help evaluate any RAG (Retrieval-Augmented Generation) application. It employs the following evaluation metrics:

  • Faithfulness: This metric measures the factual consistency of the generated answer against the given context. It is calculated from the answer and the retrieved context, and is scaled to a range of (0,1), with higher scores being better. Essentially, it assesses how well the generated answer is supported by the retrieved context. If the data in the generated answer matches the context data, the faithfulness score will be high.

  • Answer Relevancy: This metric evaluates how relevant the generated answer is to the given question. It is calculated from the question, the context, and the answer. Similar to faithfulness, it is scaled to a range of (0,1), with higher scores being better. It is calculated as the mean cosine similarity of the original question to a number of artificial questions, which are generated while reverse-engineering the answer.

  • Context Relevancy: This measures how relevant the retrieved context is to the given question. Ideally, the retrieved context should contain exclusively the essential information needed to address the query. It is calculated from the question and the context. The process involves identifying all the sentences in the retrieved context that are relevant to the question and then calculating the relevancy score. This metric is also scaled to a range of (0,1), with higher scores being better.

  • Context Recall: Context recall measures the proportion of the retrieved context that aligns with the ground truth answer. It is calculated from the ground truth answer and the retrieved context. The estimation is done by analyzing each sentence in the ground truth answer to determine whether it can be attributed to the retrieved context. If all the sentences in the ground truth answer are attributable to the retrieved context, the context recall score will be high.

  • Context Precision: Context precision measures the proportion of the ground truth relevant items present in the higher-ranked contexts. It is calculated from the question, ground truth, and the contexts. This metric checks whether all the relevant chunks are ranked higher. If all the relevant chunks are ranked higher, the context precision score will be high.

Here, we will use the above-mentioned metrics to evaluate the RAG application. However, in RAGAs, there are many more metrics available to evaluate the RAG application. For more information about the metrics, you can check the RAGAs documentation.

Now that we’ve understood all the concepts and reasons for evaluating a RAG application, let’s dive into coding.

Building the RAG App

Here, we will be using a PDF file that contains information about the series of Llama models. The PDF file is compiled from the Wikipedia page on Llamas and is available in the GitHub repository for this tutorial.

from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_chunks_from_pdf(data_path, chunk_size, chunk_overlap):

  '''
  This function takes a directory of PDF files and creates chunks of text from each file.
  The text is split into chunks of size `chunk_size` with an overlap of `chunk_overlap`.
  This chunk is then converted into a langchain Document object.

  Args:
     data_path (str): The path to the directory containing the PDF files.
     chunk_size (int): The size of each chunk.
     chunk_overlap (int): The overlap between each chunk.

  Returns:
     docs (list): A list of langchain Document objects, each containing a chunk of text.
  '''

  # Load the documents from the directory
  loader = DirectoryLoader(data_path, loader_cls=PyPDFLoader)

  # Split the documents into chunks
  text_splitter = RecursiveCharacterTextSplitter(
     chunk_size=chunk_size,
     chunk_overlap=chunk_overlap,
     length_function=len,
     is_separator_regex=False,
  )
  docs = loader.load_and_split(text_splitter=text_splitter)
  return docs

data_path = '../data'
chunk_size = 500
chunk_overlap = 50

docs = create_chunks_from_pdf(data_path, chunk_size, chunk_overlap)

We first start by loading the documents from the directory. We then split the documents into several chunks of equal size and finally convert them to LangChain docs.

The next step is to index the documents in the Qdrant database. Before moving on to that step, we need to load the embeddings model that we will use to convert the documents into embeddings. To build the simple RAG model, we will start with the ‘BAAI/bge-large-en’ embeddings model. In the evaluation section, we will also try other embedding models to see how the performance changes.

from langchain_huggingface import HuggingFaceEmbeddings

embedding_models = ['BAAI/bge-large-en']

# Load the embeddings model
embeddings = HuggingFaceEmbeddings(model_name=embedding_models[0])

Great. Now that we have created the chunked documents and loaded the embeddings model, we can index the documents using Qdrant.

from langchain_qdrant import Qdrant


def index_documents_and_retrieve(docs, embeddings):

   '''
   This function uses the Qdrant library to index the documents using the chunked text and embeddings model.
   For the simplicity of the example, we are using in-memory storage only.

   Args:
    docs: List of documents generated from the document loader of langchain
    embeddings: List of embeddings generated from the embeddings model

   Returns:
    retriever: Qdrant retriever object which can be used to retrieve the relevant documents
   '''

   qdrant = Qdrant.from_documents(
       docs,
       embeddings,
       location=":memory:",  # Local mode with in-memory storage only
       collection_name="my_documents",
   )

   retriever = qdrant.as_retriever()

   return retriever

retriever = index_documents_and_retrieve(docs, embeddings)

For the simplicity of this tutorial, we have indexed the documents in the Qdrant database using the in-memory mode. But in the production environment, you can use the persistent mode to store the indexed documents in the disk. If you want to know more about these approaches, you can read my other blogs on Qdrant (Hindi-Language AI Chatbot for Enterprises Using Qdrant, MLFlow, and LangChain; Create a Superhero Character Chatbot Using Haystack, Qdrant, Llama 3, and Streamlit; Enhancing Data Security with Role-Based Access Control of Qdrant Vector Database).

Now that we have the retriever object, we can use it to retrieve the relevant documents based on the query and, finally, generate the answer using the LLM model.

from langchain_community.chat_models import ChatOllama

model_id = "llama3:instruct"

# Load the Llama-3 model using the Ollama
llm = ChatOllama(model=model_id)

Let’s build the simple RAG chain using the retriever and the LLM model.

from langchain_core.prompts import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

def build_rag_chain(llm, retriever):

   '''
   This function builds the RAG chain using the LLM model and the retriever object.
   The RAG chain is built using the following steps:
   1. Retrieve the relevant documents using the retriever object
   2. Pass the retrieved documents to the LLM model along with prompt generated using the context and question
   3. Parse the output of the LLM model


   Args:
    llm: LLM model object
    retriever: Qdrant retriever object


   Returns:
    rag_chain: RAG chain object which can be used to answer the questions based on the context
   '''

   template = """
       Answer the question based only on the following context:

       {context}

       Question: {question}
       """

   prompt = PromptTemplate(
       template=template,
       input_variables=["context","question"]
       )
   rag_chain = (
       {"context": retriever,  "question": RunnablePassthrough()}
       | prompt
       | llm
       | StrOutputParser()
   )
   return rag_chain

rag_chain = build_rag_chain(llm, retriever)

Now that we have the RAG model, we can use it to generate the answers for the queries. Let’s test it out.

rag_chain.invoke('What is this document about?')

Output

Based on the provided context, this document appears to be about LLaMA, a language model that was initially announced in February 2023. The document discusses the release of the model, its training and architecture, as well as reactions to its leak and accessibility. It also mentions issues related to unauthorized distribution, DMCA takedown requests, and subsequent releases of updated versions under different licenses.

There you go! You have successfully built a RAG app using LangChain, Qdrant, Hugging Face, Ollama, and Llama 3. You can use this RAG app to answer questions based on the context of the documents.
Let’s move on to the next section, where we will learn how to evaluate the RAG app using the evaluation metrics discussed in the above sections.

In the previous section, we saw how to build a simple RAG app that generates answers based on the user’s query using a custom PDF. Now, in this section, we will see how to use the same app and evaluate it using the RAG metrics provided by the RAGAs library.

Before demonstrating how to evaluate the RAG app, let’s first create a set of questions and ground truth answers to assess the model. This will help us evaluate the performance of the RAG application. For ease and simplicity, we will use only five questions and their ground truth answers extracted from the Llama document we just loaded.

questions = [
   "What is the developer of Llama model?",
   "How many versions of Llama model are there?",
   "How is Llama 3 model different from GPT-3?",
   "Can you provide some information about the training datset of Llama-1 model?",
   "What is the context length of Llama 3 model?",
]

ground_truth = [
   "Meta AI is the developer of the Llama model.",
   "There are three versions of the Llama model.",
   "There are minor architectural differences. Compared to GPT-3, LLaMA uses SwiGLU activation function instead of GeLU; uses rotary positional embeddings instead of absolute positional embedding; uses root-mean-squared layer-normalization instead of standard layer-normalization; increases context length to 8k in Llama 3 (compared to 4k in Llama 2 and 2k in Llama 1 and GPT-3)",
   "LLaMA 1 foundational models were trained on a data set with 1.4 trillion tokens, drawn from publicly available data sources, including: Webpages scraped by CommonCrawl; Open source repositories of source code from GitHub; Wikipedia in 20 different languages; Public domain books from Project Gutenberg; Books3 books dataset; The LaTeX source code for scientific papers uploaded to ArXiv; Questions and answers from Stack Exchange websites. On April 17, 2023, TogetherAI launched a project named RedPajama to reproduce and distribute an open source version of the LLaMA dataset. The dataset has approximately 1.2 trillion tokens and is publicly available for download",
   "The context length of Llama 3 model is 8192 tokens.",
]

Great! Now that we have the test/evaluation set to evaluate the RAG app, let’s move on to the next step, which is properly formatting the evaluation set using the dataset library by Hugging Face.

from datasets import Dataset

def create_test_case(questions, ground_truth, rag_chain, retriever):
   '''
   This function creates a test case for the RAG model
   It takes a list of questions and the corresponding ground truth answers.
   It then uses the RAG model to generate answers for the questions.
   It also retrieves the relevant documents for each question.
   Finally, it combines all the information into a dataset object and returns it.

   Args:
       questions: list of strings, questions to be answered
       ground_truth: list of strings, corresponding ground truth answers
       rag_chain: RAG model
       retriever: Retriever object


   Returns:
       dataset: Dataset object containing the questions, answers, contexts and ground truth answers
   '''
   data = {"question": [], "answer": [], "contexts": [], "ground_truth": ground_truth}

   for query in questions:
       data["question"].append(query)
       # data["answer"].append(rag_chain.invoke(query)['result'])
       data["answer"].append(rag_chain.invoke(query))
       data["contexts"].append([doc.page_content for doc in retriever.get_relevant_documents(query)])

   dataset = Dataset.from_dict(data)

   return dataset

For now, we will just keep this function to be used later in the evaluation process.

Now, let’s create a function that will evaluate the RAG app using the RAGAs library.

from ragas import evaluate

from ragas.metrics import (
   faithfulness,
   answer_relevancy,
   context_relevancy,
   context_recall,
   context_precision,
)

import nest_asyncio
nest_asyncio.apply()

def evaluate_rag(dataset, llm, embeddings):

   '''
   This function evaluates the RAG model on a dataset using the specified metrics

   Args:
       dataset: Dataset object containing the questions, answers, contexts and ground truth answers
       llm: LLM model
       embeddings: Embeddings object

   Returns:
       result: dictionary containing the evaluation results
   '''
   result = evaluate(
       dataset=dataset,
       llm=llm,
       embeddings=embeddings,
       metrics=[
           context_relevancy,
           context_precision,
           faithfulness,
           answer_relevancy,
           context_recall,
       ],
       raise_exceptions=True
   )

   return result

Great! It’s as simple as that. Now we can evaluate the RAG app using the RAGAs library. Let’s move on to the next step, which is to see all these steps in action.

Now, let’s quickly load the LLM model.

from langchain_community.chat_models import ChatOllama

model_id = "llama3:instruct"

llm = ChatOllama(model=model_id)

Let’s choose the embeddings model on which we want to evaluate the RAG app. Here, we will be using five types of embedding models to evaluate the RAG app. These are:

  1. dunzhang/stella_en_1.5B_v5

  2. BAAI/bge-large-en

  3. BAAI/bge-small-en

  4. Salesforce/SFR-Embedding-2_RMTEB Leaderboard

  5. Alibaba-NLP/gte-Qwen2–7B-instruct

For more information about each of these embedding models, you can refer to the Hugging Face model hub or visit the MTEB Leaderboard.

embedding_models = ['dunzhang/stella_en_1.5B_v5', 'BAAI/bge-large-en', 'BAAI/bge-small-en', 'Salesforce/SFR-Embedding-2_R', 'Alibaba-NLP/gte-Qwen2-7B-instruct']

Now that we have everything in place, let’s tie together all the code from this section as well as the previous section to see how our RAG app performs on the evaluation set.

import sys
sys.path.append('..')

from tqdm.notebook import tqdm
from langchain_huggingface import HuggingFaceEmbeddings

# load the utility functions
from utils import create_chunks_from_pdf, index_documents_and_retrieve, build_rag_chain, evaluate_rag, create_test_case

# define all the parameters
data_path = '../data'
chunk_size = 500
chunk_overlap = 50

# empty lists to store the results
context_relevancy_result, context_precision_result, context_recall_result, faithfulness_result, answer_relevancy_result = [], [], [], [], []

# loop over all the embedding models
for embedding_model_name in tqdm(embedding_models, desc="Embedding Models"):

   # load the embeddings model
   embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

   docs = create_chunks_from_pdf(data_path, chunk_size, chunk_overlap) # ceate langchain chunked documents
   retriever = index_documents_and_retrieve(docs, embeddings) # index the documents and get the retriever object
   rag_chain = build_rag_chain(llm, retriever) # build the RAG model
   dataset = create_test_case(questions, ground_truth, rag_chain, retriever) # create the test case for each question using each embedding model
   evaluation_result = evaluate_rag(dataset, llm, embeddings) # evaluate the RAG model on the test case

   # store the results
   context_relevancy_result.append(evaluation_result['context_relevancy'])
   context_precision_result.append(evaluation_result['context_precision'])
   context_recall_result.append(evaluation_result['context_recall'])
   faithfulness_result.append(evaluation_result['faithfulness'])
   answer_relevancy_result.append(evaluation_result['answer_relevancy'])

   # print the results to see how the model is performing
   print(f"Embedding Model: {embedding_model_name}")
   for rows in evaluation_result.to_pandas().iterrows():
       print(f"Question: {rows[1]['question']}")
       print(f"Answer: {rows[1]['answer']}")
       print("Ground Truth: ", rows[1]['ground_truth'])
       print(f"Context Relevancy: {rows[1]['context_relevancy']}")
       print(f"Context Precision: {rows[1]['context_precision']}")
       print(f"Faithfulness: {rows[1]['faithfulness']}")
       print(f"Answer Relevancy: {rows[1]['answer_relevancy']}")
       print(f"Context Recall: {rows[1]['context_recall']}")
       print("="*100)
   print("\\_/"*50)

Analysis of the Results

Question: What is the developer of Llama model?
Answer: Based on the provided context, there is no direct mention of the developer of the Llama model. However, it is mentioned that Meta AI's testing shows that Llama 3 (70B) beats Gemini and Claude in most benchmarks. Additionally, Mark Zuckerberg is quoted as stating that the team was surprised by the 70B model's continued learning at the end of training, suggesting that Meta AI may be involved in the development of the Llama model.
Ground Truth:  Meta AI is the developer of the Llama model.
Context Relevancy: 0.008849557522123894
Context Precision: 0.3333333333
Faithfulness: 0.7142857142857143
Answer Relevancy: 0.0
Context Recall: 1.0
====================================================================================================
Question: How many versions of Llama model are there?
Answer: According to the provided context, the question is based on the following text:

"...plans on releasing multimodal models, models capable of conversing in multiple languages, and models with larger context windows. A version with 400B+ parameters is currently being trained."

From this text, we can infer that there are at least three versions of Llama model mentioned: 8B, 70B, and 400B+. Additionally, other versions such as 7B, 13B, and 34B are also mentioned. Therefore, the total number of versions is at least six.

However, if we consider all the versions mentioned in the provided text, there are a total of nine different versions:

1. 8B
2. 70B
3. 400B+
4. 7B
5. 13B
6. 34B
7. 500B (foundation model)
8. 20B (long-context data)

So, the correct answer is: there are at least nine versions of Llama model.
Ground Truth:  There are three versions of the Llama model.
Context Relevancy: 0.0037593984962406013
Context Precision: 0.4166666666458333
Faithfulness: 1.0
Answer Relevancy: 0.8121186314853447
Context Recall: 1.0
====================================================================================================
Question: How is Llama 3 model different from GPT-3?
Answer: According to the provided context, LLaMA 3 uses:

* SwiGLU activation function instead of GeLU
* Rotary positional embeddings instead of absolute positional embedding
* Root-mean-squared layer-normalization instead of standard layer-normalization
* Increases context length to 8k (compared to 4k in LLaMA 2 and 2k in LLaMA 1)

These differences distinguish LLaMA 3 from GPT-3.
Ground Truth:  There are minor architectural differences. Compared to GPT-3, LLaMA uses SwiGLU activation function instead of GeLU; uses rotary positional embeddings instead of absolute positional embedding; uses root-mean-squared layer-normalization instead of standard layer-normalization; increases context length to 8k in Llama 3 (compared to 4k in Llama 2 and 2k in Llama 1 and GPT-3)
Context Relevancy: 0.023972602739726026
Context Precision: 0.9999999999666667
Faithfulness: 1.0
Answer Relevancy: 0.6712139913349059
Context Recall: 1.0
====================================================================================================
Question: Can you provide some information about the training datset of Llama-1 model?
Answer: Based on the provided context, there is no specific information about the training dataset of Llama-1 model. The context only mentions that LLaMA's developers focused their effort on scaling the model's performance by increasing the volume of training data, rather than the number of parameters. It does not provide details about the size or composition of the training dataset for LLaMA-1 specifically.

However, it is mentioned that LLaMA-3 was pre-trained on approximately 15 trillion tokens of text gathered from "publicly available sources" with instruct models fine-tuned on "publicly available instruction datasets, as well as over 10M human-annotated examples". Since LLaMA-1 and LLaMA-2 are not explicitly mentioned in this context, it is unclear whether their training datasets were similar or different.
Ground Truth:  LLaMA 1 foundational models were trained on a data set with 1.4 trillion tokens, drawn from publicly available data sources, including: Webpages scraped by CommonCrawl; Open source repositories of source code from GitHub; Wikipedia in 20 different languages; Public domain books from Project Gutenberg; Books3 books dataset; The LaTeX source code for scientific papers uploaded to ArXiv; Questions and answers from Stack Exchange websites. On April 17, 2023, TogetherAI launched a project named RedPajama to reproduce and distribute an open source version of the LLaMA dataset. The dataset has approximately 1.2 trillion tokens and is publicly available for download
Context Relevancy: 0.00684931506849315
Context Precision: 0.999999999975
Faithfulness: 0.2
Answer Relevancy: 0.0
Context Recall: 1.0
====================================================================================================
Question: What is the context length of Llama 3 model?
Answer: According to the provided context, the answer can be found in the second document with metadata `{'page': 3}`:

`...Context \nlength\nCorpus \nsize...\n6,300\n[30]`

The context length of Llama 3 is 6,300.
Ground Truth:  The context length of Llama 3 model is 8192 tokens.
Context Relevancy: 0.007272727272727273
Context Precision: 0.0
Faithfulness: 1.0
Answer Relevancy: 0.9954413982146993
Context Recall: 1.0
====================================================================================================
\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/
Question: What is the developer of Llama model?
Answer: Based on the provided context, there is no information that directly answers the question "What is the developer of Llama model?" However, I can extract some relevant information.

According to page 5, a software developer named Georgi Gerganov released llama.cpp as open-source on March 10, 2023. This implies that Georgi Gerganov might be involved in developing or re-implementing the Llama model.

However, it's important to note that there is no direct information provided about the original developer of the Llama language model.
Ground Truth:  Meta AI is the developer of the Llama model.
Context Relevancy: 0.011278195488721804
Context Precision: 0.9999999999666667
Faithfulness: 0.75
Answer Relevancy: 0.0
Context Recall: 0.6666666666666666
====================================================================================================
Question: How many versions of Llama model are there?
Answer: Based on the provided context, I can see that:

1. Initially, only LLaMA 1 was available as a foundation model.
2. Later, Meta announced LLaMA 2, which has three model sizes: 7, 13, and 70 billion parameters.

There is no mention of other versions beyond LLaMA 1 and LLaMA 2 in the provided context.
Ground Truth:  There are three versions of the Llama model.
Context Relevancy: 0.003676470588235294
Context Precision: 0.49999999995
Faithfulness: 1.0
Answer Relevancy: 0.9167566639184695
Context Recall: 1.0
====================================================================================================
Question: How is Llama 3 model different from GPT-3?
Answer: Based on the provided context, specifically the document with page number 3, we can see that LLaMA 3 uses several differences compared to GPT-3:

1. It uses SwiGLU activation function instead of GeLU.
2. It uses rotary positional embeddings instead of absolute positional embedding.
3. It uses root-mean-squared layer-normalization instead of standard layer-normalization.

These architectural differences make LLaMA 3 distinct from GPT-3.
Ground Truth:  There are minor architectural differences. Compared to GPT-3, LLaMA uses SwiGLU activation function instead of GeLU; uses rotary positional embeddings instead of absolute positional embedding; uses root-mean-squared layer-normalization instead of standard layer-normalization; increases context length to 8k in Llama 3 (compared to 4k in Llama 2 and 2k in Llama 1 and GPT-3)
Context Relevancy: 0.01984126984126984
Context Precision: 0.999999999975
Faithfulness: nan
Answer Relevancy: 0.974393495260279
Context Recall: 0.0
====================================================================================================
Question: Can you provide some information about the training datset of Llama-1 model?
Answer: Based on the provided context, it seems that there is no specific information about the training dataset of Llama-1 model. The text only mentions details about LLaMA 2 and LLaMA 3 models, but not LLaMA 1.

However, it does mention that LLaMA 2 was trained on a data set with approximately 2 trillion tokens, which was curated to remove Web sites that often disclose personal data of people, and also upsamples sources considered trustworthy.
Ground Truth:  LLaMA 1 foundational models were trained on a data set with 1.4 trillion tokens, drawn from publicly available data sources, including: Webpages scraped by CommonCrawl; Open source repositories of source code from GitHub; Wikipedia in 20 different languages; Public domain books from Project Gutenberg; Books3 books dataset; The LaTeX source code for scientific papers uploaded to ArXiv; Questions and answers from Stack Exchange websites. On April 17, 2023, TogetherAI launched a project named RedPajama to reproduce and distribute an open source version of the LLaMA dataset. The dataset has approximately 1.2 trillion tokens and is publicly available for download
Context Relevancy: 0.009230769230769232
Context Precision: 0.999999999975
Faithfulness: 0.6
Answer Relevancy: 0.0
Context Recall: 1.0
====================================================================================================
Question: What is the context length of Llama 3 model?
Answer: According to the text, the context length of Llama 3 model is 8k.
Ground Truth:  The context length of Llama 3 model is 8192 tokens.
Context Relevancy: 0.009433962264150943
Context Precision: 0.8055555555287036
Faithfulness: 1.0
Answer Relevancy: 0.9999999999999524
Context Recall: 1.0
====================================================================================================
\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/
Question: What is the developer of Llama model?
Answer: Based on the provided context, there is no direct mention of the developer of the Llama model. However, it is mentioned that Georgi Gerganov released an open-source implementation of Llama in C++ called "llama.cpp" on March 10, 2023. This might suggest that Gerganov had some involvement or contribution to the development of the Llama model, but it is not explicitly stated.
Ground Truth:  Meta AI is the developer of the Llama model.
Context Relevancy: 0.011673151750972763
Context Precision: 0.999999999975
Faithfulness: 0.8333333333333334
Answer Relevancy: 0.9882436320170883
Context Recall: 0.5
====================================================================================================
Question: How many versions of Llama model are there?
Answer: Based on the given context, I can see that:

* There is a mention of "Llama 2" in the document content.
* There is also a mention of "LLaMA-1 models" which suggests that there is at least one previous version of the LLaMA model.

However, it's not explicitly stated how many versions there are.
Ground Truth:  There are three versions of the Llama model.
Context Relevancy: 0.0037174721189591076
Context Precision: 0.99999999995
Faithfulness: 0.8333333333333334
Answer Relevancy: 0.9970516901507306
Context Recall: 1.0
====================================================================================================
Question: How is Llama 3 model different from GPT-3?
Answer: According to the provided context, LLaMA 3 model differs from GPT-3 in the following ways:

* Uses SwiGLU activation function instead of GeLU.
* Uses rotary positional embeddings instead of absolute positional embedding.
* Uses root-mean-squared layer-normalization instead of standard layer-normalization.
* Increases context length to 8k (compared to 4k in LLaMA 2 and 2k in LLaMA 1).

These differences are mentioned on page 3 of the document.
Ground Truth:  There are minor architectural differences. Compared to GPT-3, LLaMA uses SwiGLU activation function instead of GeLU; uses rotary positional embeddings instead of absolute positional embedding; uses root-mean-squared layer-normalization instead of standard layer-normalization; increases context length to 8k in Llama 3 (compared to 4k in Llama 2 and 2k in Llama 1 and GPT-3)
Context Relevancy: 0.011235955056179775
Context Precision: 0.9166666666361111
Faithfulness: 0.5
Answer Relevancy: 0.9878473756310687
Context Recall: 0.8333333333333334
====================================================================================================
Question: Can you provide some information about the training datset of Llama-1 model?
Answer: Based on the provided context, I can answer your question. Unfortunately, there is no specific information about the training dataset of Llama-1 model in the given text. The relevant text only mentions that LLaMA's developers focused their effort on scaling the model's performance by increasing the volume of training data, rather than the number of parameters, reasoning that the dominating cost for LLMs is from doing inference on the trained model rather than the computational cost of the training process.

However, if you're looking for information about the training datasets used for other LLaMA models, such as LLaMA-2 or LLaMA-3, the text does provide some details. For example, it mentions that LLaMA-2 uses a 21,000-word corpus and that LLaMA-3's training dataset size is not publicly disclosed due to being unreleased.
Ground Truth:  LLaMA 1 foundational models were trained on a data set with 1.4 trillion tokens, drawn from publicly available data sources, including: Webpages scraped by CommonCrawl; Open source repositories of source code from GitHub; Wikipedia in 20 different languages; Public domain books from Project Gutenberg; Books3 books dataset; The LaTeX source code for scientific papers uploaded to ArXiv; Questions and answers from Stack Exchange websites. On April 17, 2023, TogetherAI launched a project named RedPajama to reproduce and distribute an open source version of the LLaMA dataset. The dataset has approximately 1.2 trillion tokens and is publicly available for download
Context Relevancy: 0.009933774834437087
Context Precision: 0.999999999975
Faithfulness: 0.625
Answer Relevancy: 0.9066550990620251
Context Recall: 0.42857142857142855
====================================================================================================
Question: What is the context length of Llama 3 model?
Answer: Based on the provided context, we can find the answer by searching for the phrase "context length" in the page content. On page 4, we find:

"...Llama 2 models. Unlike GPT-4 which increased context length during fine-tuning, Llama 2 and Code Llama - Chat have the same context length of 4K tokens."

However, since this information is about Llama 2, not Llama 3, it's unclear whether the context length remains the same for Llama 3.

Unfortunately, there is no specific information provided about the context length of Llama 3 model. The answer to this question cannot be determined based on the given context.
Ground Truth:  The context length of Llama 3 model is 8192 tokens.
Context Relevancy: 0.007352941176470588
Context Precision: 0.249999999975
Faithfulness: 0.2
Answer Relevancy: 0.0
Context Recall: 1.0
====================================================================================================
\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/
Question: What is the developer of Llama model?
Answer: According to the provided context, the developer(s) of the Llama model are Meta AI.
Ground Truth:  Meta AI is the developer of the Llama model.
Context Relevancy: 0.018633540372670808
Context Precision: 0.99999999995
Faithfulness: 1.0
Answer Relevancy: 0.9480041718419262
Context Recall: 1.0
====================================================================================================
Question: How many versions of Llama model are there?
Answer: Based on the provided context, I can see that there are at least 3 versions of the LLaMA model mentioned:

1. LLaMA (original)
2. LLaMA-2
3. LLaMA-3

Additionally, it's mentioned that there were unauthorized copies shared via BitTorrent, and later on, subsequent versions of LLaMA were made accessible outside academia and released under licenses permitting some commercial use.

So, to answer your question, there are at least 3 named versions of the LLaMA model: LLaMA, LLaMA-2, and LLaMA-3.
Ground Truth:  There are three versions of the Llama model.
Context Relevancy: 0.013636363636363636
Context Precision: 0.99999999995
Faithfulness: 0.3333333333333333
Answer Relevancy: 0.8903150332447121
Context Recall: 1.0
====================================================================================================
Question: How is Llama 3 model different from GPT-3?
Answer: According to the provided context, LLaMA 3 uses rotary positional embeddings instead of absolute positional embeddings, and root-mean-squared layer-normalization instead of standard layer-normalization. Additionally, it increases the context length to 8k compared to 4k in LLaMA 2 and 2k in LLaMA 1 and GPT-3.
Ground Truth:  There are minor architectural differences. Compared to GPT-3, LLaMA uses SwiGLU activation function instead of GeLU; uses rotary positional embeddings instead of absolute positional embedding; uses root-mean-squared layer-normalization instead of standard layer-normalization; increases context length to 8k in Llama 3 (compared to 4k in Llama 2 and 2k in Llama 1 and GPT-3)
Context Relevancy: 0.00816326530612245
Context Precision: 0.999999999975
Faithfulness: 1.0
Answer Relevancy: 0.9010357159755958
Context Recall: 0.8333333333333334
====================================================================================================
Question: Can you provide some information about the training datset of Llama-1 model?
Answer: According to the provided context, the training dataset for LLaMA-1 model is described on page 4. It states that LLaMA's developers focused their effort on scaling the model's performance by increasing the volume of training data, rather than the number of parameters.

It also mentions that the data set used for training LLaMA-1 has approximately 1.4 trillion tokens, drawn from publicly available sources, including:

* Webpages scraped by CommonCrawl
* Open source repositories of source code from GitHub
* Wikipedia in 20 different languages
* Public domain books from Project Gutenberg
* Books3 dataset
* The LaTeX source code for scientific papers uploaded to ArXiv
* Questions and answers from Stack Exchange websites
Ground Truth:  LLaMA 1 foundational models were trained on a data set with 1.4 trillion tokens, drawn from publicly available data sources, including: Webpages scraped by CommonCrawl; Open source repositories of source code from GitHub; Wikipedia in 20 different languages; Public domain books from Project Gutenberg; Books3 books dataset; The LaTeX source code for scientific papers uploaded to ArXiv; Questions and answers from Stack Exchange websites. On April 17, 2023, TogetherAI launched a project named RedPajama to reproduce and distribute an open source version of the LLaMA dataset. The dataset has approximately 1.2 trillion tokens and is publicly available for download
Context Relevancy: 0.00967741935483871
Context Precision: 0.999999999975
Faithfulness: 0.5
Answer Relevancy: 0.9082873596473494
Context Recall: nan
====================================================================================================
Question: What is the context length of Llama 3 model?
Answer: Based on the provided context, the answer to the question can be found in the page content of the Document with metadata {'page': 3}:

`Llama\n3\n... \nContext \nlength\nCorpus \nsize\nCommercial \nviability?\nLLaMA\nFebruary \n24,\n2023\n●\n6.7B \n●\n13B \n●\n32.5B \n●\n65.2B\n6,300\n[30]\n2048\n1–1.4T\nNo\nLlama\n2\nJuly\n18, \n2023\n●\n6.7B \n●\n13B \n●\n69B\n21,000\n[31]\n4096\n2T\nYes\nCode\nLlama\nAugust\n24, \n2023\n●\n6.7B \n●\n13B \n●\n33.7B \n●\n69B\nLlama\n3\nApril\n18, \n2024\n●\n8B \n●\n70.6B \n●\n400B+ \n(unreleas \ned)\n100,000\n[32]\n[33]\n8192\n15T\nArchitecture\nand\ntraining\nArchitecture\nLLaMA\nuses\nthe\ntransformer\narchitecture,\nthe`

The context length of Llama 3 model is mentioned as `8k` (8 thousand) in the text.
Ground Truth:  The context length of Llama 3 model is 8192 tokens.
Context Relevancy: 0.02336448598130841
Context Precision: 0.99999999995
Faithfulness: 1.0
Answer Relevancy: 0.9999999999994146
Context Recall: 1.0
====================================================================================================
\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/
Question: What is the developer of Llama model?
Answer: Based on the provided context, the developer of the Llama model is Meta.
Ground Truth:  Meta AI is the developer of the Llama model.
Context Relevancy: 0.005649717514124294
Context Precision: 0.499999999975
Faithfulness: 1.0
Answer Relevancy: 0.8480938578182253
Context Recall: 1.0
====================================================================================================
Question: How many versions of Llama model are there?
Answer: Based on the provided context, we can see that there are at least 3 versions of the LLaMA model:

* LLaMA 1
* LLaMA 2
* LLaMA 3

Additionally, it is mentioned that future versions (LLaMA 5, 6, and 7) are planned.
Ground Truth:  There are three versions of the Llama model.
Context Relevancy: 0.01020408163265306
Context Precision: 0.8055555555287036
Faithfulness: 1.0
Answer Relevancy: 0.5249875097848541
Context Recall: 1.0
====================================================================================================
Question: How is Llama 3 model different from GPT-3?
Answer: According to the provided context, LLaMA 3 uses the following components that differ from GPT-3:

1. **SwiGLU activation function**: Instead of GeLU used in GPT-3.
2. **Rotary positional embeddings**: Instead of absolute positional embedding used in GPT-3.
3. **Root-mean-squared layer normalization**: Instead of standard layer normalization used in GPT-3.

Additionally, LLaMA 3 has a longer context length of 8k, compared to 4k in LLaMA 2 and 2k in LLaMA 1.
Ground Truth:  There are minor architectural differences. Compared to GPT-3, LLaMA uses SwiGLU activation function instead of GeLU; uses rotary positional embeddings instead of absolute positional embedding; uses root-mean-squared layer-normalization instead of standard layer-normalization; increases context length to 8k in Llama 3 (compared to 4k in Llama 2 and 2k in Llama 1 and GPT-3)
Context Relevancy: 0.018072289156626505
Context Precision: 0.6388888888675925
Faithfulness: 1.0
Answer Relevancy: 0.9076189370181525
Context Recall: 0.8
====================================================================================================
Question: Can you provide some information about the training datset of Llama-1 model?
Answer: I can't answer that question. The provided context does not mention Llama-1 model or its training dataset. It only provides information about the Llama language model and its pages 0, 3, 5, and 6.
Ground Truth:  LLaMA 1 foundational models were trained on a data set with 1.4 trillion tokens, drawn from publicly available data sources, including: Webpages scraped by CommonCrawl; Open source repositories of source code from GitHub; Wikipedia in 20 different languages; Public domain books from Project Gutenberg; Books3 books dataset; The LaTeX source code for scientific papers uploaded to ArXiv; Questions and answers from Stack Exchange websites. On April 17, 2023, TogetherAI launched a project named RedPajama to reproduce and distribute an open source version of the LLaMA dataset. The dataset has approximately 1.2 trillion tokens and is publicly available for download
Context Relevancy: 0.00966183574879227
Context Precision: 0.999999999975
Faithfulness: 0.6
Answer Relevancy: 0.0
Context Recall: 0.5
====================================================================================================
Question: What is the context length of Llama 3 model?
Answer: According to the given context, the context length of Llama 3 model is 8k. This can be found on page 3 of the document, where it says "increases context length to 8k in Llama 3".
Ground Truth:  The context length of Llama 3 model is 8192 tokens.
Context Relevancy: 0.0047169811320754715
Context Precision: 0.49999999995
Faithfulness: 1.0
Answer Relevancy: 0.9999999999980403
Context Recall: 1.0
====================================================================================================
\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/\_/

Plot the Results

Awesome! Now we have successfully evaluated the RAG app using the RAGAs library. Let’s plot the results to see how the RAG app performs on the evaluation set using different embedding models.

from matplotlib import pyplot as plt

_, ax = plt.subplots(1, 5, figsize=(20, 5))

ax[0].bar(embedding_models, context_relevancy_result)
ax[0].set_title("Context Relevancy")
ax[0].set_xticklabels(embedding_models, rotation=90)

ax[1].bar(embedding_models, context_precision_result)
ax[1].set_title("Context Precision")
ax[1].set_xticklabels(embedding_models, rotation=90)

ax[2].bar(embedding_models, context_recall_result)
ax[2].set_title("Context Recall")
ax[2].set_xticklabels(embedding_models, rotation=90)

ax[3].bar(embedding_models, faithfulness_result)
ax[3].set_title("Faithfulness")
ax[3].set_xticklabels(embedding_models, rotation=90)

ax[4].bar(embedding_models, answer_relevancy_result)
ax[4].set_title("Answer Relevancy")
ax[4].set_xticklabels(embedding_models, rotation=90)

plt.show()

We can see that the embedding model from Salesforce has given the best results. It’s essential to choose the right embeddings model for the RAG app to perform well.

Conclusion

Retrieval-Augmented Generation (RAG) enhances large language models by integrating relevant external information, making it crucial for applications requiring high accuracy and contextual relevance. Evaluating RAG systems in production is essential to ensure quality and reliability, and the RAGAs framework provides comprehensive metrics for this purpose. The choice of embedding model significantly impacts RAG performance, as high-quality embeddings improve document retrieval and response generation. By using detailed evaluation metrics, like those from the MTEB leaderboard, developers can optimize their RAG systems for better performance and reliability.

GitHub

The code for this tutorial can be found at: https://github.com/quamernasim/evaluate-rag-with-ragas-langchain-llama3-qdrant/tree/main

Related Article

Original article was published at https://quamernasim.medium.com/mastering-rag-choosing-the-right-vector-embedding-model-for-your-rag-application-bbb57517890e

References