Back to blog
Articles
Articles
September 28, 2023
·
5 min read

RAG Evaluation

September 28, 2023
|
5 min read

Latest content

Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024
Tutorials
4 min read

Accelerating Data Analysis with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to accelerate data analysis.
January 24, 2024
Tutorials
4 min read

Exploring Contact Center Data with HumanFirst and Google Cloud

How to use HumanFirst with CCAI-generated data to streamline topic modeling.
January 11, 2024
Articles
5 min

Building In Alignment: The Role of Observability in LLM-Led Conversational Design

Building In Alignment: The Role of Observability in LLM-Led Conversational Design
December 6, 2023
Articles
5 min read

Rivet Is An Open-Source Visual AI Programming Environment

Rivet is suited for building complex agents with LLM Prompts, and it was Open Sourced recently.
September 27, 2023
Articles
6 min read

What Is The Future Of Prompt Engineering?

The skill of Prompt Engineering has been touted as the ultimate skill of the future. But, will prompt engineering be around in the near future? In this article I attempt to decompose how the future LLM interface might look like…considering it will be conversational.
September 26, 2023
Articles
4 min read

LLM Drift

A recent study coined the term LLM Drift. LLM Drift is definite changes in LLM responses and behaviour, over a relatively short period of time.
September 25, 2023
Tutorials
5 min read

Optimizing RAG with Knowledge Base Maintenance

How to find gaps between knowledge base content and real user questions.
April 23, 2024
Tutorials
4 min read

Scaling Quality Assurance with HumanFirst and Google Cloud

How to use HumanFirst with Vertex AI to test, improve, and trust agent performance.
March 14, 2024
Announcements
2 min read

Full Circle: HumanFirst Welcomes Maeghan Smulders as COO

Personal and professional history might not repeat, but it certainly rhymes. I’m thrilled to join the team at HumanFirst, and reconnect with a team of founders I not only trust, but deeply admire.
February 13, 2024

Let your data drive.

Retrieval Augmented Generation (RAG) is a very popular framework or class of LLM Application. The basic principle of RAG is to leverage external data sources to give LLMs contextual reference. In the recent past, I wrote much on different RAG approaches and pipelines. But how can we evaluate, measure and quantify the performance of a RAG pipeline?

Any RAG implementation has two aspects: Generation and Retrieval. The context is established via the retrieval process. Generation is performed by the LLM, which generates the answer by using the retrieved information.

When evaluating a RAG pipeline, both of these elements need to be evaluated separately and together to get an overall score as well as the individual scores to pinpoint the aspects to improve.

Ragas uses LLMs to evaluate a RAG pipelines while also providing actionable metrics using as little annotated data as possible.

Ragas references the following data:

Question: These are the questions you RAG pipeline will be evaluated on.

Answer: The answer generated from the RAG pipeline and presented to the user.

Contexts: The contexts passed into the LLM to answer the question.

Ground Truths: The ground truth answer to the questions.

The following output is produced by Ragas:

Retrieval: context_relevancy and context_recall which represents the measure of the performance of your retrieval system.

Generation: faithfulness which measures hallucinations and answer_relevancy which measures the answers to question relevance.

Adapted from source

The harmonic mean of these 4 aspects gives you the ragas score which is a single measure of the performance of your QA system across all the important aspects. (Source)

Considering the data, the questions should be representative of user questions.

The example below uses a dataset with the fields for: Index, Question, Ground Truth, Answer and Reference Context.

Here is a complete working code example to run your own application, all you will need is a OpenAI API Key, as seen below.

Output:

And...

Ragas Output:

To view the data:

And the output below, the question is visible, the ground truth text, and the answer with the context. On the right is the context relevancy score, faithfulness score, answer relevancy, context-recall and harmfulness scores.

Lastly, I have a question for the community…there is obviously a need to observe, inspect and fine-tune data.

And in this case it is the data RAG accesses, and how it can be improved with an enhanced chunking strategy. Or the embedding model can improve. Or, the prompt at the heart of the RAG implementation can be optimised.

But this brings us back to the importance of data management, ideally via a data-centric latent space. Intelligently managing and updating data used for bench-marking will become increasingly important.

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

Subscribe to HumanFirst Blog

Get the latest posts delivered right to your inbox