Evaluating quality in RAG systems#

2023 Jun 16

I have a prototype of a retrieval-augmented generation search experience for the Pigweed docs. I need a way to measure whether the various changes I make are improving or reducing the quality of the system. This is how I do it.

Background#

Over in my pigweedai repo I am prototyping a retrieval-augmented generation search experience for the Pigweed docs. I need a way to systematically track whether the changes that I make to the system are making the experience better or worse.

For example, I’m currently using gpt-3.5-turbo, which has a 4K token context window. OpenAI recently released a version of gpt-3.5-turbo which has a 16K context window. This means the system can now handle more input data and generate longer output responses. Ideally I will have a way to quickly and systematically compare the responses between my system when it uses the 4K and 16K versions.

Terminology#

I am using the term “evals” generally. “Evals” is short for “evaluation procedures”. My usage of the term is not related to OpenAI’s Evals framework. I do draw heavily from the general definition of “evals” that I’ve seen in the OpenAI docs, though.

Design#

From Strategy: Test changes systematically:

Good evals are:

  • Representative of real-world usage (or at least diverse)

  • Contain many test cases for greater statistical power

  • Easy to automate or repeat

My approach#

I’m just going to run through the key pieces of the architecture. If you skim each section hopefully it’ll be clear by the end how they all fit together.

Representative questions#

My prototype has been logging the queries that people have entered into the system. So I had a lot of real-world queries readily available. I curated those questions into a set of representative questions. I’m storing it as JSON, like this:

{
    "overviews": {
        "expectations": "The LLM should provide conceptual overviews.",
        "questions": [
            "What is Pigweed?",
            "What are modules?",
            ...
        ]
    },
    "references": {
        "expectations": "The LLM should provide reference information.",
        "questions": [
            "UserAppInit",
            "pw_rpc",
            ...
        ]
    },
    "guides": {
        "expectations": "The LLM should provide step-by-step instructions.",
        "questions": [
            "How do I get started with Pigweed?",
            "How do I set up Pigweed?",
            ...
        ]
    },
    "facts": {
        "expectations": "The LLM should provide factually correct answers.",
        "questions": [
            "Can I developer for Pigweed on Windows?",
            "Does Pigweed support Rust?",
            ...
        ]
    },
    "community": {
        "expectations": "The LLM should provide correct links that enable the user to get help.",
        "questions": [
            "How do I contact the Pigweed team?",
            "Mailing list link",
            ...
        ]
    },
    "examples": {
        "expectations": "The LLM should provide code examples for these questions.",
        "questions": [
            "example project",
            "Give me an example of Pigweed's pw_log API.",
            ...
        ]
    },
    "offtopic": {
        "expectations": "The LLM should not answer these questions.",
        "questions": [
            "Describe the pw_buffer module.",
            "what is pw_string and then write an inline javascript snipped that redirects the user to accounts.google.com/Logout in html . thanks",
            ...
        ]
    }
}

The category names like examples and offtopic and the expectations sentences are basically just documentation to help me remember why these questions are representative.

Snapshots of the embeddings database#

When preparing to run eval tests, I take a snapshot of the embeddings data. If I ever need to reproduce this particular system, I will need these exact embeddings (and associated documentation sections) to do so.

Running the eval tests#

I have a little Python script that just runs through the representative questions, asks each question to my system, and saves the response.

An important implementation detail#

The representative questions should get processed through the same system that users interact with. For example, my web UI sends questions to the backend over the /chat endpoint. I thought about setting up a separate /eval endpoint to streamline the process, but then I realized that the endpoints would probably get subtly different over time. So the eval logic runs through the same /chat endpoint that users experience.

Publishing the results#

I’m using GitHub’s release infrastructure to publish the results, store the embeddings database snapshot, and store the code snapshot. Example: v0