This repository contains examples on how to use Haystack to evaluate systems build with Haystack for different tasks and datasets.
This repository is structured as:
Here we provide full examples on how to use Haystack to evaluate systems build also with Haystack for different tasks and datasets.
Name | Dataset | Evaluation Metrics | 📚 Article |
---|---|---|---|
RAG with parameter search | ARAGOG | ContextRelevance , Faithfulness, Semantic Answer Similarity | Benchmarking Haystack Pipelines for Optimal Performance |
Baseline RAG vs HyDE using Harness | ARAGOG | ContextRelevance , Faithfulness, Semantic Answer Similarity | - |
Extractive QA with parameter search | SQuAD | Answer Exact Match, DocumentMRR, DocumentMAP, DocumentRecall, Semantic Answer Similarity | - |
Name | Description |
---|---|
Baseline RAG | Retrieve-and-Generate (RAG) is a model that combines a retriever and a generator to answer questions. The retriever finds relevant documents and the generator creates an answer based on the retrieved documents. |
HyDE | Hybrid Document Embeddings (HyDE) is a HyDE generates a hypothetical document from the query and uses it to retrieve similar documents from the document embedding space. |
Extractive QA | Extractive Question Answering (QA) is a task where the model is given a question and a document and it has to find the answer to the question in the document. The answer is typically a span of text from the document. |
Sentence-Window | Sentence-Window is a technique that uses a sliding window to extract chunks/sentences from a document. The extracted chunks/sentences are then used to generate answers to questions. |
Name | Suitable Metrics | Description |
---|---|---|
ARAGOG | ContextRelevance, Faithfulness, Semantic Answer Similarity | A collection of papers from ArXiv covering topics around Transformers and Large Language Models, all in PDF format. |
SQuAD 2.0 | Answer Exact Match, DocumentMRR, DocumentMAP, DocumentRecall Semantic Answer Similarity | A collection of questions and answers from Wikipedia articles, typically used for training and evaluating models for extractive question-answering tasks. |