issue with metrics evaluation in case of any exception #956

mukuls-zeta · 2024-05-15T07:42:36Z

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
Evaluating dataset with multiple metrics using evaluate function. for now calling evaluate function for each metrics separately[because of issue -> 878 ]
Observation -

If there is no exception(eg. context length exceeds) than result looks fine.
if there is at-lease one exception, that result seems completely different and not stable. (using evaluate with raise_exception = False), I think the orders messed up.

Ragas version: 0.1.7
Python version: 3.10.9

Code to Reproduce

import ragas
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    context_relevancy,
    answer_correctness
)

from datasets import Dataset 
from ragas import evaluate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

ragas_metrics = [ faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    context_relevancy,
    answer_correctness]

def ragas_evaluate(df_res, metrics, query_col = "Question", truth_ans_col = "Answer" ,
                       llm = ChatOpenAI(model="gpt-3.5-turbo"),
                       embed_model = OpenAIEmbeddings(model="text-embedding-3-small"),
                       raise_exceptions = True,
                       timeout=10, max_retries=3, max_wait=10,
                       is_async = True,
                        callbacks = None
                   ):

        data_samples = {
            'question': df_res[query_col].to_list(),
            'answer': df_res.rag_answer.to_list(),
            'contexts' : df_res.context.to_list(),
            'ground_truth': df_res[truth_ans_col].to_list()
        }

        run_config = ragas.RunConfig(timeout=timeout, max_retries=max_retries, max_wait=max_wait)

        dataset = Dataset.from_dict(data_samples)
   
        for idx, metric in enumerate(metrics):
            print(f">> {idx+1} : {metric.name}")
            score = evaluate(dataset, metrics= [metric], llm = llm,
                        embeddings= embed_model, raise_exceptions=raise_exceptions, run_config=run_config,
                             is_async=is_async, callbacks=callbacks)
            time.sleep(1)
            if idx == 0:
                df_score = score.to_pandas()
            else:
                df_score = df_score.merge(score.to_pandas()[['question',metric.name]], on = 'question', how = "left")

        return df_score


# for now showing with only faithfullness and answer_relevancy
## without any data point which may cause exception
df_eval = ragas_evaluate(df_response.iloc[list(range(20))], ragas_metrics[0:2], raise_exceptions=False)
df_eval

## with a data point which cause exception - openai.BadRequestError: Error code: 400 [context_length_exceeded]
df_eval_ex = ragas_evaluate(df_res.iloc[list(range(20))+ [68]], ragas_metrics[0:2], raise_exceptions=False)
df_eval_ex

Result of df_eval

	question	answer	contexts	ground_truth	faithfulness	answer_relevancy
0	How has Intel's total net sales fluctuated ove...	Intel's total net sales decreased from approxi...	[Sub question: By first identifying and quotin...	Intel's total net sales have fluctuated as fol...	1.000000	0.910671
1	What are the significant factors that have led...	The significant factors that have led to chang...	[Sub question: By first identifying and quotin...	In the latest 10-Q for the quarter ended Septe...	1.000000	0.994862
2	Has there been a substantial shift in Intel's ...	Yes, there has been a substantial shift in Int...	[Sub question: By first identifying and quotin...	Yes, there has been a substantial shift in Int...	0.500000	0.891755
3	How has Intel's revenue from chip sales seen v...	Intel's revenue from chip sales has seen signi...	[Sub question: By first identifying and quotin...	Intel's revenue from chip sales has seen the f...	1.000000	0.000000
4	Are there identifiable trends in Intel's Data ...	There is a downward trend in Intel's Data Cent...	[Sub question: By first identifying and quotin...	Based on the provided documents, there is a tr...	0.750000	0.000000
5	How does Intel's research and development spen...	Intel's research and development spending in t...	[Sub question: By first identifying and quotin...	In the most recent quarter, which ended on Sep...	0.428571	0.861540
6	What legal proceedings or potential liabilitie...	Intel's 10-Q reports have disclosed several le...	[Sub question: By first identifying and quotin...	Intel's 10-Q filings disclose several legal pr...	1.000000	0.965009
7	Are there any significant alterations in Intel...	Yes, there was a significant increase in Intel...	[Sub question: By first identifying and quotin...	Yes, there have been alterations in Intel's de...	0.750000	0.000000
8	In Amazon's Q2 2023 10-Q, how does the discuss...	The discussion on employee compensation and be...	[Sub question: How does Amazon discuss employe...	The discussion on employee compensation and be...	0.875000	0.846506
9	Has Intel reported any significant business ac...	Yes, Intel reported the acquisition of Tower S...	[Sub question: By first identifying and quotin...	Yes, Intel reported significant business acqui...	1.000000	0.899457
10	How have inventory levels at Intel changed ove...	Intel's inventory levels have decreased from $...	[Sub question: By first identifying and quotin...	Over the quarters provided in the context, Int...	1.000000	0.893291
11	Outline the risk factors for Intel's business,...	The risk factors for Intel's business include ...	[Sub question: By first identifying and quotin...	The risk factors for Intel's business as outli...	1.000000	0.840134
12	How has Amazon's total net sales changed over ...	IDK	[Sub question: By first identifying and quotin...	Amazon's total net sales have changed over tim...	NaN	0.000000
13	What are the major factors contributing to the...	The major factors contributing to the change i...	[Sub question: By first identifying and quotin...	The major factors contributing to the change i...	1.000000	0.970940
14	Has there been any significant change in Amazo...	Yes, there has been a significant change in Am...	[Sub question: By first identifying and quotin...	Yes, there has been a significant change in Am...	0.800000	0.844631
15	How has Amazon's revenue from online retail se...	IDK	[Sub question: By first identifying and quotin...	Amazon's revenue from online retail services, ...	NaN	0.000000
16	Can any trends be identified in Amazon's AWS (...	Yes, there is a positive revenue trend in Amaz...	[Sub question: By first identifying and quotin...	Yes, trends can be identified in Amazon's AWS ...	1.000000	0.893554
17	What is the impact of foreign exchange rates o...	- For Q1 2023, foreign exchange rates had a mi...	[Sub question: By first identifying and quotin...	The impact of foreign exchange rates on Amazon...	1.000000	0.000000
18	Are there any notable changes in Amazon's liqu...	Yes, there are notable changes in Amazon's liq...	[Sub question: By first identifying and quotin...	Yes, there are notable changes in Amazon's liq...	1.000000	0.919368
19	How does Amazon's R&D expenditure in the most ...	IDK	[Sub question: By first identifying and quotin...	In the most recent quarter ended September 30,...	NaN	0.854005

Expected result. index 12, 15, 19 have "IDK" in response and accordingly faithfulness and answer_relevancy is coming NaN/0.

Result of df_eval_ex

	question	answer	contexts	ground_truth	faithfulness	answer_relevancy
0	How has Intel's total net sales fluctuated ove...	Intel's total net sales decreased from approxi...	[Sub question: By first identifying and quotin...	Intel's total net sales have fluctuated as fol...	NaN	NaN
1	What are the significant factors that have led...	The significant factors that have led to chang...	[Sub question: By first identifying and quotin...	In the latest 10-Q for the quarter ended Septe...	1.000000	0.910671
2	Has there been a substantial shift in Intel's ...	Yes, there has been a substantial shift in Int...	[Sub question: By first identifying and quotin...	Yes, there has been a substantial shift in Int...	1.000000	0.986719
3	How has Intel's revenue from chip sales seen v...	Intel's revenue from chip sales has seen signi...	[Sub question: By first identifying and quotin...	Intel's revenue from chip sales has seen the f...	0.000000	0.891747
4	Are there identifiable trends in Intel's Data ...	There is a downward trend in Intel's Data Cent...	[Sub question: By first identifying and quotin...	Based on the provided documents, there is a tr...	1.000000	0.000000
5	How does Intel's research and development spen...	Intel's research and development spending in t...	[Sub question: By first identifying and quotin...	In the most recent quarter, which ended on Sep...	0.750000	0.000000
6	What legal proceedings or potential liabilitie...	Intel's 10-Q reports have disclosed several le...	[Sub question: By first identifying and quotin...	Intel's 10-Q filings disclose several legal pr...	0.750000	0.861468
7	Are there any significant alterations in Intel...	Yes, there was a significant increase in Intel...	[Sub question: By first identifying and quotin...	Yes, there have been alterations in Intel's de...	1.000000	0.965009
8	In Amazon's Q2 2023 10-Q, how does the discuss...	The discussion on employee compensation and be...	[Sub question: How does Amazon discuss employe...	The discussion on employee compensation and be...	0.750000	0.770066
9	Has Intel reported any significant business ac...	Yes, Intel reported the acquisition of Tower S...	[Sub question: By first identifying and quotin...	Yes, Intel reported significant business acqui...	0.857143	0.846523
10	How have inventory levels at Intel changed ove...	Intel's inventory levels have decreased from $...	[Sub question: By first identifying and quotin...	Over the quarters provided in the context, Int...	1.000000	0.899457
11	Outline the risk factors for Intel's business,...	The risk factors for Intel's business include ...	[Sub question: By first identifying and quotin...	The risk factors for Intel's business as outli...	1.000000	0.891580
12	How has Amazon's total net sales changed over ...	IDK	[Sub question: By first identifying and quotin...	Amazon's total net sales have changed over tim...	1.000000	0.840090
13	What are the major factors contributing to the...	The major factors contributing to the change i...	[Sub question: By first identifying and quotin...	The major factors contributing to the change i...	NaN	0.000000
14	Has there been any significant change in Amazo...	Yes, there has been a significant change in Am...	[Sub question: By first identifying and quotin...	Yes, there has been a significant change in Am...	0.687500	0.961019
15	How has Amazon's revenue from online retail se...	IDK	[Sub question: By first identifying and quotin...	Amazon's revenue from online retail services, ...	0.800000	0.900011
16	Can any trends be identified in Amazon's AWS (...	Yes, there is a positive revenue trend in Amaz...	[Sub question: By first identifying and quotin...	Yes, trends can be identified in Amazon's AWS ...	NaN	0.000000
17	What is the impact of foreign exchange rates o...	- For Q1 2023, foreign exchange rates had a mi...	[Sub question: By first identifying and quotin...	The impact of foreign exchange rates on Amazon...	1.000000	0.893554
18	Are there any notable changes in Amazon's liqu...	Yes, there are notable changes in Amazon's liq...	[Sub question: By first identifying and quotin...	Yes, there are notable changes in Amazon's liq...	1.000000	0.000000
19	How does Amazon's R&D expenditure in the most ...	IDK	[Sub question: By first identifying and quotin...	In the most recent quarter ended September 30,...	0.857143	0.852599
20	From the most recent 10-Q, how do Intel's comm...	IDK	[Sub question: What are Intel's commitments in...	The most recent 10-Q does not provide specific...	NaN	0.000000

index 20 data point is causing exception and result messed up.
- both metrics coming NaN for data index 0. which is wrong.
- data index 3 -> faithfulness is 0 and answer_relevancy is non zero.
- Data index - 12, 15, 19 -> rag answer is 'IDK' but still both metrics is coming high non zero.

Error trace

Expected behavior

Result should be nearly same for df_eval and df_eval_ex (except data point 20).
I think ordering is getting messed up while collecting the result of different data points.
tried without using evaluate function means calling each metrics score function and handling exception explicitly , result are near fine but performance is not.

Additional context
Add any other context about the problem here.

baptiste-pasquier · 2024-05-15T08:17:03Z

Hello, I encountered the same problem recently due to exception on content filtering with openai.

While investigating I found the origin of the problem here:

ragas/src/ragas/executor.py

Line 77 in 2d79365

r = (-1, np.nan)

In Runner, jobs are normally wrapped with a counter to be able to reorder the metrics at the end because asyncio does not respect the order:

ragas/src/ragas/executor.py

Lines 110 to 114 in 2d79365

    
           def wrap_callable_with_index(self, callable: t.Callable, counter): 
        
               async def wrapped_callable_async(*args, **kwargs): 
        
                   return counter, await callable(*args, **kwargs) 
        
               return wrapped_callable_async

ragas/src/ragas/executor.py

Lines 144 to 145 in 2d79365

    
           sorted_results = sorted(executor_job.results, key=lambda x: x[0]) 
        
           return [r[1] for r in sorted_results]

But in the event of an exception and when raise_exceptions=False, the counter is not used: the index -1 is returned, which by default places the NaN at the beginning, and shakes up the entire list of metrics.

ragas/src/ragas/executor.py

Lines 77 to 89 in 2d79365

    
           r = (-1, np.nan) 
        
           try: 
        
               r = await future 
        
           except MaxRetriesExceeded as e: 
        
               logger.warning(f"max retries exceeded for {e.evolution}") 
        
           except Exception as e: 
        
               if self.raise_exceptions: 
        
                   raise e 
        
               else: 
        
                   logger.error( 
        
                       "Runner in Executor raised an exception", exc_info=True 
        
                   ) 
        
           results.append(r)

baptiste-pasquier · 2024-05-15T08:25:38Z

A possible solution would be to move the np.nan and try except into the function‎ Executor.wrap_callable_with_index instead of
‎Runner._aresults to allow returning the counter even in the event of an exception.

mukuls-zeta · 2024-05-15T09:42:18Z

@baptiste-pasquier Thanks for RCA. Have you tried the workaround possible solution/code?

baptiste-pasquier · 2024-05-15T12:59:12Z

Here is the fix code: baptiste-pasquier@167b9c5

Code for testing

import ragas
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
)

data = {
    "question": [
        "What is the capital of France?",
        "What is the capital of France?",
        "What is the capital of France?",
        "What is the capital of France?",
    ],
    "answer": [
        "Paris is the capital of France.",
        "",
        "London is the capital of France.",
        "x" * 99999,  # Context length error
    ],
    "contexts": [
        ["Paris is the capital of France."],
        ["Paris is the capital of France."],
        ["Paris is the capital of France."],
        ["Paris is the capital of France."],
    ],
    "ground_truth": [
        "Paris",
        "Paris",
        "Paris",
        "Paris",
    ],
}
dataset = Dataset.from_dict(data)

metrics = [faithfulness, answer_relevancy]
run_config = ragas.RunConfig(timeout=10, max_retries=1, max_wait=10)

Test 1 (without exception)

result = evaluate(
    dataset.select(range(3)),  # Run without the last row
    metrics=metrics,
    llm=llm,
    embeddings=embeddings,
    raise_exceptions=False,
    run_config=run_config,
)
result.to_pandas()

Result before and after fix:

question	answer	contexts	ground_truth	faithfulness	answer_relevancy
What is the capital of France?	Paris is the capital of France.	[Paris is the capital of France.]	Paris	1.0	1.0
What is the capital of France?		[Paris is the capital of France.]	Paris	NaN	1.0
What is the capital of France?	London is the capital of France.	[Paris is the capital of France.]	Paris	0.0	0.0

Test 2 (with exception)

A context length exception is created at the last row of the dataset.

result = evaluate(
    dataset,  # Run on the entire dataset
    metrics=metrics,
    llm=llm,
    embeddings=embeddings,
    raise_exceptions=False,
    run_config=run_config,
)
result.to_pandas()

Result before fix:

question	answer	contexts	ground_truth	faithfulness	answer_relevancy
What is the capital of France?	Paris is the capital of France.	[Paris is the capital of France.]	Paris	$\color{red}\textsf{NaN}$	$\color{red}\textsf{NaN}$
What is the capital of France?		[Paris is the capital of France.]	Paris	1.0	1.0
What is the capital of France?	London is the capital of France.	[Paris is the capital of France.]	Paris	NaN	1.0
What is the capital of France?	xxxxxxxxxxxxxx...	[Paris is the capital of France.]	Paris	0.0	0.0

The NaNs appear in the first row, which is unexpected behavior (the other metrics are shifted in the table).

Result after fix:

question	answer	contexts	ground_truth	faithfulness	answer_relevancy
What is the capital of France?	Paris is the capital of France.	[Paris is the capital of France.]	Paris	1.0	1.0
What is the capital of France?		[Paris is the capital of France.]	Paris	NaN	1.0
What is the capital of France?	London is the capital of France.	[Paris is the capital of France.]	Paris	0.0	0.0
What is the capital of France?	xxxxxxxxxxxxxx...	[Paris is the capital of France.]	Paris	$\color{red}\textsf{NaN}$	$\color{red}\textsf{NaN}$

The NaNs appear in the last row, which is the expected behavior.

mukuls-zeta · 2024-05-16T04:59:10Z

Yes, i have tried the similar fix, it seems working.

mukuls-zeta added the bug Something isn't working label May 15, 2024

baptiste-pasquier linked a pull request May 28, 2024 that will close this issue

Fix counter in Runner and Executor in case of any exception #995

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue with metrics evaluation in case of any exception #956

issue with metrics evaluation in case of any exception #956

mukuls-zeta commented May 15, 2024

baptiste-pasquier commented May 15, 2024 •

edited

Loading

baptiste-pasquier commented May 15, 2024

mukuls-zeta commented May 15, 2024

baptiste-pasquier commented May 15, 2024

mukuls-zeta commented May 16, 2024

issue with metrics evaluation in case of any exception #956

issue with metrics evaluation in case of any exception #956

Comments

mukuls-zeta commented May 15, 2024

Result of df_eval

Result of df_eval_ex

baptiste-pasquier commented May 15, 2024 • edited Loading

baptiste-pasquier commented May 15, 2024

mukuls-zeta commented May 15, 2024

baptiste-pasquier commented May 15, 2024

Code for testing

Test 1 (without exception)

Result before and after fix:

Test 2 (with exception)

Result before fix:

Result after fix:

mukuls-zeta commented May 16, 2024

baptiste-pasquier commented May 15, 2024 •

edited

Loading