Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with metrics evaluation in case of any exception #956

Open
mukuls-zeta opened this issue May 15, 2024 · 5 comments · May be fixed by #995
Open

issue with metrics evaluation in case of any exception #956

mukuls-zeta opened this issue May 15, 2024 · 5 comments · May be fixed by #995
Labels
bug Something isn't working

Comments

@mukuls-zeta
Copy link

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug
Evaluating dataset with multiple metrics using evaluate function. for now calling evaluate function for each metrics separately[because of issue -> 878 ]
Observation -

  • If there is no exception(eg. context length exceeds) than result looks fine.
  • if there is at-lease one exception, that result seems completely different and not stable. (using evaluate with raise_exception = False), I think the orders messed up.

Ragas version: 0.1.7
Python version: 3.10.9

Code to Reproduce

import ragas
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    context_relevancy,
    answer_correctness
)

from datasets import Dataset 
from ragas import evaluate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

ragas_metrics = [ faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    context_relevancy,
    answer_correctness]

def ragas_evaluate(df_res, metrics, query_col = "Question", truth_ans_col = "Answer" ,
                       llm = ChatOpenAI(model="gpt-3.5-turbo"),
                       embed_model = OpenAIEmbeddings(model="text-embedding-3-small"),
                       raise_exceptions = True,
                       timeout=10, max_retries=3, max_wait=10,
                       is_async = True,
                        callbacks = None
                   ):

        data_samples = {
            'question': df_res[query_col].to_list(),
            'answer': df_res.rag_answer.to_list(),
            'contexts' : df_res.context.to_list(),
            'ground_truth': df_res[truth_ans_col].to_list()
        }

        run_config = ragas.RunConfig(timeout=timeout, max_retries=max_retries, max_wait=max_wait)

        dataset = Dataset.from_dict(data_samples)
   
        for idx, metric in enumerate(metrics):
            print(f">> {idx+1} : {metric.name}")
            score = evaluate(dataset, metrics= [metric], llm = llm,
                        embeddings= embed_model, raise_exceptions=raise_exceptions, run_config=run_config,
                             is_async=is_async, callbacks=callbacks)
            time.sleep(1)
            if idx == 0:
                df_score = score.to_pandas()
            else:
                df_score = df_score.merge(score.to_pandas()[['question',metric.name]], on = 'question', how = "left")

        return df_score


# for now showing with only faithfullness and answer_relevancy
## without any data point which may cause exception
df_eval = ragas_evaluate(df_response.iloc[list(range(20))], ragas_metrics[0:2], raise_exceptions=False)
df_eval

## with a data point which cause exception - openai.BadRequestError: Error code: 400 [context_length_exceeded]
df_eval_ex = ragas_evaluate(df_res.iloc[list(range(20))+ [68]], ragas_metrics[0:2], raise_exceptions=False)
df_eval_ex



Result of df_eval

question answer contexts ground_truth faithfulness answer_relevancy
0 How has Intel's total net sales fluctuated ove... Intel's total net sales decreased from approxi... [Sub question: By first identifying and quotin... Intel's total net sales have fluctuated as fol... 1.000000 0.910671
1 What are the significant factors that have led... The significant factors that have led to chang... [Sub question: By first identifying and quotin... In the latest 10-Q for the quarter ended Septe... 1.000000 0.994862
2 Has there been a substantial shift in Intel's ... Yes, there has been a substantial shift in Int... [Sub question: By first identifying and quotin... Yes, there has been a substantial shift in Int... 0.500000 0.891755
3 How has Intel's revenue from chip sales seen v... Intel's revenue from chip sales has seen signi... [Sub question: By first identifying and quotin... Intel's revenue from chip sales has seen the f... 1.000000 0.000000
4 Are there identifiable trends in Intel's Data ... There is a downward trend in Intel's Data Cent... [Sub question: By first identifying and quotin... Based on the provided documents, there is a tr... 0.750000 0.000000
5 How does Intel's research and development spen... Intel's research and development spending in t... [Sub question: By first identifying and quotin... In the most recent quarter, which ended on Sep... 0.428571 0.861540
6 What legal proceedings or potential liabilitie... Intel's 10-Q reports have disclosed several le... [Sub question: By first identifying and quotin... Intel's 10-Q filings disclose several legal pr... 1.000000 0.965009
7 Are there any significant alterations in Intel... Yes, there was a significant increase in Intel... [Sub question: By first identifying and quotin... Yes, there have been alterations in Intel's de... 0.750000 0.000000
8 In Amazon's Q2 2023 10-Q, how does the discuss... The discussion on employee compensation and be... [Sub question: How does Amazon discuss employe... The discussion on employee compensation and be... 0.875000 0.846506
9 Has Intel reported any significant business ac... Yes, Intel reported the acquisition of Tower S... [Sub question: By first identifying and quotin... Yes, Intel reported significant business acqui... 1.000000 0.899457
10 How have inventory levels at Intel changed ove... Intel's inventory levels have decreased from $... [Sub question: By first identifying and quotin... Over the quarters provided in the context, Int... 1.000000 0.893291
11 Outline the risk factors for Intel's business,... The risk factors for Intel's business include ... [Sub question: By first identifying and quotin... The risk factors for Intel's business as outli... 1.000000 0.840134
12 How has Amazon's total net sales changed over ... IDK [Sub question: By first identifying and quotin... Amazon's total net sales have changed over tim... NaN 0.000000
13 What are the major factors contributing to the... The major factors contributing to the change i... [Sub question: By first identifying and quotin... The major factors contributing to the change i... 1.000000 0.970940
14 Has there been any significant change in Amazo... Yes, there has been a significant change in Am... [Sub question: By first identifying and quotin... Yes, there has been a significant change in Am... 0.800000 0.844631
15 How has Amazon's revenue from online retail se... IDK [Sub question: By first identifying and quotin... Amazon's revenue from online retail services, ... NaN 0.000000
16 Can any trends be identified in Amazon's AWS (... Yes, there is a positive revenue trend in Amaz... [Sub question: By first identifying and quotin... Yes, trends can be identified in Amazon's AWS ... 1.000000 0.893554
17 What is the impact of foreign exchange rates o... - For Q1 2023, foreign exchange rates had a mi... [Sub question: By first identifying and quotin... The impact of foreign exchange rates on Amazon... 1.000000 0.000000
18 Are there any notable changes in Amazon's liqu... Yes, there are notable changes in Amazon's liq... [Sub question: By first identifying and quotin... Yes, there are notable changes in Amazon's liq... 1.000000 0.919368
19 How does Amazon's R&D expenditure in the most ... IDK [Sub question: By first identifying and quotin... In the most recent quarter ended September 30,... NaN 0.854005
  • Expected result. index 12, 15, 19 have "IDK" in response and accordingly faithfulness and answer_relevancy is coming NaN/0.



Result of df_eval_ex

question answer contexts ground_truth faithfulness answer_relevancy
0 How has Intel's total net sales fluctuated ove... Intel's total net sales decreased from approxi... [Sub question: By first identifying and quotin... Intel's total net sales have fluctuated as fol... NaN NaN
1 What are the significant factors that have led... The significant factors that have led to chang... [Sub question: By first identifying and quotin... In the latest 10-Q for the quarter ended Septe... 1.000000 0.910671
2 Has there been a substantial shift in Intel's ... Yes, there has been a substantial shift in Int... [Sub question: By first identifying and quotin... Yes, there has been a substantial shift in Int... 1.000000 0.986719
3 How has Intel's revenue from chip sales seen v... Intel's revenue from chip sales has seen signi... [Sub question: By first identifying and quotin... Intel's revenue from chip sales has seen the f... 0.000000 0.891747
4 Are there identifiable trends in Intel's Data ... There is a downward trend in Intel's Data Cent... [Sub question: By first identifying and quotin... Based on the provided documents, there is a tr... 1.000000 0.000000
5 How does Intel's research and development spen... Intel's research and development spending in t... [Sub question: By first identifying and quotin... In the most recent quarter, which ended on Sep... 0.750000 0.000000
6 What legal proceedings or potential liabilitie... Intel's 10-Q reports have disclosed several le... [Sub question: By first identifying and quotin... Intel's 10-Q filings disclose several legal pr... 0.750000 0.861468
7 Are there any significant alterations in Intel... Yes, there was a significant increase in Intel... [Sub question: By first identifying and quotin... Yes, there have been alterations in Intel's de... 1.000000 0.965009
8 In Amazon's Q2 2023 10-Q, how does the discuss... The discussion on employee compensation and be... [Sub question: How does Amazon discuss employe... The discussion on employee compensation and be... 0.750000 0.770066
9 Has Intel reported any significant business ac... Yes, Intel reported the acquisition of Tower S... [Sub question: By first identifying and quotin... Yes, Intel reported significant business acqui... 0.857143 0.846523
10 How have inventory levels at Intel changed ove... Intel's inventory levels have decreased from $... [Sub question: By first identifying and quotin... Over the quarters provided in the context, Int... 1.000000 0.899457
11 Outline the risk factors for Intel's business,... The risk factors for Intel's business include ... [Sub question: By first identifying and quotin... The risk factors for Intel's business as outli... 1.000000 0.891580
12 How has Amazon's total net sales changed over ... IDK [Sub question: By first identifying and quotin... Amazon's total net sales have changed over tim... 1.000000 0.840090
13 What are the major factors contributing to the... The major factors contributing to the change i... [Sub question: By first identifying and quotin... The major factors contributing to the change i... NaN 0.000000
14 Has there been any significant change in Amazo... Yes, there has been a significant change in Am... [Sub question: By first identifying and quotin... Yes, there has been a significant change in Am... 0.687500 0.961019
15 How has Amazon's revenue from online retail se... IDK [Sub question: By first identifying and quotin... Amazon's revenue from online retail services, ... 0.800000 0.900011
16 Can any trends be identified in Amazon's AWS (... Yes, there is a positive revenue trend in Amaz... [Sub question: By first identifying and quotin... Yes, trends can be identified in Amazon's AWS ... NaN 0.000000
17 What is the impact of foreign exchange rates o... - For Q1 2023, foreign exchange rates had a mi... [Sub question: By first identifying and quotin... The impact of foreign exchange rates on Amazon... 1.000000 0.893554
18 Are there any notable changes in Amazon's liqu... Yes, there are notable changes in Amazon's liq... [Sub question: By first identifying and quotin... Yes, there are notable changes in Amazon's liq... 1.000000 0.000000
19 How does Amazon's R&D expenditure in the most ... IDK [Sub question: By first identifying and quotin... In the most recent quarter ended September 30,... 0.857143 0.852599
20 From the most recent 10-Q, how do Intel's comm... IDK [Sub question: What are Intel's commitments in... The most recent 10-Q does not provide specific... NaN 0.000000
  • index 20 data point is causing exception and result messed up.
    • both metrics coming NaN for data index 0. which is wrong.
    • data index 3 -> faithfulness is 0 and answer_relevancy is non zero.
    • Data index - 12, 15, 19 -> rag answer is 'IDK' but still both metrics is coming high non zero.

Error trace

Expected behavior

  • Result should be nearly same for df_eval and df_eval_ex (except data point 20).
  • I think ordering is getting messed up while collecting the result of different data points.
  • tried without using evaluate function means calling each metrics score function and handling exception explicitly , result are near fine but performance is not.

Additional context
Add any other context about the problem here.

@mukuls-zeta mukuls-zeta added the bug Something isn't working label May 15, 2024
@baptiste-pasquier
Copy link

baptiste-pasquier commented May 15, 2024

Hello, I encountered the same problem recently due to exception on content filtering with openai.

While investigating I found the origin of the problem here:

r = (-1, np.nan)

In Runner, jobs are normally wrapped with a counter to be able to reorder the metrics at the end because asyncio does not respect the order:

ragas/src/ragas/executor.py

Lines 110 to 114 in 2d79365

def wrap_callable_with_index(self, callable: t.Callable, counter):
async def wrapped_callable_async(*args, **kwargs):
return counter, await callable(*args, **kwargs)
return wrapped_callable_async

ragas/src/ragas/executor.py

Lines 144 to 145 in 2d79365

sorted_results = sorted(executor_job.results, key=lambda x: x[0])
return [r[1] for r in sorted_results]

But in the event of an exception and when raise_exceptions=False, the counter is not used: the index -1 is returned, which by default places the NaN at the beginning, and shakes up the entire list of metrics.

r = (-1, np.nan)
try:
r = await future
except MaxRetriesExceeded as e:
logger.warning(f"max retries exceeded for {e.evolution}")
except Exception as e:
if self.raise_exceptions:
raise e
else:
logger.error(
"Runner in Executor raised an exception", exc_info=True
)
results.append(r)

@baptiste-pasquier
Copy link

A possible solution would be to move the np.nan and try except into the function‎ Executor.wrap_callable_with_index instead of
Runner._aresults to allow returning the counter even in the event of an exception.

@mukuls-zeta
Copy link
Author

@baptiste-pasquier Thanks for RCA. Have you tried the workaround possible solution/code?

@baptiste-pasquier
Copy link

Here is the fix code: baptiste-pasquier@167b9c5

Code for testing

import ragas
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
)

data = {
    "question": [
        "What is the capital of France?",
        "What is the capital of France?",
        "What is the capital of France?",
        "What is the capital of France?",
    ],
    "answer": [
        "Paris is the capital of France.",
        "",
        "London is the capital of France.",
        "x" * 99999,  # Context length error
    ],
    "contexts": [
        ["Paris is the capital of France."],
        ["Paris is the capital of France."],
        ["Paris is the capital of France."],
        ["Paris is the capital of France."],
    ],
    "ground_truth": [
        "Paris",
        "Paris",
        "Paris",
        "Paris",
    ],
}
dataset = Dataset.from_dict(data)

metrics = [faithfulness, answer_relevancy]
run_config = ragas.RunConfig(timeout=10, max_retries=1, max_wait=10)

Test 1 (without exception)

result = evaluate(
    dataset.select(range(3)),  # Run without the last row
    metrics=metrics,
    llm=llm,
    embeddings=embeddings,
    raise_exceptions=False,
    run_config=run_config,
)
result.to_pandas()

Result before and after fix:

question answer contexts ground_truth faithfulness answer_relevancy
What is the capital of France? Paris is the capital of France. [Paris is the capital of France.] Paris 1.0 1.0
What is the capital of France?   [Paris is the capital of France.] Paris NaN 1.0
What is the capital of France? London is the capital of France. [Paris is the capital of France.] Paris 0.0 0.0

Test 2 (with exception)

A context length exception is created at the last row of the dataset.

result = evaluate(
    dataset,  # Run on the entire dataset
    metrics=metrics,
    llm=llm,
    embeddings=embeddings,
    raise_exceptions=False,
    run_config=run_config,
)
result.to_pandas()

Result before fix:

question answer contexts ground_truth faithfulness answer_relevancy
What is the capital of France? Paris is the capital of France. [Paris is the capital of France.] Paris $\color{red}\textsf{NaN}$ $\color{red}\textsf{NaN}$
What is the capital of France?   [Paris is the capital of France.] Paris 1.0 1.0
What is the capital of France? London is the capital of France. [Paris is the capital of France.] Paris NaN 1.0
What is the capital of France? xxxxxxxxxxxxxx... [Paris is the capital of France.] Paris 0.0 0.0

The NaNs appear in the first row, which is unexpected behavior (the other metrics are shifted in the table).

Result after fix:

question answer contexts ground_truth faithfulness answer_relevancy
What is the capital of France? Paris is the capital of France. [Paris is the capital of France.] Paris 1.0 1.0
What is the capital of France?   [Paris is the capital of France.] Paris NaN 1.0
What is the capital of France? London is the capital of France. [Paris is the capital of France.] Paris 0.0 0.0
What is the capital of France? xxxxxxxxxxxxxx... [Paris is the capital of France.] Paris $\color{red}\textsf{NaN}$ $\color{red}\textsf{NaN}$

The NaNs appear in the last row, which is the expected behavior.

@mukuls-zeta
Copy link
Author

Yes, i have tried the similar fix, it seems working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants