Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addressed situation when assign_default_confidence() returns only dataframe with all NaN confidence values #548

Merged
merged 3 commits into from
Jun 26, 2024

Conversation

hrshdhgd
Copy link
Contributor

@hrshdhgd hrshdhgd commented Jun 25, 2024

Ok, so here was the problem:

When the dataframe whose redundant rows had to be filtered out had all NaN values for confidence, the line

df, nan_df = assign_default_confidence(df)

returned df = Empty dataframe and the entire source data frame = nan_df.

Due to this, the following line:

dfmax = df.groupby(key, as_index=False)[CONFIDENCE].apply(max).drop_duplicates()

result in dfmax = {} which is of type pandas.Series. Hence the confusion.

The correct way to handle this is simple adding an if statement:

sssom-py/src/sssom/util.py

Lines 447 to 469 in ffa2109

if not df.empty:
dfmax = df.groupby(key, as_index=False)[CONFIDENCE].apply(max).drop_duplicates()
max_conf: Dict[Tuple[str, ...], float] = {}
for _, row in dfmax.iterrows():
if ignore_predicate:
max_conf[(row[SUBJECT_ID], row[OBJECT_ID])] = row[CONFIDENCE]
else:
max_conf[(row[SUBJECT_ID], row[OBJECT_ID], row[PREDICATE_ID])] = row[CONFIDENCE]
if ignore_predicate:
df = df[
df.apply(
lambda x: x[CONFIDENCE] >= max_conf[(x[SUBJECT_ID], x[OBJECT_ID])],
axis=1,
)
]
else:
df = df[
df.apply(
lambda x: x[CONFIDENCE]
>= max_conf[(x[SUBJECT_ID], x[OBJECT_ID], x[PREDICATE_ID])],
axis=1,
)
]

I've added an explicit test and it passes. Fixes #546

@hrshdhgd hrshdhgd requested a review from matentzn June 25, 2024 02:02
Copy link
Collaborator

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@hrshdhgd hrshdhgd merged commit e0dfcb3 into master Jun 26, 2024
6 checks passed
@hrshdhgd hrshdhgd deleted the issue-546 branch June 26, 2024 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make filter_redundant_rows resistant to the case that there is an empty confidence column
3 participants