Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reopening issue 222 #223

Open
luffycodes opened this issue Feb 2, 2021 · 1 comment
Open

Reopening issue 222 #223

luffycodes opened this issue Feb 2, 2021 · 1 comment
Assignees

Comments

@luffycodes
Copy link

luffycodes commented Feb 2, 2021

hey, I read the paper suggested. can you please explain why early stopping algorithm should have access to test dataset?

I understand that the final filtered metrics can have access to test dataset (as done in the paper, to report the actual performance), but the paper does not mention stopping early by seeing performance using test dataset (one can use valid dataset to get an estimate of filtered metrics).

Citing the paper by Bordes verbatim: "selected the best model by early stopping using the mean rank on the validation sets (with a total of at most 1,000 epochs over the training data)"

Originally posted by @luffycodes in #222 (comment)

@sumitpai
Copy link
Contributor

sumitpai commented Feb 2, 2021

We are not evaluating the early stopping performance on the test set. The mrr computed during early stopping is only on the validation set X['valid'] .

We are only using the test set for filtering out the known facts out of the negatives generated for each validation triple. If the test set is a list of known facts, rather than unsure hypothesis, we filter out test set triples while generating negatives both during validation and testing.

Consider this example:
<Alec Guinness, acted_in, star_wars> , and several other similar triples (people who acted in star wars) are in train set

<Harrison_Ford, acted_in, star_wars> in the validation set

and assume that the following are in test set:

<Carrie_Fisher, acted_in, star_wars>
<Natalie Portman, acted_in, star_wars>
<Mark Hamill, acted_in, star_wars>
and 97 more such facts in the test set.

In other words, we have 100 facts, about actors who acted in star wars, in our test set.

During early stopping, say we check for subject side corruption only, i.e. for each triple in validation set, we replace subject with ALL the unique entities present in the graph. Then we filter out all the known facts. Finally we score and rank them.

In our above example, if we had not filtered ALL known facts, i.e. if we had not used test set in the filter, then when we generate corruptions for <Harrison_Ford, acted_in, star_wars> it would also have those 100 triples of the test set in the corruptions. Now if our model ranks <Harrison_Ford, acted_in, star_wars> as 101 (say the 100 test set triples are ranked better than the this), would you call it a bad model? To get the true performance of the model on validation set, we must concatenate the test set in the filters.

Just to summarize, we do not perform early stopping on the test set. We just use it to filter out known facts from the corruptions of validation triples in order to get the true performance of the model on the validation set.

Does that answer your question?

I guess @lukostaz can give a clearer explanation for this. (tagging him to this thread)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants