Reopening issue 222 #223

luffycodes · 2021-02-02T17:00:08Z

hey, I read the paper suggested. can you please explain why early stopping algorithm should have access to test dataset?

I understand that the final filtered metrics can have access to test dataset (as done in the paper, to report the actual performance), but the paper does not mention stopping early by seeing performance using test dataset (one can use valid dataset to get an estimate of filtered metrics).

Citing the paper by Bordes verbatim: "selected the best model by early stopping using the mean rank on the validation sets (with a total of at most 1,000 epochs over the training data)"

Originally posted by @luffycodes in #222 (comment)

sumitpai · 2021-02-02T17:35:25Z

We are not evaluating the early stopping performance on the test set. The mrr computed during early stopping is only on the validation set X['valid'] .

We are only using the test set for filtering out the known facts out of the negatives generated for each validation triple. If the test set is a list of known facts, rather than unsure hypothesis, we filter out test set triples while generating negatives both during validation and testing.

Consider this example:
<Alec Guinness, acted_in, star_wars> , and several other similar triples (people who acted in star wars) are in train set

<Harrison_Ford, acted_in, star_wars> in the validation set

and assume that the following are in test set:

<Carrie_Fisher, acted_in, star_wars>
<Natalie Portman, acted_in, star_wars>
<Mark Hamill, acted_in, star_wars>
and 97 more such facts in the test set.

In other words, we have 100 facts, about actors who acted in star wars, in our test set.

During early stopping, say we check for subject side corruption only, i.e. for each triple in validation set, we replace subject with ALL the unique entities present in the graph. Then we filter out all the known facts. Finally we score and rank them.

In our above example, if we had not filtered ALL known facts, i.e. if we had not used test set in the filter, then when we generate corruptions for <Harrison_Ford, acted_in, star_wars> it would also have those 100 triples of the test set in the corruptions. Now if our model ranks <Harrison_Ford, acted_in, star_wars> as 101 (say the 100 test set triples are ranked better than the this), would you call it a bad model? To get the true performance of the model on validation set, we must concatenate the test set in the filters.

Just to summarize, we do not perform early stopping on the test set. We just use it to filter out known facts from the corruptions of validation triples in order to get the true performance of the model on the validation set.

Does that answer your question?

I guess @lukostaz can give a clearer explanation for this. (tagging him to this thread)

sumitpai assigned lukostaz Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reopening issue 222 #223

Reopening issue 222 #223

luffycodes commented Feb 2, 2021 •

edited

Loading

sumitpai commented Feb 2, 2021 •

edited

Loading

Reopening issue 222 #223

Reopening issue 222 #223

Comments

luffycodes commented Feb 2, 2021 • edited Loading

sumitpai commented Feb 2, 2021 • edited Loading

luffycodes commented Feb 2, 2021 •

edited

Loading

sumitpai commented Feb 2, 2021 •

edited

Loading