Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add intra_distance_score evaluation #103

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

mokarakaya
Copy link

Issue #90 (Diversification metrics for evaluation)

Intra_distance diversity is probably mostly considered metric of diversity.
Therefore I'd like to add it first.

@maciejkula
Copy link
Owner

maciejkula commented Mar 17, 2018

Thank you for this!

I think to move this forward, we'll want to do two things.

  1. Can we add some references to the metric in the docstring? Maybe some papers that use it?
  2. We'll need to make this fast. From a cursory glance at the code, I suspect it's incredibly slow.

On making this fast: as far as I understand, we want to compute the cosine distance between the columns i and j of the (sparse) interaction matrix. There are a couple of things that we can do to make this faster than the current implementation:

  1. We convert the interactions object to a sparse matrix, transpose it, and make it CSR. This way, rows represent items. Call this mat.
  2. We get the lengths of the item vectors by calling lenghts = mat.getnnz(axis=1) (I think, the axis argument always throws me).
  3. We can get the length of the intersection of i and j by doing
numerator = np.in1d(mat[i].indices, mat[j].indices, assume_unique=True).sum()
denominator = lengths[i] * lengths[j]
distance = numerator / denominator

This way we don't need the cache either.

  1. Let's return this in a (num_users, k * (k-1) / 2) array (it's a list of lists right now).

We can make it even faster by using the fact that indices are sorted and using numba, but this is probably a good first step.

Copy link
Owner

@maciejkula maciejkula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, I made some additional comments.

More generally:

  1. Can you give me a sense of how slow/fast this is? Could you post how long it takes for the Movielens 100k dataset, for instance?
  2. We will need a version of this for sequence-based models. Have a look at the MRR routines to see how the two differ.


distances = []
test = test.tocsr()
mat = train.tocoo().T.tocsr()
Copy link
Owner

@maciejkula maciejkula Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the coo conversion necessary?

"""

distances = []
test = test.tocsr()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we need test for? For knowing which users to compute the predictions for?

Maybe a cleaner way of doing the same would be to allow the user to pass in an optional array of user ids for which the metric should be computed.

for user_id, row in enumerate(test):
predictions = -model.predict(user_id)
rec_list = predictions.argsort()[:k]
distance = [
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find nested list comprehensions very confusing. Could we use nested for loops here?

@mokarakaya
Copy link
Author

Thank you very much for the comments. I agree.
I'll check and update accordingly.

In addition to these comments, I'm still looking for a way to move distance function to input parameters. The current function will be the default one since it's really fast (I'll post exact times separately)

@mokarakaya
Copy link
Author

I've fixed the review comments except for the sequence-based models solution.

calling intra_distance_score function in tests takes 6 seconds (only the function) when I run locally.

I should check how we can achieve to run this on sequence-based models;
We need to convert the sequence to items array with user_ids in order to compute the distance between items.

So we need a matrix like this to calculate the distance;
[[userId1, userId2, userId3],
[userId2, userId3]]

where each row represents an item.

Any advise or guidance would be greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants