Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse matrix support in hierarchical clustering #173

Open
hecking opened this issue Jul 16, 2019 · 4 comments
Open

Sparse matrix support in hierarchical clustering #173

hecking opened this issue Jul 16, 2019 · 4 comments

Comments

@hecking
Copy link

hecking commented Jul 16, 2019

I ran into an Out of memory error when applying single linkage hierarchical clustering on a large sparse distance matrix.

I figured out that the problem is in the first line of the hclust_minimum function

function hclust_minimum(ds::AbstractMatrix{T}) where T<:Real
     d = Matrix(ds)

This creates a dense matrix from any kind of input distance matrix. If ds is a sparse matrix that still fits into the memory, this line can crash the procedure if the dense counterpart of df becomes to big for the working memory.

This issue could easily be solved using a SparseMatrixCSC matrix from the SparseArray module.

function hclust_minimum(ds::AbstractMatrix{T}) where T<:Real
     d = SparseMatrixCSC(ds)

I think for the other types of hierarchical clustering it will be similar.

@alyst
Copy link
Member

alyst commented Jul 16, 2019

hclust_minimum() method really uses the d in the "dense" way updating all elements of one row/col per iteration, so I'm not sure that

  • SparseMatrixCSC is the right type for d
  • just switching the type would be enough to benefit from d sparsity.

Actually, it's somewhat intriguing to have a sparse distance matrix.
Does it mean that most of your datapoints are duplicates (zero distances)?
Maybe for your clustering problem you can consider removing the duplicates first, and then running hclust() with the reduced distance matrix that only contains the distances between distinct points?

@hecking
Copy link
Author

hecking commented Jul 16, 2019

Regarding the updating of the distance matrix there is no difference between sparse or dense storage.

I agree that sparse distance matrices are rare but this can happen with non-metric distances. In my case I have to deal with very large matrices where distances are not known (needed) between many pairs of data points. In these cases the distance is set to 0. The distance has a negative value for more similar data points, e.g. for duplicates the distance would be -1.

@alyst
Copy link
Member

alyst commented Jul 16, 2019

Thanks for the clarifications.
I think we definitely should not do a hard switch from Matrix to SparseMatrixCSC, because it will kill the performance for most of the use cases.

But you can try introducing the type parameter that defines which matrix format to use. Something like this:

function hclust_minimum(::Type{M}, ds::AbstractMatrix{T}) where {M<:AbstractMatrix, T<:Real}
     d = M(ds)
    ....
end

And sparse keyword argument to hclust() (hclust(..., sparse=false/true)) so that if sparse=true, it calls hclust_minimum(SparseMatrixCSC, ...) and hclust_minimum(Matrix, ...) otherwise.
You can submit the PR introducing this change, but I personally consider this approach highly experimental.

@hecking
Copy link
Author

hecking commented Jul 17, 2019

Thank you for the suggestions. I think it can be a quite flexible solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants