Security News
Oracle Drags Its Feet in the JavaScript Trademark Dispute
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
This package boosts a sparse matrix multiplication followed by selecting the top-n multiplication
sparse_dot_topn provides a fast way to performing a sparse matrix multiplication followed by top-n multiplication result selection.
Comparing very large feature vectors and picking the best matches, in practice often results in performing a sparse matrix multiplication followed by selecting the top-n multiplication results.
sparse_dot_topn provides a (parallelised) sparse matrix multiplication implementation that integrates selecting the top-n values, resulting in a significantly lower memory footprint and improved performance. On Apple M2 Pro over two 20k x 193k TF-IDF matrices sparse_dot_topn can be up to 6 times faster when retaining the top 10 values per row and utilising 8 cores. See the benchmark directory for details.
sp_matmul_topn
supports {CSR, CSC, COO}
matrices with {32, 64}bit {int, float}
data.
Note that COO
and CSC
inputs are converted to the CSR
format and are therefore slower.
Two options to further reduce memory requirements are threshold
and density
.
Optionally, the values can be sorted such that the first column for a given row contains the largest value.
Note that sp_matmul_topn(A, B, top_n=B.shape[1])
is equal to sp_matmul(A, B)
and A.dot(B)
.
If you are migrating from v0.*
please see the migration guide below for details.
import scipy.sparse as sparse
from sparse_dot_topn import sp_matmul, sp_matmul_topn
A = sparse.random(1000, 100, density=0.1, format="csr")
B = sparse.random(100, 2000, density=0.1, format="csr")
# Compute C and retain the top 10 values per row
C = sp_matmul_topn(A, B, top_n=10)
# or paralleslised matrix multiplication without top-n selection
C = sp_matmul(A, B, n_threads=2)
# or with top-n selection
C = sp_matmul_topn(A, B, top_n=10, n_threads=2)
# If you are only interested in values above a certain threshold
C = sp_matmul_topn(A, B, top_n=10, threshold=0.8)
# If you set the threshold we cannot easily determine the number of non-zero
# entries beforehand. Therefore, we allocate memory for `ceil(top_n * A.shap[0] * density)`
# non-zero entries. You can set the expected density to reduce the amount pre-allocated
# entries. Note that if we allocate too little an expensive copy(ies) will need to hapen.
C = sp_matmul_topn(A, B, top_n=10, threshold=0.8, density=0.1)
sparse_dot_topn provides wheels for CPython 3.8 to 3.12 for:
pip install sparse_dot_topn
sparse_dot_topn relies on a C++ extension for the computationally intensive multiplication routine.
Note that the wheels vendor/ships OpenMP with the extension to provide parallelisation out-of-the-box.
This may cause issues when used in combination with other libraries that ship OpenMP like PyTorch.
If you run into any issues with OpenMP see INSTALLATION.md for help or run the function without specifying the n_threads
argument.
Installing from source requires a C++17 compatible compiler. If you have a compiler available it is advised to install without the wheel as this enables architecture specific optimisations.
You can install from source using:
pip install sparse_dot_topn --no-binary sparse_dot_topn
sparse_dot_topn provides some configuration options when building from source. Building from source can enable architecture specific optimisations and is recommended for those that have a C++ compiler installed. See INSTALLATION.md for details.
The top-n multiplication of two large O(10M+) sparse matrices can be broken down into smaller chunks. For example, one may want to split sparse matrices into matrices with just 1M rows, and do the the (top-n) multiplication of all those matrix pairs. Reasons to do this are to reduce the memory footprint of each pair, and to employ available distributed computing power.
The pairs can be distributed and calculated over a cluster (eg. we use a spark cluster). The resulting matrix-products are then zipped and stacked in order to reproduce the full matrix product.
Here's an example how to do this, where we are matching 1000 rows in sparse matrix A against 600 rows in sparse matrix B, and both A and B are split into chunks.
import numpy as np
import scipy.sparse as sparse
from sparse_dot_topn import sp_matmul_topn, zip_sp_matmul_topn
# 1a. Example matching 1000 rows in sparse matrix A against 600 rows in sparse matrix B.
A = sparse.random(1000, 2000, density=0.1, format="csr", dtype=np.float32, random_state=rng)
B = sparse.random(600, 2000, density=0.1, format="csr", dtype=np.float32, random_state=rng)
# 1b. Reference full matrix product with top-n
C_ref = sp_matmul_topn(A, B.T, top_n=10, threshold=0.01, sort=True)
# 2a. Split the sparse matrices. Here A is split into three parts, and B into five parts.
As = [A[i*200:(i+1)*200] for i in range(5)]
Bs = [B[:100], B[100:300], B[300:]]
# 2b. Perform the top-n multiplication of all sub-matrix pairs, here in a double loop.
# E.g. all sub-matrix pairs could be distributed over a cluster and multiplied there.
Cs = [[sp_matmul_topn(Aj, Bi.T, top_n=10, threshold=0.01, sort=True) for Bi in Bs] for Aj in As]
# 2c. top-n zipping of the C-matrices, done over the index of the B sub-matrices.
Czip = [zip_sp_matmul_topn(top_n=10, C_mats=Cis) for Cis in Cs]
# 2d. stacking over zipped C-matrices, done over the index of the A sub-matrices
# The resulting matrix C equals C_ref.
C = sparse.vstack(Czip, dtype=np.float32)
sparse_dot_topn v1 is a significant change from v0.*
with a new bindings and API.
The new version adds support for CPython 3.12 and now supports both ints as well as floats.
Internally we switched to a max-heap to collect the top-n values which significantly reduces memory-footprint.
The former implementation had O(n_columns)
complexity for the top-n selection where we now have O(top-n)
complexity.
awesome_cossim_topn
has been deprecated and will be removed in a future version.
Users should switch to sp_matmul_topn
which is largely compatible:
For example:
C = awesome_cossim_topn(A, B, ntop=10)
can be replicated using:
C = sp_matmul_topn(A, B, top_n=10, threshold=0.0, sort=True)
ntop
has been renamed to topn
lower_bound
has been renamed to threshold
use_threads
and n_jobs
have been combined into n_threads
return_best_ntop
option has been removedtest_nnz_max
option has been removedB
is auto-transposed when its shape is not compatible but its transpose is.The output of return_best_ntop
can be replicated with:
C = sp_matmul_topn(A, B, top_n=10)
best_ntop = np.diff(C.indptr).max()
threshold
no longer 0.0
but disabled by defaultThis enables proper functioning for matrices that contain negative values.
Additionally a different data-structure is used internally when collecting non-zero results that has a much lower memory-footprint than previously.
This means that the effect of the threshold
parameter on performance and memory requirements is negligible.
If the threshold
is None
we pre-compute the number of non-zero
entries, this can significantly reduce the required memory at a mild (~10%) performance penalty.
sort = False
, the result matrix is no longer sorted by defaultThe matrix is returned with the same column order as if not filtering of the top-n results has taken place.
This means that when you set top_n
equal to the number of columns of B
you obtain the same result as normal multiplication,
i.e. sp_matmul_topn(A, B, top_n=B.shape[1])
is equal to A.dot(B)
.
Contributions are very welcome, please see CONTRIBUTING for details.
This package was developed and is maintained by authors (previously) affiliated with ING Analytics Wholesale Banking Advanced Analytics. The original implementation was based on modified version of Scipy's CSR multiplication implementation. You can read about it in a blog (mirror) written by Zhe Sun.
FAQs
This package boosts a sparse matrix multiplication followed by selecting the top-n multiplication
We found that sparse-dot-topn demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 4 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Oracle seeks to dismiss fraud claims in the JavaScript trademark dispute, delaying the case and avoiding questions about its right to the name.
Security News
The Linux Foundation is warning open source developers that compliance with global sanctions is mandatory, highlighting legal risks and restrictions on contributions.
Security News
Maven Central now validates Sigstore signatures, making it easier for developers to verify the provenance of Java packages.