Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
.. -- mode: rst --
.. image:: doc/hdbscan_logo.png :width: 600 :alt: HDBSCAN logo :align: center
The fast_hdbscan
library provides a simple implementation of the HDBSCAN clustering algorithm designed specifically
for high performance on multicore machine with low dimensional data (2D to about 20D). The algorithm runs in parallel and can make
effective use of as many cores as you wish to throw at a problem. It is thus ideal for large SMP systems, and even
modern multicore laptops.
This library provides a
re-implementation of a subset of the HDBSCAN algorithm that is compatible with the
hdbscan <https://github.com/scikit-learn-contrib/hdbscan>
_ library for data that is Euclidean and
low dimensional. The primary advantages of this library over the standard hdbscan
library are:
This library does not support all the features and input formats available in the hdbscan library.
The fast_hdbscan
library follows the hdbscan
library in using the sklearn API. You can use the fast_hdbscan
class HDBSCAN
exactly as you would that of the hdbscan
library with the caveat that fast_hdbscan
only
supports a subset of the parameters and options of hdbscan
. Nonetheless, if you have low-dimensional
Euclidean data (e.g. the output of UMAP), you can use this library as a straightforward drop in replacement for
hdbscan
:
.. code:: python
import fast_hdbscan
from sklearn.datasets import make_blobs
data, _ = make_blobs(1000)
clusterer = fast_hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)
fast_hdbscan requires:
fast_hdbscan can be installed via pip:
.. code:: bash
pip install fast_hdbscan
To manually install this package:
.. code:: bash
wget https://github.com/TutteInstitute/fast_hdbscan/archive/main.zip
unzip main.zip
rm main.zip
cd fast_hdbscan-main
python setup.py install
The algorithm used here is an adaptation of the algorithms described in the papers:
McInnes L, Healy J. *Accelerated Hierarchical Density Based Clustering*
In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42.
2017 `[pdf] <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8215642>`_
R. Campello, D. Moulavi, and J. Sander, *Density-Based Clustering Based on
Hierarchical Density Estimates*
In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172.
2013
fast_hdbscan is BSD (2-clause) licensed. See the LICENSE file for details.
Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from
code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute.
To contribute please fork the project <https://github.com/TutteInstitute/fast_hdbscan/issues#fork-destination-box>
_ make your
changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.
FAQs
A fast multicore version of hdbscan for low dimensional euclidean spaces
We found that fast-hdbscan demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.