pyckmeans
pyckmeans is a Python package for Consensus K-Means and Weighted Ensemble Consensus of Random (WECR) K-Means clustering, especially in the context of DNA sequence data. To evaluate the quality of clusterings, pyckmeans implements several internal validation metrics.
In addition to the clustering functionality, it provides tools for working with DNA sequence data such as reading and writing of DNA alignment files, calculating genetic distances, and Principle Coordinate Analysis (PCoA) for dimensionality reduction.
Consensus K-Means
Consensus K-Means is an unsupervised ensemble clustering algorithm, combining multiple K-Means clusterings, where each K-Means is trained on a subset of the data (random subset) and a subset of the the features (random subspace). The predicted cluster memberships of the single clusterings are combined to a consensus (or co-association) matrix, determining the number of times each pair of samples was clustered together over all clusterings. This matrix can be interpreted as similarity matrix and can be used to resolve the final consensus clustering by subjecting it to a last clustering step, e.g. hierarchical, or spectral clustering.
WECR K-Means
Weighted Ensemble Consensus of Random (WECR) K-Means is a semi-supervised ensemble clustering algorithm. Similar to consensus K-Means, it is based on a collection of K-Means clusterings, which are each trained on a random subset of data and a random subspace of features. In addition, for each single clustering the number of clusters k is also randomized. This library of clusterings is subjected to weighting function that integrates user-supplied must-link and must-not-link constraints, as well as an internal cluster validation criterion. The constraints represent the semi-supervised component of WECR K-Means: the user can provide prior knowledge considering the composition of the clusters. Must-link and must-not-link constraints imply that a pair of samples (observations, data points) is expected to be found in the same or different clusters, respectively. Based on the clusterings and the calculated weights, a weighted consensus (co-association) matrix is constructed, which is subjected to Cluster-based Similariry Partitioning (CSPA; e.g. hierarchical clustering) or spectral clustering to resolve the consensus clustering.
Documentation
See pyckmeans' RTD Documentation for details.
Installation
pyckmeans can be installed using pip, Conda, or from source.
pip
pip install pyckmeans
Conda
conda install pyckmeans -c TankredO
From Source
The installation from source requires git
and a c++ compiler.
git clone https://github.com/TankredO/pyckmeans
cd pyckmeans
pip install .
Usage
Examples using the Python API:
Consensus K-Means: Clustering a Data Matrix (Single K)
from pyckmeans import CKmeans
import sklearn.datasets
x, _ = sklearn.datasets.make_blobs(n_samples=50, n_features=2, centers=3, random_state=75)
ckm = CKmeans(k=3, n_rep=100, p_samp=0.8, p_feat=0.5)
ckm.fit(x)
ckm_res = ckm.predict(x)
fig = ckm_res.plot(figsize=(7,7))
ckm_res.cmatrix
print('Bayesian Information Criterion:', ckm_res.bic)
print('Davies-Bouldin Index:', ckm_res.db)
print('Silhouette Score:', ckm_res.sil)
print('Calinski-Harabasz Index:', ckm_res.ch)
print('Cluster Membership:', ckm_res.cl)
Bayesian Information Criterion: 50.21824821939818
Davies-Bouldin Index: 0.2893792767901513
Silhouette Score: 0.7827738719266039
Calinski-Harabasz Index: 630.8235586596012
Cluster Membership: [0 2 1 0 2 2 1 0 2 1 0 0 2 0 2 2 1 1 1 1 0 1 2 2 2 2 1 0 2 2 1 0 1 1 0 0 0
1 0 1 2 1 2 2 1 0 0 0 0 1]
Consensus K-Means: Clustering a Data Matrix (Multi K)
The MultiCKmeans
class allows to train multiple CKmeans
objects a once.
This is, for example, useful for exploring clustering for different values of k.
from pyckmeans import MultiCKMeans
import sklearn.datasets
x, _ = sklearn.datasets.make_blobs(n_samples=50, n_features=10, centers=3, random_state=44)
mckm = MultiCKMeans(k=[2, 3, 4, 5], n_rep=100, p_samp=0.8, p_feat=0.5)
mckm.fit(x)
mckm_res = mckm.predict(x)
print('Metrics:')
print(mckm_res.metrics)
mckm_res.plot_metrics(figsize=(10,5))
ckm_res_k3 = mckm_res.ckmeans_results[1]
Metrics:
k sil bic db ch
0 2 0.574369 225.092100 0.646401 59.733498
1 3 0.788207 126.358519 0.302979 387.409107
2 4 0.563343 126.979355 1.214520 271.019424
3 5 0.339466 128.061382 1.698652 211.080143
Consensus K-Means: Clustering Sequence Data
from pyckmeans import MultiCKMeans, NucleotideAlignment, pcoa
from IPython.display import display
import numpy
numpy.random.seed(0)
aln = NucleotideAlignment.from_file('datasets/rhodanthemum_ct85_msl68.snps.phy')
print('Nucleotide alignment:', aln)
dst = aln.distance(distance_type='k2p')
pcoa_res = pcoa(dst, correction='lingoes')
print('Eigenvalues:')
display(pcoa_res.values)
vectors = pcoa_res.get_vectors(
filter_by='eigvals_rel_corrected_cum',
filter_th=0.8,
out_format='pandas'
)
mckm = MultiCKMeans(
k=range(2, 20),
n_rep=50,
p_samp=0.8,
p_feat=0.8
)
mckm.fit(vectors)
mckm_res = mckm.predict(vectors)
mckm_res.plot_metrics(figsize=(12, 7))
ckm_res_k7 = mckm_res.ckmeans_results[5]
fig = ckm_res_k7.plot(figsize=(14,14))
Nucleotide alignment: <NucleotideAlignment; #samples: 108, #sites: 6752>
Eigenvalues:
| eigvals | eigvals_rel | eigvals_rel_cum | eigvals_rel_corrected | eigvals_rel_corrected_cum |
---|
0 | 0.115972 | 0.471458 | 0.233986 | 0.233986 | 0.233986 |
---|
1 | 0.039585 | 0.160924 | 0.317016 | 0.083030 | 0.317016 |
---|
2 | 0.035079 | 0.142604 | 0.391140 | 0.074125 | 0.391140 |
---|
3 | 0.017383 | 0.070665 | 0.430295 | 0.039154 | 0.430295 |
---|
4 | 0.009831 | 0.039965 | 0.454525 | 0.024230 | 0.454525 |
---|
... | ... | ... | ... | ... | ... |
---|
103 | -0.001325 | -0.005388 | 0.998575 | 0.001457 | 0.998575 |
---|
104 | -0.001693 | -0.006881 | 0.999654 | 0.001079 | 0.999654 |
---|
105 | -0.001884 | -0.007660 | 1.000000 | 0.000346 | 1.000000 |
---|
106 | -0.002255 | -0.009168 | 1.000000 | 0.000000 | 1.000000 |
---|
107 | -0.002430 | -0.009880 | 1.000000 | 0.000000 | 1.000000 |
---|
108 rows × 5 columns
WECR K-Means: Clustering Sequence Data
from pyckmeans import WECR, NucleotideAlignment, pcoa
aln = NucleotideAlignment.from_file('datasets/rhodanthemum_ct85_msl68.snps.phy')
dst = aln.distance(distance_type='k2p')
pcoa_res = pcoa(dst, correction='lingoes')
vectors = pcoa_res.get_vectors(
filter_by='eigvals_rel_corrected_cum',
filter_th=0.8,
out_format='pandas'
)
wecr = WECR(
k=range(2, 20),
n_rep=1000,
p_samp=0.6,
p_feat=0.6,
)
wecr.fit(vectors)
wecr_res = wecr.predict(vectors)
wecr_res.plot_metrics(figsize=(12, 7))
wecr_res.plot(k=6, figsize=(14,14))
cluster_membership = wecr_res.get_cl(k=6, with_names=True)
print('cluster_membership:')
print(cluster_membership)
cluster_membership:
PP-R002-01 0
PP-R002-01-dupl 0
PP-R017-04 4
PP-R017-04-dupl 4
PP-R019-01 5
..
R044-02 3
R044-12 3
R045-02 0
R045-06 0
R045-25 0
Length: 108, dtype: int32