
Company News
Socket Named Top Sales Organization by RepVue
Socket won two 2026 Reppy Awards from RepVue, ranking in the top 5% of all sales orgs. AE Alexandra Lister shares what it's like to grow a sales career here.
clustertk
Advanced tools
A comprehensive Python toolkit for cluster analysis with full pipeline support.
ClusterTK provides a complete, sklearn-style pipeline for clustering: from raw data preprocessing to cluster interpretation and export. Perfect for data analysts who want powerful clustering without writing hundreds of lines of code.
# Core functionality
pip install clustertk
# With visualization
pip install clustertk[viz]
import pandas as pd
from clustertk import ClusterAnalysisPipeline
# Load data
df = pd.read_csv('your_data.csv')
# Create and fit pipeline
pipeline = ClusterAnalysisPipeline(
dim_reduction='auto', # Smart selection (PCA/UMAP/None based on algorithm)
handle_missing='median',
correlation_threshold=0.85,
n_clusters=None, # Auto-detect optimal number
verbose=True
)
pipeline.fit(df, feature_columns=['feature1', 'feature2', 'feature3'])
# Get results
labels = pipeline.labels_
profiles = pipeline.cluster_profiles_
metrics = pipeline.metrics_
print(f"Found {pipeline.n_clusters_} clusters")
print(f"Silhouette score: {metrics['silhouette']:.3f}")
# Export
pipeline.export_results('results.csv')
pipeline.export_report('report.html')
# Visualize (requires clustertk[viz])
pipeline.plot_clusters_2d()
pipeline.plot_cluster_heatmap()
Raw Data → Preprocessing → Feature Selection → Dimensionality Reduction
→ Clustering → Evaluation → Interpretation → Export
Each step is configurable through pipeline parameters or can be run independently.
| Algorithm | Features | Auto Selection |
|---|---|---|
| K-Means/GMM | <50 | None |
| K-Means/GMM | ≥50 | PCA |
| HDBSCAN/DBSCAN | <30 | None |
| HDBSCAN/DBSCAN | ≥30 | UMAP |
# Perfect for high-dimensional density-based clustering
pipeline = ClusterAnalysisPipeline(
dim_reduction='umap', # Preserves local density
umap_n_components=10, # NOT 2! For clustering, not viz
clustering_algorithm='hdbscan'
)
pipeline.fit(high_dim_data)
# UMAP preserves density → HDBSCAN finds real clusters!
print(f"Found {pipeline.n_clusters_} clusters")
print(f"Noise ratio: {pipeline.cluster_profiles_.noise_ratio_:.1%}")
Why UMAP for HDBSCAN?
Important: Use n_components=10-20 for clustering, NOT 2-3 (visualization only)!
# Problem: You have 30 features, but not all are useful for clustering
# More features ≠ better clustering (curse of dimensionality)
# Step 1: Fit on all features
pipeline = ClusterAnalysisPipeline(dim_reduction='pca')
pipeline.fit(df) # 30 features → Silhouette: 0.42
# Step 2: Find which features matter most
importance = pipeline.get_pca_feature_importance()
print(importance.head(10)) # Top 10 features by PCA loadings
# Step 3: Try refitting with top 10 features
comparison = pipeline.refit_with_top_features(
n_features=10,
importance_method='permutation', # Best for clustering quality
compare_metrics=True,
update_pipeline=False # Just compare, don't update yet
)
# Step 4: If metrics improved, update pipeline
if comparison['metrics_improved']:
print(f"Improvement: {comparison['weighted_improvement']:+.1%}")
pipeline.refit_with_top_features(n_features=10, update_pipeline=True)
# New silhouette: 0.58 (+38% improvement!)
Why Feature Selection?
Three Importance Methods:
'permutation' - Best for clustering quality (default)'contribution' - Variance ratio analysis'pca' - PCA loadings (only if dim_reduction='pca')# Understand which features drive your clustering
results = pipeline.analyze_feature_importance(method='all')
# View permutation importance
print(results['permutation'].head())
# View feature contribution (variance ratio)
print(results['contribution'].head())
# Use top features for focused analysis
top_features = results['permutation'].head(5)['feature'].tolist()
# Compare multiple algorithms automatically
results = pipeline.compare_algorithms(
X=df,
feature_columns=['feature1', 'feature2', 'feature3'],
algorithms=['kmeans', 'gmm', 'hierarchical', 'dbscan'],
n_clusters_range=(2, 8)
)
print(results['comparison']) # DataFrame with metrics
print(f"Best algorithm: {results['best_algorithm']}")
# Visualize comparison
pipeline.plot_algorithm_comparison(results)
pipeline = ClusterAnalysisPipeline(
n_clusters=None, # Auto-detect
auto_name_clusters=True
)
pipeline.fit(customers_df,
feature_columns=['age', 'income', 'purchases'],
category_mapping={
'demographics': ['age', 'income'],
'behavior': ['purchases']
})
pipeline.export_report('customer_segments.html')
pipeline = ClusterAnalysisPipeline(
clustering_algorithm='dbscan'
)
pipeline.fit(transactions_df)
anomalies = transactions_df[pipeline.labels_ == -1]
More examples: docs/examples.md
Optional (for visualization):
Contributions are welcome! Please check:
MIT License - see LICENSE file for details.
If you use ClusterTK in your research, please cite:
@software{clustertk2024,
author = {Veselov, Aleksey},
title = {ClusterTK: A Comprehensive Python Toolkit for Cluster Analysis},
year = {2024},
url = {https://github.com/alexeiveselov92/clustertk}
}
Made with ❤️ for the data science community
FAQs
A comprehensive toolkit for cluster analysis with full pipeline support
We found that clustertk demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Company News
Socket won two 2026 Reppy Awards from RepVue, ranking in the top 5% of all sales orgs. AE Alexandra Lister shares what it's like to grow a sales career here.

Security News
NIST will stop enriching most CVEs under a new risk-based model, narrowing the NVD's scope as vulnerability submissions continue to surge.

Company News
/Security News
Socket is an initial recipient of OpenAI's Cybersecurity Grant Program, which commits $10M in API credits to defenders securing open source software.