Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

clust-learn

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

clust-learn

A Python package for explainable cluster analysis

  • 0.2.7
  • PyPI
  • Socket score

Maintainers
1

Clust-learn

A Python package for extracting information from large and high-dimensional mixed-type data through explainable cluster analysis.


clust-learn visualizations


Table of contents

  1. Introduction
  2. Overall architecture
  3. Implementation
  4. Installation
  5. Version and license information
  6. Bug reports and future work
  7. User guide & API
    1. Data processing
      1. Data imputation
      2. Outliers
    2. Dimensionality reduction
    3. Clustering
    4. Classifier
  8. Citing

1. Introduction

clust-learn enables users to run end-to-end explainable cluster analysis to extract information from large and high-dimensional mixed-type data, and it does so by providing a framework that guides the user through data preprocessing, dimensionality reduction, clustering, and classification of the obtained clusters. It is designed to require very few lines of code, and with a strong focus on explainability.

2. Overall architecture

clust-learn is organized into four modules, one for each component of the methodological framework presented here:

Figue 1 shows the package layout with the functionalities covered by each module along with the techniques used, the explainability strategies available, and the main functions and class methods encapsulating these techniques and explainability strategies.


clust-learn package structure


3. Implementation

The package is implemented with Python 3.9 using open source libraries. It relies heavily on pandas and scikit-learn. Read the complete list of requirements here.

It can be installed manually or from pip/PyPI (see Section 4. Installation).

4. Installation

The package is on PyPI. Simply run:

pip install clust-learn

5. Version and license information

6. Bug reports and future work

Please report bugs and feature requests through creating a new issue here.

7. User guide & API

clust-learn is organized into four modules:

  1. Data preprocessing
  2. Dimensionality reduction
  3. Clustering
  4. Classifier

Figue 1 shows the package layout with the functionalities covered by each module along with the techniques used, the explainability strategies available, and the main functions and class methods encapsulating these techniques and explainability strategies.

The four modules are designed to be used sequentially to ensure robust and explainable results. However, each of them is independent and can be used separately to suit different use cases.

7.i. Data preprocessing

Data preprocessing consists of a set of manipulation and transformation tasks performed on the raw data before it is used for its analysis. Although data quality is essential for obtaining robust and reliable results, real-world data is often incomplete, noisy, or inconsistent. Therefore, data preprocessing is a crucial step in any analytical study.

7.i.a. Data imputation
compute_missing()
compute_missing(df, normalize=True)

Calculates the pct/count of missing values per column.

Parameters

  • df : pandas.DataFrame
  • normalize : boolean, default=True

Returns

  • missing_df : pandas.DataFrame
    • DataFrame with the pct/counts of missing values per column.
missing_values_heatmap()
missing_values_heatmap(df, output_path=None, savefig_kws=None)

Plots a heatmap to visualize missing values (light color).

Parameters

  • df : pandas.DataFrame
    • DataFrame containing the data.
  • output_path : str, default=None
    • Path to save figure as image.
  • savefig_kws : dict, default=None
    • Save figure options.
impute_missing_values()
impute_missing_values(df, num_vars, cat_vars, num_pair_kws=None, mixed_pair_kws=None, cat_pair_kws=None, graph_thres=0.05, k=8, max_missing_thres=0.33)

This function imputes missing values following this steps:

  1. One-to-one model based imputation for strongly related variables.
  2. Cluster based hot deck imputation where clusters are obtained as the connected components of an undirected graph G=(V,E), where V is the set of variables and E the pairs of variables with mutual information above a predefined threshold.
  3. Records with a proportion of missing values above a predefined threshold are discarded to ensure the quality of the hot deck imputation.
  4. Hot deck imputation for the remaining missing values considering all variables together.

Parameters

  • df : pandas.DataFrame
    • Data frame containing the data with potential missing values.
  • num_vars : str, list, pandas.Series, or numpy.array
    • Numerical variable name(s).
  • cat_vars : str, list, pandas.Series, or numpy.array
    • Categorical variable name(s).
  • {num,mixed,cat}_pair_kws : dict, default=None
    • Additional keyword arguments to pass to compute imputation pairs for one-to-one model based imputation, namely:
      • For numerical pairs, corr_thres and method for setting the correlation coefficient threshold and method. By default, corr_thres=0.7 and method='pearson'.
      • For mixed-type pairs, np2_thres for setting the a threshold on partial eta square with 0.14 as default value.
      • For categorical pairs, mi_thres for setting a threshold on mutual information score. By default, mi_thres=0.6.
  • graph_thres : float, default=0.05
    • Threshold to determine if two variables are similar based on mutual information score, and therefore are an edge of the graph from which variable clusters are derived.
  • k : int, default=8
    • Number of neighbors to consider in hot deck imputation.
  • max_missing_thres: float, default=0.33
    • Maximum proportion of missing values per observation allowed before final general hot deck imputation - see step 3 of the missing value imputation methodology in section 2.1.

Returns

  • final_pairs : pandas.DataFrame
    • DataFrame with pairs of highly correlated variables (var1: variable with values to impute; var2: variable to be used as independent variable for model-based imputation), together proportion of missing values of variables var1 and var2.
plot_imputation_distribution_assessment()
plot_imputation_distribution_assessment(df_prior, df_posterior, imputed_vars, sample_frac=1.0, prior_kws=None, posterior_kws=None, output_path=None, savefig_kws=None)

Plots a distribution comparison of each variable with imputed variables, before and after imputation.

Parameters

  • df_prior : pandas.DataFrame
    • DataFrame containing the data before imputation.
  • df_posterior : pandas.DataFrame
    • DataFrame containing the data after imputation.
  • imputed_vars : list
    • List of variables with imputed variables.
  • sample_frac : float, default=1.0
    • If < 1 a random sample of every pair of variables will be plotted.
  • {prior,posterior}_kws : dict, default=None
    • Additional keyword arguments to pass to the kdeplot.
  • output_path : str, default=None
    • Path to save figure as image.
  • savefig_kws : dict, default=None
    • Save figure options.
7.i.b. Outliers
remove_outliers()
remove_outliers(df, variables, iforest_kws=None)

Removes outliers using the Isolation Forest algorithm.

Parameters

  • df : pandas.DataFrame
    • DataFrame containing the data.
  • variables : list
    • Variables with potential outliers.
  • iforest_kws : dict, default=None
    • IsolationForest algorithm hyperparameters.

Returns

  • df_inliers : pandas.DataFrame
    • DataFrame with inliers (i.e. observations that are not outliers).
  • df_outliers : pandas.DataFrame
    • DataFrame with outliers.

7.ii. Dimensionality reduction

All the functionality of this module is encapsulated in the DimensionalityReduction class so that the original data, the instances of the models used, and any other relevant information is self-maintained and always accessible.

DimensionalityReduction class
dr = DimensionalityReduction(df, num_vars=None, cat_vars=None, num_algorithm='pca', cat_algorithm='mca', num_kwargs=None, cat_kwargs=None)
ParameterTypeDescription
dfpandas.DataFrameData table containing the data with the original variables
num_varsstring, list, pandas.Series, or numpy.arrayNumerical variable name(s)
cat_varsstring, list, pandas.Series, or numpy.arrayCategorical variable name(s)
num_algorithmstringAlgorithm to be used for dimensionality reduction of numerical variables. By default, PCA is used. The current version also supports SPCA
cat_algorithmstringAlgorithm to be used for dimensionality reduction of categorical variables. By default, MCA is used. The current version doesn’t support other algorithms
num_kwargsdictionaryAdditional keyword arguments to pass to the model used for numerical variables
cat_kwargsdictionaryAdditional keyword arguments to pass to the model used for categorical variables
AttributeTypeDescription
n_components_intFinal number of extracted components
min_explained_variance_ratio_floatMinimum explained variance ratio. By default, 0.5
num_trans_pandas.DataFrameExtracted components from numerical variables
cat_trans_pandas.DataFrameExtracted components from categorical variables
num_components_listList of names assigned to the extracted components from numerical variables
cat_components_listList of names assigned to the extracted components from categorical variables
pca_sklearn.decomposition.PCAPCA instance used to speed up some computations and for comparison purposes
Methods
transform()

Source

transform(self, n_components=None, min_explained_variance_ratio=0.5)

Transforms a DataFrame df to a lower dimensional space.

num_main_contributors(()

Source

num_main_contributors(self, thres=0.5, n_contributors=None, dim_idx=None, component_description=None, col_description=None, output_path=None)

Computes the original numerical variables with the strongest relation to the derived variable(s) (measured as Pearson correlation coefficient).

cat_main_contributors(()

Source

cat_main_contributors(self, thres=0.14, n_contributors=None, dim_idx=None, component_description=None, col_description=None, output_path=None)

Computes the original categorical variables with the strongest relation to the derived variable(s)(measured as correlation ratio).

cat_main_contributors_stats()

Source

cat_main_contributors_stats(self, thres=0.14, n_contributors=None, dim_idx=None, output_path=None)

Computes for every categorical variable's value, the mean and std of the derived variables that are strongly related to the categorical variable (based on the correlation ratio)).

plot_num_explained_variance()

Source

plot_num_explained_variance(self, thres=0.5, plots='all', output_path=None, savefig_kws=None)

Plot the explained variance (ratio, cumulative, and/or normalized) for numerical variables.

plot_cat_explained_variance()

Source

plot_cat_explained_variance(self, thres=0.5, plots='all', output_path=None, savefig_kws=None)

Plot the explained variance (ratio, cumulative, and/or normalized) for categorical variables.

plot_num_main_contributors()

Source

plot_num_main_contributors(self, thres=0.5, n_contributors=5, dim_idx=None, output_path=None, savefig_kws=None)

Plot main contributors (original variables with the strongest relation with derived variables) for every derived variable.

plot_cat_main_contributor_distribution()

Source

plot_cat_main_contributor_distribution(self, thres=0.14, n_contributors=None, dim_idx=None, output_path=None, savefig_kws=None)

Plot main contributors (original variables with the strongest relation with derived variables) for every derived variable.

7.iii. Clustering

The Clustering class encapsulates all the functionality of this module and stores the data, the instances of the algorithms used, and other relevant information so it is always accessible.

Clustering class
cl = Clustering(df, algorithms='kmeans', normalize=False)
ParameterTypeDescription
dfpandas.DataFrameData frame containing the data to be clustered
algorithmsinstance or list of instancesAlgorithm instances to be used for clustering. They must implement the fit and set_params methods
normalizeboolWhether to apply data normalization for fair comparisons between variables. In case dimensionality reduction is applied beforehand, normalization should not be applied
AttributeTypeDescription
dimensions_listList of columns of they input data frame
instances_dictPairs of algorithm name and its instance
metric_stringThe cluster validation metric used. Four metrics available: ['inertia', 'davies_bouldin_score', 'silhouette_score', 'calinski_harabasz_score']
optimal_config_tupleTuple with the optimal configuration for clustering containing the algorithm name, number of clusters, and value of the chosen validation metric
scores_dictPairs of algorithm name and a list of values of the chosen validation metric for a cluster range
Methods
compute_clusters()

Source

compute_clusters(self, n_clusters=None, metric='inertia', max_clusters=10, prefix=None, weights=None)

Calculates clusters. If more than one algorithm is passed in the class constructor, first, the optimal number of clusters is computed for each algorithm based on the metric passed to the method. Secondly, the algorithm that provides the best performance for the corresponding optimal number of clusters is selected. Therefore, the result shows the clusters calculated with the best performing algorithm based on the criteria explained above.

describe_clusters()

Source

describe_clusters(self, df_ext=None, variables=None, cluster_filter=None, statistics=['mean', 'median', 'std'], output_path=None)

Describes clusters based on internal or external continuous variables. For categorical variables use describe_clusters_cat().

describe_clusters_cat()

Source

describe_clusters_cat(self, cat_array, cat_name, order=None, normalize=False, use_weights=False, output_path=None)

Describes clusters based on external categorical variables. The result is a contingency table. For continuous variables use describe_clusters().

compare_cluster_means_to_global_means()

Source

compare_cluster_means_to_global_means(self, df_original=None, output_path=None)

For every cluster and every internal variable, the relative difference between the intra-cluster mean and the global mean.

anova_tests()

Source

anova_tests(self, df_test=None, vars_test=None, cluster_filter=None, output_path=None)

Runs ANOVA tests for a given set of continuous variables (internal or external) to test dependency with clusters.

chi2_test()

Source

chi2_test(self, cat_array)

Runs Chi-squared tests for a given categorical variable to test dependency with clusters.

plot_score_comparison()

Source

plot_score_comparison(self, output_path=None, savefig_kws=None)

Plots the comparison in performance between the different clustering algorithms.

plot_optimal_components_normalized()

Source

plot_optimal_components_normalized(self, output_path=None, savefig_kws=None)

Plots the normalized curve used for computing the optimal number of clusters.

plot_clustercount()

Source

plot_clustercount(self, use_weights=False, output_path=None, savefig_kws=None)

Plots a bar plot with cluster counts.

plot_cluster_means_to_global_means_comparison()

Source

plot_cluster_means_to_global_means_comparison(self, use_weights= False, df_original=None, xlabel=None, ylabel=None,
                                              levels=[-0.50, -0.32, -0.17, -0.05, 0.05, 0.17, 0.32, 0.50],
                                              output_path=None, savefig_kws=None)

Plots the normalized curve used for computing the optimal number of clusters.

plot_distribution_comparison_by_cluster()

Source

plot_distribution_comparison_by_cluster(self, df_ext=None, xlabel=None, ylabel=None, output_path=None, savefig_kws=None)

Plots the violin plots per cluster and continuous variables of interest to understand differences in their distributions by cluster.

plot_clusters_2D()

Source

plot_clusters_2D(self, coor1, coor2, use_weights=False, style_kwargs=dict(), output_path=None, savefig_kws=None)

Plots two 2D plots: - A scatter plot styled by the categorical variable hue. - A 2D plot comparing cluster centroids and optionally the density area.

plot_cat_distribution_by_cluster()

Source

plot_cat_distribution_by_cluster(self, cat_array, cat_label, order=None, cluster_label=None, use_weights=False, output_path=None, savefig_kws=None)

Plots the relative contingency table of the clusters with a categorical variable as a stacked bar plot.

7.iv. Classifier

The functionality of this module is encapsulated in the Classifier class, which is also responsible for storing the original data, the instances of the models used, and any other relevant information.

Classifier class
classifier = Classifier(df, predictor_cols, target, num_cols=None, cat_cols=None)
ParameterTypeDescription
dfpandas.DataFrameData frame containing the data
predictor_colslist of stringList of columns to use as predictors
targetnumpy.array or listValues of the target variable
num_colslistList of numerical columns from predictor_cols
cat_colslistList of categorical columns from predictor_cols
AttributeTypeDescription
filtered_features_listList of columns of the input data frame
labels_listList of class labels
model_Instance of TransformerMixin and BaseEstimator from sklearn.baseTrained classifier
X_train_numpy.arrayTrain split of predictors
X_test_numpy.arrayTest split of predictors
y_train_numpy.arrayTrain split of target
y_test_numpy.arrayTest split of target
grid_result_sklearn.model_selection.GridSearchCVInstance of fitted estimator for hyperparameter tuning
Methods
train_model()

Source

train_model(self, model=None, feature_selection=True, features_to_keep=[],
			feature_selection_model=None, hyperparameter_tuning=False, param_grid=None,
			train_size=0.8, balance_classes=False)

This method trains a classification model.

By default, it uses XGBoost, but any other estimator (instance of scikit-learn.Estimator) can be used.

The building process consists of three main steps:

  • Feature Selection (optional)

Feature removing highly correlated variables using a classification model and SHAP values to determine which to keep, and Recursive Feature Elimination with Cross-Validation (RFECV) on the remaining features.

  • Hyperparameter tuning (optional)

Runs grid search with cross-validation for hyperparameter tuning. Note the parameter grid must be passed.

  • Model training

Trains a classification model with the selected features and hyperparameters. By default, an XGBoost classifier will be trained.

Note both hyperparameter tuning and model training are run on a train set. Train-test split is performed using sklearn.model_selection.train_test_split.

hyperparameter_tuning_metrics()

Source

hyperparameter_tuning_metrics(self, output_path=None)

This method returns the average and standard deviation of the cross-validation runs for every hyperparameter combination in hyperparameter tuning.

confusion_matrix()

Source

confusion_matrix(self, test=True, sum_stats=True, output_path=None)

This method returns the confusion matrix of the classification model.

classification_report()

Source

classification_report(self, test=True, output_path=None)

This method returns the sklearn.metrics.classification_report in pandas.DataFrame format.

This report contains the intra-class metrics precision, recall and F1-score, together with the global accuracy, and macro average and weighted average of the three intra-class metrics.

plot_shap_importances()

Source

plot_shap_importances(self, n_top=7, output_path=None, savefig_kws=None)

Plots shap importance values, calculated as the combined average of the absolute values of the shap values for all classes.

plot_shap_importances_beeswarm()

Source

plot_shap_importances_beeswarm(self, class_id, class_name=None, n_top=10, output_path=None, savefig_kws=None)

Plots a summary of shap values for a specific class of the target variable. This uses shap beeswarm plot.

plot_confusion_matrix()

Source

plot_confusion_matrix(self, test=True, sum_stats=True, output_path=None, savefig_kws=None)

This function makes a pretty plot of an sklearn Confusion Matrix cf using a Seaborn heatmap visualization.

plot_roc_curves()

Source

 plot_roc_curves(self, test=True, labels=None, output_path=None, savefig_kws=None)

Plots ROC curve for every class.

8. Citing

Alvarez-Garcia, M., Ibar-Alonso, R., Arenas-Parra, M. (2024). A comprehensive framework for explainable cluster analysis. Information Sciences, 663 , 120282, https://doi.org/10.1016/j.ins.2024.120282

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc