Launch Week Day 2: Introducing Reports: An Extensible Reporting Framework for Socket Data.Learn More
Socket
Book a DemoSign in
Socket

anticor-features

Package Overview
Dependencies
Maintainers
1
Versions
12
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

anticor-features

Anti-correlation based feature selection for single cell (and other) omics datasets

pipPyPI
Version
0.2.7
Maintainers
1

anticor_features

Anti-correlation based feature selection for single cell (and other) omics datasets.

Features

  • Unsupervised feature selection based on gene-gene anti-correlations.
  • Automatically filters out genes in mitochondrial, ribosomal, and other pathways (customizable).
  • Scales to large datasets using HDF5-backed intermediate files.
  • Integrated Python API and command-line interface.
  • Passes null-dataset tests for robust selection.

Installation

Requires Python 3.6 or higher.

Install from PyPI:

pip install anticor_features

Or install from source:

git clone https://github.com/scottyler89/anticor_fs.git
cd anticor_fs
pip install .

Dependencies

  • h5py
  • numpy
  • pandas
  • scipy
  • seaborn
  • matplotlib
  • numba
  • ray
  • gprofiler-official (>=0.3.5)
  • psutil

Quickstart Python API

from anticor_features.anticor_features import get_anti_cor_genes

# exprs: array-like or HDF5 dataset with genes in rows and cells in columns
# feature_ids: list of gene IDs matching rows of exprs
# species: g:Profiler species code (e.g., "hsapiens" or "mmusculus")
anti_cor_table = get_anti_cor_genes(exprs, feature_ids, species="hsapiens")

# Filter selected genes
selected = anti_cor_table.loc[anti_cor_table["selected"], "gene"].tolist()
print(selected)

See the g:Profiler organism list for valid species codes: https://biit.cs.ut.ee/gprofiler/page/organism-list

Customization

  • pre_remove_features: list of gene IDs to exclude before analysis.
  • pre_remove_pathways: list of GO term codes whose genes will be removed.
  • min_express_n: minimum number of cells a gene must be expressed in to be considered (set to -1 to disable filtering, e.g., for non-expression or non-single-cell data).
  • scratch_dir: directory for temporary HDF5 files (default: system temp directory).
  • bin_size: number of features per batch when computing correlation matrix.
  • FPR and FDR: false positive rate and false discovery rate for negative correlations.
  • num_pos_cor: minimum number of positive correlations to select a feature.
  • offline_mode: when True, disallow network calls (requires a local ID bank for default pathway removal).
  • id_bank_dir: directory containing precomputed ID banks (defaults to the packaged/shipped bank; override with ANTICOR_FEATURES_ID_BANK_DIR).
  • use_live_pathway_lookup: when True, force live GO-term resolution (g:Profiler) instead of using the shipped/local ID bank.

Offline / HPC usage (no g:Profiler dependency)

anticor_features uses the packaged/shipped ID bank by default for the default pathway removal (no g:Profiler needed).

To ensure fully offline runs (and to avoid any fallback network calls), set offline_mode=True and generate a local ID bank (in an environment with network access):

python3 scripts/build_id_bank.py --species hsapiens --provider ncbi

Then run feature selection with offline_mode=True (point to your custom bank via ANTICOR_FEATURES_ID_BANK_DIR or id_bank_dir=).

Using with Non-Expression or Other Omics Data

For datasets that are not single-cell or gene-expression matrices (e.g., bulk omics, proteomics, metabolomics, or other feature embeddings), you can skip the minimum-expression filter and run only the anti-correlation statistics by setting min_express_n=-1. For example:

anti_cor_df = get_anti_cor_genes(
    embed_df,
    feature_ids=embed_df.index.tolist(),
    pre_remove_features=[],
    pre_remove_pathways=[],
    min_express_n=-1
)

Setting min_express_n=-1 disables the minimum-expression requirement (only meaningful for count-based single-cell data), allowing all features to be included in the statistical analysis.

Scanpy Integration

When using Scanpy (AnnData), transpose the data matrix:

from anticor_features.anticor_features import get_anti_cor_genes

anti_cor_table = get_anti_cor_genes(
    adata.X.T,
    adata.var.index.tolist(),
    species="hsapiens"
)

import pandas as pd
adata.var = pd.concat([adata.var, anti_cor_table], axis=1)
selected = anti_cor_table.loc[anti_cor_table["selected"], "gene"].tolist()
adata.raw = adata
adata = adata[:, selected]

Command-Line Interface

python3 -m anticor_features.anticor_features \
  -i exprs.tsv \
  -species mmusculus \
  -out_file anti_cor_features.tsv \
  -scratch_dir /path/to/tmp \
  -use_default_pathway_removal

Options:

  • -i, --infile: input expression matrix (TSV or HDF5).
  • -species: g:Profiler species code (default: "hsapiens").
  • -out_file: output file path for the results table.
  • -hdf5: treat input as HDF5 with dataset key "infile".
  • -ids: file with feature (gene) IDs (no header) for HDF5 input.
  • -cols: file with sample (cell) IDs (with header) for HDF5 input.
  • -scratch_dir: directory for temporary files.
  • -use_default_pathway_removal: remove default mitochondrial, ribosomal, and related pathways.
  • -h, --help: display full help message.

Performance

Computing time scales with number of features and batch size. Selecting anti-correlated features on ~10k genes and ~3k cells typically takes 1–2 minutes (network time for g:Profiler). Larger datasets may take longer.

License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

Contact

Scott Tyler scottyler89+bitbucket@gmail.com

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts