Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

ncbi-counts

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

ncbi-counts

Download the NCBI-generated RNA-seq count data by specifying the Series accession number(s), and the regular expression of the Sample attributes.

  • 0.2.0
  • PyPI
  • Socket score

Maintainers
1

ncbi_counts

Download the NCBI-generated RNA-seq count data by specifying the Series accession number(s), and the regular expression of the Sample attributes.

If you just need a count matrix for all samples (GSM) in a series (GSE), this library is not needed. However, if you need a count matrix for each GSE, specifying only the control group samples and treatment group samples, this library may be useful.

Installation

From PyPI:

pip install ncbi-counts

Usage

python -m ncbi_counts [-h] [-n NORM] [-a ANNOT_VER] [-k [KEEP_ANNOT ...]] [-s SRC_DIR] [-o OUTPUT] [-q] [-S SEP] [-y GSM_YAML] [-c] FILE

Options

positional arguments:
  FILE                  Path to input file (.yaml, .yml) which represents each GSE accession number(s) which contains a sequence of maps with two keys: 'control' and 'treatment'. Each of these maps further contains key(s) (e.g., 'title', 'characteristics_ch1').

options:
  -h, --help            show this help message and exit
  -n NORM, --norm-type NORM
                        Normalization type of counts (choices: None, fpkm, tpm, default: None)
  -a ANNOT_VER, --annot-ver ANNOT_VER
                        Annotation version of counts (default: GRCh38.p13)
  -k [KEEP_ANNOT ...], --keep-annot [KEEP_ANNOT ...]
                        Annotation column(s) to keep (choices: Symbol, Description, Synonyms, GeneType, EnsemblGeneID, Status, ChrAcc, ChrStart, ChrStop, Orientation, Length, GOFunctionID, GOProcessID, GOComponentID, GOFunction, GOProcess, GOComponent, default: None)
  -s SRC_DIR, --src-dir SRC_DIR
                        A directory to save the source obtained from NCBI (default: ./)
  -o OUTPUT, --output OUTPUT
                        A directory to save the count matrix (or matrices) (default: ./)
  -q, --silent          If True, suppress warnings (default: False)
  -S SEP, --sep SEP     Separator between group and GSM in column (default: -)
  -y GSM_YAML, --yaml GSM_YAML
                        Path to save YAML file which contains GSMs (default: None)
  -c, --cleanup         If True, remove source files (default: False)

Command-line Example

To create a mock vs. CoV2 comparison pair for each tissues from GSE164073, please prepare the following yaml file (but do not need words beginning with "!!" as they are type hints):

[!NOTE] The acceptable options for Sample attributes (such as 'title' and 'characteristics_ch1') can be found on the Sample Attributes table or SOFT download section in SOFT submission instructions page. You can use the values in the 'Label' column of the table as a key in the YAML file. Also, please exclude the string '!Sample_'.

If you want a comprehensive list of attributes for all samples in a series, GEOparse library is useful.

 import GEOparse
 GEOparse.get_GEO("GSExxxxx").phenotype_data
GSE164073: !!seq
- control: !!map
    title: !!str Cornea
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Cornea
    characteristics_ch1: !!str SARS-CoV-2
- control: !!map
    title: !!str Limbus
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Limbus
    characteristics_ch1: !!str SARS-CoV-2
- control: !!map
    title: !!str Sclera
    characteristics_ch1: !!str mock
  treatment: !!map
    title: !!str Sclera
    characteristics_ch1: !!str SARS-CoV-2

or if you would like to specify the GSM directly, please prepare the following yaml file:

GSE164073: !!seq
- control: !!map
    geo_accession: !!str ^GSM4996084$|^GSM4996085$|^GSM4996086$
  treatment: !!map
    geo_accession: !!str ^GSM4996087$|^GSM4996088$|^GSM4996089$
- control: !!map
    geo_accession: !!str ^GSM4996090$|^GSM4996091$|^GSM4996092$
  treatment: !!map
    geo_accession: !!str ^GSM4996093$|^GSM4996094$|^GSM4996095$
- control: !!map
    geo_accession: !!str ^GSM4996096$|^GSM4996097$|^GSM4996098$
  treatment: !!map
    geo_accession: !!str ^GSM4996099$|^GSM4996100$|^GSM4996101$

and run the following command ("Symbol" column is kept in this expample):

python -m ncbi_counts sample_regex.yaml -k Symbol -c

then you will get the following files:

GSE164073-1.tsv
GeneIDSymbolcontrol-GSM4996084control-GSM4996085control-GSM4996086treatment-GSM4996088treatment-GSM4996087treatment-GSM4996089
1A1BG144197157156133122
2A2M254276262178153178
3A2MP1102000
9NAT197133103839388
........................
GSE164073-2.tsv
GeneIDSymbolcontrol-GSM4996092control-GSM4996091control-GSM4996090treatment-GSM4996095treatment-GSM4996094treatment-GSM4996093
1A1BG175167203143145145
2A2M261158427215145169
3A2MP1000002
9NAT1122100133907880
........................
GSE164073-3.tsv
GeneIDSymbolcontrol-GSM4996098control-GSM4996097control-GSM4996096treatment-GSM4996099treatment-GSM4996100treatment-GSM4996101
1A1BG158115140136124145
2A2M333722612536152412881807
3A2MP1000000
9NAT1836468655279
........................

If you don't need source files from NCBI, please delete the following files:

Example in Python

To get the output as a pandas DataFrame, please refer to the following code:

from ncbi_counts import Series

series = Series(
    "GSE164073",
    [
        {
            "control": {"title": "Cornea", "characteristics_ch1": "mock"},
            "treatment": {"title": "Cornea", "characteristics_ch1": "SARS-CoV-2"},
        },
        {
            "control": {"title": "Limbus", "characteristics_ch1": "mock"},
            "treatment": {"title": "Limbus", "characteristics_ch1": "SARS-CoV-2"},
        },
        {
            "control": {"geo_accession": "^GSM499609[6-8]$"},
            "treatment": {"geo_accession": "^GSM4996099$|^GSM4996100$|^GSM4996101$"},
        },
    ],
    keep_annot=["Symbol"],
    save_to=None,
)
series.generate_pair_matrix()
# series.cleanup()  # remove source files
series.pair_count_list[0]  # Corresponds to GSE164073-1.tsv
series.pair_count_list[1]  # Corresponds to GSE164073-2.tsv
series.pair_count_list[2]  # Corresponds to GSE164073-3.tsv

License

ncbi_counts is released under an MIT license.

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc