hca2scea
A tool to assist in the automatic conversion of an hca metadata spreadsheet to scea metadata MAGE-TAB files.
Installation
pip install hca2scea
Description
The tool takes as input an HCA metadata spreadsheet and converts the metadata to SCEA MAGE-TAB files which are then saved to an output directory.
Usage
To run it as a package, after installing it via pip:
$ hca2scea -h
usage: hca2scea [-h] -s SPREADSHEET -id PROJECT_UUID -study STUDY [-name {cs_name,cs_id,sp_name,sp_id,other}] -ac
ACCESSION_NUMBER -c CURATORS [CURATORS ...] -et {baseline,differential} [-facs] -f EXPERIMENTAL_FACTORS
[EXPERIMENTAL_FACTORS ...] -pd PUBLIC_RELEASE_DATE -hd HCA_UPDATE_DATE [-o OUTPUT_DIR] [-zip]
run hca -> scea tool
optional arguments:
-h, --help show this help message and exit
-s SPREADSHEET, --spreadsheet SPREADSHEET
Please provide a path to the HCA project spreadsheet.
-id PROJECT_UUID, --project_uuid PROJECT_UUID
Please provide an HCA ingest project submission id.
-study STUDY Please provide the SRA or ENA study accession.
-name {cs_name,cs_id,sp_name,sp_id,other}
Please indicate which field to use as the sample name. cs=cell suspension, sp = specimen.
-ac ACCESSION_NUMBER, --accession_number ACCESSION_NUMBER
Provide an E-HCAD accession number. Please find the next suitable accession number by checking
the google tracker sheet.
-c CURATORS [CURATORS ...], --curators CURATORS [CURATORS ...]
space separated names of curators
-et {baseline,differential}, --experiment_type {baseline,differential}
Please indicate whether this is a baseline or differential experimental design
-facs Please specify if FACS was used to isolate single cells.
-f EXPERIMENTAL_FACTORS [EXPERIMENTAL_FACTORS ...], --experimental_factors EXPERIMENTAL_FACTORS [EXPERIMENTAL_FACTORS ...]
space separated list of experimental factors
-pd PUBLIC_RELEASE_DATE, --public_release_date PUBLIC_RELEASE_DATE
Please enter the public release date in this format: YYYY-MM-DD
-hd HCA_UPDATE_DATE, --hca_update_date HCA_UPDATE_DATE
Please enter the last time the HCA prohect submission was updated in this format: YYYY-MM-DD
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Provide full path to preferred output dir
-zip, --zip_format Please indicate whether you would like the script to output alltxt files separately or together
in 1 zip file.```
To run it as a python module:
```shell script
cd /path-to/hca2scea
python -m hca-to-scea-tools.hca_to_scea.hca2scea -h
Arguments
Argument | Argument name | Description | Required? |
---|
-s | HCA spreadsheet | Path to HCA spreadsheet (.xlsx) | yes |
-id | HCA project uuid | This is added to the 'secondary accessions' field in idf file | yes |
-c | Curator initials | HCA Curator initials. Space-separated list. | yes |
-ac | accession number | Provide an SCEA accession number (integer). | yes |
-et | Experiment type | Must be 1 of [differential,baseline] | yes |
-f | Factor value | A space-separated list of user-defined factor values e.g. age disease | yes |
-pd | Dataset publication date | provide in YYYY-MM-DD E.g. from GEO | yes |
-hd | HCA last update date | provide in YYYY-MM-DD The last time the HCA project was updated in ingest UI (production) | yes |
-study | study accession (SRPxxx) | The study accession will be used to find the paths to the fastq files for the given runs | yes |
-name | HCA name field | Which HCA field to use for the biomaterial names columns. Must be 1 of | no |
| | [cs_name, cs_id, sp_name, sp_id, other] where cs indicates cell suspension and sp indicates | |
| | specimen from organism. Default is cs_name. | |
-facs | optional argument | If FACS was used as a single cell isolation method, indicate this by adding the -facs argument. | no |
-o | optional argument | An output dir path can optionally be provided. If it does not exist, it will be created. | no |
-zip | optional argument | Indicate if you would like the resulting output files to be output in a single zip file. | no |
Definitions
Factor values
A factor value is a chosen experimental characteristic which can be used to group or differentiate samples. Multiple factor values can be entered and should be chosen from the following list.
- Known disease(s)
- Development stage
- Organ
- Organ part
- Selected cell type(s)
- Individual
There must be at least 1 factor value. If you cannot identify a factor value i.e. all donors and samples share the same metadata with respect to the above list of factor values, then enter 'Individual'.
Datasets with more than 1 technolgoy type are not eligible for SCEA. Therefore, technology type is not a valid factor value.
Experiment type
An experiment with samples which can be grouped or differentiatied by a factor value is classified as 'differential'. The list of possible factor values can be found above.
If 1 or more factor values other than 'Individual' is identified, then the experiment type should be 'Differential'. If the only factor value is 'Individual', then the experiment type should be 'Baseline'.
Related E-HCAD-ID
If the project has been split into two separate E-HCAD datasets, due to different technologies being used in the same project, or any other reason, then enter the E-HCAD-ID for the other dataset here. Example: E-HCAD-50.
Examples
Required arguments only
python3 hca2scea.py -s /home/aday/GSE111976-endometrium_MC_SCEA.xlsx -id 379ed69e-be05-48bc-af5e-a7fc589709bf -study SRP135922 -ac 50 -c AD -et differential -f menstrual cycle day -pd 2021-06-29 -hd 2021-02-12
Specify optional name argument
python3 hca2scea.py -s /home/aday/GSE111976-endometrium_MC_SCEA.xlsx -id 379ed69e-be05-48bc-af5e-a7fc589709bf -study SRP135922 -name cs_name -ac 50 -c AD -et differential -f menstrual cycle day -pd 2021-06-29 -hd 2021-02-12
Specify that FACS was used
python3 hca2scea.py -s /home/aday/GSE111976-endometrium_MC_SCEA.xlsx -id 379ed69e-be05-48bc-af5e-a7fc589709bf -study SRP135922 -ac 50 -c AD -et differential -f menstrual cycle day -pd 2021-06-29 -hd 2021-02-12 -facs
Specify optional output dir
python3 hca2scea.py -s /home/aday/GSE111976-endometrium_MC_SCEA.xlsx -id 379ed69e-be05-48bc-af5e-a7fc589709bf -study SRP135922 -ac 50 -c AD -et differential -f menstrual cycle day -pd 2021-06-29 -hd 2021-02-12 -o my_output_dir
Developer Notes
Developing Code in Editable Mode
Using pip
's editable mode, projects using hca-to-scea as a dependency can refer to the latest code in this repository
directly without installing it through PyPI. This can be done either by manually cloning the code
base:
pip install -e path/to/hca2scea
or by having pip
do it automatically by providing a reference to this repository:
pip install -e \
git+https://github.com/ebi-ait/hca-to-scea-tools.git\
#egg=hca2scea
Publish to PyPI
-
Create PyPI Account through the registration page.
Take note that PyPI requires email addresses to be verified before publishing.
-
Package the project for distribution.
python setup.py sdist
-
Install Twine
pip install twine
-
Upload the distribution package to PyPI.
twine upload dist/*
Running python setup.py sdist
will create a package in the dist
directory of the project
base directory.