=======
muMerge
A tool for combining bed regions from multiple bed files in a probabilistically-prinipled manner.
Installation
It is recommended to install mumerge
using a virtual environment or package manager---e.g. venv
or conda
. Specifically, because bedtools
must be available at the command line we recommend you create a new environment with conda
and install bedtools
from bioconda, as follows:
::
(base) $ conda create -n mumerge_env python=3.9
(base) $ conda activate mumerge_env
(mumerge_env) $ conda install -c bioconda bedtools
To confirm installation, check the bedtools
version:
::
(mumerge_env) $ bedtools --version
bedtools v2.30.0
Now, with bedtools
available within your environment you can install mumerge
as follows.
Via pip
The simplest way of installing mumerge
within your virtual environment is using pip
. Check that pip
is installed in your virtual environment and if not install it (e.g. conda install pip
). Also, be sure to use the appropriate version of Python if you have multiple versions installed. mumerge
can then be installed with one of the following commands.
From PyPI (recommended):
::
$ python -m pip install mumerge
If successful, mumerge
should now be callable from the command line.
From GitHub:
::
$ python -m pip install git+https://github.com/Dowell-Lab/mumerge
In order to upgrade to the latest version of mumerge
from a previous one, include --upgrade
in other of the previous pip
commands.
Via git clone
Alternatively, you can download mumerge
and all supporting files by cloning the GitHub repository to your local machine using git
:
::
$ git clone https://github.com/Dowell-Lab/mumerge.git
If you clone the repo, you may want to add directory mumerge/mumerge
to your system PATH
variable (this will depend on your platform/OS) so that you can run mumerge
directly from the command-line.
Dependencies
NumPy will be installed automatically when using pip
to install mumerge
. However, bedtools
must be installed manually and made available in your system path prior to running mumerge
.
Bedtools
muMerge relies on bedtools
in order to group together those bed regions from the input bed files that will be combined probabilistically by muMerge. This grouping is done using the bedtools merge
command, which must be available at the command line.
Usage
For general usage, see the help menu:
::
$ mumerge -h
This will return the general commands needed to run muMerge:
::
usage: mumerge.py [-h] [-H] [-i INPUT] [-o OUTPUT] [-w WIDTH] [-m MERGED] [-r] [-v]
Merges region calls (mu) generated by Tfit, or other peak calling functions across
multiple samples and replicates.
optional arguments:
-h, --help show this help message and exit
-H, --HELP Verbose help info about the input format.
-i INPUT, --input INPUT
Input file (full path) containing bedfiles, sample ID's and
replicate grouping names (tab delimited). Each sample on separate
line. First line header, equal to '#file<TAB>sampid<TAB>group',
required. 'file' must be full path. 'sampid' can be any string.
'group' can be string or integer. See '-H' help flag for more
information.
-o OUTPUT, --output OUTPUT
Output file basename (full path, sans extension). WARNING:
will overwrite any existing file)
-w WIDTH, --width WIDTH
The ratio of a the sigma for the corresponding probabilty
distribution to the bed region (half-width) --- sigma:half-bed
(default: 1). The choice for this parameter will depend on the
data type as well as how bed regions were inferred from the
expression data.
-m MERGED, --merged MERGED
Sorted bedfile (full path) containing the regions over which
to combine the sample bedfiles. If not specified, mumerge will
generate one directly from the sample bedfiles.
-r, --remove_singletons
Remove calls not present in more than 1 sample
-v, --verbose Verbose printing during processing.
Input file
The <INPUT>
file is a tab delimited text file that contains paths to BED files to be merged along with sample names as condition/replicate information for each sample. In the example below, there are 4 samples with two treatment groups.
::
#file sampid group
/path/to/sample1.bed sample1 control
/path/to/sample2.bed sample2 control
/path/to/sample3.bed sample3 treatment
/path/to/sample4.bed sample4 treatment
You can find this information using the -H
flag---i.e. running mumerge -H
, which will return the following:
::
INPUT FILE
----------
Input file containing bedfiles, sample ID's, and replicate groupings. Input
file (indicated by the '-i' flag) should be of the following (tab delimited)
format:
#file sampid group
/full/path/file1.bed sampid1 A
/full/path/file2.bed sampid2 B
...
Header line indicated by '#' character must be included and fields must
follow the same order as non-header lines. The order of subsequent lines does
not matter. File paths should be full paths to bed files, however you can
also specify paths that are relative to the input file location. 'group'
identifiers should group files that are technical/biological replicates.
Different experimental conditions should recieve different 'group' identifiers.
The 'group' identifier can be of type 'int' or 'str'. If 'sampid' is not
specified, then default sample ID's will be used.
Output files
muMerge returns the merged regions in BED file format (project_id_MUMERGE.bed
). Additionally, a log file (project_id.log
) that details the summary of the run is also inlcuded along with intermediate files (project_id_MISCALLS.bed
and project_id_BEDTOOLS_MERGE.bed
).
Demo
The additional help menu (mumerge -H
) also contains information on a muMerge demo included with the package. The menu will specify where the demo files are located (install location depends on the platform) and how to run them. The demo consists of an input muMerge file which references two short bedfiles (a.bed
and b.bed
) that are located in the same directory. Running the demo (replace <fullpath>
with the path to the input file which depends on where you installed it):
::
$ mumerge -v -i <fullpath>/mumerge_demo.input -o ./demo_out
will return the following to stdout:
::
Generating 'bedtools merge' bedfile...
Building bed-regions dictionary...
# Sample_ID Filename
# sampA <fullpath>/a.bed
# sampB <fullpath>/b.bed
Processed 2 of 2 regions
and will produce the following files:
::
./demo_out.log
./demo_out_BEDTOOLS_MERGE.bed
./demo_out_MISCALLS.bed
./demo_out_MUMERGE.bed
If run correctly, demo_out_MUMERGE.bed
should have two bed lines (chr1 150 350
and chr1 600 900
), demo_out_MISCALLS.bed
should be empty, and demo_out.log
should contain meta information about the run.
Platforms
- Linux
- macOS
- Windows Subsystem for Linux (WSL)
Runtime
The overall run time depends on the the number of input BED files and regions being merged. A test case, where 8 samples (~30,000 regions) with 6 condition groups were merged, took about 12 minutes on a MacBook Pro iCore i9 2.3 GHz running macOS v 10.14.6.
Cite
Please cite the following article if you use muMerge: Transcription factor enrichment analysis (TFEA) quantifies the activity of multiple transcription factors from a single experiment <https://doi.org/10.1038/s42003-021-02153-7>
_
BibTeX citation:
::
@article{rubin2021transcription,
title={Transcription factor enrichment analysis (TFEA) quantifies the activity of multiple transcription factors from a single experiment},
author={Rubin, Jonathan D and Stanley, Jacob T and Sigauke, Rutendo F and Levandowski, Cecilia B and Maas, Zachary L and Westfall, Jessica and Taatjes, Dylan J and Dowell, Robin D},
journal={Communications biology},
volume={4},
number={1},
pages={1--15},
year={2021},
publisher={Nature Publishing Group}
}