Foldcomp compresses protein structures with torsion angles effectively. It compresses the backbone atoms to 8 bytes and the side chain to additionally 4-5 byes per residue, an averaged-sized protein of 350 residues requires ~4.2kb. Foldcomp is a C++ library with Python bindings.
Foldcomp compresses protein structures with torsion angles effectively. It compresses the backbone atoms to 8 bytes and the side chain to additionally 4-5 byes per residue, thus an averaged-sized protein of 350 residues requires ~6kb.
Foldcomp efficient compressed format stores protein structures requiring only 13 bytes per residue, which reduces the required storage space by an order of magnitude compared to saving 3D coordinates directly. We achieve this reduction by encoding the torsion angles of the backbone as well as the side-chain angles in a compact binary file format (FCZ).
Foldcomp currently only supports compression of single chain PDB files
# Install Foldcomp Python package
pip install foldcomp
# Download static binaries for Linux
wget https://mmseqs.com/foldcomp/foldcomp-linux-x86_64.tar.gz
# Download static binaries for Linux (ARM64)
wget https://mmseqs.com/foldcomp/foldcomp-linux-arm64.tar.gz
# Download binary for macOS
wget https://mmseqs.com/foldcomp/foldcomp-macos-universal.tar.gz
# Download binary for Windows (x64)
wget https://mmseqs.com/foldcomp/foldcomp-windows-x64.zip
Executable
# Compression
foldcomp compress <pdb|cif> [<fcz>]
foldcomp compress [-t number] <dir|tar(.gz)> [<dir|tar|db>]
# Decompression
foldcomp decompress <fcz|tar> [<pdb>]
foldcomp decompress [-t number] <dir|tar(.gz)|db> [<dir|tar>]
# Decompressing a subset of Foldcomp database
foldcomp decompress [-t number] --id-list <idlist.txt> <db> [<dir|tar>]
# Extraction of sequence or pLDDT
foldcomp extract [--plddt|--amino-acid] <fcz> [<fasta>]
foldcomp extract [--plddt|--amino-acid] [-t number] <dir|tar(.gz)|db> [<fasta_out>]
# Check
foldcomp check <fcz>
foldcomp check [-t number] <dir|tar(.gz)|db>
# RMSD
foldcomp rmsd <pdb|cif> <pdb|cif>
# Options
-h, --help print this help message
-v, --version print version
-t, --threads threads for (de)compression of folders/tar files [default=1]
-r, --recursive recursively look for files in directory [default=0]
-f, --file input is a list of files [default=0]
-a, --alt use alternative atom order [default=false]
-b, --break interval size to save absolute atom coordinates [default=25]
-z, --tar save as tar file [default=false]
-d, --db save as database [default=false]
-y, --overwrite overwrite existing files [default=false]
-l, --id-list a file of id list to be processed (only for database input)
--skip-discontinuous skip PDB with with discontinuous residues (only batch compression)
--check check FCZ before and skip entries with error (only for batch decompression)
--plddt extract pLDDT score (only for extraction mode)
--fasta extract amino acid sequence (only for extraction mode)
--no-merge do not merge output files (only for extraction mode)
--time measure time for compression/decompression
Downloading Databases
We offer prebuilt databases for multiple large sets of predicted protein structures and a Python helper to download the database files.
You can download the AlphaFoldDB Swiss-Prot with the following command:
Note: We skipped all structures with discontinous residues or other issues.
Here is a list with the affected predictions;
full (~21M),
high-quality (~100k),
v2023_02 (~10k)
If you want other prebuilt datasets, please get in touch with us through our GitHub issues.
If you have issues downloading the databases you can navigate directly to our download server and download the required files. E.g. afdb_uniprot_v4, afdb_uniprot_v4.index, afdb_uniprot_v4.dbtype, afdb_uniprot_v4.lookup, and optionally afdb_uniprot_v4.source.
Python API
You can find more in-depth examples of using Foldcomp's Python interface in the example notebook:
import foldcomp
# 01. Handling a FCZ file# Open a fcz filewithopen("test/compressed.fcz", "rb") as fcz:
fcz_binary = fcz.read()
# Decompress
(name, pdb) = foldcomp.decompress(fcz_binary) # pdb_out[0]: file name, pdb_out[1]: pdb binary string# Save to a pdb filewithopen(name, "w") as pdb_file:
pdb_file.write(pdb)
# Get data as dictionary
data_dict = foldcomp.get_data(fcz_binary) # foldcomp.get_data(pdb) also works# Keys: phi, psi, omega, torsion_angles, residues, bond_angles, coordinates
data_dict["phi"] # phi angles (C-N-CA-C)
data_dict["psi"] # psi angles (N-CA-C-N)
data_dict["omega"] # omega angles (CA-C-N-CA)
data_dict["torsion_angles"] # torsion angles of the backbone as list (phi + psi + omega)
data_dict["bond_angles"] # bond angles of the backbone as list
data_dict["residues"] # amino acid residues as string
data_dict["coordinates"] # coordinates of the backbone as list# 02. Iterate over a database of FCZ files# Open a foldcomp database
ids = ["d1asha_", "d1it2a_"]
with foldcomp.open("test/example_db", ids=ids) as db:
# Iterate through databasefor (name, pdb) in db:
# save entries as seperate pdb fileswithopen(name + ".pdb", "w") as pdb_file:
pdb_file.write(pdb)
Subsetting Databases
If you are dealing with millions of entries, we recommend using createsubdb command
of mmseqs2 to subset databases.
The following commands can be used to subset the AlphaFold Uniprot DB with given IDs.
Foldcomp compresses protein structures with torsion angles effectively. It compresses the backbone atoms to 8 bytes and the side chain to additionally 4-5 byes per residue, an averaged-sized protein of 350 residues requires ~4.2kb. Foldcomp is a C++ library with Python bindings.
We found that foldcomp demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.