Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

riot-na

Package Overview
Dependencies
Maintainers
2
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

riot-na

Antibody numbering software

  • 1.2.5
  • PyPI
  • Socket score

Maintainers
2

RIOT - Rapid Immunoglobulin Overview Tool

Have some raw antibody sequences? Find matching germlines, perform numbering and get results in a familiar AIRR format!

RIOT supports both nucleotide and amino acid sequences as well as all major schemes: KABAT, CHOTHIA, MARTIN and IMGT.

MOTIVATION

Antibodies are a cornerstone of the immune system, playing a pivotal role in identifying and neutralizing infections caused by bacteria, viruses, and other pathogens. Understanding their structure, function, can provide insights into both the body's natural defenses and the principles behind many therapeutic interventions, including vaccines and antibody-based drugs. The analysis and annotation of antibody sequences, including the identification of variable, diversity, joining, and constant genes, as well as the delineation of framework regions and complementarity determining regions, are essential for understanding their structure and function. Currently analyzing large volumes of antibody sequences for is routine in antibody discovery, requiring fast and accurate tools. While there are existing tools designed for the annotation and numbering of antibody sequences, they often have limitations such as being restricted to either nucleotide or amino acid sequences, reliance on non-uniform germline databases, or slow execution times. Here we present Rapid Immunoglobulin Overview Tool (RIOT), a novel open source solution for antibody numbering that addresses these shortcomings. RIOT handles nucleotide and amino acid sequence processing, comes with a free germline database, and is computationally efficient. We hope the tool will facilitate rapid annotation of antibody sequencing outputs for the benefit of understanding of antibody biology and discovering novel therapeutics.

  • PyPI
  • Source code
  • Collab example

Requirements

  • Python ^3.10

Quickstart

> pip install riot-na

> riot_na -s GGGCGTTTTGGCAC...

{
    "sequence_header": "-",
    "sequence": "GGGCGTTTTGGCAC...",
    "numbering_scheme": "imgt",
    "locus": "igh",
    "stop_codon": False,
    "vj_in_frame": True,
    "v_frameshift": False,
    "j_frameshift": False,
    "productive": True,
    "rev_comp": False,
    "complete_vdj": True,
    "v_call": "IGHV1-69*01",
    "d_call": "IGHD3-3*01",
    "j_call": "IGHJ6*02",
    "c_call": "IGHM",
    "v_frame": 0,
    ...
}

Installation

Riot is distributed in prebuild binary wheels for all major platforms. Just run in your chosen virtualenv:

pip install riot-na

Usage

CLI

Usage: riot_na [OPTIONS]

Options:
  -f, --input-file PATH           Path to input FASTA file.
  -s, --sequence TEXT             Input sequence.
  -o, --output-file PATH          Path to output CSV file. If not specified,
                                  stdout is used.
  --scheme [kabat|chothia|imgt|martin]
                                  Which numbering scheme should be used: IMGT,
                                  KABAT, CHOTHIA, MARTIN. Default IMGT
  --species [human|mouse]         Which species germline sequences should be
                                  used. Default is all species.
  --input-type [nt|aa]            What kind of sequences are provided on
                                  input. Default is nucleotide sequences.
  -p, --ncpu INTEGER              Number of parallel processes to use. Default
                                  is number of physical cores.
  -e, --extend_alignment BOOLEAN  Include unaligned beginning of the query
                                  sequence in numbering.This option impacts
                                  only amino acid sequences passed with -s option.
  --help                          Show this message and exit.
Examples:

Run on single sequence and print output to stdout:

riot_na -s <sequence>

Run on single sequence and save output to csv:

riot_na -s <sequence> -o result.csv

Run on fasta file:

riot_na -f input.fasta -o results.csv

API Nucleotides

from riot_na import create_riot_nt, Organism, Scheme, RiotNumberingNT, AirrRearrangementEntryNT

riot_nt: RiotNumberingNT = create_riot_nt(allowed_species = [Organism.HOMO_SAPIENS])
airr_result: AirrRearrangementEntryNT = riot_nt.run_on_sequence(
                    header = "SRR13857054.957936",
                    query_sequence = "GAACCAAACTGACTGTCCTAGGCCAGCCCAAGTCTTCGCCATCAGTCACCCTGTTTCCACCTTCCCCTGAAGAGCTAAAAAAA",
                    scheme = Scheme.KABAT
                )

API Amino Acids

from riot_na import create_riot_aa, Organism, Scheme, RiotNumberingAA, AirrRearrangementEntryAA

riot_aa: RiotNumberingAA = create_riot_aa(allowed_species = [Organism.HOMO_SAPIENS])
airr_result: AirrRearrangementEntryAA = riot_aa.run_on_sequence(
                    header = "SRR13385915.5101835",
                    query_sequence = "QVTLKESGPVLVKPTETLTLTCTVSGFSLSNARMGVSWIRQPPGKALEWLAHIFSNDEKSYSTSLKSRLTISKDTSKSQVVLTMTNMDPGDTATYYCARRGGTIFGVVIILVRRPPL",
                    scheme = Scheme.KABAT,
                    extend_alignment = True
                )

Multiprocessing

Riot uses precompiled Rust module for prefiltering. This means the RiotNumberingNT/AA objects are unpickable, so you cannot pass it as a worker's parameter in eg. mp.Pool() or use it in Spark's UDF functions. There is however a simple way of achieving it, by using caching mechanism from cachetools package. Below you can find working and not working examples.

The following will not work:

import functools
import multiprocessing as mp
from riot_na import create_riot_aa, AirrRearrangementEntryNT, RiotNumberingAA

seqs = ["EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLEWVARIYPTNGYTRYADSVKGRFTISADTSKNTAYLQMNSLRAEDTAVYYCARGGSFYYYYMDVWGQGTLVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEPKSCDKTGHHHHHHHHG"] * 10

def worker(riot: RiotNumberingAA, seq: str) -> AirrRearrangementEntryNT:
    airr = riot.run_on_sequence("-", seq)
    return airr

riot = create_riot_aa()

worker_partial = functools.partial(worker, riot)

with mp.Pool() as pool:
    res = pool.map(worker_partial, seqs)

# Output: TypeError: cannot pickle 'builtins.Prefiltering' object

The proper way:

import multiprocessing as mp
from cachetools import cached
from riot_na import create_riot_aa, AirrRearrangementEntryNT

seqs = ["EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLEWVARIYPTNGYTRYADSVKGRFTISADTSKNTAYLQMNSLRAEDTAVYYCARGGSFYYYYMDVWGQGTLVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEPKSCDKTGHHHHHHHHG"] * 10

@cached(cache={})
def get_riot():
    return create_riot_aa()

def worker(seq: str) -> AirrRearrangementEntryNT:
    riot = get_riot()
    airr = riot.run_on_sequence("-", seq)
    return airr

with mp.Pool() as pool:
    res = pool.map(worker, seqs)

Spark UDF example:

import json
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from cachetools import cached
from riot_na import create_riot_aa, RiotNumberingAA

spark = SparkSession.builder.appName("Riot on Spark").getOrCreate()

seq = "EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLEWVARIYPTNGYTRYADSVKGRFTISADTSKNTAYLQMNSLRAEDTAVYYCARGGSFYYYYMDVWGQGTLVTVSSASTKGPSVFPLAPSSKSTSGGTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTQTYICNVNHKPSNTKVDKKVEPKSCDKTGHHHHHHHHG"

df = spark.createDataFrame([{"seq":  seq}]*10)

@cached(cache={})
def get_riot() -> RiotNumberingAA:
    return create_riot_aa()

@udf
def number_sequence(seq: str)-> str:
    riot = get_riot()
    airr = riot.run_on_sequence("-", seq)
    return json.dumps(airr.__dict__)

df.select(number_sequence("seq")).collect()

For a pure Python solution you can check riot_na/api/api_mp.py file. We are using this because we want to keep Riot's dependencies to a minimum.

Germline database

RIOT uses OGRDB as a primary source of germline alleles. Database version as of 22.01.2024 was used. C genes are imported from igblast FTP site fhttps://ftp.ncbi.nih.gov/blast/executables/igblast/release/database/ncbi_human_c_genes.tar

Data format

This section describes the fields of numbering result object AirrRearrangementEntry(AA). It is based on AIRR Rearrangement Schema format extended by 7 columns highlighted in the table (bold) and emptied of the unnecessary ones. There are also some differences in fields’ definitions, so the AIRR format specification should be treated only as a loose reference. Description of the original AIRR format can be found here.

Attributes:

  1. All fields in the format are required (always present)
  2. All fields are nullable, with the exception of sequence_header and sequence

AirrRearrangementNT fields definitions

NameTypeDefinition
sequence_headerstringFasta header for given input sequence (when numbering a FASTA file) or value of sequence_header parameter (when using RiotNumberingNT API).
sequencestringThe query nucleotide sequence. Usually, this is the unmodified input sequence, but can be reverse complemented if needed.
sequence_aastringTranslated query sequence.
numbering_schemeenum ["imgt", "kabat", "chothia", "martin"]Used numbering scheme, default is "imgt".
locusstringGene locus (chain type).
stop_codonbooleanTrue if the aligned sequence contains a stop codon.
vj_in_framebooleanTrue if the V and J gene alignments are in-frame. In details: distance between v_alignement reading frame and j_alignment reading frame is divisible by 3.
v_frameshiftbooleanTrue if the V gene in the query nucleotide sequence contains a translational frameshift relative to the frame of the V gene reference sequence. In other words: sum of insertions and deletions between consecutive matches in alignment is divisible by 3.
j_frameshiftbooleanTrue if the J gene in the query nucleotide sequence contains a translational frameshift relative to the frame of the J gene reference sequence. In other words: sum of insertions and deletions between consecutive matches in alignment is divisible by 3.
productivebooleanTrue if the V(D)J sequence is predicted to be productive. In details: stop_codon is False and vj_in_frame is True
rev_compbooleanTrue if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True, indicates the sequence field contains a reverse complemented original query sequence.
complete_vdjbooleanTrue if the sequence alignment spans the entire V(D)J region. Meaning, sequence_alignment includes both the first V gene codon that encodes the mature polypeptide chain (i.e., after the leader sequence) and the last complete codon of the J gene (i.e., before the J-C splice site). This does not require an absence of deletions within the internal FWR and CDR regions of the alignment.
v_callstringV gene with allele.
d_callstringD gene with allele.
j_callstringJ gene with allele.
c_callstringConstant region gene with allele.
v_frameenum [0, 1, 2]V frame offset from v_alignment_start.
j_frameenum [0, 1, 2]J frame offset from j_alignment_start.
sequence_alignmentstringGapped alignment of query sequence spanning V-J segment aligned to germline, reverse complemented if needed.
germline_alignmentstringGapped aligned germline sequence spanning the same region as the sequence_alignment field (V(D)J region). Segments between matched germlines are gapped to match query sequence length.
sequence_alignment_aastringAmino acid translation of the sequence_alignment.
germline_alignment_aastringAmino acid translation of the germline_alignment.
v_alignment_startintegerStart position of the V gene alignment in sequence_alignment (1-based closed interval).
v_alignment_endintegerEnd position of the V gene alignment in sequence_alignment (1-based closed interval).
d_alignment_startintegerStart position of the D gene alignment in sequence_alignment (1-based closed interval).
d_alignment_endintegerEnd position of the D gene alignment in sequence_alignment (1-based closed interval).
j_alignment_startintegerStart position of the J gene alignment in sequence_alignment (1-based closed interval).
j_alignment_endintegerEnd position of the J gene alignment in sequence_alignment (1-based closed interval).
c_alignment_startintegerStart position of the C gene alignment in sequence_alignment (1-based closed interval).
c_alignment_endintegerEnd position of the C gene alignment in sequence_alignment (1-based closed interval).
v_sequence_alignmentstringAligned portion of query sequence assigned to the V gene.
v_sequence_alignment_aastringAmino acid translation of the v_sequence_alignment field.
v_germline_alignmentstringAligned V gene germline sequence.
v_germline_alignment_aastringAligned amino acid V gene germline sequence.
d_sequence_alignmentstringAligned portion of query sequence assigned to the D gene.
d_germline_alignmentstringAligned D gene germline sequence.
j_sequence_alignmentstringAligned portion of query sequence assigned to the J gene.
j_sequence_alignment_aastringAmino acid translation of the j_sequence_alignment field.
j_germline_alignmentstringAligned J gene germline sequence.
j_germline_alignment_aastringAligned amino acid J gene germline sequence.
c_sequence_alignmentstringAligned portion of query sequence assigned to the constant region.
c_germline_alignmentstringAligned constant region germline sequence.
fwr1stringNucleotide sequence of the aligned FWR1 region.
fwr1_aastringAmino acid translation of the fwr1 field.
cdr1stringNucleotide sequence of the aligned CDR1 region.
cdr1_aastringAmino acid translation of the cdr1 field.
fwr2stringNucleotide sequence of the aligned FWR2 region.
fwr2_aastringAmino acid translation of the fwr2 field.
cdr2stringNucleotide sequence of the aligned CDR2 region.
cdr2_aastringAmino acid translation of the cdr2 field.
fwr3stringNucleotide sequence of the aligned FWR3 region.
fwr3_aastringAmino acid translation of the fwr3 field.
cdr3stringNucleotide sequence of the aligned CDR3 region.
cdr3_aastringAmino acid translation of the cdr3 field.
fwr4stringNucleotide sequence of the aligned FWR4 region.
fwr4_aastringAmino acid translation of the fwr4 field.
junctionstringJunction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons.
junction_aastringAmino acid translation of the junction.
junction_lengthintegerNumber of nucleotides in the junction sequence.
junction_aa_lengthintegerNumber of amino acids in the junction sequence.
v_scorenumberAlignment score (Smith-Waterman) for the V gene.
d_scorenumberAlignment score (Smith-Waterman) for the D gene alignment.
j_scorenumberAlignment score (Smith-Waterman) for the J gene alignment.
c_scorenumberAlignment score (Smith-Waterman) for the C gene alignment.
v_cigarstringCIGAR string for the V gene alignment.
d_cigarstringCIGAR string for the D gene alignment.
j_cigarstringCIGAR string for the J gene alignment.
c_cigarstringCIGAR string for the C gene alignment.
v_supportnumberV gene alignment E-value. Note: Every value less than 1.4e-45 will appear as 0.0 (due to single-precision floating point standard limitation)
d_supportnumberD gene alignment E-value. Note: Every value less than 1.4e-45 will appear as 0.0 (due to single-precision floating point standard limitation)
j_supportnumberJ gene alignment E-value. Note: Every value less than 1.4e-45 will appear as 0.0 (due to single-precision floating point standard limitation)
c_supportnumberC gene alignment E-value. Note: Every value less than 1.4e-45 will appear as 0.0 (due to single-precision floating point standard limitation)
v_identitynumberFractional identity for the V gene alignment.
d_identitynumberFractional identity for the D gene alignment.
j_identitynumberFractional identity for the J gene alignment.
c_identitynumberFractional identity for the C gene alignment.
v_sequence_startintegerStart position of the V gene in the query sequence (1-based closed interval).
v_sequence_endintegerEnd position of the V gene in the query sequence (1-based closed interval).
d_sequence_startintegerStart position of the D gene in the query sequence (1-based closed interval).
d_sequence_endintegerEnd position of the D gene in the query sequence (1-based closed interval).
j_sequence_startintegerStart position of the J gene in the query sequence (1-based closed interval).
j_sequence_endintegerEnd position of the J gene in the query sequence (1-based closed interval).
c_sequence_startintegerStart position of the C gene in the query sequence (1-based closed interval).
c_sequence_endintegerEnd position of the C gene in the query sequence (1-based closed interval).
v_germline_startintegerAlignment start position in the V gene reference sequence (1-based closed interval).
v_germline_endintegerAlignment end position in the V gene reference sequence (1-based closed interval).
d_germline_startintegerAlignment start position in the D gene reference sequence (1-based closed interval).
d_germline_endintegerAlignment end position in the D gene reference sequence (1-based closed interval).
j_germline_startintegerAlignment start position in the J gene reference sequence (1-based closed interval).
j_germline_endintegerAlignment end position in the J gene reference sequence (1-based closed interval).
c_germline_startintegerAlignment start position in the C gene reference sequence (1-based closed interval).
c_germline_endintegerAlignment end position in the C gene reference sequence (1-based closed interval).
fwr1_startintegerFWR1 start position in the query sequence (1-based closed interval).
fwr1_endintegerFWR1 end position in the query sequence (1-based closed interval).
cdr1_startintegerCDR1 start position in the query sequence (1-based closed interval).
cdr1_endintegerCDR1 end position in the query sequence (1-based closed interval).
fwr2_startintegerFWR2 start position in the query sequence (1-based closed interval).
fwr2_endintegerFWR2 end position in the query sequence (1-based closed interval).
cdr2_startintegerCDR2 start position in the query sequence (1-based closed interval).
cdr2_endintegerCDR2 end position in the query sequence (1-based closed interval).
fwr3_startintegerFWR3 start position in the query sequence (1-based closed interval).
fwr3_endintegerFWR3 end position in the query sequence (1-based closed interval).
cdr3_startintegerCDR3 start position in the query sequence (1-based closed interval).
cdr3_endintegerCDR3 end position in the query sequence (1-based closed interval).
fwr4_startintegerFWR4 start position in the query sequence (1-based closed interval).
fwr4_endintegerFWR4 end position in the query sequence (1-based closed interval).
sequence_aa_scheme_cigarstringCIGAR string defining sequence_aa to scheme alignment.
scheme_residue_mappingjson stringScheme numbering of sequence_alignment_aa - positions not present in this sequence are not included.
positional_scheme_mappingjson stringMapping from absolute residue position in sequence_alignment_aa (0-based) to corresponding scheme position.
excstringException (if any) thrown during ANARCI numbering.
additional_validation_flagsjson stringJSON string containing additional validation flags.
Additional validation flags

Following table describes additional validation flags calculated alongside main fields. Last 5 flags regarding conserved residues apply only then using IMGT schema.

Field nameAIRR fields required for calculationDescription
regions_in_aligned_sequenceall regions (fwr1, cdr1, fwr2 …); sequence_alignmentTrue if all region sequences, concatenated, are present in sequence_alignment.
regions_aa_in_aligned_sequence_aaall _aa (fwr1_aa, cdr1_aa, …); sequence_alignment_aaTrue if all region_aa sequences, concatenated, are present in sequence_alignment_aa.
translated_regions_in_aligned_sequence_aaall regions (fwr1, cdr1, fwr2 …); sequence_alignment_aa; v_frameTrue if all region sequences, concatenated and translated using v_frame, are present in sequence_alignment_aa.
correct_vj_in_framev_alignment_start; v_frame; j_alignment_start; j_frameTrue if vj_in_frame is equal to: distance between v_alignement translation frame and j_alignment translation frame is divisible by 3.
cdr3_in_junctioncdr3; junction; cdr3_aa; junction_aaTrue if cdr3 is present in junction and cdr3_aa is present in junction_aa.
locus_as_in_v_genelocus; v_callTrue if locus is consistent with the one specified in V gene (v_call).
v_gene_alignmentsequence; v_sequence_start; v_sequence_end; v_sequence_alignmentTrue if v_sequence_alignment is equal to substring in sequence from position v_sequence_start to v_sequence_end.
j_gene_alignmentsequence; j_sequence_start; j_sequence_end; j_sequence_alignmentTrue if j_sequence_alignment is equal to substring in sequence from position j_sequence_start to j_sequence_end.
c_gene_alignmentsequence; c_sequence_start; c_sequence_end; c_sequence_alignmentTrue if c_sequence_alignment is equal to substring in sequence from position c_sequence_start to c_sequence_end.
no_negative_offsets_inside_v_alignmentfwr1_start; fwr1_end; cdr1_start; cdr1_end; fwr2_start; fwr2_end; cdr2_start; cdr2_end; fwr3_start; fwr3_end; cdr3_startTrue if there is no negative (missing) offset inside V alignment, eg.: fwr1_start == 1; fwr1_end == 35; cdr1_start == -1; cdr1_end == 65.
no_negative_offsets_inside_j_alignmentcdr3_end; fwr4_start; fwr4_endTrue if there is no negative (missing) offset inside J alignment, eg.: cdr3_end == 293; fwr4_start == -1; fwr4_end == 326.
consecutive_offsetsall _start and _endTrue if consecutive region_start and region_end offsets are ascendant, and no region_start is greater than corresponding region_end.
no_empty_cdr3cdr3True if cdr3 is present.
primary_sequence_in_sequence_alignment_aasequence_alignment_aa; scheme_residue_mappingTrue if concatenation of scheme_residue_mapping amino acids results in a sequence that is a part of sequence_alignment_aa and amino acids are in correct order.
no_insertion_next_to_deletion_aasequence_aa_scheme_cigarTrue if there are no insertions next to deletions - indicates correct CIGARs merging process.
insertions_in_correct_placesscheme_residue_mapping; numbering_scheme; locusTrue if insertions are on schema-allowed positions.
correct_fwr1_offsetssequence; v_sequence_start; fwr1_start; fwr1_end; fwr1True if fwr1 is equal to substring in sequence cut from position fwr1_start up to fwr1_end. If fwr1_start is -1 (missing), v_sequence_start is used as a starting offset instead.
correct_cdr1_offsetssequence; cdr1_start; cdr1_end; cdr1True if cdr1 is equal to substring in sequence cut from position cdr1_start up to cdr1_end.
correct_fwr2_offsetssequence; fwr2_start; fwr2_end; fwr2True if fwr2 is equal to substring in sequence cut from position fwr2_start up to fwr2_end.
correct_cdr2_offsetssequence; cdr2_start; cdr2_end; cdr2True if cdr2 is equal to substring in sequence cut from position cdr2_start up to cdr2_end.
correct_fwr3_offsetssequence; fwr3_start; fwr3_end; fwr3True if fwr3 is equal to substring in sequence cut from position fwr3_start up to fwr3_end.
correct_cdr3_offsetssequence; cdr3_start; cdr3_end; cdr3True if cdr3 is equal to substring in sequence cut from position cdr3_start up to cdr3_end.
correct_fwr4_offsetssequence; j_sequence_end; fwr4_start; fwr4_end; fwr4True if fwr4 is equal to substring in sequence cut from position fwr4_start up to fwr4_end. If fwr4_end is -1 (missing), j_sequence_end is used as an ending offset instead.
no_empty_fwr1_in_vv_sequence_alignment; fwr1True if fwr1 is present.
no_empty_cdr1_in_vv_sequence_alignment; cdr1True if cdr1 is present.
no_empty_fwr2_in_vv_sequence_alignment; fwr2True if fwr2 is present.
no_empty_cdr2_in_vv_sequence_alignment; cdr2True if cdr2 is present.
no_empty_fwr3_in_vv_sequence_alignment; fwr3True if fwr3 is present.
no_empty_fwr4_in_jj_sequence_alignment; fwr4True if fwr4 is present.
conserved_C23_presentimgt_residue_mappingTrue if conserved Cysteine on IMGT position 23 is present.
conserved_W41_presentimgt_residue_mappingTrue if conserved Tryptophan on IMGT position 41 is present.
conserved_C104_presentimgt_residue_mappingTrue if conserved Cysteine on IMGT position 104 is present.
conserved_W118_heavy_presentimgt_residue_mappingTrue if conserved Tryptophan on IMGT position 118 is present (heavy chain only).
conserved_F118_light_presentimgt_residue_mappingTrue if conserved Phenylalanine on IMGT position 118 is present (light chain only).

AirrRearrangementAA field definitions

Airr data format was developed for nucleotide sequences. For the amino acid pipeline a similar to format was created. Most fields are analogous to nucleotide-based one, with _aa suffix in name.

NameTypeDefinition
sequence_headerstringFasta header for given input sequence (when numbering a FASTA file) or value of sequence_header parameter (when using RiotNumberingNT API).
sequence_aastringThe query sequence.
numbering_schemeenum ["imgt", "kabat", "chothia", "martin"]Used numbering scheme, default is "imgt".
locusstringGene locus (chain type).
stop_codonbooleanTrue if the aligned sequence contains a stop codon.
productivebooleanTrue if the V(D)J sequence is predicted to be productive. In details: stop_codon is False and V and J genes are detected.
complete_vdjbooleanTrue if the sequence alignment spans the entire V(D)J region. Meaning, sequence alignment includes both the first V amino acid and the last of the J gene (i.e., before the J-C splice site). This does not require an absence of deletions within the internal FWR and CDR regions of the alignment.
v_callstringV gene with allele.
j_callstringJ gene with allele.
germline_alignment_aastringAssembled, aligned, full-length inferred germline sequence spanning the same region as the sequence_alignment_aa field (V-J region).
sequence_alignment_aastringSegment of query sequence spanning V-J aligned to germline.
v_alignment_start_aaintegerStart position of the V gene alignment in sequence_alignment_aa (1-based closed interval).
v_alignment_end_aaintegerEnd position of the V gene alignment in sequence_alignment_aa (1-based closed interval).
j_alignment_start_aaintegerStart position of the J gene alignment in sequence_alignment_aa (1-based closed interval).
j_alignment_end_aaintegerEnd position of the J gene alignment in sequence_alignment_aa (1-based closed interval).
v_sequence_alignment_aastringAligned portion of query sequence assigned to the V gene.
v_germline_alignment_aastringAligned V gene germline sequence.
j_sequence_alignment_aastringAligned portion of query sequence assigned to the J gene.
j_germline_alignment_aastringAligned J gene germline sequence.
fwr1_aastringAmino acid sequence of the aligned FWR1 region.
cdr1_aastringAmino acid sequence of the aligned CDR1 region.
fwr2_aastringAmino acid sequence of the aligned FWR2 region.
cdr2_aastringAmino acid sequence of the aligned CDR2 region.
fwr3_aastringAmino acid sequence of the aligned FWR3 region.
cdr3_aastringAmino acid sequence of the aligned CDR3 region.
fwr4_aastringAmino acid sequence of the aligned FWR4 region.
junction_aastringJunction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved amino acids.
junction_aa_lengthintegerNumber of amino acids in the junction sequence.
v_score_aanumberAlignment score (Smith-Waterman) for the V gene.
j_score_aanumberAlignment score (Smith-Waterman) for the J gene alignment.
v_cigar_aastringCIGAR string for the V gene alignment.
j_cigar_aastringCIGAR string for the J gene alignment.
v_support_aanumberV gene alignment E-value. Note: Every value less than 1.4e-45 will appear as 0.0 (due to single-precision floating point standard limitation)
j_support_aanumberJ gene alignment E-value. Note: Every value less than 1.4e-45 will appear as 0.0 (due to single-precision floating point standard limitation)
v_identity_aanumberFractional identity for the V gene alignment.
j_identity_aanumberFractional identity for the J gene alignment.
v_sequence_start_aaintegerStart position of the V gene in the query sequence (1-based closed interval).
v_sequence_end_aaintegerEnd position of the V gene in the query sequence (1-based closed interval).
j_sequence_start_aaintegerStart position of the J gene in the query sequence (1-based closed interval).
j_sequence_end_aaintegerEnd position of the J gene in the query sequence (1-based closed interval).
v_germline_start_aaintegerAlignment start position in the V gene reference sequence (1-based closed interval).
v_germline_end_aaintegerAlignment end position in the V gene reference sequence (1-based closed interval).
j_germline_start_aaintegerAlignment start position in the J gene reference sequence (1-based closed interval).
j_germline_end_aaintegerAlignment end position in the J gene reference sequence (1-based closed interval).
fwr1_start_aaintegerFWR1 start position in the query sequence (1-based closed interval).
fwr1_end_aaintegerFWR1 end position in the query sequence (1-based closed interval).
cdr1_start_aaintegerCDR1 start position in the query sequence (1-based closed interval).
cdr1_end_aaintegerCDR1 end position in the query sequence (1-based closed interval).
fwr2_start_aaintegerFWR2 start position in the query sequence (1-based closed interval).
fwr2_end_aaintegerFWR2 end position in the query sequence (1-based closed interval).
cdr2_start_aaintegerCDR2 start position in the query sequence (1-based closed interval).
cdr2_end_aaintegerCDR2 end position in the query sequence (1-based closed interval).
fwr3_start_aaintegerFWR3 start position in the query sequence (1-based closed interval).
fwr3_end_aaintegerFWR3 end position in the query sequence (1-based closed interval).
cdr3_start_aaintegerCDR3 start position in the query sequence (1-based closed interval).
cdr3_end_aaintegerCDR3 end position in the query sequence (1-based closed interval).
fwr4_start_aaintegerFWR4 start position in the query sequence (1-based closed interval).
fwr4_end_aaintegerFWR4 end position in the query sequence (1-based closed interval).
sequence_aa_scheme_cigarstringCIGAR string defining sequence_alignment_aa to scheme alignment.
scheme_residue_mappingjson stringScheme numbering of sequence_alignment_aa - positions not present in this sequence are not included.
positional_scheme_mappingjson stringMapping from absolute residue position in sequence_alignment_aa (0-based) to corresponding scheme position.
excstringException (if any) thrown during ANARCI numbering.
additional_validation_flagsjson stringJSON string containing additional validation flags.
Additional validation flags (AA)

Following table describes additional validation flags calculated alongside main fields. Last 5 flags regarding conserved residues apply only then using IMGT schema.

AIRR fields required for calculationDescription
regions_aa_in_aligned_sequence_aaall _aa (fwr1_aa_aa, cdr1_aa_aa, …); sequence_alignment_aaTrue if all region_aa sequences, concatenated, are present in sequence_alignment_aa.
locus_as_in_v_genelocus; v_callTrue if locus is consistent with the one specified in V gene (v_call).
v_gene_alignment_aasequence; v_sequence_start_aa; v_sequence_end_aa; v_sequence_alignment_aaTrue if v_sequence_alignment_aa is equal to substring in sequence from position v_sequence_start_aa to v_sequence_end_aa.
j_gene_alignment_aasequence; j_sequence_start_aa; j_sequence_end_aa; j_sequence_alignment_aaTrue if j_sequence_alignment_aa is equal to substring in sequence from position j_sequence_start_aa to j_sequence_end_aa.
no_negative_offsets_inside_v_alignment_aafwr1_aa_start_aa; fwr1_aa_end_aa; cdr1_aa_start_aa; cdr1_aa_end_aa; fwr2_aa_start_aa; fwr2_aa_end_aa; cdr2_aa_start_aa; cdr2_aa_end_aa; fwr3_aa_start_aa; fwr3_aa_end_aa; cdr3_aa_start_aaTrue if there is no negative (missing) offset inside V alignment, eg.: fwr1_aa_start_aa == 1; fwr1_aa_end_aa == 26; cdr1_aa_start_aa == -1; cdr1_aa_end_aa == 38.
no_negative_offsets_inside_j_alignment_aacdr3_aa_end_aa; fwr4_aa_start_aa; fwr4_aa_end_aaTrue if there is no negative (missing) offset inside J alignment, eg.: cdr3_aa_end_aa == 117; fwr4_aa_start_aa == -1; fwr4_aa_end_aa == 128.
consecutive_offsets_aaall _start_aa and _end_aaTrue if consecutive region_start_aa and region_end_aa offsets are ascendant, and no region_start_aa is greater than corresponding region_end_aa.
no_empty_cdr3_aacdr3_aaTrue if cdr3_aa is present.
primary_sequence_in_sequence_alignment_aasequence_alignment_aa; scheme_residue_mappingTrue if concatenation of scheme_residue_mapping amino acids results in a sequence that is a part of sequence_alignment_aa and amino acids are in correct order.
no_insertion_next_to_deletion_aasequence_aa_scheme_cigarTrue if there are no insertions next to deletions - indicates correct CIGARs merging process.
insertions_in_correct_placesscheme_residue_mapping; numbering_scheme; locusTrue if insertions are on schema-allowed positions.
correct_fwr1_aa_offsetssequence; v_sequence_start_aa; fwr1_aa_start_aa; fwr1_aa_end_aa; fwr1_aaTrue if fwr1_aa is equal to substring in sequence cut from position fwr1_aa_start_aa up to fwr1_aa_end_aa. If fwr1_aa_start_aa is -1 (missing), v_sequence_start_aa is used as a starting offset instead.
correct_cdr1_aa_offsetssequence; cdr1_aa_start_aa; cdr1_aa_end_aa; cdr1_aaTrue if cdr1_aa is equal to substring in sequence cut from position cdr1_aa_start_aa up to cdr1_aa_end_aa.
correct_fwr2_aa_offsetssequence; fwr2_aa_start_aa; fwr2_aa_end_aa; fwr2_aaTrue if fwr2_aa is equal to substring in sequence cut from position fwr2_aa_start_aa up to fwr2_aa_end_aa.
correct_cdr2_aa_offsetssequence; cdr2_aa_start_aa; cdr2_aa_end_aa; cdr2_aaTrue if cdr2_aa is equal to substring in sequence cut from position cdr2_aa_start_aa up to cdr2_aa_end_aa.
correct_fwr3_aa_offsetssequence; fwr3_aa_start_aa; fwr3_aa_end_aa; fwr3_aaTrue if fwr3_aa is equal to substring in sequence cut from position fwr3_aa_start_aa up to fwr3_aa_end_aa.
correct_cdr3_aa_offsetssequence; cdr3_aa_start_aa; cdr3_aa_end_aa; cdr3_aaTrue if cdr3_aa is equal to substring in sequence cut from position cdr3_aa_start_aa up to cdr3_aa_end_aa.
correct_fwr4_aa_offsetssequence; j_sequence_end_aa; fwr4_aa_start_aa; fwr4_aa_end_aa; fwr4_aaTrue if fwr4_aa is equal to substring in sequence cut from position fwr4_aa_start_aa up to fwr4_aa_end_aa. If fwr4_aa_end_aa is -1 (missing), j_sequence_end_aa is used as an ending offset instead.
no_empty_fwr1_aa_in_vv_sequence_alignment_aa; fwr1_aaTrue if fwr1_aa is present.
no_empty_cdr1_aa_in_vv_sequence_alignment_aa; cdr1_aaTrue if cdr1_aa is present.
no_empty_fwr2_aa_in_vv_sequence_alignment_aa; fwr2_aaTrue if fwr2_aa is present.
no_empty_cdr2_aa_in_vv_sequence_alignment_aa; cdr2_aaTrue if cdr2_aa is present.
no_empty_fwr3_aa_in_vv_sequence_alignment_aa; fwr3_aaTrue if fwr3_aa is present.
no_empty_fwr4_aa_in_jj_sequence_alignment_aa; fwr4_aaTrue if fwr4_aa is present.
conserved_C23_presentscheme_residue_mappingTrue if conserved Cysteine on IMGT position 23 is present.
conserved_W41_presentscheme_residue_mappingTrue if conserved Tryptophan on IMGT position 41 is present.
conserved_C104_presentscheme_residue_mappingTrue if conserved Cysteine on IMGT position 104 is present.
conserved_W118_heavy_presentscheme_residue_mappingTrue if conserved Tryptophan on IMGT position 118 is present (heavy chain only).
conserved_F118_light_presentscheme_residue_mappingTrue if conserved Phenylalanine on IMGT position 118 is present (light chain only).

Examples

Sample usage of the software is presented at https://colab.research.google.com/drive/1xKO4udsX5gmnY88eDKWsQaUnHsLuFwVA?usp=sharing. To give users the ability to use RIOT with a custom database, we provide google colab script which showcases how to build a custom germline database for RIOT. It is available at https://colab.research.google.com/drive/1VCStUKgZ1ggi2Xf5YV7hFWHxxzP29BjK?usp=sharing.

Development

RIOT uses prefiltering module written in Rust, which requires some extra steps to install from source.

# Install Poetry

curl -sSL https://install.python-poetry.org | python3 - --version 1.7.1

# Add `export PATH="/root/.local/bin:$PATH"` to your shell configuration file.****

# Download and run the Rust installation script
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

# Restart shell to reload PATH

# Verify the installation
!poetry --version
!rustc --version
!cargo --version

git clone https://github.com/NaturalAntibody/riot_na
cd riot_na

poetry install
poetry run maturin develop -r
poetry install

Citing this work

The code and data in this package is based on the following paper <we release the paper once it clears peer review>. If you use it, please cite:

@misc{riot,
      title={RIOT - Rapid Immunoglobulin Overview Tool - rapid annotation of nucleotide and amino acid immunoglobulin sequences using an open germline database.},
      author={Paweł Dudzic, Bartosz Janusz, Tadeusz Satława, Dawid Chomicz, Tomasz Gawłowski, Rafał Grabowski, Przemysław Jóźwiak, Mateusz Tarkowski, Maciej Mycielski, Sonia Wróbel, Konrad Krawczyk*},

}

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc