ReverTra is a practical tool designed for mapping protein sequences (amino-acid) to species-optimized codon sequences. It employs models developed by Tomer Sidi, Tamir Tuller, and Rachel Kolody, utilizing deep learning and transformers architecture trained on mRNA sequences and alignments of 4 species: S. cerevisiae, S. pombe, E. coli, and B. subtilis. For detailed insights into the models, please refer to our paper [ref] and explore the project code here. In the project code you can also find working notebooks for model inference and data exploration.
Crafting a tool for species-optimized codon sequence stems from the critical role that codon usage plays in the efficiency of protein expression. In biological systems, different species exhibit variations in their preferred codon usage patterns, which can significantly impact translational efficiency and other aspects of gene expression. Though not tested in experimental setting, we demonstrated that ReverTra is able to predict the codons of endogenous genes better than the common alternative.
ReverTra offers to optimize codon sequences to 4 possible host species: S. cerevisiae, S. pombe, E. coli, and B. subtilis. By providing a user-friendly tool for this purpose, we aim to empower researchers and bioengineers to streamline their protein expression efforts, facilitating more accurate and effective studies across diverse biological contexts.
To generate an optimized codon sequence, the user must provide an out of host protein (amino-acid) sequence, specify the target host species, and desired expression level of the translated protein. Also, in the model configuration section the user can define the type of model to use for generating the sequences, which includes the window size of sequences and whether to input the model a single sequence (amino-acid) or a pair of sequences that includes a codon sequence from the original trained hosts (i.e., mimicing).
* For a demo application of the model inference evaluation procedure from the paper please visit ReverTra-Evaluation-DEMO.
(1) Inference type - Mask/Mimic; the two inference type are presented at the paper. In mask mode the input to the model is the AA sequence of the target protein. In mimic mode, an additional codon sequence aligned to the target AA sequence is provided to the model.
(2) Model window size - 10/30/50/75/100/150; In the paper, we present different model trained using different window sizes. Each option in this category activates a different model for prediction.