Abstract:
The elucidation of three-dimensional protein structure plays a pivotal role in
comprehending biological phenomena. It directly governs protein function and hence
aids in drug discovery. Development of protein prediction algorithms, AlphaFold2
and ESMFold, have the potential to shift the paradigm of protein-based therapeutic
discovery. Turning an amino acid chain into 3D domains and docking them can aid
in unlocking a protein’s full potential. Besides this, the effects of mutations on the domain structure can be studied meticulously. Prediction scores from extensive studies
were examined in the hope of searching for newer modalities of transforming protein
therapeutics. Most of these studies failed to find any utility of these algorithms, and a
few suggested, despite their dismal findings, that their utility can be found. The inventors of the algorithms cautioned that the predicted structures and scores have no utility
except regurgitate known structures from the known structure databases. A few possible applications, as considered in this study, are to predict pre-translation variations,
mutations, and structural changes. A potential correlation of repeatedly manufactured
batches of therapeutic protein is correlated with the structure prediction score as a
measure of thermodynamic instability. 204 unmodified FDA-approved therapeutic
proteins were correlated with their prediction scores and available physicochemical
and functional properties. Slight residual differences among the commercial therapeutic proteins and structures reported in the PDB were found. The potential impact of
mutations on the prediction scores is also studied. No observed correlation was found
between the prediction score and any tested attribute. The algorithms exhibited lower
confidence in predicting structures for sequences with low identity scores when tested
against the UniProt and PDB databases. Other deployed algorithms (i.e., trRoseeta)
were concluded to be more relevant to domain manipulation as well. Reliable structure prediction from these algorithms highly depends on the model’s architecture and
training data. Ultimately, it was concluded that none of these algorithms have any
1
Abstract
value except they show how good they can be at reproducing a known or partiallyknown structure. The comparison of AF2 and ESMF resulted in R
2 of 0.69, vouching
for their orthogonality. However, the R
2 value of physiochemical attributes was as low
as 0.07. Lack of significant correlation of predictability scores with physicochemical
and functional properties cannot vouch for in-vivo stability and molecular functionality of a protein. Furthermore, when novel randomized and mutated sequences are
provided to these algorithms, they fail to predict structures with acceptable accuracy.
This is majorly due to the unavailability of similar folds in the training dataset (i.e.,
UniProt and PDB) of these algorithms. Although it might seem that these algorithms
go beyond regurgitating available data, it might not be the case. In this context, these
algorithms are considered no different than GPT4, which also regurgitates available
data. These algorithms do not play well in proving the Levinthal paradox as solved,
yet it remains unsolved.