Recently, Jinjin Li's team in the Artificial Intelligence and Micro Structure Laboratory (AIMS-Lab) of the Department of Micro/Nano Electronics, School of Electronic Information and Electrical Engineering (SEIEE), Shanghai Jiao Tong University (SJTU) proposed a method that combines unsupervised and supervised machine learning to predict the effects of single-point amino acid mutations on folding free energies of proteins. The paper, Clustered tree regression to learn protein energy change with mutated amino acid (CTR), was published in the internationally renowned journal Briefings in Bioinformatics (WOS SCI: Q1, impact factor: 13.994, ranked 1/57 in Mathematical & Computational Biology).
Protein molecules, the most versatile type of polymers for biological substances, perform almost all biological functions by folding into a wide variety of three-dimensional (3D) structures, whose stability is determined by folding free energies. Mutations promote biological evolution by introducing diversity in genomes and influencing the structures and functions of the corresponding proteins. Except for advantages in evolution, such mutations may also cause malfunctions and lead to diseases and drug resistance by changing protein structures and stability. Therefore, many scholars are devoted to predicting the effects of the above mutations at the molecular level via traditional experiments or machine learning methods, which provide significant explanations for the diagnosis of related diseases and also important high-precision algorithms for the prediction of the structures and properties of biological macromolecules (e.g., peptides, enzymes, etc.). However, traditional experiments are expensive and time-consuming, severely limiting the development of related methods; most early machine learning methods relied on the 3D structures of proteins, which were also impractical because of the insufficient 3D structural data and the high computational costs.
In the above context, the AIMS team proposed the AI-powered CTR method for extracting rich protein sequence features. CTR differs from traditional bioinformatics in that it does not require any structural information, which bypasses the complexity of structure-based predictions by cleverly combining unsupervised and supervised learning, reducing the accuracy loss caused by the discrepancy in feature distributions. As a result, CTR outperformed the previous methods by achieving an RMSE of 0.94kcal/mol and a PCC of 0.83 on the FireProt dataset. Generally, CTR is 3-4 orders of magnitude faster than traditional methods and achieves high-precision protein predictions and structure designs based on next-generation artificial intelligence technologies.
Overview of CTR. A, Protein chains. B, Single-point amino acid mutation (e.g. from Glu to Val) on a chain. C, Flowchart of CTR: Step 1, extracting physicochemical properties, positional features and evolutionary features; Step 2, dividing features into two groups using K-means; Step 3, feeding each group of features into one boosted tree regressor, respectively. D, Predicted change in protein folding free energy (ΔΔG) upon mutation.
CTR's unsupervised feature clustering is the key to improving the accuracy of the previous methods. The combination of unsupervised and supervised learning allows CTR to exploit the inherent distributional patterns in the data more effectively and to decouple the feature distributions. The two groups of features obtained by clustering were similar, Gaussian-like, while significantly different in the distributions of wild-type and mutant amino acids. Therefore, the model could better grasp the essence of each feature group and make predictions in a targeted manner, substantially improving the prediction accuracy and speed.
Results of the clustering process. A, Visualization of the feature groups after dimensionality reduction. B, Distributions of the two feature groups. C, Probability and cumulative probability of wild-type amino acids for each group of features. D, Probability and cumulative probability of mutant amino acids for each group.
On the FireProt dataset, the RMSE was only 0.94 kcal/mol, and the PCC was as high as 0.83; the distribution of the predicted values was very close to that of the experimental values, with an inlier ratio of 90.88%. These results indicate that the regression model of CTR outperformed the mainstream regression models. The fast and accurate CTR method can facilitate large-scale studies of single-point amino acid mutations and enable predictions of other protein properties with rare structural information.
Predicted results by CTR with XGBoost and other regression methods. A, Scatter plot of experimental protein folding free energy change (ΔΔG) values against predicted values using XGBoost. CTR achieves a PCC of 0.83 and an RMSE of 0.94kcal/mol. B, Distributions of experimental and predicted ΔΔG values for XGBoost. C, Residuals (calculated as the difference between experimental and predicted values) with the one-sigma boundary (calculated as the standard deviation of predicted values) for XGBoost. Inliers (90.88%), whose residuals fall inside the one-sigma boundary, are highlighted. D, Residuals for different methods. Note that two regressors are trained for each method. From left to right are Bayesian Ridge (BR), Kernel Ridge (KR), Support Vector Regression (SVR), Gradient Boosting (GB), Neural Network (NN) and XGBoost (XGB), where XGBoost produces the fewest outliers.
The research featured Shanghai Jiao Tong University (SJTU) as the sole correspondent; Prof. Jinjin Li of the Artificial Intelligence and Micro Structure Laboratory (AIMS-Lab), Department of Micro/Nano Electronics, School of Electronic Information and Electrical Engineering (SEIEE), Shanghai Jiao Tong University (SJTU), as the corresponding author; Hongwei Tu, an undergraduate student in the first artificial intelligence major of 2019 in the School of Electronic Information and Electrical Engineering (SEIEE), Shanghai Jiao Tong University (SJTU), as the first author.
The major of artificial intelligence at Shanghai Jiao Tong University is oriented to the needs of the National Innovation-Driven Development Strategy and the New Generation of Artificial Intelligence Development Plan, and actively explores the cultivation of top talents in basic research and high-end talents in interdisciplinarity with AI-based methods. This research is an important application of artificial intelligence in the field of life sciences. In the context of green biomedicine and carbon neutrality having a significant impact on the global economy in the next 20 years, Shanghai Jiao Tong University's AI talent training will also seek to provide strong talent support for the development of national interdisciplinary applications. This work was funded by the National Key R&D Program of China, the National Natural Science Foundation of China, the Shanghai Science and Technology Project, and the SJTU Global Strategic Partnership Fund.
Link to the paper:
https://academic.oup.com/bib/advance-article/doi/10.1093/bib/bbac374/6702668