报告摘要:
AlphaFold 2 彻底革新了蛋白质结构预测的准确性。然而,其性能在很大程度上依赖于从蛋白质序列库中为给定序列检测到的天然同源物的质量和数量,而这些因素往往难以控制。高质量同源物的缺乏是关键原因之一,导致仅有36%的人类蛋白质组残基能以高置信度被预测。这一问题并未随着AlphaFold 3或其他更新技术的出现而得到解决。对于RNA而言,情况更为严峻:由于RNA在序列空间中的保守性极差,若其二级结构未知,目前根本没有可靠的方法搜索RNA同源物。本研究将展示实验室生成的同源序列与语言模型如何破解这一难题。该进展通过将人工智能与高通量测序技术相结合,为快速、经济高效的蛋白质与RNA结构预测开辟了前景广阔的发展道路。
Abstract:
AlphaFold 2 revolutionized the accuracy of protein structure prediction. However, its performance heavily relies on uncontrollable quality and quantity of natural homologs that can be detected in protein sequence libraries for a given sequence. The lack of quality homologs is one key reason why only 36% of human proteome residues were predicted with high confidence. This problem was not solved with the arrival of AlphaFold 3 or other updated techniques. The situation for RNAs is even worse: there is simply no reliable way of searching RNA homologs if its secondary structure is unknown because RNAs are poorly conserved in sequence space. Here we will show how lab-generated homologous sequences and language models can help. The advancement paves the way for a promising future of rapid, cost-effective structure prediction for proteins and RNAs by integrating AI with high-throughput sequencing.