CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models</title>  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <base href="/html/2310.13913v4/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S1" title="In Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1 Introduction</a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S2" title="In Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">2 Methodology</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S2.SS1" title="In 2 Methodology ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">2.1 Generation of Large-scale Docking Complexes</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S2.SS2" title="In 2 Methodology ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">2.2 Training Paradism and Network Architecture of HelixDock</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S3" title="In Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S3.SS1" title="In 3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">3.1 Superior Performance of HelixDock Compared to Baseline Methods</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S3.SS2" title="In 3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">3.2 HelixDock demonstrates Its Transferability on Novel Complexes</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S3.SS3" title="In 3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">3.3 HelixDock Produces Physically Plausible Molecular Structures</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S3.SS4" title="In 3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">3.4 HelixDock Maintains High Accurate Predictions Across Various Protein Families</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S4" title="In Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">4 Scaling Law for Protein-Ligand Structure Prediction</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S4.SS1" title="In 4 Scaling Law for Protein-Ligand Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">4.1 Relations between the Performance and the Model Sizes</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S4.SS2" title="In 4 Scaling Law for Protein-Ligand Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">4.2 Relations between the Performance and the Pre-training Data Sizes</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S5" title="In Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">5 Practicality in Drug Discovery</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S5.SS1" title="In 5 Practicality in Drug Discovery ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">5.1 HelixDock Generalizes Well to Cross-Docking</a></li> <li class="ltx_tocentry ltx_tocentry_subsection"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S5.SS2" title="In 5 Practicality in Drug Discovery ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">5.2 HelixDock Demonstrates Application Potential in Structure-Based Virtual Screening</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S6" title="In Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">6 Conclusion and Future Work</a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line"> <h1 class="ltx_title ltx_title_document">Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models</h1> <div class="ltx_authors"> Lihang Liu1, Shanzhuo Zhang111footnotemark: 1, Donglong He1, Xianbin Ye1, Jingbo Zhou1, Xiaonan Zhang1, Yaoyao Jiang3, Weiming Diao4, Hang Yin3,4, Hua Chai2, Fan Wang1, Jingzhou He1, Liang Zheng2, Yonghui Li222footnotemark: 2, Xiaomin Fang1 1PaddleHelix team, Baidu Inc., 2National Supercomputing Center in Chengdu, 3School of Pharmaceutical Sciences, Key Laboratory of Bioorganic Phosphorous Chemistry and Chemical Biology (Ministry of Education), Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing, China, 4Toll Biotech Co. Ltd (Beijing), Beijing, China Equal contributions.Corresponding authors. Email: fangxiaomin01@baidu.com and liyh@cdcszx.cn </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Although conventional physics-based docking tools are widely utilized, their accuracy is compromised by limited conformational sampling and imprecise scoring functions. Recent advances have incorporated deep learning techniques to improve the accuracy of protein-ligand structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises concerns regarding the generalizability of these deep learning-based methods due to the limited training data. In this work, we show that by pre-training on a large-scale docking conformation generated by traditional physics-based docking tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes, we can obtain a protein-ligand structure prediction model with outstanding performance. Specifically, this process involved the generation of 100 million docking conformations for protein-ligand pairings, an endeavor consuming roughly 1 million CPU core days. The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase. HelixDock has been rigorously benchmarked against both physics-based and deep learning-based baselines, demonstrating its exceptional precision and robust transferability in predicting binding confirmation. In addition, our investigation reveals the scaling laws governing pre-trained protein-ligand structure prediction models, indicating a consistent enhancement in performance with increases in model parameters and the volume of pre-training data. Moreover, we applied HelixDock to several drug discovery-related tasks to validate its practical utility. HelixDock demonstrates outstanding capabilities on both cross-docking and structure-based virtual screening benchmarks. This study illuminates the strategic advantage of leveraging a vast and varied repository of generated data by physics-based tools to advance the frontiers of AI-driven drug discovery. </div> <div class="ltx_para ltx_noindent" id="p1"> Keywords Protein-ligand structure prediction <math alttext="\cdot" class="ltx_Math" display="inline" id="p1.1.m1.1"><semantics id="p1.1.m1.1a"><mo id="p1.1.m1.1.1" xref="p1.1.m1.1.1.cmml">⋅</mo><annotation-xml encoding="MathML-Content" id="p1.1.m1.1b"><ci id="p1.1.m1.1.1.cmml" xref="p1.1.m1.1.1">⋅</ci></annotation-xml><annotation encoding="application/x-tex" id="p1.1.m1.1c">\cdot</annotation><annotation encoding="application/x-llamapun" id="p1.1.m1.1d">⋅</annotation></semantics></math> Large-scale docking dataset <math alttext="\cdot" class="ltx_Math" display="inline" id="p1.2.m2.1"><semantics id="p1.2.m2.1a"><mo id="p1.2.m2.1.1" xref="p1.2.m2.1.1.cmml">⋅</mo><annotation-xml encoding="MathML-Content" id="p1.2.m2.1b"><ci id="p1.2.m2.1.1.cmml" xref="p1.2.m2.1.1">⋅</ci></annotation-xml><annotation encoding="application/x-tex" id="p1.2.m2.1c">\cdot</annotation><annotation encoding="application/x-llamapun" id="p1.2.m2.1d">⋅</annotation></semantics></math> Scaling laws <math alttext="\cdot" class="ltx_Math" display="inline" id="p1.3.m3.1"><semantics id="p1.3.m3.1a"><mo id="p1.3.m3.1.1" xref="p1.3.m3.1.1.cmml">⋅</mo><annotation-xml encoding="MathML-Content" id="p1.3.m3.1b"><ci id="p1.3.m3.1.1.cmml" xref="p1.3.m3.1.1">⋅</ci></annotation-xml><annotation encoding="application/x-tex" id="p1.3.m3.1c">\cdot</annotation><annotation encoding="application/x-llamapun" id="p1.3.m3.1d">⋅</annotation></semantics></math> Drug discovery </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> 1 Introduction</h2> <div class="ltx_para ltx_noindent" id="S1.p1"> Protein-ligand structure prediction, a computational technique central to drug discovery, predicts the binding poses between small molecules (ligands) and target proteins (receptors). Recognized for its pivotal role in helping scientists identify potential drug candidates efficiently, the demand for improved accuracy in complex structure prediction has attracted substantial research attention in recent years. </div> <div class="ltx_para ltx_noindent" id="S1.p2"> The protein-ligand structure prediction technique has proceeded along two principal paths that focus on either the physics-based methods or the deep learning-based model. The conventionally used physics-based docking tools, such as LeDock <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib1" title="">1</a>]</cite>, AutoDock <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib2" title="">2</a>]</cite>, AutoDock Vina <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib3" title="">3</a>]</cite>, Smina <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib4" title="">4</a>]</cite>, and Glide <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib5" title="">5</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib6" title="">6</a>]</cite>, are based on physics-based force fields and take into consideration various factors, such as shape complementary, electrostatics, hydrogen bonding, and van der Waals forces, to produce the candidate binding poses of the given receptor and ligand. These tools apply various sampling techniques to explore the conformational space and evaluate the sampled poses by the scoring functions. Despite the sound theoretical basis, this approach still suffers high challenges due to the limited conformational sampling and imprecise scoring. Recently, deep learning-based methods like EquiBind <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib7" title="">7</a>]</cite>, TankBind <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib8" title="">8</a>]</cite>, DiffDock <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib9" title="">9</a>]</cite>, and Uni-Mol <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib10" title="">10</a>]</cite> has emerged as alternatives, leveraging known complex structures to train models and potentially surpass traditional physics-based tools in prediction accuracy. However, due to the high cost of experimentally determining receptor-ligand complex structures, there is very limited data available for training. As mentioned in the work <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib11" title="">11</a>]</cite>, the performance of these models is susceptible to being over-optimistic, leading to concerns about their generalization capabilities, particularly when faced with novel complexes (i.e. dissimilar to those in the training set). </div> <div class="ltx_para ltx_noindent" id="S1.p3"> Here we reasoned that an improved protein-ligand structure prediction model could be developed by pre-training the deep learning-based structure prediction model on large-scale data generated by the physics-based docking tool, which hopefully can combine the complementary advantages of these two approaches. Pre-training a model on a large and diverse dataset has demonstrated efficacy in various applications, particularly natural language processing and computer vision, to improve the accuracy of the models <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib12" title="">12</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib13" title="">13</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib14" title="">14</a>]</cite>. Recent endeavors in life sciences have sought to understand small molecules and proteins by exploiting large-scale data. Specifically, some studies <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib15" title="">15</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib16" title="">16</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib17" title="">17</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib10" title="">10</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib18" title="">18</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib19" title="">19</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib20" title="">20</a>]</cite> learn the molecular representations from large-scale molecular databases that can be used for a variety of property prediction tasks, such as toxicity and binding affinity. There are also some studies <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib21" title="">21</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib22" title="">22</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib23" title="">23</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib24" title="">24</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib25" title="">25</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib26" title="">26</a>]</cite> to learn protein representation from a huge amount of protein sequences, which are then utilized for attribute annotation and structure prediction. In particular, Uni-Mol <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib10" title="">10</a>]</cite> trained a molecular model and a protein pocket model, independently, for protein-ligand structure prediction. However, the interaction dynamics between protein receptors and molecule ligands are often overlooked due to the limited number of known receptor-ligand binding conformations. The PDBbind database, despite being a widely acknowledged comprehensive receptor-ligand complex repository, contains only around 20,000 experimentally validated structures, highlighting the data scarcity challenge. Though several molecular docking datasets with a great number of molecules, e.g., Cross-docking benchmark <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib27" title="">27</a>]</cite> and a SARS-CoV2 dataset <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib28" title="">28</a>]</cite>, can supplement the data, these datasets predominantly focus on a very limited number of frequently studied protein targets, leading to potential biases in model training. In contrast, pre-training on an extensive and diverse dataset generated by physics-based docking tools could endow a deep learning model with the requisite physical knowledge that is already considered by physics-based docking tools, thereby enhancing both precision and generalizability. </div> <div class="ltx_para ltx_noindent" id="S1.p4"> Our proposed solution, HelixDock pre-trained a SE(3)-Equivariant network on an extensive collection of docking conformations generated by conventional physics-based tools, and fine-tuned with experimentally verified receptor-ligand complexes. HelixDock focuses on resolving the site-specific protein-ligand structure prediction in this study, and small molecules are docked into pre-identified binding pockets, usually identified through experimental techniques or expertise about the protein target. We generated 100 million binding poses using traditional physics-based molecular docking tools which is a task consuming roughly 1 million CPU core days. The generated dataset contains hundreds of thousands of protein targets with known structures and millions of drug-like small molecules (as shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1</a>b). The generated poses are distributed in the extensive conformational space, as illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1</a>a. These binding poses were then utilized for pre-training our designed SE(3)-Equivariant network model (Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1</a>d), which serves as the foundation for learning general knowledge for complex structure prediction. Then, a small number of precise receptor-ligand complex structures, which are distributed in the narrow, limited conformational space (as illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1</a>a), were leveraged to fine-tune the model (as illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1</a>c). While the quality of the binding poses generated by the traditional physics-based docking tools may not match that of experimentally determined complex structures, we believe that our model can still distill the physical interaction knowledge between receptors and ligands from a large-scale dataset of rough binding poses. </div> <div class="ltx_para ltx_noindent" id="S1.p5"> Our comparative analyses have benchmarked HelixDock against various strong baseline methods, including the physics-based docking tools and the deep learning-based models, showing outstanding performance. HelixDock exhibits a comprehensive performance advantage over the baseline methods on multiple test sets. To be specific, HelixDock gives highly accurate predictions in two popular benchmarks, i.e., the PDBbind core set and PoseBusters Benchmark. Furthermore, HelixDock’s superior performance on protein targets with low sequence identity to the training set underscores its robust transferability in protein-ligand structure prediction. </div> <div class="ltx_para ltx_noindent" id="S1.p6"> Notably, we conduct training on multiple versions of structure prediction models, each with varying numbers of parameters and diverse pre-training data quantities, aiming to uncover the scaling laws governing the performance of deep learning-based structure prediction models. The investigation into the scaling laws <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib29" title="">29</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib30" title="">30</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib31" title="">31</a>]</cite> of the size of the training dataset and the number of model parameters can provide a rough estimation of the capabilities of the deep protein-ligand structure prediction models. We experimentally show that pre-trained models consistently demonstrate performance improvement with an increase in model parameters, while non-pre-trained models struggle to achieve significant breakthroughs in performance with parameter increments (as shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1</a>e). </div> <div class="ltx_para ltx_noindent" id="S1.p7"> We sought to validate the practical applicability of HelixDock by applying it to two key tasks in drug discovery: cross-docking and structure-based virtual screening. In cross-docking, HelixDock maintains a high success rate across two datasets, outperforming other baselines. Moreover, in structure-based virtual screening, HelixDock demonstrated a remarkably high Enrichment Factor (EF) on the benchmark DUD-E dataset <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib32" title="">32</a>]</cite>, which encompasses 102 targets. Our study showcases the potential of using large-scale and diverse receptor-ligand docking poses for pre-training deep learning-based docking models, which can shed light on future investigation with vast and varied generated data by physics-based tools in advancing AI-driven drug discovery. </div> <figure class="ltx_figure" id="S1.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_missing ltx_missing_image" id="S1.F1.g1" src=""/> <figcaption class="ltx_caption ltx_centering">Figure 1: Overall framework of HelixDock: deep learning-based protein-ligand structure prediction model enhanced by massive and diverse binding poses.</figcaption> </figure> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> 2 Methodology</h2> <section class="ltx_subsection" id="S2.SS1"> <h3 class="ltx_title ltx_title_subsection"> 2.1 Generation of Large-scale Docking Complexes</h3> <figure class="ltx_table" id="S2.T1"> <table class="ltx_tabular ltx_centering ltx_align_middle" id="S2.T1.1"> <tr class="ltx_tr" id="S2.T1.1.1"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_tt" id="S2.T1.1.1.1">Dataset</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S2.T1.1.1.2">#Protein</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S2.T1.1.1.3">#Protein Family</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S2.T1.1.1.4">#Complex</td> <td class="ltx_td ltx_align_center ltx_border_tt" id="S2.T1.1.1.5">Structure Source</td> </tr> <tr class="ltx_tr" id="S2.T1.1.2"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="S2.T1.1.2.1">PDBbind version 2020</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S2.T1.1.2.2">19k</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S2.T1.1.2.3">661</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S2.T1.1.2.4">19k</td> <td class="ltx_td ltx_align_center ltx_border_t" id="S2.T1.1.2.5">Experimentally determined</td> </tr> <tr class="ltx_tr" id="S2.T1.1.3"> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r" id="S2.T1.1.3.1">HelixDock-Database</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S2.T1.1.3.2">189k</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S2.T1.1.3.3">2,589</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S2.T1.1.3.4">100m</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="S2.T1.1.3.5">Generated by AutoDock Vina</td> </tr> </table> <figcaption class="ltx_caption ltx_centering">Table 1: Statistics of HelixDock-Database and PDBbind.</figcaption> </figure> <figure class="ltx_figure" id="S2.F2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_missing ltx_missing_image" id="S2.F2.g1" src=""/> <figcaption class="ltx_caption ltx_centering">Figure 2: Comparison of HelixDock-Database and PDBbind.(a) Protein family comparison of HelixDock-Database and PDBbind. (b) UMAP of Morgan fingerprints of ligands from HelixDock-Database and PDBbind.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S2.SS1.p1"> HelixDock utilizes a vast and varied array of docking complexes generated through physics-based docking tools, AutoDock Vina <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib2" title="">2</a>]</cite>, to enhance the accuracy of protein-ligand structure prediction. As shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1</a>, our methodology first involves the generation of poses of approximately 100 million receptor-ligand pairs utilizing a supercomputing platform. These systematically generated docking complexes function as the foundational pre-training dataset for HelixDock, allowing for an intricate understanding and learning of the underlying physical principles involved in molecular interactions. This dataset is herein referred to as the HelixDock-Database. The construction of the HelixDock-Database consists of three phases. </div> <div class="ltx_para ltx_noindent" id="S2.SS1.p2"> In the first phase, we establish the protein target set by consolidating all proteins with known 3D structures from the Protein Data Bank (PDB) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib33" title="">33</a>]</cite> available until September 31, 2022, encompassing approximately 189 thousand proteins. These proteins manifest a diverse array of protein families, encapsulating a total of 2,589 distinct families. In comparison, the widely acknowledged receptor-ligand complex dataset, PDBbind <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib34" title="">34</a>]</cite>, incorporates only 19 thousand proteins and encompasses only 661 protein families, as shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S2.T1" title="Table 1 ‣ 2.1 Generation of Large-scale Docking Complexes ‣ 2 Methodology ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1</a>. The numeric distributions among the top-40 protein families within the HelixDock-Database are illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S2.F2" title="Figure 2 ‣ 2.1 Generation of Large-scale Docking Complexes ‣ 2 Methodology ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">2</a>a, and are concurrently contrasted with the distributions of protein families within PDBbind. Subsequently, fpocket <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib35" title="">35</a>]</cite>, an advanced pocket mining tool, is employed to determine potential binding pockets within each protein to streamline the subsequent docking procedures. The two most probable pockets from each protein were selected, yielding approximately 378 thousand candidate target pockets. </div> <div class="ltx_para ltx_noindent" id="S2.SS1.p3"> We curate the ligand set in the second phase, acquiring 265 million drug-like small molecules from Enamine (enamine.net), a compound library renowned for its prevalent application in virtual screening endeavors. The selection of these ligands is strategically conducted to ensure they encompass extensive coverage in the drug-like hit space. Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S2.F2" title="Figure 2 ‣ 2.1 Generation of Large-scale Docking Complexes ‣ 2 Methodology ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">2</a>b contrasts the chemical space represented by 1 million ligands randomly selected from the HelixDock-Database, against the chemical space represented by the entire set of 19,000 ligands from PDBbind. This comparative visualization employs the UMAP algorithm <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib36" title="">36</a>]</cite> to project the Morgan fingerprints of these ligands into a two-dimensional representation. As shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S2.F2" title="Figure 2 ‣ 2.1 Generation of Large-scale Docking Complexes ‣ 2 Methodology ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">2</a>b, the ligands within the HelixDock database have more expansive coverage of the chemical space compared to those within PDBbind. </div> <div class="ltx_para ltx_noindent" id="S2.SS1.p4"> In the final phase, we execute a comprehensive large-scale docking initiative, pairing targets from the pre-established pocket target set with ligands from the curated ligand set through a random selection process, as illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S1.F1" title="Figure 1 ‣ 1 Introduction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1</a>b. To clarify, one target pocket and one compound are randomly sampled to constitute a target-ligand pair, leading to the generation of approximately 100 million such pairs. AutoDock Vina <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib2" title="">2</a>]</cite>, a widely recognized docking software, is employed for generating docking poses. To maintain a balanced representation of protein targets, we perform clustering of proteins based on sequence similarity with the MMSeqs2 tool <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib37" title="">37</a>]</cite>. The likelihood of each protein being sampled is calculated to be inversely proportional to the size of its corresponding protein cluster, ensuring the representation of a diverse range of protein targets. This methodology affirms the diversity within our protein targets. The docking of all these 100 million pairs takes approximately 1 million CPU core days to complete. </div> <div class="ltx_para ltx_noindent" id="S2.SS1.p5"> The coarse-grained conformations derived through this methodology serve as foundational elements for the pre-training of the deep learning-based model. The precision of these conformations, as crafted by docking tools, does not align with the accuracy found in experimentally derived receptor-ligand complex conformations. Nonetheless, we postulate that a molecular docking tool holds the potential to elucidate the fundamental physical principles that govern docking interactions. Typically, physics-based docking tools usually traverse conformational space and scrutinize conformations leveraging sophisticated physics-based scoring functions. </div> </section> <section class="ltx_subsection" id="S2.SS2"> <h3 class="ltx_title ltx_title_subsection"> 2.2 Training Paradism and Network Architecture of HelixDock</h3> <div class="ltx_para ltx_noindent" id="S2.SS2.p1"> HelixDock’s training process consists of two main phases: a pre-training phase and a fine-tuning phase. Initially, in the pre-training phase, the model is trained using the HelixDock-Database. This database provides large-scale data that helps the model learn the general features and patterns of protein-ligand structures. Leveraging this extensive dataset enables the model to gain a broad understanding and robust foundation for structure prediction. In the fine-tuning phase, the model is further refined using the general set from PDBbind version 2020 <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib34" title="">34</a>]</cite>. This set includes experimentally derived co-crystal conformations, which provide high-quality and specific structural information. To ensure unbiased fine-tuning, the core set is excluded from this phase. This two-step training process allows HelixDock to combine the strengths of large-scale pre-training with targeted fine-tuning, resulting in a highly accurate and reliable protein-ligand structure prediction model. </div> <div class="ltx_para ltx_noindent" id="S2.SS2.p2"> HelixDock improves protein-ligand conformation prediction accuracy by leveraging geometry-aware neural network architectures and adopting a molecular diffusion process. Following the recent success of using the diffusion model to learn continuous distributions <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib38" title="">38</a>]</cite>, we generate train inputs by adding Gaussian noise into the 3D coordinates of all heavy atoms in the ligand, then HelixDock is trained to reverse the noising process by predicting the original docking/crystal conformations, instead of predicting the injected noise in the original paper. More concretely, for each time step of the diffusion process, HelixDock directly predicts the 3D coordinates of all heavy atoms in the ligand based on the information of the protein pockets (with known structures) and the ligands (with noised structures). The architecture of HelixDock consists of two main modules: InteractionLearner which utilizes a hybrid of graph neural networks (GNNs) and the transformer-style GeoFormer network, focusing on modeling interactions between protein pockets and ligands; and StructurePredictor which adopts an end-to-end SE(3)-Equivariant network to predict and iteratively refine the 3D coordinates of the atoms in the ligands. </div> <div class="ltx_para ltx_noindent" id="S2.SS2.p3"> When utilizing HelixDock to predict protein-ligand conformations, we randomly sample multiple noise inputs to generate a diverse array of conformations. This broad exploration ensures various potential binding poses. Similar to traditional docking tools and the DiffDock method, we use a scoring function to evaluate and rank these conformations. The scoring function assesses predicted binding affinity and physical plausibility, allowing us to identify the most accurate and biologically relevant conformations. </div> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> 3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction</h2> <figure class="ltx_figure" id="S3.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_missing ltx_missing_image" id="S3.F3.g1" src=""/> <figcaption class="ltx_caption ltx_centering">Figure 3: Overall evaluation of HelixDock and the baseline methods for complex structure prediction on complex prediction datasets. (a) Percentage of RMSD <math alttext="\leq 2" class="ltx_Math" display="inline" id="S3.F3.4.1.m1.1"><semantics id="S3.F3.4.1.m1.1b"><mrow id="S3.F3.4.1.m1.1.1" xref="S3.F3.4.1.m1.1.1.cmml"><mi id="S3.F3.4.1.m1.1.1.2" xref="S3.F3.4.1.m1.1.1.2.cmml"></mi><mo id="S3.F3.4.1.m1.1.1.1" mathcolor="#000000" xref="S3.F3.4.1.m1.1.1.1.cmml">≤</mo><mn id="S3.F3.4.1.m1.1.1.3" mathcolor="#000000" xref="S3.F3.4.1.m1.1.1.3.cmml">2</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.F3.4.1.m1.1c"><apply id="S3.F3.4.1.m1.1.1.cmml" xref="S3.F3.4.1.m1.1.1"><leq id="S3.F3.4.1.m1.1.1.1.cmml" xref="S3.F3.4.1.m1.1.1.1"></leq><csymbol cd="latexml" id="S3.F3.4.1.m1.1.1.2.cmml" xref="S3.F3.4.1.m1.1.1.2">absent</csymbol><cn id="S3.F3.4.1.m1.1.1.3.cmml" type="integer" xref="S3.F3.4.1.m1.1.1.3">2</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.F3.4.1.m1.1d">\leq 2</annotation><annotation encoding="application/x-llamapun" id="S3.F3.4.1.m1.1e">≤ 2</annotation></semantics></math>Å (success rate) on PDBbind Core set. (b) Percentage of RMSD <math alttext="\leq 1" class="ltx_Math" display="inline" id="S3.F3.5.2.m2.1"><semantics id="S3.F3.5.2.m2.1b"><mrow id="S3.F3.5.2.m2.1.1" xref="S3.F3.5.2.m2.1.1.cmml"><mi id="S3.F3.5.2.m2.1.1.2" xref="S3.F3.5.2.m2.1.1.2.cmml"></mi><mo id="S3.F3.5.2.m2.1.1.1" mathcolor="#000000" xref="S3.F3.5.2.m2.1.1.1.cmml">≤</mo><mn id="S3.F3.5.2.m2.1.1.3" mathcolor="#000000" xref="S3.F3.5.2.m2.1.1.3.cmml">1</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.F3.5.2.m2.1c"><apply id="S3.F3.5.2.m2.1.1.cmml" xref="S3.F3.5.2.m2.1.1"><leq id="S3.F3.5.2.m2.1.1.1.cmml" xref="S3.F3.5.2.m2.1.1.1"></leq><csymbol cd="latexml" id="S3.F3.5.2.m2.1.1.2.cmml" xref="S3.F3.5.2.m2.1.1.2">absent</csymbol><cn id="S3.F3.5.2.m2.1.1.3.cmml" type="integer" xref="S3.F3.5.2.m2.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.F3.5.2.m2.1d">\leq 1</annotation><annotation encoding="application/x-llamapun" id="S3.F3.5.2.m2.1e">≤ 1</annotation></semantics></math>Å on PDBbind Core set. (c) Percentage of RMSD <math alttext="\leq 2" class="ltx_Math" display="inline" id="S3.F3.6.3.m3.1"><semantics id="S3.F3.6.3.m3.1b"><mrow id="S3.F3.6.3.m3.1.1" xref="S3.F3.6.3.m3.1.1.cmml"><mi id="S3.F3.6.3.m3.1.1.2" xref="S3.F3.6.3.m3.1.1.2.cmml"></mi><mo id="S3.F3.6.3.m3.1.1.1" mathcolor="#000000" xref="S3.F3.6.3.m3.1.1.1.cmml">≤</mo><mn id="S3.F3.6.3.m3.1.1.3" mathcolor="#000000" xref="S3.F3.6.3.m3.1.1.3.cmml">2</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.F3.6.3.m3.1c"><apply id="S3.F3.6.3.m3.1.1.cmml" xref="S3.F3.6.3.m3.1.1"><leq id="S3.F3.6.3.m3.1.1.1.cmml" xref="S3.F3.6.3.m3.1.1.1"></leq><csymbol cd="latexml" id="S3.F3.6.3.m3.1.1.2.cmml" xref="S3.F3.6.3.m3.1.1.2">absent</csymbol><cn id="S3.F3.6.3.m3.1.1.3.cmml" type="integer" xref="S3.F3.6.3.m3.1.1.3">2</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.F3.6.3.m3.1d">\leq 2</annotation><annotation encoding="application/x-llamapun" id="S3.F3.6.3.m3.1e">≤ 2</annotation></semantics></math>Å on PoseBusters benchmark. (d) Success rate in PoseBusters benchmark with cases stratified by sequence identity to the PDBbind 2020 General set. (e) Two examples from PoseBusters, PDB_id:7AA0 with a high sequence identity of 1.0 and PDB_id:6XM9 with a low sequence identity of 0.24. (f) PoseBuesters quality checks on predictions of DiffDock, AlphaFold-latest, and HelixDock. (g) Performance comparison across various protein families.</figcaption> </figure> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection" style="color:#000000;"> 3.1 Superior Performance of HelixDock Compared to Baseline Methods</h3> <div class="ltx_para ltx_noindent" id="S3.SS1.p1"> We began our evaluation by examining the overall performance of HelixDock. The overall performance of HelixDock in predicting complex structures is evaluated on PDBbind core set <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib39" title="">39</a>]</cite> and PoseBusters benchmark <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib40" title="">40</a>]</cite>. We compared HelixDock against two categories of protein-ligand structure prediction methods: the physics-based docking tools including LeDock <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib41" title="">41</a>]</cite>, Smina <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib4" title="">4</a>]</cite>, AutoDock Vina <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib2" title="">2</a>]</cite>, and rDock <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib42" title="">42</a>]</cite>, as well as deep learning-based protein-ligand structure prediction models, including Uni-Mol <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib43" title="">43</a>]</cite> and DiffDock <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib9" title="">9</a>]</cite>. In accordance with prior research <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib44" title="">44</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib7" title="">7</a>]</cite>, we employ the Root Mean Square Deviation (RMSD) as our evaluation metric, comparing the predicted ligand poses with their respective native ligand binding poses, where only heavy atoms are taken into consideration. </div> <div class="ltx_para ltx_noindent" id="S3.SS1.p2"> Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S3.F3" style="color:#000000;" title="Figure 3 ‣ 3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">3</a>a&b present the performance of HelixDock and other baseline methods on the PDBbind core set. HelixDock achieves a high success rate (RMSD <math alttext="\leq 2" class="ltx_Math" display="inline" id="S3.SS1.p2.1.m1.1"><semantics id="S3.SS1.p2.1.m1.1a"><mrow id="S3.SS1.p2.1.m1.1.1" xref="S3.SS1.p2.1.m1.1.1.cmml"><mi id="S3.SS1.p2.1.m1.1.1.2" xref="S3.SS1.p2.1.m1.1.1.2.cmml"></mi><mo id="S3.SS1.p2.1.m1.1.1.1" mathcolor="#000000" xref="S3.SS1.p2.1.m1.1.1.1.cmml">≤</mo><mn id="S3.SS1.p2.1.m1.1.1.3" mathcolor="#000000" xref="S3.SS1.p2.1.m1.1.1.3.cmml">2</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p2.1.m1.1b"><apply id="S3.SS1.p2.1.m1.1.1.cmml" xref="S3.SS1.p2.1.m1.1.1"><leq id="S3.SS1.p2.1.m1.1.1.1.cmml" xref="S3.SS1.p2.1.m1.1.1.1"></leq><csymbol cd="latexml" id="S3.SS1.p2.1.m1.1.1.2.cmml" xref="S3.SS1.p2.1.m1.1.1.2">absent</csymbol><cn id="S3.SS1.p2.1.m1.1.1.3.cmml" type="integer" xref="S3.SS1.p2.1.m1.1.1.3">2</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p2.1.m1.1c">\leq 2</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p2.1.m1.1d">≤ 2</annotation></semantics></math>Å) of 90.1%. This marks a significant improvement over the second-best method under the same metric. HelixDock’s performance is particularly distinguished at the percentage of RMSD <math alttext="\leq 1" class="ltx_Math" display="inline" id="S3.SS1.p2.2.m2.1"><semantics id="S3.SS1.p2.2.m2.1a"><mrow id="S3.SS1.p2.2.m2.1.1" xref="S3.SS1.p2.2.m2.1.1.cmml"><mi id="S3.SS1.p2.2.m2.1.1.2" xref="S3.SS1.p2.2.m2.1.1.2.cmml"></mi><mo id="S3.SS1.p2.2.m2.1.1.1" mathcolor="#000000" xref="S3.SS1.p2.2.m2.1.1.1.cmml">≤</mo><mn id="S3.SS1.p2.2.m2.1.1.3" mathcolor="#000000" xref="S3.SS1.p2.2.m2.1.1.3.cmml">1</mn></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.p2.2.m2.1b"><apply id="S3.SS1.p2.2.m2.1.1.cmml" xref="S3.SS1.p2.2.m2.1.1"><leq id="S3.SS1.p2.2.m2.1.1.1.cmml" xref="S3.SS1.p2.2.m2.1.1.1"></leq><csymbol cd="latexml" id="S3.SS1.p2.2.m2.1.1.2.cmml" xref="S3.SS1.p2.2.m2.1.1.2">absent</csymbol><cn id="S3.SS1.p2.2.m2.1.1.3.cmml" type="integer" xref="S3.SS1.p2.2.m2.1.1.3">1</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.p2.2.m2.1c">\leq 1</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.p2.2.m2.1d">≤ 1</annotation></semantics></math>Å, surpassing the top baseline method by over 20%. This significant improvement highlights HelixDock’s exceptional capability in predicting conformational details with high precision. </div> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection" style="color:#000000;"> 3.2 HelixDock demonstrates Its Transferability on Novel Complexes</h3> <div class="ltx_para ltx_noindent" id="S3.SS2.p1"> Even though the deep-learning-based baselines exhibit an obvious advantage over physics-based methods on the PDBbind core set, previous research <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib11" title="">11</a>]</cite> has indicated that deep-learning models trained on the general set are susceptible to overestimating their performance on the core set given the observed similarities between the samples in the PDBbind core set and PDBbind general set. The transferability of deep learning methods to novel complexes, particularly novel protein targets, is a key concern. </div> <div class="ltx_para ltx_noindent" id="S3.SS2.p2"> To this end, we further evaluated our model using the recently developed PoseBusters Benchmark set, a high-quality collection of protein-ligand complexes <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib40" title="">40</a>]</cite>. This set comprises 428 complexes introduced since 2021, with most exhibiting low similarity to samples in our training dataset. HelixDock achieves a high success rate of 85.6% in structural predictions, as depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S3.F3" style="color:#000000;" title="Figure 3 ‣ 3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">3</a>c. This performance surpasses that of the baseline methods, which are detailed in the records from PoseBusters and AlphaFold-latest <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib45" title="">45</a>]</cite>. Notably, while other deep-learning-based methods generally fall short compared to physics-based approaches, HelixDock and AlphaFold-latest stand out as exceptions, consistently delivering superior results. </div> <div class="ltx_para ltx_noindent" id="S3.SS2.p3"> To further assess the transferability of HelixDock to novel protein targets, we stratified the PoseBusters Benchmark set according to the target protein receptor’s maximum sequence identity with proteins in the PDBbind 2020 General Set <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib34" title="">34</a>]</cite>, following the methodology described in <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib40" title="">40</a>]</cite>. The test cases are categorized into three groups based on maximum percentage sequence identity: low [0%, 30%], medium (30%, 90%], and high (90%, 100%]. As shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S3.F3" style="color:#000000;" title="Figure 3 ‣ 3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">3</a>d, physics-based methods exhibit consistent performance across all three groups, while previous deep-learning-based methods show a decline in performance with proteins of lower sequence identity. For instance, DiffDock’s success rate dramatically drops from 51% in the high sequence identity category to just 16% in the low sequence identity category. In contrast, HelixDock maintains consistent performance across these groups, exemplifying its robust generalization capability to new protein targets. As depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S3.F3" style="color:#000000;" title="Figure 3 ‣ 3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">3</a>e, HelixDock exhibits robust docking performance across complexes from the PoseBusters dataset, achieving high accuracy irrespective of the sequence similarity with PDBbind 2020 general set, whether at 1.0 or 0.24. </div> </section> <section class="ltx_subsection" id="S3.SS3"> <h3 class="ltx_title ltx_title_subsection" style="color:#000000;"> 3.3 HelixDock Produces Physically Plausible Molecular Structures</h3> <div class="ltx_para ltx_noindent" id="S3.SS3.p1"> Unlike physics-based methods, deep learning-based conformation prediction methods are not guaranteed to produce physically plausible molecular structures, which is one of the main reasons these methods are frequently criticized. To evaluate the chemical and geometric consistency of the predicted ligand conformations, the PoseBusters test suite <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib40" title="">40</a>]</cite> has implemented a series of standard quality checks. These checks assess the stereochemistry of the ligands as well as the physical plausibility of intra- and intermolecular measurements. Ligand conformations that successfully pass all these tests are deemed ‘PB-valid’. </div> <div class="ltx_para ltx_noindent" id="S3.SS3.p2"> The comparison of HelixDock with DiffDock and AlphaFold-latest on these quality checks is shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S3.F3" style="color:#000000;" title="Figure 3 ‣ 3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">3</a>f. Overall, HelixDock achieves a PB-valid of 85.2%. It surpasses or matches DiffDock in 15 out of 18 quality checks and outperforms or equals AlphaFold-latest in 12 of these checks. While HelixDock does not guarantee that all generated conformations are physically plausible, its overall performance in producing valid structures is commendable. </div> </section> <section class="ltx_subsection" id="S3.SS4"> <h3 class="ltx_title ltx_title_subsection" style="color:#000000;"> 3.4 HelixDock Maintains High Accurate Predictions Across Various Protein Families</h3> <div class="ltx_para ltx_noindent" id="S3.SS4.p1"> We extracted complexes released between January 1, 2020, and December 31, 2022, from the RCSB PDB 111https://rcsb.org and categorized the protein targets according to their protein families. Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S3.F3" style="color:#000000;" title="Figure 3 ‣ 3 Pre-training on Large-scale Docking Complexes Enhances Complex Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">3</a>g displays the results for the top 15 protein families. Notably, HelixDock achieves the best performance among all considered protein families, recording the lowest RMSD values. It further distinguishes itself by achieving a median RMSD value below 2Å across 10 protein families. These outcomes validate the robustness of HelixDock across a wide spectrum of protein families. This evaluation reaffirms HelixDock’s promise as a reliable tool for protein-ligand structure prediction studies, exhibiting excellent prediction with lower RMSD values and adaptability to diverse protein families. </div> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> 4 Scaling Law for Protein-Ligand Structure Prediction</h2> <figure class="ltx_figure" id="S4.F4"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_missing ltx_missing_image" id="S4.F4.g1" src=""/> <figcaption class="ltx_caption ltx_centering">Figure 4: Scaling laws for protein-ligand structure prediction. (a) Scaling laws of model sizes in the pre-training stage. (b) Scaling laws of model sizes in the fine-tuning stage with models evaluated on PDBbind core set and PoseBusters benchmark. (c) Scaling laws of pre-training data sizes in the pre-training stage. (d) Scaling laws of pre-training data in the fine-tuning stage with models evaluated on PDBbind core set and PoseBusters benchmark.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S4.p1"> Recent advances in Natural Language Processing and Computer Vision have highlighted the phenomenon of empirical scaling laws, where enlarging both the data size and the model size can substantially boost model performance across a variety of applications <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib29" title="">29</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib30" title="">30</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib31" title="">31</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib46" title="">46</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib47" title="">47</a>]</cite>. In this section, we seek to explore the effects of model size <math alttext="N" class="ltx_Math" display="inline" id="S4.p1.1.m1.1"><semantics id="S4.p1.1.m1.1a"><mi id="S4.p1.1.m1.1.1" xref="S4.p1.1.m1.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="S4.p1.1.m1.1b"><ci id="S4.p1.1.m1.1.1.cmml" xref="S4.p1.1.m1.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.p1.1.m1.1c">N</annotation><annotation encoding="application/x-llamapun" id="S4.p1.1.m1.1d">italic_N</annotation></semantics></math>, characterized by the number of model parameters, and pre-training dataset size <math alttext="D" class="ltx_Math" display="inline" id="S4.p1.2.m2.1"><semantics id="S4.p1.2.m2.1a"><mi id="S4.p1.2.m2.1.1" xref="S4.p1.2.m2.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S4.p1.2.m2.1b"><ci id="S4.p1.2.m2.1.1.cmml" xref="S4.p1.2.m2.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.p1.2.m2.1c">D</annotation><annotation encoding="application/x-llamapun" id="S4.p1.2.m2.1d">italic_D</annotation></semantics></math> in the domain of protein-ligand structure prediction. Our objective is to experimentally investigate the empirical scaling laws in this specific application. We hope this analysis can offer valuable insights for our future work and that of the broader research community. </div> <div class="ltx_para ltx_noindent" id="S4.p2"> In our evaluation, we investigate the empirical scaling laws at both the pre-training and fine-tuning phases of our model. In the pre-training phase, we present the average RMSD value for pre-trained models. In order to evaluate the pre-trained models, we constructed a held-out pre-training test set, which is a randomly chosen subset from the HelixDock-Database, comprising 36,000 samples with 180 distinct proteins. In the fine-tuning phase, we report the average RMSD values for models refined using PDBbind, evaluated on the PDBbind core set and PoseBusters benchmark. For all model variations subjected to pre-training, we only conduct the pre-training for a fixed duration of 120,000 steps due to computational resource constraints. In the fine-tuning phase, we perform three independent runs for each model to reduce variance, presenting the average results. </div> <div class="ltx_para ltx_noindent" id="S4.p3"> From our experimental results, it is evident that both the model size and the pre-training dataset size play a crucial role in enhancing the predictive accuracy of the complex structure prediction task. The scaling laws previously validated in other fields remain effective in protein-ligand structure prediction. </div> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> 4.1 Relations between the Performance and the Model Sizes</h3> <div class="ltx_para ltx_noindent" id="S4.SS1.p1"> We begin by studying the relation between the model size (ranging from <math alttext="3\times 10^{4}" class="ltx_Math" display="inline" id="S4.SS1.p1.1.m1.1"><semantics id="S4.SS1.p1.1.m1.1a"><mrow id="S4.SS1.p1.1.m1.1.1" xref="S4.SS1.p1.1.m1.1.1.cmml"><mn id="S4.SS1.p1.1.m1.1.1.2" xref="S4.SS1.p1.1.m1.1.1.2.cmml">3</mn><mo id="S4.SS1.p1.1.m1.1.1.1" lspace="0.222em" rspace="0.222em" xref="S4.SS1.p1.1.m1.1.1.1.cmml">×</mo><msup id="S4.SS1.p1.1.m1.1.1.3" xref="S4.SS1.p1.1.m1.1.1.3.cmml"><mn id="S4.SS1.p1.1.m1.1.1.3.2" xref="S4.SS1.p1.1.m1.1.1.3.2.cmml">10</mn><mn id="S4.SS1.p1.1.m1.1.1.3.3" xref="S4.SS1.p1.1.m1.1.1.3.3.cmml">4</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.1.m1.1b"><apply id="S4.SS1.p1.1.m1.1.1.cmml" xref="S4.SS1.p1.1.m1.1.1"><times id="S4.SS1.p1.1.m1.1.1.1.cmml" xref="S4.SS1.p1.1.m1.1.1.1"></times><cn id="S4.SS1.p1.1.m1.1.1.2.cmml" type="integer" xref="S4.SS1.p1.1.m1.1.1.2">3</cn><apply id="S4.SS1.p1.1.m1.1.1.3.cmml" xref="S4.SS1.p1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.p1.1.m1.1.1.3.1.cmml" xref="S4.SS1.p1.1.m1.1.1.3">superscript</csymbol><cn id="S4.SS1.p1.1.m1.1.1.3.2.cmml" type="integer" xref="S4.SS1.p1.1.m1.1.1.3.2">10</cn><cn id="S4.SS1.p1.1.m1.1.1.3.3.cmml" type="integer" xref="S4.SS1.p1.1.m1.1.1.3.3">4</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.1.m1.1c">3\times 10^{4}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.1.m1.1d">3 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT</annotation></semantics></math> to <math alttext="10^{8}" class="ltx_Math" display="inline" id="S4.SS1.p1.2.m2.1"><semantics id="S4.SS1.p1.2.m2.1a"><msup id="S4.SS1.p1.2.m2.1.1" xref="S4.SS1.p1.2.m2.1.1.cmml"><mn id="S4.SS1.p1.2.m2.1.1.2" xref="S4.SS1.p1.2.m2.1.1.2.cmml">10</mn><mn id="S4.SS1.p1.2.m2.1.1.3" xref="S4.SS1.p1.2.m2.1.1.3.cmml">8</mn></msup><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.2.m2.1b"><apply id="S4.SS1.p1.2.m2.1.1.cmml" xref="S4.SS1.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.2.m2.1.1.1.cmml" xref="S4.SS1.p1.2.m2.1.1">superscript</csymbol><cn id="S4.SS1.p1.2.m2.1.1.2.cmml" type="integer" xref="S4.SS1.p1.2.m2.1.1.2">10</cn><cn id="S4.SS1.p1.2.m2.1.1.3.cmml" type="integer" xref="S4.SS1.p1.2.m2.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.2.m2.1c">10^{8}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.2.m2.1d">10 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT</annotation></semantics></math>) and the RMSD values of these test sets during the pre-training and the fine-tuning stage. Notably, when increasing the model size <math alttext="N" class="ltx_Math" display="inline" id="S4.SS1.p1.3.m3.1"><semantics id="S4.SS1.p1.3.m3.1a"><mi id="S4.SS1.p1.3.m3.1.1" xref="S4.SS1.p1.3.m3.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.3.m3.1b"><ci id="S4.SS1.p1.3.m3.1.1.cmml" xref="S4.SS1.p1.3.m3.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.3.m3.1c">N</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.3.m3.1d">italic_N</annotation></semantics></math>, we keep the pre-training size <math alttext="D=10^{8}" class="ltx_Math" display="inline" id="S4.SS1.p1.4.m4.1"><semantics id="S4.SS1.p1.4.m4.1a"><mrow id="S4.SS1.p1.4.m4.1.1" xref="S4.SS1.p1.4.m4.1.1.cmml"><mi id="S4.SS1.p1.4.m4.1.1.2" xref="S4.SS1.p1.4.m4.1.1.2.cmml">D</mi><mo id="S4.SS1.p1.4.m4.1.1.1" xref="S4.SS1.p1.4.m4.1.1.1.cmml">=</mo><msup id="S4.SS1.p1.4.m4.1.1.3" xref="S4.SS1.p1.4.m4.1.1.3.cmml"><mn id="S4.SS1.p1.4.m4.1.1.3.2" xref="S4.SS1.p1.4.m4.1.1.3.2.cmml">10</mn><mn id="S4.SS1.p1.4.m4.1.1.3.3" xref="S4.SS1.p1.4.m4.1.1.3.3.cmml">8</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.4.m4.1b"><apply id="S4.SS1.p1.4.m4.1.1.cmml" xref="S4.SS1.p1.4.m4.1.1"><eq id="S4.SS1.p1.4.m4.1.1.1.cmml" xref="S4.SS1.p1.4.m4.1.1.1"></eq><ci id="S4.SS1.p1.4.m4.1.1.2.cmml" xref="S4.SS1.p1.4.m4.1.1.2">𝐷</ci><apply id="S4.SS1.p1.4.m4.1.1.3.cmml" xref="S4.SS1.p1.4.m4.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.p1.4.m4.1.1.3.1.cmml" xref="S4.SS1.p1.4.m4.1.1.3">superscript</csymbol><cn id="S4.SS1.p1.4.m4.1.1.3.2.cmml" type="integer" xref="S4.SS1.p1.4.m4.1.1.3.2">10</cn><cn id="S4.SS1.p1.4.m4.1.1.3.3.cmml" type="integer" xref="S4.SS1.p1.4.m4.1.1.3.3">8</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.4.m4.1c">D=10^{8}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.4.m4.1d">italic_D = 10 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT</annotation></semantics></math> fixed for all the variants. We also assess the models without pre-training to analyze the impact of pre-training for complex structure prediction. In line with prior research<cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib46" title="">46</a>]</cite>, we express RMSD as a function of model size using the power-law relationship described in Equation <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S4.E1" title="In 4.1 Relations between the Performance and the Model Sizes ‣ 4 Scaling Law for Protein-Ligand Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1</a>: <table class="ltx_equation ltx_eqn_table" id="S4.E1"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\text{RMSD}(N)\approx(N/N_{c})^{\alpha_{N}}." class="ltx_Math" display="block" id="S4.E1.m1.2"><semantics id="S4.E1.m1.2a"><mrow id="S4.E1.m1.2.2.1" xref="S4.E1.m1.2.2.1.1.cmml"><mrow id="S4.E1.m1.2.2.1.1" xref="S4.E1.m1.2.2.1.1.cmml"><mrow id="S4.E1.m1.2.2.1.1.3" xref="S4.E1.m1.2.2.1.1.3.cmml"><mtext id="S4.E1.m1.2.2.1.1.3.2" xref="S4.E1.m1.2.2.1.1.3.2a.cmml">RMSD</mtext><mo id="S4.E1.m1.2.2.1.1.3.1" xref="S4.E1.m1.2.2.1.1.3.1.cmml">⁢</mo><mrow id="S4.E1.m1.2.2.1.1.3.3.2" xref="S4.E1.m1.2.2.1.1.3.cmml"><mo id="S4.E1.m1.2.2.1.1.3.3.2.1" stretchy="false" xref="S4.E1.m1.2.2.1.1.3.cmml">(</mo><mi id="S4.E1.m1.1.1" xref="S4.E1.m1.1.1.cmml">N</mi><mo id="S4.E1.m1.2.2.1.1.3.3.2.2" stretchy="false" xref="S4.E1.m1.2.2.1.1.3.cmml">)</mo></mrow></mrow><mo id="S4.E1.m1.2.2.1.1.2" xref="S4.E1.m1.2.2.1.1.2.cmml">≈</mo><msup id="S4.E1.m1.2.2.1.1.1" xref="S4.E1.m1.2.2.1.1.1.cmml"><mrow id="S4.E1.m1.2.2.1.1.1.1.1" xref="S4.E1.m1.2.2.1.1.1.1.1.1.cmml"><mo id="S4.E1.m1.2.2.1.1.1.1.1.2" stretchy="false" xref="S4.E1.m1.2.2.1.1.1.1.1.1.cmml">(</mo><mrow id="S4.E1.m1.2.2.1.1.1.1.1.1" xref="S4.E1.m1.2.2.1.1.1.1.1.1.cmml"><mi id="S4.E1.m1.2.2.1.1.1.1.1.1.2" xref="S4.E1.m1.2.2.1.1.1.1.1.1.2.cmml">N</mi><mo id="S4.E1.m1.2.2.1.1.1.1.1.1.1" xref="S4.E1.m1.2.2.1.1.1.1.1.1.1.cmml">/</mo><msub id="S4.E1.m1.2.2.1.1.1.1.1.1.3" xref="S4.E1.m1.2.2.1.1.1.1.1.1.3.cmml"><mi id="S4.E1.m1.2.2.1.1.1.1.1.1.3.2" xref="S4.E1.m1.2.2.1.1.1.1.1.1.3.2.cmml">N</mi><mi id="S4.E1.m1.2.2.1.1.1.1.1.1.3.3" xref="S4.E1.m1.2.2.1.1.1.1.1.1.3.3.cmml">c</mi></msub></mrow><mo id="S4.E1.m1.2.2.1.1.1.1.1.3" stretchy="false" xref="S4.E1.m1.2.2.1.1.1.1.1.1.cmml">)</mo></mrow><msub id="S4.E1.m1.2.2.1.1.1.3" xref="S4.E1.m1.2.2.1.1.1.3.cmml"><mi id="S4.E1.m1.2.2.1.1.1.3.2" xref="S4.E1.m1.2.2.1.1.1.3.2.cmml">α</mi><mi id="S4.E1.m1.2.2.1.1.1.3.3" xref="S4.E1.m1.2.2.1.1.1.3.3.cmml">N</mi></msub></msup></mrow><mo id="S4.E1.m1.2.2.1.2" lspace="0em" xref="S4.E1.m1.2.2.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.E1.m1.2b"><apply id="S4.E1.m1.2.2.1.1.cmml" xref="S4.E1.m1.2.2.1"><approx id="S4.E1.m1.2.2.1.1.2.cmml" xref="S4.E1.m1.2.2.1.1.2"></approx><apply id="S4.E1.m1.2.2.1.1.3.cmml" xref="S4.E1.m1.2.2.1.1.3"><times id="S4.E1.m1.2.2.1.1.3.1.cmml" xref="S4.E1.m1.2.2.1.1.3.1"></times><ci id="S4.E1.m1.2.2.1.1.3.2a.cmml" xref="S4.E1.m1.2.2.1.1.3.2"><mtext id="S4.E1.m1.2.2.1.1.3.2.cmml" xref="S4.E1.m1.2.2.1.1.3.2">RMSD</mtext></ci><ci id="S4.E1.m1.1.1.cmml" xref="S4.E1.m1.1.1">𝑁</ci></apply><apply id="S4.E1.m1.2.2.1.1.1.cmml" xref="S4.E1.m1.2.2.1.1.1"><csymbol cd="ambiguous" id="S4.E1.m1.2.2.1.1.1.2.cmml" xref="S4.E1.m1.2.2.1.1.1">superscript</csymbol><apply id="S4.E1.m1.2.2.1.1.1.1.1.1.cmml" xref="S4.E1.m1.2.2.1.1.1.1.1"><divide id="S4.E1.m1.2.2.1.1.1.1.1.1.1.cmml" xref="S4.E1.m1.2.2.1.1.1.1.1.1.1"></divide><ci id="S4.E1.m1.2.2.1.1.1.1.1.1.2.cmml" xref="S4.E1.m1.2.2.1.1.1.1.1.1.2">𝑁</ci><apply id="S4.E1.m1.2.2.1.1.1.1.1.1.3.cmml" xref="S4.E1.m1.2.2.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S4.E1.m1.2.2.1.1.1.1.1.1.3.1.cmml" xref="S4.E1.m1.2.2.1.1.1.1.1.1.3">subscript</csymbol><ci id="S4.E1.m1.2.2.1.1.1.1.1.1.3.2.cmml" xref="S4.E1.m1.2.2.1.1.1.1.1.1.3.2">𝑁</ci><ci id="S4.E1.m1.2.2.1.1.1.1.1.1.3.3.cmml" xref="S4.E1.m1.2.2.1.1.1.1.1.1.3.3">𝑐</ci></apply></apply><apply id="S4.E1.m1.2.2.1.1.1.3.cmml" xref="S4.E1.m1.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S4.E1.m1.2.2.1.1.1.3.1.cmml" xref="S4.E1.m1.2.2.1.1.1.3">subscript</csymbol><ci id="S4.E1.m1.2.2.1.1.1.3.2.cmml" xref="S4.E1.m1.2.2.1.1.1.3.2">𝛼</ci><ci id="S4.E1.m1.2.2.1.1.1.3.3.cmml" xref="S4.E1.m1.2.2.1.1.1.3.3">𝑁</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E1.m1.2c">\text{RMSD}(N)\approx(N/N_{c})^{\alpha_{N}}.</annotation><annotation encoding="application/x-llamapun" id="S4.E1.m1.2d">RMSD ( italic_N ) ≈ ( italic_N / italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(1)</td> </tr></tbody> </table> Here, <math alttext="N_{c}" class="ltx_Math" display="inline" id="S4.SS1.p1.5.m1.1"><semantics id="S4.SS1.p1.5.m1.1a"><msub id="S4.SS1.p1.5.m1.1.1" xref="S4.SS1.p1.5.m1.1.1.cmml"><mi id="S4.SS1.p1.5.m1.1.1.2" xref="S4.SS1.p1.5.m1.1.1.2.cmml">N</mi><mi id="S4.SS1.p1.5.m1.1.1.3" xref="S4.SS1.p1.5.m1.1.1.3.cmml">c</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.5.m1.1b"><apply id="S4.SS1.p1.5.m1.1.1.cmml" xref="S4.SS1.p1.5.m1.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.5.m1.1.1.1.cmml" xref="S4.SS1.p1.5.m1.1.1">subscript</csymbol><ci id="S4.SS1.p1.5.m1.1.1.2.cmml" xref="S4.SS1.p1.5.m1.1.1.2">𝑁</ci><ci id="S4.SS1.p1.5.m1.1.1.3.cmml" xref="S4.SS1.p1.5.m1.1.1.3">𝑐</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.5.m1.1c">N_{c}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.5.m1.1d">italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="\alpha_{N}" class="ltx_Math" display="inline" id="S4.SS1.p1.6.m2.1"><semantics id="S4.SS1.p1.6.m2.1a"><msub id="S4.SS1.p1.6.m2.1.1" xref="S4.SS1.p1.6.m2.1.1.cmml"><mi id="S4.SS1.p1.6.m2.1.1.2" xref="S4.SS1.p1.6.m2.1.1.2.cmml">α</mi><mi id="S4.SS1.p1.6.m2.1.1.3" xref="S4.SS1.p1.6.m2.1.1.3.cmml">N</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.6.m2.1b"><apply id="S4.SS1.p1.6.m2.1.1.cmml" xref="S4.SS1.p1.6.m2.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.6.m2.1.1.1.cmml" xref="S4.SS1.p1.6.m2.1.1">subscript</csymbol><ci id="S4.SS1.p1.6.m2.1.1.2.cmml" xref="S4.SS1.p1.6.m2.1.1.2">𝛼</ci><ci id="S4.SS1.p1.6.m2.1.1.3.cmml" xref="S4.SS1.p1.6.m2.1.1.3">𝑁</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.6.m2.1c">\alpha_{N}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.6.m2.1d">italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT</annotation></semantics></math> are constant parameters of the power-law that we fitted to the data points. <math alttext="\alpha_{N}" class="ltx_Math" display="inline" id="S4.SS1.p1.7.m3.1"><semantics id="S4.SS1.p1.7.m3.1a"><msub id="S4.SS1.p1.7.m3.1.1" xref="S4.SS1.p1.7.m3.1.1.cmml"><mi id="S4.SS1.p1.7.m3.1.1.2" xref="S4.SS1.p1.7.m3.1.1.2.cmml">α</mi><mi id="S4.SS1.p1.7.m3.1.1.3" xref="S4.SS1.p1.7.m3.1.1.3.cmml">N</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.p1.7.m3.1b"><apply id="S4.SS1.p1.7.m3.1.1.cmml" xref="S4.SS1.p1.7.m3.1.1"><csymbol cd="ambiguous" id="S4.SS1.p1.7.m3.1.1.1.cmml" xref="S4.SS1.p1.7.m3.1.1">subscript</csymbol><ci id="S4.SS1.p1.7.m3.1.1.2.cmml" xref="S4.SS1.p1.7.m3.1.1.2">𝛼</ci><ci id="S4.SS1.p1.7.m3.1.1.3.cmml" xref="S4.SS1.p1.7.m3.1.1.3">𝑁</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p1.7.m3.1c">\alpha_{N}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p1.7.m3.1d">italic_α start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT</annotation></semantics></math> provides insights into the degree of performance improvement to some extent. </div> <div class="ltx_para ltx_noindent" id="S4.SS1.p2"> The results of the pre-training stage and the fine-tuning stage are depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S4.F4" title="Figure 4 ‣ 4 Scaling Law for Protein-Ligand Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">4</a>a and Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S4.F4" title="Figure 4 ‣ 4 Scaling Law for Protein-Ligand Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">4</a>b, respectively. Our results strongly suggest a robust correlation between the accuracy of structure prediction models and their respective model sizes during both the pre-training and fine-tuning stages. An interesting observation is that when the pre-training technique is not involved (the w/o pre-train variants i </div> <div class="ltx_para ltx_noindent" id="S4.SS1.p3"> n Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S4.F4" title="Figure 4 ‣ 4 Scaling Law for Protein-Ligand Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">4</a>b), it becomes increasingly challenging to achieve additional gains by scaling up the model on test sets of both the PDBbind core set (after <math alttext="N=3\times 10^{6}" class="ltx_Math" display="inline" id="S4.SS1.p3.1.m1.1"><semantics id="S4.SS1.p3.1.m1.1a"><mrow id="S4.SS1.p3.1.m1.1.1" xref="S4.SS1.p3.1.m1.1.1.cmml"><mi id="S4.SS1.p3.1.m1.1.1.2" xref="S4.SS1.p3.1.m1.1.1.2.cmml">N</mi><mo id="S4.SS1.p3.1.m1.1.1.1" xref="S4.SS1.p3.1.m1.1.1.1.cmml">=</mo><mrow id="S4.SS1.p3.1.m1.1.1.3" xref="S4.SS1.p3.1.m1.1.1.3.cmml"><mn id="S4.SS1.p3.1.m1.1.1.3.2" xref="S4.SS1.p3.1.m1.1.1.3.2.cmml">3</mn><mo id="S4.SS1.p3.1.m1.1.1.3.1" lspace="0.222em" rspace="0.222em" xref="S4.SS1.p3.1.m1.1.1.3.1.cmml">×</mo><msup id="S4.SS1.p3.1.m1.1.1.3.3" xref="S4.SS1.p3.1.m1.1.1.3.3.cmml"><mn id="S4.SS1.p3.1.m1.1.1.3.3.2" xref="S4.SS1.p3.1.m1.1.1.3.3.2.cmml">10</mn><mn id="S4.SS1.p3.1.m1.1.1.3.3.3" xref="S4.SS1.p3.1.m1.1.1.3.3.3.cmml">6</mn></msup></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p3.1.m1.1b"><apply id="S4.SS1.p3.1.m1.1.1.cmml" xref="S4.SS1.p3.1.m1.1.1"><eq id="S4.SS1.p3.1.m1.1.1.1.cmml" xref="S4.SS1.p3.1.m1.1.1.1"></eq><ci id="S4.SS1.p3.1.m1.1.1.2.cmml" xref="S4.SS1.p3.1.m1.1.1.2">𝑁</ci><apply id="S4.SS1.p3.1.m1.1.1.3.cmml" xref="S4.SS1.p3.1.m1.1.1.3"><times id="S4.SS1.p3.1.m1.1.1.3.1.cmml" xref="S4.SS1.p3.1.m1.1.1.3.1"></times><cn id="S4.SS1.p3.1.m1.1.1.3.2.cmml" type="integer" xref="S4.SS1.p3.1.m1.1.1.3.2">3</cn><apply id="S4.SS1.p3.1.m1.1.1.3.3.cmml" xref="S4.SS1.p3.1.m1.1.1.3.3"><csymbol cd="ambiguous" id="S4.SS1.p3.1.m1.1.1.3.3.1.cmml" xref="S4.SS1.p3.1.m1.1.1.3.3">superscript</csymbol><cn id="S4.SS1.p3.1.m1.1.1.3.3.2.cmml" type="integer" xref="S4.SS1.p3.1.m1.1.1.3.3.2">10</cn><cn id="S4.SS1.p3.1.m1.1.1.3.3.3.cmml" type="integer" xref="S4.SS1.p3.1.m1.1.1.3.3.3">6</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p3.1.m1.1c">N=3\times 10^{6}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p3.1.m1.1d">italic_N = 3 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT</annotation></semantics></math>) and and PoseBusters (after <math alttext="N=3\times 10^{6}" class="ltx_Math" display="inline" id="S4.SS1.p3.2.1.m1.1"><semantics id="S4.SS1.p3.2.1.m1.1a"><mrow id="S4.SS1.p3.2.1.m1.1.1" xref="S4.SS1.p3.2.1.m1.1.1.cmml"><mi id="S4.SS1.p3.2.1.m1.1.1.2" mathcolor="#000000" xref="S4.SS1.p3.2.1.m1.1.1.2.cmml">N</mi><mo id="S4.SS1.p3.2.1.m1.1.1.1" mathcolor="#000000" xref="S4.SS1.p3.2.1.m1.1.1.1.cmml">=</mo><mrow id="S4.SS1.p3.2.1.m1.1.1.3" xref="S4.SS1.p3.2.1.m1.1.1.3.cmml"><mn id="S4.SS1.p3.2.1.m1.1.1.3.2" mathcolor="#000000" xref="S4.SS1.p3.2.1.m1.1.1.3.2.cmml">3</mn><mo id="S4.SS1.p3.2.1.m1.1.1.3.1" lspace="0.222em" mathcolor="#000000" rspace="0.222em" xref="S4.SS1.p3.2.1.m1.1.1.3.1.cmml">×</mo><msup id="S4.SS1.p3.2.1.m1.1.1.3.3" xref="S4.SS1.p3.2.1.m1.1.1.3.3.cmml"><mn id="S4.SS1.p3.2.1.m1.1.1.3.3.2" mathcolor="#000000" xref="S4.SS1.p3.2.1.m1.1.1.3.3.2.cmml">10</mn><mn id="S4.SS1.p3.2.1.m1.1.1.3.3.3" mathcolor="#000000" xref="S4.SS1.p3.2.1.m1.1.1.3.3.3.cmml">6</mn></msup></mrow></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.p3.2.1.m1.1b"><apply id="S4.SS1.p3.2.1.m1.1.1.cmml" xref="S4.SS1.p3.2.1.m1.1.1"><eq id="S4.SS1.p3.2.1.m1.1.1.1.cmml" xref="S4.SS1.p3.2.1.m1.1.1.1"></eq><ci id="S4.SS1.p3.2.1.m1.1.1.2.cmml" xref="S4.SS1.p3.2.1.m1.1.1.2">𝑁</ci><apply id="S4.SS1.p3.2.1.m1.1.1.3.cmml" xref="S4.SS1.p3.2.1.m1.1.1.3"><times id="S4.SS1.p3.2.1.m1.1.1.3.1.cmml" xref="S4.SS1.p3.2.1.m1.1.1.3.1"></times><cn id="S4.SS1.p3.2.1.m1.1.1.3.2.cmml" type="integer" xref="S4.SS1.p3.2.1.m1.1.1.3.2">3</cn><apply id="S4.SS1.p3.2.1.m1.1.1.3.3.cmml" xref="S4.SS1.p3.2.1.m1.1.1.3.3"><csymbol cd="ambiguous" id="S4.SS1.p3.2.1.m1.1.1.3.3.1.cmml" xref="S4.SS1.p3.2.1.m1.1.1.3.3">superscript</csymbol><cn id="S4.SS1.p3.2.1.m1.1.1.3.3.2.cmml" type="integer" xref="S4.SS1.p3.2.1.m1.1.1.3.3.2">10</cn><cn id="S4.SS1.p3.2.1.m1.1.1.3.3.3.cmml" type="integer" xref="S4.SS1.p3.2.1.m1.1.1.3.3.3">6</cn></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p3.2.1.m1.1c">N=3\times 10^{6}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p3.2.1.m1.1d">italic_N = 3 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT</annotation></semantics></math>). On the other hand, when pre-training (the w/ pre-train variants in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S4.F4" title="Figure 4 ‣ 4 Scaling Law for Protein-Ligand Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">4</a>b) is employed, scaling up the model consistently leads to continued improvements in performance. These relationships between <math alttext="N" class="ltx_Math" display="inline" id="S4.SS1.p3.3.m2.1"><semantics id="S4.SS1.p3.3.m2.1a"><mi id="S4.SS1.p3.3.m2.1.1" xref="S4.SS1.p3.3.m2.1.1.cmml">N</mi><annotation-xml encoding="MathML-Content" id="S4.SS1.p3.3.m2.1b"><ci id="S4.SS1.p3.3.m2.1.1.cmml" xref="S4.SS1.p3.3.m2.1.1">𝑁</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.p3.3.m2.1c">N</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.p3.3.m2.1d">italic_N</annotation></semantics></math> and RMSD in both stages can be closely approximated by a power law as shown in Equation <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S4.E1" title="In 4.1 Relations between the Performance and the Model Sizes ‣ 4 Scaling Law for Protein-Ligand Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">1</a>. Moreover, as the model size grows, we observe a more substantial performance gap between pre-trained and non-pre-trained variants in the PoseBusters benchmark compared to the PDBbind core set. This phenomenon might be attributed to the greater complexity of cases in the PoseBusters benchmark. This finding underscores the pivotal role of pre-training in augmenting model scalability for protein-ligand structure prediction. </div> </section> <section class="ltx_subsection" id="S4.SS2"> <h3 class="ltx_title ltx_title_subsection"> 4.2 Relations between the Performance and the Pre-training Data Sizes</h3> <div class="ltx_para ltx_noindent" id="S4.SS2.p1"> In this section, we validate one of our primary contributions through empirical scaling law experiments: the generation of an extensive collection of docking complexes and the utilization of these docking complexes for pre-training purposes. Specifically, we vary the pre-training data size from <math alttext="10^{5}" class="ltx_Math" display="inline" id="S4.SS2.p1.1.m1.1"><semantics id="S4.SS2.p1.1.m1.1a"><msup id="S4.SS2.p1.1.m1.1.1" xref="S4.SS2.p1.1.m1.1.1.cmml"><mn id="S4.SS2.p1.1.m1.1.1.2" xref="S4.SS2.p1.1.m1.1.1.2.cmml">10</mn><mn id="S4.SS2.p1.1.m1.1.1.3" xref="S4.SS2.p1.1.m1.1.1.3.cmml">5</mn></msup><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.1.m1.1b"><apply id="S4.SS2.p1.1.m1.1.1.cmml" xref="S4.SS2.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.SS2.p1.1.m1.1.1.1.cmml" xref="S4.SS2.p1.1.m1.1.1">superscript</csymbol><cn id="S4.SS2.p1.1.m1.1.1.2.cmml" type="integer" xref="S4.SS2.p1.1.m1.1.1.2">10</cn><cn id="S4.SS2.p1.1.m1.1.1.3.cmml" type="integer" xref="S4.SS2.p1.1.m1.1.1.3">5</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.1.m1.1c">10^{5}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.1.m1.1d">10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT</annotation></semantics></math> to <math alttext="10^{8}" class="ltx_Math" display="inline" id="S4.SS2.p1.2.m2.1"><semantics id="S4.SS2.p1.2.m2.1a"><msup id="S4.SS2.p1.2.m2.1.1" xref="S4.SS2.p1.2.m2.1.1.cmml"><mn id="S4.SS2.p1.2.m2.1.1.2" xref="S4.SS2.p1.2.m2.1.1.2.cmml">10</mn><mn id="S4.SS2.p1.2.m2.1.1.3" xref="S4.SS2.p1.2.m2.1.1.3.cmml">8</mn></msup><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.2.m2.1b"><apply id="S4.SS2.p1.2.m2.1.1.cmml" xref="S4.SS2.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S4.SS2.p1.2.m2.1.1.1.cmml" xref="S4.SS2.p1.2.m2.1.1">superscript</csymbol><cn id="S4.SS2.p1.2.m2.1.1.2.cmml" type="integer" xref="S4.SS2.p1.2.m2.1.1.2">10</cn><cn id="S4.SS2.p1.2.m2.1.1.3.cmml" type="integer" xref="S4.SS2.p1.2.m2.1.1.3">8</cn></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.2.m2.1c">10^{8}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.2.m2.1d">10 start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT</annotation></semantics></math> and evaluate the performance of the respective pre-trained models. The model size is kept fixed at <math alttext="N=10^{7}" class="ltx_Math" display="inline" id="S4.SS2.p1.3.m3.1"><semantics id="S4.SS2.p1.3.m3.1a"><mrow id="S4.SS2.p1.3.m3.1.1" xref="S4.SS2.p1.3.m3.1.1.cmml"><mi id="S4.SS2.p1.3.m3.1.1.2" xref="S4.SS2.p1.3.m3.1.1.2.cmml">N</mi><mo id="S4.SS2.p1.3.m3.1.1.1" xref="S4.SS2.p1.3.m3.1.1.1.cmml">=</mo><msup id="S4.SS2.p1.3.m3.1.1.3" xref="S4.SS2.p1.3.m3.1.1.3.cmml"><mn id="S4.SS2.p1.3.m3.1.1.3.2" xref="S4.SS2.p1.3.m3.1.1.3.2.cmml">10</mn><mn id="S4.SS2.p1.3.m3.1.1.3.3" xref="S4.SS2.p1.3.m3.1.1.3.3.cmml">7</mn></msup></mrow><annotation-xml encoding="MathML-Content" id="S4.SS2.p1.3.m3.1b"><apply id="S4.SS2.p1.3.m3.1.1.cmml" xref="S4.SS2.p1.3.m3.1.1"><eq id="S4.SS2.p1.3.m3.1.1.1.cmml" xref="S4.SS2.p1.3.m3.1.1.1"></eq><ci id="S4.SS2.p1.3.m3.1.1.2.cmml" xref="S4.SS2.p1.3.m3.1.1.2">𝑁</ci><apply id="S4.SS2.p1.3.m3.1.1.3.cmml" xref="S4.SS2.p1.3.m3.1.1.3"><csymbol cd="ambiguous" id="S4.SS2.p1.3.m3.1.1.3.1.cmml" xref="S4.SS2.p1.3.m3.1.1.3">superscript</csymbol><cn id="S4.SS2.p1.3.m3.1.1.3.2.cmml" type="integer" xref="S4.SS2.p1.3.m3.1.1.3.2">10</cn><cn id="S4.SS2.p1.3.m3.1.1.3.3.cmml" type="integer" xref="S4.SS2.p1.3.m3.1.1.3.3">7</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p1.3.m3.1c">N=10^{7}</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p1.3.m3.1d">italic_N = 10 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT</annotation></semantics></math>. Similar to the analysis of model size, we use a power-law function to fit the relationship between RMSD values and the pre-training dataset size: </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p2"> <table class="ltx_equation ltx_eqn_table" id="S4.E2"> <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline"> <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td> <td class="ltx_eqn_cell ltx_align_center"><math alttext="\text{RMSD}(D)\approx(D/D_{c})^{\alpha_{D}}." class="ltx_Math" display="block" id="S4.E2.m1.2"><semantics id="S4.E2.m1.2a"><mrow id="S4.E2.m1.2.2.1" xref="S4.E2.m1.2.2.1.1.cmml"><mrow id="S4.E2.m1.2.2.1.1" xref="S4.E2.m1.2.2.1.1.cmml"><mrow id="S4.E2.m1.2.2.1.1.3" xref="S4.E2.m1.2.2.1.1.3.cmml"><mtext id="S4.E2.m1.2.2.1.1.3.2" xref="S4.E2.m1.2.2.1.1.3.2a.cmml">RMSD</mtext><mo id="S4.E2.m1.2.2.1.1.3.1" xref="S4.E2.m1.2.2.1.1.3.1.cmml">⁢</mo><mrow id="S4.E2.m1.2.2.1.1.3.3.2" xref="S4.E2.m1.2.2.1.1.3.cmml"><mo id="S4.E2.m1.2.2.1.1.3.3.2.1" stretchy="false" xref="S4.E2.m1.2.2.1.1.3.cmml">(</mo><mi id="S4.E2.m1.1.1" xref="S4.E2.m1.1.1.cmml">D</mi><mo id="S4.E2.m1.2.2.1.1.3.3.2.2" stretchy="false" xref="S4.E2.m1.2.2.1.1.3.cmml">)</mo></mrow></mrow><mo id="S4.E2.m1.2.2.1.1.2" xref="S4.E2.m1.2.2.1.1.2.cmml">≈</mo><msup id="S4.E2.m1.2.2.1.1.1" xref="S4.E2.m1.2.2.1.1.1.cmml"><mrow id="S4.E2.m1.2.2.1.1.1.1.1" xref="S4.E2.m1.2.2.1.1.1.1.1.1.cmml"><mo id="S4.E2.m1.2.2.1.1.1.1.1.2" stretchy="false" xref="S4.E2.m1.2.2.1.1.1.1.1.1.cmml">(</mo><mrow id="S4.E2.m1.2.2.1.1.1.1.1.1" xref="S4.E2.m1.2.2.1.1.1.1.1.1.cmml"><mi id="S4.E2.m1.2.2.1.1.1.1.1.1.2" xref="S4.E2.m1.2.2.1.1.1.1.1.1.2.cmml">D</mi><mo id="S4.E2.m1.2.2.1.1.1.1.1.1.1" xref="S4.E2.m1.2.2.1.1.1.1.1.1.1.cmml">/</mo><msub id="S4.E2.m1.2.2.1.1.1.1.1.1.3" xref="S4.E2.m1.2.2.1.1.1.1.1.1.3.cmml"><mi id="S4.E2.m1.2.2.1.1.1.1.1.1.3.2" xref="S4.E2.m1.2.2.1.1.1.1.1.1.3.2.cmml">D</mi><mi id="S4.E2.m1.2.2.1.1.1.1.1.1.3.3" xref="S4.E2.m1.2.2.1.1.1.1.1.1.3.3.cmml">c</mi></msub></mrow><mo id="S4.E2.m1.2.2.1.1.1.1.1.3" stretchy="false" xref="S4.E2.m1.2.2.1.1.1.1.1.1.cmml">)</mo></mrow><msub id="S4.E2.m1.2.2.1.1.1.3" xref="S4.E2.m1.2.2.1.1.1.3.cmml"><mi id="S4.E2.m1.2.2.1.1.1.3.2" xref="S4.E2.m1.2.2.1.1.1.3.2.cmml">α</mi><mi id="S4.E2.m1.2.2.1.1.1.3.3" xref="S4.E2.m1.2.2.1.1.1.3.3.cmml">D</mi></msub></msup></mrow><mo id="S4.E2.m1.2.2.1.2" lspace="0em" xref="S4.E2.m1.2.2.1.1.cmml">.</mo></mrow><annotation-xml encoding="MathML-Content" id="S4.E2.m1.2b"><apply id="S4.E2.m1.2.2.1.1.cmml" xref="S4.E2.m1.2.2.1"><approx id="S4.E2.m1.2.2.1.1.2.cmml" xref="S4.E2.m1.2.2.1.1.2"></approx><apply id="S4.E2.m1.2.2.1.1.3.cmml" xref="S4.E2.m1.2.2.1.1.3"><times id="S4.E2.m1.2.2.1.1.3.1.cmml" xref="S4.E2.m1.2.2.1.1.3.1"></times><ci id="S4.E2.m1.2.2.1.1.3.2a.cmml" xref="S4.E2.m1.2.2.1.1.3.2"><mtext id="S4.E2.m1.2.2.1.1.3.2.cmml" xref="S4.E2.m1.2.2.1.1.3.2">RMSD</mtext></ci><ci id="S4.E2.m1.1.1.cmml" xref="S4.E2.m1.1.1">𝐷</ci></apply><apply id="S4.E2.m1.2.2.1.1.1.cmml" xref="S4.E2.m1.2.2.1.1.1"><csymbol cd="ambiguous" id="S4.E2.m1.2.2.1.1.1.2.cmml" xref="S4.E2.m1.2.2.1.1.1">superscript</csymbol><apply id="S4.E2.m1.2.2.1.1.1.1.1.1.cmml" xref="S4.E2.m1.2.2.1.1.1.1.1"><divide id="S4.E2.m1.2.2.1.1.1.1.1.1.1.cmml" xref="S4.E2.m1.2.2.1.1.1.1.1.1.1"></divide><ci id="S4.E2.m1.2.2.1.1.1.1.1.1.2.cmml" xref="S4.E2.m1.2.2.1.1.1.1.1.1.2">𝐷</ci><apply id="S4.E2.m1.2.2.1.1.1.1.1.1.3.cmml" xref="S4.E2.m1.2.2.1.1.1.1.1.1.3"><csymbol cd="ambiguous" id="S4.E2.m1.2.2.1.1.1.1.1.1.3.1.cmml" xref="S4.E2.m1.2.2.1.1.1.1.1.1.3">subscript</csymbol><ci id="S4.E2.m1.2.2.1.1.1.1.1.1.3.2.cmml" xref="S4.E2.m1.2.2.1.1.1.1.1.1.3.2">𝐷</ci><ci id="S4.E2.m1.2.2.1.1.1.1.1.1.3.3.cmml" xref="S4.E2.m1.2.2.1.1.1.1.1.1.3.3">𝑐</ci></apply></apply><apply id="S4.E2.m1.2.2.1.1.1.3.cmml" xref="S4.E2.m1.2.2.1.1.1.3"><csymbol cd="ambiguous" id="S4.E2.m1.2.2.1.1.1.3.1.cmml" xref="S4.E2.m1.2.2.1.1.1.3">subscript</csymbol><ci id="S4.E2.m1.2.2.1.1.1.3.2.cmml" xref="S4.E2.m1.2.2.1.1.1.3.2">𝛼</ci><ci id="S4.E2.m1.2.2.1.1.1.3.3.cmml" xref="S4.E2.m1.2.2.1.1.1.3.3">𝐷</ci></apply></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.E2.m1.2c">\text{RMSD}(D)\approx(D/D_{c})^{\alpha_{D}}.</annotation><annotation encoding="application/x-llamapun" id="S4.E2.m1.2d">RMSD ( italic_D ) ≈ ( italic_D / italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .</annotation></semantics></math></td> <td class="ltx_eqn_cell ltx_eqn_center_padright"></td> <td class="ltx_eqn_cell ltx_eqn_eqno ltx_align_middle ltx_align_right" rowspan="1">(2)</td> </tr></tbody> </table> </div> <div class="ltx_para ltx_noindent" id="S4.SS2.p3"> The results are illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S4.F4" title="Figure 4 ‣ 4 Scaling Law for Protein-Ligand Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">4</a>c and Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S4.F4" title="Figure 4 ‣ 4 Scaling Law for Protein-Ligand Structure Prediction ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">4</a>d. There is a discernible correlation between the pre-training dataset size (<math alttext="D" class="ltx_Math" display="inline" id="S4.SS2.p3.1.m1.1"><semantics id="S4.SS2.p3.1.m1.1a"><mi id="S4.SS2.p3.1.m1.1.1" xref="S4.SS2.p3.1.m1.1.1.cmml">D</mi><annotation-xml encoding="MathML-Content" id="S4.SS2.p3.1.m1.1b"><ci id="S4.SS2.p3.1.m1.1.1.cmml" xref="S4.SS2.p3.1.m1.1.1">𝐷</ci></annotation-xml><annotation encoding="application/x-tex" id="S4.SS2.p3.1.m1.1c">D</annotation><annotation encoding="application/x-llamapun" id="S4.SS2.p3.1.m1.1d">italic_D</annotation></semantics></math>) and the performance of the model (measured by RMSD). These findings suggest a potential for further enhancement in the structure prediction model’s performance with increasing pre-training dataset sizes, demonstrating the substantial benefits of the large-scale pre-training docking complexes. </div> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section" style="color:#000000;"> 5 Practicality in Drug Discovery</h2> <div class="ltx_para ltx_noindent" id="S5.p1"> To validate the practical utility of HelixDock in drug development, we first apply it to cross-docking (i.e., unbound docking) and structure-based virtual screening, evaluating its performance in these critical tasks. </div> <section class="ltx_subsection" id="S5.SS1"> <h3 class="ltx_title ltx_title_subsection" style="color:#000000;"> 5.1 HelixDock Generalizes Well to Cross-Docking</h3> <figure class="ltx_figure" id="S5.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_missing ltx_missing_image" id="S5.F5.g1" src=""/> <figcaption class="ltx_caption ltx_centering" style="color:#000000;">Figure 5: Cross-docking. (a) Performance of HelixDock and other baselines in two cross-docking datasets, i.e., PDBbind-CrossDocked-Core and APObind-Core. (b) RMSD comparison of HelixDock on holo structures and apo structures. (c) Prediction of HelixDock on sample PDB ID:2XNB from PDBbind Core set when the protein is holo or apo.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S5.SS1.p1"> The preceding experiments assumed the receptor structure to be in its optimal conformation, but this does not align with real-world scenarios in protein-ligand structure prediction, where receptor structures exhibit flexibility. Cross-docking involves extracting a ligand from a co-crystal complex and docking it to a different conformation (apo structure) of the same protein, rather than to the ligand’s original holo structure. We evaluate HelixDock on two cross-docking datasets, the PDBbind-CrossDocked-Core <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib48" title="">48</a>]</cite> including 1,058 cross-docked complexes and APObind-Core <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib49" title="">49</a>]</cite> including 229 cross-docked complexes. </div> <div class="ltx_para ltx_noindent" id="S5.SS1.p2"> As illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S5.F5" style="color:#000000;" title="Figure 5 ‣ 5.1 HelixDock Generalizes Well to Cross-Docking ‣ 5 Practicality in Drug Discovery ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">5</a>a, HelixDock continues to achieve a high success rate in both the PDBbind-CrossDocked-Core and APObind-Core datasets, reaching 81.7% and 72.9%, respectively. In contrast, the performance of baseline methods on these cross-docking datasets falls significantly short. This drop in performance is especially pronounced when compared to their effectiveness in the re-docking dataset, where receptor and ligand structures are more compatible. This decline underlines the difficulty of predicting complex structures when introduced to alternative protein conformations, a challenge that HelixDock seems particularly well-equipped to overcome. </div> <div class="ltx_para ltx_noindent" id="S5.SS1.p3"> In Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S5.F5" style="color:#000000;" title="Figure 5 ‣ 5.1 HelixDock Generalizes Well to Cross-Docking ‣ 5 Practicality in Drug Discovery ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">5</a>b, we conducted a comprehensive analysis of the RMSDs associated with each protein-ligand pair predicted by HelixDock, DiffDock, and AutoDock Vina, considering both holo and apo structures. Across all methodologies, a notable trend emerged wherein numerous samples displayed decreased efficacy when the target protein was in the apo state as opposed to the holo state. Notably, HelixDock demonstrated a comparatively lesser degree of influence. The majority of these predictions maintained a commendable level of accuracy despite the structural variations encountered. For example, as illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S5.F5" style="color:#000000;" title="Figure 5 ‣ 5.1 HelixDock Generalizes Well to Cross-Docking ‣ 5 Practicality in Drug Discovery ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">5</a>c, both HelixDock and AutoDock Vina demonstrate strong performance on the 2XNB complex in its holo structure. However, in the apo structure, HelixDock maintains a low RMSD of 1.153 Å , whereas the RMSD of AutoDock Vina significantly rises to 5.388Å . </div> </section> <section class="ltx_subsection" id="S5.SS2"> <h3 class="ltx_title ltx_title_subsection"> 5.2 HelixDock Demonstrates Application Potential in Structure-Based Virtual Screening</h3> <figure class="ltx_figure" id="S5.F6"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="389" id="S5.F6.g1" src="extracted/2310.13913v4/figures_v2/visualize/figure_cross_arxiv.png" width="509"/> <figcaption class="ltx_caption ltx_centering">Figure 6: Performance of HelixDock on structure-based virtual screening tasks and experimental validation. (a) Overall comparison of HelixDock and baseline methods on a virtual screening benchmark DUD-E. (b) Detailed comparison of HelixDock and AutoDock Vina on all the 102 targets in DUD-E. (c) The co-crystal pose of JNK3 binding with Isoquinolone (left, PDB code 2ZDT <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib50" title="">50</a>]</cite>), and binding poses with <a class="ltx_ref ltx_href" href="https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL194806/" title="">ChEMBL194806</a> predicted by HelixDock (middle) and Vina (right), respectively.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S5.SS2.p1"> Virtual screening is a pivotal technique in drug discovery, employed to identify promising drug candidates from large-scale compound libraries. Typical structure-based virtual screening approaches involve generating the ligand conformation bound to the protein target and subsequently applying a scoring function to assess the binding strength. The compounds with the highest scores are selected for further validation. The accuracy of these binding poses significantly influences the final outcomes <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib51" title="">51</a>]</cite>. </div> <div class="ltx_para ltx_noindent" id="S5.SS2.p2"> To assess the efficacy of HelixDock, we compared its virtual screening capabilities with AutoDock Vina, a widely-used docking tool, and KarmaDock <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib52" title="">52</a>]</cite>, an advanced AI-based virtual screening method. We evaluated these methods using the enhancement factor (EF) <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib53" title="">53</a>, <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib39" title="">39</a>]</cite> on the DUD-E dataset <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib32" title="">32</a>]</cite>, a widely adopted virtual screening benchmark comprising actives and decoys across 102 protein targets. As illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S5.F6" title="Figure 6 ‣ 5.2 HelixDock Demonstrates Application Potential in Structure-Based Virtual Screening ‣ 5 Practicality in Drug Discovery ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">6</a>a, HelixDock surpasses both KarmaDock and AutoDock Vina in enrichment factors (EF) at 1% and 5% thresholds, underscoring its strong potential in real virtual screening applications. Specifically, HelixDock achieves an EF(1%) of 29.4, which represents a significant 31.7% improvement over AutoDock Vina’s 22.3. Further analysis is presented in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S5.F6" title="Figure 6 ‣ 5.2 HelixDock Demonstrates Application Potential in Structure-Based Virtual Screening ‣ 5 Practicality in Drug Discovery ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">6</a>b, where HelixDock and AutoDock Vina are compared across all protein targets in the DUD-E dataset. Here, HelixDock consistently outperforms AutoDock Vina in terms of EF across most targets, demonstrating its robustness and superior performance in virtual screening scenarios. </div> <div class="ltx_para ltx_noindent" id="S5.SS2.p3"> We also discovered that the binding poses generated by HelixDock demonstrate superior binding modes compared to those produced by other computational tools. As illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#S5.F6" title="Figure 6 ‣ 5.2 HelixDock Demonstrates Application Potential in Structure-Based Virtual Screening ‣ 5 Practicality in Drug Discovery ‣ Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-Ligand Structure Prediction Models">6</a>c, we visualized a protein target from the DUD-E database <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib32" title="">32</a>]</cite>, c-Jun N-terminal kinase 3 (JNK3), and compared the differences between the co-crystallized reference ligand’s binding pose (ligand: isoquinolone, PDB code: 2ZDT) and the poses of another active compound (ChEMBL ID: 194806) generated by HelixDock and Vina. Previous research has highlighted a hydrogen bond between isoquinolone and the NH group of MET-149’s main chain in the hinge region, as well as hydrogen bonds with ASN-152 and LYS-68, are crucial for the JNK3 inhibition <cite class="ltx_cite ltx_citemacro_cite">[<a class="ltx_ref" href="https://arxiv.org/html/2310.13913v4#bib.bib54" title="">54</a>]</cite>. Our observations indicate that the binding pose predicted by HelixDock forms hydrogen bonds with both ASN-152 and MET-149, whereas the pose predicted by Vina establishes only one hydrogen bond with MET-149. The binding configuration predicted by HelixDock more accurately replicates the binding modes found in the co-crystal structure, which suggests that HelixDock is adept at capturing essential protein-ligand interactions, thereby providing a more reliable starting point for structure-based lead optimization. </div> </section> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> 6 Conclusion and Future Work</h2> <div class="ltx_para ltx_noindent" id="S6.p1"> Protein-ligand structure prediction remains indispensable in drug discovery for its computational efficiency in predicting binding interactions. Recent strides in deep learning-based methods offer a more accurate alternative compared to physics-based docking, although they face inherent challenges in generalization due to limited available training data. HelixDock, however, capitalizes on a vast dataset of docking poses generated through physics-based tools, harnessing the potential of deep learning-based protein-ligand structure prediction. </div> <div class="ltx_para ltx_noindent" id="S6.p2"> Our findings demonstrate the superiority of HelixDock over baseline methods, particularly on challenging samples. Moreover, our study indicates that scaling up the pre-train data size can enhance performance. Besides, when applying pre-training, increasing the model size can lead to further performance improvement. In conclusion, our work not only highlights the substantial potential of large-scale data in life sciences but may also provide inspiration to fellow researchers in related fields. </div> <div class="ltx_para ltx_noindent" id="S6.p3"> Several promising directions for future research include: <ul class="ltx_itemize" id="S6.I1"> <li class="ltx_item" id="S6.I1.i1" style="list-style-type:none;"> • <div class="ltx_para" id="S6.I1.i1.p1"> Larger-scale data: We have generated hundreds of millions of molecular docking poses for pre-training and plan to expand this dataset, leveraging the advantages of scaling to enhance model precision. </div> </li> <li class="ltx_item" id="S6.I1.i2" style="list-style-type:none;"> • <div class="ltx_para" id="S6.I1.i2.p1"> Higher-precision data: Recognizing the precision limitations of docking poses generated by physics-based methods, we explore more precise techniques like molecular dynamics simulations to optimize conformational data. </div> </li> <li class="ltx_item" id="S6.I1.i3" style="list-style-type:none;"> • <div class="ltx_para ltx_noindent" id="S6.I1.i3.p1"> Wider applications: Our ongoing efforts involve applying extensive conformational data to improve the performance of related tasks such as affinity prediction and molecular property estimation. Additionally, we consider the application of pre-training strategies to study large molecules, such as predicting protein complex structures. </div> </li> </ul> </div> <div class="ltx_para ltx_noindent" id="S6.p4"> Code availability: The source code, trained weights, and inference code of HelixDock will be freely available at GitHub (<a class="ltx_ref ltx_url ltx_font_typewriter" href="https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/molecular_docking/helixdock" title="">https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/molecular_docking/helixdock</a>) to ensure the reproduction of our experimental results. A web service of HelixDock is also available at <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://paddlehelix.baidu.com/app/drug/helix-dock/forecast" title="">https://paddlehelix.baidu.com/app/drug/helix-dock/forecast</a> to provide efficient structure predictions. </div> <div class="ltx_pagination ltx_role_newpage"></div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> [1] Hongtao Zhao and Amedeo Caflisch. Discovery of zap70 inhibitors by high-throughput docking into a conformation of its kinase domain generated by molecular dynamics. Bioorganic & medicinal chemistry letters, 23(20):5721–5726, 2013. </li> <li class="ltx_bibitem" id="bib.bib2"> [2] Oleg Trott and Arthur J Olson. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):455–461, 2010. </li> <li class="ltx_bibitem" id="bib.bib3"> [3] Jerome Eberhardt, Diogo Santos-Martins, Andreas F. Tillack, and Stefano Forli. Autodock vina 1.2.0: New docking methods, expanded force field, and python bindings. Journal of Chemical Information and Modeling, 61(8):3891–3898, 2021. PMID: 34278794. </li> <li class="ltx_bibitem" id="bib.bib4"> [4] David Ryan Koes, Matthew P Baumgartner, and Carlos J Camacho. Lessons learned in empirical scoring with smina from the csar 2011 benchmarking exercise. Journal of chemical information and modeling, 53(8):1893–1904, 2013. </li> <li class="ltx_bibitem" id="bib.bib5"> [5] Richard A Friesner, Jay L Banks, Robert B Murphy, Thomas A Halgren, Jasna J Klicic, Daniel T Mainz, Matthew P Repasky, Eric H Knoll, Mee Shelley, Jason K Perry, et al. Glide: a new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy. Journal of medicinal chemistry, 47(7):1739–1749, 2004. </li> <li class="ltx_bibitem" id="bib.bib6"> [6] Thomas A Halgren, Robert B Murphy, Richard A Friesner, Hege S Beard, Leah L Frye, W Thomas Pollard, and Jay L Banks. Glide: a new approach for rapid, accurate docking and scoring. 2. enrichment factors in database screening. Journal of medicinal chemistry, 47(7):1750–1759, 2004. </li> <li class="ltx_bibitem" id="bib.bib7"> [7] Hannes Stärk, Octavian Ganea, Lagnajit Pattanaik, Regina Barzilay, and Tommi Jaakkola. Equibind: Geometric deep learning for drug binding structure prediction. In International Conference on Machine Learning, pages 20503–20521. PMLR, 2022. </li> <li class="ltx_bibitem" id="bib.bib8"> [8] Wei Lu, Qifeng Wu, Jixian Zhang, Jiahua Rao, Chengtao Li, and Shuangjia Zheng. Tankbind: Trigonometry-aware neural networks for drug-protein binding structure prediction. bioRxiv, pages 2022–06, 2022. </li> <li class="ltx_bibitem" id="bib.bib9"> [9] Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi Jaakkola. Diffdock: Diffusion steps, twists, and turns for molecular docking. arXiv preprint arXiv:2210.01776, 2022. </li> <li class="ltx_bibitem" id="bib.bib10"> [10] Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. Uni-mol: A universal 3d molecular representation learning framework. 2023. </li> <li class="ltx_bibitem" id="bib.bib11"> [11] Paul G Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B Iovanisci, Ian Snyder, and David R Koes. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design. Journal of chemical information and modeling, 60(9):4200–4215, 2020. </li> <li class="ltx_bibitem" id="bib.bib12"> [12] Dumitru Erhan, Aaron Courville, Yoshua Bengio, and Pascal Vincent. Why does unsupervised pre-training help deep learning? In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 201–208. JMLR Workshop and Conference Proceedings, 2010. </li> <li class="ltx_bibitem" id="bib.bib13"> [13] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4918–4927, 2019. </li> <li class="ltx_bibitem" id="bib.bib14"> [14] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems, 32, 2019. </li> <li class="ltx_bibitem" id="bib.bib15"> [15] Weihua Hu, Bowen Liu, Joseph Gomes, Marinka Zitnik, Percy Liang, Vijay S. Pande, and Jure Leskovec. Strategies for pre-training graph neural networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. </li> <li class="ltx_bibitem" id="bib.bib16"> [16] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. </li> <li class="ltx_bibitem" id="bib.bib17"> [17] Xiaomin Fang, Lihang Liu, Jieqiong Lei, Donglong He, Shanzhuo Zhang, Jingbo Zhou, Fan Wang, Hua Wu, and Haifeng Wang. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127–134, 2022. </li> <li class="ltx_bibitem" id="bib.bib18"> [18] Can Chen, Jingbo Zhou, Fan Wang, Xue Liu, and Dejing Dou. Structure-aware protein self-supervised learning. Bioinformatics, 39(4):btad189, 2023. </li> <li class="ltx_bibitem" id="bib.bib19"> [19] Shuangli Li, Jingbo Zhou, Tong Xu, Dejing Dou, and Hui Xiong. Geomgcl: Geometric graph contrastive learning for molecular property prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 4541–4549, 2022. </li> <li class="ltx_bibitem" id="bib.bib20"> [20] Shanzhuo Zhang, Zhiyuan Yan, Yueyang Huang, Lihang Liu, Donglong He, Wei Wang, Xiaomin Fang, Xiaonan Zhang, Fan Wang, Hua Wu, et al. Helixadmet: a robust and endpoint extensible admet system incorporating self-supervised knowledge transfer. Bioinformatics, 38(13):3444–3453, 2022. </li> <li class="ltx_bibitem" id="bib.bib21"> [21] Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel, and Yun Song. Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019. </li> <li class="ltx_bibitem" id="bib.bib22"> [22] Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 2019. </li> <li class="ltx_bibitem" id="bib.bib23"> [23] Roshan M Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. Msa transformer. In International Conference on Machine Learning, pages 8844–8856. PMLR, 2021. </li> <li class="ltx_bibitem" id="bib.bib24"> [24] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023. </li> <li class="ltx_bibitem" id="bib.bib25"> [25] Konstantin Weißenow, Michael Heinzinger, and Burkhard Rost. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure, 2022. </li> <li class="ltx_bibitem" id="bib.bib26"> [26] Xiaomin Fang, Fan Wang, Lihang Liu, Jingzhou He, Dayong Lin, Yingfei Xiang, Kunrui Zhu, Xiaonan Zhang, Hua Wu, Hui Li, et al. A method for multiple-sequence-alignment-free protein structure prediction using a protein language model. Nature Machine Intelligence, pages 1–10, 2023. </li> <li class="ltx_bibitem" id="bib.bib27"> [27] Shayne D Wierbowski, Bentley M Wingert, Jim Zheng, and Carlos J Camacho. Cross-docking benchmark for automated pose and ranking prediction of ligand binding. Protein Science, 29(1):298–305, 2020. </li> <li class="ltx_bibitem" id="bib.bib28"> [28] David M Rogers, Rupesh Agarwal, Josh V Vermaas, Micholas Dean Smith, Rajitha T Rajeshwar, Connor Cooper, Ada Sedova, Swen Boehm, Matthew Baker, Jens Glaser, et al. Sars-cov2 billion-compound docking. Scientific Data, 10(1):173, 2023. </li> <li class="ltx_bibitem" id="bib.bib29"> [29] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017. </li> <li class="ltx_bibitem" id="bib.bib30"> [30] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. </li> <li class="ltx_bibitem" id="bib.bib31"> [31] Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022. </li> <li class="ltx_bibitem" id="bib.bib32"> [32] Michael M Mysinger, Michael Carchia, John J Irwin, and Brian K Shoichet. Directory of useful decoys, enhanced (dud-e): better ligands and decoys for better benchmarking. Journal of medicinal chemistry, 55(14):6582–6594, 2012. </li> <li class="ltx_bibitem" id="bib.bib33"> [33] Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids research, 28(1):235–242, 2000. </li> <li class="ltx_bibitem" id="bib.bib34"> [34] Renxiao Wang, Xueliang Fang, Yipin Lu, and Shaomeng Wang. The pdbbind database: Collection of binding affinities for protein- ligand complexes with known three-dimensional structures. Journal of medicinal chemistry, 47(12):2977–2980, 2004. </li> <li class="ltx_bibitem" id="bib.bib35"> [35] Vincent Le Guilloux, Peter Schmidtke, and Pierre Tuffery. Fpocket: an open source platform for ligand pocket detection. BMC bioinformatics, 10(1):1–11, 2009. </li> <li class="ltx_bibitem" id="bib.bib36"> [36] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. </li> <li class="ltx_bibitem" id="bib.bib37"> [37] Martin Steinegger and Johannes Söding. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11):1026–1028, 2017. </li> <li class="ltx_bibitem" id="bib.bib38"> [38] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. </li> <li class="ltx_bibitem" id="bib.bib39"> [39] Minyi Su, Qifan Yang, Yu Du, Guoqin Feng, Zhihai Liu, Yan Li, and Renxiao Wang. Comparative assessment of scoring functions: the casf-2016 update. Journal of chemical information and modeling, 59(2):895–913, 2018. </li> <li class="ltx_bibitem" id="bib.bib40"> [40] Martin Buttenschoen, Garrett M Morris, and Charlotte M Deane. Posebusters: Ai-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chemical Science, 2024. </li> <li class="ltx_bibitem" id="bib.bib41"> [41] Na Zhang and Hongtao Zhao. Enriching screening libraries with bioactive fragment space. Bioorganic & Medicinal Chemistry Letters, 26(15):3594–3597, 2016. </li> <li class="ltx_bibitem" id="bib.bib42"> [42] Sergio Ruiz-Carmona, Daniel Alvarez-Garcia, Nicolas Foloppe, A Beatriz Garmendia-Doval, Szilveszter Juhos, Peter Schmidtke, Xavier Barril, Roderick E Hubbard, and S David Morley. rdock: a fast, versatile and open source program for docking ligands to proteins and nucleic acids. PLoS computational biology, 10(4):e1003571, 2014. </li> <li class="ltx_bibitem" id="bib.bib43"> [43] Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. Uni-mol: a universal 3d molecular representation learning framework. 2023. </li> <li class="ltx_bibitem" id="bib.bib44"> [44] Zhe Wang, Huiyong Sun, Xiaojun Yao, Dan Li, Lei Xu, Youyong Li, Sheng Tian, and Tingjun Hou. Comprehensive evaluation of ten docking programs on a diverse set of protein–ligand complexes: the prediction accuracy of sampling power and scoring power. Physical Chemistry Chemical Physics, 18(18):12964–12975, 2016. </li> <li class="ltx_bibitem" id="bib.bib45"> [45] A glimpse of the next generation of alphafold. <a class="ltx_ref ltx_url ltx_font_typewriter" href="https://deepmind.google/discover/blog/a-glimpse-of-the-next-generation-of-alphafold/" title="">https://deepmind.google/discover/blog/a-glimpse-of-the-next-generation-of-alphafold/</a>. Accessed: 2023-10-13. </li> <li class="ltx_bibitem" id="bib.bib46"> [46] Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim Krikun, Xavier Garcia, Ciprian Chelba, and Colin Cherry. Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740, 2021. </li> <li class="ltx_bibitem" id="bib.bib47"> [47] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022. </li> <li class="ltx_bibitem" id="bib.bib48"> [48] Chao Shen, Xueping Hu, Junbo Gao, Xujun Zhang, Haiyang Zhong, Zhe Wang, Lei Xu, Yu Kang, Dongsheng Cao, and Tingjun Hou. The impact of cross-docked poses on performance of machine learning classifier for protein–ligand binding pose prediction. Journal of cheminformatics, 13:1–18, 2021. </li> <li class="ltx_bibitem" id="bib.bib49"> [49] Rishal Aggarwal, Akash Gupta, and U Priyakumar. Apobind: a dataset of ligand unbound protein conformations for machine learning applications in de novo drug design. arXiv preprint arXiv:2108.09926, 2021. </li> <li class="ltx_bibitem" id="bib.bib50"> [50] Yasutomi Asano, Shuji Kitamura, Taiichi Ohra, Fumio Itoh, Masahiro Kajino, Tomoko Tamura, Manami Kaneko, Shota Ikeda, Hideki Igata, Tomohiro Kawamoto, Satoshi Sogabe, Shin-ichi Matsumoto, Toshimasa Tanaka, Masashi Yamaguchi, Hiroyuki Kimura, and Shoji Fukumoto. Discovery, synthesis and biological evaluation of isoquinolones as novel and highly selective JNK inhibitors (2). 16(8):4699–4714, 2008. </li> <li class="ltx_bibitem" id="bib.bib51"> [51] Evanthia Lionta, George Spyrou, Demetrios K. Vassilatis, and Zoe Cournia. Structure-Based Virtual Screening for Drug Discovery: Principles, Applications and Recent Advances. 14(16):1923–1938, 2014. </li> <li class="ltx_bibitem" id="bib.bib52"> [52] Xujun Zhang, Odin Zhang, Chao Shen, Wanglin Qu, Shicheng Chen, Hanqun Cao, Yu Kang, Zhe Wang, Ercheng Wang, Jintu Zhang, et al. Efficient and accurate large library ligand docking with karmadock. Nature Computational Science, 3(9):789–804, 2023. </li> <li class="ltx_bibitem" id="bib.bib53"> [53] Yan Li, Minyi Su, Zhihai Liu, Jie Li, Jie Liu, Li Han, and Renxiao Wang. Assessing protein–ligand interaction scoring functions with the casf-2013 benchmark. Nature protocols, 13(4):666–680, 2018. </li> <li class="ltx_bibitem" id="bib.bib54"> [54] Yasutomi Asano, Shuji Kitamura, Taiichi Ohra, Fumio Itoh, Masahiro Kajino, Tomoko Tamura, Manami Kaneko, Shota Ikeda, Hideki Igata, Tomohiro Kawamoto, et al. Discovery, synthesis and biological evaluation of isoquinolones as novel and highly selective jnk inhibitors (2). Bioorganic & medicinal chemistry, 16(8):4699–4714, 2008. </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Fri May 24 16:47:56 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/">LaTeXML<img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>