CINXE.COM

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction</title>  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <base href="/html/2406.08961v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S1" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">1 Introduction</a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S2" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">2 Related work</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S2.SS0.SSS0.Px1" title="In 2 Related work ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Non-structural datasets on drug-target interaction for bioactivity prediction.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S2.SS0.SSS0.Px2" title="In 2 Related work ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Structural datasets based on experimental structures for bioactivity prediction.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S2.SS0.SSS0.Px3" title="In 2 Related work ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Structural datasets based on modeling structures for bioactivity prediction.</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3 SIU dataset construction and overview</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1" title="In 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3.1 Methods</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS1" title="In 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3.1.1 Data cleaning and deduplication</a> <ol class="ltx_toclist ltx_toclist_subsubsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS1.Px1" title="In 3.1.1 Data cleaning and deduplication ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Bioactivity data extracting.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS1.Px2" title="In 3.1.1 Data cleaning and deduplication ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">PDB structure retrieval and mapping.</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsubsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS2" title="In 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3.1.2 Structural data construction via multi-software docking</a> <ol class="ltx_toclist ltx_toclist_subsubsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS2.Px1" title="In 3.1.2 Structural data construction via multi-software docking ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Molecular docking.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS2.Px2" title="In 3.1.2 Structural data construction via multi-software docking ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Consensus filtering of docking poses.</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsubsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS3" title="In 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3.1.3 Data construction for downstream tasks</a> <ol class="ltx_toclist ltx_toclist_subsubsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS3.Px1" title="In 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Dataset organization for unbiased bioactivity prediciton.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS3.Px2" title="In 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Dataset split.</a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS2" title="In 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3.2 Dataset overview</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS2.SSS0.Px1" title="In 3.2 Dataset overview ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Large-scale.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS2.SSS0.Px2" title="In 3.2 Dataset overview ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Diversity.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS2.SSS0.Px3" title="In 3.2 Dataset overview ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">High-quality.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS2.SSS0.Px4" title="In 3.2 Dataset overview ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Well-organized.</a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">4 SIU experiments and analysis</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.SS1" title="In 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">4.1 Analysis</a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.SS1.SSS0.Px1" title="In 4.1 Analysis ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Different difficulties of assay types.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.SS1.SSS0.Px2" title="In 4.1 Analysis ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Influence of measuring correlation with same PDB IDs.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.SS1.SSS0.Px3" title="In 4.1 Analysis ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Effectivness of training on larger dataset.</a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S5" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">5 Limitations and future work</a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S5.SS0.SSS0.Px1" title="In 5 Limitations and future work ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Limitations.</a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S5.SS0.SSS0.Px2" title="In 5 Limitations and future work ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">Future work.</a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S6" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">6 Conclusion</a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A1" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">A Dataset and Code Availability</a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A2" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">B Dataset overview</a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A3" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">C Test set construction</a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A4" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">D Model Training</a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A5" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">E Potential negative impact of SIU</a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line"> <h1 class="ltx_title ltx_title_document">SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction</h1> <div class="ltx_authors"> Yanwen Huang1 , Bowen Gao211footnotemark: 1 , Yinjun Jia3, Hongbo Ma4, Wei-Ying Ma2, Ya-Qin Zhang2, Yanyan Lan2 1Department of Pharmaceutical Science, Peking University 2Institute for AI Industry Research (AIR), Tsinghua University 3School of Life Sciences, Tsinghua University 4Department of Computer Science and Technology, Tsinghua University Equal contirbutionWork was done while Yanwen Huang was an intern at AIR.Correspondence to lanyanyan@air.tsinghua.edu.cn </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> Small molecules play a pivotal role in modern medicine, and scrutinizing their interactions with protein targets is essential for the discovery and development of novel, life-saving therapeutics. The term "bioactivity" encompasses various biological effects resulting from these interactions, including both binding and functional responses. The magnitude of bioactivity dictates the therapeutic or toxic pharmacological outcomes of small molecules, rendering accurate bioactivity prediction crucial for the development of safe and effective drugs. However, existing structural datasets of small molecule-protein interactions are often limited in scale and lack systematically organized bioactivity labels, thereby impeding our understanding of these interactions and precise bioactivity prediction. In this study, we introduce a comprehensive dataset of small molecule-protein interactions, consisting of over a million binding structures, each annotated with real biological activity labels. This dataset is designed to facilitate unbiased bioactivity prediction. We evaluated several classical models on this dataset, and the results demonstrate that the task of unbiased bioactivity prediction is challenging yet essential. </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> 1 Introduction</h2> <div class="ltx_para ltx_noindent" id="S1.p1"> Bioactivity encapsulates various types of distinct measurements derived from different wet lab conditions, including both binding affinity and the spectrum of biological effects resulting from small molecule-protein interactions. Accurate bioactivity prediction is fundamental for discerning therapeutic potential and off-target toxicity effects, guiding medicinal chemistry efforts in discovering and optimizing potential small-molecule therapeutics, and is thus pivotal to the development of safe and effective drugs <cite class="ltx_cite ltx_citemacro_cite">Tropsha et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib52" title="">2024</a>); Gaulton and Overington (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib22" title="">2010</a>)</cite>. For a small molecule to modulate the function of its protein target and exert its biological effects, it must be recognized by the protein through three-dimensional complementarity of shape and properties <cite class="ltx_cite ltx_citemacro_cite">Verma et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib55" title="">2010</a>)</cite>. This level of detail is indispensable, as knowing only the materials of a lock (protein) and the blueprint of a key (small molecule) is insufficient; 3D information is required to understand how the key fits into the lock and functions <cite class="ltx_cite ltx_citemacro_cite">Koshland Jr (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib31" title="">1995</a>); Eschenmoser (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib18" title="">1995</a>)</cite>. Current challenges in bioactivity prediction largely stem from the scarcity of high-quality, 3D structural data on small molecule-protein interactions. </div> <div class="ltx_para ltx_noindent" id="S1.p2"> The existing structural data on small molecule-protein interactions are markedly insufficient. Structural data derived from wet-lab experiments are limited, owing to the laborious and time-consuming nature of these assays. Additionally, this data often lacks comprehensive bioactivity annotations and is poorly organized with respect to bioactivity assay types. While computational modeling approaches have been employed to generate structural datasets, these efforts have yielded datasets of modest size with limited molecular diversity. The paucity of high-quality structural data still imposes a significant barrier to accurate bioactivity prediction, highlighting the critical need for computationally generated, large-scale, high-quality datasets. </div> <div class="ltx_para ltx_noindent" id="S1.p3"> To address this critical need, we present SIU: a million-scale Structural small molecule-protein Interaction dataset for Unbiased bioactivity prediction, the largest and most comprehensive structural dataset available to date. SIU comprises over 5.34 million conformations, integrating both structural and bioactivity information for small molecule-protein interactions. Within this dataset, small molecule-protein pairs feature 1.38 million rigorously curated bioactivity annotations, each with a clear assay type designation. Our dataset provides extensive coverage of diverse small molecules, encompassing both active and inactive compounds, thereby surpassing the limitations of datasets restricted to molecules structurally similar to co-crystal ligands. It also includes a wide array of protein targets, covering all common protein classes, with each protein associated with multiple PDB IDs reflecting distinct pocket conformations. Our robust data generation pipeline employs multi-software docking and consensus filtering approach to ensure the precise modeling of small molecule-protein complexes. Bioactivity labels are meticulously curated and systematically organized according to assay types. SIU represents a significant advancement, offering a solid foundation for unbiased bioactivity prediction and enabling more accurate and comprehensive Pharmaceutical investigations. </div> <div class="ltx_para ltx_noindent" id="S1.p4"> We conducted experiments with several classical baseline models, and the results demonstrate that our large-scale dataset can improve model performance compared to the widely used PDBbind dataset <cite class="ltx_cite ltx_citemacro_cite">Wang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib57" title="">2004</a>, <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib58" title="">2005</a>)</cite>. Additionally, the correlation results calculated by mixing different protein-molecule pairs are significantly higher than the correlation calculated after grouping by PDB IDs and using molecules for a single protein pocket. This indicates that the correlation within PDB IDs is more challenging and serves as a more important metric for evaluating the bioactivity prediction ability of models.s It highlights the importance of the unbiased bioactivity prediction task we introduced. This task focuses on the bioactivity difference for different molecules within a protein pocket, instead of the bias introduced by the different bioactivity ranges for different protein pockets. </div> <div class="ltx_para ltx_noindent" id="S1.p5"> In conclusion, our main contributions are threefold: (1) We introduced a million-scale structural dataset to address the exigent demands of the AI-driven drug discovery (AIDD) community; (2) We devised and rigorously validated a robust, scalable pipeline for producing high-fidelity structural data of small molecule-protein interactions; (3) We accentuated the significance of differentiating among bioactivity assay types and meticulously curated the dataset to enhance this practice in the training and evaluation of bioactivity prediction models. </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> 2 Related work</h2> <section class="ltx_paragraph" id="S2.SS0.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">Non-structural datasets on drug-target interaction for bioactivity prediction.</h5> <div class="ltx_para ltx_noindent" id="S2.SS0.SSS0.Px1.p1"> A multitude of datasets are available for drug-target affinity (DTA) prediction; however, these datasets frequently lack structural data concerning the interactions between small molecules and their corresponding targets <cite class="ltx_cite ltx_citemacro_cite">Ekins et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib16" title="">2017</a>)</cite>. Large-scale bioactivity databases such as ChEMBL <cite class="ltx_cite ltx_citemacro_cite">Mendez et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib38" title="">2019</a>); Gaulton et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib23" title="">2012</a>)</cite>, PubChem <cite class="ltx_cite ltx_citemacro_cite">Kim et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib30" title="">2016</a>)</cite>, GuideToPharmacology <cite class="ltx_cite ltx_citemacro_cite">Pawson et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib43" title="">2014</a>)</cite>, and DrugBank <cite class="ltx_cite ltx_citemacro_cite">Wishart et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib62" title="">2018</a>); Law et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib33" title="">2014</a>)</cite> are invaluable resources. Research efforts in DTA prediction primarily focus on binding affinity labels, with datasets such as Davis <cite class="ltx_cite ltx_citemacro_cite">Davis et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib14" title="">2011</a>)</cite> and KIBA <cite class="ltx_cite ltx_citemacro_cite">Tang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib50" title="">2014</a>)</cite> being widely utilized. MoleculeNet <cite class="ltx_cite ltx_citemacro_cite">Wu et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib63" title="">2018</a>)</cite> also comprises non-structural bioactivity data. </div> </section> <section class="ltx_paragraph" id="S2.SS0.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">Structural datasets based on experimental structures for bioactivity prediction.</h5> <div class="ltx_para ltx_noindent" id="S2.SS0.SSS0.Px2.p1"> The Protein Data Bank (PDB) <cite class="ltx_cite ltx_citemacro_cite">Berman et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib5" title="">2000</a>)</cite>, established in 1971, has been an indispensable resource for structural biology, providing extensive structural data on protein and other biomolecules. However, it lacks direct bioactivity annotations and a systematic categorization of small molecules, often including non-specific and biologically irrelevant compounds. To address these limitations, several specialized databases have been developed, including PDBbind <cite class="ltx_cite ltx_citemacro_cite">Wang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib57" title="">2004</a>, <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib58" title="">2005</a>)</cite>, Binding MOAD <cite class="ltx_cite ltx_citemacro_cite">Hu et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib29" title="">2005</a>)</cite>, KiBank <cite class="ltx_cite ltx_citemacro_cite">Zhang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib66" title="">2004</a>)</cite>, AffinDB <cite class="ltx_cite ltx_citemacro_cite">Block et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib7" title="">2006</a>)</cite>, and BioLiP <cite class="ltx_cite ltx_citemacro_cite">Yang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib64" title="">2012</a>); Wei et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib59" title="">2024</a>)</cite>. Their curation significantly enhances the utility of these datasets for structure-based Pharmaceutical research. Nevertheless, the reliance on labor-intensive experimental data acquisition limits the scalability and rapid expansion of these databases. Moreover, some databases lack explicit guidelines for the non-mixed use of assay types. </div> </section> <section class="ltx_paragraph" id="S2.SS0.SSS0.Px3"> <h5 class="ltx_title ltx_title_paragraph">Structural datasets based on modeling structures for bioactivity prediction.</h5> <div class="ltx_para ltx_noindent" id="S2.SS0.SSS0.Px3.p1"> Computational methods have been employed to construct datasets modeling small molecule-protein interaction structures that correspond with experimental bioactivities. The Natural Ligand DataBase (NLDB) <cite class="ltx_cite ltx_citemacro_cite">Murakami et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib40" title="">2016</a>)</cite> includes 7,053 complex structures, some of which are computationally modeled. The eModel-BDB <cite class="ltx_cite ltx_citemacro_cite">Naderi et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib41" title="">2018</a>)</cite> reports 200,005 structural entries, though it encounters issues such as steric clashes <cite class="ltx_cite ltx_citemacro_cite">Li et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib35" title="">2024</a>)</cite>. BindingNet <cite class="ltx_cite ltx_citemacro_cite">Li et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib35" title="">2024</a>)</cite> represents a novel dataset comprising 69,816 high-quality modeled structures obtained through comparative complex structure modeling. This modeling technique, however, requires the modeled small molecules to have structural similarity to co-crystal ligands from experiments, thereby limiting the quantity and diversity of the small molecules modeled in BindingNet. </div> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> 3 SIU dataset construction and overview</h2> <div class="ltx_para ltx_noindent" id="S3.p1"> The SIU dataset is a pioneering resource for predicting bioactivity, offering a comprehensive collection of small molecule-protein interactions with meticulously annotated bioactivity information. This section details the construction methods employed to ensure the data’s quality, diversity, and organization for downstream tasks. </div> <figure class="ltx_figure" id="S3.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="228" id="S3.F1.g1" src="extracted/5664306/images/m.png" width="598"/> <figcaption class="ltx_caption ltx_centering">Figure 1: Pipeline for SIU construction. (A) Small molecules and protein targets were obtained from corresponding databases, cleaned, and deduplicated. Different small molecules binding to the same protein and different pockets (different PDB IDs) of the same protein were filtered and analyzed. (B) These were then subjected to a multi-software docking pipeline, where the small molecules were prepared and docked to their wet-experiment confirmed targets using three different software programs. The resulting poses were filtered through a voting mechanism to construct the final dataset. (C) The dataset is well-organized and contains multiple pockets for each protein and multiple molecules for each pocket, allowing for downstream tasks to be performed PDB-wisely and assay-type-wisely.</figcaption> </figure> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> 3.1 Methods</h3> <section class="ltx_subsubsection" id="S3.SS1.SSS1"> <h4 class="ltx_title ltx_title_subsubsection"> 3.1.1 Data cleaning and deduplication</h4> <section class="ltx_paragraph" id="S3.SS1.SSS1.Px1"> <h5 class="ltx_title ltx_title_paragraph">Bioactivity data extracting.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS1.Px1.p1"> We retrieved non-structural bioactivity data from established databases: ChEMBL <cite class="ltx_cite ltx_citemacro_cite">Mendez et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib38" title="">2019</a>); Gaulton et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib23" title="">2012</a>)</cite> and BindingDB <cite class="ltx_cite ltx_citemacro_cite">Chen et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib9" title="">2001</a>); Liu et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib37" title="">2007</a>); Gilson et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib25" title="">2016</a>)</cite>. Molecules were filtered based on predefined criteria. Assays measuring bioactivity of small molecule-protein interactions were selected and filtered. The protein target information for each assay was carefully identified and standardized using UniProt IDs <cite class="ltx_cite ltx_citemacro_cite">Consortium (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib11" title="">2015</a>); uni (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib1" title="">2017</a>)</cite>, ensuring consistency across datasets and facilitating matching with experimental structural data. </div> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS1.Px1.p2"> The small molecule filtering criteria are well-defined to exclude molecules that are not drug-like, including molecular weight, atom composition, and element restrictions. All small molecules retained their original IUPAC International Chemical Identifier (InChI) keys <cite class="ltx_cite ltx_citemacro_cite">Heller et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib28" title="">2015</a>)</cite> and Simplified Molecular Input Line Entry System (SMILES) notations <cite class="ltx_cite ltx_citemacro_cite">Weininger (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib60" title="">1988</a>); Weininger et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib61" title="">1989</a>)</cite> from the databases to avoid mismatches due to different software calculations. Additionally, docking structurally similar small molecules for a single target leads to resource wastage. We examined targets associated with an excessively high number of small molecules and introduced a new filter based on small molecule extended-connectivity fingerprints (ECFP) similarity <cite class="ltx_cite ltx_citemacro_cite">Rogers and Hahn (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib47" title="">2010</a>)</cite>, ensuring both quality and diversity of small molecules while minimizing the computational expense of molecular docking. </div> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS1.Px1.p3"> The bioactivity data filtering process is also rigorously defined to ensure high-quality. Data from ChEMBL and BindingDB were independently extracted and cleaned before merging. For ChEMBL, criteria included assays involving only a single protein target, assays being either binding or functional, and bioactivity labels having standard relations, values, and units (i.e., <math alttext="pM" class="ltx_Math" display="inline" id="S3.SS1.SSS1.Px1.p3.1.m1.1"><semantics id="S3.SS1.SSS1.Px1.p3.1.m1.1a"><mrow id="S3.SS1.SSS1.Px1.p3.1.m1.1.1" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.cmml"><mi id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.2" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.2.cmml">p</mi><mo id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.1" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.3" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.3.cmml">M</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS1.Px1.p3.1.m1.1b"><apply id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1"><times id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.1"></times><ci id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.2.cmml" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.2">𝑝</ci><ci id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.3.cmml" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.3">𝑀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS1.Px1.p3.1.m1.1c">pM</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS1.Px1.p3.1.m1.1d">italic_p italic_M</annotation></semantics></math>, <math alttext="nM" class="ltx_Math" display="inline" id="S3.SS1.SSS1.Px1.p3.2.m2.1"><semantics id="S3.SS1.SSS1.Px1.p3.2.m2.1a"><mrow id="S3.SS1.SSS1.Px1.p3.2.m2.1.1" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.cmml"><mi id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.2" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.2.cmml">n</mi><mo id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.1" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.1.cmml">⁢</mo><mi id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.3" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.3.cmml">M</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS1.Px1.p3.2.m2.1b"><apply id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1"><times id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.1"></times><ci id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.2.cmml" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.2">𝑛</ci><ci id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.3.cmml" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.3">𝑀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS1.Px1.p3.2.m2.1c">nM</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS1.Px1.p3.2.m2.1d">italic_n italic_M</annotation></semantics></math>, or <math alttext="{\mu}M" class="ltx_Math" display="inline" id="S3.SS1.SSS1.Px1.p3.3.m3.1"><semantics id="S3.SS1.SSS1.Px1.p3.3.m3.1a"><mrow id="S3.SS1.SSS1.Px1.p3.3.m3.1.1" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.cmml"><mi id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.2" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.2.cmml">μ</mi><mo id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.1" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.1.cmml">⁢</mo><mi id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.3" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.3.cmml">M</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS1.Px1.p3.3.m3.1b"><apply id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1"><times id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.1"></times><ci id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.2.cmml" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.2">𝜇</ci><ci id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.3.cmml" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.3">𝑀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS1.Px1.p3.3.m3.1c">{\mu}M</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS1.Px1.p3.3.m3.1d">italic_μ italic_M</annotation></semantics></math>). BindingDB data were extracted using similar logic, with slightly different filters due to database differences. The cleaned datasets were merged using InChI keys for small molecules and UniProt IDs for protein targets, ensuring precise matching of bioactivity labels to their respective small molecule-protein interactions. All small molecule-protein pairs with matched bioactivity labels were subsequently docked. The bioactivity information was standardized to a unit of <math alttext="mol/L" class="ltx_Math" display="inline" id="S3.SS1.SSS1.Px1.p3.4.m4.1"><semantics id="S3.SS1.SSS1.Px1.p3.4.m4.1a"><mrow id="S3.SS1.SSS1.Px1.p3.4.m4.1.1" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.cmml"><mrow id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.cmml"><mi id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.2" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.2.cmml">m</mi><mo id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1.cmml">⁢</mo><mi id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.3" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.3.cmml">o</mi><mo id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1a" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1.cmml">⁢</mo><mi id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.4" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.4.cmml">l</mi></mrow><mo id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.1" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.1.cmml">/</mo><mi id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.3" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.3.cmml">L</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS1.Px1.p3.4.m4.1b"><apply id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1"><divide id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.1"></divide><apply id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2"><times id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1"></times><ci id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.2.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.2">𝑚</ci><ci id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.3.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.3">𝑜</ci><ci id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.4.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.4">𝑙</ci></apply><ci id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.3.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.3">𝐿</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS1.Px1.p3.4.m4.1c">mol/L</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS1.Px1.p3.4.m4.1d">italic_m italic_o italic_l / italic_L</annotation></semantics></math> (<math alttext="M" class="ltx_Math" display="inline" id="S3.SS1.SSS1.Px1.p3.5.m5.1"><semantics id="S3.SS1.SSS1.Px1.p3.5.m5.1a"><mi id="S3.SS1.SSS1.Px1.p3.5.m5.1.1" xref="S3.SS1.SSS1.Px1.p3.5.m5.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS1.Px1.p3.5.m5.1b"><ci id="S3.SS1.SSS1.Px1.p3.5.m5.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.5.m5.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS1.Px1.p3.5.m5.1c">M</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS1.Px1.p3.5.m5.1d">italic_M</annotation></semantics></math>) and anti-logged, similar to datasets for drug-target binding affinity prediction <cite class="ltx_cite ltx_citemacro_cite">Öztürk et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib42" title="">2018</a>)</cite>. </div> </section> <section class="ltx_paragraph" id="S3.SS1.SSS1.Px2"> <h5 class="ltx_title ltx_title_paragraph">PDB structure retrieval and mapping.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS1.Px2.p1"> The protein structures were downloaded and matched with their UniProt IDs to ensure accurate alignment with bioactivity data. These structures were parsed into individual PDB format files, each representing a distinct pocket. Identified by PDB IDs, pockets are functional regions of the protein that interact with small molecules. We developed a filtering mechanism that leverages chemical and biological knowledge to eliminate PDB files containing non-specific or biologically irrelevant co-crystallized ligands not occupying genuine binding sites. The number of PDB files associated with each protein target varied significantly. Docking all these structures is computationally expensive and offers diminishing returns in terms of novel information. We addressed this issue by implementing Fast Local Alignment of Protein Pockets (FLAPP) <cite class="ltx_cite ltx_citemacro_cite">Sankar et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib48" title="">2022</a>)</cite> and other methods to further deduplicate the pocket library. As illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3</a>(A), this step efficiently removed highly similar pockets, resulting in a more streamlined pocket collection for docking simulations. </div> </section> </section> <section class="ltx_subsubsection" id="S3.SS1.SSS2"> <h4 class="ltx_title ltx_title_subsubsection"> 3.1.2 Structural data construction via multi-software docking</h4> <section class="ltx_paragraph" id="S3.SS1.SSS2.Px1"> <h5 class="ltx_title ltx_title_paragraph">Molecular docking.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS2.Px1.p1"> SIU employs multiple docking software programs <cite class="ltx_cite ltx_citemacro_cite">Friesner et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib20" title="">2004</a>); Verdonk et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib54" title="">2003</a>); Trott and Olson (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib53" title="">2010</a>)</cite>, reducing reliance on any individual docking software. Initial 3D conformations for the small molecules were generated. For molecules with chiral centers, different stereoisomers were explored and included. Ionization states of molecules at physiological pH were also considered to ensure accurate representations of their charged forms. Multiple conformations were prepared for each small molecule to account for their flexibility. The preprocessed data were organized into formats compatible with the chosen docking software. The protein targets were prepared, and grid files were generated according to each software’s specific requirements to ensure compatibility. Small molecules were then docked into the binding pockets of the protein structures. </div> </section> <section class="ltx_paragraph" id="S3.SS1.SSS2.Px2"> <h5 class="ltx_title ltx_title_paragraph">Consensus filtering of docking poses.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS2.Px2.p1"> The molecular docking results undergo rigorous scrutiny to ensure the retention of only credible poses. SIU employs a stringent filtering process: only those docking poses that exhibit consistency across at least two out of three different docking software results are retained. This consensus-based approach mitigates the inclusion of erroneous or misleading docking poses, thereby augmenting the overall quality and reliability of the dataset. </div> <figure class="ltx_figure" id="S3.F2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="154" id="S3.F2.g1" src="extracted/5664306/images/rmsd_example_3.png" width="598"/> <figcaption class="ltx_caption ltx_centering">Figure 2: Capability of RMSD to quantify differences in docking poses. (A) RMSD 1.544, well-superimposed poses. (B) RMSD 1.985, similar binding modes. (C) RMSD 8.095, fundamentally different binding modes.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS2.Px2.p2"> Different docking poses of the same small molecule-PDB pair were evaluated using the root mean square deviation (RMSD). Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F2" title="Figure 2 ‣ Consensus filtering of docking poses. ‣ 3.1.2 Structural data construction via multi-software docking ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">2</a> shows the RMSD and corresponding poses of a single Glide docking pose compared with the top three docking poses generated by GOLD. When RMSD is about 2, the key interactions are maintained, indicating a potentially valid docking result. This example underscores the importance of RMSD as a metric for evaluating the consistency and reliability of docking poses, with higher RMSD values indicating divergent binding modes that may arise from incorrect small molecule-protein interaction mode predictions. We further investigated the trade-off between pose accuracy and the quantity of retained data, conducting experiments to observe the impact of varying RMSD values on these factors. Co-crystal poses, used as the ground truth, were extracted from PDB complexes and redocked into the original PDB pockets according to our docking procedure. The results, shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3</a>(B), indicate that when the RMSD is less than 2, a significant number of molecules can be retained, and the success ratio of the poses is satisfactory; as the RMSD increases, the number of retained poses slightly rises, but the accuracy of these poses significantly decreases. Therefore, an RMSD of 2 was selected as the cutoff. </div> </section> </section> <section class="ltx_subsubsection" id="S3.SS1.SSS3"> <h4 class="ltx_title ltx_title_subsubsection"> 3.1.3 Data construction for downstream tasks</h4> <section class="ltx_paragraph" id="S3.SS1.SSS3.Px1"> <h5 class="ltx_title ltx_title_paragraph">Dataset organization for unbiased bioactivity prediciton.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS3.Px1.p1"> We organized data PDB-wisely and assay-type-wisely to facilitate unbiased bioactivity prediction, addressing the common issue of mixing the dissociation constant (<math alttext="K_{d}" class="ltx_Math" display="inline" id="S3.SS1.SSS3.Px1.p1.1.m1.1"><semantics id="S3.SS1.SSS3.Px1.p1.1.m1.1a"><msub id="S3.SS1.SSS3.Px1.p1.1.m1.1.1" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1.cmml"><mi id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.2" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1.2.cmml">K</mi><mi id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.3" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS3.Px1.p1.1.m1.1b"><apply id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.cmml" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.1.cmml" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1">subscript</csymbol><ci id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.2.cmml" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1.2">𝐾</ci><ci id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.3.cmml" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS3.Px1.p1.1.m1.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS3.Px1.p1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math>) <cite class="ltx_cite ltx_citemacro_cite">Lineweaver and Burk (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib36" title="">1934</a>)</cite> and the inhibition constant (<math alttext="K_{i}" class="ltx_Math" display="inline" id="S3.SS1.SSS3.Px1.p1.2.m2.1"><semantics id="S3.SS1.SSS3.Px1.p1.2.m2.1a"><msub id="S3.SS1.SSS3.Px1.p1.2.m2.1.1" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1.cmml"><mi id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.2" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1.2.cmml">K</mi><mi id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.3" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS3.Px1.p1.2.m2.1b"><apply id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.cmml" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.1.cmml" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1">subscript</csymbol><ci id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.2.cmml" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1.2">𝐾</ci><ci id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.3.cmml" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS3.Px1.p1.2.m2.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS3.Px1.p1.2.m2.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math>) <cite class="ltx_cite ltx_citemacro_cite">Yung-Chi and Prusoff (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib65" title="">1973</a>)</cite> data and neglecting the data of other bioactivities. This meticulous organization supports the evaluation of small molecule-protein interactions with high fidelity, ensuring that the inherent differences in PDB files and assay types are respected. </div> </section> <section class="ltx_paragraph" id="S3.SS1.SSS3.Px2"> <h5 class="ltx_title ltx_title_paragraph">Dataset split.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS3.Px2.p1"> To ensure the generalizability of the experimental findings with SIU, we employed a manual curation approach for dataset splitting. We selected a set of 10 representative protein targets to serve as the test set. These targets were intentionally chosen to cover a diverse range of protein classes, including well-known drug targets such as G-Protein Coupled Receptors (GPCRs) <cite class="ltx_cite ltx_citemacro_cite">Hauser et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib27" title="">2017</a>)</cite>, kinases <cite class="ltx_cite ltx_citemacro_cite">Attwood et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib4" title="">2021</a>); Cohen et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib10" title="">2021</a>)</cite>, and cytochromes <cite class="ltx_cite ltx_citemacro_cite">Danielson (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib13" title="">2002</a>)</cite>. This selection strategy was designed to encompass the bioactivity landscape across various protein functionalities, thereby enhancing the applicability of our results to a wider range of potential drug discovery applications. We conducted non-homology analyses at two levels, 0.6 and 0.9, to ensure the independence and diversity of the training and test sets. For both versions 0.9 and 0.6, we have 21528 data pairs allocated for testing. Specifically, version 0.9 includes 1250807 data pairs for training and validation, while version 0.6 includes 386,330 data pairs for these purposes. </div> <figure class="ltx_figure" id="S3.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="299" id="S3.F3.g1" src="extracted/5664306/images/fig_all_new_2.png" width="598"/> <figcaption class="ltx_caption ltx_centering">Figure 3: Filter selection and dataset statistics. (A) Distribution of the number of PDB files per protein target before and after filtering. (B) Influence of RMSD on success and retention ratios. (C) Pairwise t-test p-value differences between the negative logarithmic assay values of four representative assay types, visualized in a heatmap, along with the distribution of the values for each type. (D) Differences in assay values for ten representative protein targets, illustrated by a heatmap of their pairwise t-test p-values, and their distribution.</figcaption> </figure> </section> </section> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> 3.2 Dataset overview</h3> <section class="ltx_paragraph" id="S3.SS2.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">Large-scale.</h5> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px1.p1"> The SIU dataset comprises 5,342,250 conformations detailing small molecule-protein interactions, each entry providing comprehensive structural and bioactivity information. It includes 1,385,201 bioactivity labels derived from wet experiments, each with standardized values and clear assay type annotations. </div> </section> <section class="ltx_paragraph" id="S3.SS2.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">Diversity.</h5> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px2.p1"> SIU offers an extensive range of data, encompassing 214,686 diverse small molecules and 1,720 distinct protein targets. It includes experimentally validated low-bioactivity or inactive molecules, often absent in structural datasets from wet experiments, thus providing valuable negative data for AIDD. The dataset features extensive protein pocket coverage, including protein from humans, E. coli <cite class="ltx_cite ltx_citemacro_cite">Vila et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib56" title="">2016</a>)</cite>, various viruses, and other organisms. It spans major protein classes such as GPCRs <cite class="ltx_cite ltx_citemacro_cite">Hauser et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib27" title="">2017</a>)</cite>, kinases <cite class="ltx_cite ltx_citemacro_cite">Attwood et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib4" title="">2021</a>); Cohen et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib10" title="">2021</a>)</cite>, nuclear receptors <cite class="ltx_cite ltx_citemacro_cite">Robinson-Rechavi et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib46" title="">2003</a>)</cite>, cytochromes <cite class="ltx_cite ltx_citemacro_cite">Danielson (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib13" title="">2002</a>)</cite>, ion channels <cite class="ltx_cite ltx_citemacro_cite">Ashcroft (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib3" title="">1999</a>)</cite>, and other protein involved in complex biological processes like epigenetics <cite class="ltx_cite ltx_citemacro_cite">Gibney and Nolan (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib24" title="">2010</a>); Feinberg (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib19" title="">2008</a>)</cite> and transcription <cite class="ltx_cite ltx_citemacro_cite">Cramer (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib12" title="">2019</a>); Lambert et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib32" title="">2018</a>)</cite>. As illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3</a>(D), the assay values of different protein targets vary significantly. This broad coverage ensures a comprehensive representation of small molecule-protein interaction modes, enhancing the relevance of our bioactivity prediction tasks to real biological environments. </div> </section> <section class="ltx_paragraph" id="S3.SS2.SSS0.Px3"> <h5 class="ltx_title ltx_title_paragraph">High-quality.</h5> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px3.p1"> The structural information on small molecule-protein interactions in SIU is of high quality, due to our multi-software voting mechanism that maximizes docking accuracy within computational limits. As detailed in the structural data construction section, we achieved a satisfactory balance between data accuracy and scale, presenting high-quality data unobtainable with a single docking software or solely by ranking based on software-predicted docking scores. Docking software often provides successful simulated docking poses within the top-ranking positions, but these are not always ranked first by docking scores. Our method, however, is based on the consistency of docking pose sampling across different algorithms. By examining consensus among different docking algorithms, we effectively ensure more accurate docking pose data. </div> </section> <section class="ltx_paragraph" id="S3.SS2.SSS0.Px4"> <h5 class="ltx_title ltx_title_paragraph">Well-organized.</h5> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px4.p1"> SIU’s bioactivity labels are meticulously curated and systematically organized by PDB IDs and assay types, ensuring data integrity and enabling effective PDB-wise and assay-wise comparisons. This organization offers a robust resource for unbiased bioactivity prediction, addressing the limitations of existing datasets that often fail to distinguish clearly between different bioactivity assay types. Traditional measurements of correlations in bioactivity prediction tasks are often ineffective due to the lack of clarity in existing datasets. SIU can also address this problem, ensuring more precise and meaningful analyses. Our structured approach facilitates nuanced assessments, such as evaluating the impact of specific small molecule modifications on protein interactions or comparing the efficacy of different compounds within the same protein pocket context. </div> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px4.p2"> We argue that the assay types should not be merged due to their distinct characteristics. The heatmap in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3</a>(C) presents the results of pairwise t-tests for the four assay types, revealing that half maximal inhibitory concentration (<math alttext="IC_{50}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.1.m1.1"><semantics id="S3.SS2.SSS0.Px4.p2.1.m1.1a"><mrow id="S3.SS2.SSS0.Px4.p2.1.m1.1.1" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.2" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.2.cmml">I</mi><mo id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.1" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.1.cmml">⁢</mo><msub id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.cmml"><mi id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.2" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.2.cmml">C</mi><mn id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.3" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.1.m1.1b"><apply id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1"><times id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.1"></times><ci id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.2">𝐼</ci><apply id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.1.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.2.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.2">𝐶</ci><cn id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.3.cmml" type="integer" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.1.m1.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.1.m1.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math>) differs the most in mean value compared to the other assay types, followed by half maximal effective concentration (<math alttext="EC_{50}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.2.m2.1"><semantics id="S3.SS2.SSS0.Px4.p2.2.m2.1a"><mrow id="S3.SS2.SSS0.Px4.p2.2.m2.1.1" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.2" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.2.cmml">E</mi><mo id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.1" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.1.cmml">⁢</mo><msub id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.cmml"><mi id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.2" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.2.cmml">C</mi><mn id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.3" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.2.m2.1b"><apply id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1"><times id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.1"></times><ci id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.2">𝐸</ci><apply id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.1.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.2.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.2">𝐶</ci><cn id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.3.cmml" type="integer" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.2.m2.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.2.m2.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math>). In contrast, the means of <math alttext="K_{i}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.3.m3.1"><semantics id="S3.SS2.SSS0.Px4.p2.3.m3.1a"><msub id="S3.SS2.SSS0.Px4.p2.3.m3.1.1" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.2" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1.2.cmml">K</mi><mi id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.3" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.3.m3.1b"><apply id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1.2">𝐾</ci><ci id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.3.m3.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.3.m3.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="K_{d}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.4.m4.1"><semantics id="S3.SS2.SSS0.Px4.p2.4.m4.1a"><msub id="S3.SS2.SSS0.Px4.p2.4.m4.1.1" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.2" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1.2.cmml">K</mi><mi id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.3" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.4.m4.1b"><apply id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1.2">𝐾</ci><ci id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.4.m4.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.4.m4.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> are relatively similar. However, even when the means are not significantly different, the assay types cannot be considered equivalent due to their distinct behaviors. As demonstrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3</a>(C), the bioactivity assay values vary significantly, with their negative logarithmic values ranging from 2 to 11 and exhibiting markedly different distributions. The distribution of <math alttext="K_{d}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.5.m5.1"><semantics id="S3.SS2.SSS0.Px4.p2.5.m5.1a"><msub id="S3.SS2.SSS0.Px4.p2.5.m5.1.1" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.2" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1.2.cmml">K</mi><mi id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.3" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.5.m5.1b"><apply id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1.2">𝐾</ci><ci id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.5.m5.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.5.m5.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> is particularly unique, as its upper values are substantially higher. The <math alttext="K_{d}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.6.m6.1"><semantics id="S3.SS2.SSS0.Px4.p2.6.m6.1a"><msub id="S3.SS2.SSS0.Px4.p2.6.m6.1.1" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.2" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1.2.cmml">K</mi><mi id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.3" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.6.m6.1b"><apply id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1.2">𝐾</ci><ci id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.6.m6.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.6.m6.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> distribution is more peaked, indicating a narrow concentration around the central values despite its broad range. </div> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px4.p3"> Moreover, SIU provides multiple small molecule 3D poses for each protein pocket, allowing for unbiased comparison of small molecule poses while maintaining a constant protein pocket environment. This approach yields detailed information on how variations in small molecule poses influence their interactions with the protein pocket, considering factors such as shape and electrostatic complementarity. Ultimately, this enhances the modeling of the relationship between these interactions and observed bioactivity, advancing our understanding of small molecule-protein interactions and their effects. </div> </section> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> 4 SIU experiments and analysis</h2> <figure class="ltx_table" id="S4.T1"> <figcaption class="ltx_caption ltx_centering">Table 1: Results for multi task learning with different label types. We show results for 3D-CNN, GNN, Uni-Mol, and ProFSA trained on SIU 0.9 version.</figcaption> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S4.T1.6" style="width:379.0pt;height:314pt;vertical-align:-0.0pt;"> RMSE MAE Pearson Pearson∗ Spearman Spearman∗ <math alttext="IC_{50}" class="ltx_Math" display="inline" id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1"><semantics id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1a"><mrow id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.2" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.2.cmml">I</mi><mo id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.1" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.1.cmml">⁢</mo><msub id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.cmml"><mi id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2.cmml">C</mi><mn id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1b"><apply id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1"><times id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.1"></times><ci id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.2">𝐼</ci><apply id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.1.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3">subscript</csymbol><ci id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2">𝐶</ci><cn id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3.cmml" type="integer" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> 3D-CNN 1.560 1.275 0.158 0.044 0.154 0.040 GNN 1.412 1.141 0.336 0.241 0.316 0.235 Uni-Mol 1.353 1.092 0.462 0.343 0.466 0.351 ProFSA 1.361 1.108 0.382 0.331 0.356 0.317 <math alttext="EC_{50}" class="ltx_Math" display="inline" id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1"><semantics id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1a"><mrow id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.2" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.2.cmml">E</mi><mo id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.1" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.1.cmml">⁢</mo><msub id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.cmml"><mi id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2.cmml">C</mi><mn id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1b"><apply id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1"><times id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.1"></times><ci id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.2">𝐸</ci><apply id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.1.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3">subscript</csymbol><ci id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2">𝐶</ci><cn id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3.cmml" type="integer" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> 3D-CNN 1.518 1.234 0.128 0.010 0.128 0.004 GNN 1.334 1.025 0.444 0.108 0.481 0.120 Uni-Mol 1.273 1.017 0.428 0.178 0.461 0.144 ProFSA 1.255 0.971 0.438 0.204 0.495 0.154 <math alttext="K_{i}" class="ltx_Math" display="inline" id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1"><semantics id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1a"><msub id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.2" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.2.cmml">K</mi><mi id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.3" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1b"><apply id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1">subscript</csymbol><ci id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.2">𝐾</ci><ci id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> 3D-CNN 1.534 1.260 0.201 0.025 0.200 0.021 GNN 1.814 1.504 0.247 0.099 0.107 0.058 Uni-Mol 1.390 1.133 0.375 0.092 0.324 0.056 ProFSA 1.374 1.142 0.405 0.149 0.365 0.127 <math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1"><semantics id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1a"><msub id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.2" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.2.cmml">K</mi><mi id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.3" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1b"><apply id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1">subscript</csymbol><ci id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.2">𝐾</ci><ci id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> 3D-CNN 1.503 1.233 0.173 0.024 0.167 0.038 GNN 1.711 1.431 -0.068 0.065 -0.147 0.033 Uni-Mol 1.429 1.223 -0.084 0.155 -0.175 0.144 ProFSA 1.546 1.334 -0.172 0.057 -0.205 0.029 </div> </figure> <div class="ltx_para ltx_noindent" id="S4.p1"> We conducted experiments using several baseline models to analyze our SIU dataset. The models tested include a voxel-grid based 3D-CNN model, a Graph Neural Network (GNN) model, and pretrained models such as Uni-Mol <cite class="ltx_cite ltx_citemacro_citep">(Zhou et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib67" title="">2022</a>)</cite> and ProFSA <cite class="ltx_cite ltx_citemacro_citep">(Gao et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib21" title="">2023</a>)</cite>. Our experiments were performed in both Multi-Task Learning (MTL) and single-target settings. In the MTL setting, all data were combined to train a single MTL model. In the single-target setting, the Uni-Mol model was trained separately on individual labels. </div> <div class="ltx_para ltx_noindent" id="S4.p2"> The metrics used in our analysis include Root Mean Square Error (RMSE), Mean Absolute Error (MAE), general Pearson and Spearman correlation, and the correlation after grouping by PDB IDs. The general Pearson and Spearman correlations are calculated by mixing pairs of protein pockets and molecules. The grouped correlation metrics are calculated for different molecules within a single protein pocket. We use Pearson∗ to represent Pearson correlation grouped by PDB IDs, and Spearman∗ to represent Spearman correlation grouped by PDB IDs. </div> <div class="ltx_para ltx_noindent" id="S4.p3"> Results for multi-task learning is shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.T1" title="Table 1 ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">1</a>, and the results for single task learning is shown in Talbe <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.T2" title="Table 2 ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">2</a>. </div> <figure class="ltx_table" id="S4.T2"> <figcaption class="ltx_caption ltx_centering">Table 2: Results for single task training with different label types. We show the results with Uni-Mol model on PDBbind dataset, our SIU 0.6 version and 0.9 version dataset.</figcaption> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S4.T2.6" style="width:381.7pt;height:230pt;vertical-align:-0.0pt;"> Train Set RMSE MAE Pearson Pearson∗ Spearman Spearman∗ <math alttext="IC_{50}" class="ltx_Math" display="inline" id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1"><semantics id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1a"><mrow id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.2" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.2.cmml">I</mi><mo id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.1" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.1.cmml">⁢</mo><msub id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.cmml"><mi id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2.cmml">C</mi><mn id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1b"><apply id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1"><times id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.1"></times><ci id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.2">𝐼</ci><apply id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.1.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3">subscript</csymbol><ci id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2">𝐶</ci><cn id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3.cmml" type="integer" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> PDBbind 1.575 1.279 0.430 0.245 0.425 0.229 SIU 0.6 1.407 1.138 0.461 0.317 0.463 0.311 SIU 0.9 1.357 1.099 0.470 0.345 0.474 0.347 <math alttext="EC_{50}" class="ltx_Math" display="inline" id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1"><semantics id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1a"><mrow id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.2" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.2.cmml">E</mi><mo id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.1" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.1.cmml">⁢</mo><msub id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.cmml"><mi id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2.cmml">C</mi><mn id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1b"><apply id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1"><times id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.1"></times><ci id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.2">𝐸</ci><apply id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.1.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3">subscript</csymbol><ci id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2">𝐶</ci><cn id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3.cmml" type="integer" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> SIU 0.6 1.400 1.163 0.280 0.171 0.284 0.150 SIU 0.9 1.340 1.096 0.384 0.196 0.379 0.142 <math alttext="K_{i}" class="ltx_Math" display="inline" id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1"><semantics id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1a"><msub id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.2" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.2.cmml">K</mi><mi id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.3" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1b"><apply id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.cmml" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1">subscript</csymbol><ci id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.2">𝐾</ci><ci id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> PDBbind 1.315 1.085 0.368 0.040 0.323 0.026 SIU 0.6 1.255 1.034 0.472 0.106 0.452 0.112 SIU 0.9 1.235 1.017 0.485 0.036 0.452 0.041 <math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1"><semantics id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1a"><msub id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.2" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.2.cmml">K</mi><mi id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.3" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1b"><apply id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.cmml" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1">subscript</csymbol><ci id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.2">𝐾</ci><ci id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> PDBbind 1.565 1.308 0.041 0.010 0.004 0.006 SIU 0.6 1.389 1.192 -0.149 0.052 -0.206 0.022 SIU 0.9 1.364 1.141 -0.033 0.103 -0.082 0.065 </div> </figure> <figure class="ltx_figure" id="S4.F4"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S4.F4.sf1"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="367" id="S4.F4.sf1.g1" src="extracted/5664306/images/correlation_agg.png" width="598"/> <figcaption class="ltx_caption">(a) </figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S4.F4.sf2"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="366" id="S4.F4.sf2.g1" src="extracted/5664306/images/correlation_datasets.png" width="598"/> <figcaption class="ltx_caption">(b) </figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering">Figure 4: (a) Pearson and Spearman correlations for various label types, calculated both before and after grouping by PDB IDs. (b) Pearson correlations after grouping PDB IDs for different assay types trained on different datasets.</figcaption> </figure> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> 4.1 Analysis</h3> <section class="ltx_paragraph" id="S4.SS1.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">Different difficulties of assay types.</h5> <div class="ltx_para ltx_noindent" id="S4.SS1.SSS0.Px1.p1"> The bioactivity prediction difficulty varies among different assay types. The <math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.1.m1.1"><semantics id="S4.SS1.SSS0.Px1.p1.1.m1.1a"><msub id="S4.SS1.SSS0.Px1.p1.1.m1.1.1" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.2" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.3" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.1.m1.1b"><apply id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.1.m1.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> task is the most challenging, primarily due to the varying correlations between different assay types, as shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3</a>(C). Although the means of <math alttext="K_{i}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.2.m2.1"><semantics id="S4.SS1.SSS0.Px1.p1.2.m2.1a"><msub id="S4.SS1.SSS0.Px1.p1.2.m2.1.1" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.2" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.3" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.2.m2.1b"><apply id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.2.m2.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.2.m2.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.3.m3.1"><semantics id="S4.SS1.SSS0.Px1.p1.3.m3.1a"><msub id="S4.SS1.SSS0.Px1.p1.3.m3.1.1" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.2" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.3" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.3.m3.1b"><apply id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.3.m3.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.3.m3.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> labels do not differ statistically, the correlation between these two data groups remains limited. The intrinsic differences in assay types of bioactivity arise from the principles of the wet-lab experiments used to measure them. Binding assays focus on the direct interaction between the small molecule and the protein target, providing insights into the strength and specificity of this binding through metrics like <math alttext="K_{i}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.4.m4.1"><semantics id="S4.SS1.SSS0.Px1.p1.4.m4.1a"><msub id="S4.SS1.SSS0.Px1.p1.4.m4.1.1" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.2" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.3" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.4.m4.1b"><apply id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.4.m4.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.4.m4.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.5.m5.1"><semantics id="S4.SS1.SSS0.Px1.p1.5.m5.1a"><msub id="S4.SS1.SSS0.Px1.p1.5.m5.1.1" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.2" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.3" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.5.m5.1b"><apply id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.5.m5.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.5.m5.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math>, using techniques such as surface plasmon resonance (SPR) <cite class="ltx_cite ltx_citemacro_cite">Schasfoort (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib49" title="">2017</a>); Englebienne et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib17" title="">2003</a>)</cite> and isothermal titration calorimetry (ITC) <cite class="ltx_cite ltx_citemacro_cite">Leavitt and Freire (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib34" title="">2001</a>)</cite>. In contrast, functional assays measure the biological response elicited by the small molecule on the target, capturing its effect on a biological system and often quantified by <math alttext="IC_{50}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.6.m6.1"><semantics id="S4.SS1.SSS0.Px1.p1.6.m6.1a"><mrow id="S4.SS1.SSS0.Px1.p1.6.m6.1.1" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.2" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.2.cmml">I</mi><mo id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.1" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.1.cmml">⁢</mo><msub id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.cmml"><mi id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.2" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.2.cmml">C</mi><mn id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.3" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.6.m6.1b"><apply id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1"><times id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.1"></times><ci id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.2">𝐼</ci><apply id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.1.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.2.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.2">𝐶</ci><cn id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.3.cmml" type="integer" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.6.m6.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.6.m6.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="EC_{50}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.7.m7.1"><semantics id="S4.SS1.SSS0.Px1.p1.7.m7.1a"><mrow id="S4.SS1.SSS0.Px1.p1.7.m7.1.1" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.2" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.2.cmml">E</mi><mo id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.1" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.1.cmml">⁢</mo><msub id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.cmml"><mi id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.2" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.2.cmml">C</mi><mn id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.3" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.7.m7.1b"><apply id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1"><times id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.1"></times><ci id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.2">𝐸</ci><apply id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.1.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.2.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.2">𝐶</ci><cn id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.3.cmml" type="integer" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.7.m7.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.7.m7.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> by enzyme activity assays <cite class="ltx_cite ltx_citemacro_cite">Bisswanger (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib6" title="">2014</a>); Hall (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib26" title="">1996</a>)</cite> or other wet experiment techniques. The inherent differences in what these assays measure mean that their values cannot be directly compared <cite class="ltx_cite ltx_citemacro_cite">Yung-Chi and Prusoff (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib65" title="">1973</a>)</cite>. Furthermore, even within the categories of binding and functional assays, metrics should not be used interchangeably, as <math alttext="K_{i}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.8.m8.1"><semantics id="S4.SS1.SSS0.Px1.p1.8.m8.1a"><msub id="S4.SS1.SSS0.Px1.p1.8.m8.1.1" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.2" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.3" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.8.m8.1b"><apply id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.8.m8.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.8.m8.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.9.m9.1"><semantics id="S4.SS1.SSS0.Px1.p1.9.m9.1a"><msub id="S4.SS1.SSS0.Px1.p1.9.m9.1.1" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.2" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.3" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.9.m9.1b"><apply id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.9.m9.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.9.m9.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> describe different aspects of binding affinity, just as <math alttext="IC_{50}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.10.m10.1"><semantics id="S4.SS1.SSS0.Px1.p1.10.m10.1a"><mrow id="S4.SS1.SSS0.Px1.p1.10.m10.1.1" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.2" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.2.cmml">I</mi><mo id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.1" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.1.cmml">⁢</mo><msub id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.cmml"><mi id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.2" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.2.cmml">C</mi><mn id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.3" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.10.m10.1b"><apply id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1"><times id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.1"></times><ci id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.2">𝐼</ci><apply id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.1.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.2.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.2">𝐶</ci><cn id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.3.cmml" type="integer" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.10.m10.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.10.m10.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="EC_{50}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.11.m11.1"><semantics id="S4.SS1.SSS0.Px1.p1.11.m11.1a"><mrow id="S4.SS1.SSS0.Px1.p1.11.m11.1.1" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.2" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.2.cmml">E</mi><mo id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.1" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.1.cmml">⁢</mo><msub id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.cmml"><mi id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.2" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.2.cmml">C</mi><mn id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.3" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.11.m11.1b"><apply id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1"><times id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.1"></times><ci id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.2">𝐸</ci><apply id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.1.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.2.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.2">𝐶</ci><cn id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.3.cmml" type="integer" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.11.m11.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.11.m11.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> describe different aspects of biological response. </div> </section> <section class="ltx_paragraph" id="S4.SS1.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">Influence of measuring correlation with same PDB IDs.</h5> <div class="ltx_para ltx_noindent" id="S4.SS1.SSS0.Px2.p1"> As demonstrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.F4.sf1" title="In Figure 4 ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">4(a)</a>, aggregating data by pdb ID across all assay types results in a significant decline in both Pearson and Spearman correlations. This observation suggests that it is more challenging to achieve high correlation when assessing binding affinities for different molecules within the same pocket after grouping. This challenge primarily arises from the skewed distribution of binding affinities across various protein pockets, as shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3</a>(D). Furthermore, these findings highlight that conventional approaches to measuring correlation without grouping by PDB ID may not effectively capture a model’s ability to differentiate between molecules targeting the same protein. Such discriminatory capacity is crucial in drug discovery, emphasizing the importance of focusing on molecular interactions specific to each target rather than general correlations across diverse targets. This underscores the necessity of our dataset, which measures correlation within the same pdb IDs, providing a more relevant assessment of a deep learning model’s utility in drug discovery. </div> </section> <section class="ltx_paragraph" id="S4.SS1.SSS0.Px3"> <h5 class="ltx_title ltx_title_paragraph">Effectivness of training on larger dataset.</h5> <div class="ltx_para ltx_noindent" id="S4.SS1.SSS0.Px3.p1"> We compare models trained on the PDBbind 2020 dataset with those trained on SIU versions 0.6 and 0.9. Notably, the PDBbind 2020 dataset was used in its entirety, without implementing any filtering techniques to exclude pockets similar to those in the test set. As illustrated in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.T2" title="Table 2 ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">2</a> and Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.F4" title="Figure 4 ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">4</a>, models trained on the SIU datasets outperform those trained on PDBbind, despite the latter’s lack of homology removal. This underscores the effectiveness of our large-scale dataset in enhancing model learning for binding affinity prediction. Also the 0.9 version gives a better performance compared to the 0.6 version, indicating the influence of removing homology and scaling law of the dataset. </div> </section> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> 5 Limitations and future work</h2> <section class="ltx_paragraph" id="S5.SS0.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">Limitations.</h5> <div class="ltx_para ltx_noindent" id="S5.SS0.SSS0.Px1.p1"> Despite our rigorous methodology, the structural data we obtained are still predicted poses rather than experimentally validated interactions. The challenge of accurately modeling small molecule-protein interactions in physiological conditions remains substantial and highlights the need for continued advancements in this field. </div> </section> <section class="ltx_paragraph" id="S5.SS0.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">Future work.</h5> <div class="ltx_para ltx_noindent" id="S5.SS0.SSS0.Px2.p1"> We aim to provide larger and more reliable datasets for various drug discovery tasks. we are developing datasets for pairwise ranking, alongside organizing data for unbiased bioactivity prediction. We will ensure that comparisons are made only between docking poses derived from the same PDB and bioactivity data from identical assay types. Additionally, our approach allowing the automated generation of extensive high-quality data has been validated as feasible and scalable in this work. Though in this work, to address the computational demands of the molecular docking stage, we optimized the dataset by deduplicating small molecules and pockect structures, Future research could build upon these methods to construct even larger datasets efficiently, thereby advancing the understanding of small molecule-protein interactions in AIDD. </div> </section> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> 6 Conclusion</h2> <div class="ltx_para ltx_noindent" id="S6.p1"> To meet the pressing demands in drug discovery and development, we introduced SIU, a large-scale, diverse, accurate, and well-curated dataset for unbiased bioactivity prediction. SIU was constructed using meticulously designed and robust pipelines, ensuring its exceptional quality and comprehensiveness. Our experimental results validate the large scale nature of SIU as a superior training dataset that enhances model performance and provides a reliable framework for unbiased bioactivity prediction tasks, facilitating more meaningful model evaluations. We anticipate that the full potential of SIU remains to be uncovered, and we hope that its introduction will catalyze significant advancements in the field of drug discovery and development. </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> uni (2017) Uniprot: the universal protein knowledgebase. Nucleic acids research, 45(D1):D158–D169, 2017. </li> <li class="ltx_bibitem" id="bib.bib2"> Altucci et al. (2007) Lucia Altucci, Mark D Leibowitz, Kathleen M Ogilvie, Angel R de Lera, and Hinrich Gronemeyer. Rar and rxr modulation in cancer and metabolic disease. Nature reviews Drug discovery, 6(10):793–810, 2007. </li> <li class="ltx_bibitem" id="bib.bib3"> Ashcroft (1999) Frances M Ashcroft. Ion channels and disease. Academic press, 1999. </li> <li class="ltx_bibitem" id="bib.bib4"> Attwood et al. (2021) Misty M Attwood, Doriano Fabbro, Aleksandr V Sokolov, Stefan Knapp, and Helgi B Schiöth. Trends in kinase drug discovery: targets, indications and inhibitor design. Nature Reviews Drug Discovery, 20(11):839–861, 2021. </li> <li class="ltx_bibitem" id="bib.bib5"> Berman et al. (2000) Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids research, 28(1):235–242, 2000. </li> <li class="ltx_bibitem" id="bib.bib6"> Bisswanger (2014) Hans Bisswanger. Enzyme assays. Perspectives in Science, 1(1-6):41–55, 2014. </li> <li class="ltx_bibitem" id="bib.bib7"> Block et al. (2006) Peter Block, Christoph A Sotriffer, Ingo Dramburg, and Gerhard Klebe. Affindb: a freely accessible database of affinities for protein–ligand complexes from the pdb. Nucleic acids research, 34(suppl_1):D522–D526, 2006. </li> <li class="ltx_bibitem" id="bib.bib8"> Bureik et al. (2002) Matthias Bureik, Michael Lisurek, and Rita Bernhardt. The human steroid hydroxylases cyp11b1 and cyp11b2. 2002. </li> <li class="ltx_bibitem" id="bib.bib9"> Chen et al. (2001) Xi Chen, Ming Liu, and Michael K Gilson. Bindingdb: a web-accessible molecular recognition database. Combinatorial chemistry & high throughput screening, 4(8):719–725, 2001. </li> <li class="ltx_bibitem" id="bib.bib10"> Cohen et al. (2021) Philip Cohen, Darren Cross, and Pasi A Jänne. Kinase drug discovery 20 years after imatinib: progress and future directions. Nature reviews drug discovery, 20(7):551–569, 2021. </li> <li class="ltx_bibitem" id="bib.bib11"> Consortium (2015) UniProt Consortium. Uniprot: a hub for protein information. Nucleic acids research, 43(D1):D204–D212, 2015. </li> <li class="ltx_bibitem" id="bib.bib12"> Cramer (2019) Patrick Cramer. Organization and regulation of gene transcription. Nature, 573(7772):45–54, 2019. </li> <li class="ltx_bibitem" id="bib.bib13"> Danielson (2002) P áB Danielson. The cytochrome p450 superfamily: biochemistry, evolution and drug metabolism in humans. Current drug metabolism, 3(6):561–597, 2002. </li> <li class="ltx_bibitem" id="bib.bib14"> Davis et al. (2011) Mindy I Davis, Jeremy P Hunt, Sanna Herrgard, Pietro Ciceri, Lisa M Wodicka, Gabriel Pallares, Michael Hocker, Daniel K Treiber, and Patrick P Zarrinkar. Comprehensive analysis of kinase inhibitor selectivity. Nature biotechnology, 29(11):1046–1051, 2011. </li> <li class="ltx_bibitem" id="bib.bib15"> Denisov et al. (2005) Ilia G Denisov, Thomas M Makris, Stephen G Sligar, and Ilme Schlichting. Structure and chemistry of cytochrome p450. Chemical reviews, 105(6):2253–2278, 2005. </li> <li class="ltx_bibitem" id="bib.bib16"> Ekins et al. (2017) S Ekins, AM Clark, C Southan, BA Bunin, and AJ Williams. Chapter 16. small-molecule bioactivity databases. High Throughput Screening Methods, pages 344–371, 2017. </li> <li class="ltx_bibitem" id="bib.bib17"> Englebienne et al. (2003) Patrick Englebienne, Anne Van Hoonacker, and Michel Verhas. Surface plasmon resonance: principles, methods and applications in biomedical sciences. Spectroscopy, 17(2-3):255–273, 2003. </li> <li class="ltx_bibitem" id="bib.bib18"> Eschenmoser (1995) Albert Eschenmoser. One hundred years lock-and-key principle. Angewandte Chemie International Edition in English, 33(23-24):2363–2363, 1995. </li> <li class="ltx_bibitem" id="bib.bib19"> Feinberg (2008) Andrew P Feinberg. Epigenetics at the epicenter of modern medicine. Jama, 299(11):1345–1350, 2008. </li> <li class="ltx_bibitem" id="bib.bib20"> Friesner et al. (2004) Richard A Friesner, Jay L Banks, Robert B Murphy, Thomas A Halgren, Jasna J Klicic, Daniel T Mainz, Matthew P Repasky, Eric H Knoll, Mee Shelley, Jason K Perry, et al. Glide: a new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy. Journal of medicinal chemistry, 47(7):1739–1749, 2004. </li> <li class="ltx_bibitem" id="bib.bib21"> Gao et al. (2023) Bowen Gao, Yinjun Jia, YuanLe Mo, Yuyan Ni, Wei-Ying Ma, Zhi-Ming Ma, and Yanyan Lan. Self-supervised pocket pretraining via protein fragment-surroundings alignment. In The Twelfth International Conference on Learning Representations, 2023. </li> <li class="ltx_bibitem" id="bib.bib22"> Gaulton and Overington (2010) Anna Gaulton and John P Overington. Role of open chemical data in aiding drug discovery and design. Future Medicinal Chemistry, 2(6):903–907, 2010. </li> <li class="ltx_bibitem" id="bib.bib23"> Gaulton et al. (2012) Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, et al. Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100–D1107, 2012. </li> <li class="ltx_bibitem" id="bib.bib24"> Gibney and Nolan (2010) ER Gibney and CM Nolan. Epigenetics and gene expression. Heredity, 105(1):4–13, 2010. </li> <li class="ltx_bibitem" id="bib.bib25"> Gilson et al. (2016) Michael K Gilson, Tiqing Liu, Michael Baitaluk, George Nicola, Linda Hwang, and Jenny Chong. Bindingdb in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic acids research, 44(D1):D1045–D1053, 2016. </li> <li class="ltx_bibitem" id="bib.bib26"> Hall (1996) George M Hall. Methods of testing protein functionality. Springer Science & Business Media, 1996. </li> <li class="ltx_bibitem" id="bib.bib27"> Hauser et al. (2017) Alexander S Hauser, Misty M Attwood, Mathias Rask-Andersen, Helgi B Schiöth, and David E Gloriam. Trends in gpcr drug discovery: new agents, targets and indications. Nature reviews Drug discovery, 16(12):829–842, 2017. </li> <li class="ltx_bibitem" id="bib.bib28"> Heller et al. (2015) Stephen R Heller, Alan McNaught, Igor Pletnev, Stephen Stein, and Dmitrii Tchekhovskoi. Inchi, the iupac international chemical identifier. Journal of cheminformatics, 7:1–34, 2015. </li> <li class="ltx_bibitem" id="bib.bib29"> Hu et al. (2005) Liegi Hu, Mark L Benson, Richard D Smith, Michael G Lerner, and Heather A Carlson. Binding moad (mother of all databases). Proteins: Structure, Function, and Bioinformatics, 60(3):333–340, 2005. </li> <li class="ltx_bibitem" id="bib.bib30"> Kim et al. (2016) Sunghwan Kim, Paul A Thiessen, Evan E Bolton, Jie Chen, Gang Fu, Asta Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A Shoemaker, et al. Pubchem substance and compound databases. Nucleic acids research, 44(D1):D1202–D1213, 2016. </li> <li class="ltx_bibitem" id="bib.bib31"> Koshland Jr (1995) Daniel E Koshland Jr. The key–lock theory and the induced fit theory. Angewandte Chemie International Edition in English, 33(23-24):2375–2378, 1995. </li> <li class="ltx_bibitem" id="bib.bib32"> Lambert et al. (2018) Samuel A Lambert, Arttu Jolma, Laura F Campitelli, Pratyush K Das, Yimeng Yin, Mihai Albu, Xiaoting Chen, Jussi Taipale, Timothy R Hughes, and Matthew T Weirauch. The human transcription factors. Cell, 172(4):650–665, 2018. </li> <li class="ltx_bibitem" id="bib.bib33"> Law et al. (2014) Vivian Law, Craig Knox, Yannick Djoumbou, Tim Jewison, An Chi Guo, Yifeng Liu, Adam Maciejewski, David Arndt, Michael Wilson, Vanessa Neveu, et al. Drugbank 4.0: shedding new light on drug metabolism. Nucleic acids research, 42(D1):D1091–D1097, 2014. </li> <li class="ltx_bibitem" id="bib.bib34"> Leavitt and Freire (2001) Stephanie Leavitt and Ernesto Freire. Direct measurement of protein binding energetics by isothermal titration calorimetry. Current opinion in structural biology, 11(5):560–566, 2001. </li> <li class="ltx_bibitem" id="bib.bib35"> Li et al. (2024) Xuelian Li, Cheng Shen, Hui Zhu, Yujian Yang, Qing Wang, Jincai Yang, and Niu Huang. A high-quality data set of protein–ligand binding interactions via comparative complex structure modeling. Journal of Chemical Information and Modeling, 2024. </li> <li class="ltx_bibitem" id="bib.bib36"> Lineweaver and Burk (1934) Hans Lineweaver and Dean Burk. The determination of enzyme dissociation constants. Journal of the American chemical society, 56(3):658–666, 1934. </li> <li class="ltx_bibitem" id="bib.bib37"> Liu et al. (2007) Tiqing Liu, Yuhmei Lin, Xin Wen, Robert N Jorissen, and Michael K Gilson. Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic acids research, 35(suppl_1):D198–D201, 2007. </li> <li class="ltx_bibitem" id="bib.bib38"> Mendez et al. (2019) David Mendez, Anna Gaulton, A Patrícia Bento, Jon Chambers, Marleen De Veij, Eloy Félix, María Paula Magariños, Juan F Mosquera, Prudence Mutowo, Michał Nowotka, et al. Chembl: towards direct deposition of bioassay data. Nucleic acids research, 47(D1):D930–D940, 2019. </li> <li class="ltx_bibitem" id="bib.bib39"> Mori and Mishina (1995) H Mori and M Mishina. Structure and function of the nmda receptor channel. Neuropharmacology, 34(10):1219–1237, 1995. </li> <li class="ltx_bibitem" id="bib.bib40"> Murakami et al. (2016) Yoichi Murakami, Satoshi Omori, and Kengo Kinoshita. Nldb: a database for 3d protein–ligand interactions in enzymatic reactions. Journal of structural and functional genomics, 17:101–110, 2016. </li> <li class="ltx_bibitem" id="bib.bib41"> Naderi et al. (2018) Misagh Naderi, Rajiv Gandhi Govindaraj, and Michal Brylinski. e model-bdb: a database of comparative structure models of drug-target interactions from the binding database. Gigascience, 7(8):giy091, 2018. </li> <li class="ltx_bibitem" id="bib.bib42"> Öztürk et al. (2018) Hakime Öztürk, Arzucan Özgür, and Elif Ozkirimli. Deepdta: deep drug–target binding affinity prediction. Bioinformatics, 34(17):i821–i829, 2018. </li> <li class="ltx_bibitem" id="bib.bib43"> Pawson et al. (2014) Adam J Pawson, Joanna L Sharman, Helen E Benson, Elena Faccenda, Stephen PH Alexander, O Peter Buneman, Anthony P Davenport, John C McGrath, John A Peters, Christopher Southan, et al. The iuphar/bps guide to pharmacology: an expert-driven knowledgebase of drug targets and their ligands. Nucleic acids research, 42(D1):D1098–D1106, 2014. </li> <li class="ltx_bibitem" id="bib.bib44"> Qu and Tang (2010) Liyan Qu and Xiuwen Tang. Bexarotene: a promising anticancer agent. Cancer chemotherapy and pharmacology, 65:201–205, 2010. </li> <li class="ltx_bibitem" id="bib.bib45"> Reisberg et al. (2003) Barry Reisberg, Rachelle Doody, Albrecht Stöffler, Frederick Schmitt, Steven Ferris, and Hans Jörg Möbius. Memantine in moderate-to-severe alzheimer’s disease. New England Journal of Medicine, 348(14):1333–1341, 2003. </li> <li class="ltx_bibitem" id="bib.bib46"> Robinson-Rechavi et al. (2003) Marc Robinson-Rechavi, Hector Escriva Garcia, and Vincent Laudet. The nuclear receptor superfamily. Journal of cell science, 116(4):585–586, 2003. </li> <li class="ltx_bibitem" id="bib.bib47"> Rogers and Hahn (2010) David Rogers and Mathew Hahn. Extended-connectivity fingerprints. Journal of chemical information and modeling, 50(5):742–754, 2010. </li> <li class="ltx_bibitem" id="bib.bib48"> Sankar et al. (2022) Santhosh Sankar, Naren Chandran Sakthivel, and Nagasuma Chandra. Fast local alignment of protein pockets (flapp): a system-compiled program for large-scale binding site alignment. Journal of Chemical Information and Modeling, 62(19):4810–4819, 2022. </li> <li class="ltx_bibitem" id="bib.bib49"> Schasfoort (2017) Richard Schasfoort. Introduction to surface plasmon resonance. 2017. </li> <li class="ltx_bibitem" id="bib.bib50"> Tang et al. (2014) Jing Tang, Agnieszka Szwajda, Sushil Shakyawar, Tao Xu, Petteri Hintsanen, Krister Wennerberg, and Tero Aittokallio. Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. Journal of Chemical Information and Modeling, 54(3):735–743, 2014. </li> <li class="ltx_bibitem" id="bib.bib51"> Townshend et al. (2021) Raphael John Lamarre Townshend, Martin Vögele, Patricia Adriana Suriana, Alexander Derry, Alexander Powers, Yianni Laloudakis, Sidhika Balachandar, Bowen Jing, Brandon M Anderson, Stephan Eismann, et al. Atom3d: Tasks on molecules in three dimensions. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. </li> <li class="ltx_bibitem" id="bib.bib52"> Tropsha et al. (2024) Alexander Tropsha, Olexandr Isayev, Alexandre Varnek, Gisbert Schneider, and Artem Cherkasov. Integrating qsar modelling and deep learning in drug discovery: the emergence of deep qsar. Nature Reviews Drug Discovery, 23(2):141–155, 2024. </li> <li class="ltx_bibitem" id="bib.bib53"> Trott and Olson (2010) Oleg Trott and Arthur J Olson. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):455–461, 2010. </li> <li class="ltx_bibitem" id="bib.bib54"> Verdonk et al. (2003) Marcel L Verdonk, Jason C Cole, Michael J Hartshorn, Christopher W Murray, and Richard D Taylor. Improved protein–ligand docking using gold. Proteins: Structure, Function, and Bioinformatics, 52(4):609–623, 2003. </li> <li class="ltx_bibitem" id="bib.bib55"> Verma et al. (2010) Jitender Verma, Vijay M Khedkar, and Evans C Coutinho. 3d-qsar in drug design-a review. Current topics in medicinal chemistry, 10(1):95–115, 2010. </li> <li class="ltx_bibitem" id="bib.bib56"> Vila et al. (2016) Julien Vila, Emma Sáez-López, James R Johnson, Ute Römling, Ulrich Dobrindt, Rafael Cantón, CG Giske, Thierry Naas, Alessandra Carattoli, Margarita Martínez-Medina, et al. Escherichia coli: an old friend with new tidings. FEMS microbiology reviews, 40(4):437–463, 2016. </li> <li class="ltx_bibitem" id="bib.bib57"> Wang et al. (2004) Renxiao Wang, Xueliang Fang, Yipin Lu, and Shaomeng Wang. The pdbbind database: Collection of binding affinities for protein- ligand complexes with known three-dimensional structures. Journal of medicinal chemistry, 47(12):2977–2980, 2004. </li> <li class="ltx_bibitem" id="bib.bib58"> Wang et al. (2005) Renxiao Wang, Xueliang Fang, Yipin Lu, Chao-Yie Yang, and Shaomeng Wang. The pdbbind database: methodologies and updates. Journal of medicinal chemistry, 48(12):4111–4119, 2005. </li> <li class="ltx_bibitem" id="bib.bib59"> Wei et al. (2024) Hong Wei, Wenkai Wang, Zhenling Peng, and Jianyi Yang. Q-biolip: A comprehensive resource for quaternary structure-based protein–ligand interactions. Genomics, Proteomics & Bioinformatics, page qzae001, 2024. </li> <li class="ltx_bibitem" id="bib.bib60"> Weininger (1988) David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31–36, 1988. </li> <li class="ltx_bibitem" id="bib.bib61"> Weininger et al. (1989) David Weininger, Arthur Weininger, and Joseph L Weininger. Smiles. 2. algorithm for generation of unique smiles notation. Journal of chemical information and computer sciences, 29(2):97–101, 1989. </li> <li class="ltx_bibitem" id="bib.bib62"> Wishart et al. (2018) David S Wishart, Yannick D Feunang, An C Guo, Elvis J Lo, Ana Marcu, Jason R Grant, Tanvir Sajed, Daniel Johnson, Carin Li, Zinat Sayeeda, et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research, 46(D1):D1074–D1082, 2018. </li> <li class="ltx_bibitem" id="bib.bib63"> Wu et al. (2018) Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018. </li> <li class="ltx_bibitem" id="bib.bib64"> Yang et al. (2012) Jianyi Yang, Ambrish Roy, and Yang Zhang. Biolip: a semi-manually curated database for biologically relevant ligand–protein interactions. Nucleic acids research, 41(D1):D1096–D1103, 2012. </li> <li class="ltx_bibitem" id="bib.bib65"> Yung-Chi and Prusoff (1973) Cheng Yung-Chi and William H Prusoff. Relationship between the inhibition constant (ki) and the concentration of inhibitor which causes 50 per cent inhibition (i50) of an enzymatic reaction. Biochemical pharmacology, 22(23):3099–3108, 1973. </li> <li class="ltx_bibitem" id="bib.bib66"> Zhang et al. (2004) Junwei Zhang, Masahiro Aizawa, Shinji Amari, Yoshio Iwasawa, Tatsuya Nakano, and Kotoko Nakata. Development of kibank, a database supporting structure-based drug design. Computational biology and chemistry, 28(5-6):401–407, 2004. </li> <li class="ltx_bibitem" id="bib.bib67"> Zhou et al. (2022) Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. Uni-mol: A universal 3d molecular representation learning framework. In The Eleventh International Conference on Learning Representations, 2022. </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> <section class="ltx_appendix" id="A1"> <h2 class="ltx_title ltx_title_appendix"> Appendix A Dataset and Code Availability</h2> <div class="ltx_para ltx_noindent" id="A1.p1"> The whole dataset and corresponding descriptions can be found at <a class="ltx_ref ltx_href" href="https://huggingface.co/datasets/bgao95/SIU" title="">https://huggingface.co/datasets/bgao95/SIU</a> </div> <div class="ltx_para ltx_noindent" id="A1.p2"> The code and instructions used to train the baseline models can be found at <a class="ltx_ref ltx_href" href="https://github.com/bowen-gao/SIU" title="">https://github.com/bowen-gao/SIU</a> </div> <div class="ltx_para ltx_noindent" id="A1.p3"> The dataset is hosted by Hugging Face. The license is CC BY 4.0. We bear all responsibility in case of violation of rights. </div> <div class="ltx_para ltx_noindent" id="A1.p4"> The data we are using/curating doesn’t contain personally identifiable information or offensive content. </div> </section> <section class="ltx_appendix" id="A2"> <h2 class="ltx_title ltx_title_appendix"> Appendix B Dataset overview</h2> <div class="ltx_para ltx_noindent" id="A2.p1"> SIU represents a large-scale, high-quality dataset of small molecule-protein interactions, meticulously organized to facilitate unbiased bioactivity prediction, both PDB-wise and assay-type-wise. The dataset comprises a total of 5,342,250 conformations. Each instance in the dataset provides detailed information about small molecule-protein interactions, including the coordinates and element types of each atom in the small molecule and the corresponding pockets of each interaction. Additionally, the assay value and type of each conformation, along with other critical information, are carefully obtained and retained from the original bioactivity databases. This includes the UniProt ID and PDB ID of the protein pockets, as well as the InChI keys <cite class="ltx_cite ltx_citemacro_citep">[Heller et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib28" title="">2015</a>]</cite> and SMILES <cite class="ltx_cite ltx_citemacro_cite">Weininger [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib60" title="">1988</a>], Weininger et al. [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib61" title="">1989</a>]</cite> notations of the small molecules. </div> <figure class="ltx_table" id="A2.T3"> <figcaption class="ltx_caption ltx_centering">Table 3: The label count for 4 representative assay types in SIU total, SIU 0.9, and 0.6 versions.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="A2.T3.5"> <thead class="ltx_thead"> <tr class="ltx_tr" id="A2.T3.5.6.1"> <th class="ltx_td ltx_th ltx_th_column ltx_th_row ltx_border_r ltx_border_tt" id="A2.T3.5.6.1.1"></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" colspan="4" id="A2.T3.5.6.1.2">SIU 0.9 version</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="4" id="A2.T3.5.6.1.3">SIU 0.6 version</th> </tr> <tr class="ltx_tr" id="A2.T3.5.7.2"> <th class="ltx_td ltx_th ltx_th_column ltx_th_row ltx_border_r" id="A2.T3.5.7.2.1"></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.2">Total</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.3">Train</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.4">Valid</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t" id="A2.T3.5.7.2.5">Test</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.6">Total</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.7">Train</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.8">Valid</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.9">Test</th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="A2.T3.1.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="A2.T3.1.1.1"><math alttext="MTL" class="ltx_Math" display="inline" id="A2.T3.1.1.1.m1.1"><semantics id="A2.T3.1.1.1.m1.1a"><mrow id="A2.T3.1.1.1.m1.1.1" xref="A2.T3.1.1.1.m1.1.1.cmml"><mi id="A2.T3.1.1.1.m1.1.1.2" xref="A2.T3.1.1.1.m1.1.1.2.cmml">M</mi><mo id="A2.T3.1.1.1.m1.1.1.1" xref="A2.T3.1.1.1.m1.1.1.1.cmml">⁢</mo><mi id="A2.T3.1.1.1.m1.1.1.3" xref="A2.T3.1.1.1.m1.1.1.3.cmml">T</mi><mo id="A2.T3.1.1.1.m1.1.1.1a" xref="A2.T3.1.1.1.m1.1.1.1.cmml">⁢</mo><mi id="A2.T3.1.1.1.m1.1.1.4" xref="A2.T3.1.1.1.m1.1.1.4.cmml">L</mi></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.1.1.1.m1.1b"><apply id="A2.T3.1.1.1.m1.1.1.cmml" xref="A2.T3.1.1.1.m1.1.1"><times id="A2.T3.1.1.1.m1.1.1.1.cmml" xref="A2.T3.1.1.1.m1.1.1.1"></times><ci id="A2.T3.1.1.1.m1.1.1.2.cmml" xref="A2.T3.1.1.1.m1.1.1.2">𝑀</ci><ci id="A2.T3.1.1.1.m1.1.1.3.cmml" xref="A2.T3.1.1.1.m1.1.1.3">𝑇</ci><ci id="A2.T3.1.1.1.m1.1.1.4.cmml" xref="A2.T3.1.1.1.m1.1.1.4">𝐿</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.1.1.1.m1.1c">MTL</annotation><annotation encoding="application/x-llamapun" id="A2.T3.1.1.1.m1.1d">italic_M italic_T italic_L</annotation></semantics></math></th> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.2">1272335</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.3">1125727</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.4">125080</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="A2.T3.1.1.5">21528</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.6">407858</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.7">347697</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.8">38633</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.9">21528</td> </tr> <tr class="ltx_tr" id="A2.T3.2.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="A2.T3.2.2.1"><math alttext="IC_{50}" class="ltx_Math" display="inline" id="A2.T3.2.2.1.m1.1"><semantics id="A2.T3.2.2.1.m1.1a"><mrow id="A2.T3.2.2.1.m1.1.1" xref="A2.T3.2.2.1.m1.1.1.cmml"><mi id="A2.T3.2.2.1.m1.1.1.2" xref="A2.T3.2.2.1.m1.1.1.2.cmml">I</mi><mo id="A2.T3.2.2.1.m1.1.1.1" xref="A2.T3.2.2.1.m1.1.1.1.cmml">⁢</mo><msub id="A2.T3.2.2.1.m1.1.1.3" xref="A2.T3.2.2.1.m1.1.1.3.cmml"><mi id="A2.T3.2.2.1.m1.1.1.3.2" xref="A2.T3.2.2.1.m1.1.1.3.2.cmml">C</mi><mn id="A2.T3.2.2.1.m1.1.1.3.3" xref="A2.T3.2.2.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.2.2.1.m1.1b"><apply id="A2.T3.2.2.1.m1.1.1.cmml" xref="A2.T3.2.2.1.m1.1.1"><times id="A2.T3.2.2.1.m1.1.1.1.cmml" xref="A2.T3.2.2.1.m1.1.1.1"></times><ci id="A2.T3.2.2.1.m1.1.1.2.cmml" xref="A2.T3.2.2.1.m1.1.1.2">𝐼</ci><apply id="A2.T3.2.2.1.m1.1.1.3.cmml" xref="A2.T3.2.2.1.m1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.1.1.3.1.cmml" xref="A2.T3.2.2.1.m1.1.1.3">subscript</csymbol><ci id="A2.T3.2.2.1.m1.1.1.3.2.cmml" xref="A2.T3.2.2.1.m1.1.1.3.2">𝐶</ci><cn id="A2.T3.2.2.1.m1.1.1.3.3.cmml" type="integer" xref="A2.T3.2.2.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.2.2.1.m1.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.2.2.1.m1.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math></th> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.2">962063</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.3">854230</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.4">94859</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A2.T3.2.2.5">12974</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.6">320594</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.7">276969</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.8">30651</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.9">12974</td> </tr> <tr class="ltx_tr" id="A2.T3.3.3"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="A2.T3.3.3.1"><math alttext="EC_{50}" class="ltx_Math" display="inline" id="A2.T3.3.3.1.m1.1"><semantics id="A2.T3.3.3.1.m1.1a"><mrow id="A2.T3.3.3.1.m1.1.1" xref="A2.T3.3.3.1.m1.1.1.cmml"><mi id="A2.T3.3.3.1.m1.1.1.2" xref="A2.T3.3.3.1.m1.1.1.2.cmml">E</mi><mo id="A2.T3.3.3.1.m1.1.1.1" xref="A2.T3.3.3.1.m1.1.1.1.cmml">⁢</mo><msub id="A2.T3.3.3.1.m1.1.1.3" xref="A2.T3.3.3.1.m1.1.1.3.cmml"><mi id="A2.T3.3.3.1.m1.1.1.3.2" xref="A2.T3.3.3.1.m1.1.1.3.2.cmml">C</mi><mn id="A2.T3.3.3.1.m1.1.1.3.3" xref="A2.T3.3.3.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.3.3.1.m1.1b"><apply id="A2.T3.3.3.1.m1.1.1.cmml" xref="A2.T3.3.3.1.m1.1.1"><times id="A2.T3.3.3.1.m1.1.1.1.cmml" xref="A2.T3.3.3.1.m1.1.1.1"></times><ci id="A2.T3.3.3.1.m1.1.1.2.cmml" xref="A2.T3.3.3.1.m1.1.1.2">𝐸</ci><apply id="A2.T3.3.3.1.m1.1.1.3.cmml" xref="A2.T3.3.3.1.m1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.3.3.1.m1.1.1.3.1.cmml" xref="A2.T3.3.3.1.m1.1.1.3">subscript</csymbol><ci id="A2.T3.3.3.1.m1.1.1.3.2.cmml" xref="A2.T3.3.3.1.m1.1.1.3.2">𝐶</ci><cn id="A2.T3.3.3.1.m1.1.1.3.3.cmml" type="integer" xref="A2.T3.3.3.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.3.3.1.m1.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.3.3.1.m1.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math></th> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.2">97952</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.3">84067</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.4">9508</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A2.T3.3.3.5">4377</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.6">32842</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.7">25675</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.8">2790</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.9">4377</td> </tr> <tr class="ltx_tr" id="A2.T3.4.4"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="A2.T3.4.4.1"><math alttext="K_{i}" class="ltx_Math" display="inline" id="A2.T3.4.4.1.m1.1"><semantics id="A2.T3.4.4.1.m1.1a"><msub id="A2.T3.4.4.1.m1.1.1" xref="A2.T3.4.4.1.m1.1.1.cmml"><mi id="A2.T3.4.4.1.m1.1.1.2" xref="A2.T3.4.4.1.m1.1.1.2.cmml">K</mi><mi id="A2.T3.4.4.1.m1.1.1.3" xref="A2.T3.4.4.1.m1.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="A2.T3.4.4.1.m1.1b"><apply id="A2.T3.4.4.1.m1.1.1.cmml" xref="A2.T3.4.4.1.m1.1.1"><csymbol cd="ambiguous" id="A2.T3.4.4.1.m1.1.1.1.cmml" xref="A2.T3.4.4.1.m1.1.1">subscript</csymbol><ci id="A2.T3.4.4.1.m1.1.1.2.cmml" xref="A2.T3.4.4.1.m1.1.1.2">𝐾</ci><ci id="A2.T3.4.4.1.m1.1.1.3.cmml" xref="A2.T3.4.4.1.m1.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.4.4.1.m1.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.4.4.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math></th> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.2">198091</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.3">175442</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.4">19447</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A2.T3.4.4.5">3202</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.6">47946</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.7">40188</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.8">4556</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.9">3202</td> </tr> <tr class="ltx_tr" id="A2.T3.5.5"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_r" id="A2.T3.5.5.1"><math alttext="K_{d}" class="ltx_Math" display="inline" id="A2.T3.5.5.1.m1.1"><semantics id="A2.T3.5.5.1.m1.1a"><msub id="A2.T3.5.5.1.m1.1.1" xref="A2.T3.5.5.1.m1.1.1.cmml"><mi id="A2.T3.5.5.1.m1.1.1.2" xref="A2.T3.5.5.1.m1.1.1.2.cmml">K</mi><mi id="A2.T3.5.5.1.m1.1.1.3" xref="A2.T3.5.5.1.m1.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="A2.T3.5.5.1.m1.1b"><apply id="A2.T3.5.5.1.m1.1.1.cmml" xref="A2.T3.5.5.1.m1.1.1"><csymbol cd="ambiguous" id="A2.T3.5.5.1.m1.1.1.1.cmml" xref="A2.T3.5.5.1.m1.1.1">subscript</csymbol><ci id="A2.T3.5.5.1.m1.1.1.2.cmml" xref="A2.T3.5.5.1.m1.1.1.2">𝐾</ci><ci id="A2.T3.5.5.1.m1.1.1.3.cmml" xref="A2.T3.5.5.1.m1.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.5.5.1.m1.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.5.5.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math></th> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.2">54570</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.3">47347</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.4">5347</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r" id="A2.T3.5.5.5">1876</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.6">17509</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.7">14003</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.8">1630</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.9">1876</td> </tr> </tbody> </table> </figure> <div class="ltx_para ltx_noindent" id="A2.p2"> Additionally, the dataset encompasses over 1,385,201 million assay labels, each derived from corresponding wet-lab bioactivity experiments, ensuring the reliability and accuracy of the bioactivity information. SIU includes 1,720 diverse protein targets, with each protein potentially possessing multiple distinct binding pockets, verified through rigorous deduplication methods, resulting in a total of 9,662 unique pockets. The dataset also features a substantial and diverse collection of small molecules, totaling 214,686, across all pockets. Importantly, we have only included protein pocket-small molecule pairs confirmed to be active or inactive through wet-lab experiments, amounting to over 1,291,362 million pairs. </div> <div class="ltx_para ltx_noindent" id="A2.p3"> As shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A2.T3" title="Table 3 ‣ Appendix B Dataset overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">3</a>, we included various assay types with significant amounts of data to demonstrate that bioactivity values from different assay types should not be mixed and to facilitate the future assay-type-specific use of SIU. Additionally, to examine the structural differences among small molecules from the top four assay types, a random sample from each was visualized using t-SNE with ECFP fingerprints (radius = 3, 1024-bit vectors), as depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A2.F5" title="Figure 5 ‣ Appendix B Dataset overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">5</a>. </div> <figure class="ltx_figure" id="A2.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="358" id="A2.F5.g1" src="extracted/5664306/images/shuffled_tsne_02.png" width="419"/> <figcaption class="ltx_caption ltx_centering">Figure 5: Visualization of chemical structure differences among small molecules from the top four assay types using t-SNE with ECFP fingerprints.</figcaption> </figure> </section> <section class="ltx_appendix" id="A3"> <h2 class="ltx_title ltx_title_appendix"> Appendix C Test set construction</h2> <div class="ltx_para ltx_noindent" id="A3.p1"> To ensure the robustness and generalizability of the experimental findings with SIU, we meticulously curated a test set composed of 10 protein targets, as listed in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A3.T4" title="Table 4 ‣ Appendix C Test set construction ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction">4</a>. These targets were selected to represent a wide range of protein classes, including G-Protein Coupled Receptors (GPCRs), kinases, cytochrome, nuclear receptor, ion channel, epigenetic, and others, ensuring broad coverage of the bioactivity landscape. For example, "C11B1_HUMAN" belongs to the cytochrome P450 family, which is involved in the metabolism of various drugs <cite class="ltx_cite ltx_citemacro_cite">Bureik et al. [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib8" title="">2002</a>], Denisov et al. [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib15" title="">2005</a>]</cite>. "RARG_HUMAN" belongs to the Nuclear Receptor family, with drugs like bexarotene used for certain cancers <cite class="ltx_cite ltx_citemacro_cite">Altucci et al. [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib2" title="">2007</a>], Qu and Tang [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib44" title="">2010</a>]</cite>. "NMDE1_HUMAN" represents the NMDA receptor, a critical glutamate receptor in neurons implicated in various neurological disorders, with memantine being an approved NMDA receptor antagonist for moderate to severe Alzheimer’s disease <cite class="ltx_cite ltx_citemacro_cite">Mori and Mishina [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib39" title="">1995</a>], Reisberg et al. [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib45" title="">2003</a>]</cite>. Including these targets across various functionalities enhances the applicability of our results in drug discovery. </div> <figure class="ltx_table" id="A3.T4"> <figcaption class="ltx_caption ltx_centering">Table 4: The curated test set of 10 protein targets, covering a diverse range of protein classes and displaying an even distribution of small molecule-pocket pair counts.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="A3.T4.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="A3.T4.1.1.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="A3.T4.1.1.1.1">UniProt</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="A3.T4.1.1.1.2">Gene name</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="A3.T4.1.1.1.3">Class</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="A3.T4.1.1.1.4">Small molecule-</th> </tr> <tr class="ltx_tr" id="A3.T4.1.2.2"> <th class="ltx_td ltx_th ltx_th_column ltx_border_r" id="A3.T4.1.2.2.1"></th> <th class="ltx_td ltx_th ltx_th_column ltx_border_r" id="A3.T4.1.2.2.2"></th> <th class="ltx_td ltx_th ltx_th_column ltx_border_r" id="A3.T4.1.2.2.3"></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="A3.T4.1.2.2.4">pocket pair count</th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="A3.T4.1.3.1"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="A3.T4.1.3.1.1">P61073</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="A3.T4.1.3.1.2">CXCR4_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="A3.T4.1.3.1.3">GPCR</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A3.T4.1.3.1.4">1376</td> </tr> <tr class="ltx_tr" id="A3.T4.1.4.2"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.4.2.1">P42866</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.4.2.2">OPRM_MOUSE</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.4.2.3">GPCR</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.4.2.4">2379</td> </tr> <tr class="ltx_tr" id="A3.T4.1.5.3"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.5.3.1">Q00535</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.5.3.2">CDK5_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.5.3.3">Kinase</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.5.3.4">2189</td> </tr> <tr class="ltx_tr" id="A3.T4.1.6.4"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.6.4.1">Q04759</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.6.4.2">KPCT_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.6.4.3">Kinase</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.6.4.4">2320</td> </tr> <tr class="ltx_tr" id="A3.T4.1.7.5"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.7.5.1">P15538</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.7.5.2">C11B1_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.7.5.3">Cytochrome</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.7.5.4">2427</td> </tr> <tr class="ltx_tr" id="A3.T4.1.8.6"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.8.6.1">P13631</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.8.6.2">RARG_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.8.6.3">Nuclear Receptor</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.8.6.4">1888</td> </tr> <tr class="ltx_tr" id="A3.T4.1.9.7"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.9.7.1">Q12879</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.9.7.2">NMDE1_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.9.7.3">Ion Channel</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.9.7.4">2144</td> </tr> <tr class="ltx_tr" id="A3.T4.1.10.8"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.10.8.1">Q9UGN5</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.10.8.2">PARP2_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.10.8.3">Epigenetic</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.10.8.4">2251</td> </tr> <tr class="ltx_tr" id="A3.T4.1.11.9"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.11.9.1">Q86WV6</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.11.9.2">STING_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.11.9.3">Others</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.11.9.4">2495</td> </tr> <tr class="ltx_tr" id="A3.T4.1.12.10"> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r" id="A3.T4.1.12.10.1">Q96SW2</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r" id="A3.T4.1.12.10.2">CRBN_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r" id="A3.T4.1.12.10.3">Others</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A3.T4.1.12.10.4">2059</td> </tr> </tbody> </table> </figure> </section> <section class="ltx_appendix" id="A4"> <h2 class="ltx_title ltx_title_appendix"> Appendix D Model Training</h2> <div class="ltx_para ltx_noindent" id="A4.p1"> For GNN Model, we use the same model in atom3d <cite class="ltx_cite ltx_citemacro_citep">[Townshend et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib51" title="">2021</a>]</cite> <a class="ltx_ref ltx_href" href="https://github.com/drorlab/atom3d/" title="">https://github.com/drorlab/atom3d/</a>. We train the model using one NVIDIA A100 GPU. The batch size is 256, the max number of epochs is 20, the optimizer is Adam, the learning rate is 1e-3. </div> <div class="ltx_para ltx_noindent" id="A4.p2"> For 3D-CNN Model, we use the same model in atom3d <cite class="ltx_cite ltx_citemacro_citep">[Townshend et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib51" title="">2021</a>]</cite> <a class="ltx_ref ltx_href" href="https://github.com/drorlab/atom3d/" title="">https://github.com/drorlab/atom3d/</a>. We train the model using one NVIDIA A100 GPU. The batch size is 256, the max number of epochs is 20, the optimizer is Adam, the learning rate is 1e-4. </div> <div class="ltx_para ltx_noindent" id="A4.p3"> For Uni-Mol model, we use the pretrained model weights provided by <a class="ltx_ref ltx_href" href="https://github.com/dptech-corp/Uni-Mol/" title="">https://github.com/dptech-corp/Uni-Mol/</a>. The pretrained molecular encoder and pocket encoder outputs are concatenated and passed through a four-layer Multi-Layer Perceptron (MLP) with hidden dimension 1024, 521, 256, 128. We use four NVIDIA A100 GPU to train the model. The batch size is 384, the max number of epochs is 50, the optimizer is Adam, the learning rate is 1e-4. </div> <div class="ltx_para ltx_noindent" id="A4.p4"> For ProFSA model, we use the pretrained model weights provided by <a class="ltx_ref ltx_href" href="https://github.com/bowen-gao/ProFSA" title="">https://github.com/bowen-gao/ProFSA</a>. The pretrained molecular encoder and pocket encoder outputs are concatenated and passed through a four-layer Multi-Layer Perceptron (MLP) with hidden dimension 1024, 521, 256, 128. We use four NVIDIA A100 GPU to train the model. The batch size is 384, the max number of epochs is 50, the optimizer is Adam, the learning rate is 1e-4. </div> <div class="ltx_para ltx_noindent" id="A4.p5"> Details can found at <a class="ltx_ref ltx_href" href="https://github.com/bowen-gao/SIU" title="">https://github.com/bowen-gao/SIU</a> </div> </section> <section class="ltx_appendix" id="A5"> <h2 class="ltx_title ltx_title_appendix"> Appendix E Potential negative impact of SIU</h2> <div class="ltx_para ltx_noindent" id="A5.p1"> While our dataset, SIU, represents a significant advancement in the field of bioactivity prediction, it is important to acknowledge potential limitations and areas of concern. Despite our robust multi-software docking approach and consensus filtering, the inherent reliance on computational methods may still introduce certain biases or inaccuracies in the modeled small molecule-protein interactions. These potential inaccuracies could inadvertently mislead researchers, leading to less reliable predictions and potentially diverting attention away from promising compounds or targets. </div> <div class="ltx_pagination ltx_role_newpage"></div> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Thu Jun 13 09:41:37 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/">LaTeXML<img alt="Mascot Sammy" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAsAAAAOCAYAAAD5YeaVAAAAAXNSR0IArs4c6QAAAAZiS0dEAP8A/wD/oL2nkwAAAAlwSFlzAAALEwAACxMBAJqcGAAAAAd0SU1FB9wKExQZLWTEaOUAAAAddEVYdENvbW1lbnQAQ3JlYXRlZCB3aXRoIFRoZSBHSU1Q72QlbgAAAdpJREFUKM9tkL+L2nAARz9fPZNCKFapUn8kyI0e4iRHSR1Kb8ng0lJw6FYHFwv2LwhOpcWxTjeUunYqOmqd6hEoRDhtDWdA8ApRYsSUCDHNt5ul13vz4w0vWCgUnnEc975arX6ORqN3VqtVZbfbTQC4uEHANM3jSqXymFI6yWazP2KxWAXAL9zCUa1Wy2tXVxheKA9YNoR8Pt+aTqe4FVVVvz05O6MBhqUIBGk8Hn8HAOVy+T+XLJfLS4ZhTiRJgqIoVBRFIoric47jPnmeB1mW/9rr9ZpSSn3Lsmir1fJZlqWlUonKsvwWwD8ymc/nXwVBeLjf7xEKhdBut9Hr9WgmkyGEkJwsy5eHG5vN5g0AKIoCAEgkEkin0wQAfN9/cXPdheu6P33fBwB4ngcAcByHJpPJl+fn54mD3Gg0NrquXxeLRQAAwzAYj8cwTZPwPH9/sVg8PXweDAauqqr2cDjEer1GJBLBZDJBs9mE4zjwfZ85lAGg2+06hmGgXq+j3+/DsixYlgVN03a9Xu8jgCNCyIegIAgx13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg=="/></a> </div></footer> </div> </body> </html>