CINXE.COM

SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction

<!DOCTYPE html> <html lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction</title> <!--Generated on Thu Jun 13 09:41:37 2024 by LaTeXML (version 0.8.8) http://dlmf.nist.gov/LaTeXML/.--> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/css/bootstrap.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/ar5iv-fonts.0.7.9.min.css" rel="stylesheet" type="text/css"/> <link href="/static/browse/0.3.4/css/latexml_styles.css" rel="stylesheet" type="text/css"/> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0/dist/js/bootstrap.bundle.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/html2canvas/1.3.3/html2canvas.min.js"></script> <script src="/static/browse/0.3.4/js/addons_new.js"></script> <script src="/static/browse/0.3.4/js/feedbackOverlay.js"></script> <base href="/html/2406.08961v1/"/></head> <body> <nav class="ltx_page_navbar"> <nav class="ltx_TOC"> <ol class="ltx_toclist"> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S1" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">1 </span>Introduction</span></a></li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S2" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">2 </span>Related work</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S2.SS0.SSS0.Px1" title="In 2 Related work ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Non-structural datasets on drug-target interaction for bioactivity prediction.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S2.SS0.SSS0.Px2" title="In 2 Related work ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Structural datasets based on experimental structures for bioactivity prediction.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S2.SS0.SSS0.Px3" title="In 2 Related work ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Structural datasets based on modeling structures for bioactivity prediction.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3 </span>SIU dataset construction and overview</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1" title="In 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1 </span>Methods</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_subsubsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS1" title="In 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1.1 </span>Data cleaning and deduplication</span></a> <ol class="ltx_toclist ltx_toclist_subsubsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS1.Px1" title="In 3.1.1 Data cleaning and deduplication ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Bioactivity data extracting.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS1.Px2" title="In 3.1.1 Data cleaning and deduplication ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">PDB structure retrieval and mapping.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsubsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS2" title="In 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1.2 </span>Structural data construction via multi-software docking</span></a> <ol class="ltx_toclist ltx_toclist_subsubsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS2.Px1" title="In 3.1.2 Structural data construction via multi-software docking ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Molecular docking.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS2.Px2" title="In 3.1.2 Structural data construction via multi-software docking ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Consensus filtering of docking poses.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsubsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS3" title="In 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.1.3 </span>Data construction for downstream tasks</span></a> <ol class="ltx_toclist ltx_toclist_subsubsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS3.Px1" title="In 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Dataset organization for unbiased bioactivity prediciton.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS1.SSS3.Px2" title="In 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Dataset split.</span></a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS2" title="In 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">3.2 </span>Dataset overview</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS2.SSS0.Px1" title="In 3.2 Dataset overview ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Large-scale.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS2.SSS0.Px2" title="In 3.2 Dataset overview ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Diversity.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS2.SSS0.Px3" title="In 3.2 Dataset overview ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">High-quality.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.SS2.SSS0.Px4" title="In 3.2 Dataset overview ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Well-organized.</span></a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4 </span>SIU experiments and analysis</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_subsection"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.SS1" title="In 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">4.1 </span>Analysis</span></a> <ol class="ltx_toclist ltx_toclist_subsection"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.SS1.SSS0.Px1" title="In 4.1 Analysis ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Different difficulties of assay types.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.SS1.SSS0.Px2" title="In 4.1 Analysis ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Influence of measuring correlation with same PDB IDs.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.SS1.SSS0.Px3" title="In 4.1 Analysis ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Effectivness of training on larger dataset.</span></a></li> </ol> </li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"> <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S5" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">5 </span>Limitations and future work</span></a> <ol class="ltx_toclist ltx_toclist_section"> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S5.SS0.SSS0.Px1" title="In 5 Limitations and future work ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Limitations.</span></a></li> <li class="ltx_tocentry ltx_tocentry_paragraph"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S5.SS0.SSS0.Px2" title="In 5 Limitations and future work ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title">Future work.</span></a></li> </ol> </li> <li class="ltx_tocentry ltx_tocentry_section"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S6" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">6 </span>Conclusion</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A1" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">A </span>Dataset and Code Availability</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A2" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">B </span>Dataset overview</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A3" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">C </span>Test set construction</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A4" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">D </span>Model Training</span></a></li> <li class="ltx_tocentry ltx_tocentry_appendix"><a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A5" title="In SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_title"><span class="ltx_tag ltx_tag_ref">E </span>Potential negative impact of SIU</span></a></li> </ol></nav> </nav> <div class="ltx_page_main"> <div class="ltx_page_content"> <article class="ltx_document ltx_authors_1line"> <h1 class="ltx_title ltx_title_document">SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction</h1> <div class="ltx_authors"> <span class="ltx_creator ltx_role_author"> <span class="ltx_personname"> Yanwen Huang<sup class="ltx_sup" id="id1.1.id1">1</sup>    , Bowen Gao<sup class="ltx_sup" id="id2.2.id2">2</sup><span class="ltx_note ltx_role_footnotemark" id="footnotex1"><sup class="ltx_note_mark">1</sup><span class="ltx_note_outer"><span class="ltx_note_content"><sup class="ltx_note_mark">1</sup><span class="ltx_note_type">footnotemark: </span><span class="ltx_tag ltx_tag_note">1</span></span></span></span>  , Yinjun Jia<sup class="ltx_sup" id="id3.3.id3">3</sup>, Hongbo Ma<sup class="ltx_sup" id="id4.4.id4">4</sup>, <br class="ltx_break"/><span class="ltx_text ltx_font_bold" id="id5.5.id5">Wei-Ying Ma<sup class="ltx_sup" id="id5.5.id5.1">2</sup></span>, <span class="ltx_text ltx_font_bold" id="id6.6.id6">Ya-Qin Zhang<sup class="ltx_sup" id="id6.6.id6.1">2</sup></span>, <span class="ltx_text ltx_font_bold" id="id7.7.id7">Yanyan Lan<sup class="ltx_sup" id="id7.7.id7.1">2</sup></span> <br class="ltx_break"/> <br class="ltx_break"/><sup class="ltx_sup" id="id8.8.id8">1</sup>Department of Pharmaceutical Science, Peking University <br class="ltx_break"/><sup class="ltx_sup" id="id9.9.id9">2</sup>Institute for AI Industry Research (AIR), Tsinghua University <br class="ltx_break"/><sup class="ltx_sup" id="id10.10.id10">3</sup>School of Life Sciences, Tsinghua University <br class="ltx_break"/><sup class="ltx_sup" id="id11.11.id11">4</sup>Department of Computer Science and Technology, Tsinghua University <br class="ltx_break"/> </span><span class="ltx_author_notes">Equal contirbutionWork was done while Yanwen Huang was an intern at AIR.Correspondence to <span class="ltx_text ltx_font_typewriter" id="id12.12.id1">lanyanyan@air.tsinghua.edu.cn</span></span></span> </div> <div class="ltx_abstract"> <h6 class="ltx_title ltx_title_abstract">Abstract</h6> <p class="ltx_p" id="id13.id1">Small molecules play a pivotal role in modern medicine, and scrutinizing their interactions with protein targets is essential for the discovery and development of novel, life-saving therapeutics. The term "bioactivity" encompasses various biological effects resulting from these interactions, including both binding and functional responses. The magnitude of bioactivity dictates the therapeutic or toxic pharmacological outcomes of small molecules, rendering accurate bioactivity prediction crucial for the development of safe and effective drugs. However, existing structural datasets of small molecule-protein interactions are often limited in scale and lack systematically organized bioactivity labels, thereby impeding our understanding of these interactions and precise bioactivity prediction. In this study, we introduce a comprehensive dataset of small molecule-protein interactions, consisting of over a million binding structures, each annotated with real biological activity labels. This dataset is designed to facilitate unbiased bioactivity prediction. We evaluated several classical models on this dataset, and the results demonstrate that the task of unbiased bioactivity prediction is challenging yet essential.</p> </div> <section class="ltx_section" id="S1"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">1 </span>Introduction</h2> <div class="ltx_para ltx_noindent" id="S1.p1"> <p class="ltx_p" id="S1.p1.1">Bioactivity encapsulates various types of distinct measurements derived from different wet lab conditions, including both binding affinity and the spectrum of biological effects resulting from small molecule-protein interactions. Accurate bioactivity prediction is fundamental for discerning therapeutic potential and off-target toxicity effects, guiding medicinal chemistry efforts in discovering and optimizing potential small-molecule therapeutics, and is thus pivotal to the development of safe and effective drugs <cite class="ltx_cite ltx_citemacro_cite">Tropsha et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib52" title="">2024</a>); Gaulton and Overington (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib22" title="">2010</a>)</cite>. For a small molecule to modulate the function of its protein target and exert its biological effects, it must be recognized by the protein through three-dimensional complementarity of shape and properties <cite class="ltx_cite ltx_citemacro_cite">Verma et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib55" title="">2010</a>)</cite>. This level of detail is indispensable, as knowing only the materials of a lock (protein) and the blueprint of a key (small molecule) is insufficient; 3D information is required to understand how the key fits into the lock and functions <cite class="ltx_cite ltx_citemacro_cite">Koshland Jr (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib31" title="">1995</a>); Eschenmoser (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib18" title="">1995</a>)</cite>. Current challenges in bioactivity prediction largely stem from the scarcity of high-quality, 3D structural data on small molecule-protein interactions.</p> </div> <div class="ltx_para ltx_noindent" id="S1.p2"> <p class="ltx_p" id="S1.p2.1">The existing structural data on small molecule-protein interactions are markedly insufficient. Structural data derived from wet-lab experiments are limited, owing to the laborious and time-consuming nature of these assays. Additionally, this data often lacks comprehensive bioactivity annotations and is poorly organized with respect to bioactivity assay types. While computational modeling approaches have been employed to generate structural datasets, these efforts have yielded datasets of modest size with limited molecular diversity. The paucity of high-quality structural data still imposes a significant barrier to accurate bioactivity prediction, highlighting the critical need for computationally generated, large-scale, high-quality datasets.</p> </div> <div class="ltx_para ltx_noindent" id="S1.p3"> <p class="ltx_p" id="S1.p3.1">To address this critical need, we present <span class="ltx_text ltx_font_bold" id="S1.p3.1.1">SIU</span>: a million-scale <span class="ltx_text ltx_font_bold" id="S1.p3.1.2">S</span>tructural small molecule-protein <span class="ltx_text ltx_font_bold" id="S1.p3.1.3">I</span>nteraction dataset for <span class="ltx_text ltx_font_bold" id="S1.p3.1.4">U</span>nbiased bioactivity prediction,<span class="ltx_text ltx_font_bold" id="S1.p3.1.5"> the largest and most comprehensive structural dataset available to date</span>. SIU comprises over 5.34 million conformations, integrating both structural and bioactivity information for small molecule-protein interactions. Within this dataset, small molecule-protein pairs feature 1.38 million rigorously curated bioactivity annotations, each with a clear assay type designation. Our dataset provides extensive coverage of diverse small molecules, encompassing both active and inactive compounds, thereby surpassing the limitations of datasets restricted to molecules structurally similar to co-crystal ligands. It also includes a wide array of protein targets, covering all common protein classes, with each protein associated with multiple PDB IDs reflecting distinct pocket conformations. Our robust data generation pipeline employs multi-software docking and consensus filtering approach to ensure the precise modeling of small molecule-protein complexes. Bioactivity labels are meticulously curated and systematically organized according to assay types. SIU represents a significant advancement, offering a solid foundation for unbiased bioactivity prediction and enabling more accurate and comprehensive Pharmaceutical investigations.</p> </div> <div class="ltx_para ltx_noindent" id="S1.p4"> <p class="ltx_p" id="S1.p4.1">We conducted experiments with several classical baseline models, and the results demonstrate that <span class="ltx_text ltx_font_bold" id="S1.p4.1.1">our large-scale dataset can improve model performance compared to the widely used PDBbind dataset <cite class="ltx_cite ltx_citemacro_cite">Wang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib57" title="">2004</a>, <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib58" title="">2005</a>)</cite>.</span> Additionally, the correlation results calculated by mixing different protein-molecule pairs are significantly higher than the correlation calculated after grouping by PDB IDs and using molecules for a single protein pocket. This indicates that <span class="ltx_text ltx_font_bold" id="S1.p4.1.2">the correlation within PDB IDs is more challenging and serves as a more important metric for evaluating the bioactivity prediction ability of models.</span>s It highlights the importance of the unbiased bioactivity prediction task we introduced. This task focuses on the bioactivity difference for different molecules within a protein pocket, instead of the bias introduced by the different bioactivity ranges for different protein pockets.</p> </div> <div class="ltx_para ltx_noindent" id="S1.p5"> <p class="ltx_p" id="S1.p5.1">In conclusion,<span class="ltx_text ltx_font_bold" id="S1.p5.1.1"> our main contributions are threefold: (1) </span>We introduced a million-scale structural dataset to address the exigent demands of the AI-driven drug discovery (AIDD) community;<span class="ltx_text ltx_font_bold" id="S1.p5.1.2"> (2) </span>We devised and rigorously validated a robust, scalable pipeline for producing high-fidelity structural data of small molecule-protein interactions;<span class="ltx_text ltx_font_bold" id="S1.p5.1.3"> (3) </span>We accentuated the significance of differentiating among bioactivity assay types and meticulously curated the dataset to enhance this practice in the training and evaluation of bioactivity prediction models.</p> </div> </section> <section class="ltx_section" id="S2"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">2 </span>Related work</h2> <section class="ltx_paragraph" id="S2.SS0.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">Non-structural datasets on drug-target interaction for bioactivity prediction.</h5> <div class="ltx_para ltx_noindent" id="S2.SS0.SSS0.Px1.p1"> <p class="ltx_p" id="S2.SS0.SSS0.Px1.p1.1">A multitude of datasets are available for drug-target affinity (DTA) prediction; however, these datasets frequently lack structural data concerning the interactions between small molecules and their corresponding targets <cite class="ltx_cite ltx_citemacro_cite">Ekins et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib16" title="">2017</a>)</cite>. Large-scale bioactivity databases such as ChEMBL <cite class="ltx_cite ltx_citemacro_cite">Mendez et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib38" title="">2019</a>); Gaulton et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib23" title="">2012</a>)</cite>, PubChem <cite class="ltx_cite ltx_citemacro_cite">Kim et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib30" title="">2016</a>)</cite>, GuideToPharmacology <cite class="ltx_cite ltx_citemacro_cite">Pawson et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib43" title="">2014</a>)</cite>, and DrugBank <cite class="ltx_cite ltx_citemacro_cite">Wishart et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib62" title="">2018</a>); Law et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib33" title="">2014</a>)</cite> are invaluable resources. Research efforts in DTA prediction primarily focus on binding affinity labels, with datasets such as Davis <cite class="ltx_cite ltx_citemacro_cite">Davis et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib14" title="">2011</a>)</cite> and KIBA <cite class="ltx_cite ltx_citemacro_cite">Tang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib50" title="">2014</a>)</cite> being widely utilized. MoleculeNet <cite class="ltx_cite ltx_citemacro_cite">Wu et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib63" title="">2018</a>)</cite> also comprises non-structural bioactivity data.</p> </div> </section> <section class="ltx_paragraph" id="S2.SS0.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">Structural datasets based on experimental structures for bioactivity prediction.</h5> <div class="ltx_para ltx_noindent" id="S2.SS0.SSS0.Px2.p1"> <p class="ltx_p" id="S2.SS0.SSS0.Px2.p1.1">The Protein Data Bank (PDB) <cite class="ltx_cite ltx_citemacro_cite">Berman et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib5" title="">2000</a>)</cite>, established in 1971, has been an indispensable resource for structural biology, providing extensive structural data on protein and other biomolecules. However, it lacks direct bioactivity annotations and a systematic categorization of small molecules, often including non-specific and biologically irrelevant compounds. To address these limitations, several specialized databases have been developed, including PDBbind <cite class="ltx_cite ltx_citemacro_cite">Wang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib57" title="">2004</a>, <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib58" title="">2005</a>)</cite>, Binding MOAD <cite class="ltx_cite ltx_citemacro_cite">Hu et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib29" title="">2005</a>)</cite>, KiBank <cite class="ltx_cite ltx_citemacro_cite">Zhang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib66" title="">2004</a>)</cite>, AffinDB <cite class="ltx_cite ltx_citemacro_cite">Block et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib7" title="">2006</a>)</cite>, and BioLiP <cite class="ltx_cite ltx_citemacro_cite">Yang et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib64" title="">2012</a>); Wei et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib59" title="">2024</a>)</cite>. Their curation significantly enhances the utility of these datasets for structure-based Pharmaceutical research. Nevertheless, the reliance on labor-intensive experimental data acquisition limits the scalability and rapid expansion of these databases. Moreover, some databases lack explicit guidelines for the non-mixed use of assay types. </p> </div> </section> <section class="ltx_paragraph" id="S2.SS0.SSS0.Px3"> <h5 class="ltx_title ltx_title_paragraph">Structural datasets based on modeling structures for bioactivity prediction.</h5> <div class="ltx_para ltx_noindent" id="S2.SS0.SSS0.Px3.p1"> <p class="ltx_p" id="S2.SS0.SSS0.Px3.p1.1">Computational methods have been employed to construct datasets modeling small molecule-protein interaction structures that correspond with experimental bioactivities. The Natural Ligand DataBase (NLDB) <cite class="ltx_cite ltx_citemacro_cite">Murakami et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib40" title="">2016</a>)</cite> includes 7,053 complex structures, some of which are computationally modeled. The eModel-BDB <cite class="ltx_cite ltx_citemacro_cite">Naderi et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib41" title="">2018</a>)</cite> reports 200,005 structural entries, though it encounters issues such as steric clashes <cite class="ltx_cite ltx_citemacro_cite">Li et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib35" title="">2024</a>)</cite>. BindingNet <cite class="ltx_cite ltx_citemacro_cite">Li et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib35" title="">2024</a>)</cite> represents a novel dataset comprising 69,816 high-quality modeled structures obtained through comparative complex structure modeling. This modeling technique, however, requires the modeled small molecules to have structural similarity to co-crystal ligands from experiments, thereby limiting the quantity and diversity of the small molecules modeled in BindingNet.</p> </div> </section> </section> <section class="ltx_section" id="S3"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">3 </span>SIU dataset construction and overview</h2> <div class="ltx_para ltx_noindent" id="S3.p1"> <p class="ltx_p" id="S3.p1.1">The SIU dataset is a pioneering resource for predicting bioactivity, offering a comprehensive collection of small molecule-protein interactions with meticulously annotated bioactivity information. This section details the construction methods employed to ensure the data’s quality, diversity, and organization for downstream tasks.</p> </div> <figure class="ltx_figure" id="S3.F1"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="228" id="S3.F1.g1" src="extracted/5664306/images/m.png" width="598"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 1: </span><span class="ltx_text ltx_font_bold" id="S3.F1.5.1">Pipeline for SIU construction.</span> <span class="ltx_text ltx_font_bold" id="S3.F1.6.2">(A)</span> Small molecules and protein targets were obtained from corresponding databases, cleaned, and deduplicated. Different small molecules binding to the same protein and different pockets (different PDB IDs) of the same protein were filtered and analyzed. <span class="ltx_text ltx_font_bold" id="S3.F1.7.3">(B)</span> These were then subjected to a multi-software docking pipeline, where the small molecules were prepared and docked to their wet-experiment confirmed targets using three different software programs. The resulting poses were filtered through a voting mechanism to construct the final dataset. <span class="ltx_text ltx_font_bold" id="S3.F1.8.4">(C)</span> The dataset is well-organized and contains multiple pockets for each protein and multiple molecules for each pocket, allowing for downstream tasks to be performed PDB-wisely and assay-type-wisely.</figcaption> </figure> <section class="ltx_subsection" id="S3.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.1 </span>Methods</h3> <section class="ltx_subsubsection" id="S3.SS1.SSS1"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection">3.1.1 </span>Data cleaning and deduplication</h4> <section class="ltx_paragraph" id="S3.SS1.SSS1.Px1"> <h5 class="ltx_title ltx_title_paragraph">Bioactivity data extracting.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS1.Px1.p1"> <p class="ltx_p" id="S3.SS1.SSS1.Px1.p1.1">We retrieved non-structural bioactivity data from established databases: ChEMBL <cite class="ltx_cite ltx_citemacro_cite">Mendez et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib38" title="">2019</a>); Gaulton et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib23" title="">2012</a>)</cite> and BindingDB <cite class="ltx_cite ltx_citemacro_cite">Chen et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib9" title="">2001</a>); Liu et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib37" title="">2007</a>); Gilson et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib25" title="">2016</a>)</cite>. Molecules were filtered based on predefined criteria. Assays measuring bioactivity of small molecule-protein interactions were selected and filtered. The protein target information for each assay was carefully identified and standardized using UniProt IDs <cite class="ltx_cite ltx_citemacro_cite">Consortium (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib11" title="">2015</a>); uni (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib1" title="">2017</a>)</cite>, ensuring consistency across datasets and facilitating matching with experimental structural data.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS1.Px1.p2"> <p class="ltx_p" id="S3.SS1.SSS1.Px1.p2.1">The small molecule filtering criteria are well-defined to exclude molecules that are not drug-like, including molecular weight, atom composition, and element restrictions. All small molecules retained their original IUPAC International Chemical Identifier (InChI) keys <cite class="ltx_cite ltx_citemacro_cite">Heller et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib28" title="">2015</a>)</cite> and Simplified Molecular Input Line Entry System (SMILES) notations <cite class="ltx_cite ltx_citemacro_cite">Weininger (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib60" title="">1988</a>); Weininger et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib61" title="">1989</a>)</cite> from the databases to avoid mismatches due to different software calculations. Additionally, docking structurally similar small molecules for a single target leads to resource wastage. We examined targets associated with an excessively high number of small molecules and introduced a new filter based on small molecule extended-connectivity fingerprints (ECFP) similarity <cite class="ltx_cite ltx_citemacro_cite">Rogers and Hahn (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib47" title="">2010</a>)</cite>, ensuring both quality and diversity of small molecules while minimizing the computational expense of molecular docking.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS1.Px1.p3"> <p class="ltx_p" id="S3.SS1.SSS1.Px1.p3.5">The bioactivity data filtering process is also rigorously defined to ensure high-quality. Data from ChEMBL and BindingDB were independently extracted and cleaned before merging. For ChEMBL, criteria included assays involving only a single protein target, assays being either binding or functional, and bioactivity labels having standard relations, values, and units (i.e., <math alttext="pM" class="ltx_Math" display="inline" id="S3.SS1.SSS1.Px1.p3.1.m1.1"><semantics id="S3.SS1.SSS1.Px1.p3.1.m1.1a"><mrow id="S3.SS1.SSS1.Px1.p3.1.m1.1.1" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.cmml"><mi id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.2" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.2.cmml">p</mi><mo id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.1" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.1.cmml">⁢</mo><mi id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.3" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.3.cmml">M</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS1.Px1.p3.1.m1.1b"><apply id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1"><times id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.1"></times><ci id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.2.cmml" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.2">𝑝</ci><ci id="S3.SS1.SSS1.Px1.p3.1.m1.1.1.3.cmml" xref="S3.SS1.SSS1.Px1.p3.1.m1.1.1.3">𝑀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS1.Px1.p3.1.m1.1c">pM</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS1.Px1.p3.1.m1.1d">italic_p italic_M</annotation></semantics></math>, <math alttext="nM" class="ltx_Math" display="inline" id="S3.SS1.SSS1.Px1.p3.2.m2.1"><semantics id="S3.SS1.SSS1.Px1.p3.2.m2.1a"><mrow id="S3.SS1.SSS1.Px1.p3.2.m2.1.1" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.cmml"><mi id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.2" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.2.cmml">n</mi><mo id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.1" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.1.cmml">⁢</mo><mi id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.3" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.3.cmml">M</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS1.Px1.p3.2.m2.1b"><apply id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1"><times id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.1"></times><ci id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.2.cmml" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.2">𝑛</ci><ci id="S3.SS1.SSS1.Px1.p3.2.m2.1.1.3.cmml" xref="S3.SS1.SSS1.Px1.p3.2.m2.1.1.3">𝑀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS1.Px1.p3.2.m2.1c">nM</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS1.Px1.p3.2.m2.1d">italic_n italic_M</annotation></semantics></math>, or <math alttext="{\mu}M" class="ltx_Math" display="inline" id="S3.SS1.SSS1.Px1.p3.3.m3.1"><semantics id="S3.SS1.SSS1.Px1.p3.3.m3.1a"><mrow id="S3.SS1.SSS1.Px1.p3.3.m3.1.1" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.cmml"><mi id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.2" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.2.cmml">μ</mi><mo id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.1" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.1.cmml">⁢</mo><mi id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.3" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.3.cmml">M</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS1.Px1.p3.3.m3.1b"><apply id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1"><times id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.1"></times><ci id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.2.cmml" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.2">𝜇</ci><ci id="S3.SS1.SSS1.Px1.p3.3.m3.1.1.3.cmml" xref="S3.SS1.SSS1.Px1.p3.3.m3.1.1.3">𝑀</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS1.Px1.p3.3.m3.1c">{\mu}M</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS1.Px1.p3.3.m3.1d">italic_μ italic_M</annotation></semantics></math>). BindingDB data were extracted using similar logic, with slightly different filters due to database differences. The cleaned datasets were merged using InChI keys for small molecules and UniProt IDs for protein targets, ensuring precise matching of bioactivity labels to their respective small molecule-protein interactions. All small molecule-protein pairs with matched bioactivity labels were subsequently docked. The bioactivity information was standardized to a unit of <math alttext="mol/L" class="ltx_Math" display="inline" id="S3.SS1.SSS1.Px1.p3.4.m4.1"><semantics id="S3.SS1.SSS1.Px1.p3.4.m4.1a"><mrow id="S3.SS1.SSS1.Px1.p3.4.m4.1.1" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.cmml"><mrow id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.cmml"><mi id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.2" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.2.cmml">m</mi><mo id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1.cmml">⁢</mo><mi id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.3" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.3.cmml">o</mi><mo id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1a" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1.cmml">⁢</mo><mi id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.4" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.4.cmml">l</mi></mrow><mo id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.1" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.1.cmml">/</mo><mi id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.3" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.3.cmml">L</mi></mrow><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS1.Px1.p3.4.m4.1b"><apply id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1"><divide id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.1"></divide><apply id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2"><times id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.1"></times><ci id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.2.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.2">𝑚</ci><ci id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.3.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.3">𝑜</ci><ci id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.4.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.2.4">𝑙</ci></apply><ci id="S3.SS1.SSS1.Px1.p3.4.m4.1.1.3.cmml" xref="S3.SS1.SSS1.Px1.p3.4.m4.1.1.3">𝐿</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS1.Px1.p3.4.m4.1c">mol/L</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS1.Px1.p3.4.m4.1d">italic_m italic_o italic_l / italic_L</annotation></semantics></math> (<math alttext="M" class="ltx_Math" display="inline" id="S3.SS1.SSS1.Px1.p3.5.m5.1"><semantics id="S3.SS1.SSS1.Px1.p3.5.m5.1a"><mi id="S3.SS1.SSS1.Px1.p3.5.m5.1.1" xref="S3.SS1.SSS1.Px1.p3.5.m5.1.1.cmml">M</mi><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS1.Px1.p3.5.m5.1b"><ci id="S3.SS1.SSS1.Px1.p3.5.m5.1.1.cmml" xref="S3.SS1.SSS1.Px1.p3.5.m5.1.1">𝑀</ci></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS1.Px1.p3.5.m5.1c">M</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS1.Px1.p3.5.m5.1d">italic_M</annotation></semantics></math>) and anti-logged, similar to datasets for drug-target binding affinity prediction <cite class="ltx_cite ltx_citemacro_cite">Öztürk et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib42" title="">2018</a>)</cite>.</p> </div> </section> <section class="ltx_paragraph" id="S3.SS1.SSS1.Px2"> <h5 class="ltx_title ltx_title_paragraph">PDB structure retrieval and mapping.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS1.Px2.p1"> <p class="ltx_p" id="S3.SS1.SSS1.Px2.p1.1">The protein structures were downloaded and matched with their UniProt IDs to ensure accurate alignment with bioactivity data. These structures were parsed into individual PDB format files, each representing a distinct pocket. Identified by PDB IDs, pockets are functional regions of the protein that interact with small molecules. We developed a filtering mechanism that leverages chemical and biological knowledge to eliminate PDB files containing non-specific or biologically irrelevant co-crystallized ligands not occupying genuine binding sites. The number of PDB files associated with each protein target varied significantly. Docking all these structures is computationally expensive and offers diminishing returns in terms of novel information. We addressed this issue by implementing Fast Local Alignment of Protein Pockets (FLAPP) <cite class="ltx_cite ltx_citemacro_cite">Sankar et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib48" title="">2022</a>)</cite> and other methods to further deduplicate the pocket library. As illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">3</span></a>(A), this step efficiently removed highly similar pockets, resulting in a more streamlined pocket collection for docking simulations.</p> </div> </section> </section> <section class="ltx_subsubsection" id="S3.SS1.SSS2"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection">3.1.2 </span>Structural data construction via multi-software docking</h4> <section class="ltx_paragraph" id="S3.SS1.SSS2.Px1"> <h5 class="ltx_title ltx_title_paragraph">Molecular docking.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS2.Px1.p1"> <p class="ltx_p" id="S3.SS1.SSS2.Px1.p1.1">SIU employs multiple docking software programs <cite class="ltx_cite ltx_citemacro_cite">Friesner et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib20" title="">2004</a>); Verdonk et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib54" title="">2003</a>); Trott and Olson (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib53" title="">2010</a>)</cite>, reducing reliance on any individual docking software. Initial 3D conformations for the small molecules were generated. For molecules with chiral centers, different stereoisomers were explored and included. Ionization states of molecules at physiological pH were also considered to ensure accurate representations of their charged forms. Multiple conformations were prepared for each small molecule to account for their flexibility. The preprocessed data were organized into formats compatible with the chosen docking software. The protein targets were prepared, and grid files were generated according to each software’s specific requirements to ensure compatibility. Small molecules were then docked into the binding pockets of the protein structures.</p> </div> </section> <section class="ltx_paragraph" id="S3.SS1.SSS2.Px2"> <h5 class="ltx_title ltx_title_paragraph">Consensus filtering of docking poses.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS2.Px2.p1"> <p class="ltx_p" id="S3.SS1.SSS2.Px2.p1.1">The molecular docking results undergo rigorous scrutiny to ensure the retention of only credible poses. SIU employs a stringent filtering process: only those docking poses that exhibit consistency across at least two out of three different docking software results are retained. This consensus-based approach mitigates the inclusion of erroneous or misleading docking poses, thereby augmenting the overall quality and reliability of the dataset.</p> </div> <figure class="ltx_figure" id="S3.F2"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="154" id="S3.F2.g1" src="extracted/5664306/images/rmsd_example_3.png" width="598"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 2: </span><span class="ltx_text ltx_font_bold" id="S3.F2.5.1">Capability of RMSD to quantify differences in docking poses.</span> <span class="ltx_text ltx_font_bold" id="S3.F2.6.2">(A)</span> RMSD 1.544, well-superimposed poses. <span class="ltx_text ltx_font_bold" id="S3.F2.7.3">(B)</span> RMSD 1.985, similar binding modes. <span class="ltx_text ltx_font_bold" id="S3.F2.8.4">(C)</span> RMSD 8.095, fundamentally different binding modes.</figcaption> </figure> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS2.Px2.p2"> <p class="ltx_p" id="S3.SS1.SSS2.Px2.p2.1">Different docking poses of the same small molecule-PDB pair were evaluated using the root mean square deviation (RMSD). Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F2" title="Figure 2 ‣ Consensus filtering of docking poses. ‣ 3.1.2 Structural data construction via multi-software docking ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">2</span></a> shows the RMSD and corresponding poses of a single Glide docking pose compared with the top three docking poses generated by GOLD. When RMSD is about 2, the key interactions are maintained, indicating a potentially valid docking result. This example underscores the importance of RMSD as a metric for evaluating the consistency and reliability of docking poses, with higher RMSD values indicating divergent binding modes that may arise from incorrect small molecule-protein interaction mode predictions. We further investigated the trade-off between pose accuracy and the quantity of retained data, conducting experiments to observe the impact of varying RMSD values on these factors. Co-crystal poses, used as the ground truth, were extracted from PDB complexes and redocked into the original PDB pockets according to our docking procedure. The results, shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">3</span></a>(B), indicate that when the RMSD is less than 2, a significant number of molecules can be retained, and the success ratio of the poses is satisfactory; as the RMSD increases, the number of retained poses slightly rises, but the accuracy of these poses significantly decreases. Therefore, an RMSD of 2 was selected as the cutoff.</p> </div> </section> </section> <section class="ltx_subsubsection" id="S3.SS1.SSS3"> <h4 class="ltx_title ltx_title_subsubsection"> <span class="ltx_tag ltx_tag_subsubsection">3.1.3 </span>Data construction for downstream tasks</h4> <section class="ltx_paragraph" id="S3.SS1.SSS3.Px1"> <h5 class="ltx_title ltx_title_paragraph">Dataset organization for unbiased bioactivity prediciton.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS3.Px1.p1"> <p class="ltx_p" id="S3.SS1.SSS3.Px1.p1.2">We organized data PDB-wisely and assay-type-wisely to facilitate unbiased bioactivity prediction, addressing the common issue of mixing the dissociation constant (<math alttext="K_{d}" class="ltx_Math" display="inline" id="S3.SS1.SSS3.Px1.p1.1.m1.1"><semantics id="S3.SS1.SSS3.Px1.p1.1.m1.1a"><msub id="S3.SS1.SSS3.Px1.p1.1.m1.1.1" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1.cmml"><mi id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.2" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1.2.cmml">K</mi><mi id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.3" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS3.Px1.p1.1.m1.1b"><apply id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.cmml" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.1.cmml" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1">subscript</csymbol><ci id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.2.cmml" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1.2">𝐾</ci><ci id="S3.SS1.SSS3.Px1.p1.1.m1.1.1.3.cmml" xref="S3.SS1.SSS3.Px1.p1.1.m1.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS3.Px1.p1.1.m1.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS3.Px1.p1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math>) <cite class="ltx_cite ltx_citemacro_cite">Lineweaver and Burk (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib36" title="">1934</a>)</cite> and the inhibition constant (<math alttext="K_{i}" class="ltx_Math" display="inline" id="S3.SS1.SSS3.Px1.p1.2.m2.1"><semantics id="S3.SS1.SSS3.Px1.p1.2.m2.1a"><msub id="S3.SS1.SSS3.Px1.p1.2.m2.1.1" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1.cmml"><mi id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.2" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1.2.cmml">K</mi><mi id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.3" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS1.SSS3.Px1.p1.2.m2.1b"><apply id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.cmml" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.1.cmml" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1">subscript</csymbol><ci id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.2.cmml" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1.2">𝐾</ci><ci id="S3.SS1.SSS3.Px1.p1.2.m2.1.1.3.cmml" xref="S3.SS1.SSS3.Px1.p1.2.m2.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS1.SSS3.Px1.p1.2.m2.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S3.SS1.SSS3.Px1.p1.2.m2.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math>) <cite class="ltx_cite ltx_citemacro_cite">Yung-Chi and Prusoff (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib65" title="">1973</a>)</cite> data and neglecting the data of other bioactivities. This meticulous organization supports the evaluation of small molecule-protein interactions with high fidelity, ensuring that the inherent differences in PDB files and assay types are respected.</p> </div> </section> <section class="ltx_paragraph" id="S3.SS1.SSS3.Px2"> <h5 class="ltx_title ltx_title_paragraph">Dataset split.</h5> <div class="ltx_para ltx_noindent" id="S3.SS1.SSS3.Px2.p1"> <p class="ltx_p" id="S3.SS1.SSS3.Px2.p1.1">To ensure the generalizability of the experimental findings with SIU, we employed a manual curation approach for dataset splitting. We selected a set of 10 representative protein targets to serve as the test set. These targets were intentionally chosen to cover a diverse range of protein classes, including well-known drug targets such as G-Protein Coupled Receptors (GPCRs) <cite class="ltx_cite ltx_citemacro_cite">Hauser et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib27" title="">2017</a>)</cite>, kinases <cite class="ltx_cite ltx_citemacro_cite">Attwood et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib4" title="">2021</a>); Cohen et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib10" title="">2021</a>)</cite>, and cytochromes <cite class="ltx_cite ltx_citemacro_cite">Danielson (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib13" title="">2002</a>)</cite>. This selection strategy was designed to encompass the bioactivity landscape across various protein functionalities, thereby enhancing the applicability of our results to a wider range of potential drug discovery applications. We conducted non-homology analyses at two levels, 0.6 and 0.9, to ensure the independence and diversity of the training and test sets. For both versions 0.9 and 0.6, we have 21528 data pairs allocated for testing. Specifically, version 0.9 includes 1250807 data pairs for training and validation, while version 0.6 includes 386,330 data pairs for these purposes.</p> </div> <figure class="ltx_figure" id="S3.F3"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_landscape" height="299" id="S3.F3.g1" src="extracted/5664306/images/fig_all_new_2.png" width="598"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 3: </span><span class="ltx_text ltx_font_bold" id="S3.F3.6.1">Filter selection and dataset statistics.</span> <span class="ltx_text ltx_font_bold" id="S3.F3.7.2">(A)</span> Distribution of the number of PDB files per protein target before and after filtering. <span class="ltx_text ltx_font_bold" id="S3.F3.8.3">(B)</span> Influence of RMSD on success and retention ratios. <span class="ltx_text ltx_font_bold" id="S3.F3.9.4">(C)</span> Pairwise t-test p-value differences between the negative logarithmic assay values of four representative assay types, visualized in a heatmap, along with the distribution of the values for each type. <span class="ltx_text ltx_font_bold" id="S3.F3.10.5">(D)</span> Differences in assay values for ten representative protein targets, illustrated by a heatmap of their pairwise t-test p-values, and their distribution.</figcaption> </figure> </section> </section> </section> <section class="ltx_subsection" id="S3.SS2"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">3.2 </span>Dataset overview</h3> <section class="ltx_paragraph" id="S3.SS2.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">Large-scale.</h5> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px1.p1"> <p class="ltx_p" id="S3.SS2.SSS0.Px1.p1.1">The SIU dataset comprises 5,342,250 conformations detailing small molecule-protein interactions, each entry providing comprehensive structural and bioactivity information. It includes 1,385,201 bioactivity labels derived from wet experiments, each with standardized values and clear assay type annotations.</p> </div> </section> <section class="ltx_paragraph" id="S3.SS2.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">Diversity.</h5> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px2.p1"> <p class="ltx_p" id="S3.SS2.SSS0.Px2.p1.1">SIU offers an extensive range of data, encompassing 214,686 diverse small molecules and 1,720 distinct protein targets. It includes experimentally validated low-bioactivity or inactive molecules, often absent in structural datasets from wet experiments, thus providing valuable negative data for AIDD. The dataset features extensive protein pocket coverage, including protein from humans, <span class="ltx_text ltx_font_italic" id="S3.SS2.SSS0.Px2.p1.1.1">E. coli</span> <cite class="ltx_cite ltx_citemacro_cite">Vila et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib56" title="">2016</a>)</cite>, various viruses, and other organisms. It spans major protein classes such as GPCRs <cite class="ltx_cite ltx_citemacro_cite">Hauser et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib27" title="">2017</a>)</cite>, kinases <cite class="ltx_cite ltx_citemacro_cite">Attwood et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib4" title="">2021</a>); Cohen et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib10" title="">2021</a>)</cite>, nuclear receptors <cite class="ltx_cite ltx_citemacro_cite">Robinson-Rechavi et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib46" title="">2003</a>)</cite>, cytochromes <cite class="ltx_cite ltx_citemacro_cite">Danielson (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib13" title="">2002</a>)</cite>, ion channels <cite class="ltx_cite ltx_citemacro_cite">Ashcroft (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib3" title="">1999</a>)</cite>, and other protein involved in complex biological processes like epigenetics <cite class="ltx_cite ltx_citemacro_cite">Gibney and Nolan (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib24" title="">2010</a>); Feinberg (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib19" title="">2008</a>)</cite> and transcription <cite class="ltx_cite ltx_citemacro_cite">Cramer (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib12" title="">2019</a>); Lambert et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib32" title="">2018</a>)</cite>. As illustrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">3</span></a>(D), the assay values of different protein targets vary significantly. This broad coverage ensures a comprehensive representation of small molecule-protein interaction modes, enhancing the relevance of our bioactivity prediction tasks to real biological environments.</p> </div> </section> <section class="ltx_paragraph" id="S3.SS2.SSS0.Px3"> <h5 class="ltx_title ltx_title_paragraph">High-quality.</h5> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px3.p1"> <p class="ltx_p" id="S3.SS2.SSS0.Px3.p1.1">The structural information on small molecule-protein interactions in SIU is of high quality, due to our multi-software voting mechanism that maximizes docking accuracy within computational limits. As detailed in the structural data construction section, we achieved a satisfactory balance between data accuracy and scale, presenting high-quality data unobtainable with a single docking software or solely by ranking based on software-predicted docking scores. Docking software often provides successful simulated docking poses within the top-ranking positions, but these are not always ranked first by docking scores. Our method, however, is based on the consistency of docking pose sampling across different algorithms. By examining consensus among different docking algorithms, we effectively ensure more accurate docking pose data.</p> </div> </section> <section class="ltx_paragraph" id="S3.SS2.SSS0.Px4"> <h5 class="ltx_title ltx_title_paragraph">Well-organized.</h5> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px4.p1"> <p class="ltx_p" id="S3.SS2.SSS0.Px4.p1.1">SIU’s bioactivity labels are meticulously curated and systematically organized by PDB IDs and assay types, ensuring data integrity and enabling effective PDB-wise and assay-wise comparisons. This organization offers a robust resource for unbiased bioactivity prediction, addressing the limitations of existing datasets that often fail to distinguish clearly between different bioactivity assay types. Traditional measurements of correlations in bioactivity prediction tasks are often ineffective due to the lack of clarity in existing datasets. SIU can also address this problem, ensuring more precise and meaningful analyses. Our structured approach facilitates nuanced assessments, such as evaluating the impact of specific small molecule modifications on protein interactions or comparing the efficacy of different compounds within the same protein pocket context.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px4.p2"> <p class="ltx_p" id="S3.SS2.SSS0.Px4.p2.6">We argue that <span class="ltx_text ltx_font_bold" id="S3.SS2.SSS0.Px4.p2.6.1">the assay types should not be merged due to their distinct characteristics</span>. The heatmap in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">3</span></a>(C) presents the results of pairwise t-tests for the four assay types, revealing that half maximal inhibitory concentration (<math alttext="IC_{50}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.1.m1.1"><semantics id="S3.SS2.SSS0.Px4.p2.1.m1.1a"><mrow id="S3.SS2.SSS0.Px4.p2.1.m1.1.1" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.2" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.2.cmml">I</mi><mo id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.1" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.1.cmml">⁢</mo><msub id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.cmml"><mi id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.2" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.2.cmml">C</mi><mn id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.3" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.1.m1.1b"><apply id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1"><times id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.1"></times><ci id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.2">𝐼</ci><apply id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.1.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.2.cmml" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.2">𝐶</ci><cn id="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.3.cmml" type="integer" xref="S3.SS2.SSS0.Px4.p2.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.1.m1.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.1.m1.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math>) differs the most in mean value compared to the other assay types, followed by half maximal effective concentration (<math alttext="EC_{50}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.2.m2.1"><semantics id="S3.SS2.SSS0.Px4.p2.2.m2.1a"><mrow id="S3.SS2.SSS0.Px4.p2.2.m2.1.1" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.2" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.2.cmml">E</mi><mo id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.1" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.1.cmml">⁢</mo><msub id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.cmml"><mi id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.2" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.2.cmml">C</mi><mn id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.3" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.2.m2.1b"><apply id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1"><times id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.1"></times><ci id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.2">𝐸</ci><apply id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.1.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.2.cmml" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.2">𝐶</ci><cn id="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.3.cmml" type="integer" xref="S3.SS2.SSS0.Px4.p2.2.m2.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.2.m2.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.2.m2.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math>). In contrast, the means of <math alttext="K_{i}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.3.m3.1"><semantics id="S3.SS2.SSS0.Px4.p2.3.m3.1a"><msub id="S3.SS2.SSS0.Px4.p2.3.m3.1.1" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.2" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1.2.cmml">K</mi><mi id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.3" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.3.m3.1b"><apply id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1.2">𝐾</ci><ci id="S3.SS2.SSS0.Px4.p2.3.m3.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.3.m3.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.3.m3.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.3.m3.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="K_{d}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.4.m4.1"><semantics id="S3.SS2.SSS0.Px4.p2.4.m4.1a"><msub id="S3.SS2.SSS0.Px4.p2.4.m4.1.1" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.2" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1.2.cmml">K</mi><mi id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.3" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.4.m4.1b"><apply id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1.2">𝐾</ci><ci id="S3.SS2.SSS0.Px4.p2.4.m4.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.4.m4.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.4.m4.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.4.m4.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> are relatively similar. However, even when the means are not significantly different, the assay types cannot be considered equivalent due to their distinct behaviors. As demonstrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">3</span></a>(C), the bioactivity assay values vary significantly, with their negative logarithmic values ranging from 2 to 11 and exhibiting markedly different distributions. The distribution of <math alttext="K_{d}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.5.m5.1"><semantics id="S3.SS2.SSS0.Px4.p2.5.m5.1a"><msub id="S3.SS2.SSS0.Px4.p2.5.m5.1.1" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.2" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1.2.cmml">K</mi><mi id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.3" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.5.m5.1b"><apply id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1.2">𝐾</ci><ci id="S3.SS2.SSS0.Px4.p2.5.m5.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.5.m5.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.5.m5.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.5.m5.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> is particularly unique, as its upper values are substantially higher. The <math alttext="K_{d}" class="ltx_Math" display="inline" id="S3.SS2.SSS0.Px4.p2.6.m6.1"><semantics id="S3.SS2.SSS0.Px4.p2.6.m6.1a"><msub id="S3.SS2.SSS0.Px4.p2.6.m6.1.1" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1.cmml"><mi id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.2" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1.2.cmml">K</mi><mi id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.3" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S3.SS2.SSS0.Px4.p2.6.m6.1b"><apply id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1"><csymbol cd="ambiguous" id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.1.cmml" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1">subscript</csymbol><ci id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.2.cmml" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1.2">𝐾</ci><ci id="S3.SS2.SSS0.Px4.p2.6.m6.1.1.3.cmml" xref="S3.SS2.SSS0.Px4.p2.6.m6.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S3.SS2.SSS0.Px4.p2.6.m6.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S3.SS2.SSS0.Px4.p2.6.m6.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> distribution is more peaked, indicating a narrow concentration around the central values despite its broad range.</p> </div> <div class="ltx_para ltx_noindent" id="S3.SS2.SSS0.Px4.p3"> <p class="ltx_p" id="S3.SS2.SSS0.Px4.p3.1">Moreover, SIU provides multiple small molecule 3D poses for each protein pocket, allowing for unbiased comparison of small molecule poses while maintaining a constant protein pocket environment. This approach yields detailed information on how variations in small molecule poses influence their interactions with the protein pocket, considering factors such as shape and electrostatic complementarity. Ultimately, this enhances the modeling of the relationship between these interactions and observed bioactivity, advancing our understanding of small molecule-protein interactions and their effects.</p> </div> </section> </section> </section> <section class="ltx_section" id="S4"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">4 </span>SIU experiments and analysis</h2> <figure class="ltx_table" id="S4.T1"> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 1: </span>Results for multi task learning with different label types. We show results for 3D-CNN, GNN, Uni-Mol, and ProFSA trained on SIU 0.9 version.</figcaption> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S4.T1.6" style="width:379.0pt;height:314pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(0.0pt,0.0pt) scale(1,1) ;"> <p class="ltx_p" id="S4.T1.6.6"><span class="ltx_text" id="S4.T1.6.6.6"> <span class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S4.T1.6.6.6.6"> <span class="ltx_thead"> <span class="ltx_tr" id="S4.T1.2.2.2.2.2"> <span class="ltx_td ltx_th ltx_th_column ltx_th_row ltx_border_tt" id="S4.T1.2.2.2.2.2.3"></span> <span class="ltx_td ltx_th ltx_th_column ltx_th_row ltx_border_r ltx_border_tt" id="S4.T1.2.2.2.2.2.4"></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T1.2.2.2.2.2.5">RMSE</span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T1.2.2.2.2.2.6">MAE</span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T1.2.2.2.2.2.7">Pearson</span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T1.1.1.1.1.1.1">Pearson<sup class="ltx_sup" id="S4.T1.1.1.1.1.1.1.1">∗</sup></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T1.2.2.2.2.2.8">Spearman</span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T1.2.2.2.2.2.2">Spearman<sup class="ltx_sup" id="S4.T1.2.2.2.2.2.2.1">∗</sup></span></span> </span> <span class="ltx_tbody"> <span class="ltx_tr" id="S4.T1.3.3.3.3.3"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t ltx_rowspan ltx_rowspan_4" id="S4.T1.3.3.3.3.3.1"><span class="ltx_text" id="S4.T1.3.3.3.3.3.1.1"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.3.3.3.3.3.1.1.1"> <span class="ltx_tr" id="S4.T1.3.3.3.3.3.1.1.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.3.3.3.3.3.1.1.1.1.1"><math alttext="IC_{50}" class="ltx_Math" display="inline" id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1"><semantics id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1a"><mrow id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.2" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.2.cmml">I</mi><mo id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.1" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.1.cmml">⁢</mo><msub id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.cmml"><mi id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2.cmml">C</mi><mn id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1b"><apply id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1"><times id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.1"></times><ci id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.2">𝐼</ci><apply id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.1.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3">subscript</csymbol><ci id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2.cmml" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2">𝐶</ci><cn id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3.cmml" type="integer" xref="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.T1.3.3.3.3.3.1.1.1.1.1.m1.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math></span></span> </span></span></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.T1.3.3.3.3.3.2">3D-CNN</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.3.3.3.3.3.3">1.560</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.3.3.3.3.3.4">1.275</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.3.3.3.3.3.5">0.158</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.3.3.3.3.3.6">0.044</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.3.3.3.3.3.7">0.154</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.3.3.3.3.3.8">0.040</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.7.1"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T1.6.6.6.6.7.1.1">GNN</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.7.1.2">1.412</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.7.1.3">1.141</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.7.1.4">0.336</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.7.1.5">0.241</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.7.1.6">0.316</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.7.1.7">0.235</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.8.2"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T1.6.6.6.6.8.2.1">Uni-Mol</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.8.2.2">1.353</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.8.2.3">1.092</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.8.2.4">0.462</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.8.2.5">0.343</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.8.2.6">0.466</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.8.2.7">0.351</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.9.3"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T1.6.6.6.6.9.3.1">ProFSA</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.9.3.2">1.361</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.9.3.3">1.108</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.9.3.4">0.382</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.9.3.5">0.331</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.9.3.6">0.356</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.9.3.7">0.317</span></span> <span class="ltx_tr" id="S4.T1.4.4.4.4.4"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t ltx_rowspan ltx_rowspan_4" id="S4.T1.4.4.4.4.4.1"><span class="ltx_text" id="S4.T1.4.4.4.4.4.1.1"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.4.4.4.4.4.1.1.1"> <span class="ltx_tr" id="S4.T1.4.4.4.4.4.1.1.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.4.4.4.4.4.1.1.1.1.1"><math alttext="EC_{50}" class="ltx_Math" display="inline" id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1"><semantics id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1a"><mrow id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.2" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.2.cmml">E</mi><mo id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.1" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.1.cmml">⁢</mo><msub id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.cmml"><mi id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2.cmml">C</mi><mn id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1b"><apply id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1"><times id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.1"></times><ci id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.2">𝐸</ci><apply id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.1.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3">subscript</csymbol><ci id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2.cmml" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2">𝐶</ci><cn id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3.cmml" type="integer" xref="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.T1.4.4.4.4.4.1.1.1.1.1.m1.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math></span></span> </span></span></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.T1.4.4.4.4.4.2">3D-CNN</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.4.4.4.4.3">1.518</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.4.4.4.4.4">1.234</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.4.4.4.4.5">0.128</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.4.4.4.4.6">0.010</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.4.4.4.4.7">0.128</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.4.4.4.4.4.8">0.004</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.10.4"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T1.6.6.6.6.10.4.1">GNN</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.10.4.2">1.334</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.10.4.3">1.025</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.10.4.4">0.444</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.10.4.5">0.108</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.10.4.6">0.481</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.10.4.7">0.120</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.11.5"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T1.6.6.6.6.11.5.1">Uni-Mol</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.11.5.2">1.273</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.11.5.3">1.017</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.11.5.4">0.428</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.11.5.5">0.178</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.11.5.6">0.461</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.11.5.7">0.144</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.12.6"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T1.6.6.6.6.12.6.1">ProFSA</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.12.6.2">1.255</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.12.6.3">0.971</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.12.6.4">0.438</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.12.6.5">0.204</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.12.6.6">0.495</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.12.6.7">0.154</span></span> <span class="ltx_tr" id="S4.T1.5.5.5.5.5"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t ltx_rowspan ltx_rowspan_4" id="S4.T1.5.5.5.5.5.1"><span class="ltx_text" id="S4.T1.5.5.5.5.5.1.1"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.5.5.5.5.5.1.1.1"> <span class="ltx_tr" id="S4.T1.5.5.5.5.5.1.1.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.5.5.5.5.5.1.1.1.1.1"><math alttext="K_{i}" class="ltx_Math" display="inline" id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1"><semantics id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1a"><msub id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.2" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.2.cmml">K</mi><mi id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.3" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1b"><apply id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1">subscript</csymbol><ci id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.2">𝐾</ci><ci id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S4.T1.5.5.5.5.5.1.1.1.1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math></span></span> </span></span></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.T1.5.5.5.5.5.2">3D-CNN</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.5.5.5.5.5.3">1.534</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.5.5.5.5.5.4">1.260</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.5.5.5.5.5.5">0.201</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.5.5.5.5.5.6">0.025</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.5.5.5.5.5.7">0.200</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.5.5.5.5.5.8">0.021</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.13.7"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T1.6.6.6.6.13.7.1">GNN</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.13.7.2">1.814</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.13.7.3">1.504</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.13.7.4">0.247</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.13.7.5">0.099</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.13.7.6">0.107</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.13.7.7">0.058</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.14.8"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T1.6.6.6.6.14.8.1">Uni-Mol</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.14.8.2">1.390</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.14.8.3">1.133</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.14.8.4">0.375</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.14.8.5">0.092</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.14.8.6">0.324</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.14.8.7">0.056</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.15.9"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T1.6.6.6.6.15.9.1">ProFSA</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.15.9.2">1.374</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.15.9.3">1.142</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.15.9.4">0.405</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.15.9.5">0.149</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.15.9.6">0.365</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.15.9.7">0.127</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.6"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_t ltx_rowspan ltx_rowspan_4" id="S4.T1.6.6.6.6.6.1"><span class="ltx_text" id="S4.T1.6.6.6.6.6.1.1"> <span class="ltx_tabular ltx_align_middle" id="S4.T1.6.6.6.6.6.1.1.1"> <span class="ltx_tr" id="S4.T1.6.6.6.6.6.1.1.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T1.6.6.6.6.6.1.1.1.1.1"><math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1"><semantics id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1a"><msub id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.2" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.2.cmml">K</mi><mi id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.3" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1b"><apply id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.cmml" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1">subscript</csymbol><ci id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.2">𝐾</ci><ci id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.T1.6.6.6.6.6.1.1.1.1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math></span></span> </span></span></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.T1.6.6.6.6.6.2">3D-CNN</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.6.6.6.3">1.503</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.6.6.6.4">1.233</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.6.6.6.5">0.173</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.6.6.6.6">0.024</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.6.6.6.7">0.167</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T1.6.6.6.6.6.8">0.038</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.16.10"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T1.6.6.6.6.16.10.1">GNN</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.16.10.2">1.711</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.16.10.3">1.431</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.16.10.4">-0.068</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.16.10.5">0.065</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.16.10.6">-0.147</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.16.10.7">0.033</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.17.11"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T1.6.6.6.6.17.11.1">Uni-Mol</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.17.11.2">1.429</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.17.11.3">1.223</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.17.11.4">-0.084</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.17.11.5">0.155</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.17.11.6">-0.175</span> <span class="ltx_td ltx_align_center" id="S4.T1.6.6.6.6.17.11.7">0.144</span></span> <span class="ltx_tr" id="S4.T1.6.6.6.6.18.12"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_r" id="S4.T1.6.6.6.6.18.12.1">ProFSA</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.6.6.18.12.2">1.546</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.6.6.18.12.3">1.334</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.6.6.18.12.4">-0.172</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.6.6.18.12.5">0.057</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.6.6.18.12.6">-0.205</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T1.6.6.6.6.18.12.7">0.029</span></span> </span> </span></span></p> </span></div> </figure> <div class="ltx_para ltx_noindent" id="S4.p1"> <p class="ltx_p" id="S4.p1.1">We conducted experiments using several baseline models to analyze our SIU dataset. The models tested include a voxel-grid based 3D-CNN model, a Graph Neural Network (GNN) model, and pretrained models such as Uni-Mol <cite class="ltx_cite ltx_citemacro_citep">(Zhou et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib67" title="">2022</a>)</cite> and ProFSA <cite class="ltx_cite ltx_citemacro_citep">(Gao et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib21" title="">2023</a>)</cite>. Our experiments were performed in both Multi-Task Learning (MTL) and single-target settings. In the MTL setting, all data were combined to train a single MTL model. In the single-target setting, the Uni-Mol model was trained separately on individual labels.</p> </div> <div class="ltx_para ltx_noindent" id="S4.p2"> <p class="ltx_p" id="S4.p2.2">The metrics used in our analysis include Root Mean Square Error (RMSE), Mean Absolute Error (MAE), general Pearson and Spearman correlation, and the correlation after grouping by PDB IDs. The general Pearson and Spearman correlations are calculated by mixing pairs of protein pockets and molecules. The grouped correlation metrics are calculated for different molecules within a single protein pocket. We use Pearson<sup class="ltx_sup" id="S4.p2.2.1">∗</sup> to represent Pearson correlation grouped by PDB IDs, and Spearman<sup class="ltx_sup" id="S4.p2.2.2">∗</sup> to represent Spearman correlation grouped by PDB IDs.</p> </div> <div class="ltx_para ltx_noindent" id="S4.p3"> <p class="ltx_p" id="S4.p3.1">Results for multi-task learning is shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.T1" title="Table 1 ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">1</span></a>, and the results for single task learning is shown in Talbe <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.T2" title="Table 2 ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">2</span></a>.</p> </div> <figure class="ltx_table" id="S4.T2"> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 2: </span>Results for single task training with different label types. We show the results with Uni-Mol model on PDBbind dataset, our SIU 0.6 version and 0.9 version dataset.</figcaption> <div class="ltx_inline-block ltx_align_center ltx_transformed_outer" id="S4.T2.6" style="width:381.7pt;height:230pt;vertical-align:-0.0pt;"><span class="ltx_transformed_inner" style="transform:translate(0.0pt,0.0pt) scale(1,1) ;"> <p class="ltx_p" id="S4.T2.6.6"><span class="ltx_text" id="S4.T2.6.6.6"> <span class="ltx_tabular ltx_guessed_headers ltx_align_middle" id="S4.T2.6.6.6.6"> <span class="ltx_thead"> <span class="ltx_tr" id="S4.T2.2.2.2.2.2"> <span class="ltx_td ltx_th ltx_th_column ltx_th_row ltx_border_tt" id="S4.T2.2.2.2.2.2.3"></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_th_row ltx_border_r ltx_border_tt" id="S4.T2.2.2.2.2.2.4">Train Set</span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T2.2.2.2.2.2.5">RMSE</span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T2.2.2.2.2.2.6">MAE</span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T2.2.2.2.2.2.7">Pearson</span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T2.1.1.1.1.1.1">Pearson<sup class="ltx_sup" id="S4.T2.1.1.1.1.1.1.1">∗</sup></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T2.2.2.2.2.2.8">Spearman</span> <span class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="S4.T2.2.2.2.2.2.2">Spearman<sup class="ltx_sup" id="S4.T2.2.2.2.2.2.2.1">∗</sup></span></span> </span> <span class="ltx_tbody"> <span class="ltx_tr" id="S4.T2.3.3.3.3.3"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t ltx_rowspan ltx_rowspan_3" id="S4.T2.3.3.3.3.3.1"><span class="ltx_text" id="S4.T2.3.3.3.3.3.1.1"> <span class="ltx_tabular ltx_align_middle" id="S4.T2.3.3.3.3.3.1.1.1"> <span class="ltx_tr" id="S4.T2.3.3.3.3.3.1.1.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T2.3.3.3.3.3.1.1.1.1.1"><math alttext="IC_{50}" class="ltx_Math" display="inline" id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1"><semantics id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1a"><mrow id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.2" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.2.cmml">I</mi><mo id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.1" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.1.cmml">⁢</mo><msub id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.cmml"><mi id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2.cmml">C</mi><mn id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1b"><apply id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1"><times id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.1"></times><ci id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.2">𝐼</ci><apply id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.1.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3">subscript</csymbol><ci id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2.cmml" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.2">𝐶</ci><cn id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3.cmml" type="integer" xref="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.T2.3.3.3.3.3.1.1.1.1.1.m1.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math></span></span> </span></span></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.T2.3.3.3.3.3.2">PDBbind</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.3.3.3.3.3.3">1.575</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.3.3.3.3.3.4">1.279</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.3.3.3.3.3.5">0.430</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.3.3.3.3.3.6">0.245</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.3.3.3.3.3.7">0.425</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.3.3.3.3.3.8">0.229</span></span> <span class="ltx_tr" id="S4.T2.6.6.6.6.7.1"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T2.6.6.6.6.7.1.1">SIU 0.6</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.7.1.2">1.407</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.7.1.3">1.138</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.7.1.4">0.461</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.7.1.5">0.317</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.7.1.6">0.463</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.7.1.7">0.311</span></span> <span class="ltx_tr" id="S4.T2.6.6.6.6.8.2"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T2.6.6.6.6.8.2.1">SIU 0.9</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.8.2.2">1.357</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.8.2.3">1.099</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.8.2.4">0.470</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.8.2.5">0.345</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.8.2.6">0.474</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.8.2.7">0.347</span></span> <span class="ltx_tr" id="S4.T2.4.4.4.4.4"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t ltx_rowspan ltx_rowspan_2" id="S4.T2.4.4.4.4.4.1"><span class="ltx_text" id="S4.T2.4.4.4.4.4.1.1"> <span class="ltx_tabular ltx_align_middle" id="S4.T2.4.4.4.4.4.1.1.1"> <span class="ltx_tr" id="S4.T2.4.4.4.4.4.1.1.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T2.4.4.4.4.4.1.1.1.1.1"><math alttext="EC_{50}" class="ltx_Math" display="inline" id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1"><semantics id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1a"><mrow id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.2" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.2.cmml">E</mi><mo id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.1" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.1.cmml">⁢</mo><msub id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.cmml"><mi id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2.cmml">C</mi><mn id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1b"><apply id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1"><times id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.1"></times><ci id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.2">𝐸</ci><apply id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3"><csymbol cd="ambiguous" id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.1.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3">subscript</csymbol><ci id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2.cmml" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.2">𝐶</ci><cn id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3.cmml" type="integer" xref="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.T2.4.4.4.4.4.1.1.1.1.1.m1.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math></span></span> </span></span></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.T2.4.4.4.4.4.2">SIU 0.6</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.4.4.4.4.4.3">1.400</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.4.4.4.4.4.4">1.163</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.4.4.4.4.4.5">0.280</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.4.4.4.4.4.6">0.171</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.4.4.4.4.4.7">0.284</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.4.4.4.4.4.8">0.150</span></span> <span class="ltx_tr" id="S4.T2.6.6.6.6.9.3"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T2.6.6.6.6.9.3.1">SIU 0.9</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.9.3.2">1.340</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.9.3.3">1.096</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.9.3.4">0.384</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.9.3.5">0.196</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.9.3.6">0.379</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.9.3.7">0.142</span></span> <span class="ltx_tr" id="S4.T2.5.5.5.5.5"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_t ltx_rowspan ltx_rowspan_3" id="S4.T2.5.5.5.5.5.1"><span class="ltx_text" id="S4.T2.5.5.5.5.5.1.1"> <span class="ltx_tabular ltx_align_middle" id="S4.T2.5.5.5.5.5.1.1.1"> <span class="ltx_tr" id="S4.T2.5.5.5.5.5.1.1.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T2.5.5.5.5.5.1.1.1.1.1"><math alttext="K_{i}" class="ltx_Math" display="inline" id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1"><semantics id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1a"><msub id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.2" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.2.cmml">K</mi><mi id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.3" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1b"><apply id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.cmml" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1">subscript</csymbol><ci id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.2">𝐾</ci><ci id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S4.T2.5.5.5.5.5.1.1.1.1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math></span></span> </span></span></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.T2.5.5.5.5.5.2">PDBbind</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.5.5.5.5.5.3">1.315</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.5.5.5.5.5.4">1.085</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.5.5.5.5.5.5">0.368</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.5.5.5.5.5.6">0.040</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.5.5.5.5.5.7">0.323</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.5.5.5.5.5.8">0.026</span></span> <span class="ltx_tr" id="S4.T2.6.6.6.6.10.4"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T2.6.6.6.6.10.4.1">SIU 0.6</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.10.4.2">1.255</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.10.4.3">1.034</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.10.4.4">0.472</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.10.4.5">0.106</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.10.4.6">0.452</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.10.4.7">0.112</span></span> <span class="ltx_tr" id="S4.T2.6.6.6.6.11.5"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T2.6.6.6.6.11.5.1">SIU 0.9</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.11.5.2">1.235</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.11.5.3">1.017</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.11.5.4">0.485</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.11.5.5">0.036</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.11.5.6">0.452</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.11.5.7">0.041</span></span> <span class="ltx_tr" id="S4.T2.6.6.6.6.6"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_t ltx_rowspan ltx_rowspan_3" id="S4.T2.6.6.6.6.6.1"><span class="ltx_text" id="S4.T2.6.6.6.6.6.1.1"> <span class="ltx_tabular ltx_align_middle" id="S4.T2.6.6.6.6.6.1.1.1"> <span class="ltx_tr" id="S4.T2.6.6.6.6.6.1.1.1.1"> <span class="ltx_td ltx_nopad_r ltx_align_center" id="S4.T2.6.6.6.6.6.1.1.1.1.1"><math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1"><semantics id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1a"><msub id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.cmml"><mi id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.2" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.2.cmml">K</mi><mi id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.3" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1b"><apply id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.cmml" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.1.cmml" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1">subscript</csymbol><ci id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.2.cmml" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.2">𝐾</ci><ci id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.3.cmml" xref="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.T2.6.6.6.6.6.1.1.1.1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math></span></span> </span></span></span> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="S4.T2.6.6.6.6.6.2">PDBbind</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.6.6.6.6.6.3">1.565</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.6.6.6.6.6.4">1.308</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.6.6.6.6.6.5">0.041</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.6.6.6.6.6.6">0.010</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.6.6.6.6.6.7">0.004</span> <span class="ltx_td ltx_align_center ltx_border_t" id="S4.T2.6.6.6.6.6.8">0.006</span></span> <span class="ltx_tr" id="S4.T2.6.6.6.6.12.6"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="S4.T2.6.6.6.6.12.6.1">SIU 0.6</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.12.6.2">1.389</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.12.6.3">1.192</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.12.6.4">-0.149</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.12.6.5">0.052</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.12.6.6">-0.206</span> <span class="ltx_td ltx_align_center" id="S4.T2.6.6.6.6.12.6.7">0.022</span></span> <span class="ltx_tr" id="S4.T2.6.6.6.6.13.7"> <span class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_r" id="S4.T2.6.6.6.6.13.7.1">SIU 0.9</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.6.6.6.6.13.7.2">1.364</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.6.6.6.6.13.7.3">1.141</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.6.6.6.6.13.7.4">-0.033</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.6.6.6.6.13.7.5">0.103</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.6.6.6.6.13.7.6">-0.082</span> <span class="ltx_td ltx_align_center ltx_border_bb" id="S4.T2.6.6.6.6.13.7.7">0.065</span></span> </span> </span></span></p> </span></div> </figure> <figure class="ltx_figure" id="S4.F4"> <div class="ltx_flex_figure"> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S4.F4.sf1"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="367" id="S4.F4.sf1.g1" src="extracted/5664306/images/correlation_agg.png" width="598"/> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure">(a) </span></figcaption> </figure> </div> <div class="ltx_flex_cell ltx_flex_size_2"> <figure class="ltx_figure ltx_figure_panel ltx_align_center" id="S4.F4.sf2"><img alt="Refer to caption" class="ltx_graphics ltx_img_landscape" height="366" id="S4.F4.sf2.g1" src="extracted/5664306/images/correlation_datasets.png" width="598"/> <figcaption class="ltx_caption"><span class="ltx_tag ltx_tag_figure">(b) </span></figcaption> </figure> </div> </div> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 4: </span>(a) Pearson and Spearman correlations for various label types, calculated both before and after grouping by PDB IDs. (b) Pearson correlations after grouping PDB IDs for different assay types trained on different datasets.</figcaption> </figure> <section class="ltx_subsection" id="S4.SS1"> <h3 class="ltx_title ltx_title_subsection"> <span class="ltx_tag ltx_tag_subsection">4.1 </span>Analysis</h3> <section class="ltx_paragraph" id="S4.SS1.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">Different difficulties of assay types.</h5> <div class="ltx_para ltx_noindent" id="S4.SS1.SSS0.Px1.p1"> <p class="ltx_p" id="S4.SS1.SSS0.Px1.p1.11">The bioactivity prediction difficulty varies among different assay types. The <math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.1.m1.1"><semantics id="S4.SS1.SSS0.Px1.p1.1.m1.1a"><msub id="S4.SS1.SSS0.Px1.p1.1.m1.1.1" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.2" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.3" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.1.m1.1b"><apply id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.1.m1.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.1.m1.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.1.m1.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> task is the most challenging, primarily due to the varying correlations between different assay types, as shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">3</span></a>(C). Although the means of <math alttext="K_{i}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.2.m2.1"><semantics id="S4.SS1.SSS0.Px1.p1.2.m2.1a"><msub id="S4.SS1.SSS0.Px1.p1.2.m2.1.1" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.2" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.3" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.2.m2.1b"><apply id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.2.m2.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.2.m2.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.2.m2.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.2.m2.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.3.m3.1"><semantics id="S4.SS1.SSS0.Px1.p1.3.m3.1a"><msub id="S4.SS1.SSS0.Px1.p1.3.m3.1.1" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.2" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.3" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.3.m3.1b"><apply id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.3.m3.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.3.m3.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.3.m3.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.3.m3.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> labels do not differ statistically, the correlation between these two data groups remains limited. The intrinsic differences in assay types of bioactivity arise from the principles of the wet-lab experiments used to measure them. Binding assays focus on the direct interaction between the small molecule and the protein target, providing insights into the strength and specificity of this binding through metrics like <math alttext="K_{i}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.4.m4.1"><semantics id="S4.SS1.SSS0.Px1.p1.4.m4.1a"><msub id="S4.SS1.SSS0.Px1.p1.4.m4.1.1" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.2" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.3" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.4.m4.1b"><apply id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.4.m4.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.4.m4.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.4.m4.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.4.m4.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.5.m5.1"><semantics id="S4.SS1.SSS0.Px1.p1.5.m5.1a"><msub id="S4.SS1.SSS0.Px1.p1.5.m5.1.1" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.2" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.3" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.5.m5.1b"><apply id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.5.m5.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.5.m5.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.5.m5.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.5.m5.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math>, using techniques such as surface plasmon resonance (SPR) <cite class="ltx_cite ltx_citemacro_cite">Schasfoort (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib49" title="">2017</a>); Englebienne et al. (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib17" title="">2003</a>)</cite> and isothermal titration calorimetry (ITC) <cite class="ltx_cite ltx_citemacro_cite">Leavitt and Freire (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib34" title="">2001</a>)</cite>. In contrast, functional assays measure the biological response elicited by the small molecule on the target, capturing its effect on a biological system and often quantified by <math alttext="IC_{50}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.6.m6.1"><semantics id="S4.SS1.SSS0.Px1.p1.6.m6.1a"><mrow id="S4.SS1.SSS0.Px1.p1.6.m6.1.1" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.2" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.2.cmml">I</mi><mo id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.1" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.1.cmml">⁢</mo><msub id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.cmml"><mi id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.2" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.2.cmml">C</mi><mn id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.3" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.6.m6.1b"><apply id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1"><times id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.1"></times><ci id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.2">𝐼</ci><apply id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.1.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.2.cmml" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.2">𝐶</ci><cn id="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.3.cmml" type="integer" xref="S4.SS1.SSS0.Px1.p1.6.m6.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.6.m6.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.6.m6.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="EC_{50}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.7.m7.1"><semantics id="S4.SS1.SSS0.Px1.p1.7.m7.1a"><mrow id="S4.SS1.SSS0.Px1.p1.7.m7.1.1" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.2" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.2.cmml">E</mi><mo id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.1" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.1.cmml">⁢</mo><msub id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.cmml"><mi id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.2" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.2.cmml">C</mi><mn id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.3" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.7.m7.1b"><apply id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1"><times id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.1"></times><ci id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.2">𝐸</ci><apply id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.1.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.2.cmml" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.2">𝐶</ci><cn id="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.3.cmml" type="integer" xref="S4.SS1.SSS0.Px1.p1.7.m7.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.7.m7.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.7.m7.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> by enzyme activity assays <cite class="ltx_cite ltx_citemacro_cite">Bisswanger (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib6" title="">2014</a>); Hall (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib26" title="">1996</a>)</cite> or other wet experiment techniques. <span class="ltx_text ltx_font_bold" id="S4.SS1.SSS0.Px1.p1.11.1">The inherent differences in what these assays measure mean that their values cannot be directly compared <cite class="ltx_cite ltx_citemacro_cite">Yung-Chi and Prusoff (<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib65" title="">1973</a>)</cite>.</span> Furthermore, even within the categories of binding and functional assays, metrics should not be used interchangeably, as <math alttext="K_{i}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.8.m8.1"><semantics id="S4.SS1.SSS0.Px1.p1.8.m8.1a"><msub id="S4.SS1.SSS0.Px1.p1.8.m8.1.1" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.2" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.3" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.8.m8.1b"><apply id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.8.m8.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.8.m8.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.8.m8.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.8.m8.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="K_{d}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.9.m9.1"><semantics id="S4.SS1.SSS0.Px1.p1.9.m9.1a"><msub id="S4.SS1.SSS0.Px1.p1.9.m9.1.1" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.2" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1.2.cmml">K</mi><mi id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.3" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.9.m9.1b"><apply id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1.2">𝐾</ci><ci id="S4.SS1.SSS0.Px1.p1.9.m9.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.9.m9.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.9.m9.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.9.m9.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math> describe different aspects of binding affinity, just as <math alttext="IC_{50}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.10.m10.1"><semantics id="S4.SS1.SSS0.Px1.p1.10.m10.1a"><mrow id="S4.SS1.SSS0.Px1.p1.10.m10.1.1" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.2" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.2.cmml">I</mi><mo id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.1" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.1.cmml">⁢</mo><msub id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.cmml"><mi id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.2" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.2.cmml">C</mi><mn id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.3" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.10.m10.1b"><apply id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1"><times id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.1"></times><ci id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.2">𝐼</ci><apply id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.1.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.2.cmml" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.2">𝐶</ci><cn id="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.3.cmml" type="integer" xref="S4.SS1.SSS0.Px1.p1.10.m10.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.10.m10.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.10.m10.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> and <math alttext="EC_{50}" class="ltx_Math" display="inline" id="S4.SS1.SSS0.Px1.p1.11.m11.1"><semantics id="S4.SS1.SSS0.Px1.p1.11.m11.1a"><mrow id="S4.SS1.SSS0.Px1.p1.11.m11.1.1" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.cmml"><mi id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.2" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.2.cmml">E</mi><mo id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.1" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.1.cmml">⁢</mo><msub id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.cmml"><mi id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.2" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.2.cmml">C</mi><mn id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.3" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="S4.SS1.SSS0.Px1.p1.11.m11.1b"><apply id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1"><times id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.1.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.1"></times><ci id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.2.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.2">𝐸</ci><apply id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3"><csymbol cd="ambiguous" id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.1.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3">subscript</csymbol><ci id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.2.cmml" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.2">𝐶</ci><cn id="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.3.cmml" type="integer" xref="S4.SS1.SSS0.Px1.p1.11.m11.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="S4.SS1.SSS0.Px1.p1.11.m11.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="S4.SS1.SSS0.Px1.p1.11.m11.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math> describe different aspects of biological response.</p> </div> </section> <section class="ltx_paragraph" id="S4.SS1.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">Influence of measuring correlation with same PDB IDs.</h5> <div class="ltx_para ltx_noindent" id="S4.SS1.SSS0.Px2.p1"> <p class="ltx_p" id="S4.SS1.SSS0.Px2.p1.1">As demonstrated in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.F4.sf1" title="In Figure 4 ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">4(a)</span></a>, aggregating data by pdb ID across all assay types results in a significant decline in both Pearson and Spearman correlations. This observation suggests that it is more challenging to achieve high correlation when assessing binding affinities for different molecules within the same pocket after grouping. This challenge primarily arises from the skewed distribution of binding affinities across various protein pockets, as shown in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S3.F3" title="Figure 3 ‣ Dataset split. ‣ 3.1.3 Data construction for downstream tasks ‣ 3.1 Methods ‣ 3 SIU dataset construction and overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">3</span></a>(D). Furthermore, these findings highlight that <span class="ltx_text ltx_font_bold" id="S4.SS1.SSS0.Px2.p1.1.1">conventional approaches to measuring correlation without grouping by PDB ID may not effectively capture a model’s ability to differentiate between molecules targeting the same protein.</span> Such discriminatory capacity is crucial in drug discovery, emphasizing the importance of focusing on molecular interactions specific to each target rather than general correlations across diverse targets. This underscores the necessity of our dataset, which measures correlation within the same pdb IDs, providing a more relevant assessment of a deep learning model’s utility in drug discovery.</p> </div> </section> <section class="ltx_paragraph" id="S4.SS1.SSS0.Px3"> <h5 class="ltx_title ltx_title_paragraph">Effectivness of training on larger dataset.</h5> <div class="ltx_para ltx_noindent" id="S4.SS1.SSS0.Px3.p1"> <p class="ltx_p" id="S4.SS1.SSS0.Px3.p1.1">We compare models trained on the PDBbind 2020 dataset with those trained on SIU versions 0.6 and 0.9. Notably, the PDBbind 2020 dataset was used in its entirety, without implementing any filtering techniques to exclude pockets similar to those in the test set. As illustrated in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.T2" title="Table 2 ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">2</span></a> and Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#S4.F4" title="Figure 4 ‣ 4 SIU experiments and analysis ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">4</span></a>, models trained on the SIU datasets outperform those trained on PDBbind, despite the latter’s lack of homology removal. <span class="ltx_text ltx_font_bold" id="S4.SS1.SSS0.Px3.p1.1.1">This underscores the effectiveness of our large-scale dataset in enhancing model learning for binding affinity prediction.</span> Also the 0.9 version gives a better performance compared to the 0.6 version, indicating the influence of removing homology and scaling law of the dataset.</p> </div> </section> </section> </section> <section class="ltx_section" id="S5"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">5 </span>Limitations and future work</h2> <section class="ltx_paragraph" id="S5.SS0.SSS0.Px1"> <h5 class="ltx_title ltx_title_paragraph">Limitations.</h5> <div class="ltx_para ltx_noindent" id="S5.SS0.SSS0.Px1.p1"> <p class="ltx_p" id="S5.SS0.SSS0.Px1.p1.1">Despite our rigorous methodology, the structural data we obtained are still predicted poses rather than experimentally validated interactions. The challenge of accurately modeling small molecule-protein interactions in physiological conditions remains substantial and highlights the need for continued advancements in this field.</p> </div> </section> <section class="ltx_paragraph" id="S5.SS0.SSS0.Px2"> <h5 class="ltx_title ltx_title_paragraph">Future work.</h5> <div class="ltx_para ltx_noindent" id="S5.SS0.SSS0.Px2.p1"> <p class="ltx_p" id="S5.SS0.SSS0.Px2.p1.1">We aim to provide larger and more reliable datasets for various drug discovery tasks. we are developing datasets for pairwise ranking, alongside organizing data for unbiased bioactivity prediction. We will ensure that comparisons are made only between docking poses derived from the same PDB and bioactivity data from identical assay types. Additionally, our approach allowing the automated generation of extensive high-quality data has been validated as feasible and scalable in this work. Though in this work, to address the computational demands of the molecular docking stage, we optimized the dataset by deduplicating small molecules and pockect structures, Future research could build upon these methods to construct even larger datasets efficiently, thereby advancing the understanding of small molecule-protein interactions in AIDD. </p> </div> </section> </section> <section class="ltx_section" id="S6"> <h2 class="ltx_title ltx_title_section"> <span class="ltx_tag ltx_tag_section">6 </span>Conclusion</h2> <div class="ltx_para ltx_noindent" id="S6.p1"> <p class="ltx_p" id="S6.p1.1">To meet the pressing demands in drug discovery and development, we introduced SIU, a large-scale, diverse, accurate, and well-curated dataset for unbiased bioactivity prediction. SIU was constructed using meticulously designed and robust pipelines, ensuring its exceptional quality and comprehensiveness. Our experimental results validate the large scale nature of SIU as a superior training dataset that enhances model performance and provides a reliable framework for unbiased bioactivity prediction tasks, facilitating more meaningful model evaluations. We anticipate that the full potential of SIU remains to be uncovered, and we hope that its introduction will catalyze significant advancements in the field of drug discovery and development.</p> </div> </section> <section class="ltx_bibliography" id="bib"> <h2 class="ltx_title ltx_title_bibliography">References</h2> <ul class="ltx_biblist"> <li class="ltx_bibitem" id="bib.bib1"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">uni (2017)</span> <span class="ltx_bibblock"> Uniprot: the universal protein knowledgebase. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib1.1.1">Nucleic acids research</em>, 45(D1):D158–D169, 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib2"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Altucci et al. (2007)</span> <span class="ltx_bibblock"> Lucia Altucci, Mark D Leibowitz, Kathleen M Ogilvie, Angel R de Lera, and Hinrich Gronemeyer. </span> <span class="ltx_bibblock">Rar and rxr modulation in cancer and metabolic disease. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib2.1.1">Nature reviews Drug discovery</em>, 6(10):793–810, 2007. </span> </li> <li class="ltx_bibitem" id="bib.bib3"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ashcroft (1999)</span> <span class="ltx_bibblock"> Frances M Ashcroft. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib3.1.1">Ion channels and disease</em>. </span> <span class="ltx_bibblock">Academic press, 1999. </span> </li> <li class="ltx_bibitem" id="bib.bib4"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Attwood et al. (2021)</span> <span class="ltx_bibblock"> Misty M Attwood, Doriano Fabbro, Aleksandr V Sokolov, Stefan Knapp, and Helgi B Schiöth. </span> <span class="ltx_bibblock">Trends in kinase drug discovery: targets, indications and inhibitor design. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib4.1.1">Nature Reviews Drug Discovery</em>, 20(11):839–861, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib5"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Berman et al. (2000)</span> <span class="ltx_bibblock"> Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. </span> <span class="ltx_bibblock">The protein data bank. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib5.1.1">Nucleic acids research</em>, 28(1):235–242, 2000. </span> </li> <li class="ltx_bibitem" id="bib.bib6"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Bisswanger (2014)</span> <span class="ltx_bibblock"> Hans Bisswanger. </span> <span class="ltx_bibblock">Enzyme assays. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib6.1.1">Perspectives in Science</em>, 1(1-6):41–55, 2014. </span> </li> <li class="ltx_bibitem" id="bib.bib7"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Block et al. (2006)</span> <span class="ltx_bibblock"> Peter Block, Christoph A Sotriffer, Ingo Dramburg, and Gerhard Klebe. </span> <span class="ltx_bibblock">Affindb: a freely accessible database of affinities for protein–ligand complexes from the pdb. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib7.1.1">Nucleic acids research</em>, 34(suppl_1):D522–D526, 2006. </span> </li> <li class="ltx_bibitem" id="bib.bib8"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Bureik et al. (2002)</span> <span class="ltx_bibblock"> Matthias Bureik, Michael Lisurek, and Rita Bernhardt. </span> <span class="ltx_bibblock">The human steroid hydroxylases cyp11b1 and cyp11b2. </span> <span class="ltx_bibblock">2002. </span> </li> <li class="ltx_bibitem" id="bib.bib9"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Chen et al. (2001)</span> <span class="ltx_bibblock"> Xi Chen, Ming Liu, and Michael K Gilson. </span> <span class="ltx_bibblock">Bindingdb: a web-accessible molecular recognition database. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib9.1.1">Combinatorial chemistry &amp; high throughput screening</em>, 4(8):719–725, 2001. </span> </li> <li class="ltx_bibitem" id="bib.bib10"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Cohen et al. (2021)</span> <span class="ltx_bibblock"> Philip Cohen, Darren Cross, and Pasi A Jänne. </span> <span class="ltx_bibblock">Kinase drug discovery 20 years after imatinib: progress and future directions. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib10.1.1">Nature reviews drug discovery</em>, 20(7):551–569, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib11"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Consortium (2015)</span> <span class="ltx_bibblock"> UniProt Consortium. </span> <span class="ltx_bibblock">Uniprot: a hub for protein information. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib11.1.1">Nucleic acids research</em>, 43(D1):D204–D212, 2015. </span> </li> <li class="ltx_bibitem" id="bib.bib12"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Cramer (2019)</span> <span class="ltx_bibblock"> Patrick Cramer. </span> <span class="ltx_bibblock">Organization and regulation of gene transcription. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib12.1.1">Nature</em>, 573(7772):45–54, 2019. </span> </li> <li class="ltx_bibitem" id="bib.bib13"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Danielson (2002)</span> <span class="ltx_bibblock"> P áB Danielson. </span> <span class="ltx_bibblock">The cytochrome p450 superfamily: biochemistry, evolution and drug metabolism in humans. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib13.1.1">Current drug metabolism</em>, 3(6):561–597, 2002. </span> </li> <li class="ltx_bibitem" id="bib.bib14"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Davis et al. (2011)</span> <span class="ltx_bibblock"> Mindy I Davis, Jeremy P Hunt, Sanna Herrgard, Pietro Ciceri, Lisa M Wodicka, Gabriel Pallares, Michael Hocker, Daniel K Treiber, and Patrick P Zarrinkar. </span> <span class="ltx_bibblock">Comprehensive analysis of kinase inhibitor selectivity. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib14.1.1">Nature biotechnology</em>, 29(11):1046–1051, 2011. </span> </li> <li class="ltx_bibitem" id="bib.bib15"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Denisov et al. (2005)</span> <span class="ltx_bibblock"> Ilia G Denisov, Thomas M Makris, Stephen G Sligar, and Ilme Schlichting. </span> <span class="ltx_bibblock">Structure and chemistry of cytochrome p450. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib15.1.1">Chemical reviews</em>, 105(6):2253–2278, 2005. </span> </li> <li class="ltx_bibitem" id="bib.bib16"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Ekins et al. (2017)</span> <span class="ltx_bibblock"> S Ekins, AM Clark, C Southan, BA Bunin, and AJ Williams. </span> <span class="ltx_bibblock">Chapter 16. small-molecule bioactivity databases. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib16.1.1">High Throughput Screening Methods</em>, pages 344–371, 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib17"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Englebienne et al. (2003)</span> <span class="ltx_bibblock"> Patrick Englebienne, Anne Van Hoonacker, and Michel Verhas. </span> <span class="ltx_bibblock">Surface plasmon resonance: principles, methods and applications in biomedical sciences. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib17.1.1">Spectroscopy</em>, 17(2-3):255–273, 2003. </span> </li> <li class="ltx_bibitem" id="bib.bib18"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Eschenmoser (1995)</span> <span class="ltx_bibblock"> Albert Eschenmoser. </span> <span class="ltx_bibblock">One hundred years lock-and-key principle. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib18.1.1">Angewandte Chemie International Edition in English</em>, 33(23-24):2363–2363, 1995. </span> </li> <li class="ltx_bibitem" id="bib.bib19"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Feinberg (2008)</span> <span class="ltx_bibblock"> Andrew P Feinberg. </span> <span class="ltx_bibblock">Epigenetics at the epicenter of modern medicine. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib19.1.1">Jama</em>, 299(11):1345–1350, 2008. </span> </li> <li class="ltx_bibitem" id="bib.bib20"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Friesner et al. (2004)</span> <span class="ltx_bibblock"> Richard A Friesner, Jay L Banks, Robert B Murphy, Thomas A Halgren, Jasna J Klicic, Daniel T Mainz, Matthew P Repasky, Eric H Knoll, Mee Shelley, Jason K Perry, et al. </span> <span class="ltx_bibblock">Glide: a new approach for rapid, accurate docking and scoring. 1. method and assessment of docking accuracy. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib20.1.1">Journal of medicinal chemistry</em>, 47(7):1739–1749, 2004. </span> </li> <li class="ltx_bibitem" id="bib.bib21"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gao et al. (2023)</span> <span class="ltx_bibblock"> Bowen Gao, Yinjun Jia, YuanLe Mo, Yuyan Ni, Wei-Ying Ma, Zhi-Ming Ma, and Yanyan Lan. </span> <span class="ltx_bibblock">Self-supervised pocket pretraining via protein fragment-surroundings alignment. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib21.1.1">The Twelfth International Conference on Learning Representations</em>, 2023. </span> </li> <li class="ltx_bibitem" id="bib.bib22"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gaulton and Overington (2010)</span> <span class="ltx_bibblock"> Anna Gaulton and John P Overington. </span> <span class="ltx_bibblock">Role of open chemical data in aiding drug discovery and design. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib22.1.1">Future Medicinal Chemistry</em>, 2(6):903–907, 2010. </span> </li> <li class="ltx_bibitem" id="bib.bib23"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gaulton et al. (2012)</span> <span class="ltx_bibblock"> Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, et al. </span> <span class="ltx_bibblock">Chembl: a large-scale bioactivity database for drug discovery. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib23.1.1">Nucleic acids research</em>, 40(D1):D1100–D1107, 2012. </span> </li> <li class="ltx_bibitem" id="bib.bib24"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gibney and Nolan (2010)</span> <span class="ltx_bibblock"> ER Gibney and CM Nolan. </span> <span class="ltx_bibblock">Epigenetics and gene expression. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib24.1.1">Heredity</em>, 105(1):4–13, 2010. </span> </li> <li class="ltx_bibitem" id="bib.bib25"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Gilson et al. (2016)</span> <span class="ltx_bibblock"> Michael K Gilson, Tiqing Liu, Michael Baitaluk, George Nicola, Linda Hwang, and Jenny Chong. </span> <span class="ltx_bibblock">Bindingdb in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib25.1.1">Nucleic acids research</em>, 44(D1):D1045–D1053, 2016. </span> </li> <li class="ltx_bibitem" id="bib.bib26"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hall (1996)</span> <span class="ltx_bibblock"> George M Hall. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib26.1.1">Methods of testing protein functionality</em>. </span> <span class="ltx_bibblock">Springer Science &amp; Business Media, 1996. </span> </li> <li class="ltx_bibitem" id="bib.bib27"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hauser et al. (2017)</span> <span class="ltx_bibblock"> Alexander S Hauser, Misty M Attwood, Mathias Rask-Andersen, Helgi B Schiöth, and David E Gloriam. </span> <span class="ltx_bibblock">Trends in gpcr drug discovery: new agents, targets and indications. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib27.1.1">Nature reviews Drug discovery</em>, 16(12):829–842, 2017. </span> </li> <li class="ltx_bibitem" id="bib.bib28"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Heller et al. (2015)</span> <span class="ltx_bibblock"> Stephen R Heller, Alan McNaught, Igor Pletnev, Stephen Stein, and Dmitrii Tchekhovskoi. </span> <span class="ltx_bibblock">Inchi, the iupac international chemical identifier. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib28.1.1">Journal of cheminformatics</em>, 7:1–34, 2015. </span> </li> <li class="ltx_bibitem" id="bib.bib29"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Hu et al. (2005)</span> <span class="ltx_bibblock"> Liegi Hu, Mark L Benson, Richard D Smith, Michael G Lerner, and Heather A Carlson. </span> <span class="ltx_bibblock">Binding moad (mother of all databases). </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib29.1.1">Proteins: Structure, Function, and Bioinformatics</em>, 60(3):333–340, 2005. </span> </li> <li class="ltx_bibitem" id="bib.bib30"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Kim et al. (2016)</span> <span class="ltx_bibblock"> Sunghwan Kim, Paul A Thiessen, Evan E Bolton, Jie Chen, Gang Fu, Asta Gindulyte, Lianyi Han, Jane He, Siqian He, Benjamin A Shoemaker, et al. </span> <span class="ltx_bibblock">Pubchem substance and compound databases. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib30.1.1">Nucleic acids research</em>, 44(D1):D1202–D1213, 2016. </span> </li> <li class="ltx_bibitem" id="bib.bib31"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Koshland Jr (1995)</span> <span class="ltx_bibblock"> Daniel E Koshland Jr. </span> <span class="ltx_bibblock">The key–lock theory and the induced fit theory. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib31.1.1">Angewandte Chemie International Edition in English</em>, 33(23-24):2375–2378, 1995. </span> </li> <li class="ltx_bibitem" id="bib.bib32"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lambert et al. (2018)</span> <span class="ltx_bibblock"> Samuel A Lambert, Arttu Jolma, Laura F Campitelli, Pratyush K Das, Yimeng Yin, Mihai Albu, Xiaoting Chen, Jussi Taipale, Timothy R Hughes, and Matthew T Weirauch. </span> <span class="ltx_bibblock">The human transcription factors. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib32.1.1">Cell</em>, 172(4):650–665, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib33"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Law et al. (2014)</span> <span class="ltx_bibblock"> Vivian Law, Craig Knox, Yannick Djoumbou, Tim Jewison, An Chi Guo, Yifeng Liu, Adam Maciejewski, David Arndt, Michael Wilson, Vanessa Neveu, et al. </span> <span class="ltx_bibblock">Drugbank 4.0: shedding new light on drug metabolism. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib33.1.1">Nucleic acids research</em>, 42(D1):D1091–D1097, 2014. </span> </li> <li class="ltx_bibitem" id="bib.bib34"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Leavitt and Freire (2001)</span> <span class="ltx_bibblock"> Stephanie Leavitt and Ernesto Freire. </span> <span class="ltx_bibblock">Direct measurement of protein binding energetics by isothermal titration calorimetry. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib34.1.1">Current opinion in structural biology</em>, 11(5):560–566, 2001. </span> </li> <li class="ltx_bibitem" id="bib.bib35"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Li et al. (2024)</span> <span class="ltx_bibblock"> Xuelian Li, Cheng Shen, Hui Zhu, Yujian Yang, Qing Wang, Jincai Yang, and Niu Huang. </span> <span class="ltx_bibblock">A high-quality data set of protein–ligand binding interactions via comparative complex structure modeling. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib35.1.1">Journal of Chemical Information and Modeling</em>, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib36"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Lineweaver and Burk (1934)</span> <span class="ltx_bibblock"> Hans Lineweaver and Dean Burk. </span> <span class="ltx_bibblock">The determination of enzyme dissociation constants. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib36.1.1">Journal of the American chemical society</em>, 56(3):658–666, 1934. </span> </li> <li class="ltx_bibitem" id="bib.bib37"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Liu et al. (2007)</span> <span class="ltx_bibblock"> Tiqing Liu, Yuhmei Lin, Xin Wen, Robert N Jorissen, and Michael K Gilson. </span> <span class="ltx_bibblock">Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib37.1.1">Nucleic acids research</em>, 35(suppl_1):D198–D201, 2007. </span> </li> <li class="ltx_bibitem" id="bib.bib38"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Mendez et al. (2019)</span> <span class="ltx_bibblock"> David Mendez, Anna Gaulton, A Patrícia Bento, Jon Chambers, Marleen De Veij, Eloy Félix, María Paula Magariños, Juan F Mosquera, Prudence Mutowo, Michał Nowotka, et al. </span> <span class="ltx_bibblock">Chembl: towards direct deposition of bioassay data. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib38.1.1">Nucleic acids research</em>, 47(D1):D930–D940, 2019. </span> </li> <li class="ltx_bibitem" id="bib.bib39"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Mori and Mishina (1995)</span> <span class="ltx_bibblock"> H Mori and M Mishina. </span> <span class="ltx_bibblock">Structure and function of the nmda receptor channel. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib39.1.1">Neuropharmacology</em>, 34(10):1219–1237, 1995. </span> </li> <li class="ltx_bibitem" id="bib.bib40"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Murakami et al. (2016)</span> <span class="ltx_bibblock"> Yoichi Murakami, Satoshi Omori, and Kengo Kinoshita. </span> <span class="ltx_bibblock">Nldb: a database for 3d protein–ligand interactions in enzymatic reactions. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib40.1.1">Journal of structural and functional genomics</em>, 17:101–110, 2016. </span> </li> <li class="ltx_bibitem" id="bib.bib41"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Naderi et al. (2018)</span> <span class="ltx_bibblock"> Misagh Naderi, Rajiv Gandhi Govindaraj, and Michal Brylinski. </span> <span class="ltx_bibblock">e model-bdb: a database of comparative structure models of drug-target interactions from the binding database. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib41.1.1">Gigascience</em>, 7(8):giy091, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib42"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Öztürk et al. (2018)</span> <span class="ltx_bibblock"> Hakime Öztürk, Arzucan Özgür, and Elif Ozkirimli. </span> <span class="ltx_bibblock">Deepdta: deep drug–target binding affinity prediction. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib42.1.1">Bioinformatics</em>, 34(17):i821–i829, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib43"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Pawson et al. (2014)</span> <span class="ltx_bibblock"> Adam J Pawson, Joanna L Sharman, Helen E Benson, Elena Faccenda, Stephen PH Alexander, O Peter Buneman, Anthony P Davenport, John C McGrath, John A Peters, Christopher Southan, et al. </span> <span class="ltx_bibblock">The iuphar/bps guide to pharmacology: an expert-driven knowledgebase of drug targets and their ligands. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib43.1.1">Nucleic acids research</em>, 42(D1):D1098–D1106, 2014. </span> </li> <li class="ltx_bibitem" id="bib.bib44"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Qu and Tang (2010)</span> <span class="ltx_bibblock"> Liyan Qu and Xiuwen Tang. </span> <span class="ltx_bibblock">Bexarotene: a promising anticancer agent. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib44.1.1">Cancer chemotherapy and pharmacology</em>, 65:201–205, 2010. </span> </li> <li class="ltx_bibitem" id="bib.bib45"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Reisberg et al. (2003)</span> <span class="ltx_bibblock"> Barry Reisberg, Rachelle Doody, Albrecht Stöffler, Frederick Schmitt, Steven Ferris, and Hans Jörg Möbius. </span> <span class="ltx_bibblock">Memantine in moderate-to-severe alzheimer’s disease. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib45.1.1">New England Journal of Medicine</em>, 348(14):1333–1341, 2003. </span> </li> <li class="ltx_bibitem" id="bib.bib46"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Robinson-Rechavi et al. (2003)</span> <span class="ltx_bibblock"> Marc Robinson-Rechavi, Hector Escriva Garcia, and Vincent Laudet. </span> <span class="ltx_bibblock">The nuclear receptor superfamily. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib46.1.1">Journal of cell science</em>, 116(4):585–586, 2003. </span> </li> <li class="ltx_bibitem" id="bib.bib47"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Rogers and Hahn (2010)</span> <span class="ltx_bibblock"> David Rogers and Mathew Hahn. </span> <span class="ltx_bibblock">Extended-connectivity fingerprints. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib47.1.1">Journal of chemical information and modeling</em>, 50(5):742–754, 2010. </span> </li> <li class="ltx_bibitem" id="bib.bib48"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Sankar et al. (2022)</span> <span class="ltx_bibblock"> Santhosh Sankar, Naren Chandran Sakthivel, and Nagasuma Chandra. </span> <span class="ltx_bibblock">Fast local alignment of protein pockets (flapp): a system-compiled program for large-scale binding site alignment. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib48.1.1">Journal of Chemical Information and Modeling</em>, 62(19):4810–4819, 2022. </span> </li> <li class="ltx_bibitem" id="bib.bib49"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Schasfoort (2017)</span> <span class="ltx_bibblock"> Richard Schasfoort. </span> <span class="ltx_bibblock">Introduction to surface plasmon resonance. </span> <span class="ltx_bibblock">2017. </span> </li> <li class="ltx_bibitem" id="bib.bib50"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Tang et al. (2014)</span> <span class="ltx_bibblock"> Jing Tang, Agnieszka Szwajda, Sushil Shakyawar, Tao Xu, Petteri Hintsanen, Krister Wennerberg, and Tero Aittokallio. </span> <span class="ltx_bibblock">Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib50.1.1">Journal of Chemical Information and Modeling</em>, 54(3):735–743, 2014. </span> </li> <li class="ltx_bibitem" id="bib.bib51"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Townshend et al. (2021)</span> <span class="ltx_bibblock"> Raphael John Lamarre Townshend, Martin Vögele, Patricia Adriana Suriana, Alexander Derry, Alexander Powers, Yianni Laloudakis, Sidhika Balachandar, Bowen Jing, Brandon M Anderson, Stephan Eismann, et al. </span> <span class="ltx_bibblock">Atom3d: Tasks on molecules in three dimensions. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib51.1.1">Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)</em>, 2021. </span> </li> <li class="ltx_bibitem" id="bib.bib52"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Tropsha et al. (2024)</span> <span class="ltx_bibblock"> Alexander Tropsha, Olexandr Isayev, Alexandre Varnek, Gisbert Schneider, and Artem Cherkasov. </span> <span class="ltx_bibblock">Integrating qsar modelling and deep learning in drug discovery: the emergence of deep qsar. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib52.1.1">Nature Reviews Drug Discovery</em>, 23(2):141–155, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib53"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Trott and Olson (2010)</span> <span class="ltx_bibblock"> Oleg Trott and Arthur J Olson. </span> <span class="ltx_bibblock">Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib53.1.1">Journal of computational chemistry</em>, 31(2):455–461, 2010. </span> </li> <li class="ltx_bibitem" id="bib.bib54"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Verdonk et al. (2003)</span> <span class="ltx_bibblock"> Marcel L Verdonk, Jason C Cole, Michael J Hartshorn, Christopher W Murray, and Richard D Taylor. </span> <span class="ltx_bibblock">Improved protein–ligand docking using gold. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib54.1.1">Proteins: Structure, Function, and Bioinformatics</em>, 52(4):609–623, 2003. </span> </li> <li class="ltx_bibitem" id="bib.bib55"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Verma et al. (2010)</span> <span class="ltx_bibblock"> Jitender Verma, Vijay M Khedkar, and Evans C Coutinho. </span> <span class="ltx_bibblock">3d-qsar in drug design-a review. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib55.1.1">Current topics in medicinal chemistry</em>, 10(1):95–115, 2010. </span> </li> <li class="ltx_bibitem" id="bib.bib56"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Vila et al. (2016)</span> <span class="ltx_bibblock"> Julien Vila, Emma Sáez-López, James R Johnson, Ute Römling, Ulrich Dobrindt, Rafael Cantón, CG Giske, Thierry Naas, Alessandra Carattoli, Margarita Martínez-Medina, et al. </span> <span class="ltx_bibblock">Escherichia coli: an old friend with new tidings. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib56.1.1">FEMS microbiology reviews</em>, 40(4):437–463, 2016. </span> </li> <li class="ltx_bibitem" id="bib.bib57"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wang et al. (2004)</span> <span class="ltx_bibblock"> Renxiao Wang, Xueliang Fang, Yipin Lu, and Shaomeng Wang. </span> <span class="ltx_bibblock">The pdbbind database: Collection of binding affinities for protein- ligand complexes with known three-dimensional structures. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib57.1.1">Journal of medicinal chemistry</em>, 47(12):2977–2980, 2004. </span> </li> <li class="ltx_bibitem" id="bib.bib58"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wang et al. (2005)</span> <span class="ltx_bibblock"> Renxiao Wang, Xueliang Fang, Yipin Lu, Chao-Yie Yang, and Shaomeng Wang. </span> <span class="ltx_bibblock">The pdbbind database: methodologies and updates. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib58.1.1">Journal of medicinal chemistry</em>, 48(12):4111–4119, 2005. </span> </li> <li class="ltx_bibitem" id="bib.bib59"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wei et al. (2024)</span> <span class="ltx_bibblock"> Hong Wei, Wenkai Wang, Zhenling Peng, and Jianyi Yang. </span> <span class="ltx_bibblock">Q-biolip: A comprehensive resource for quaternary structure-based protein–ligand interactions. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib59.1.1">Genomics, Proteomics &amp; Bioinformatics</em>, page qzae001, 2024. </span> </li> <li class="ltx_bibitem" id="bib.bib60"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Weininger (1988)</span> <span class="ltx_bibblock"> David Weininger. </span> <span class="ltx_bibblock">Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib60.1.1">Journal of chemical information and computer sciences</em>, 28(1):31–36, 1988. </span> </li> <li class="ltx_bibitem" id="bib.bib61"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Weininger et al. (1989)</span> <span class="ltx_bibblock"> David Weininger, Arthur Weininger, and Joseph L Weininger. </span> <span class="ltx_bibblock">Smiles. 2. algorithm for generation of unique smiles notation. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib61.1.1">Journal of chemical information and computer sciences</em>, 29(2):97–101, 1989. </span> </li> <li class="ltx_bibitem" id="bib.bib62"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wishart et al. (2018)</span> <span class="ltx_bibblock"> David S Wishart, Yannick D Feunang, An C Guo, Elvis J Lo, Ana Marcu, Jason R Grant, Tanvir Sajed, Daniel Johnson, Carin Li, Zinat Sayeeda, et al. </span> <span class="ltx_bibblock">Drugbank 5.0: a major update to the drugbank database for 2018. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib62.1.1">Nucleic acids research</em>, 46(D1):D1074–D1082, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib63"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Wu et al. (2018)</span> <span class="ltx_bibblock"> Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S Pappu, Karl Leswing, and Vijay Pande. </span> <span class="ltx_bibblock">Moleculenet: a benchmark for molecular machine learning. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib63.1.1">Chemical science</em>, 9(2):513–530, 2018. </span> </li> <li class="ltx_bibitem" id="bib.bib64"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yang et al. (2012)</span> <span class="ltx_bibblock"> Jianyi Yang, Ambrish Roy, and Yang Zhang. </span> <span class="ltx_bibblock">Biolip: a semi-manually curated database for biologically relevant ligand–protein interactions. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib64.1.1">Nucleic acids research</em>, 41(D1):D1096–D1103, 2012. </span> </li> <li class="ltx_bibitem" id="bib.bib65"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Yung-Chi and Prusoff (1973)</span> <span class="ltx_bibblock"> Cheng Yung-Chi and William H Prusoff. </span> <span class="ltx_bibblock">Relationship between the inhibition constant (ki) and the concentration of inhibitor which causes 50 per cent inhibition (i50) of an enzymatic reaction. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib65.1.1">Biochemical pharmacology</em>, 22(23):3099–3108, 1973. </span> </li> <li class="ltx_bibitem" id="bib.bib66"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhang et al. (2004)</span> <span class="ltx_bibblock"> Junwei Zhang, Masahiro Aizawa, Shinji Amari, Yoshio Iwasawa, Tatsuya Nakano, and Kotoko Nakata. </span> <span class="ltx_bibblock">Development of kibank, a database supporting structure-based drug design. </span> <span class="ltx_bibblock"><em class="ltx_emph ltx_font_italic" id="bib.bib66.1.1">Computational biology and chemistry</em>, 28(5-6):401–407, 2004. </span> </li> <li class="ltx_bibitem" id="bib.bib67"> <span class="ltx_tag ltx_role_refnum ltx_tag_bibitem">Zhou et al. (2022)</span> <span class="ltx_bibblock"> Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. </span> <span class="ltx_bibblock">Uni-mol: A universal 3d molecular representation learning framework. </span> <span class="ltx_bibblock">In <em class="ltx_emph ltx_font_italic" id="bib.bib67.1.1">The Eleventh International Conference on Learning Representations</em>, 2022. </span> </li> </ul> </section> <div class="ltx_pagination ltx_role_newpage"></div> <section class="ltx_appendix" id="A1"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix A </span>Dataset and Code Availability</h2> <div class="ltx_para ltx_noindent" id="A1.p1"> <p class="ltx_p" id="A1.p1.1">The whole dataset and corresponding descriptions can be found at <br class="ltx_break"/><a class="ltx_ref ltx_href" href="https://huggingface.co/datasets/bgao95/SIU" title="">https://huggingface.co/datasets/bgao95/SIU</a></p> </div> <div class="ltx_para ltx_noindent" id="A1.p2"> <p class="ltx_p" id="A1.p2.1">The code and instructions used to train the baseline models can be found at <br class="ltx_break"/><a class="ltx_ref ltx_href" href="https://github.com/bowen-gao/SIU" title="">https://github.com/bowen-gao/SIU</a></p> </div> <div class="ltx_para ltx_noindent" id="A1.p3"> <p class="ltx_p" id="A1.p3.1">The dataset is hosted by Hugging Face. The license is CC BY 4.0. We bear all responsibility in case of violation of rights.</p> </div> <div class="ltx_para ltx_noindent" id="A1.p4"> <p class="ltx_p" id="A1.p4.1">The data we are using/curating doesn’t contain personally identifiable information or offensive content.</p> </div> </section> <section class="ltx_appendix" id="A2"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix B </span>Dataset overview</h2> <div class="ltx_para ltx_noindent" id="A2.p1"> <p class="ltx_p" id="A2.p1.1">SIU represents a large-scale, high-quality dataset of small molecule-protein interactions, meticulously organized to facilitate unbiased bioactivity prediction, both PDB-wise and assay-type-wise. The dataset comprises a total of 5,342,250 conformations. Each instance in the dataset provides detailed information about small molecule-protein interactions, including the coordinates and element types of each atom in the small molecule and the corresponding pockets of each interaction. Additionally, the assay value and type of each conformation, along with other critical information, are carefully obtained and retained from the original bioactivity databases. This includes the UniProt ID and PDB ID of the protein pockets, as well as the InChI keys <cite class="ltx_cite ltx_citemacro_citep">[Heller et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib28" title="">2015</a>]</cite> and SMILES <cite class="ltx_cite ltx_citemacro_cite">Weininger [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib60" title="">1988</a>], Weininger et al. [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib61" title="">1989</a>]</cite> notations of the small molecules.</p> </div> <figure class="ltx_table" id="A2.T3"> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 3: </span>The label count for 4 representative assay types in SIU total, SIU 0.9, and 0.6 versions.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="A2.T3.5"> <thead class="ltx_thead"> <tr class="ltx_tr" id="A2.T3.5.6.1"> <th class="ltx_td ltx_th ltx_th_column ltx_th_row ltx_border_r ltx_border_tt" id="A2.T3.5.6.1.1"></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" colspan="4" id="A2.T3.5.6.1.2"><span class="ltx_text ltx_font_bold" id="A2.T3.5.6.1.2.1">SIU 0.9 version</span></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" colspan="4" id="A2.T3.5.6.1.3"><span class="ltx_text ltx_font_bold" id="A2.T3.5.6.1.3.1">SIU 0.6 version</span></th> </tr> <tr class="ltx_tr" id="A2.T3.5.7.2"> <th class="ltx_td ltx_th ltx_th_column ltx_th_row ltx_border_r" id="A2.T3.5.7.2.1"></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.2">Total</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.3">Train</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.4">Valid</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_t" id="A2.T3.5.7.2.5">Test</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.6">Total</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.7">Train</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.8">Valid</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_t" id="A2.T3.5.7.2.9">Test</th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="A2.T3.1.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r ltx_border_t" id="A2.T3.1.1.1"><math alttext="MTL" class="ltx_Math" display="inline" id="A2.T3.1.1.1.m1.1"><semantics id="A2.T3.1.1.1.m1.1a"><mrow id="A2.T3.1.1.1.m1.1.1" xref="A2.T3.1.1.1.m1.1.1.cmml"><mi id="A2.T3.1.1.1.m1.1.1.2" xref="A2.T3.1.1.1.m1.1.1.2.cmml">M</mi><mo id="A2.T3.1.1.1.m1.1.1.1" xref="A2.T3.1.1.1.m1.1.1.1.cmml">⁢</mo><mi id="A2.T3.1.1.1.m1.1.1.3" xref="A2.T3.1.1.1.m1.1.1.3.cmml">T</mi><mo id="A2.T3.1.1.1.m1.1.1.1a" xref="A2.T3.1.1.1.m1.1.1.1.cmml">⁢</mo><mi id="A2.T3.1.1.1.m1.1.1.4" xref="A2.T3.1.1.1.m1.1.1.4.cmml">L</mi></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.1.1.1.m1.1b"><apply id="A2.T3.1.1.1.m1.1.1.cmml" xref="A2.T3.1.1.1.m1.1.1"><times id="A2.T3.1.1.1.m1.1.1.1.cmml" xref="A2.T3.1.1.1.m1.1.1.1"></times><ci id="A2.T3.1.1.1.m1.1.1.2.cmml" xref="A2.T3.1.1.1.m1.1.1.2">𝑀</ci><ci id="A2.T3.1.1.1.m1.1.1.3.cmml" xref="A2.T3.1.1.1.m1.1.1.3">𝑇</ci><ci id="A2.T3.1.1.1.m1.1.1.4.cmml" xref="A2.T3.1.1.1.m1.1.1.4">𝐿</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.1.1.1.m1.1c">MTL</annotation><annotation encoding="application/x-llamapun" id="A2.T3.1.1.1.m1.1d">italic_M italic_T italic_L</annotation></semantics></math></th> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.2">1272335</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.3">1125727</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.4">125080</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="A2.T3.1.1.5">21528</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.6">407858</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.7">347697</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.8">38633</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A2.T3.1.1.9">21528</td> </tr> <tr class="ltx_tr" id="A2.T3.2.2"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="A2.T3.2.2.1"><math alttext="IC_{50}" class="ltx_Math" display="inline" id="A2.T3.2.2.1.m1.1"><semantics id="A2.T3.2.2.1.m1.1a"><mrow id="A2.T3.2.2.1.m1.1.1" xref="A2.T3.2.2.1.m1.1.1.cmml"><mi id="A2.T3.2.2.1.m1.1.1.2" xref="A2.T3.2.2.1.m1.1.1.2.cmml">I</mi><mo id="A2.T3.2.2.1.m1.1.1.1" xref="A2.T3.2.2.1.m1.1.1.1.cmml">⁢</mo><msub id="A2.T3.2.2.1.m1.1.1.3" xref="A2.T3.2.2.1.m1.1.1.3.cmml"><mi id="A2.T3.2.2.1.m1.1.1.3.2" xref="A2.T3.2.2.1.m1.1.1.3.2.cmml">C</mi><mn id="A2.T3.2.2.1.m1.1.1.3.3" xref="A2.T3.2.2.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.2.2.1.m1.1b"><apply id="A2.T3.2.2.1.m1.1.1.cmml" xref="A2.T3.2.2.1.m1.1.1"><times id="A2.T3.2.2.1.m1.1.1.1.cmml" xref="A2.T3.2.2.1.m1.1.1.1"></times><ci id="A2.T3.2.2.1.m1.1.1.2.cmml" xref="A2.T3.2.2.1.m1.1.1.2">𝐼</ci><apply id="A2.T3.2.2.1.m1.1.1.3.cmml" xref="A2.T3.2.2.1.m1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.2.2.1.m1.1.1.3.1.cmml" xref="A2.T3.2.2.1.m1.1.1.3">subscript</csymbol><ci id="A2.T3.2.2.1.m1.1.1.3.2.cmml" xref="A2.T3.2.2.1.m1.1.1.3.2">𝐶</ci><cn id="A2.T3.2.2.1.m1.1.1.3.3.cmml" type="integer" xref="A2.T3.2.2.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.2.2.1.m1.1c">IC_{50}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.2.2.1.m1.1d">italic_I italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math></th> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.2">962063</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.3">854230</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.4">94859</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A2.T3.2.2.5">12974</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.6">320594</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.7">276969</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.8">30651</td> <td class="ltx_td ltx_align_center" id="A2.T3.2.2.9">12974</td> </tr> <tr class="ltx_tr" id="A2.T3.3.3"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="A2.T3.3.3.1"><math alttext="EC_{50}" class="ltx_Math" display="inline" id="A2.T3.3.3.1.m1.1"><semantics id="A2.T3.3.3.1.m1.1a"><mrow id="A2.T3.3.3.1.m1.1.1" xref="A2.T3.3.3.1.m1.1.1.cmml"><mi id="A2.T3.3.3.1.m1.1.1.2" xref="A2.T3.3.3.1.m1.1.1.2.cmml">E</mi><mo id="A2.T3.3.3.1.m1.1.1.1" xref="A2.T3.3.3.1.m1.1.1.1.cmml">⁢</mo><msub id="A2.T3.3.3.1.m1.1.1.3" xref="A2.T3.3.3.1.m1.1.1.3.cmml"><mi id="A2.T3.3.3.1.m1.1.1.3.2" xref="A2.T3.3.3.1.m1.1.1.3.2.cmml">C</mi><mn id="A2.T3.3.3.1.m1.1.1.3.3" xref="A2.T3.3.3.1.m1.1.1.3.3.cmml">50</mn></msub></mrow><annotation-xml encoding="MathML-Content" id="A2.T3.3.3.1.m1.1b"><apply id="A2.T3.3.3.1.m1.1.1.cmml" xref="A2.T3.3.3.1.m1.1.1"><times id="A2.T3.3.3.1.m1.1.1.1.cmml" xref="A2.T3.3.3.1.m1.1.1.1"></times><ci id="A2.T3.3.3.1.m1.1.1.2.cmml" xref="A2.T3.3.3.1.m1.1.1.2">𝐸</ci><apply id="A2.T3.3.3.1.m1.1.1.3.cmml" xref="A2.T3.3.3.1.m1.1.1.3"><csymbol cd="ambiguous" id="A2.T3.3.3.1.m1.1.1.3.1.cmml" xref="A2.T3.3.3.1.m1.1.1.3">subscript</csymbol><ci id="A2.T3.3.3.1.m1.1.1.3.2.cmml" xref="A2.T3.3.3.1.m1.1.1.3.2">𝐶</ci><cn id="A2.T3.3.3.1.m1.1.1.3.3.cmml" type="integer" xref="A2.T3.3.3.1.m1.1.1.3.3">50</cn></apply></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.3.3.1.m1.1c">EC_{50}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.3.3.1.m1.1d">italic_E italic_C start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT</annotation></semantics></math></th> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.2">97952</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.3">84067</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.4">9508</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A2.T3.3.3.5">4377</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.6">32842</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.7">25675</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.8">2790</td> <td class="ltx_td ltx_align_center" id="A2.T3.3.3.9">4377</td> </tr> <tr class="ltx_tr" id="A2.T3.4.4"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_r" id="A2.T3.4.4.1"><math alttext="K_{i}" class="ltx_Math" display="inline" id="A2.T3.4.4.1.m1.1"><semantics id="A2.T3.4.4.1.m1.1a"><msub id="A2.T3.4.4.1.m1.1.1" xref="A2.T3.4.4.1.m1.1.1.cmml"><mi id="A2.T3.4.4.1.m1.1.1.2" xref="A2.T3.4.4.1.m1.1.1.2.cmml">K</mi><mi id="A2.T3.4.4.1.m1.1.1.3" xref="A2.T3.4.4.1.m1.1.1.3.cmml">i</mi></msub><annotation-xml encoding="MathML-Content" id="A2.T3.4.4.1.m1.1b"><apply id="A2.T3.4.4.1.m1.1.1.cmml" xref="A2.T3.4.4.1.m1.1.1"><csymbol cd="ambiguous" id="A2.T3.4.4.1.m1.1.1.1.cmml" xref="A2.T3.4.4.1.m1.1.1">subscript</csymbol><ci id="A2.T3.4.4.1.m1.1.1.2.cmml" xref="A2.T3.4.4.1.m1.1.1.2">𝐾</ci><ci id="A2.T3.4.4.1.m1.1.1.3.cmml" xref="A2.T3.4.4.1.m1.1.1.3">𝑖</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.4.4.1.m1.1c">K_{i}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.4.4.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT</annotation></semantics></math></th> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.2">198091</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.3">175442</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.4">19447</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A2.T3.4.4.5">3202</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.6">47946</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.7">40188</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.8">4556</td> <td class="ltx_td ltx_align_center" id="A2.T3.4.4.9">3202</td> </tr> <tr class="ltx_tr" id="A2.T3.5.5"> <th class="ltx_td ltx_align_center ltx_th ltx_th_row ltx_border_bb ltx_border_r" id="A2.T3.5.5.1"><math alttext="K_{d}" class="ltx_Math" display="inline" id="A2.T3.5.5.1.m1.1"><semantics id="A2.T3.5.5.1.m1.1a"><msub id="A2.T3.5.5.1.m1.1.1" xref="A2.T3.5.5.1.m1.1.1.cmml"><mi id="A2.T3.5.5.1.m1.1.1.2" xref="A2.T3.5.5.1.m1.1.1.2.cmml">K</mi><mi id="A2.T3.5.5.1.m1.1.1.3" xref="A2.T3.5.5.1.m1.1.1.3.cmml">d</mi></msub><annotation-xml encoding="MathML-Content" id="A2.T3.5.5.1.m1.1b"><apply id="A2.T3.5.5.1.m1.1.1.cmml" xref="A2.T3.5.5.1.m1.1.1"><csymbol cd="ambiguous" id="A2.T3.5.5.1.m1.1.1.1.cmml" xref="A2.T3.5.5.1.m1.1.1">subscript</csymbol><ci id="A2.T3.5.5.1.m1.1.1.2.cmml" xref="A2.T3.5.5.1.m1.1.1.2">𝐾</ci><ci id="A2.T3.5.5.1.m1.1.1.3.cmml" xref="A2.T3.5.5.1.m1.1.1.3">𝑑</ci></apply></annotation-xml><annotation encoding="application/x-tex" id="A2.T3.5.5.1.m1.1c">K_{d}</annotation><annotation encoding="application/x-llamapun" id="A2.T3.5.5.1.m1.1d">italic_K start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT</annotation></semantics></math></th> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.2">54570</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.3">47347</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.4">5347</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r" id="A2.T3.5.5.5">1876</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.6">17509</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.7">14003</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.8">1630</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A2.T3.5.5.9">1876</td> </tr> </tbody> </table> </figure> <div class="ltx_para ltx_noindent" id="A2.p2"> <p class="ltx_p" id="A2.p2.1">Additionally, the dataset encompasses over 1,385,201 million assay labels, each derived from corresponding wet-lab bioactivity experiments, ensuring the reliability and accuracy of the bioactivity information. SIU includes 1,720 diverse protein targets, with each protein potentially possessing multiple distinct binding pockets, verified through rigorous deduplication methods, resulting in a total of 9,662 unique pockets. The dataset also features a substantial and diverse collection of small molecules, totaling 214,686, across all pockets. Importantly, we have only included protein pocket-small molecule pairs confirmed to be active or inactive through wet-lab experiments, amounting to over 1,291,362 million pairs.</p> </div> <div class="ltx_para ltx_noindent" id="A2.p3"> <p class="ltx_p" id="A2.p3.1">As shown in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A2.T3" title="Table 3 ‣ Appendix B Dataset overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">3</span></a>, we included various assay types with significant amounts of data to demonstrate that bioactivity values from different assay types should not be mixed and to facilitate the future assay-type-specific use of SIU. Additionally, to examine the structural differences among small molecules from the top four assay types, a random sample from each was visualized using t-SNE with ECFP fingerprints (radius = 3, 1024-bit vectors), as depicted in Figure <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A2.F5" title="Figure 5 ‣ Appendix B Dataset overview ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">5</span></a>.</p> </div> <figure class="ltx_figure" id="A2.F5"><img alt="Refer to caption" class="ltx_graphics ltx_centering ltx_img_square" height="358" id="A2.F5.g1" src="extracted/5664306/images/shuffled_tsne_02.png" width="419"/> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_figure">Figure 5: </span>Visualization of chemical structure differences among small molecules from the top four assay types using t-SNE with ECFP fingerprints.</figcaption> </figure> </section> <section class="ltx_appendix" id="A3"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix C </span>Test set construction</h2> <div class="ltx_para ltx_noindent" id="A3.p1"> <p class="ltx_p" id="A3.p1.1">To ensure the robustness and generalizability of the experimental findings with SIU, we meticulously curated a test set composed of 10 protein targets, as listed in Table <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#A3.T4" title="Table 4 ‣ Appendix C Test set construction ‣ SIU: A Million-Scale Structural Small Molecule-Protein Interaction Dataset for Unbiased Bioactivity Prediction"><span class="ltx_text ltx_ref_tag">4</span></a>. These targets were selected to represent a wide range of protein classes, including G-Protein Coupled Receptors (GPCRs), kinases, cytochrome, nuclear receptor, ion channel, epigenetic, and others, ensuring broad coverage of the bioactivity landscape. For example, "C11B1_HUMAN" belongs to the cytochrome P450 family, which is involved in the metabolism of various drugs <cite class="ltx_cite ltx_citemacro_cite">Bureik et al. [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib8" title="">2002</a>], Denisov et al. [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib15" title="">2005</a>]</cite>. "RARG_HUMAN" belongs to the Nuclear Receptor family, with drugs like bexarotene used for certain cancers <cite class="ltx_cite ltx_citemacro_cite">Altucci et al. [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib2" title="">2007</a>], Qu and Tang [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib44" title="">2010</a>]</cite>. "NMDE1_HUMAN" represents the NMDA receptor, a critical glutamate receptor in neurons implicated in various neurological disorders, with memantine being an approved NMDA receptor antagonist for moderate to severe Alzheimer’s disease <cite class="ltx_cite ltx_citemacro_cite">Mori and Mishina [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib39" title="">1995</a>], Reisberg et al. [<a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib45" title="">2003</a>]</cite>. Including these targets across various functionalities enhances the applicability of our results in drug discovery.</p> </div> <figure class="ltx_table" id="A3.T4"> <figcaption class="ltx_caption ltx_centering"><span class="ltx_tag ltx_tag_table">Table 4: </span>The curated test set of 10 protein targets, covering a diverse range of protein classes and displaying an even distribution of small molecule-pocket pair counts.</figcaption> <table class="ltx_tabular ltx_centering ltx_guessed_headers ltx_align_middle" id="A3.T4.1"> <thead class="ltx_thead"> <tr class="ltx_tr" id="A3.T4.1.1.1"> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="A3.T4.1.1.1.1">UniProt</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="A3.T4.1.1.1.2">Gene name</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_r ltx_border_tt" id="A3.T4.1.1.1.3">Class</th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt" id="A3.T4.1.1.1.4">Small molecule-</th> </tr> <tr class="ltx_tr" id="A3.T4.1.2.2"> <th class="ltx_td ltx_th ltx_th_column ltx_border_r" id="A3.T4.1.2.2.1"></th> <th class="ltx_td ltx_th ltx_th_column ltx_border_r" id="A3.T4.1.2.2.2"></th> <th class="ltx_td ltx_th ltx_th_column ltx_border_r" id="A3.T4.1.2.2.3"></th> <th class="ltx_td ltx_align_center ltx_th ltx_th_column" id="A3.T4.1.2.2.4">pocket pair count</th> </tr> </thead> <tbody class="ltx_tbody"> <tr class="ltx_tr" id="A3.T4.1.3.1"> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="A3.T4.1.3.1.1">P61073</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="A3.T4.1.3.1.2">CXCR4_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r ltx_border_t" id="A3.T4.1.3.1.3">GPCR</td> <td class="ltx_td ltx_align_center ltx_border_t" id="A3.T4.1.3.1.4">1376</td> </tr> <tr class="ltx_tr" id="A3.T4.1.4.2"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.4.2.1">P42866</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.4.2.2">OPRM_MOUSE</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.4.2.3">GPCR</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.4.2.4">2379</td> </tr> <tr class="ltx_tr" id="A3.T4.1.5.3"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.5.3.1">Q00535</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.5.3.2">CDK5_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.5.3.3">Kinase</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.5.3.4">2189</td> </tr> <tr class="ltx_tr" id="A3.T4.1.6.4"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.6.4.1">Q04759</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.6.4.2">KPCT_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.6.4.3">Kinase</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.6.4.4">2320</td> </tr> <tr class="ltx_tr" id="A3.T4.1.7.5"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.7.5.1">P15538</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.7.5.2">C11B1_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.7.5.3">Cytochrome</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.7.5.4">2427</td> </tr> <tr class="ltx_tr" id="A3.T4.1.8.6"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.8.6.1">P13631</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.8.6.2">RARG_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.8.6.3">Nuclear Receptor</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.8.6.4">1888</td> </tr> <tr class="ltx_tr" id="A3.T4.1.9.7"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.9.7.1">Q12879</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.9.7.2">NMDE1_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.9.7.3">Ion Channel</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.9.7.4">2144</td> </tr> <tr class="ltx_tr" id="A3.T4.1.10.8"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.10.8.1">Q9UGN5</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.10.8.2">PARP2_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.10.8.3">Epigenetic</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.10.8.4">2251</td> </tr> <tr class="ltx_tr" id="A3.T4.1.11.9"> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.11.9.1">Q86WV6</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.11.9.2">STING_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_r" id="A3.T4.1.11.9.3">Others</td> <td class="ltx_td ltx_align_center" id="A3.T4.1.11.9.4">2495</td> </tr> <tr class="ltx_tr" id="A3.T4.1.12.10"> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r" id="A3.T4.1.12.10.1">Q96SW2</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r" id="A3.T4.1.12.10.2">CRBN_HUMAN</td> <td class="ltx_td ltx_align_center ltx_border_bb ltx_border_r" id="A3.T4.1.12.10.3">Others</td> <td class="ltx_td ltx_align_center ltx_border_bb" id="A3.T4.1.12.10.4">2059</td> </tr> </tbody> </table> </figure> </section> <section class="ltx_appendix" id="A4"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix D </span>Model Training</h2> <div class="ltx_para ltx_noindent" id="A4.p1"> <p class="ltx_p" id="A4.p1.1">For GNN Model, we use the same model in atom3d <cite class="ltx_cite ltx_citemacro_citep">[Townshend et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib51" title="">2021</a>]</cite> <a class="ltx_ref ltx_href" href="https://github.com/drorlab/atom3d/" title="">https://github.com/drorlab/atom3d/</a>. We train the model using one NVIDIA A100 GPU. The batch size is 256, the max number of epochs is 20, the optimizer is Adam, the learning rate is 1e-3.</p> </div> <div class="ltx_para ltx_noindent" id="A4.p2"> <p class="ltx_p" id="A4.p2.1">For 3D-CNN Model, we use the same model in atom3d <cite class="ltx_cite ltx_citemacro_citep">[Townshend et al., <a class="ltx_ref" href="https://arxiv.org/html/2406.08961v1#bib.bib51" title="">2021</a>]</cite> <a class="ltx_ref ltx_href" href="https://github.com/drorlab/atom3d/" title="">https://github.com/drorlab/atom3d/</a>. We train the model using one NVIDIA A100 GPU. The batch size is 256, the max number of epochs is 20, the optimizer is Adam, the learning rate is 1e-4.</p> </div> <div class="ltx_para ltx_noindent" id="A4.p3"> <p class="ltx_p" id="A4.p3.1">For Uni-Mol model, we use the pretrained model weights provided by <a class="ltx_ref ltx_href" href="https://github.com/dptech-corp/Uni-Mol/" title="">https://github.com/dptech-corp/Uni-Mol/</a>. The pretrained molecular encoder and pocket encoder outputs are concatenated and passed through a four-layer Multi-Layer Perceptron (MLP) with hidden dimension 1024, 521, 256, 128. We use four NVIDIA A100 GPU to train the model. The batch size is 384, the max number of epochs is 50, the optimizer is Adam, the learning rate is 1e-4.</p> </div> <div class="ltx_para ltx_noindent" id="A4.p4"> <p class="ltx_p" id="A4.p4.1">For ProFSA model, we use the pretrained model weights provided by <a class="ltx_ref ltx_href" href="https://github.com/bowen-gao/ProFSA" title="">https://github.com/bowen-gao/ProFSA</a>. The pretrained molecular encoder and pocket encoder outputs are concatenated and passed through a four-layer Multi-Layer Perceptron (MLP) with hidden dimension 1024, 521, 256, 128. We use four NVIDIA A100 GPU to train the model. The batch size is 384, the max number of epochs is 50, the optimizer is Adam, the learning rate is 1e-4.</p> </div> <div class="ltx_para ltx_noindent" id="A4.p5"> <p class="ltx_p" id="A4.p5.1">Details can found at <a class="ltx_ref ltx_href" href="https://github.com/bowen-gao/SIU" title="">https://github.com/bowen-gao/SIU</a></p> </div> </section> <section class="ltx_appendix" id="A5"> <h2 class="ltx_title ltx_title_appendix"> <span class="ltx_tag ltx_tag_appendix">Appendix E </span>Potential negative impact of SIU</h2> <div class="ltx_para ltx_noindent" id="A5.p1"> <p class="ltx_p" id="A5.p1.1">While our dataset, SIU, represents a significant advancement in the field of bioactivity prediction, it is important to acknowledge potential limitations and areas of concern. Despite our robust multi-software docking approach and consensus filtering, the inherent reliance on computational methods may still introduce certain biases or inaccuracies in the modeled small molecule-protein interactions. These potential inaccuracies could inadvertently mislead researchers, leading to less reliable predictions and potentially diverting attention away from promising compounds or targets.</p> </div> <div class="ltx_pagination ltx_role_newpage"></div> </section> </article> </div> <footer class="ltx_page_footer"> <div class="ltx_page_logo">Generated on Thu Jun 13 09:41:37 2024 by <a class="ltx_LaTeXML_logo" href="http://dlmf.nist.gov/LaTeXML/"><span style="letter-spacing:-0.2em; margin-right:0.1em;">L<span class="ltx_font_smallcaps" style="position:relative; bottom:2.2pt;">a</span>T<span class="ltx_font_smallcaps" style="font-size:120%;position:relative; bottom:-0.2ex;">e</span></span><span style="font-size:90%; position:relative; bottom:-0.2ex;">XML</span><img alt="Mascot Sammy" src=""/></a> </div></footer> </div> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10