CINXE.COM
Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics | PLOS ONE
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/terms/" xmlns:doi="http://dx.doi.org/" lang="en" xml:lang="en" itemscope itemtype="http://schema.org/Article" class="no-js"> <head prefix="og: http://ogp.me/ns#"> <link rel="stylesheet" href="/resource/css/screen.css?79f248ebefa43b7800a14562e5049ab4"/> <!-- allows for extra head tags --> <!-- hello --> <link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Open+Sans:400,400i,600"> <link media="print" rel="stylesheet" type="text/css" href="/resource/css/print.css"/> <script type="text/javascript"> var siteUrlPrefix = "/plosone/"; </script> <script src="/resource/js/vendor/modernizr-v2.7.1.js" type="text/javascript"></script> <script src="/resource/js/vendor/detectizr.min.js" type="text/javascript"></script> <link rel="shortcut icon" href="/resource/img/favicon.ico" type="image/x-icon"/> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> <link rel="canonical" href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141287" /> <meta name="description" content="We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN." /> <meta name="citation_abstract" content="We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN."> <meta name="keywords" content="Protein domains,Structural proteins,Biophysics,Protein structure prediction,Protein structure databases,Support vector machines,Bioinformatics,Protein structure" /> <meta name="citation_doi" content="10.1371/journal.pone.0141287"/> <meta name="citation_author" content="Ehsaneddin Asgari"/> <meta name="citation_author_institution" content="Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America"/> <meta name="citation_author" content="Mohammad R. K. Mofrad"/> <meta name="citation_author_institution" content="Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America"/> <meta name="citation_author_institution" content="Physical Biosciences Division, Lawrence Berkeley National Lab, Berkeley, California 94720, United States of America"/> <meta name="citation_title" content="Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics"/> <meta itemprop="name" content="Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics"/> <meta name="citation_journal_title" content="PLOS ONE"/> <meta name="citation_journal_abbrev" content="PLOS ONE"/> <meta name="citation_date" content="Nov 10, 2015"/> <meta name="citation_firstpage" content="e0141287"/> <meta name="citation_issue" content="11"/> <meta name="citation_volume" content="10"/> <meta name="citation_issn" content="1932-6203"/> <meta name="citation_publisher" content="Public Library of Science"/> <meta name="citation_pdf_url" content="https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0141287&type=printable"> <meta name="citation_article_type" content="Research Article"> <meta name="dc.identifier" content="10.1371/journal.pone.0141287" /> <meta name="twitter:card" content="summary" /> <meta name="twitter:site" content="plosone"/> <meta name="twitter:title" content="Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics" /> <meta property="twitter:description" content="We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN." /> <meta property="twitter:image" content="https://journals.plos.org/plosone/article/figure/image?id=10.1371/journal.pone.0141287.g004&size=inline" /> <meta property="og:type" content="article" /> <meta property="og:url" content="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0141287"/> <meta property="og:title" content="Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics"/> <meta property="og:description" content="We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN."/> <meta property="og:image" content="https://journals.plos.org/plosone/article/figure/image?id=10.1371/journal.pone.0141287.g004&size=inline"/> <meta name="citation_reference" content="citation_title=Genomics and natural language processing;citation_author=MD Yandell;citation_author=WH Majoros;citation_journal_title=Nature Reviews Genetics;citation_volume=3;citation_number=3;citation_issue=8;citation_first_page=601;citation_last_page=610;citation_publication_date=2002;"/> <meta name="citation_reference" content="citation_title=The language of genes;citation_author=DB Searls;citation_journal_title=Nature;citation_volume=420;citation_number=420;citation_issue=6912;citation_first_page=211;citation_last_page=217;citation_publication_date=2002;"/> <meta name="citation_reference" content="citation_title=Word decoding of protein amino acid sequences with availability analysis: a linguistic approach;citation_author=K Motomura;citation_author=T Fujita;citation_author=M Tsutsumi;citation_author=S Kikuzato;citation_author=M Nakamura;citation_author=JM Otaki;citation_journal_title=PloS one;citation_volume=7;citation_number=7;citation_issue=11;citation_first_page=e50039;citation_publication_date=2012;"/> <meta name="citation_reference" content="citation_title=Modeling structure-function relationships in synthetic DNA sequences using attribute grammars;citation_author=Y Cai;citation_author=MW Lux;citation_author=L Adam;citation_author=J Peccoud;citation_journal_title=PLoS Comput Biol;citation_volume=5;citation_number=5;citation_issue=10;citation_first_page=e1000529;citation_publication_date=2009;"/> <meta name="citation_reference" content="citation_title=Least squares support vector machine classifiers;citation_author=JA Suykens;citation_author=J Vandewalle;citation_journal_title=Neural processing letters;citation_volume=9;citation_number=9;citation_issue=3;citation_first_page=293;citation_last_page=300;citation_publication_date=1999;"/> <meta name="citation_reference" content="Hinton GE. Distributed representations. School of Computer Science at Carnegie Mellon University. 1984;."/> <meta name="citation_reference" content="citation_title=Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data;citation_author=TA Lasko;citation_author=JC Denny;citation_author=MA Levy;citation_journal_title=PloS one;citation_volume=8;citation_number=8;citation_issue=6;citation_first_page=e66341;citation_publication_date=2013;"/> <meta name="citation_reference" content="citation_title=The human splicing code reveals new insights into the genetic determinants of disease;citation_author=HY Xiong;citation_author=B Alipanahi;citation_author=LJ Lee;citation_author=H Bretschneider;citation_author=D Merico;citation_author=RK Yuen;citation_journal_title=Science;citation_volume=347;citation_number=347;citation_issue=6218;citation_first_page=1254806;citation_publication_date=2015;"/> <meta name="citation_reference" content="citation_title=Natural language processing (almost) from scratch;citation_author=R Collobert;citation_author=J Weston;citation_author=L Bottou;citation_author=M Karlen;citation_author=K Kavukcuoglu;citation_author=P Kuksa;citation_journal_title=The Journal of Machine Learning Research;citation_volume=12;citation_number=12;citation_first_page=2493;citation_last_page=2537;citation_publication_date=2011;"/> <meta name="citation_reference" content="citation_title=Distributed representations of words and phrases and their compositionality;citation_author=T Mikolov;citation_author=I Sutskever;citation_author=K Chen;citation_author=GS Corrado;citation_author=J Dean;citation_journal_title=Advances in neural information processing systems;citation_first_page=3111;citation_last_page=3119;citation_publication_date=2013;"/> <meta name="citation_reference" content="Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013;."/> <meta name="citation_reference" content="citation_title=An efficient algorithm for large-scale detection of protein families;citation_author=AJ Enright;citation_author=S Van Dongen;citation_author=CA Ouzounis;citation_journal_title=Nucleic acids research;citation_volume=30;citation_number=30;citation_issue=7;citation_first_page=1575;citation_last_page=1584;citation_publication_date=2002;"/> <meta name="citation_reference" content="citation_title=Predicting function: from genes to genomes and back;citation_author=P Bork;citation_author=T Dandekar;citation_author=Y Diaz-Lazcoz;citation_author=F Eisenhaber;citation_author=M Huynen;citation_author=Y Yuan;citation_journal_title=Journal of molecular biology;citation_volume=283;citation_number=283;citation_issue=4;citation_first_page=707;citation_last_page=725;citation_publication_date=1998;"/> <meta name="citation_reference" content="citation_title=HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment;citation_author=M Remmert;citation_author=A Biegert;citation_author=A Hauser;citation_author=J Söding;citation_journal_title=Nature methods;citation_volume=9;citation_number=9;citation_issue=2;citation_first_page=173;citation_last_page=175;citation_publication_date=2012;"/> <meta name="citation_reference" content="citation_title=Pfam: the protein families database;citation_author=RD Finn;citation_author=A Bateman;citation_author=J Clements;citation_author=P Coggill;citation_author=RY Eberhardt;citation_author=SR Eddy;citation_journal_title=Nucleic acids research;citation_first_page=gkt1223;citation_publication_date=2013;"/> <meta name="citation_reference" content="citation_title=SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence;citation_author=C Cai;citation_author=L Han;citation_author=ZL Ji;citation_author=X Chen;citation_author=YZ Chen;citation_journal_title=Nucleic acids research;citation_volume=31;citation_number=31;citation_issue=13;citation_first_page=3692;citation_last_page=3697;citation_publication_date=2003;"/> <meta name="citation_reference" content="Leslie CS, Eskin E, Noble WS. The spectrum kernel: A string kernel for SVM protein classification. In: Pacific symposium on biocomputing. vol. 7. World Scientific; 2002. p. 566–575."/> <meta name="citation_reference" content="citation_title=Predicting protein function by genomic context: quantitative evaluation and qualitative inferences;citation_author=M Huynen;citation_author=B Snel;citation_author=W Lathe;citation_author=P Bork;citation_journal_title=Genome research;citation_volume=10;citation_number=10;citation_issue=8;citation_first_page=1204;citation_last_page=1210;citation_publication_date=2000;"/> <meta name="citation_reference" content="citation_title=SCOP: a structural classification of proteins database for the investigation of sequences and structures;citation_author=AG Murzin;citation_author=SE Brenner;citation_author=T Hubbard;citation_author=C Chothia;citation_journal_title=Journal of molecular biology;citation_volume=247;citation_number=247;citation_issue=4;citation_first_page=536;citation_last_page=540;citation_publication_date=1995;"/> <meta name="citation_reference" content="citation_title=Characterization of protein hubs by inferring interacting motifs from protein interactions;citation_author=R Aragues;citation_author=A Sali;citation_author=J Bonet;citation_author=MA Marti-Renom;citation_author=B Oliva;citation_journal_title=PloS Computational Biology;citation_volume=3.9;citation_number=3;citation_first_page=e178;citation_publication_date=2007;"/> <meta name="citation_reference" content="citation_title=Function and structure of inherently disordered proteins;citation_author=AK Dunker;citation_author=I Silman;citation_author=VN Uversky;citation_author=JL Sussman;citation_journal_title=Current opinion in structural biology;citation_volume=18;citation_number=18;citation_issue=6;citation_first_page=756;citation_last_page=764;citation_publication_date=2008;"/> <meta name="citation_reference" content="citation_title=Intrinsically unstructured proteins and their functions;citation_author=HJ Dyson;citation_author=PE Wright;citation_journal_title=Nature reviews Molecular cell biology;citation_volume=6;citation_number=6;citation_issue=3;citation_first_page=197;citation_last_page=208;citation_publication_date=2005;"/> <meta name="citation_reference" content="citation_title=Mechanism of coupled folding and binding of an intrinsically disordered protein;citation_author=K Sugase;citation_author=HJ Dyson;citation_author=PE Wright;citation_journal_title=Nature;citation_volume=447;citation_number=447;citation_issue=7147;citation_first_page=1021;citation_last_page=1025;citation_publication_date=2007;"/> <meta name="citation_reference" content="citation_title=Predicting intrinsic disorder in proteins: an overview;citation_author=B He;citation_author=K Wang;citation_author=Y Liu;citation_author=B Xue;citation_author=VN Uversky;citation_author=AK Dunker;citation_journal_title=Cell research;citation_volume=19;citation_number=19;citation_issue=8;citation_first_page=929;citation_last_page=949;citation_publication_date=2009;"/> <meta name="citation_reference" content="citation_title=Nuclear pore complex: biochemistry and biophysics of nucleocytoplasmic transport in health and disease;citation_author=T Jamali;citation_author=Y Jamali;citation_author=M Mehrbod;citation_author=M Mofrad;citation_journal_title=Int Rev Cell Mol Biol;citation_volume=287;citation_number=287;citation_first_page=233;citation_last_page=286;citation_publication_date=2011;"/> <meta name="citation_reference" content="citation_title=DisProt: the database of disordered proteins;citation_author=M Sickmeier;citation_author=JA Hamilton;citation_author=T LeGall;citation_author=V Vacic;citation_author=MS Cortese;citation_author=A Tantos;citation_journal_title=Nucleic acids research;citation_volume=35;citation_number=35;citation_issue=suppl 1;citation_first_page=D786;citation_last_page=D793;citation_publication_date=2007;"/> <meta name="citation_reference" content="citation_title=Physical motif clustering within intrinsically disordered nucleoporin sequences reveals universal functional features;citation_author=D Ando;citation_author=M Colvin;citation_author=M Rexach;citation_author=A Gopinathan;citation_journal_title=PloS one;citation_volume=8;citation_number=8;citation_issue=9;citation_first_page=e73831;citation_publication_date=2013;"/> <meta name="citation_reference" content="citation_title=Higher Nucleoporin-Importin β Affinity at the Nuclear Basket Increases Nucleocytoplasmic Import;citation_author=M Azimi;citation_author=MR Mofrad;citation_journal_title=PloS one;citation_volume=8;citation_number=8;citation_issue=11;citation_first_page=e81741;citation_publication_date=2013;"/> <meta name="citation_reference" content="Peyro M, Soheilypour M, Lee BL, Mofrad M. Evolutionary conserved sequence features optimizes nucleoporins behavior for cargo transportation through nuclear pore complex. Scientific Reports. In press 2015;."/> <meta name="citation_reference" content="citation_title=Visualization of multiple alignments, phylogenies and gene family evolution;citation_author=JB Procter;citation_author=J Thompson;citation_author=I Letunic;citation_author=C Creevey;citation_author=F Jossinet;citation_author=GJ Barton;citation_journal_title=Nature methods;citation_volume=7;citation_number=7;citation_first_page=S16;citation_last_page=S25;citation_publication_date=2010;"/> <meta name="citation_reference" content="citation_title=Artemis: sequence visualization and annotation;citation_author=K Rutherford;citation_author=J Parkhill;citation_author=J Crook;citation_author=T Horsnell;citation_author=P Rice;citation_author=MA Rajandream;citation_journal_title=Bioinformatics;citation_volume=16;citation_number=16;citation_issue=10;citation_first_page=944;citation_last_page=945;citation_publication_date=2000;"/> <meta name="citation_reference" content="Ganapathiraju M, Weisser D, Rosenfeld R, Carbonell J, Reddy R, Klein-Seetharaman J. Comparative n-gram analysis of whole-genome protein sequences. In: Proceedings of the second international conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc.; 2002. p. 76–81."/> <meta name="citation_reference" content="citation_title=Mining for class-specific motifs in protein sequence classification;citation_author=SM Srinivasan;citation_author=S Vural;citation_author=BR King;citation_author=C Guda;citation_journal_title=BMC bioinformatics;citation_volume=14;citation_number=14;citation_issue=1;citation_first_page=96;citation_publication_date=2013;"/> <meta name="citation_reference" content="citation_title=Subfamily specific conservation profiles for proteins based on n-gram patterns;citation_author=JK Vries;citation_author=X Liu;citation_journal_title=BMC bioinformatics;citation_volume=9;citation_number=9;citation_issue=1;citation_first_page=72;citation_publication_date=2008;"/> <meta name="citation_reference" content="Goldberg Y, Levy O. word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:14023722. 2014;."/> <meta name="citation_reference" content="citation_title=Visualizing data using t-SNE;citation_author=L Van der Maaten;citation_author=G Hinton;citation_journal_title=Journal of Machine Learning Research;citation_volume=9;citation_number=9;citation_issue=2579–2605;citation_first_page=85;citation_publication_date=2008;"/> <meta name="citation_reference" content="citation_title=Proteins and proteomics: A laboratory manual;citation_author=E McGregor;citation_journal_title=Journal of Proteome Research;citation_volume=3;citation_number=3;citation_issue=4;citation_first_page=694;citation_last_page=694;citation_publication_date=2004;"/> <meta name="citation_reference" content="citation_title=The RCSB Protein Data Bank: new resources for research and education;citation_author=PW Rose;citation_author=C Bi;citation_author=WF Bluhm;citation_author=CH Christie;citation_author=D Dimitropoulos;citation_author=S Dutta;citation_journal_title=Nucleic acids research;citation_volume=41;citation_number=41;citation_issue=D1;citation_first_page=D475;citation_last_page=D482;citation_publication_date=2013;"/> <meta name="citation_reference" content="citation_title=Visualization of SNPs with t-SNE;citation_author=A Platzer;citation_journal_title=PloS one;citation_volume=8;citation_number=8;citation_issue=2;citation_first_page=e56883;citation_publication_date=2013;"/> <!-- DoubleClick overall ad setup script --> <script type='text/javascript'> var googletag = googletag || {}; googletag.cmd = googletag.cmd || []; (function() { var gads = document.createElement('script'); gads.async = true; gads.type = 'text/javascript'; var useSSL = 'https:' == document.location.protocol; gads.src = (useSSL ? 'https:' : 'http:') + '//www.googletagservices.com/tag/js/gpt.js'; var node = document.getElementsByTagName('script')[0]; node.parentNode.insertBefore(gads, node); })(); </script> <!-- DoubleClick ad slot setup script --> <script id="doubleClickSetupScript" type='text/javascript'> googletag.cmd.push(function() { googletag.defineSlot('/75507958/PONE_728x90_ATF', [728, 90], 'div-gpt-ad-1458247671871-0').addService(googletag.pubads()); googletag.defineSlot('/75507958/PONE_160x600_BTF', [160, 600], 'div-gpt-ad-1458247671871-1').addService(googletag.pubads()); var personalizedAds = window.plosCookieConsent && window.plosCookieConsent.hasConsented('advertising'); googletag.pubads().setRequestNonPersonalizedAds(personalizedAds ? 0 : 1); googletag.pubads().enableSingleRequest(); googletag.enableServices(); }); </script> <script type="text/javascript"> var WombatConfig = WombatConfig || {}; WombatConfig.journalKey = "PLoSONE"; WombatConfig.journalName = "PLOS ONE"; WombatConfig.figurePath = "/plosone/article/figure/image"; WombatConfig.figShareInstitutionString = "plos"; WombatConfig.doiResolverPrefix = "https://dx.plos.org/"; </script> <script type="text/javascript"> var WombatConfig = WombatConfig || {}; WombatConfig.metrics = WombatConfig.metrics || {}; WombatConfig.metrics.referenceUrl = "http://lagotto.io/plos"; WombatConfig.metrics.googleScholarUrl = "https://scholar.google.com/scholar"; WombatConfig.metrics.googleScholarCitationUrl = WombatConfig.metrics.googleScholarUrl + "?hl=en&lr=&q="; WombatConfig.metrics.crossrefUrl = "https://www.crossref.org"; </script> <script defer="defer" src="/resource/js/defer.js?13928eb59791c3cc61cf"></script><script src="/resource/js/sync.js?13928eb59791c3cc61cf"></script> <script src="/resource/js/vendor/jquery.min.js" type="text/javascript"></script> <script type="text/javascript" src="https://widgets.figshare.com/static/figshare.js"></script> <script src="/resource/js/vendor/fastclick/lib/fastclick.js" type="text/javascript"></script> <script src="/resource/js/vendor/foundation/foundation.js" type="text/javascript"></script> <script src="/resource/js/vendor/underscore-min.js" type="text/javascript"></script> <script src="/resource/js/vendor/underscore.string.min.js" type="text/javascript"></script> <script src="/resource/js/vendor/moment.js" type="text/javascript"></script> <script src="/resource/js/vendor/jquery-ui-effects.min.js" type="text/javascript"></script> <script src="/resource/js/vendor/foundation/foundation.tooltip.js" type="text/javascript"></script> <script src="/resource/js/vendor/foundation/foundation.dropdown.js" type="text/javascript"></script> <script src="/resource/js/vendor/foundation/foundation.tab.js" type="text/javascript"></script> <script src="/resource/js/vendor/foundation/foundation.reveal.js" type="text/javascript"></script> <script src="/resource/js/vendor/foundation/foundation.slider.js" type="text/javascript"></script> <script src="/resource/js/util/utils.js" type="text/javascript"></script> <script src="/resource/js/components/toggle.js" type="text/javascript"></script> <script src="/resource/js/components/truncate_elem.js" type="text/javascript"></script> <script src="/resource/js/components/tooltip_hover.js" type="text/javascript"></script> <script src="/resource/js/vendor/jquery.dotdotdot.js" type="text/javascript"></script> <!--For Google Tag manager to be able to track site information --> <script> dataLayer = [{ 'mobileSite': 'false', 'desktopSite': 'true' }]; </script> <title>Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics | PLOS ONE</title> </head> <body class="article plosone"> <!-- Google Tag Manager --> <noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-TP26BH" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript> <script> (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= '//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-TP26BH'); </script> <noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-MQQMGF" height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript> <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= '//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-MQQMGF');</script> <!-- End Google Tag Manager --> <!-- New Relic --> <script type="text/javascript"> ;window.NREUM||(NREUM={});NREUM.init={distributed_tracing:{enabled:true},privacy:{cookies_enabled:true},ajax:{deny_list:["bam.nr-data.net"]}}; window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(t,e,n){function r(t){try{s.console&&console.log(t)}catch(e){}}var o,i=t("ee"),a=t(32),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",function(t){r(t.stack)}),s.dev&&i.on("fn-err",function(t,e,n){r(n.stack)}),s.dev&&(r("NR AGENT IN DEVELOPMENT MODE"),r("flags: "+a(s,function(t,e){return t}).join(", ")))},{}],2:[function(t,e,n){function r(t,e,n,r,s){try{l?l-=1:o(s||new UncaughtException(t,e,n),!0)}catch(f){try{i("ierr",[f,c.now(),!0])}catch(d){}}return"function"==typeof u&&u.apply(this,a(arguments))}function UncaughtException(t,e,n){this.message=t||"Uncaught error with no additional information",this.sourceURL=e,this.line=n}function o(t,e){var n=e?null:c.now();i("err",[t,n])}var i=t("handle"),a=t(33),s=t("ee"),c=t("loader"),f=t("gos"),u=window.onerror,d=!1,p="nr@seenError";if(!c.disabled){var l=0;c.features.err=!0,t(1),window.onerror=r;try{throw new Error}catch(h){"stack"in h&&(t(14),t(13),"addEventListener"in window&&t(7),c.xhrWrappable&&t(15),d=!0)}s.on("fn-start",function(t,e,n){d&&(l+=1)}),s.on("fn-err",function(t,e,n){d&&!n[p]&&(f(n,p,function(){return!0}),this.thrown=!0,o(n))}),s.on("fn-end",function(){d&&!this.thrown&&l>0&&(l-=1)}),s.on("internal-error",function(t){i("ierr",[t,c.now(),!0])})}},{}],3:[function(t,e,n){var r=t("loader");r.disabled||(r.features.ins=!0)},{}],4:[function(t,e,n){function r(){U++,L=g.hash,this[u]=y.now()}function o(){U--,g.hash!==L&&i(0,!0);var t=y.now();this[h]=~~this[h]+t-this[u],this[d]=t}function i(t,e){E.emit("newURL",[""+g,e])}function a(t,e){t.on(e,function(){this[e]=y.now()})}var s="-start",c="-end",f="-body",u="fn"+s,d="fn"+c,p="cb"+s,l="cb"+c,h="jsTime",m="fetch",v="addEventListener",w=window,g=w.location,y=t("loader");if(w[v]&&y.xhrWrappable&&!y.disabled){var x=t(11),b=t(12),E=t(9),R=t(7),O=t(14),T=t(8),P=t(15),S=t(10),M=t("ee"),N=M.get("tracer"),C=t(23);t(17),y.features.spa=!0;var L,U=0;M.on(u,r),b.on(p,r),S.on(p,r),M.on(d,o),b.on(l,o),S.on(l,o),M.buffer([u,d,"xhr-resolved"]),R.buffer([u]),O.buffer(["setTimeout"+c,"clearTimeout"+s,u]),P.buffer([u,"new-xhr","send-xhr"+s]),T.buffer([m+s,m+"-done",m+f+s,m+f+c]),E.buffer(["newURL"]),x.buffer([u]),b.buffer(["propagate",p,l,"executor-err","resolve"+s]),N.buffer([u,"no-"+u]),S.buffer(["new-jsonp","cb-start","jsonp-error","jsonp-end"]),a(T,m+s),a(T,m+"-done"),a(S,"new-jsonp"),a(S,"jsonp-end"),a(S,"cb-start"),E.on("pushState-end",i),E.on("replaceState-end",i),w[v]("hashchange",i,C(!0)),w[v]("load",i,C(!0)),w[v]("popstate",function(){i(0,U>1)},C(!0))}},{}],5:[function(t,e,n){function r(){var t=new PerformanceObserver(function(t,e){var n=t.getEntries();s(v,[n])});try{t.observe({entryTypes:["resource"]})}catch(e){}}function o(t){if(s(v,[window.performance.getEntriesByType(w)]),window.performance["c"+p])try{window.performance[h](m,o,!1)}catch(t){}else try{window.performance[h]("webkit"+m,o,!1)}catch(t){}}function i(t){}if(window.performance&&window.performance.timing&&window.performance.getEntriesByType){var a=t("ee"),s=t("handle"),c=t(14),f=t(13),u=t(6),d=t(23),p="learResourceTimings",l="addEventListener",h="removeEventListener",m="resourcetimingbufferfull",v="bstResource",w="resource",g="-start",y="-end",x="fn"+g,b="fn"+y,E="bstTimer",R="pushState",O=t("loader");if(!O.disabled){O.features.stn=!0,t(9),"addEventListener"in window&&t(7);var T=NREUM.o.EV;a.on(x,function(t,e){var n=t[0];n instanceof T&&(this.bstStart=O.now())}),a.on(b,function(t,e){var n=t[0];n instanceof T&&s("bst",[n,e,this.bstStart,O.now()])}),c.on(x,function(t,e,n){this.bstStart=O.now(),this.bstType=n}),c.on(b,function(t,e){s(E,[e,this.bstStart,O.now(),this.bstType])}),f.on(x,function(){this.bstStart=O.now()}),f.on(b,function(t,e){s(E,[e,this.bstStart,O.now(),"requestAnimationFrame"])}),a.on(R+g,function(t){this.time=O.now(),this.startPath=location.pathname+location.hash}),a.on(R+y,function(t){s("bstHist",[location.pathname+location.hash,this.startPath,this.time])}),u()?(s(v,[window.performance.getEntriesByType("resource")]),r()):l in window.performance&&(window.performance["c"+p]?window.performance[l](m,o,d(!1)):window.performance[l]("webkit"+m,o,d(!1))),document[l]("scroll",i,d(!1)),document[l]("keypress",i,d(!1)),document[l]("click",i,d(!1))}}},{}],6:[function(t,e,n){e.exports=function(){return"PerformanceObserver"in window&&"function"==typeof window.PerformanceObserver}},{}],7:[function(t,e,n){function r(t){for(var e=t;e&&!e.hasOwnProperty(u);)e=Object.getPrototypeOf(e);e&&o(e)}function o(t){s.inPlace(t,[u,d],"-",i)}function i(t,e){return t[1]}var a=t("ee").get("events"),s=t("wrap-function")(a,!0),c=t("gos"),f=XMLHttpRequest,u="addEventListener",d="removeEventListener";e.exports=a,"getPrototypeOf"in Object?(r(document),r(window),r(f.prototype)):f.prototype.hasOwnProperty(u)&&(o(window),o(f.prototype)),a.on(u+"-start",function(t,e){var n=t[1];if(null!==n&&("function"==typeof n||"object"==typeof n)){var r=c(n,"nr@wrapped",function(){function t(){if("function"==typeof n.handleEvent)return n.handleEvent.apply(n,arguments)}var e={object:t,"function":n}[typeof n];return e?s(e,"fn-",null,e.name||"anonymous"):n});this.wrapped=t[1]=r}}),a.on(d+"-start",function(t){t[1]=this.wrapped||t[1]})},{}],8:[function(t,e,n){function r(t,e,n){var r=t[e];"function"==typeof r&&(t[e]=function(){var t=i(arguments),e={};o.emit(n+"before-start",[t],e);var a;e[m]&&e[m].dt&&(a=e[m].dt);var s=r.apply(this,t);return o.emit(n+"start",[t,a],s),s.then(function(t){return o.emit(n+"end",[null,t],s),t},function(t){throw o.emit(n+"end",[t],s),t})})}var o=t("ee").get("fetch"),i=t(33),a=t(32);e.exports=o;var s=window,c="fetch-",f=c+"body-",u=["arrayBuffer","blob","json","text","formData"],d=s.Request,p=s.Response,l=s.fetch,h="prototype",m="nr@context";d&&p&&l&&(a(u,function(t,e){r(d[h],e,f),r(p[h],e,f)}),r(s,"fetch",c),o.on(c+"end",function(t,e){var n=this;if(e){var r=e.headers.get("content-length");null!==r&&(n.rxSize=r),o.emit(c+"done",[null,e],n)}else o.emit(c+"done",[t],n)}))},{}],9:[function(t,e,n){var r=t("ee").get("history"),o=t("wrap-function")(r);e.exports=r;var i=window.history&&window.history.constructor&&window.history.constructor.prototype,a=window.history;i&&i.pushState&&i.replaceState&&(a=i),o.inPlace(a,["pushState","replaceState"],"-")},{}],10:[function(t,e,n){function r(t){function e(){f.emit("jsonp-end",[],l),t.removeEventListener("load",e,c(!1)),t.removeEventListener("error",n,c(!1))}function n(){f.emit("jsonp-error",[],l),f.emit("jsonp-end",[],l),t.removeEventListener("load",e,c(!1)),t.removeEventListener("error",n,c(!1))}var r=t&&"string"==typeof t.nodeName&&"script"===t.nodeName.toLowerCase();if(r){var o="function"==typeof t.addEventListener;if(o){var a=i(t.src);if(a){var d=s(a),p="function"==typeof d.parent[d.key];if(p){var l={};u.inPlace(d.parent,[d.key],"cb-",l),t.addEventListener("load",e,c(!1)),t.addEventListener("error",n,c(!1)),f.emit("new-jsonp",[t.src],l)}}}}}function o(){return"addEventListener"in window}function i(t){var e=t.match(d);return e?e[1]:null}function a(t,e){var n=t.match(l),r=n[1],o=n[3];return o?a(o,e[r]):e[r]}function s(t){var e=t.match(p);return e&&e.length>=3?{key:e[2],parent:a(e[1],window)}:{key:t,parent:window}}var c=t(23),f=t("ee").get("jsonp"),u=t("wrap-function")(f);if(e.exports=f,o()){var d=/[?&](?:callback|cb)=([^&#]+)/,p=/(.*)\.([^.]+)/,l=/^(\w+)(\.|$)(.*)$/,h=["appendChild","insertBefore","replaceChild"];Node&&Node.prototype&&Node.prototype.appendChild?u.inPlace(Node.prototype,h,"dom-"):(u.inPlace(HTMLElement.prototype,h,"dom-"),u.inPlace(HTMLHeadElement.prototype,h,"dom-"),u.inPlace(HTMLBodyElement.prototype,h,"dom-")),f.on("dom-start",function(t){r(t[0])})}},{}],11:[function(t,e,n){var r=t("ee").get("mutation"),o=t("wrap-function")(r),i=NREUM.o.MO;e.exports=r,i&&(window.MutationObserver=function(t){return this instanceof i?new i(o(t,"fn-")):i.apply(this,arguments)},MutationObserver.prototype=i.prototype)},{}],12:[function(t,e,n){function r(t){var e=i.context(),n=s(t,"executor-",e,null,!1),r=new f(n);return i.context(r).getCtx=function(){return e},r}var o=t("wrap-function"),i=t("ee").get("promise"),a=t("ee").getOrSetContext,s=o(i),c=t(32),f=NREUM.o.PR;e.exports=i,f&&(window.Promise=r,["all","race"].forEach(function(t){var e=f[t];f[t]=function(n){function r(t){return function(){i.emit("propagate",[null,!o],a,!1,!1),o=o||!t}}var o=!1;c(n,function(e,n){Promise.resolve(n).then(r("all"===t),r(!1))});var a=e.apply(f,arguments),s=f.resolve(a);return s}}),["resolve","reject"].forEach(function(t){var e=f[t];f[t]=function(t){var n=e.apply(f,arguments);return t!==n&&i.emit("propagate",[t,!0],n,!1,!1),n}}),f.prototype["catch"]=function(t){return this.then(null,t)},f.prototype=Object.create(f.prototype,{constructor:{value:r}}),c(Object.getOwnPropertyNames(f),function(t,e){try{r[e]=f[e]}catch(n){}}),o.wrapInPlace(f.prototype,"then",function(t){return function(){var e=this,n=o.argsToArray.apply(this,arguments),r=a(e);r.promise=e,n[0]=s(n[0],"cb-",r,null,!1),n[1]=s(n[1],"cb-",r,null,!1);var c=t.apply(this,n);return r.nextPromise=c,i.emit("propagate",[e,!0],c,!1,!1),c}}),i.on("executor-start",function(t){t[0]=s(t[0],"resolve-",this,null,!1),t[1]=s(t[1],"resolve-",this,null,!1)}),i.on("executor-err",function(t,e,n){t[1](n)}),i.on("cb-end",function(t,e,n){i.emit("propagate",[n,!0],this.nextPromise,!1,!1)}),i.on("propagate",function(t,e,n){this.getCtx&&!e||(this.getCtx=function(){if(t instanceof Promise)var e=i.context(t);return e&&e.getCtx?e.getCtx():this})}),r.toString=function(){return""+f})},{}],13:[function(t,e,n){var r=t("ee").get("raf"),o=t("wrap-function")(r),i="equestAnimationFrame";e.exports=r,o.inPlace(window,["r"+i,"mozR"+i,"webkitR"+i,"msR"+i],"raf-"),r.on("raf-start",function(t){t[0]=o(t[0],"fn-")})},{}],14:[function(t,e,n){function r(t,e,n){t[0]=a(t[0],"fn-",null,n)}function o(t,e,n){this.method=n,this.timerDuration=isNaN(t[1])?0:+t[1],t[0]=a(t[0],"fn-",this,n)}var i=t("ee").get("timer"),a=t("wrap-function")(i),s="setTimeout",c="setInterval",f="clearTimeout",u="-start",d="-";e.exports=i,a.inPlace(window,[s,"setImmediate"],s+d),a.inPlace(window,[c],c+d),a.inPlace(window,[f,"clearImmediate"],f+d),i.on(c+u,r),i.on(s+u,o)},{}],15:[function(t,e,n){function r(t,e){d.inPlace(e,["onreadystatechange"],"fn-",s)}function o(){var t=this,e=u.context(t);t.readyState>3&&!e.resolved&&(e.resolved=!0,u.emit("xhr-resolved",[],t)),d.inPlace(t,y,"fn-",s)}function i(t){x.push(t),m&&(E?E.then(a):w?w(a):(R=-R,O.data=R))}function a(){for(var t=0;t<x.length;t++)r([],x[t]);x.length&&(x=[])}function s(t,e){return e}function c(t,e){for(var n in t)e[n]=t[n];return e}t(7);var f=t("ee"),u=f.get("xhr"),d=t("wrap-function")(u),p=t(23),l=NREUM.o,h=l.XHR,m=l.MO,v=l.PR,w=l.SI,g="readystatechange",y=["onload","onerror","onabort","onloadstart","onloadend","onprogress","ontimeout"],x=[];e.exports=u;var b=window.XMLHttpRequest=function(t){var e=new h(t);try{u.emit("new-xhr",[e],e),e.addEventListener(g,o,p(!1))}catch(n){try{u.emit("internal-error",[n])}catch(r){}}return e};if(c(h,b),b.prototype=h.prototype,d.inPlace(b.prototype,["open","send"],"-xhr-",s),u.on("send-xhr-start",function(t,e){r(t,e),i(e)}),u.on("open-xhr-start",r),m){var E=v&&v.resolve();if(!w&&!v){var R=1,O=document.createTextNode(R);new m(a).observe(O,{characterData:!0})}}else f.on("fn-end",function(t){t[0]&&t[0].type===g||a()})},{}],16:[function(t,e,n){function r(t){if(!s(t))return null;var e=window.NREUM;if(!e.loader_config)return null;var n=(e.loader_config.accountID||"").toString()||null,r=(e.loader_config.agentID||"").toString()||null,f=(e.loader_config.trustKey||"").toString()||null;if(!n||!r)return null;var h=l.generateSpanId(),m=l.generateTraceId(),v=Date.now(),w={spanId:h,traceId:m,timestamp:v};return(t.sameOrigin||c(t)&&p())&&(w.traceContextParentHeader=o(h,m),w.traceContextStateHeader=i(h,v,n,r,f)),(t.sameOrigin&&!u()||!t.sameOrigin&&c(t)&&d())&&(w.newrelicHeader=a(h,m,v,n,r,f)),w}function o(t,e){return"00-"+e+"-"+t+"-01"}function i(t,e,n,r,o){var i=0,a="",s=1,c="",f="";return o+"@nr="+i+"-"+s+"-"+n+"-"+r+"-"+t+"-"+a+"-"+c+"-"+f+"-"+e}function a(t,e,n,r,o,i){var a="btoa"in window&&"function"==typeof window.btoa;if(!a)return null;var s={v:[0,1],d:{ty:"Browser",ac:r,ap:o,id:t,tr:e,ti:n}};return i&&r!==i&&(s.d.tk=i),btoa(JSON.stringify(s))}function s(t){return f()&&c(t)}function c(t){var e=!1,n={};if("init"in NREUM&&"distributed_tracing"in NREUM.init&&(n=NREUM.init.distributed_tracing),t.sameOrigin)e=!0;else if(n.allowed_origins instanceof Array)for(var r=0;r<n.allowed_origins.length;r++){var o=h(n.allowed_origins[r]);if(t.hostname===o.hostname&&t.protocol===o.protocol&&t.port===o.port){e=!0;break}}return e}function f(){return"init"in NREUM&&"distributed_tracing"in NREUM.init&&!!NREUM.init.distributed_tracing.enabled}function u(){return"init"in NREUM&&"distributed_tracing"in NREUM.init&&!!NREUM.init.distributed_tracing.exclude_newrelic_header}function d(){return"init"in NREUM&&"distributed_tracing"in NREUM.init&&NREUM.init.distributed_tracing.cors_use_newrelic_header!==!1}function p(){return"init"in NREUM&&"distributed_tracing"in NREUM.init&&!!NREUM.init.distributed_tracing.cors_use_tracecontext_headers}var l=t(29),h=t(18);e.exports={generateTracePayload:r,shouldGenerateTrace:s}},{}],17:[function(t,e,n){function r(t){var e=this.params,n=this.metrics;if(!this.ended){this.ended=!0;for(var r=0;r<p;r++)t.removeEventListener(d[r],this.listener,!1);e.aborted||(n.duration=a.now()-this.startTime,this.loadCaptureCalled||4!==t.readyState?null==e.status&&(e.status=0):i(this,t),n.cbTime=this.cbTime,s("xhr",[e,n,this.startTime,this.endTime,"xhr"],this))}}function o(t,e){var n=c(e),r=t.params;r.hostname=n.hostname,r.port=n.port,r.protocol=n.protocol,r.host=n.hostname+":"+n.port,r.pathname=n.pathname,t.parsedOrigin=n,t.sameOrigin=n.sameOrigin}function i(t,e){t.params.status=e.status;var n=v(e,t.lastSize);if(n&&(t.metrics.rxSize=n),t.sameOrigin){var r=e.getResponseHeader("X-NewRelic-App-Data");r&&(t.params.cat=r.split(", ").pop())}t.loadCaptureCalled=!0}var a=t("loader");if(a.xhrWrappable&&!a.disabled){var s=t("handle"),c=t(18),f=t(16).generateTracePayload,u=t("ee"),d=["load","error","abort","timeout"],p=d.length,l=t("id"),h=t(24),m=t(22),v=t(19),w=t(23),g=NREUM.o.REQ,y=window.XMLHttpRequest;a.features.xhr=!0,t(15),t(8),u.on("new-xhr",function(t){var e=this;e.totalCbs=0,e.called=0,e.cbTime=0,e.end=r,e.ended=!1,e.xhrGuids={},e.lastSize=null,e.loadCaptureCalled=!1,e.params=this.params||{},e.metrics=this.metrics||{},t.addEventListener("load",function(n){i(e,t)},w(!1)),h&&(h>34||h<10)||t.addEventListener("progress",function(t){e.lastSize=t.loaded},w(!1))}),u.on("open-xhr-start",function(t){this.params={method:t[0]},o(this,t[1]),this.metrics={}}),u.on("open-xhr-end",function(t,e){"loader_config"in NREUM&&"xpid"in NREUM.loader_config&&this.sameOrigin&&e.setRequestHeader("X-NewRelic-ID",NREUM.loader_config.xpid);var n=f(this.parsedOrigin);if(n){var r=!1;n.newrelicHeader&&(e.setRequestHeader("newrelic",n.newrelicHeader),r=!0),n.traceContextParentHeader&&(e.setRequestHeader("traceparent",n.traceContextParentHeader),n.traceContextStateHeader&&e.setRequestHeader("tracestate",n.traceContextStateHeader),r=!0),r&&(this.dt=n)}}),u.on("send-xhr-start",function(t,e){var n=this.metrics,r=t[0],o=this;if(n&&r){var i=m(r);i&&(n.txSize=i)}this.startTime=a.now(),this.listener=function(t){try{"abort"!==t.type||o.loadCaptureCalled||(o.params.aborted=!0),("load"!==t.type||o.called===o.totalCbs&&(o.onloadCalled||"function"!=typeof e.onload))&&o.end(e)}catch(n){try{u.emit("internal-error",[n])}catch(r){}}};for(var s=0;s<p;s++)e.addEventListener(d[s],this.listener,w(!1))}),u.on("xhr-cb-time",function(t,e,n){this.cbTime+=t,e?this.onloadCalled=!0:this.called+=1,this.called!==this.totalCbs||!this.onloadCalled&&"function"==typeof n.onload||this.end(n)}),u.on("xhr-load-added",function(t,e){var n=""+l(t)+!!e;this.xhrGuids&&!this.xhrGuids[n]&&(this.xhrGuids[n]=!0,this.totalCbs+=1)}),u.on("xhr-load-removed",function(t,e){var n=""+l(t)+!!e;this.xhrGuids&&this.xhrGuids[n]&&(delete this.xhrGuids[n],this.totalCbs-=1)}),u.on("xhr-resolved",function(){this.endTime=a.now()}),u.on("addEventListener-end",function(t,e){e instanceof y&&"load"===t[0]&&u.emit("xhr-load-added",[t[1],t[2]],e)}),u.on("removeEventListener-end",function(t,e){e instanceof y&&"load"===t[0]&&u.emit("xhr-load-removed",[t[1],t[2]],e)}),u.on("fn-start",function(t,e,n){e instanceof y&&("onload"===n&&(this.onload=!0),("load"===(t[0]&&t[0].type)||this.onload)&&(this.xhrCbStart=a.now()))}),u.on("fn-end",function(t,e){this.xhrCbStart&&u.emit("xhr-cb-time",[a.now()-this.xhrCbStart,this.onload,e],e)}),u.on("fetch-before-start",function(t){function e(t,e){var n=!1;return e.newrelicHeader&&(t.set("newrelic",e.newrelicHeader),n=!0),e.traceContextParentHeader&&(t.set("traceparent",e.traceContextParentHeader),e.traceContextStateHeader&&t.set("tracestate",e.traceContextStateHeader),n=!0),n}var n,r=t[1]||{};"string"==typeof t[0]?n=t[0]:t[0]&&t[0].url?n=t[0].url:window.URL&&t[0]&&t[0]instanceof URL&&(n=t[0].href),n&&(this.parsedOrigin=c(n),this.sameOrigin=this.parsedOrigin.sameOrigin);var o=f(this.parsedOrigin);if(o&&(o.newrelicHeader||o.traceContextParentHeader))if("string"==typeof t[0]||window.URL&&t[0]&&t[0]instanceof URL){var i={};for(var a in r)i[a]=r[a];i.headers=new Headers(r.headers||{}),e(i.headers,o)&&(this.dt=o),t.length>1?t[1]=i:t.push(i)}else t[0]&&t[0].headers&&e(t[0].headers,o)&&(this.dt=o)}),u.on("fetch-start",function(t,e){this.params={},this.metrics={},this.startTime=a.now(),this.dt=e,t.length>=1&&(this.target=t[0]),t.length>=2&&(this.opts=t[1]);var n,r=this.opts||{},i=this.target;"string"==typeof i?n=i:"object"==typeof i&&i instanceof g?n=i.url:window.URL&&"object"==typeof i&&i instanceof URL&&(n=i.href),o(this,n);var s=(""+(i&&i instanceof g&&i.method||r.method||"GET")).toUpperCase();this.params.method=s,this.txSize=m(r.body)||0}),u.on("fetch-done",function(t,e){this.endTime=a.now(),this.params||(this.params={}),this.params.status=e?e.status:0;var n;"string"==typeof this.rxSize&&this.rxSize.length>0&&(n=+this.rxSize);var r={txSize:this.txSize,rxSize:n,duration:a.now()-this.startTime};s("xhr",[this.params,r,this.startTime,this.endTime,"fetch"],this)})}},{}],18:[function(t,e,n){var r={};e.exports=function(t){if(t in r)return r[t];var e=document.createElement("a"),n=window.location,o={};e.href=t,o.port=e.port;var i=e.href.split("://");!o.port&&i[1]&&(o.port=i[1].split("/")[0].split("@").pop().split(":")[1]),o.port&&"0"!==o.port||(o.port="https"===i[0]?"443":"80"),o.hostname=e.hostname||n.hostname,o.pathname=e.pathname,o.protocol=i[0],"/"!==o.pathname.charAt(0)&&(o.pathname="/"+o.pathname);var a=!e.protocol||":"===e.protocol||e.protocol===n.protocol,s=e.hostname===document.domain&&e.port===n.port;return o.sameOrigin=a&&(!e.hostname||s),"/"===o.pathname&&(r[t]=o),o}},{}],19:[function(t,e,n){function r(t,e){var n=t.responseType;return"json"===n&&null!==e?e:"arraybuffer"===n||"blob"===n||"json"===n?o(t.response):"text"===n||""===n||void 0===n?o(t.responseText):void 0}var o=t(22);e.exports=r},{}],20:[function(t,e,n){function r(){}function o(t,e,n,r){return function(){return u.recordSupportability("API/"+e+"/called"),i(t+e,[f.now()].concat(s(arguments)),n?null:this,r),n?void 0:this}}var i=t("handle"),a=t(32),s=t(33),c=t("ee").get("tracer"),f=t("loader"),u=t(25),d=NREUM;"undefined"==typeof window.newrelic&&(newrelic=d);var p=["setPageViewName","setCustomAttribute","setErrorHandler","finished","addToTrace","inlineHit","addRelease"],l="api-",h=l+"ixn-";a(p,function(t,e){d[e]=o(l,e,!0,"api")}),d.addPageAction=o(l,"addPageAction",!0),d.setCurrentRouteName=o(l,"routeName",!0),e.exports=newrelic,d.interaction=function(){return(new r).get()};var m=r.prototype={createTracer:function(t,e){var n={},r=this,o="function"==typeof e;return i(h+"tracer",[f.now(),t,n],r),function(){if(c.emit((o?"":"no-")+"fn-start",[f.now(),r,o],n),o)try{return e.apply(this,arguments)}catch(t){throw c.emit("fn-err",[arguments,this,t],n),t}finally{c.emit("fn-end",[f.now()],n)}}}};a("actionText,setName,setAttribute,save,ignore,onEnd,getContext,end,get".split(","),function(t,e){m[e]=o(h,e)}),newrelic.noticeError=function(t,e){"string"==typeof t&&(t=new Error(t)),u.recordSupportability("API/noticeError/called"),i("err",[t,f.now(),!1,e])}},{}],21:[function(t,e,n){function r(t){if(NREUM.init){for(var e=NREUM.init,n=t.split("."),r=0;r<n.length-1;r++)if(e=e[n[r]],"object"!=typeof e)return;return e=e[n[n.length-1]]}}e.exports={getConfiguration:r}},{}],22:[function(t,e,n){e.exports=function(t){if("string"==typeof t&&t.length)return t.length;if("object"==typeof t){if("undefined"!=typeof ArrayBuffer&&t instanceof ArrayBuffer&&t.byteLength)return t.byteLength;if("undefined"!=typeof Blob&&t instanceof Blob&&t.size)return t.size;if(!("undefined"!=typeof FormData&&t instanceof FormData))try{return JSON.stringify(t).length}catch(e){return}}}},{}],23:[function(t,e,n){var r=!1;try{var o=Object.defineProperty({},"passive",{get:function(){r=!0}});window.addEventListener("testPassive",null,o),window.removeEventListener("testPassive",null,o)}catch(i){}e.exports=function(t){return r?{passive:!0,capture:!!t}:!!t}},{}],24:[function(t,e,n){var r=0,o=navigator.userAgent.match(/Firefox[\/\s](\d+\.\d+)/);o&&(r=+o[1]),e.exports=r},{}],25:[function(t,e,n){function r(t,e){var n=[a,t,{name:t},e];return i("storeMetric",n,null,"api"),n}function o(t,e){var n=[s,t,{name:t},e];return i("storeEventMetrics",n,null,"api"),n}var i=t("handle"),a="sm",s="cm";e.exports={constants:{SUPPORTABILITY_METRIC:a,CUSTOM_METRIC:s},recordSupportability:r,recordCustom:o}},{}],26:[function(t,e,n){function r(){return s.exists&&performance.now?Math.round(performance.now()):(i=Math.max((new Date).getTime(),i))-a}function o(){return i}var i=(new Date).getTime(),a=i,s=t(34);e.exports=r,e.exports.offset=a,e.exports.getLastTimestamp=o},{}],27:[function(t,e,n){function r(t){return!(!t||!t.protocol||"file:"===t.protocol)}e.exports=r},{}],28:[function(t,e,n){function r(t,e){var n=t.getEntries();n.forEach(function(t){"first-paint"===t.name?p("timing",["fp",Math.floor(t.startTime)]):"first-contentful-paint"===t.name&&p("timing",["fcp",Math.floor(t.startTime)])})}function o(t,e){var n=t.getEntries();if(n.length>0){var r=n[n.length-1];if(c&&c<r.startTime)return;p("lcp",[r])}}function i(t){t.getEntries().forEach(function(t){t.hadRecentInput||p("cls",[t])})}function a(t){if(t instanceof v&&!g){var e=Math.round(t.timeStamp),n={type:t.type};e<=l.now()?n.fid=l.now()-e:e>l.offset&&e<=Date.now()?(e-=l.offset,n.fid=l.now()-e):e=l.now(),g=!0,p("timing",["fi",e,n])}}function s(t){"hidden"===t&&(c=l.now(),p("pageHide",[c]))}if(!("init"in NREUM&&"page_view_timing"in NREUM.init&&"enabled"in NREUM.init.page_view_timing&&NREUM.init.page_view_timing.enabled===!1)){var c,f,u,d,p=t("handle"),l=t("loader"),h=t(31),m=t(23),v=NREUM.o.EV;if("PerformanceObserver"in window&&"function"==typeof window.PerformanceObserver){f=new PerformanceObserver(r);try{f.observe({entryTypes:["paint"]})}catch(w){}u=new PerformanceObserver(o);try{u.observe({entryTypes:["largest-contentful-paint"]})}catch(w){}d=new PerformanceObserver(i);try{d.observe({type:"layout-shift",buffered:!0})}catch(w){}}if("addEventListener"in document){var g=!1,y=["click","keydown","mousedown","pointerdown","touchstart"];y.forEach(function(t){document.addEventListener(t,a,m(!1))})}h(s)}},{}],29:[function(t,e,n){function r(){function t(){return e?15&e[n++]:16*Math.random()|0}var e=null,n=0,r=window.crypto||window.msCrypto;r&&r.getRandomValues&&(e=r.getRandomValues(new Uint8Array(31)));for(var o,i="xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx",a="",s=0;s<i.length;s++)o=i[s],"x"===o?a+=t().toString(16):"y"===o?(o=3&t()|8,a+=o.toString(16)):a+=o;return a}function o(){return a(16)}function i(){return a(32)}function a(t){function e(){return n?15&n[r++]:16*Math.random()|0}var n=null,r=0,o=window.crypto||window.msCrypto;o&&o.getRandomValues&&Uint8Array&&(n=o.getRandomValues(new Uint8Array(31)));for(var i=[],a=0;a<t;a++)i.push(e().toString(16));return i.join("")}e.exports={generateUuid:r,generateSpanId:o,generateTraceId:i}},{}],30:[function(t,e,n){function r(t,e){if(!o)return!1;if(t!==o)return!1;if(!e)return!0;if(!i)return!1;for(var n=i.split("."),r=e.split("."),a=0;a<r.length;a++)if(r[a]!==n[a])return!1;return!0}var o=null,i=null,a=/Version\/(\S+)\s+Safari/;if(navigator.userAgent){var s=navigator.userAgent,c=s.match(a);c&&s.indexOf("Chrome")===-1&&s.indexOf("Chromium")===-1&&(o="Safari",i=c[1])}e.exports={agent:o,version:i,match:r}},{}],31:[function(t,e,n){function r(t){function e(){t(s&&document[s]?document[s]:document[i]?"hidden":"visible")}"addEventListener"in document&&a&&document.addEventListener(a,e,o(!1))}var o=t(23);e.exports=r;var i,a,s;"undefined"!=typeof document.hidden?(i="hidden",a="visibilitychange",s="visibilityState"):"undefined"!=typeof document.msHidden?(i="msHidden",a="msvisibilitychange"):"undefined"!=typeof document.webkitHidden&&(i="webkitHidden",a="webkitvisibilitychange",s="webkitVisibilityState")},{}],32:[function(t,e,n){function r(t,e){var n=[],r="",i=0;for(r in t)o.call(t,r)&&(n[i]=e(r,t[r]),i+=1);return n}var o=Object.prototype.hasOwnProperty;e.exports=r},{}],33:[function(t,e,n){function r(t,e,n){e||(e=0),"undefined"==typeof n&&(n=t?t.length:0);for(var r=-1,o=n-e||0,i=Array(o<0?0:o);++r<o;)i[r]=t[e+r];return i}e.exports=r},{}],34:[function(t,e,n){e.exports={exists:"undefined"!=typeof window.performance&&window.performance.timing&&"undefined"!=typeof window.performance.timing.navigationStart}},{}],ee:[function(t,e,n){function r(){}function o(t){function e(t){return t&&t instanceof r?t:t?f(t,c,a):a()}function n(n,r,o,i,a){if(a!==!1&&(a=!0),!l.aborted||i){t&&a&&t(n,r,o);for(var s=e(o),c=m(n),f=c.length,u=0;u<f;u++)c[u].apply(s,r);var p=d[y[n]];return p&&p.push([x,n,r,s]),s}}function i(t,e){g[t]=m(t).concat(e)}function h(t,e){var n=g[t];if(n)for(var r=0;r<n.length;r++)n[r]===e&&n.splice(r,1)}function m(t){return g[t]||[]}function v(t){return p[t]=p[t]||o(n)}function w(t,e){l.aborted||u(t,function(t,n){e=e||"feature",y[n]=e,e in d||(d[e]=[])})}var g={},y={},x={on:i,addEventListener:i,removeEventListener:h,emit:n,get:v,listeners:m,context:e,buffer:w,abort:s,aborted:!1};return x}function i(t){return f(t,c,a)}function a(){return new r}function s(){(d.api||d.feature)&&(l.aborted=!0,d=l.backlog={})}var c="nr@context",f=t("gos"),u=t(32),d={},p={},l=e.exports=o();e.exports.getOrSetContext=i,l.backlog=d},{}],gos:[function(t,e,n){function r(t,e,n){if(o.call(t,e))return t[e];var r=n();if(Object.defineProperty&&Object.keys)try{return Object.defineProperty(t,e,{value:r,writable:!0,enumerable:!1}),r}catch(i){}return t[e]=r,r}var o=Object.prototype.hasOwnProperty;e.exports=r},{}],handle:[function(t,e,n){function r(t,e,n,r){o.buffer([t],r),o.emit(t,e,n)}var o=t("ee").get("handle");e.exports=r,r.ee=o},{}],id:[function(t,e,n){function r(t){var e=typeof t;return!t||"object"!==e&&"function"!==e?-1:t===window?0:a(t,i,function(){return o++})}var o=1,i="nr@id",a=t("gos");e.exports=r},{}],loader:[function(t,e,n){function r(){if(!P++){var t=T.info=NREUM.info,e=v.getElementsByTagName("script")[0];if(setTimeout(f.abort,3e4),!(t&&t.licenseKey&&t.applicationID&&e))return f.abort();c(R,function(e,n){t[e]||(t[e]=n)});var n=a();s("mark",["onload",n+T.offset],null,"api"),s("timing",["load",n]);var r=v.createElement("script");0===t.agent.indexOf("http://")||0===t.agent.indexOf("https://")?r.src=t.agent:r.src=h+"://"+t.agent,e.parentNode.insertBefore(r,e)}}function o(){"complete"===v.readyState&&i()}function i(){s("mark",["domContent",a()+T.offset],null,"api")}var a=t(26),s=t("handle"),c=t(32),f=t("ee"),u=t(30),d=t(27),p=t(21),l=t(23),h=p.getConfiguration("ssl")===!1?"http":"https",m=window,v=m.document,w="addEventListener",g="attachEvent",y=m.XMLHttpRequest,x=y&&y.prototype,b=!d(m.location);NREUM.o={ST:setTimeout,SI:m.setImmediate,CT:clearTimeout,XHR:y,REQ:m.Request,EV:m.Event,PR:m.Promise,MO:m.MutationObserver};var E=""+location,R={beacon:"bam.nr-data.net",errorBeacon:"bam.nr-data.net",agent:"js-agent.newrelic.com/nr-spa-1212.min.js"},O=y&&x&&x[w]&&!/CriOS/.test(navigator.userAgent),T=e.exports={offset:a.getLastTimestamp(),now:a,origin:E,features:{},xhrWrappable:O,userAgent:u,disabled:b};if(!b){t(20),t(28),v[w]?(v[w]("DOMContentLoaded",i,l(!1)),m[w]("load",r,l(!1))):(v[g]("onreadystatechange",o),m[g]("onload",r)),s("mark",["firstbyte",a.getLastTimestamp()],null,"api");var P=0}},{}],"wrap-function":[function(t,e,n){function r(t,e){function n(e,n,r,c,f){function nrWrapper(){var i,a,u,p;try{a=this,i=d(arguments),u="function"==typeof r?r(i,a):r||{}}catch(l){o([l,"",[i,a,c],u],t)}s(n+"start",[i,a,c],u,f);try{return p=e.apply(a,i)}catch(h){throw s(n+"err",[i,a,h],u,f),h}finally{s(n+"end",[i,a,p],u,f)}}return a(e)?e:(n||(n=""),nrWrapper[p]=e,i(e,nrWrapper,t),nrWrapper)}function r(t,e,r,o,i){r||(r="");var s,c,f,u="-"===r.charAt(0);for(f=0;f<e.length;f++)c=e[f],s=t[c],a(s)||(t[c]=n(s,u?c+r:r,o,c,i))}function s(n,r,i,a){if(!h||e){var s=h;h=!0;try{t.emit(n,r,i,e,a)}catch(c){o([c,n,r,i],t)}h=s}}return t||(t=u),n.inPlace=r,n.flag=p,n}function o(t,e){e||(e=u);try{e.emit("internal-error",t)}catch(n){}}function i(t,e,n){if(Object.defineProperty&&Object.keys)try{var r=Object.keys(t);return r.forEach(function(n){Object.defineProperty(e,n,{get:function(){return t[n]},set:function(e){return t[n]=e,e}})}),e}catch(i){o([i],n)}for(var a in t)l.call(t,a)&&(e[a]=t[a]);return e}function a(t){return!(t&&t instanceof Function&&t.apply&&!t[p])}function s(t,e){var n=e(t);return n[p]=t,i(t,n,u),n}function c(t,e,n){var r=t[e];t[e]=s(r,n)}function f(){for(var t=arguments.length,e=new Array(t),n=0;n<t;++n)e[n]=arguments[n];return e}var u=t("ee"),d=t(33),p="nr@original",l=Object.prototype.hasOwnProperty,h=!1;e.exports=r,e.exports.wrapFunction=s,e.exports.wrapInPlace=c,e.exports.argsToArray=f},{}]},{},["loader",2,17,5,3,4]); ;NREUM.loader_config={accountID:"804283",trustKey:"804283",agentID:"402703674",licenseKey:"cf99e8d2a3",applicationID:"402703674"} ;NREUM.info={beacon:"bam.nr-data.net",errorBeacon:"bam.nr-data.net",licenseKey:"cf99e8d2a3", // Modified this value from the generated script, to pass prod vs dev applicationID: window.location.hostname.includes('journals.plos.org') ? "402703674" : "402694889", sa:1} </script> <!-- End New Relic --> <header> <div id="topslot" class="head-top"> <a id="skip-to-content" tabindex="0" class="button" href="#main-content"> Skip to main content </a> <div class="center"> <div class="title">Advertisement</div> <!-- DoubleClick Ad Zone --> <div class='advertisement' id='div-gpt-ad-1458247671871-0' style='width:728px; height:90px;'> <script type='text/javascript'> googletag.cmd.push(function() { googletag.display('div-gpt-ad-1458247671871-0'); }); </script> </div> </div> </div> <div id="user" class="nav" data-user-management-url="https://community.plos.org"> </div> <div id="pagehdr"> <nav class="nav-main"> <h1 class="logo"> <a href="/plosone/.">PLOS ONE</a> </h1> <section class="top-bar-section"> <ul class="nav-elements"> <li class="multi-col-parent menu-section-header has-dropdown" id="publish"> Publish <div class="dropdown mega "> <ul class="multi-col" id="publish-dropdown-list"> <li class="menu-section-header " id="submissions"> <span class="menu-section-header-title"> Submissions </span> <ul class="menu-section " id="submissions-dropdown-list"> <li> <a href="/plosone/s/getting-started" >Getting Started</a> </li> <li> <a href="/plosone/s/submission-guidelines" >Submission Guidelines</a> </li> <li> <a href="/plosone/s/figures" >Figures</a> </li> <li> <a href="/plosone/s/tables" >Tables</a> </li> <li> <a href="/plosone/s/supporting-information" >Supporting Information</a> </li> <li> <a href="/plosone/s/latex" >LaTeX</a> </li> <li> <a href="/plosone/s/what-we-publish" >What We Publish</a> </li> <li> <a href="/plosone/s/preprints" >Preprints</a> </li> <li> <a href="/plosone/s/revising-your-manuscript" >Revising Your Manuscript</a> </li> <li> <a href="/plosone/s/submit-now" >Submit Now</a> </li> <li> <a href="https://collections.plos.org/s/calls-for-papers" >Calls for Papers</a> </li> </ul> </li> <li class="menu-section-header " id="policies"> <span class="menu-section-header-title"> Policies </span> <ul class="menu-section " id="policies-dropdown-list"> <li> <a href="/plosone/s/best-practices-in-research-reporting" >Best Practices in Research Reporting</a> </li> <li> <a href="/plosone/s/human-subjects-research" >Human Subjects Research</a> </li> <li> <a href="/plosone/s/animal-research" >Animal Research</a> </li> <li> <a href="/plosone/s/competing-interests" >Competing Interests</a> </li> <li> <a href="/plosone/s/disclosure-of-funding-sources" >Disclosure of Funding Sources</a> </li> <li> <a href="/plosone/s/licenses-and-copyright" >Licenses and Copyright</a> </li> <li> <a href="/plosone/s/data-availability" >Data Availability</a> </li> <li> <a href="/plosone/s/complementary-research" >Complementary Research</a> </li> <li> <a href="/plosone/s/materials-software-and-code-sharing" >Materials, Software and Code Sharing</a> </li> <li> <a href="/plosone/s/ethical-publishing-practice" >Ethical Publishing Practice</a> </li> <li> <a href="/plosone/s/authorship" >Authorship</a> </li> <li> <a href="/plosone/s/corrections-expressions-of-concern-and-retractions" >Corrections, Expressions of Concern, and Retractions</a> </li> </ul> </li> <li class="menu-section-header " id="manuscript-review-and-publication"> <span class="menu-section-header-title"> Manuscript Review and Publication </span> <ul class="menu-section " id="manuscript-review-and-publication-dropdown-list"> <li> <a href="/plosone/s/criteria-for-publication" >Criteria for Publication</a> </li> <li> <a href="/plosone/s/editorial-and-peer-review-process" >Editorial and Peer Review Process</a> </li> <li> <a href="https://plos.org/resources/editor-center" >Editor Center</a> </li> <li> <a href="/plosone/s/resources-for-editors" >Resources for Editors</a> </li> <li> <a href="/plosone/s/reviewer-guidelines" >Guidelines for Reviewers</a> </li> <li> <a href="/plosone/s/accepted-manuscripts" >Accepted Manuscripts</a> </li> <li> <a href="/plosone/s/comments" >Comments</a> </li> </ul> </li> </ul> <div class="calloutcontainer"> <h3 class="callout-headline">Submit Your Manuscript</h3> <div class="action-contain"> <p class="callout-content"> Discover a faster, simpler path to publishing in a high-quality journal. <em>PLOS ONE</em> promises fair, rigorous peer review, broad scope, and wide readership – a perfect fit for your research every time. </p> <p class="button-contain special"> <a class="button button-default" href="/plosone/static/publish"> Learn More </a> <a class="button-link" href="https://www.editorialmanager.com/pone/default.asp"> Submit Now </a> </p> </div> <!-- opens in siteMenuCalloutDescription --> </div> </div> </li> <li class="menu-section-header has-dropdown " id="about"> <span class="menu-section-header-title"> About </span> <ul class="menu-section dropdown " id="about-dropdown-list"> <li> <a href="/plosone/static/publish" >Why Publish with PLOS ONE</a> </li> <li> <a href="/plosone/s/journal-information" >Journal Information</a> </li> <li> <a href="/plosone/s/staff-editors" >Staff Editors</a> </li> <li> <a href="/plosone/static/editorial-board" >Editorial Board</a> </li> <li> <a href="/plosone/s/section-editors" >Section Editors</a> </li> <li> <a href="/plosone/s/advisory-groups" >Advisory Groups</a> </li> <li> <a href="/plosone/s/find-and-read-articles" >Find and Read Articles</a> </li> <li> <a href="/plosone/s/publishing-information" >Publishing Information</a> </li> <li> <a href="https://plos.org/publication-fees" >Publication Fees</a> </li> <li> <a href="https://plos.org/press-and-media" >Press and Media</a> </li> <li> <a href="/plosone/s/contact" >Contact</a> </li> </ul> </li> <li data-js-tooltip-hover="trigger" class="subject-area menu-section-header"> Browse </li> <script src="/resource/js/vendor/jquery.hoverIntent.js" type="text/javascript"></script> <script src="/resource/js/components/menu_drop.js" type="text/javascript"></script> <script src="/resource/js/components/hover_delay.js" type="text/javascript"></script> <li id="navsearch" class="head-search"> <form name="searchForm" action="/plosone/search" method="get"> <fieldset> <legend>Search</legend> <label for="search">Search</label> <div class="search-contain"> <input id="search" type="text" name="q" placeholder="SEARCH" required/> <button id="headerSearchButton" type="submit" aria-label="Submit search"> <i title="Submit search" class="search-icon"></i> </button> </div> </fieldset> <input type="hidden" name="filterJournals" value="PLoSONE"/> </form> <a id="advSearch" href="/plosone/search"> advanced search </a> <script src="/resource/js/components/placeholder_style.js" type="text/javascript"></script> </li> </ul> </section> </nav> </div> </header> <section id="taxonomyContainer"> <script src="/resource/js/taxonomy-browser.js" type="text/javascript"></script> <div id="taxonomy-browser" class="areas" data-search-url="/plosone/browse"> <div class="wrapper"> <div class="taxonomy-header"> Browse Subject Areas <div id="subjInfo">?</div> <div id="subjInfoText"> <p>Click through the PLOS taxonomy to find articles in your field.</p> <p>For more information about PLOS Subject Areas, click <a href="https://github.com/PLOS/plos-thesaurus/blob/master/README.md" target="_blank" title="Link opens in new window">here</a>. </p> </div> </div> <div class="levels"> <div class="levels-container cf"> <div class="levels-position"></div> </div> <a href="#" class="prev"></a> <a href="#" class="next active"></a> </div> </div> <div class="taxonomy-browser-border-bottom"></div> </div> </section> <main id="main-content"> <div class="set-grid"> <header class="title-block"> <script src="/resource/js/components/signposts.js" type="text/javascript"></script> <ul id="almSignposts" class="signposts"> <li id="loadingMetrics"> <p>Loading metrics</p> </li> </ul> <script type="text/template" id="signpostsGeneralErrorTemplate"> <li id="metricsError">Article metrics are unavailable at this time. Please try again later.</li> </script> <script type="text/template" id="signpostsNewArticleErrorTemplate"> <li></li><li></li><li id="tooSoon">Article metrics are unavailable for recently published articles.</li> </script> <script type="text/template" id="signpostsTemplate"> <li id="almSaves"> <%= s.numberFormat(saveCount, 0) %> <div class="tools" data-js-tooltip-hover="trigger"> <a class="metric-term" href="/plosone/article/metrics?id=10.1371/journal.pone.0141287#savedHeader">Save</a> <p class="saves-tip" data-js-tooltip-hover="target"><a href="/plosone/article/metrics?id=10.1371/journal.pone.0141287#savedHeader">Total Mendeley and Citeulike bookmarks.</a></p> </div> </li> <li id="almCitations"> <%= s.numberFormat(citationCount, 0) %> <div class="tools" data-js-tooltip-hover="trigger"> <a class="metric-term" href="/plosone/article/metrics?id=10.1371/journal.pone.0141287#citedHeader">Citation</a> <p class="citations-tip" data-js-tooltip-hover="target"><a href="/plosone/article/metrics?id=10.1371/journal.pone.0141287#citedHeader">Paper's citation count computed by Dimensions.</a></p> </div> </li> <li id="almViews"> <%= s.numberFormat(viewCount, 0) %> <div class="tools" data-js-tooltip-hover="trigger"> <a class="metric-term" href="/plosone/article/metrics?id=10.1371/journal.pone.0141287#viewedHeader">View</a> <p class="views-tip" data-js-tooltip-hover="target"><a href="/plosone/article/metrics?id=10.1371/journal.pone.0141287#viewedHeader">PLOS views and downloads.</a></p> </div> </li> <li id="almShares"> <%= s.numberFormat(shareCount, 0) %> <div class="tools" data-js-tooltip-hover="trigger"> <a class="metric-term" href="/plosone/article/metrics?id=10.1371/journal.pone.0141287#discussedHeader">Share</a> <p class="shares-tip" data-js-tooltip-hover="target"><a href="/plosone/article/metrics?id=10.1371/journal.pone.0141287#discussedHeader">Sum of Facebook, Twitter, Reddit and Wikipedia activity.</a></p> </div> </li> </script> <div class="article-meta"> <div class="classifications"> <p class="license-short" id="licenseShort">Open Access</p> <p class="peer-reviewed" id="peerReviewed">Peer-reviewed</p> <div class="article-type" > <p class="type-article" id="artType">Research Article</p> </div> </div> </div> <div class="article-title-etc"> <div class="title-authors"> <h1 id="artTitle"><?xml version="1.0" encoding="UTF-8"?>Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics</h1> <ul class="author-list clearfix" data-js-tooltip="tooltip_container" id="author-list"> <li data-js-tooltip="tooltip_trigger" > <a data-author-id="0" class="author-name" > Ehsaneddin Asgari,</a> <div id="author-meta-0" class="author-info" data-js-tooltip="tooltip_target"> <p id="authAffiliations-0"><span class="type">Affiliation</span> Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America </p> <a data-js-tooltip="tooltip_close" class="close" id="tooltipClose0"> ⨯ </a> </div> </li> <li data-js-tooltip="tooltip_trigger" > <a data-author-id="1" class="author-name" > Mohammad R. K. Mofrad <span class="email"> </span></a> <div id="author-meta-1" class="author-info" data-js-tooltip="tooltip_target"> <p id="authCorresponding-1"> <span class="email">* E-mail:</span> <a href="mailto:mofrad@berkeley.edu">mofrad@berkeley.edu</a></p> <p id="authAffiliations-1"><span class="type">Affiliations</span> Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America, Physical Biosciences Division, Lawrence Berkeley National Lab, Berkeley, California 94720, United States of America </p> <a data-js-tooltip="tooltip_close" class="close" id="tooltipClose1"> ⨯ </a> </div> </li> </ul> <script src="/resource/js/components/tooltip.js" type="text/javascript"></script> </div> <div id="floatTitleTop" data-js-floater="title_author" class="float-title" role="presentation"> <div class="set-grid"> <div class="float-title-inner"> <h1><?xml version="1.0" encoding="UTF-8"?>Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics</h1> <ul id="floatAuthorList" data-js-floater="floated_authors"> <li data-float-index="1">Ehsaneddin Asgari, </li> <li data-float-index="2">Mohammad R. K. Mofrad </li> </ul> </div> <div class="logo-close" id="titleTopCloser"> <img src="/resource/img/logo-plos.png" style="height: 2em" alt="PLOS" /> <div class="close-floater" title="close">x</div> </div> </div> </div> <ul class="date-doi"> <li id="artPubDate">Published: November 10, 2015</li> <li id="artDoi"> <a href="https://doi.org/10.1371/journal.pone.0141287">https://doi.org/10.1371/journal.pone.0141287</a> </li> <li class="flex-spacer"></li> </ul> </div> <div> </div> </header> <section class="article-body"> <ul class="article-tabs"> <li class="tab-title active" id="tabArticle"> <a href="/plosone/article?id=10.1371/journal.pone.0141287" class="article-tab-1">Article</a> </li> <li class="tab-title " id="tabAuthors"> <a href="/plosone/article/authors?id=10.1371/journal.pone.0141287" class="article-tab-2">Authors</a> </li> <li class="tab-title " id="tabMetrics"> <a href="/plosone/article/metrics?id=10.1371/journal.pone.0141287" class="article-tab-3">Metrics</a> </li> <li class="tab-title " id="tabComments"> <a href="/plosone/article/comments?id=10.1371/journal.pone.0141287" class="article-tab-4">Comments</a> </li> <li class="tab-title" id="tabRelated"> <a class="article-tab-5" id="tabRelated-link">Media Coverage</a> <script>$(document).ready(function() { $.getMediaLink("10.1371/journal.pone.0141287").then(function (url) { $("#tabRelated-link").attr("href", url) } ) })</script> </li> </ul> <div class="article-container"> <div id="nav-article"> <ul class="nav-secondary"> <li class="nav-comments" id="nav-comments"> <a href="article/comments?id=10.1371/journal.pone.0141287">Reader Comments</a> </li> <li id="nav-figures"><a href="#" data-doi="10.1371/journal.pone.0141287">Figures</a></li> </ul> <div id="nav-data-linking" data-data-url=""> </div> </div> <script src="/resource/js/components/scroll.js" type="text/javascript"></script> <script src="/resource/js/components/nav_builder.js" type="text/javascript"></script> <script src="/resource/js/components/floating_nav.js" type="text/javascript"></script> <div id="figure-lightbox-container"></div> <script id="figure-lightbox-template" type="text/template"> <div id="figure-lightbox" class="reveal-modal full" data-reveal aria-hidden="true" role="dialog"> <div class="lb-header"> <h1 id="lb-title"><%= articleTitle %></h1> <div id="lb-authors"> <span>Ehsaneddin Asgari</span> <span>Mohammad R. K. Mofrad</span> </div> <div class="lb-close" title="close"> </div> </div> <div class="img-container"> <div class="loader"> <i class="fa-spinner"></i> </div> <img class="main-lightbox-image" src=""/> <aside id="figures-list"> <% figureList.each(function (ix, figure) { %> <div class="change-img" data-doi="<%= figure.getAttribute('data-doi') %>"> <img class="aside-figure" src="/plosone/article/figure/image?size=inline&id=<%= figure.getAttribute('data-doi') %>" /> </div> <% }) %> <div class="dummy-figure"> </div> </aside> </div> <div id="lightbox-footer"> <div id="btns-container" class="lightbox-row <% if(figureList.length <= 1) { print('one-figure-only') } %>"> <div class="fig-btns-container reset-zoom-wrapper left"> <span class="fig-btn reset-zoom-btn">Reset zoom</span> </div> <div class="zoom-slider-container"> <div class="range-slider-container"> <span id="lb-zoom-min"></span> <div class="range-slider round" data-slider data-options="start: 20; end: 200; initial: 20;"> <span class="range-slider-handle" role="slider" tabindex="0"></span> <span class="range-slider-active-segment"></span> <input type="hidden"> </div> <span id="lb-zoom-max"></span> </div> </div> <% if(figureList.length > 1) { %> <div class="fig-btns-container"> <span class="fig-btn all-fig-btn"><i class="icon icon-all"></i> All Figures</span> <span class="fig-btn next-fig-btn"><i class="icon icon-next"></i> Next</span> <span class="fig-btn prev-fig-btn"><i class="icon icon-prev"></i> Previous</span> </div> <% } %> </div> <div id="image-context"> </div> </div> </div> </script> <script id="image-context-template" type="text/template"> <div class="footer-text"> <div id="figure-description-wrapper"> <div id="view-more-wrapper" style="<% descriptionExpanded? print('display:none;') : '' %>"> <span id="figure-title"><%= title %></span> <p id="figure-description"> <%= description %> </p> <span id="view-more">show more<i class="icon-arrow-right"></i></span> </div> <div id="view-less-wrapper" style="<% descriptionExpanded? print('display:inline-block;') : '' %>" > <span id="figure-title"><%= title %></span> <p id="full-figure-description"> <%= description %> <span id="view-less">show less<i class="icon-arrow-left"></i></span> </p> </div> </div> </div> <div id="show-context-container"> <a class="btn show-context" href="<%= showInContext(strippedDoi) %>">Show in Context</a> </div> <div id="download-buttons"> <h3>Download:</h3> <div class="item"> <a href="/plosone/article/figure/image?size=original&download=&id=<%= doi %>" title="original image"> <span class="download-btn">TIFF</span> </a> <span class="file-size"><%= fileSizes.original %></span> </div> <div class="item"> <a href="/plosone/article/figure/image?size=large&download=&id=<%= doi %>" title="large image"> <span class="download-btn">PNG</span> </a> <span class="file-size"><%= fileSizes.large %></span> </div> <div class="item"> <a href="/plosone/article/figure/powerpoint?id=<%= doi %>" title="PowerPoint slide"> <span class="download-btn">PPT</span> </a> </div> </div> </script> <div class="article-content"> <div id="figure-carousel-section"> <h2>Figures</h2> <div id="figure-carousel"> <div class="carousel-wrapper"> <div class="slider"> <div class="carousel-item lightbox-figure" data-doi="10.1371/journal.pone.0141287.g001"> <img src="/plosone/article/figure/image?size=inline&id=10.1371/journal.pone.0141287.g001" loading="lazy" alt="Fig 1" /> </div> <div class="carousel-item lightbox-figure" data-doi="10.1371/journal.pone.0141287.g002"> <img src="/plosone/article/figure/image?size=inline&id=10.1371/journal.pone.0141287.g002" loading="lazy" alt="Fig 2" /> </div> <div class="carousel-item lightbox-figure" data-doi="10.1371/journal.pone.0141287.t001"> <img src="/plosone/article/figure/image?size=inline&id=10.1371/journal.pone.0141287.t001" loading="lazy" alt="Table 1" /> </div> <div class="carousel-item lightbox-figure" data-doi="10.1371/journal.pone.0141287.t002"> <img src="/plosone/article/figure/image?size=inline&id=10.1371/journal.pone.0141287.t002" loading="lazy" alt="Table 2" /> </div> <div class="carousel-item lightbox-figure" data-doi="10.1371/journal.pone.0141287.g003"> <img src="/plosone/article/figure/image?size=inline&id=10.1371/journal.pone.0141287.g003" loading="lazy" alt="Fig 3" /> </div> <div class="carousel-item lightbox-figure" data-doi="10.1371/journal.pone.0141287.t003"> <img src="/plosone/article/figure/image?size=inline&id=10.1371/journal.pone.0141287.t003" loading="lazy" alt="Table 3" /> </div> <div class="carousel-item lightbox-figure" data-doi="10.1371/journal.pone.0141287.g004"> <img src="/plosone/article/figure/image?size=inline&id=10.1371/journal.pone.0141287.g004" loading="lazy" alt="Fig 4" /> </div> </div> </div> <div class="carousel-control"> <span class="button previous"></span> <span class="button next"></span> </div> <div class="carousel-page-buttons"> </div> </div> </div> <script src="/resource/js/vendor/jquery.touchswipe.js" type="text/javascript"></script> <script src="/resource/js/components/figure_carousel.js" type="text/javascript"></script> <script src="/resource/js/vendor/jquery.dotdotdot.js" type="text/javascript"></script> <div class="article-text" id="artText"> <div xmlns:plos="http://plos.org" class="abstract toc-section abstract-type-"><a id="abstract0" name="abstract0" data-toc="abstract0" class="link-target" title="Abstract"></a><h2>Abstract</h2><div class="abstract-content"><a id="article1.front1.article-meta1.abstract1.p1" name="article1.front1.article-meta1.abstract1.p1" class="link-target"></a><p>We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: <a href="http://llp.berkeley.edu">http://llp.berkeley.edu</a> and Harvard Dataverse: <a href="http://dx.doi.org/10.7910/DVN/JMFHTN">http://dx.doi.org/10.7910/DVN/JMFHTN</a>.</p> </div></div> <div xmlns:plos="http://plos.org" class="articleinfo"><p><strong>Citation: </strong>Asgari E, Mofrad MRK (2015) Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS ONE 10(11): e0141287. https://doi.org/10.1371/journal.pone.0141287</p><p><strong>Editor: </strong>Firas H. Kobeissy, University of Florida, UNITED STATES </p><p><strong>Received: </strong>July 9, 2015; <strong>Accepted: </strong>October 5, 2015; <strong>Published: </strong> November 10, 2015</p><p><strong>Copyright: </strong> © 2015 Asgari, Mofrad. This is an open access article distributed under the terms of the <a href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</a>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited</p><p><strong>Data Availability: </strong>All relevant data are within the paper and its Supporting Information files. Our web-based tools and trained data are available at Life Language Processing website: <a href="http://llp.berkeley.edu">http://llp.berkeley.edu</a> and through the Dataverse database: <a href="http://dx.doi.org/10.7910/DVN/JMFHTN">http://dx.doi.org/10.7910/DVN/JMFHTN</a>, and will be regularly updated for calculation/classification of ProtVecs as well as visualization of biological sequences.</p><p><strong>Funding: </strong>Financial support from National Science Foundation through a CAREER Award (CBET-0955291) is gratefully acknowledged. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</p><p><strong>Competing interests: </strong> The authors have declared that no competing interests exist.</p></div> <div xmlns:plos="http://plos.org" id="section1" class="section toc-section"><a id="sec001" name="sec001" data-toc="sec001" class="link-target" title="Introduction"></a><h2>Introduction</h2><a id="article1.body1.sec1.p1" name="article1.body1.sec1.p1" class="link-target"></a><p>Nature uses certain languages to describe biological sequences such as DNA, RNA, and proteins. Much like humans adopt languages to communicate, biological organisms use sophisticated languages to convey information within and between cells. Inspired by this conceptual analogy, we adopt existing methods in natural language processing (NLP) to gain a deeper understanding of the “language of life” with the ultimate goal to discover functions encoded within biological sequences [<a href="#pone.0141287.ref001" class="ref-tip">1</a>–<a href="#pone.0141287.ref004" class="ref-tip">4</a>].</p> <a id="article1.body1.sec1.p2" name="article1.body1.sec1.p2" class="link-target"></a><p>Feature extraction is an important step in data analysis, machine learning and NLP. It refers to finding an interpretable representation of data for machines that can increase performance of learning algorithms. Even the most sophisticated algorithms would perform poorly if inappropriate features are used, while simple methods can potentially perform well when they are fed with the appropriate features. Feature extraction can be done manually or in an unsupervised fashion. In this paper, we propose an unsupervised data-driven distributed representation for biological sequences. This method, called bio-vectors (BioVec) in general and more specifically protein-vectors (ProtVec) for proteins, can be applied to a wide range of problems in bioinformatics, such as protein visualization, protein family classification, structure prediction, domain extraction, and interaction prediction. In this approach, each biological sequence is embedded in an n-dimensional vector that characterizes biophysical and biochemical properties of sequences using neural networks. In the following, we first explain how this method works and how it is trained from 546,790 sequences of Swiss-Prot database. Subsequently, we will analyze the biophysical and the biochemical properties of this representation qualitatively and quantitatively. To further evaluate this feature extraction method, we apply it in classification of 7,027 protein families of 324,018 protein sequences in Swiss-Prot. In the next step, we use this approach for visualization and characterization of two categories of disordered sequences: the DisProt database as well as a database of disordered regions of phenylalanine-glycine nucleoporins (FG-Nups). Finally, we classify these protein families using support vector machine (SVM) classifiers [<a href="#pone.0141287.ref005" class="ref-tip">5</a>]. As a key advantage of the proposed method, the embedding needs to be trained only once and then may be used to encode biological sequences in a given problem. The related data and future updated will be available at: <a href="http://llp.berkeley.edu">http://llp.berkeley.edu</a> and Harvard Dataverse: <a href="http://dx.doi.org/10.7910/DVN/JMFHTN">http://dx.doi.org/10.7910/DVN/JMFHTN</a>.</p> <div id="section1" class="section toc-section"><a id="sec002" name="sec002" class="link-target" title="Distributed Representation"></a> <h3>Distributed Representation</h3> <a id="article1.body1.sec1.sec1.p1" name="article1.body1.sec1.sec1.p1" class="link-target"></a><p>Distributed representation has proved one of the most successful approaches in machine learning [<a href="#pone.0141287.ref006" class="ref-tip">6</a>–<a href="#pone.0141287.ref010" class="ref-tip">10</a>]. The main idea in this approach is encoding and storing information about an item within a system through establishing its interactions with other members. Distributed representation was originally inspired by the structure of human memory, where the items are stored in a “content-addressable” fashion. Content-based storing allows for efficiently recalling items from partial descriptions. Since the content-addressable items and their properties are stored within a close proximity, such a system provides a viable infrastructure to generalize features attributed to an item.</p> <a id="article1.body1.sec1.sec1.p2" name="article1.body1.sec1.sec1.p2" class="link-target"></a><p>Continuous vector representation, as a distributed representation for words, has been recently established in natural language processing (NLP) as an efficient way to represent semantic/syntactic units with many applications. In this model, each word is embedded in a vector in an n-dimensional space. Similar words have close vectors, where similarity is defined in terms of both syntax and semantic. The basic idea behind training such vectors is that the meaning of a word is characterized by its context, i.e. neighboring words. Thus, words and their contexts are considered to be positive training samples. Such vectors can be trained using large amounts of textual data in a variety of ways, e.g. neural network architectures like the Skip-gram model [<a href="#pone.0141287.ref010" class="ref-tip">10</a>].</p> <a id="article1.body1.sec1.sec1.p3" name="article1.body1.sec1.sec1.p3" class="link-target"></a><p>Interesting patterns have been observed by training word vectors using Skip-gram in natural language. Words with similar vector representations show multiple degrees of similarity. For instance, <span class="inline-formula"><img src="article/file?type=thumbnail&id=10.1371/journal.pone.0141287.e001" loading="lazy" class="inline-graphic"></span> resembles the closest vector to the word <span class="inline-formula"><img src="article/file?type=thumbnail&id=10.1371/journal.pone.0141287.e002" loading="lazy" class="inline-graphic"></span> [<a href="#pone.0141287.ref011" class="ref-tip">11</a>].</p> <a id="article1.body1.sec1.sec1.p4" name="article1.body1.sec1.sec1.p4" class="link-target"></a><p>In this work, we seek unique patterns in biological sequences to facilitate biophysical and biochemical interpretations. We show how Skip-gram can be used to train a distributed representation for biological sequences over a large set of sequences, and establish physical and chemical interpretations for such representations. We propose this as a general-purpose representation for protein sequences that can be used in a wide range of bioinformatics problems, including protein family classification, protein interaction prediction, structure prediction, motif extraction, protein visualization, and domain identification. To illustrate, we specifically tackle visualization and protein family classification problems.</p> </div> <div id="section2" class="section toc-section"><a id="sec003" name="sec003" class="link-target" title="Protein Family Classification"></a> <h3>Protein Family Classification</h3> <a id="article1.body1.sec1.sec2.p1" name="article1.body1.sec1.sec2.p1" class="link-target"></a><p>A protein family is a set of proteins that are evolutionarily related, typically involving similar structures or functions. The large gap between the number of known sequences versus the amount of known functional information about sequences has motivated family (function) identification methods based on primary sequences [<a href="#pone.0141287.ref012" class="ref-tip">12</a>–<a href="#pone.0141287.ref014" class="ref-tip">14</a>]. Protein Family Database (Pfam) is a widely used source for protein families [<a href="#pone.0141287.ref015" class="ref-tip">15</a>]. In Pfam, a family can be classified as a “family”, “domain”, “repeat”, or “motif”. In this study, we utilize ProtVec to classify protein families in Swiss-Prot using the information provided by Pfam database and we obtain a high classification accuracy.</p> <a id="article1.body1.sec1.sec2.p2" name="article1.body1.sec1.sec2.p2" class="link-target"></a><p>Protein family classification based on the primary structures (sequences) has been performed using classifiers such as support vector machine classifier (SVM) [<a href="#pone.0141287.ref016" class="ref-tip">16</a>–<a href="#pone.0141287.ref018" class="ref-tip">18</a>]. Besides the primary sequence, the existing methods typically require extensive information for feature extraction, e.g. hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility. The reported accuracies of a previous study on family classification have been in the range of 69.1–99.6% for 54 protein families [<a href="#pone.0141287.ref016" class="ref-tip">16</a>]. In another study, researchers used motifs from protein interactions for detecting Structural Classification of Proteins (SCOP) [<a href="#pone.0141287.ref019" class="ref-tip">19</a>] families for 368 proteins, and obtained a classification accuracy of 75% at sensitivity of 10% [<a href="#pone.0141287.ref020" class="ref-tip">20</a>]. In contrast, our proposed approach is trained based solely on primary sequence information, yet achieving high accuracy when applied in classifications of protein families.</p> </div> <div id="section3" class="section toc-section"><a id="sec004" name="sec004" class="link-target" title="Disordered Proteins"></a> <h3>Disordered Proteins</h3> <a id="article1.body1.sec1.sec3.p1" name="article1.body1.sec1.sec3.p1" class="link-target"></a><p>Proteins can be fully or partially unstructured, i.e. lacking a secondary or ordered tertiary three-dimensional structure. Due to their abundance and the critical roles they play in cell biology, disordered proteins are considered to be an important class of proteins [<a href="#pone.0141287.ref021" class="ref-tip">21</a>]. Several studies have focused on disordered peptides and their functional analysis in recent years [<a href="#pone.0141287.ref022" class="ref-tip">22</a>–<a href="#pone.0141287.ref024" class="ref-tip">24</a>].</p> <a id="article1.body1.sec1.sec3.p2" name="article1.body1.sec1.sec3.p2" class="link-target"></a><p>In the present work, we introduce ProtVec for the visualization and characterization of two categories of disordered proteins: DisProt database as well as a database of disordered regions of phenylalanine-glycine nucleoporins (FG-Nups) [<a href="#pone.0141287.ref025" class="ref-tip">25</a>].</p> <a id="article1.body1.sec1.sec3.p3" name="article1.body1.sec1.sec3.p3" class="link-target"></a><p>DisProt is a database of experimentally identified disordered proteins that categorizes disordered and ordered regions of a collection of proteins [<a href="#pone.0141287.ref026" class="ref-tip">26</a>]. DisProt Release 6.02 consists of 694 proteins presenting 1539 disordered, and 95 ordered regions. FG-Nups dataset is a collection of FG-Nups disordered sequences [<a href="#pone.0141287.ref027" class="ref-tip">27</a>]. Nucleoporins form the nuclear pore complex (NPC), the sole gateway for bidirectional transport of cargo between the cytoplasm and the nucleus in eukaryotic cells [<a href="#pone.0141287.ref028" class="ref-tip">28</a>]. Since FG-Nups are mostly computationally identified, only 10 sequences out of 1,138 disordered sequences exist in Swiss-Prot. A recent study on features of FG-Nups versus DisProt showed biophysical differences between FG-Nups and average DisProt sequences [<a href="#pone.0141287.ref029" class="ref-tip">29</a>].</p> <a id="article1.body1.sec1.sec3.p4" name="article1.body1.sec1.sec3.p4" class="link-target"></a><p>We further propose using protein-vectors for the visualization of biological sequences. Simplicity and biophysical interpretations encoded within ProtVec distinguishes this method from the previous work [<a href="#pone.0141287.ref030" class="ref-tip">30</a>, <a href="#pone.0141287.ref031" class="ref-tip">31</a>]. As an example, we use ProtVec for the visualization of FG-Nups, DisProt, and structured PDB proteins. This visualization confirms the results obtained [<a href="#pone.0141287.ref029" class="ref-tip">29</a>] on the biophysical features of FG-Nups and typical disordered proteins. Furthermore, we employ ProtVec to classify FG-Nups versus random PDB sequences as well as DisProt disordered regions versus disport ordered regions.</p> </div> </div> <div xmlns:plos="http://plos.org" id="section2" class="section toc-section"><a id="sec005" name="sec005" data-toc="sec005" class="link-target" title="Methods"></a><h2>Methods</h2> <div id="section1" class="section toc-section"><a id="sec006" name="sec006" class="link-target" title="Protein-Space Construction"></a> <h3>Protein-Space Construction</h3> <a id="article1.body1.sec2.sec1.p1" name="article1.body1.sec2.sec1.p1" class="link-target"></a><p>Our goal is to construct a distributed representation of biological sequences. In the training process of word embedding in NLP, a large corpus of sentences should be fed into the training algorithm to ensure sufficient contexts are observed. Similarly, a large corpus is needed to train distributed representation of biological sequences. We use Swiss-Prot as a rich protein database, which consists of 546,790 manually annotated and reviewed sequences.</p> <a id="article1.body1.sec2.sec1.p2" name="article1.body1.sec2.sec1.p2" class="link-target"></a><p>The next step in training distributed representations is to break the sequences into sub sequences (i.e. biological words). The simplest and most common technique in bioinformatics to study sequences involves fixed-length overlapping n-grams [<a href="#pone.0141287.ref032" class="ref-tip">32</a>–<a href="#pone.0141287.ref034" class="ref-tip">34</a>]. However, instead of using n-grams directly in feature extraction, we utilize n-gram modeling for training a general purpose distributed representation of sequences. This so-called embedding model needs to be trained only once and may then be adopted in feature extraction part of specific problems.</p> <a id="article1.body1.sec2.sec1.p3" name="article1.body1.sec2.sec1.p3" class="link-target"></a><p>In n-gram modeling of protein informatics, usually an overlapping window of 3 to 6 residues is used. Instead of taking overlapping windows, we generate 3 lists of shifted non-overlapping words, as shown in <a href="#pone-0141287-g001">Fig 1</a>. Evaluating K-nearest neighbors in a 2xfold cross-validation for different window sizes, embedding vector sizes and overlapping versus non-overlapping n-grams showed a more consistent embedding training for a window size of 3 and the mentioned splitting.</p> <a class="link-target" id="pone-0141287-g001" name="pone-0141287-g001"></a><div class="figure" data-doi="10.1371/journal.pone.0141287.g001"><div class="img-box"><a title="Click for larger image" href="article/figure/image?size=medium&id=10.1371/journal.pone.0141287.g001" data-doi="10.1371/journal.pone.0141287" data-uri="10.1371/journal.pone.0141287.g001"><img src="article/figure/image?size=inline&id=10.1371/journal.pone.0141287.g001" alt="thumbnail" class="thumbnail" loading="lazy"></a><div class="expand"></div></div><div class="figure-inline-download"> Download: <ul><li><a href="article/figure/powerpoint?id=10.1371/journal.pone.0141287.g001"><div class="definition-label">PPT</div><div class="definition-description">PowerPoint slide</div></a></li><li><a href="article/figure/image?download&size=large&id=10.1371/journal.pone.0141287.g001"><div class="definition-label">PNG</div><div class="definition-description">larger image</div></a></li><li><a href="article/figure/image?download&size=original&id=10.1371/journal.pone.0141287.g001"><div class="definition-label">TIFF</div><div class="definition-description">original image</div></a></li></ul></div><div class="figcaption"><span>Fig 1. </span> Protein sequence splitting.</div><p class="caption_target"><a id="article1.body1.sec2.sec1.fig1.caption1.p1" name="article1.body1.sec2.sec1.fig1.caption1.p1" class="link-target"></a><p>In order to prepare the training data, each protein sequence will be represented as three sequences (1, 2, 3) of 3-grams.</p> </p><p class="caption_object"><a href="https://doi.org/10.1371/journal.pone.0141287.g001"> https://doi.org/10.1371/journal.pone.0141287.g001</a></p></div><a id="article1.body1.sec2.sec1.p4" name="article1.body1.sec2.sec1.p4" class="link-target"></a><p>The same procedure is applied on all 546,790 sequences in Swiss-Prot, thus at the end we obtain a corpus consisting of 546,790 × 3 = 1,640,370 sequences of 3-grams (3-gram is a “biological” word consisting of 3 amino acids). The next step is training the embedding based on such data through a Skip-gram neural network. In training word vector representations, Skip-gram attempts to maximize the probability of observed word sequences (contexts). In other words, for a given training sequence of words we would like to find their corresponding n-dimensional vectors maximizing the following average log probability function. Such a constraint allows similar words to assume a similar representation in this space. <a name="pone.0141287.e003" id="pone.0141287.e003" class="link-target"></a><span class="equation"><img src="article/file?type=thumbnail&id=10.1371/journal.pone.0141287.e003" loading="lazy" class="inline-graphic"><span class="note">(1)</span></span> where <em>N</em> is the length of the training sequence, 2<em>c</em> is the window size we consider as the context, <em>w</em><sub><em>i</em></sub> is the center of the window, <em>W</em> is the number of words in the dictionary and <em>v</em><sub><em>w</em></sub> and <span class="inline-formula"><img src="article/file?type=thumbnail&id=10.1371/journal.pone.0141287.e004" loading="lazy" class="inline-graphic"></span> are input and output n-dimensional representations of word <em>w</em>, respectively. The probability <em>p</em>(<em>w</em><sub><em>i</em>+<em>j</em></sub>|<em>w</em><sub><em>i</em></sub>) is defined using a softmax function. Hierarchical softmax or negative sampling are efficient approximations of such a softmax function. In the implementation we use (Word2Vec) [<a href="#pone.0141287.ref010" class="ref-tip">10</a>] negative sampling has been utilized, which is considered as the state-of-the-art for training word vector representation. Negative sampling uses the following objective function in the calculation of the word vectors: <a name="pone.0141287.e005" id="pone.0141287.e005" class="link-target"></a><span class="equation"><img src="article/file?type=thumbnail&id=10.1371/journal.pone.0141287.e005" loading="lazy" class="inline-graphic"><span class="note">(2)</span></span> where <em>D</em> is the set of all word and context pairs (w, c) existing in the training data (positive samples) and <em>D</em>′ is a randomly generated set of incorrect (w, c) pairs (negative samples).</p> <a id="article1.body1.sec2.sec1.p5" name="article1.body1.sec2.sec1.p5" class="link-target"></a><p><em>p</em>(<em>D</em> = 1|<em>w</em>, <em>c</em>; <em>θ</em>) is the probability that (w, c) pair came from the training data and <em>p</em>(<em>D</em> = 0|<em>w</em>, <em>c</em>; <em>θ</em>) is the probability that (<em>w</em>, <em>c</em>) did not come from the training data. The term <em>p</em>(<em>D</em> = 1|<em>c</em>, <em>w</em>; <em>θ</em>) can be defined using a sigmoid function on the word vectors: <a name="pone.0141287.e006" id="pone.0141287.e006" class="link-target"></a><span class="equation"><img src="article/file?type=thumbnail&id=10.1371/journal.pone.0141287.e006" loading="lazy" class="inline-graphic"></span> where the parameters <em>θ</em> are the word vectors we train within the optimization framework: <em>v</em><sub><em>c</em></sub> and <em>v</em><sub><em>w</em></sub> ∈ <em>R</em><sup><em>d</em></sup> are vector representations for the context <em>c</em> and the word <em>w</em> respectively [<a href="#pone.0141287.ref035" class="ref-tip">35</a>]. In <a href="#pone.0141287.e005">Eq (2)</a>, the positive samples maximize the probabilities of the observed (w, c) pairs in the training data, while the negative samples prevent all vectors from having the same value by disallowing some incorrect (w, c) pairs. To train the embedding vectors, we consider a vector size of 100 and a context size of 25. Thus each 3-gram is presented as a vector of size 100.</p> </div> <div id="section2" class="section toc-section"><a id="sec007" name="sec007" class="link-target" title="Protein-Space Analysis"></a> <h3>Protein-Space Analysis</h3> <a id="article1.body1.sec2.sec2.p1" name="article1.body1.sec2.sec2.p1" class="link-target"></a><p>To qualitatively analyze the distribution of various biophysical and biochemical properties within the training space, we project all 3-gram embeddings from 100-dimensional space to a 2D space using Stochastic Neighbor Embedding [<a href="#pone.0141287.ref036" class="ref-tip">36</a>]. Mass, volume, polarity, hydrophobicity, charge, and van der Waals volume properties were analyzed. The data is adopted from [<a href="#pone.0141287.ref037" class="ref-tip">37</a>]. In addition, to quantitatively measure the continuity of these properties in the protein-space, the best Lipschitz constant, i.e. the smallest <em>k</em> satisfying is calculated: <a name="pone.0141287.e007" id="pone.0141287.e007" class="link-target"></a><span class="equation"><img src="article/file?type=thumbnail&id=10.1371/journal.pone.0141287.e007" loading="lazy" class="inline-graphic"><span class="note">(3)</span></span> where <em>f</em> is the scale of one of the properties of a given 3-grams (e.g., average mass, hydrophobicity, etc.), <em>d</em> is the distance metric, <em>d</em><sub><em>f</em></sub> is the absolute value of score differences, and <em>d</em><sub><em>w</em></sub> is Euclidian distance between two 3-grams <em>w</em><sub>1</sub> and <em>w</em><sub>2</sub>. The Lipschitz constant is calculated for the aforementioned properties.</p> </div> <div id="section3" class="section toc-section"><a id="sec008" name="sec008" class="link-target" title="Protein Family Classification"></a> <h3>Protein Family Classification</h3> <a id="article1.body1.sec2.sec3.p1" name="article1.body1.sec2.sec3.p1" class="link-target"></a><p>To evaluate the strength of the proposed representation, we set up a classification task on protein families. Family information of 324,018 protein sequences in Swiss-Prot is extracted from the Protein Family (Pfam) database, resulting in a total of 7,027 distinct families for Swiss-Prot sequences. Each sequence is represented as the summation of the vector representation of overlapping 3-grams. Thus, each sequence is presented as a vector of size 100. For each family type, the same number of instances from Swiss-Prot are selected randomly to form the negative examples. Support vector machine classifiers are used to evaluate the strength of ProtVec in the classification of protein families through 10 × fold cross-validations. We perform the classification over 7,027 protein families consisting of 324,018 sequences. For the evaluation we report specificity (true negative rate), sensitivity (true positive rate), and the accuracy of family classifications. <a name="pone.0141287.e008" id="pone.0141287.e008" class="link-target"></a><span class="equation"><img src="article/file?type=thumbnail&id=10.1371/journal.pone.0141287.e008" loading="lazy" class="inline-graphic"></span> <a name="pone.0141287.e009" id="pone.0141287.e009" class="link-target"></a><span class="equation"><img src="article/file?type=thumbnail&id=10.1371/journal.pone.0141287.e009" loading="lazy" class="inline-graphic"></span> <a name="pone.0141287.e010" id="pone.0141287.e010" class="link-target"></a><span class="equation"><img src="article/file?type=thumbnail&id=10.1371/journal.pone.0141287.e010" loading="lazy" class="inline-graphic"></span></p> </div> <div id="section4" class="section toc-section"><a id="sec009" name="sec009" class="link-target" title="Visualization and Classification of Disordered Proteins"></a> <h3>Visualization and Classification of Disordered Proteins</h3> <a id="article1.body1.sec2.sec4.p1" name="article1.body1.sec2.sec4.p1" class="link-target"></a><p>Two databases of disordered proteins are used for disordered protein prediction: DisProt database (694 sequences) and FG-Nups dataset (1,138 sequences).</p> <div id="section1" class="section toc-section"><a id="sec010" name="sec010" class="link-target" title="FG-Nups Characterization"></a><h4>FG-Nups Characterization.</h4><a id="article1.body1.sec2.sec4.sec1.p1" name="article1.body1.sec2.sec4.sec1.p1" class="link-target"></a><p>To distinguish the characteristics of FG-Nups, a collection of 1,138 FG-Nups and two random sets of 1,138 structured proteins from Protein Data Bank (PDB) [<a href="#pone.0141287.ref038" class="ref-tip">38</a>] are compared. Since PDB sequences on average have a shorter length than disordered proteins, the two sets are selected from PDB in such a way that they have an average length of 900 residues, the same as the average length of the disordered protein dataset. For visualization purposes, the ProtVec is reduced from 100 dimensions to a 2D space using Stochastic Neighbor Embedding [<a href="#pone.0141287.ref036" class="ref-tip">36</a>].</p> <a id="article1.body1.sec2.sec4.sec1.p2" name="article1.body1.sec2.sec4.sec1.p2" class="link-target"></a><p>We quantitatively evaluate how ProtVec can be used to distinguish between FG-Nups versus typical PDB sequences using a support vector machine binary classifier. The positive examples were the aforementioned 1,138 disordered FG-Nups proteins and the negative examples (again 1,138 sequences) are selected randomly from PDB with the same average length of disordered sequences (≈ 900 residues). We present each protein sequence as a summation of its ProtVecs of all 3-grams. Since the average length of structured proteins is shorter than FG-Nups, and to avoid trivial cases, the PDB sequences are selected in a way to maintain the same average length.</p> </div> <div id="section2" class="section toc-section"><a id="sec011" name="sec011" class="link-target" title="DisProt Characterization"></a><h4>DisProt Characterization.</h4><a id="article1.body1.sec2.sec4.sec2.p1" name="article1.body1.sec2.sec4.sec2.p1" class="link-target"></a><p>To distinguish the characteristics of DisProt sequences, we use DisProt Release 6.02, consisting of 694 proteins presenting 1539 disordered and 95 ordered regions, and perform the same experiment as for FG-Nups with DisProt sequences.</p> </div> </div> </div> <div xmlns:plos="http://plos.org" id="section3" class="section toc-section"><a id="sec012" name="sec012" data-toc="sec012" class="link-target" title="Results"></a><h2>Results</h2> <div id="section1" class="section toc-section"><a id="sec013" name="sec013" class="link-target" title="Protein-Space Analysis"></a> <h3>Protein-Space Analysis</h3> <a id="article1.body1.sec3.sec1.p1" name="article1.body1.sec3.sec1.p1" class="link-target"></a><p>Although the protein-space is trained based on only the primary sequences of proteins, it offers several interesting biochemical and biophysical implications. In order to study these features, we visualized the distribution of different criteria, including mass, volume, polarity, hydrophobicity, charge, and van der Waals volume in this space. To do so, for each 3-gram we conducted qualitative and quantitative analyses as described below.</p> <div id="section1" class="section toc-section"><a id="sec014" name="sec014" class="link-target" title="Qualitative Analysis"></a><h4>Qualitative Analysis.</h4><a id="article1.body1.sec3.sec1.sec1.p1" name="article1.body1.sec3.sec1.sec1.p1" class="link-target"></a><p>In order to visualize the distribution of the aforementioned properties, we projected all 3-gram embeddings from 100-dimensional space to a 2D space using Stochastic Neighbor Embedding (t-SNE) [<a href="#pone.0141287.ref036" class="ref-tip">36</a>]. In the diagrams presented in <a href="#pone-0141287-g002">Fig 2</a>, each point represents a 3-gram and is colored according to its scale in each property. Interestingly, as can be seen in the figure, 3-grams with the same biophysical and biochemical properties were grouped together. This observation suggests that the proposed embedding not only encodes protein sequences in an efficient way that proved useful for classification purposes, but also reveals some important physical and chemical patterns in protein sequences.</p> <a class="link-target" id="pone-0141287-g002" name="pone-0141287-g002"></a><div class="figure" data-doi="10.1371/journal.pone.0141287.g002"><div class="img-box"><a title="Click for larger image" href="article/figure/image?size=medium&id=10.1371/journal.pone.0141287.g002" data-doi="10.1371/journal.pone.0141287" data-uri="10.1371/journal.pone.0141287.g002"><img src="article/figure/image?size=inline&id=10.1371/journal.pone.0141287.g002" alt="thumbnail" class="thumbnail" loading="lazy"></a><div class="expand"></div></div><div class="figure-inline-download"> Download: <ul><li><a href="article/figure/powerpoint?id=10.1371/journal.pone.0141287.g002"><div class="definition-label">PPT</div><div class="definition-description">PowerPoint slide</div></a></li><li><a href="article/figure/image?download&size=large&id=10.1371/journal.pone.0141287.g002"><div class="definition-label">PNG</div><div class="definition-description">larger image</div></a></li><li><a href="article/figure/image?download&size=original&id=10.1371/journal.pone.0141287.g002"><div class="definition-label">TIFF</div><div class="definition-description">original image</div></a></li></ul></div><div class="figcaption"><span>Fig 2. </span> Normalized distributions of biochemical and biophysical properties in protein-space.</div><p class="caption_target"><a id="article1.body1.sec3.sec1.sec1.fig1.caption1.p1" name="article1.body1.sec3.sec1.sec1.fig1.caption1.p1" class="link-target"></a><p>In these plots, each point represents a 3-gram (a word of three residues) and the colors indicate the scale for each property. Data points in these plots are projected from a 100-dimensional space a 2D space using t-SNE. As it is shown words with similar properties are automatically clustered together meaning that the properties are smoothly distributed in this space.</p> </p><p class="caption_object"><a href="https://doi.org/10.1371/journal.pone.0141287.g002"> https://doi.org/10.1371/journal.pone.0141287.g002</a></p></div></div> <div id="section2" class="section toc-section"><a id="sec015" name="sec015" class="link-target" title="Quantitative Analysis"></a><h4>Quantitative Analysis.</h4><a id="article1.body1.sec3.sec1.sec2.p1" name="article1.body1.sec3.sec1.sec2.p1" class="link-target"></a><p>Although <a href="#pone-0141287-g002">Fig 2</a> illustrates the smoothness of protein-space with respect to different physical and chemical meanings, we required a quantitative approach to measure the continuity of these properties in the protein-space. To do so, we calculated the best Lipschitz constant. For all 6 properties presented in <a href="#pone-0141287-g002">Fig 2</a>, we calculated the minimum <em>k</em>. To evaluate this result we made an artificial space called “scrambled space” by randomly shuffling the labels of 3-grams in the 100 dimensional space. <a href="#pone-0141287-t001">Table 1</a> contains the values of Libschitz constants for protein-space versus the “scrambled space” with respect to different properties and also their ratio.</p> <a class="link-target" id="pone-0141287-t001" name="pone-0141287-t001"></a><div class="figure" data-doi="10.1371/journal.pone.0141287.t001"><div class="img-box"><a title="Click for larger image" href="article/figure/image?size=medium&id=10.1371/journal.pone.0141287.t001" data-doi="10.1371/journal.pone.0141287" data-uri="10.1371/journal.pone.0141287.t001"><img src="article/figure/image?size=inline&id=10.1371/journal.pone.0141287.t001" alt="thumbnail" class="thumbnail" loading="lazy"></a><div class="expand"></div></div><div class="figure-inline-download"> Download: <ul><li><a href="article/figure/powerpoint?id=10.1371/journal.pone.0141287.t001"><div class="definition-label">PPT</div><div class="definition-description">PowerPoint slide</div></a></li><li><a href="article/figure/image?download&size=large&id=10.1371/journal.pone.0141287.t001"><div class="definition-label">PNG</div><div class="definition-description">larger image</div></a></li><li><a href="article/figure/image?download&size=original&id=10.1371/journal.pone.0141287.t001"><div class="definition-label">TIFF</div><div class="definition-description">original image</div></a></li></ul></div><div class="figcaption"><span>Table 1. </span> Using Lipschitz number to evaluate the continuity of ProtVec with respect to biophysical and biochemical properties.</div><p class="caption_target"></p><p class="caption_object"><a href="https://doi.org/10.1371/journal.pone.0141287.t001"> https://doi.org/10.1371/journal.pone.0141287.t001</a></p></div><a id="article1.body1.sec3.sec1.sec2.p2" name="article1.body1.sec3.sec1.sec2.p2" class="link-target"></a><p>Normally if <em>k</em> = 1 the function is called a short map, and if 0 ≤ <em>k</em> < 1 the function is called a contraction. The results suggest that the protein-space is on average 2-times smoother in terms of physical and chemical properties than a random space. This quantitative result supports our qualitative observation of the space structure in <a href="#pone-0141287-g002">Fig 2</a>, and suggests that our training space encodes, 3-grams in an informative manner.</p> </div> </div> <div id="section2" class="section toc-section"><a id="sec016" name="sec016" class="link-target" title="Protein Family Classification"></a> <h3>Protein Family Classification</h3> <a id="article1.body1.sec3.sec2.p1" name="article1.body1.sec3.sec2.p1" class="link-target"></a><p>In order to evaluate the strength of ProtVec, we performed classifications of 7,027 protein families and obtained a weighted average accuracy of 93 ± 0.06%, which exhibits a more reliable result than the existing methods. In contrast to the existing methods, our proposed approach is trained based on primary sequence information alone.</p> <a id="article1.body1.sec3.sec2.p2" name="article1.body1.sec3.sec2.p2" class="link-target"></a><p> <a href="#pone-0141287-t002">Table 2</a> shows the sensitivity, specificity, and the accuracy for the most frequent families in Swiss-Prot. These results suggest that structural features of proteins can be accurately predicted from the primary sequence information solely. The results for all 7,027 families can be found in Supplementary Information, see <a href="#pone.0141287.s001">S1 File</a>. The average accuracy for the first 1,000 (261,149 sequences), 2,000 (293,957 sequences), 3,000 (308,292 sequences), and 4,000 (316,135 sequences) frequent families were respectively 94% ± 0.05%,93% ± 0.05%, 92% ± 0.06%, and 91% ± 0.08%. To compute the overall accuracy for all 7,026 families, we calculated the weighted average accuracy, because for the families with number of instances less than 10, the validation set are not statistically sufficient and they should have less contribution in the overall accuracy. The weighted accuracy of all 7,027 families (weighted based on the number of instances) was 93% ± 0.06%.</p> <a class="link-target" id="pone-0141287-t002" name="pone-0141287-t002"></a><div class="figure" data-doi="10.1371/journal.pone.0141287.t002"><div class="img-box"><a title="Click for larger image" href="article/figure/image?size=medium&id=10.1371/journal.pone.0141287.t002" data-doi="10.1371/journal.pone.0141287" data-uri="10.1371/journal.pone.0141287.t002"><img src="article/figure/image?size=inline&id=10.1371/journal.pone.0141287.t002" alt="thumbnail" class="thumbnail" loading="lazy"></a><div class="expand"></div></div><div class="figure-inline-download"> Download: <ul><li><a href="article/figure/powerpoint?id=10.1371/journal.pone.0141287.t002"><div class="definition-label">PPT</div><div class="definition-description">PowerPoint slide</div></a></li><li><a href="article/figure/image?download&size=large&id=10.1371/journal.pone.0141287.t002"><div class="definition-label">PNG</div><div class="definition-description">larger image</div></a></li><li><a href="article/figure/image?download&size=original&id=10.1371/journal.pone.0141287.t002"><div class="definition-label">TIFF</div><div class="definition-description">original image</div></a></li></ul></div><div class="figcaption"><span>Table 2. </span> Performance of protein family classification using SVM and ProtVec over some of the most frequent families in Swiss-Prot.</div><p class="caption_target"><a id="article1.body1.sec3.sec2.table-wrap1.caption1.p1" name="article1.body1.sec3.sec2.table-wrap1.caption1.p1" class="link-target"></a><p>Families are sorted with respect to their frequency in Swiss-Prot.</p> </p><p class="caption_object"><a href="https://doi.org/10.1371/journal.pone.0141287.t002"> https://doi.org/10.1371/journal.pone.0141287.t002</a></p></div></div> <div id="section3" class="section toc-section"><a id="sec017" name="sec017" class="link-target" title="Disordered Proteins Visualization and Classification"></a> <h3>Disordered Proteins Visualization and Classification</h3> <a id="article1.body1.sec3.sec3.p1" name="article1.body1.sec3.sec3.p1" class="link-target"></a><p>Due to the functional importance of disordered proteins, prediction of unstructured regions of disordered proteins and determining the sequence patterns featured in disordered regions is a critical problem in protein bioinformatics. We evaluated the ability of ProtVec to characterize and discern disordered protein sequences from structured sequences.</p> <div id="section1" class="section toc-section"><a id="sec018" name="sec018" class="link-target" title="FG-Nups Characterization"></a><h4>FG-Nups Characterization.</h4><a id="article1.body1.sec3.sec3.sec1.p1" name="article1.body1.sec3.sec3.sec1.p1" class="link-target"></a><p>In this case study, we used the FG-Nups collection of 1,138 disordered proteins containing disorder regions with a fraction of at least one third of the sequence length. For comparison purposes, we also collected two sets of structured proteins from Protein Data Bank (PDB).</p> <a id="article1.body1.sec3.sec3.sec1.p2" name="article1.body1.sec3.sec3.sec1.p2" class="link-target"></a><p>In order to visualize each dataset, we reduced the dimensionality of the protein-space using Stochastic Neighbor Embedding [<a href="#pone.0141287.ref036" class="ref-tip">36</a>, <a href="#pone.0141287.ref039" class="ref-tip">39</a>] and then generated the 2D histogram of all overlapping 3-grams occurring in each dataset. As shown in <a href="#pone-0141287-g003">Fig 3</a> (see column (b)), the two random sets from structured proteins had nearly identical patterns. However, the FG-Nups dataset exhibits a substantially different pattern. To amplify the characteristic of disordered sequences we have also examined the histogram of disordered regions of FG-Nups (see <a href="#pone-0141287-g003">Fig 3</a>, column (a)).</p> <a class="link-target" id="pone-0141287-g003" name="pone-0141287-g003"></a><div class="figure" data-doi="10.1371/journal.pone.0141287.g003"><div class="img-box"><a title="Click for larger image" href="article/figure/image?size=medium&id=10.1371/journal.pone.0141287.g003" data-doi="10.1371/journal.pone.0141287" data-uri="10.1371/journal.pone.0141287.g003"><img src="article/figure/image?size=inline&id=10.1371/journal.pone.0141287.g003" alt="thumbnail" class="thumbnail" loading="lazy"></a><div class="expand"></div></div><div class="figure-inline-download"> Download: <ul><li><a href="article/figure/powerpoint?id=10.1371/journal.pone.0141287.g003"><div class="definition-label">PPT</div><div class="definition-description">PowerPoint slide</div></a></li><li><a href="article/figure/image?download&size=large&id=10.1371/journal.pone.0141287.g003"><div class="definition-label">PNG</div><div class="definition-description">larger image</div></a></li><li><a href="article/figure/image?download&size=original&id=10.1371/journal.pone.0141287.g003"><div class="definition-label">TIFF</div><div class="definition-description">original image</div></a></li></ul></div><div class="figcaption"><span>Fig 3. </span> Visualization of protein sequences using ProtVec can characterize FGNUPs versus Disport disordered sequences and structured sequences.</div><p class="caption_target"><a id="article1.body1.sec3.sec3.sec1.fig1.caption1.p1" name="article1.body1.sec3.sec3.sec1.fig1.caption1.p1" class="link-target"></a><p>Column (a) compares FG Nup sequences 2D histogram (at the bottom) with 2D histogram of FG Nup disordered regions (on top). Column (b) compares 2D histogram two random sets of structured sequences with the same average length as the FG-Nups. Column (c) compares between 2D histogram of DisProt sequences (at the bottom) and 2D histogram of DisProt disordered regions (on top).</p> </p><p class="caption_object"><a href="https://doi.org/10.1371/journal.pone.0141287.g003"> https://doi.org/10.1371/journal.pone.0141287.g003</a></p></div><a id="article1.body1.sec3.sec3.sec1.p3" name="article1.body1.sec3.sec3.sec1.p3" class="link-target"></a><p>In the next step, we quantitatively evaluated how ProtVec can be used to distinguish between FG-Nups versus typical PDB sequences using a support vector machine binary classification. The positive examples were the above mentioned 1,138 disordered FG-Nups proteins and negative examples (again 1,138 sequences) were selected randomly from PDB with the same average length of disordered sequences (≈ 900 residues). We represented each protein sequence as a summation of its ProtVecs of all 3-grams. Since on average the length of structured proteins were shorter than FG-Nups, in order to avoid trivial cases, the PDB sequences were selected in such a way as to maintain the same average length. But still, an accuracy of 99.81% was obtained with high sensitivity and specificity (<a href="#pone-0141287-t003">Table 3</a>). The distribution of the classified proteins in a 2D space is shown in <a href="#pone-0141287-g004">Fig 4</a>.</p> <a class="link-target" id="pone-0141287-t003" name="pone-0141287-t003"></a><div class="figure" data-doi="10.1371/journal.pone.0141287.t003"><div class="img-box"><a title="Click for larger image" href="article/figure/image?size=medium&id=10.1371/journal.pone.0141287.t003" data-doi="10.1371/journal.pone.0141287" data-uri="10.1371/journal.pone.0141287.t003"><img src="article/figure/image?size=inline&id=10.1371/journal.pone.0141287.t003" alt="thumbnail" class="thumbnail" loading="lazy"></a><div class="expand"></div></div><div class="figure-inline-download"> Download: <ul><li><a href="article/figure/powerpoint?id=10.1371/journal.pone.0141287.t003"><div class="definition-label">PPT</div><div class="definition-description">PowerPoint slide</div></a></li><li><a href="article/figure/image?download&size=large&id=10.1371/journal.pone.0141287.t003"><div class="definition-label">PNG</div><div class="definition-description">larger image</div></a></li><li><a href="article/figure/image?download&size=original&id=10.1371/journal.pone.0141287.t003"><div class="definition-label">TIFF</div><div class="definition-description">original image</div></a></li></ul></div><div class="figcaption"><span>Table 3. </span> The performance of FG-Nups disordered protein classification in a 10xFold cross-validation using SVM.</div><p class="caption_target"></p><p class="caption_object"><a href="https://doi.org/10.1371/journal.pone.0141287.t003"> https://doi.org/10.1371/journal.pone.0141287.t003</a></p></div><a class="link-target" id="pone-0141287-g004" name="pone-0141287-g004"></a><div class="figure" data-doi="10.1371/journal.pone.0141287.g004"><div class="img-box"><a title="Click for larger image" href="article/figure/image?size=medium&id=10.1371/journal.pone.0141287.g004" data-doi="10.1371/journal.pone.0141287" data-uri="10.1371/journal.pone.0141287.g004"><img src="article/figure/image?size=inline&id=10.1371/journal.pone.0141287.g004" alt="thumbnail" class="thumbnail" loading="lazy"></a><div class="expand"></div></div><div class="figure-inline-download"> Download: <ul><li><a href="article/figure/powerpoint?id=10.1371/journal.pone.0141287.g004"><div class="definition-label">PPT</div><div class="definition-description">PowerPoint slide</div></a></li><li><a href="article/figure/image?download&size=large&id=10.1371/journal.pone.0141287.g004"><div class="definition-label">PNG</div><div class="definition-description">larger image</div></a></li><li><a href="article/figure/image?download&size=original&id=10.1371/journal.pone.0141287.g004"><div class="definition-label">TIFF</div><div class="definition-description">original image</div></a></li></ul></div><div class="figcaption"><span>Fig 4. </span> Classification of FG-Nups versus PDB structured sequences.</div><p class="caption_target"><a id="article1.body1.sec3.sec3.sec1.fig2.caption1.p1" name="article1.body1.sec3.sec3.sec1.fig2.caption1.p1" class="link-target"></a><p>In this figure, each point presents a protein projected into a 2D space.</p> </p><p class="caption_object"><a href="https://doi.org/10.1371/journal.pone.0141287.g004"> https://doi.org/10.1371/journal.pone.0141287.g004</a></p></div></div> <div id="section2" class="section toc-section"><a id="sec019" name="sec019" class="link-target" title="DisProt characterization"></a><h4>DisProt characterization.</h4><a id="article1.body1.sec3.sec3.sec2.p1" name="article1.body1.sec3.sec3.sec2.p1" class="link-target"></a><p>In this part, we used DisProt consisting of 694 proteins presenting 1539 disordered, and 95 ordered regions. We performed the same analysis as we did for FG-Nups with DisProt sequences (see <a href="#pone-0141287-g003">Fig 3</a> column (c)). Since the size of DisProt was relatively small compared to that of the FG-Nups, the scales of columns (a),(b) were not comparable with column (c) (see <a href="#pone-0141287-g003">Fig 3</a>). The visualization of disordered regions of DisProt sequences (<a href="#pone-0141287-g003">Fig 3</a> column (c), on top) revealed a different characteristic than FG-Nups disordered regions (<a href="#pone-0141287-g003">Fig 3</a> column (a), on top). A visual comparison between Figs <a href="#pone-0141287-g003">3</a> and <a href="#pone-0141287-g002">2</a> suggest that the FG-Nups have a significantly higher amount of hydrophobic residues and less polar residues in their disordered regions than the experimentally identified disordered proteins in DisProt [<a href="#pone.0141287.ref027" class="ref-tip">27</a>, <a href="#pone.0141287.ref029" class="ref-tip">29</a>]. Additionally, the DisProt disordered regions versus DisProt ordered regions can be classified with 100% accuracy respectively using SVM and ProtVec.</p> </div> </div> </div> <div xmlns:plos="http://plos.org" id="section4" class="section toc-section"><a id="sec020" name="sec020" data-toc="sec020" class="link-target" title="Conclusions"></a><h2>Conclusions</h2><a id="article1.body1.sec4.p1" name="article1.body1.sec4.p1" class="link-target"></a><p>An unsupervised data-driven distributed representation, called ProtVec, was proposed for application of machine learning approaches in biological sequences. By training this representation solely on protein sequences, our feature extraction approach was able to capture a diverse range of meaningful physical and chemical properties. We demonstrated that ProtVec can be used as an informative and dense representation for biological sequences in protein family classification, and obtained an average family classification accuracy of 93%.</p> <a id="article1.body1.sec4.p2" name="article1.body1.sec4.p2" class="link-target"></a><p>We further proposed ProtVec as a powerful approach for protein data visualization and showed the utility of this approach by providing an example in characterization of disordered protein sequences vs. structured protein sequences. Our results suggest that ProtVec can characterize protein sequences in terms of biochemical and biophysical interpretations of the underlying patterns. In addition, this dense representation of sequences can help to discriminate between various categories of sequences, e.g. disordered proteins. Furthermore, we demonstrated that ProtVec was able to identify disordered sequences with an accuracy of nearly 100%. The related data is available at: <a href="http://llp.berkeley.edu">http://llp.berkeley.edu</a> and Harvard Dataverse: <a href="http://dx.doi.org/10.7910/DVN/JMFHTN">http://dx.doi.org/10.7910/DVN/JMFHTN</a>.</p> <a id="article1.body1.sec4.p3" name="article1.body1.sec4.p3" class="link-target"></a><p>Another advantage of this method is that embeddings could be trained once and then used to encode biological sequences in any given problem. In general, machine learning approaches in bioinformatics can widely benefit from bio-vectors (ProtVec and GeneVec) representation. This representation can be considered as pre-training for various applications of deep learning in bioinformatics. In particular, ProtVec can be used in protein interaction predictions, structure prediction, and protein data visualization.</p> </div> <div xmlns:plos="http://plos.org" id="section5" class="section toc-section"><a id="sec021" name="sec021" data-toc="sec021" class="link-target" title="Supporting Information"></a><h2>Supporting Information</h2><div class="figshare_widget" doi="10.1371/journal.pone.0141287"></div><div class="supplementary-material"><a name="pone.0141287.s001" id="pone.0141287.s001" class="link-target"></a><h3 class="siTitle title-small"><a href="article/file?type=supplementary&id=10.1371/journal.pone.0141287.s001">S1 File. </a>The results of family classification task for all 7,027 families.</h3><p class="siDoi"><a href="https://doi.org/10.1371/journal.pone.0141287.s001">https://doi.org/10.1371/journal.pone.0141287.s001</a></p><a id="article1.body1.sec5.supplementary-material1.caption1.p1" name="article1.body1.sec5.supplementary-material1.caption1.p1" class="link-target"></a><p class="postSiDOI">(XLSX)</p> </div></div> <div xmlns:plos="http://plos.org" class="section toc-section"><a id="ack" name="ack" data-toc="ack" title="Acknowledgments" class="link-target"></a><h2>Acknowledgments</h2> <a id="article1.back1.ack1.p1" name="article1.back1.ack1.p1" class="link-target"></a><p>Fruitful discussions with Kiavash Garakani, Mohammad Soheilypour, Zeinab Jahed, Mohaddeseh Peyro, Hengameh Shams and other members of the Molecular Cell Biomechanics Lab at the University of California Berkeley are gratefully acknowledged.</p> </div><div xmlns:plos="http://plos.org" class="contributions toc-section"><a id="authcontrib" name="authcontrib" data-toc="authcontrib" title="Author Contributions"></a><h2>Author Contributions</h2><p>Conceived and designed the experiments: EA MRKM. Performed the experiments: EA. Analyzed the data: EA MRKM. Contributed reagents/materials/analysis tools: MRKM. Wrote the paper: EA MRKM.</p></div><div xmlns:plos="http://plos.org" class="toc-section"><a id="references" name="references" class="link-target" data-toc="references" title="References"></a><h2>References</h2><ol class="references"><li id="ref1"><span class="order">1. </span><a name="pone.0141287.ref001" id="pone.0141287.ref001" class="link-target"></a> Yandell MD, Majoros WH. Genomics and natural language processing. Nature Reviews Genetics. 2002;3(8):601–610. pmid:12154383 <ul class="reflinks"><li><a href="#" data-author="Yandell" data-cit="%0AYandellMD%2C%20MajorosWH.%20Genomics%20and%20natural%20language%20processing.%20Nature%20Reviews%20Genetics.%202002%3B3%288%29%3A601%E2%80%93610.%2012154383" data-title="Genomics%20and%20natural%20language%20processing" target="_new" title="Go to article in CrossRef"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/12154383" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Genomics+and+natural+language+processing+Yandell+2002" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref2"><span class="order">2. </span><a name="pone.0141287.ref002" id="pone.0141287.ref002" class="link-target"></a> Searls DB. The language of genes. Nature. 2002;420(6912):211–217. pmid:12432405 <ul class="reflinks" data-doi="10.1038/nature01255"><li><a href="https://doi.org/10.1038/nature01255" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/12432405" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=The+language+of+genes+Searls+2002" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref3"><span class="order">3. </span><a name="pone.0141287.ref003" id="pone.0141287.ref003" class="link-target"></a> Motomura K, Fujita T, Tsutsumi M, Kikuzato S, Nakamura M, Otaki JM. Word decoding of protein amino acid sequences with availability analysis: a linguistic approach. PloS one. 2012;7(11):e50039. pmid:23185527 <ul class="reflinks" data-doi="10.1371/journal.pone.0050039"><li><a href="https://doi.org/10.1371/journal.pone.0050039" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/23185527" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Word+decoding+of+protein+amino+acid+sequences+with+availability+analysis%3A+a+linguistic+approach+Motomura+2012" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref4"><span class="order">4. </span><a name="pone.0141287.ref004" id="pone.0141287.ref004" class="link-target"></a> Cai Y, Lux MW, Adam L, Peccoud J. Modeling structure-function relationships in synthetic DNA sequences using attribute grammars. PLoS Comput Biol. 2009;5(10):e1000529. pmid:19816554 <ul class="reflinks" data-doi="10.1371/journal.pcbi.1000529"><li><a href="https://doi.org/10.1371/journal.pcbi.1000529" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/19816554" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Modeling+structure-function+relationships+in+synthetic+DNA+sequences+using+attribute+grammars+Cai+2009" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref5"><span class="order">5. </span><a name="pone.0141287.ref005" id="pone.0141287.ref005" class="link-target"></a> Suykens JA, Vandewalle J. Least squares support vector machine classifiers. Neural processing letters. 1999;9(3):293–300. <ul class="reflinks" data-doi="10.1023/A:1018628609742"><li><a href="https://doi.org/10.1023/A:1018628609742" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://scholar.google.com/scholar?q=Least+squares+support+vector+machine+classifiers+Suykens+1999" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref6"><span class="order">6. </span><a name="pone.0141287.ref006" id="pone.0141287.ref006" class="link-target"></a>Hinton GE. Distributed representations. School of Computer Science at Carnegie Mellon University. 1984;. <ul class="find-nolinks"></ul></li><li id="ref7"><span class="order">7. </span><a name="pone.0141287.ref007" id="pone.0141287.ref007" class="link-target"></a> Lasko TA, Denny JC, Levy MA. Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PloS one. 2013;8(6):e66341. pmid:23826094 <ul class="reflinks" data-doi="10.1371/journal.pone.0066341"><li><a href="https://doi.org/10.1371/journal.pone.0066341" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/23826094" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Computational+phenotype+discovery+using+unsupervised+feature+learning+over+noisy%2C+sparse%2C+and+irregular+clinical+data+Lasko+2013" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref8"><span class="order">8. </span><a name="pone.0141287.ref008" id="pone.0141287.ref008" class="link-target"></a> Xiong HY, Alipanahi B, Lee LJ, Bretschneider H, Merico D, Yuen RK, et al. The human splicing code reveals new insights into the genetic determinants of disease. Science. 2015;347(6218):1254806. pmid:25525159 <ul class="reflinks" data-doi="10.1126/science.1254806"><li><a href="https://doi.org/10.1126/science.1254806" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/25525159" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=The+human+splicing+code+reveals+new+insights+into+the+genetic+determinants+of+disease+Xiong+2015" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref9"><span class="order">9. </span><a name="pone.0141287.ref009" id="pone.0141287.ref009" class="link-target"></a> Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. The Journal of Machine Learning Research. 2011;12:2493–2537. <ul class="reflinks"><li><a href="#" data-author="Collobert" data-cit="%0ACollobertR%2C%20WestonJ%2C%20BottouL%2C%20KarlenM%2C%20KavukcuogluK%2C%20KuksaP.%20Natural%20language%20processing%20%28almost%29%20from%20scratch.%20The%20Journal%20of%20Machine%20Learning%20Research.%202011%3B12%3A2493%E2%80%932537." data-title="Natural%20language%20processing%20%28almost%29%20from%20scratch" target="_new" title="Go to article in CrossRef"> View Article </a></li><li><a href="http://scholar.google.com/scholar?q=Natural+language+processing+%28almost%29+from+scratch+Collobert+2011" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref10"><span class="order">10. </span><a name="pone.0141287.ref010" id="pone.0141287.ref010" class="link-target"></a> Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. p. 3111–3119. <ul class="reflinks"><li><a href="#" data-author="Mikolov" data-cit="%0AMikolovT%2C%20SutskeverI%2C%20ChenK%2C%20CorradoGS%2C%20DeanJ.%20Distributed%20representations%20of%20words%20and%20phrases%20and%20their%20compositionality.%20In%3A%20Advances%20in%20neural%20information%20processing%20systems%3B%202013.%20p.%203111%E2%80%933119." data-title="Distributed%20representations%20of%20words%20and%20phrases%20and%20their%20compositionality" target="_new" title="Go to article in CrossRef"> View Article </a></li><li><a href="http://scholar.google.com/scholar?q=Distributed+representations+of+words+and+phrases+and+their+compositionality+Mikolov+2013" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref11"><span class="order">11. </span><a name="pone.0141287.ref011" id="pone.0141287.ref011" class="link-target"></a>Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013;. <ul class="find-nolinks"></ul></li><li id="ref12"><span class="order">12. </span><a name="pone.0141287.ref012" id="pone.0141287.ref012" class="link-target"></a> Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic acids research. 2002;30(7):1575–1584. pmid:11917018 <ul class="reflinks" data-doi="10.1093/nar/30.7.1575"><li><a href="https://doi.org/10.1093/nar/30.7.1575" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/11917018" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=An+efficient+algorithm+for+large-scale+detection+of+protein+families+Enright+2002" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref13"><span class="order">13. </span><a name="pone.0141287.ref013" id="pone.0141287.ref013" class="link-target"></a> Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y. Predicting function: from genes to genomes and back. Journal of molecular biology. 1998;283(4):707–725. pmid:9790834 <ul class="reflinks" data-doi="10.1006/jmbi.1998.2144"><li><a href="https://doi.org/10.1006/jmbi.1998.2144" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/9790834" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Predicting+function%3A+from+genes+to+genomes+and+back+Bork+1998" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref14"><span class="order">14. </span><a name="pone.0141287.ref014" id="pone.0141287.ref014" class="link-target"></a> Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature methods. 2012;9(2):173–175. <ul class="reflinks" data-doi="10.1038/nmeth.1818"><li><a href="https://doi.org/10.1038/nmeth.1818" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://scholar.google.com/scholar?q=HHblits%3A+lightning-fast+iterative+protein+sequence+searching+by+HMM-HMM+alignment+Remmert+2012" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref15"><span class="order">15. </span><a name="pone.0141287.ref015" id="pone.0141287.ref015" class="link-target"></a> Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic acids research. 2013;p. gkt1223. <ul class="reflinks"><li><a href="#" data-author="Finn" data-cit="%0AFinnRD%2C%20BatemanA%2C%20ClementsJ%2C%20CoggillP%2C%20EberhardtRY%2C%20EddySR%2C%20et%20al.%20Pfam%3A%20the%20protein%20families%20database.%20Nucleic%20acids%20research.%202013%3Bp.%20gkt1223." data-title="Pfam%3A%20the%20protein%20families%20database" target="_new" title="Go to article in CrossRef"> View Article </a></li><li><a href="http://scholar.google.com/scholar?q=Pfam%3A+the+protein+families+database+Finn+2013" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref16"><span class="order">16. </span><a name="pone.0141287.ref016" id="pone.0141287.ref016" class="link-target"></a> Cai C, Han L, Ji ZL, Chen X, Chen YZ. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic acids research. 2003;31(13):3692–3697. pmid:12824396 <ul class="reflinks" data-doi="10.1093/nar/gkg600"><li><a href="https://doi.org/10.1093/nar/gkg600" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/12824396" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=SVM-Prot%3A+web-based+support+vector+machine+software+for+functional+classification+of+a+protein+from+its+primary+sequence+Cai+2003" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref17"><span class="order">17. </span><a name="pone.0141287.ref017" id="pone.0141287.ref017" class="link-target"></a>Leslie CS, Eskin E, Noble WS. The spectrum kernel: A string kernel for SVM protein classification. In: Pacific symposium on biocomputing. vol. 7. World Scientific; 2002. p. 566–575. <ul class="find-nolinks"></ul></li><li id="ref18"><span class="order">18. </span><a name="pone.0141287.ref018" id="pone.0141287.ref018" class="link-target"></a> Huynen M, Snel B, Lathe W, Bork P. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome research. 2000;10(8):1204–1210. pmid:10958638 <ul class="reflinks" data-doi="10.1101/gr.10.8.1204"><li><a href="https://doi.org/10.1101/gr.10.8.1204" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/10958638" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Predicting+protein+function+by+genomic+context%3A+quantitative+evaluation+and+qualitative+inferences+Huynen+2000" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref19"><span class="order">19. </span><a name="pone.0141287.ref019" id="pone.0141287.ref019" class="link-target"></a> Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of molecular biology. 1995;247(4):536–540. pmid:7723011 <ul class="reflinks" data-doi="10.1016/S0022-2836(05)80134-2"><li><a href="https://doi.org/10.1016/S0022-2836(05)80134-2" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/7723011" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=SCOP%3A+a+structural+classification+of+proteins+database+for+the+investigation+of+sequences+and+structures+Murzin+1995" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref20"><span class="order">20. </span><a name="pone.0141287.ref020" id="pone.0141287.ref020" class="link-target"></a> Aragues R, Sali A, Bonet J, Marti-Renom MA, Oliva B. Characterization of protein hubs by inferring interacting motifs from protein interactions. PloS Computational Biology. 2007;3.9:e178. <ul class="reflinks" data-doi="10.1371/journal.pcbi.0030178"><li><a href="https://doi.org/10.1371/journal.pcbi.0030178" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://scholar.google.com/scholar?q=Characterization+of+protein+hubs+by+inferring+interacting+motifs+from+protein+interactions+Aragues+2007" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref21"><span class="order">21. </span><a name="pone.0141287.ref021" id="pone.0141287.ref021" class="link-target"></a> Dunker AK, Silman I, Uversky VN, Sussman JL. Function and structure of inherently disordered proteins. Current opinion in structural biology. 2008;18(6):756–764. pmid:18952168 <ul class="reflinks" data-doi="10.1016/j.sbi.2008.10.002"><li><a href="https://doi.org/10.1016/j.sbi.2008.10.002" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/18952168" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Function+and+structure+of+inherently+disordered+proteins+Dunker+2008" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref22"><span class="order">22. </span><a name="pone.0141287.ref022" id="pone.0141287.ref022" class="link-target"></a> Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nature reviews Molecular cell biology. 2005;6(3):197–208. pmid:15738986 <ul class="reflinks" data-doi="10.1038/nrm1589"><li><a href="https://doi.org/10.1038/nrm1589" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/15738986" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Intrinsically+unstructured+proteins+and+their+functions+Dyson+2005" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref23"><span class="order">23. </span><a name="pone.0141287.ref023" id="pone.0141287.ref023" class="link-target"></a> Sugase K, Dyson HJ, Wright PE. Mechanism of coupled folding and binding of an intrinsically disordered protein. Nature. 2007;447(7147):1021–1025. pmid:17522630 <ul class="reflinks" data-doi="10.1038/nature05858"><li><a href="https://doi.org/10.1038/nature05858" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/17522630" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Mechanism+of+coupled+folding+and+binding+of+an+intrinsically+disordered+protein+Sugase+2007" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref24"><span class="order">24. </span><a name="pone.0141287.ref024" id="pone.0141287.ref024" class="link-target"></a> He B, Wang K, Liu Y, Xue B, Uversky VN, Dunker AK. Predicting intrinsic disorder in proteins: an overview. Cell research. 2009;19(8):929–949. pmid:19597536 <ul class="reflinks" data-doi="10.1038/cr.2009.87"><li><a href="https://doi.org/10.1038/cr.2009.87" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/19597536" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Predicting+intrinsic+disorder+in+proteins%3A+an+overview+He+2009" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref25"><span class="order">25. </span><a name="pone.0141287.ref025" id="pone.0141287.ref025" class="link-target"></a> Jamali T, Jamali Y, Mehrbod M, Mofrad M. Nuclear pore complex: biochemistry and biophysics of nucleocytoplasmic transport in health and disease. Int Rev Cell Mol Biol. 2011;287:233–286. pmid:21414590 <ul class="reflinks" data-doi="10.1016/B978-0-12-386043-9.00006-2"><li><a href="https://doi.org/10.1016/B978-0-12-386043-9.00006-2" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/21414590" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Nuclear+pore+complex%3A+biochemistry+and+biophysics+of+nucleocytoplasmic+transport+in+health+and+disease+Jamali+2011" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref26"><span class="order">26. </span><a name="pone.0141287.ref026" id="pone.0141287.ref026" class="link-target"></a> Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, et al. DisProt: the database of disordered proteins. Nucleic acids research. 2007;35(suppl 1):D786–D793. pmid:17145717 <ul class="reflinks" data-doi="10.1093/nar/gkl893"><li><a href="https://doi.org/10.1093/nar/gkl893" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/17145717" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=DisProt%3A+the+database+of+disordered+proteins+Sickmeier+2007" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref27"><span class="order">27. </span><a name="pone.0141287.ref027" id="pone.0141287.ref027" class="link-target"></a> Ando D, Colvin M, Rexach M, Gopinathan A. Physical motif clustering within intrinsically disordered nucleoporin sequences reveals universal functional features. PloS one. 2013;8(9):e73831. pmid:24066078 <ul class="reflinks" data-doi="10.1371/journal.pone.0073831"><li><a href="https://doi.org/10.1371/journal.pone.0073831" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/24066078" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Physical+motif+clustering+within+intrinsically+disordered+nucleoporin+sequences+reveals+universal+functional+features+Ando+2013" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref28"><span class="order">28. </span><a name="pone.0141287.ref028" id="pone.0141287.ref028" class="link-target"></a> Azimi M, Mofrad MR. Higher Nucleoporin-Importin<em>β</em> Affinity at the Nuclear Basket Increases Nucleocytoplasmic Import. PloS one. 2013;8(11):e81741. pmid:24282617 <ul class="reflinks" data-doi="10.1371/journal.pone.0081741"><li><a href="https://doi.org/10.1371/journal.pone.0081741" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/24282617" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Higher+Nucleoporin-Importin%CE%B2+Affinity+at+the+Nuclear+Basket+Increases+Nucleocytoplasmic+Import+Azimi+2013" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref29"><span class="order">29. </span><a name="pone.0141287.ref029" id="pone.0141287.ref029" class="link-target"></a>Peyro M, Soheilypour M, Lee BL, Mofrad M. Evolutionary conserved sequence features optimizes nucleoporins behavior for cargo transportation through nuclear pore complex. Scientific Reports. In press 2015;. <ul class="find-nolinks"></ul></li><li id="ref30"><span class="order">30. </span><a name="pone.0141287.ref030" id="pone.0141287.ref030" class="link-target"></a> Procter JB, Thompson J, Letunic I, Creevey C, Jossinet F, Barton GJ. Visualization of multiple alignments, phylogenies and gene family evolution. Nature methods. 2010;7:S16–S25. pmid:20195253 <ul class="reflinks" data-doi="10.1038/nmeth.1434"><li><a href="https://doi.org/10.1038/nmeth.1434" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/20195253" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Visualization+of+multiple+alignments%2C+phylogenies+and+gene+family+evolution+Procter+2010" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref31"><span class="order">31. </span><a name="pone.0141287.ref031" id="pone.0141287.ref031" class="link-target"></a> Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, et al. Artemis: sequence visualization and annotation. Bioinformatics. 2000;16(10):944–945. pmid:11120685 <ul class="reflinks" data-doi="10.1093/bioinformatics/16.10.944"><li><a href="https://doi.org/10.1093/bioinformatics/16.10.944" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/11120685" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Artemis%3A+sequence+visualization+and+annotation+Rutherford+2000" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref32"><span class="order">32. </span><a name="pone.0141287.ref032" id="pone.0141287.ref032" class="link-target"></a>Ganapathiraju M, Weisser D, Rosenfeld R, Carbonell J, Reddy R, Klein-Seetharaman J. Comparative n-gram analysis of whole-genome protein sequences. In: Proceedings of the second international conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc.; 2002. p. 76–81. <ul class="find-nolinks"></ul></li><li id="ref33"><span class="order">33. </span><a name="pone.0141287.ref033" id="pone.0141287.ref033" class="link-target"></a> Srinivasan SM, Vural S, King BR, Guda C. Mining for class-specific motifs in protein sequence classification. BMC bioinformatics. 2013;14(1):96. pmid:23496846 <ul class="reflinks" data-doi="10.1186/1471-2105-14-96"><li><a href="https://doi.org/10.1186/1471-2105-14-96" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/23496846" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Mining+for+class-specific+motifs+in+protein+sequence+classification+Srinivasan+2013" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref34"><span class="order">34. </span><a name="pone.0141287.ref034" id="pone.0141287.ref034" class="link-target"></a> Vries JK, Liu X. Subfamily specific conservation profiles for proteins based on n-gram patterns. BMC bioinformatics. 2008;9(1):72. pmid:18234090 <ul class="reflinks" data-doi="10.1186/1471-2105-9-72"><li><a href="https://doi.org/10.1186/1471-2105-9-72" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/18234090" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Subfamily+specific+conservation+profiles+for+proteins+based+on+n-gram+patterns+Vries+2008" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref35"><span class="order">35. </span><a name="pone.0141287.ref035" id="pone.0141287.ref035" class="link-target"></a>Goldberg Y, Levy O. word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:14023722. 2014;. <ul class="find-nolinks"></ul></li><li id="ref36"><span class="order">36. </span><a name="pone.0141287.ref036" id="pone.0141287.ref036" class="link-target"></a> Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008;9(2579–2605):85. <ul class="reflinks"><li><a href="#" data-author="Van%20der%20Maaten" data-cit="%0AVan%20der%20MaatenL%2C%20HintonG.%20Visualizing%20data%20using%20t-SNE.%20Journal%20of%20Machine%20Learning%20Research.%202008%3B9%282579%E2%80%932605%29%3A85." data-title="Visualizing%20data%20using%20t-SNE" target="_new" title="Go to article in CrossRef"> View Article </a></li><li><a href="http://scholar.google.com/scholar?q=Visualizing+data+using+t-SNE+Van+der+Maaten+2008" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref37"><span class="order">37. </span><a name="pone.0141287.ref037" id="pone.0141287.ref037" class="link-target"></a> McGregor E. Proteins and proteomics: A laboratory manual. Journal of Proteome Research. 2004;3(4):694–694. <ul class="reflinks" data-doi="10.1021/pr040022a"><li><a href="https://doi.org/10.1021/pr040022a" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://scholar.google.com/scholar?q=Proteins+and+proteomics%3A+A+laboratory+manual+McGregor+2004" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref38"><span class="order">38. </span><a name="pone.0141287.ref038" id="pone.0141287.ref038" class="link-target"></a> Rose PW, Bi C, Bluhm WF, Christie CH, Dimitropoulos D, Dutta S, et al. The RCSB Protein Data Bank: new resources for research and education. Nucleic acids research. 2013;41(D1):D475–D482. pmid:23193259 <ul class="reflinks" data-doi="10.1093/nar/gks1200"><li><a href="https://doi.org/10.1093/nar/gks1200" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/23193259" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=The+RCSB+Protein+Data+Bank%3A+new+resources+for+research+and+education+Rose+2013" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li><li id="ref39"><span class="order">39. </span><a name="pone.0141287.ref039" id="pone.0141287.ref039" class="link-target"></a> Platzer A. Visualization of SNPs with t-SNE. PloS one. 2013;8(2):e56883. pmid:23457633 <ul class="reflinks" data-doi="10.1371/journal.pone.0056883"><li><a href="https://doi.org/10.1371/journal.pone.0056883" data-author="doi-provided" data-cit="doi-provided" data-title="doi-provided" target="_new" title="Go to article"> View Article </a></li><li><a href="http://www.ncbi.nlm.nih.gov/pubmed/23457633" target="_new" title="Go to article in PubMed"> PubMed/NCBI </a></li><li><a href="http://scholar.google.com/scholar?q=Visualization+of+SNPs+with+t-SNE+Platzer+2013" target="_new" title="Go to article in Google Scholar"> Google Scholar </a></li></ul></li></ol></div> <div class="ref-tooltip"> <div class="ref_tooltip-content"> </div> </div> </div> </div> </div> </section> <aside class="article-aside"> <!--[if IE 9]> <style> .dload-xml {margin-top: 38px} </style> <![endif]--> <div class="dload-menu"> <div class="dload-pdf"> <a href="/plosone/article/file?id=10.1371/journal.pone.0141287&type=printable" id="downloadPdf" target="_blank">Download PDF</a> </div> <div data-js-tooltip-hover="trigger" class="dload-hover"> <ul class="dload-xml" data-js-tooltip-hover="target"> <li><a href="/plosone/article/citation?id=10.1371/journal.pone.0141287" id="downloadCitation">Citation</a></li> <li><a href="/plosone/article/file?id=10.1371/journal.pone.0141287&type=manuscript" id="downloadXml">XML</a> </li> </ul> </div> </div> <div class="aside-container"> <div class="print-article" id="printArticle" data-js-tooltip-hover="trigger"> <a href="#" onclick="window.print(); return false;" class="preventDefault" id="printBrowser">Print</a> </div> <div class="share-article" id="shareArticle" data-js-tooltip-hover="trigger"> Share <ul data-js-tooltip-hover="target" class="share-options" id="share-options"> <li><a href="https://www.reddit.com/submit?url=https%3A%2F%2Fdx.plos.org%2F10.1371%2Fjournal.pone.0141287" id="shareReddit" target="_blank" title="Submit to Reddit"><img src="/resource/img/icon.reddit.16.png" width="16" height="16" alt="Reddit">Reddit</a></li> <li><a href="https://www.facebook.com/share.php?u=https%3A%2F%2Fdx.plos.org%2F10.1371%2Fjournal.pone.0141287&t=Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics" id="shareFacebook" target="_blank" title="Share on Facebook"><img src="/resource/img/icon.fb.16.png" width="16" height="16" alt="Facebook">Facebook</a></li> <li><a href="https://www.linkedin.com/shareArticle?url=https%3A%2F%2Fdx.plos.org%2F10.1371%2Fjournal.pone.0141287&title=Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics&summary=Checkout this article I found at PLOS" id="shareLinkedIn" target="_blank" title="Add to LinkedIn"><img src="/resource/img/icon.linkedin.16.png" width="16" height="16" alt="LinkedIn">LinkedIn</a></li> <li><a href="https://www.mendeley.com/import/?url=https%3A%2F%2Fdx.plos.org%2F10.1371%2Fjournal.pone.0141287" id="shareMendeley" target="_blank" title="Add to Mendeley"><img src="/resource/img/icon.mendeley.16.png" width="16" height="16" alt="Mendeley">Mendeley</a></li> <li><a href="https://twitter.com/intent/tweet?url=https%3A%2F%2Fdx.plos.org%2F10.1371%2Fjournal.pone.0141287&text=%23PLOSONE%3A%20Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics" target="_blank" title="share on Twitter" id="twitter-share-link"><img src="/resource/img/icon.twtr.16.png" width="16" height="16" alt="Twitter">Twitter</a></li> <li><a href="mailto:?subject=Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics&body=I%20thought%20you%20would%20find%20this%20article%20interesting.%20From%20PLOS ONE:%20https%3A%2F%2Fdx.plos.org%2F10.1371%2Fjournal.pone.0141287" id="shareEmail" rel="noreferrer" aria-label="Email"><img src="/resource/img/icon.email.16.png" width="16" height="16" alt="Email">Email</a></li> <script src="/resource/js/components/tweet140.js" type="text/javascript"></script> </ul> </div> </div> <!-- Crossmark 2.0 widget --> <script src="https://crossmark-cdn.crossref.org/widget/v2.0/widget.js"></script> <a aria-label="Check for updates via CrossMark" data-target="crossmark"> <img alt="Check for updates via CrossMark" width="150" src="https://crossmark-cdn.crossref.org/widget/v2.0/logos/CROSSMARK_BW_horizontal.svg"> </a> <!-- End Crossmark 2.0 widget --> <div class="aside-container collections-aside-container"><!-- React Magic --></div> <div class="skyscraper-container"> <div class="title">Advertisement</div> <!-- DoubleClick Ad Zone --> <div class='advertisement' id='div-gpt-ad-1458247671871-1' style='width:160px; height:600px;'> <script type='text/javascript'> googletag.cmd.push(function() { googletag.display('div-gpt-ad-1458247671871-1'); }); </script> </div> </div> <div class="subject-areas-container"> <h3>Subject Areas <div id="subjInfo">?</div> <div id="subjInfoText"> <p>For more information about PLOS Subject Areas, click <a href="https://github.com/PLOS/plos-thesaurus/blob/master/README.md" target="_blank" title="Link opens in new window">here</a>.</p> <span class="inline-intro">We want your feedback.</span> Do these Subject Areas make sense for this article? Click the target next to the incorrect Subject Area and let us know. Thanks for your help! </div> </h3> <ul id="subjectList"> <li> <a class="taxo-term" title="Search for articles about Protein domains" href="/plosone/search?filterSubjects=Protein+domains&filterJournals=PLoSONE&q=">Protein domains</a> <span class="taxo-flag"> </span> <div class="taxo-tooltip" data-categoryname="Protein domains"><p class="taxo-explain">Is the Subject Area <strong>"Protein domains"</strong> applicable to this article? <button id="noFlag" data-action="remove">Yes</button> <button id="flagIt" value="flagno" data-action="add">No</button></p> <p class="taxo-confirm">Thanks for your feedback.</p> </div> </li> <li> <a class="taxo-term" title="Search for articles about Structural proteins" href="/plosone/search?filterSubjects=Structural+proteins&filterJournals=PLoSONE&q=">Structural proteins</a> <span class="taxo-flag"> </span> <div class="taxo-tooltip" data-categoryname="Structural proteins"><p class="taxo-explain">Is the Subject Area <strong>"Structural proteins"</strong> applicable to this article? <button id="noFlag" data-action="remove">Yes</button> <button id="flagIt" value="flagno" data-action="add">No</button></p> <p class="taxo-confirm">Thanks for your feedback.</p> </div> </li> <li> <a class="taxo-term" title="Search for articles about Biophysics" href="/plosone/search?filterSubjects=Biophysics&filterJournals=PLoSONE&q=">Biophysics</a> <span class="taxo-flag"> </span> <div class="taxo-tooltip" data-categoryname="Biophysics"><p class="taxo-explain">Is the Subject Area <strong>"Biophysics"</strong> applicable to this article? <button id="noFlag" data-action="remove">Yes</button> <button id="flagIt" value="flagno" data-action="add">No</button></p> <p class="taxo-confirm">Thanks for your feedback.</p> </div> </li> <li> <a class="taxo-term" title="Search for articles about Protein structure prediction" href="/plosone/search?filterSubjects=Protein+structure+prediction&filterJournals=PLoSONE&q=">Protein structure prediction</a> <span class="taxo-flag"> </span> <div class="taxo-tooltip" data-categoryname="Protein structure prediction"><p class="taxo-explain">Is the Subject Area <strong>"Protein structure prediction"</strong> applicable to this article? <button id="noFlag" data-action="remove">Yes</button> <button id="flagIt" value="flagno" data-action="add">No</button></p> <p class="taxo-confirm">Thanks for your feedback.</p> </div> </li> <li> <a class="taxo-term" title="Search for articles about Protein structure databases" href="/plosone/search?filterSubjects=Protein+structure+databases&filterJournals=PLoSONE&q=">Protein structure databases</a> <span class="taxo-flag"> </span> <div class="taxo-tooltip" data-categoryname="Protein structure databases"><p class="taxo-explain">Is the Subject Area <strong>"Protein structure databases"</strong> applicable to this article? <button id="noFlag" data-action="remove">Yes</button> <button id="flagIt" value="flagno" data-action="add">No</button></p> <p class="taxo-confirm">Thanks for your feedback.</p> </div> </li> <li> <a class="taxo-term" title="Search for articles about Support vector machines" href="/plosone/search?filterSubjects=Support+vector+machines&filterJournals=PLoSONE&q=">Support vector machines</a> <span class="taxo-flag"> </span> <div class="taxo-tooltip" data-categoryname="Support vector machines"><p class="taxo-explain">Is the Subject Area <strong>"Support vector machines"</strong> applicable to this article? <button id="noFlag" data-action="remove">Yes</button> <button id="flagIt" value="flagno" data-action="add">No</button></p> <p class="taxo-confirm">Thanks for your feedback.</p> </div> </li> <li> <a class="taxo-term" title="Search for articles about Bioinformatics" href="/plosone/search?filterSubjects=Bioinformatics&filterJournals=PLoSONE&q=">Bioinformatics</a> <span class="taxo-flag"> </span> <div class="taxo-tooltip" data-categoryname="Bioinformatics"><p class="taxo-explain">Is the Subject Area <strong>"Bioinformatics"</strong> applicable to this article? <button id="noFlag" data-action="remove">Yes</button> <button id="flagIt" value="flagno" data-action="add">No</button></p> <p class="taxo-confirm">Thanks for your feedback.</p> </div> </li> <li> <a class="taxo-term" title="Search for articles about Protein structure" href="/plosone/search?filterSubjects=Protein+structure&filterJournals=PLoSONE&q=">Protein structure</a> <span class="taxo-flag"> </span> <div class="taxo-tooltip" data-categoryname="Protein structure"><p class="taxo-explain">Is the Subject Area <strong>"Protein structure"</strong> applicable to this article? <button id="noFlag" data-action="remove">Yes</button> <button id="flagIt" value="flagno" data-action="add">No</button></p> <p class="taxo-confirm">Thanks for your feedback.</p> </div> </li> </ul> </div> <div id="subjectErrors"></div> </aside> </div> </main> <footer id="pageftr"> <div class="row"> <div class="block x-small"> <ul class="nav nav-secondary"> <li class="ftr-header"><a href="https://plos.org/publications/journals/">Publications</a></li> <li><a href="/plosbiology/" id="ftr-bio">PLOS Biology</a></li> <li><a href="/climate/" id="ftr-climate">PLOS Climate</a></li> <li><a href="/complexsystems/" id="ftr-complex-systems">PLOS Complex Systems</a></li> <li><a href="/ploscompbiol/" id="ftr-compbio">PLOS Computational Biology</a></li> <li><a href="/digitalhealth/" id="ftr-digitalhealth">PLOS Digital Health</a></li> <li><a href="/plosgenetics/" id="ftr-gen">PLOS Genetics</a></li> <li><a href="/globalpublichealth/" id="ftr-globalpublichealth">PLOS Global Public Health</a></li> </ul> </div> <div class="block x-small"> <ul class="nav nav-secondary"> <li class="ftr-header"> </li> <li><a href="/plosmedicine/" id="ftr-med">PLOS Medicine</a></li> <li><a href="/mentalhealth/" id="ftr-mental-health">PLOS Mental Health</a></li> <li><a href="/plosntds/" id="ftr-ntds">PLOS Neglected Tropical Diseases</a></li> <li><a href="/plosone/" id="ftr-one">PLOS ONE</a></li> <li><a href="/plospathogens/" id="ftr-path">PLOS Pathogens</a></li> <li><a href="/sustainabilitytransformation/" id="ftr-sustainabilitytransformation">PLOS Sustainability and Transformation</a></li> <li><a href="/water/" id="ftr-water">PLOS Water</a></li> </ul> </div> <div class="block xx-small"> <ul class="nav nav-tertiary"> <li> <a href="https://plos.org" id="ftr-home">Home</a> </li> <li> <a href="https://blogs.plos.org" id="ftr-blog">Blogs</a> </li> <li> <a href="https://collections.plos.org/" id="ftr-collections">Collections</a> </li> <li> <a href="mailto:webmaster@plos.org" id="ftr-feedback">Give feedback</a> </li> <li> <a href="/plosone/lockss-manifest" id="ftr-lockss">LOCKSS</a> </li> </ul> </div> <div class="block xx-small"> <ul class="nav nav-primary"> <li><a href="https://plos.org/privacy-policy" id="ftr-privacy">Privacy Policy</a></li> <li><a href="https://plos.org/terms-of-use" id="ftr-terms">Terms of Use</a></li> <li><a href="https://plos.org/advertise/" id="ftr-advertise">Advertise</a></li> <li><a href="https://plos.org/media-inquiries" id="ftr-media">Media Inquiries</a></li> <li><a href="https://plos.org/contact" id="ftr-contact">Contact</a></li> </ul> </div> </div> <div class="row"> <p> <img src="/resource/img/logo-plos-footer.png" alt="PLOS" class="logo-footer"/> <span class="footer-non-profit-statement">PLOS is a nonprofit 501(c)(3) corporation, #C2354500, based in San Francisco, California, US</span> </p> <div class="block"> </div> </div> <script src="/resource/js/global.js" type="text/javascript"></script> </footer> <script type="text/javascript"> var ArticleData = { doi: '10.1371/journal.pone.0141287', title: '<article-title xmlns:mml=\"http://www.w3.org/1998/Math/MathML\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics<\/article-title>', date: 'Nov 10, 2015' }; </script> <script src="/resource/js/components/show_onscroll.js" type="text/javascript"></script> <script src="/resource/js/components/pagination.js" type="text/javascript"></script> <script src="/resource/js/vendor/spin.js" type="text/javascript"></script> <script src="/resource/js/pages/article.js" type="text/javascript"></script> <script src="/resource/js/pages/article_references.js" type="text/javascript"></script> <script src="/resource/js/pages/article_sidebar.js" type="text/javascript"></script> <script src="/resource/js/vendor/foundation/foundation.dropdown.js" type="text/javascript"></script> <script src="/resource/js/components/table_open.js" type="text/javascript"></script> <script src="/resource/js/components/figshare.js" type="text/javascript"></script> <script src="/resource/js/vendor/jquery.panzoom.min.js" type="text/javascript"></script> <script src="/resource/js/vendor/jquery.mousewheel.js" type="text/javascript"></script> <script src="/resource/js/components/lightbox.js" type="text/javascript"></script> <script src="/resource/js/pages/article_body.js" type="text/javascript"></script> <!-- This file should be loaded before the renderJs, to avoid conflicts with the FigShare, that implements the MathJax also. --> <!-- mathjax configuration options --> <!-- more can be found at http://docs.mathjax.org/en/latest/ --> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ "HTML-CSS": { scale: 100, availableFonts: ["STIX","TeX"], preferredFont: "STIX", webFont: "STIX-Web", linebreaks: { automatic: false } }, jax: ["input/MathML", "output/HTML-CSS"] }); </script> <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_HTMLorMML"></script> <div class="reveal-modal-bg"></div> </body> </html>