<p class="list-title is-inline-block"><a href="">arXiv:2401.00775</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Applications">stat.AP</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Information Retrieval">cs.IR</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1146/annurev-statistics-040522-022138 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Recent Advances in Text Analysis </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/stat?searchtype=author&amp;query=Ke%2C+Z+T">Zheng Tracy Ke</a>, <a href="/search/stat?searchtype=author&amp;query=Ji%2C+P">Pengsheng Ji</a>, <a href="/search/stat?searchtype=author&amp;query=Jin%2C+J">Jiashun Jin</a>, <a href="/search/stat?searchtype=author&amp;query=Li%2C+W">Wanshan Li</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2401.00775v2-abstract-short" style="display: inline;"> Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze MADSta&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.00775v2-abstract-full').style.display = 'inline'; document.getElementById('2401.00775v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2401.00775v2-abstract-full" style="display: none;"> Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze MADStat - a dataset on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods on MADStat leads to interesting findings. For example, $11$ representative topics in statistics are identified. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of $11$ topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research in $1975$--$2015$, from a text analysis perspective. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2401.00775v2-abstract-full').style.display = 'none'; document.getElementById('2401.00775v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 7 February, 2024; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 1 January, 2024; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> January 2024. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> Annual Review of Statistics and Its Application 2024 11:1 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:2008.03820</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Social and Information Networks">cs.SI</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Statistics Theory">math.ST</span> </div> </div> <p class="title is-5 mathjax"> Spectral Algorithms for Community Detection in Directed Networks </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/stat?searchtype=author&amp;query=Wang%2C+Z">Zhe Wang</a>, <a href="/search/stat?searchtype=author&amp;query=Liang%2C+Y">Yingbin Liang</a>, <a href="/search/stat?searchtype=author&amp;query=Ji%2C+P">Pengsheng Ji</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="2008.03820v1-abstract-short" style="display: inline;"> Community detection in large social networks is affected by degree heterogeneity of nodes. The D-SCORE algorithm for directed networks was introduced to reduce this effect by taking the element-wise ratios of the singular vectors of the adjacency matrix before clustering. Meaningful results were obtained for the statistician citation network, but rigorous analysis on its performance was missing. F&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2008.03820v1-abstract-full').style.display = 'inline'; document.getElementById('2008.03820v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="2008.03820v1-abstract-full" style="display: none;"> Community detection in large social networks is affected by degree heterogeneity of nodes. The D-SCORE algorithm for directed networks was introduced to reduce this effect by taking the element-wise ratios of the singular vectors of the adjacency matrix before clustering. Meaningful results were obtained for the statistician citation network, but rigorous analysis on its performance was missing. First, this paper establishes theoretical guarantee for this algorithm and its variants for the directed degree-corrected block model (Directed-DCBM). Second, this paper provides significant improvements for the original D-SCORE algorithms by attaching the nodes outside of the community cores using the information of the original network instead of the singular vectors. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('2008.03820v1-abstract-full').style.display = 'none'; document.getElementById('2008.03820v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 9 August, 2020; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> August 2020. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">Journal of Machine Learning Research 2020, to appear</span> </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> Journal of Machine Learning Research 2020. (153):1-45, </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:1809.10804</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Computation and Language">cs.CL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">cs.LG</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Machine Learning">stat.ML</span> </div> </div> <p class="title is-5 mathjax"> Patient Risk Assessment and Warning Symptom Detection Using Deep Attention-Based Neural Networks </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/stat?searchtype=author&amp;query=Girardi%2C+I">Ivan Girardi</a>, <a href="/search/stat?searchtype=author&amp;query=Ji%2C+P">Pengfei Ji</a>, <a href="/search/stat?searchtype=author&amp;query=Nguyen%2C+A">An-phi Nguyen</a>, <a href="/search/stat?searchtype=author&amp;query=Hollenstein%2C+N">Nora Hollenstein</a>, <a href="/search/stat?searchtype=author&amp;query=Ivankay%2C+A">Adam Ivankay</a>, <a href="/search/stat?searchtype=author&amp;query=Kuhn%2C+L">Lorenz Kuhn</a>, <a href="/search/stat?searchtype=author&amp;query=Marchiori%2C+C">Chiara Marchiori</a>, <a href="/search/stat?searchtype=author&amp;query=Zhang%2C+C">Ce Zhang</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="1809.10804v1-abstract-short" style="display: inline;"> We present an operational component of a real-world patient triage system. Given a specific patient presentation, the system is able to assess the level of medical urgency and issue the most appropriate recommendation in terms of best point of care and time to treat. We use an attention-based convolutional neural network architecture trained on 600,000 doctor notes in German. We compare two approa&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('1809.10804v1-abstract-full').style.display = 'inline'; document.getElementById('1809.10804v1-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="1809.10804v1-abstract-full" style="display: none;"> We present an operational component of a real-world patient triage system. Given a specific patient presentation, the system is able to assess the level of medical urgency and issue the most appropriate recommendation in terms of best point of care and time to treat. We use an attention-based convolutional neural network architecture trained on 600,000 doctor notes in German. We compare two approaches, one that uses the full text of the medical notes and one that uses only a selected list of medical entities extracted from the text. These approaches achieve 79% and 66% precision, respectively, but on a confidence threshold of 0.6, precision increases to 85% and 75%, respectively. In addition, a method to detect warning symptoms is implemented to render the classification task transparent from a medical perspective. The method is based on the learning of attention scores and a method of automatic validation using the same data. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('1809.10804v1-abstract-full').style.display = 'none'; document.getElementById('1809.10804v1-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 27 September, 2018; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> September 2018. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">10 pages, 2 figures, EMNLP workshop LOUHI 2018</span> </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:1410.2840</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Applications">stat.AP</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Digital Libraries">cs.DL</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Physics and Society">physics.soc-ph</span> <span class="tag is-small is-grey tooltip is-tooltip-top" data-tooltip="Methodology">stat.ME</span> </div> <div class="is-inline-block" style="margin-left: 0.5rem"> <div class="tags has-addons"> <span class="tag is-dark is-size-7">doi</span> <span class="tag is-light is-size-7"><a class="" href="">10.1214/15-AOAS896 <i class="fa fa-external-link" aria-hidden="true"></i></a></span> </div> </div> </div> <p class="title is-5 mathjax"> Coauthorship and Citation Networks for Statisticians </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/stat?searchtype=author&amp;query=Ji%2C+P">Pengsheng Ji</a>, <a href="/search/stat?searchtype=author&amp;query=Jin%2C+J">Jiashun Jin</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="1410.2840v2-abstract-short" style="display: inline;"> We have collected and cleaned two network data sets: Coauthorship and Citation networks for statisticians. The data sets are based on all research papers published in four of the top journals in statistics from $2003$ to the first half of $2012$. We analyze the data sets from many different perspectives, focusing on (a) centrality, (b) community structures, and (c) productivity, patterns and trend&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('1410.2840v2-abstract-full').style.display = 'inline'; document.getElementById('1410.2840v2-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="1410.2840v2-abstract-full" style="display: none;"> We have collected and cleaned two network data sets: Coauthorship and Citation networks for statisticians. The data sets are based on all research papers published in four of the top journals in statistics from $2003$ to the first half of $2012$. We analyze the data sets from many different perspectives, focusing on (a) centrality, (b) community structures, and (c) productivity, patterns and trends. For (a), we have identified the most prolific/collaborative/highly cited authors. We have also identified a handful of &#34;hot&#34; papers, suggesting &#34;Variable Selection&#34; as one of the &#34;hot&#34; areas. For (b), we have identified about $15$ meaningful communities or research groups, including large-size ones such as &#34;Spatial Statistics&#34;, &#34;Large-Scale Multiple Testing&#34;, &#34;Variable Selection&#34; as well as small-size ones such as &#34;Dimensional Reduction&#34;, &#34;Objective Bayes&#34;, &#34;Quantile Regression&#34;, and &#34;Theoretical Machine Learning&#34;. For (c), we find that over the 10-year period, both the average number of papers per author and the fraction of self citations have been decreasing, but the proportion of distant citations has been increasing. These suggest that the statistics community has become increasingly more collaborative, competitive, and globalized. Our findings shed light on research habits, trends, and topological patterns of statisticians. The data sets provide a fertile ground for future researches on or related to social networks of statisticians. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('1410.2840v2-abstract-full').style.display = 'none'; document.getElementById('1410.2840v2-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 2 July, 2015; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 10 October, 2014; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> October 2014. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">MSC Class:</span> 91C20; 62H30; 62P25 </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Journal ref:</span> Annals of Applied Statistics 2016, 10(4): 1779-1812 </p> </li> <li class="arxiv-result"> <div class="is-marginless"> <p class="list-title is-inline-block"><a href="">arXiv:1404.2961</a> <span>&nbsp;[<a href="">pdf</a>, <a href="">other</a>]&nbsp;</span> </p> <div class="tags is-inline-block"> <span class="tag is-small is-link tooltip is-tooltip-top" data-tooltip="Methodology">stat.ME</span> </div> </div> <p class="title is-5 mathjax"> Rate optimal multiple testing procedure in high-dimensional regression </p> <p class="authors"> <span class="search-hit">Authors:</span> <a href="/search/stat?searchtype=author&amp;query=Ji%2C+P">Pengsheng Ji</a>, <a href="/search/stat?searchtype=author&amp;query=Zhao%2C+Z">Zhigen Zhao</a> </p> <p class="abstract mathjax"> <span class="has-text-black-bis has-text-weight-semibold">Abstract</span>: <span class="abstract-short has-text-grey-dark mathjax" id="1404.2961v4-abstract-short" style="display: inline;"> In the high dimensional regression analysis when the number of predictors is much larger than the sample size, an important question is to select the important variable which are relevant to the response variable of interest. Variable selection and the multiple testing are both tools to address this issue. However, there is little discussion on the connection of these two areas. When the signal st&hellip; <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('1404.2961v4-abstract-full').style.display = 'inline'; document.getElementById('1404.2961v4-abstract-short').style.display = 'none';">&#9661; More</a> </span> <span class="abstract-full has-text-grey-dark mathjax" id="1404.2961v4-abstract-full" style="display: none;"> In the high dimensional regression analysis when the number of predictors is much larger than the sample size, an important question is to select the important variable which are relevant to the response variable of interest. Variable selection and the multiple testing are both tools to address this issue. However, there is little discussion on the connection of these two areas. When the signal strength is strong enough such that the selection consistency is achievable, it seems to be unnecessary to control the false discovery rate. In this paper, we consider the regime where the signals are both rare and weak such that the selection consistency is not achievable and propose a method which controls the false discovery rate asymptotically. It is theoretically shown that the false non-discovery rate of the proposed method converges to zero at the optimal rate. Numerical results are provided to demonstrate the advantage of the proposed method. <a class="is-size-7" style="white-space: nowrap;" onclick="document.getElementById('1404.2961v4-abstract-full').style.display = 'none'; document.getElementById('1404.2961v4-abstract-short').style.display = 'inline';">&#9651; Less</a> </span> </p> <p class="is-size-7"><span class="has-text-black-bis has-text-weight-semibold">Submitted</span> 6 January, 2023; <span class="has-text-black-bis has-text-weight-semibold">v1</span> submitted 10 April, 2014; <span class="has-text-black-bis has-text-weight-semibold">originally announced</span> April 2014. </p> <p class="comments is-size-7"> <span class="has-text-black-bis has-text-weight-semibold">Comments:</span> <span class="has-text-grey-dark mathjax">26 pages</span> </p> </li> </ol> <div class="is-hidden-tablet"> <!-- feedback for mobile only --> <span class="help" style="display: inline-block;"><a href="">Search v0.5.6 released 