CINXE.COM
Texthero 路 Text preprocessing, representation and visualization from zero to hero.
<!DOCTYPE html><html lang=""><head><meta charSet="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><title>Texthero 路 Text preprocessing, representation and visualization from zero to hero.</title><meta name="viewport" content="width=device-width, initial-scale=1.0"/><meta name="generator" content="Docusaurus"/><meta name="description" content="Text preprocessing, representation and visualization from zero to hero."/><meta property="og:title" content="Texthero 路 Text preprocessing, representation and visualization from zero to hero."/><meta property="og:type" content="website"/><meta property="og:url" content="https://texthero.org/"/><meta property="og:description" content="Text preprocessing, representation and visualization from zero to hero."/><meta property="og:image" content="https://texthero.org/img/T.png"/><meta name="twitter:card" content="summary"/><meta name="twitter:image" content="https://texthero.org/img/T.png"/><link rel="shortcut icon" href="/img/favicon.png"/><link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/atom-one-dark.min.css"/><link rel="alternate" type="application/atom+xml" href="https://texthero.org/blog/atom.xml" title="Texthero Blog ATOM Feed"/><link rel="alternate" type="application/rss+xml" href="https://texthero.org/blog/feed.xml" title="Texthero Blog RSS Feed"/><link rel="stylesheet" href="/css/code-block-buttons.css"/><link rel="stylesheet" href="/css/sphinx_basic.css"/><script type="text/javascript" src="https://buttons.github.io/buttons.js"></script><script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.0/clipboard.min.js"></script><script type="text/javascript" src="/js/code-block-buttons.js"></script><script type="text/javascript" src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/10.0.3/highlight.min.js"></script><script type="text/javascript" src="/js/start_highlight.js"></script><script type="text/javascript" src="https://www.googletagmanager.com/gtag/js?id=G-0V7XX3QG4C"></script><script type="text/javascript" src="/js/analytics.js"></script><script src="/js/scrollSpy.js"></script><link rel="stylesheet" href="/css/prism.css"/><link rel="stylesheet" href="/css/main.css"/><script src="/js/codetabs.js"></script></head><body><div class="fixedHeaderContainer"><div class="headerWrapper wrapper"><header><a href="/"><h2 class="headerTitle">Texthero</h2></a><div class="navigationWrapper navigationSlider"><nav class="slidingNav"><ul class="nav-site nav-site-internal"><li class=""><a href="/docs/getting-started" target="_self">Getting started</a></li><li class=""><a href="/blog/" target="_self">Tutorial</a></li><li class=""><a href="/docs/api-preprocessing" target="_self">API</a></li><li class=""><a href="https://github.com/jbesomi/texthero" target="_self">GitHub</a></li></ul></nav></div></header></div></div><div class="navPusher"><div><div class="homeContainer"><div class="homeSplashFade"><div class="wrapper homeWrapper"><div class="projectLogo"><img src="/docs/assets/texthero.png"/></div><div class="inner"><div><h2 class="projectTitle"><small>Text preprocessing, representation and visualization from zero to hero.</small></h2><h3>Texthero is a python package to work with text data <strong>efficiently</strong>. <br/>It empowers NLP developers with a tool to quickly understand any text-based dataset and <br/>it provides a solid pipeline to clean and represent text data, from zero to hero.</h3></div><div class="section promoSection"><div class="promoRow"><div class="pluginRowBlock"><div class="pluginWrapper buttonWrapper"><a class="button" href="/docs/getting-started" target="_self">Getting started</a></div><div class="pluginWrapper buttonWrapper"><a class="button" href="/blog" target="_self">Tutorial</a></div><div class="pluginWrapper buttonWrapper"><a class="button" href="/docs/api-preprocessing" target="_self">API</a></div><div class="pluginWrapper buttonWrapper"><a class="button" href="https://github.com/jbesomi/texthero" target="_self">Github</a></div></div></div></div><iframe src="https://ghbtns.com/github-btn.html?user=jbesomi&repo=texthero&type=star&count=true&size=large" frameBorder="0" scrolling="0" width="160" height="30" title="GitHub Stars" style="margin:20px"></iframe><div class="home_separator"></div><div class="showcase"><div class="homebox left"><h2 class="projectTitle"><small>Import texthero ... </small></h2><pre><code class="python">import texthero as hero import pandas as pd</code></pre></div><div class="homebox right"><h2 class="projectTitle"><small>... load any text dataset with Pandas</small></h2><pre><code class="python">df = pd.read_csv( "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" ) df.head(2)</code></pre><table border="1" class="dataframe"><thead><tr><th></th><th>text</th><th>topic</th></tr></thead><tbody><tr><th>0</th><td>Claxton hunting first major medal\n\nBritish h...</td><td>athletics</td></tr><tr><th>1</th><td>O'Sullivan could run in Worlds\n\nSonia O'Sull...</td><td>athletics</td></tr></tbody></table></div><div class="homebox left"><h2 class="projectTitle"><small>Preprocess it ...</small></h2><pre><code class="python">df['text'] = hero.clean(df['text'])</code></pre><table border="1" class="dataframe"><thead><tr><th></th><th>text</th><th>topic</th></tr></thead><tbody><tr><th>0</th><td>claxton hunting first major medal british hurd...</td><td>athletics</td></tr><tr><th>1</th><td>sullivan could run worlds sonia sullivan indic...</td><td>athletics</td></tr></tbody></table><div><span><blockquote> <p>Look at the <a href="/docs/api-preprocessing">preprocessing API</a> for more customization</p> </blockquote> </span></div></div><div class="homebox right"><h2 class="projectTitle"><small>... represent it</small></h2><pre><code class="python">df['tfidf'] = ( hero.tfidf(df['text'], max_features=100) ) df[["tfidf", "topic"]].head(2) </code></pre><table border="1" class="dataframe"><thead><tr><th></th><th>tfidf</th><th>topic</th></tr></thead><tbody><tr><th>0</th><td>[0.0, 0.13194458247285848, 0.0, 0.0, 0.0, 0.0,...</td><td>athletics</td></tr><tr><th>1</th><td>[0.0, 0.13056235989725676, 0.0, 0.205187581391...</td><td>athletics</td></tr></tbody></table><div><span><blockquote> <p>There are <a href="/docs/api-representation">many other ways</a> to represent the data</p> </blockquote> </span></div></div><div class="homebox left"><h2 class="projectTitle"><small>Reduce dimension and visualize the vector space</small></h2><pre><code class="python">df['pca'] = hero.pca(df['tfidf']) hero.scatterplot( df, col='pca', color='topic', title="PCA BBC Sport news" )</code></pre><img src="/img/scatterplot_bccsport.svg" alt=""/></div><div class="homebox right"><h2 class="projectTitle"><small>... need more? find named entities</small></h2><pre><code class="python">df['named_entities'] = ( hero.named_entities(df['text']) ) df[['named_entities', 'topic']].head(2) </code></pre><table border="1" class="dataframe"><thead><tr><th></th><th>named_entities</th><th>topic</th></tr></thead><tbody><tr><th>0</th><td>[(claxton, ORG, 0, 7), (first, ORDINAL, 16, 21...</td><td>athletics</td></tr><tr><th>1</th><td>[(sullivan, ORG, 0, 8), (sonia sullivan, PERSO...</td><td>athletics</td></tr></tbody></table></div><div class="homebox left"><h2 class="projectTitle"><small>Show top words ...</small></h2><pre><code class="python">NUM_TOP_WORDS = 5 hero.top_words(df['text'])[:NUM_TOP_WORDS] </code></pre><table border="1" class="dataframe"><thead><tr><th></th><th>text</th></tr></thead><tbody><tr><th>said</th><td>1338</td></tr><tr><th>first</th><td>790</td></tr><tr><th>england</th><td>749</td></tr><tr><th>game</th><td>681</td></tr><tr><th>one</th><td>671</td></tr></tbody></table></div><div class="homebox right"><h2 class="projectTitle"><small>And much more !</small></h2></div></div></div></div></div></div><div class="mainContainer"><div class="wrapper homeWrapper"><div class="inner"><div></div></div></div></div></div><footer class="nav-footer" id="footer"><section class="copyright">Texthero - MIT license</section></footer></div></body></html>