CINXE.COM
MLlib | Apache Spark
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title> MLlib | Apache Spark </title> <meta name="description" content="MLlib is Apache Spark's scalable machine learning library, with APIs in Java, Scala, Python, and R."> <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.2/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-EVSTQN3/azprG1Anm3QDgpJLIm9Nao0Yz1ztcQTwFspd3yD65VohhpuuCOmLASjC" crossorigin="anonymous"> <link rel="preconnect" href="https://fonts.googleapis.com"> <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin> <link href="https://fonts.googleapis.com/css2?family=DM+Sans:ital,wght@0,400;0,500;0,700;1,400;1,500;1,700&Courier+Prime:wght@400;700&display=swap" rel="stylesheet"> <link href="/css/custom.css" rel="stylesheet"> <!-- Code highlighter CSS --> <link href="/css/pygments-default.css" rel="stylesheet"> <link rel="icon" href="/favicon.ico" type="image/x-icon"> <!-- Matomo --> <script> var _paq = window._paq = window._paq || []; /* tracker methods like "setCustomDimension" should be called before "trackPageView" */ _paq.push(["disableCookies"]); _paq.push(['trackPageView']); _paq.push(['enableLinkTracking']); (function() { var u="https://analytics.apache.org/"; _paq.push(['setTrackerUrl', u+'matomo.php']); _paq.push(['setSiteId', '40']); var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0]; g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s); })(); </script> <!-- End Matomo Code --> </head> <body class="global"> <nav class="navbar navbar-expand-lg navbar-dark p-0 px-4" style="background: #1D6890;"> <a class="navbar-brand" href="/"> <img src="/images/spark-logo-rev.svg" alt="" width="141" height="72"> </a> <button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarContent" aria-controls="navbarContent" aria-expanded="false" aria-label="Toggle navigation"> <span class="navbar-toggler-icon"></span> </button> <div class="collapse navbar-collapse col-md-12 col-lg-auto pt-4" id="navbarContent"> <ul class="navbar-nav me-auto"> <li class="nav-item"> <a class="nav-link active" aria-current="page" href="/downloads.html">Download</a> </li> <li class="nav-item dropdown"> <a class="nav-link dropdown-toggle" href="#" id="libraries" role="button" data-bs-toggle="dropdown" aria-expanded="false"> Libraries </a> <ul class="dropdown-menu" aria-labelledby="libraries"> <li><a class="dropdown-item" href="/sql/">SQL and DataFrames</a></li> <li><a class="dropdown-item" href="/spark-connect/">Spark Connect</a></li> <li><a class="dropdown-item" href="/streaming/">Spark Streaming</a></li> <li><a class="dropdown-item" href="/pandas-on-spark/">pandas on Spark</a></li> <li><a class="dropdown-item" href="/mllib/">MLlib (machine learning)</a></li> <li><a class="dropdown-item" href="/graphx/">GraphX (graph)</a></li> <li> <hr class="dropdown-divider"> </li> <li><a class="dropdown-item" href="/third-party-projects.html">Third-Party Projects</a></li> </ul> </li> <li class="nav-item dropdown"> <a class="nav-link dropdown-toggle" href="#" id="documentation" role="button" data-bs-toggle="dropdown" aria-expanded="false"> Documentation </a> <ul class="dropdown-menu" aria-labelledby="documentation"> <li><a class="dropdown-item" href="/docs/latest/">Latest Release</a></li> <li><a class="dropdown-item" href="/documentation.html">Older Versions and Other Resources</a></li> <li><a class="dropdown-item" href="/faq.html">Frequently Asked Questions</a></li> </ul> </li> <li class="nav-item"> <a class="nav-link active" aria-current="page" href="/examples.html">Examples</a> </li> <li class="nav-item dropdown"> <a class="nav-link dropdown-toggle" href="#" id="community" role="button" data-bs-toggle="dropdown" aria-expanded="false"> Community </a> <ul class="dropdown-menu" aria-labelledby="community"> <li><a class="dropdown-item" href="/community.html">Mailing Lists & Resources</a></li> <li><a class="dropdown-item" href="/contributing.html">Contributing to Spark</a></li> <li><a class="dropdown-item" href="/improvement-proposals.html">Improvement Proposals (SPIP)</a> </li> <li><a class="dropdown-item" href="https://issues.apache.org/jira/browse/SPARK">Issue Tracker</a> </li> <li><a class="dropdown-item" href="/powered-by.html">Powered By</a></li> <li><a class="dropdown-item" href="/committers.html">Project Committers</a></li> <li><a class="dropdown-item" href="/history.html">Project History</a></li> </ul> </li> <li class="nav-item dropdown"> <a class="nav-link dropdown-toggle" href="#" id="developers" role="button" data-bs-toggle="dropdown" aria-expanded="false"> Developers </a> <ul class="dropdown-menu" aria-labelledby="developers"> <li><a class="dropdown-item" href="/developer-tools.html">Useful Developer Tools</a></li> <li><a class="dropdown-item" href="/versioning-policy.html">Versioning Policy</a></li> <li><a class="dropdown-item" href="/release-process.html">Release Process</a></li> <li><a class="dropdown-item" href="/security.html">Security</a></li> </ul> </li> <li class="nav-item dropdown"> <a class="nav-link dropdown-toggle" href="#" id="github" role="button" data-bs-toggle="dropdown" aria-expanded="false"> GitHub </a> <ul class="dropdown-menu" aria-labelledby="github"> <li><a class="dropdown-item" href="https://github.com/apache/spark">spark</a></li> <li><a class="dropdown-item" href="https://github.com/apache/spark-connect-go">spark-connect-go</a></li> <li><a class="dropdown-item" href="https://github.com/apache/spark-docker">spark-docker</a></li> <li><a class="dropdown-item" href="https://github.com/apache/spark-kubernetes-operator">spark-kubernetes-operator</a></li> <li><a class="dropdown-item" href="https://github.com/apache/spark-website">spark-website</a></li> </ul> </li> </ul> <ul class="navbar-nav ml-auto"> <li class="nav-item dropdown"> <a class="nav-link dropdown-toggle" href="#" id="apacheFoundation" role="button" data-bs-toggle="dropdown" aria-expanded="false"> Apache Software Foundation </a> <ul class="dropdown-menu" aria-labelledby="apacheFoundation"> <li><a class="dropdown-item" href="https://www.apache.org/">Apache Homepage</a></li> <li><a class="dropdown-item" href="https://www.apache.org/licenses/">License</a></li> <li><a class="dropdown-item" href="https://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> <li><a class="dropdown-item" href="https://www.apache.org/foundation/thanks.html">Thanks</a></li> <li><a class="dropdown-item" href="https://www.apache.org/security/">Security</a></li> <li><a class="dropdown-item" href="https://www.apache.org/events/current-event">Event</a></li> </ul> </li> </ul> </div> </nav> <div class="container"> <div class="row mt-4"> <div class="col-12 col-md-9"> <div class="jumbotron"> <b>MLlib</b> is Apache Spark's scalable machine learning library. </div> <div class="row row-padded"> <div class="col-md-7 col-sm-7"> <h2>Ease of use</h2> <p class="lead"> Usable in Java, Scala, Python, and R. </p> <p> MLlib fits into <a href="/">Spark</a>'s APIs and interoperates with <a href="http://www.numpy.org">NumPy</a> in Python (as of Spark 0.9) and R libraries (as of Spark 1.5). You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows. </p> </div> <div class="col-md-5 col-sm-5 col-padded-top col-center"> <div style="margin-top: 15px; text-align: left; display: inline-block;"> <div class="code"> data = spark.read.format(<span class="string">"libsvm"</span>)\<br /> .load(<span class="string">"hdfs://..."</span>)<br /> <br /> model = <span class="sparkop">KMeans</span>(k=10).fit(data) </div> <div class="caption">Calling MLlib in Python</div> </div> </div> </div> <div class="row row-padded"> <div class="col-md-7 col-sm-7"> <h2>Performance</h2> <p class="lead"> High-quality algorithms, 100x faster than MapReduce. </p> <p> Spark excels at iterative computation, enabling MLlib to run fast. At the same time, we care about algorithmic performance: MLlib contains high-quality algorithms that leverage iteration, and can yield better results than the one-pass approximations sometimes used on MapReduce. </p> </div> <div class="col-md-5 col-sm-5 col-padded-top col-center"> <div style="width: 100%; max-width: 272px; display: inline-block; text-align: center;"> <img src="/images/logistic-regression.png" style="width: 100%; max-width: 250px;" /> <div class="caption" style="min-width: 272px;">Logistic regression in Hadoop and Spark</div> </div> </div> </div> <div class="row row-padded" style="margin-bottom: 15px;"> <div class="col-md-7 col-sm-7"> <h2>Runs everywhere</h2> <p class="lead"> Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, against diverse data sources. </p> <p> You can run Spark using its <a href="/docs/latest/spark-standalone.html">standalone cluster mode</a>, on <a href="https://github.com/amplab/spark-ec2">EC2</a>, on <a href="https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html">Hadoop YARN</a>, on <a href="https://mesos.apache.org">Mesos</a>, or on <a href="https://kubernetes.io/">Kubernetes</a>. Access data in <a href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">HDFS</a>, <a href="https://cassandra.apache.org">Apache Cassandra</a>, <a href="https://hbase.apache.org">Apache HBase</a>, <a href="https://hive.apache.org">Apache Hive</a>, and hundreds of other data sources. </p> </div> <div class="col-md-5 col-sm-5 col-padded-top col-center"> <img src="/images/hadoop.jpg" style="width: 100%; max-width: 280px;" /> </div> </div> <div class="row"> <div class="col-md-4 col-padded"> <h3>Algorithms</h3> <p> MLlib contains many algorithms and utilities. </p> <p> ML algorithms include: </p> <ul class="list-narrow"> <li>Classification: logistic regression, naive Bayes,...</li> <li>Regression: generalized linear regression, survival regression,...</li> <li>Decision trees, random forests, and gradient-boosted trees</li> <li>Recommendation: alternating least squares (ALS)</li> <li>Clustering: K-means, Gaussian mixtures (GMMs),...</li> <li>Topic modeling: latent Dirichlet allocation (LDA)</li> <li>Frequent itemsets, association rules, and sequential pattern mining</li> </ul> <p> ML workflow utilities include: </p> <ul class="list-narrow"> <li>Feature transformations: standardization, normalization, hashing,...</li> <li>ML Pipeline construction</li> <li>Model evaluation and hyper-parameter tuning</li> <li>ML persistence: saving and loading models and Pipelines</li> </ul> <p> Other utilities include: </p> <ul class="list-narrow"> <li>Distributed linear algebra: SVD, PCA,...</li> <li>Statistics: summary statistics, hypothesis testing,...</li> </ul> <p>Refer to the <a href="/docs/latest/ml-guide.html">MLlib guide</a> for usage examples.</p> </div> <div class="col-md-4 col-padded"> <h3>Community</h3> <p> MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release. </p> <p> If you have questions about the library, ask on the <a href="/community.html#mailing-lists">Spark mailing lists</a>. </p> <p> MLlib is still a rapidly growing project and welcomes contributions. If you'd like to submit an algorithm to MLlib, read <a href="/contributing.html">how to contribute to Spark</a> and send us a patch! </p> </div> <div class="col-md-4 col-padded"> <h3>Getting started</h3> <p> To get started with MLlib: </p> <ul class="list-narrow"> <li><a href="/downloads.html">Download Spark</a>. MLlib is included as a module.</li> <li>Read the <a href="/docs/latest/ml-guide.html">MLlib guide</a>, which includes various usage examples.</li> <li>Learn how to <a href="/docs/latest/#launching-on-a-cluster">deploy</a> Spark on a cluster if you'd like to run in distributed mode. You can also run locally on a multicore machine without any setup. </li> </ul> </div> </div> <div class="row"> <div class="col-sm-12 col-center"> <a href="/downloads.html" class="btn btn-cta btn-lg btn-multiline"> Download Apache Spark<br /><span class="small">Includes MLlib</span> </a> </div> </div> </div> <div class="col-12 col-md-3"> <div class="news" style="margin-bottom: 20px;"> <h5>Latest News</h5> <ul class="list-unstyled"> <li><a href="/news/spark-3-4-4-released.html">Spark 3.4.4 released</a> <span class="small">(Oct 27, 2024)</span></li> <li><a href="/news/spark-4.0.0-preview2.html">Preview release of Spark 4.0</a> <span class="small">(Sep 26, 2024)</span></li> <li><a href="/news/spark-3-5-3-released.html">Spark 3.5.3 released</a> <span class="small">(Sep 24, 2024)</span></li> <li><a href="/news/spark-3-5-2-released.html">Spark 3.5.2 released</a> <span class="small">(Aug 10, 2024)</span></li> </ul> <p class="small" style="text-align: right;"><a href="/news/index.html">Archive</a></p> </div> <div style="text-align:center; margin-bottom: 20px;"> <a href="https://www.apache.org/events/current-event.html"> <img src="https://www.apache.org/events/current-event-234x60.png" style="max-width: 100%;"/> </a> </div> <div class="hidden-xs hidden-sm"> <a href="/downloads.html" class="btn btn-cta btn-lg d-grid" style="margin-bottom: 30px;"> Download Spark </a> <p style="font-size: 16px; font-weight: 500; color: #555;"> Built-in Libraries: </p> <ul class="list-none"> <li><a href="/sql/">SQL and DataFrames</a></li> <li><a href="/streaming/">Spark Streaming</a></li> <li><a href="/mllib/">MLlib (machine learning)</a></li> <li><a href="/graphx/">GraphX (graph)</a></li> </ul> <a href="/third-party-projects.html">Third-Party Projects</a> </div> </div> </div> <footer class="small"> <hr> Apache Spark, Spark, Apache, the Apache feather logo, and the Apache Spark project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries. See guidance on use of Apache Spark <a href="/trademarks.html">trademarks</a>. All other marks mentioned may be trademarks or registered trademarks of their respective owners. Copyright © 2018 The Apache Software Foundation, Licensed under the <a href="https://www.apache.org/licenses/">Apache License, Version 2.0</a>. </footer> </div> <script src="https://cdn.jsdelivr.net/npm/bootstrap@5.0.2/dist/js/bootstrap.bundle.min.js" integrity="sha384-MrcW6ZMFYlzcLA8Nl+NtUVF0sA7MsXsP1UyJoMp4YLEuNSfAP+JcXn/tWtIaxVXM" crossorigin="anonymous"></script> <script src="https://code.jquery.com/jquery.js"></script> <script src="/js/lang-tabs.js"></script> <script src="/js/downloads.js"></script> </body> </html>