CINXE.COM
Blog | RocksDB
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta property="og:url" content="http://rocksdb.org/blog/" /> <meta property="og:site_name" content="RocksDB"/> <meta property="og:title" content="Blog" /> <meta property="og:image" content="http://rocksdb.org/static/og_image.png" /> <meta property="og:description" content="RocksDB is an embeddable persistent key-value store for fast storage. " /> <link rel="stylesheet" href="/css/main.css" media="screen"> <link rel="icon" href="/static/favicon.png" type="image/x-icon"> <title>Blog | RocksDB</title> <meta name="description" content="RocksDB is an embeddable persistent key-value store for fast storage. "> <link rel="canonical" href="http://rocksdb.org/blog/"> <link rel="alternate" type="application/rss+xml" title="RocksDB" href="http://rocksdb.org/feed.xml" /> </head> <body class="docsNavVisible"> <div id="fixed_header" class="fixedHeaderContainer visible"> <div class="headerWrapper wrapper"> <header> <a href="http://rocksdb.org/"> <img src="/static/logo.svg"> <h2>RocksDB</h2> </a> <div class="navigationWrapper navigationFull" id="flat_nav"> <nav class="navigation"> <ul> <li class="navItem"> <a href="/docs/">Docs</a> </li> <li class="navItem"> <a href="https://github.com/facebook/rocksdb/">GitHub</a> </li> <li class="navItem"> <a href="https://github.com/facebook/rocksdb/tree/main/include/rocksdb">API (C++)</a> </li> <li class="navItem"> <a href="https://github.com/facebook/rocksdb/tree/main/java/src/main/java/org/rocksdb">API (Java)</a> </li> <li class="navItem"> <a href="/support.html">Support</a> </li> <li class="navItem navItemActive"> <a href="/blog/">Blog</a> </li> <li class="navItem"> <a href="https://www.facebook.com/groups/rocksdb.dev/">Facebook</a> </li> </ul> </nav> </div> <div class="navigationWrapper navigationSlider" id="navigation_wrap"> <div id="header_nav"> <div class="navSlideout"> <i class="menuExpand" id="header_nav_expander"><span></span><span></span><span></span></i> </div> <nav class="slidingNav"> <ul> <li class="navItem"> <a href="/docs/">Docs</a> </li> <li class="navItem"> <a href="https://github.com/facebook/rocksdb/" target="_blank">GitHub</a> </li> <li class="navItem"> <a href="https://github.com/facebook/rocksdb/tree/main/include/rocksdb" target="_blank">API (C++)</a> </li> <li class="navItem"> <a href="https://github.com/facebook/rocksdb/tree/main/java/src/main/java/org/rocksdb" target="_blank">API (Java)</a> </li> <li class="navItem"> <a href="/support.html">Support</a> </li> <li class="navItem"> <a href="/blog/">Blog</a> </li> <li class="navItem"> <a href="https://www.facebook.com/groups/rocksdb.dev/" target="_blank">Facebook</a> </li> </ul> </nav> </div> <script> var event = document.createEvent('Event'); event.initEvent('slide', true, true); document.addEventListener('slide', function (e) { document.body.classList.toggle('sliderActive'); }, false); var headerNav = document.getElementById('header_nav'); var headerNavExpander = document.getElementById('header_nav_expander'); headerNavExpander.addEventListener('click', function(e) { headerNav.classList.toggle('navSlideoutActive'); document.dispatchEvent(event); }, false); </script> </div> </header> </div> </div> <div class="navPusher"> <div class="docMainWrapper wrapper"> <div class="docsNavContainer"> <nav class="toc" id="doc_nav"> <div class="toggleNav" id="collection_nav"> <section class="navWrapper wrapper"> <div class="navBreadcrumb wrapper"> <div class="navToggle" id="collection_nav_toggler"> <i></i> </div> <h2> <a href="/blog/">Blog</a> </h2> </div> <div class="navGroups"> <div class="navGroup navGroupActive navGroupCurrent"> <h3><i>+</i><span>All Posts</span></h3> <ul> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2025/02/07/mitigated-bug-update.html">Addressing a Mitigated Misconfig Bug in the RocksDB OSS Repository</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2024/02/20/foreign-function-interface.html">Java Foreign Function Interface</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2023/11/06/java-jni-benchmarks.html">Java API Performance Improvements</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2022/11/09/time-aware-tiered-storage.html">Time-Aware Tiered Storage in RocksDB</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2022/10/31/align-compaction-output-file.html">Reduce Write Amplification by Aligning Compaction Output File Boundaries</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2022/10/07/asynchronous-io-in-rocksdb.html">Asynchronous IO in RocksDB</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2022/10/05/lost-buffered-write-recovery.html">Verifying crash-recovery with lost buffered writes</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2022/07/18/per-key-value-checksum.html">Per Key-Value Checksum</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2021/12/29/ribbon-filter.html">Ribbon Filter</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2021/05/31/dictionary-compression.html">Preset Dictionary Compression</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html">RocksDB Secondary Cache</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2021/05/26/online-validation.html">Online Validation</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2021/05/26/integrated-blob-db.html">Integrated BlobDB</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2021/04/12/universal-improvements.html">(Call For Contribution) Make Universal Compaction More Incremental</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2019/08/15/unordered-write.html">Higher write throughput with `unordered_write` feature</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2019/03/08/format-version-4.html">format_version 4</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2018/11/21/delete-range.html">DeleteRange: A New Native RocksDB Operation</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2018/08/23/data-block-hash-index.html">Improving Point-Lookup Using Data Block Hash Index</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2018/08/01/rocksdb-tuning-advisor.html">Rocksdb Tuning Advisor</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2018/02/05/rocksdb-5-10-2-released.html">RocksDB 5.10.2 Released!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/12/19/write-prepared-txn.html">WritePrepared Transactions</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html">Auto-tuned Rate Limiter</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/09/28/rocksdb-5-8-released.html">RocksDB 5.8 Released!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/08/25/flushwal.html">FlushWAL; less fwrite, faster writes</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/08/24/pinnableslice.html">PinnableSlice; less memcpy with point lookups</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/07/25/rocksdb-5-6-1-released.html">RocksDB 5.6.1 Released!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/06/29/rocksdb-5-5-1-released.html">RocksDB 5.5.1 Released!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/06/26/17-level-based-changes.html">Level-based Compaction Changes</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/05/26/rocksdb-5-4-5-released.html">RocksDB 5.4.5 Released!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/05/14/core-local-stats.html">Core-local Statistics</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html">Partitioned Index/Filters</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/03/02/rocksdb-5-2-1-released.html">RocksDB 5.2.1 Released!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/02/17/bulkoad-ingest-sst-file.html">Bulkloading by ingesting external SST files</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/02/07/rocksdb-5-1-2-released.html">RocksDB 5.1.2 Released!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2017/01/06/rocksdb-5-0-1-released.html">RocksDB 5.0.1 Released!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2016/09/28/rocksdb-4-11-2-released.html">RocksDB 4.11.2 Released!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2016/07/26/rocksdb-4-8-released.html">RocksDB 4.8 Released!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2016/04/26/rocksdb-4-5-1-released.html">RocksDB 4.5.1 Released!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2016/03/07/rocksdb-options-file.html">RocksDB Options File</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2016/02/25/rocksdb-ama.html">RocksDB AMA</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2016/02/24/rocksdb-4-2-release.html">RocksDB 4.2 Release!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2016/01/29/compaction_pri.html">Option of Compaction Priority</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/11/16/analysis-file-read-latency-by-level.html">Analysis File Read Latency by Level</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/11/10/use-checkpoints-for-efficient-snapshots.html">Use Checkpoints for Efficient Snapshots</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/10/27/getthreadlist.html">GetThreadList</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/07/23/dynamic-level.html">Dynamic Level Size for Level-Based Compaction</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/07/22/rocksdb-is-now-available-in-windows-platform.html">RocksDB is now available in Windows Platform</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/07/17/spatial-indexing-in-rocksdb.html">Spatial indexing in RocksDB</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/07/15/rocksdb-2015-h2-roadmap.html">RocksDB 2015 H2 roadmap</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/06/12/rocksdb-in-osquery.html">RocksDB in osquery</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/04/22/integrating-rocksdb-with-mongodb-2.html">Integrating RocksDB with MongoDB</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/02/27/write-batch-with-index.html">WriteBatchWithIndex: Utility for Implementing Read-Your-Own-Writes</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/02/24/reading-rocksdb-options-from-a-file.html">Reading RocksDB options from a file</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2015/01/16/migrating-from-leveldb-to-rocksdb-2.html">Migrating from LevelDB to RocksDB</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/09/15/rocksdb-3-5-release.html">RocksDB 3.5 Release!</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/09/12/new-bloom-filter-format.html">New Bloom Filter Format</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/09/12/cuckoo.html">Cuckoo Hashing Table Format</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/07/29/rocksdb-3-3-release.html">RocksDB 3.3 Release</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/06/27/rocksdb-3-2-release.html">RocksDB 3.2 release</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/06/27/avoid-expensive-locks-in-get.html">Avoid Expensive Locks in Get()</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/06/23/plaintable-a-new-file-format.html">PlainTable — A New File Format</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/05/22/rocksdb-3-1-release.html">RocksDB 3.1 release</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/05/19/rocksdb-3-0-release.html">RocksDB 3.0 release</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/05/14/lock.html">Reducing Lock Contention in RocksDB</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/04/21/indexing-sst-files-for-better-lookup-performance.html">Indexing SST Files for Better Lookup Performance</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/04/07/rocksdb-2-8-release.html">RocksDB 2.8 release</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/04/02/the-1st-rocksdb-local-meetup-held-on-march-27-2014.html">The 1st RocksDB Local Meetup Held on March 27, 2014</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/03/27/how-to-persist-in-memory-rocksdb-database.html">How to persist in-memory RocksDB database?</a></li> <li class="navListItem"><a class="navItem" href="http://rocksdb.org/blog/2014/03/27/how-to-backup-rocksdb.html">How to backup RocksDB?</a></li> </ul> </div> </div> </section> </div> </nav> </div> <script> var docsevent = document.createEvent('Event'); docsevent.initEvent('docs_slide', true, true); document.addEventListener('docs_slide', function (e) { document.body.classList.toggle('docsSliderActive'); }, false); var collectionNav = document.getElementById('collection_nav'); var collectionNavToggler = document.getElementById('collection_nav_toggler'); collectionNavToggler.addEventListener('click', function(e) { collectionNav.classList.toggle('toggleNavActive'); document.dispatchEvent(docsevent); }); var groups = document.getElementsByClassName('navGroup'); for(var i = 0; i < groups.length; i++) { var thisGroup = groups[i]; thisGroup.onclick = function() { for(var j = 0; j < groups.length; j++) { var group = groups[j]; group.classList.remove('navGroupActive'); } this.classList.add('navGroupActive'); } } </script> <div class="mainContainer blogContainer postContainer"> <div id="main_wrap" class="wrapper mainWrapper"> <div class="posts"> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100037058588280/picture/" alt="" title="" /> </div> <p class="post-authorName">Hui Xiao</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2025/02/07/mitigated-bug-update.html">Addressing a Mitigated Misconfig Bug in the RocksDB OSS Repository</a></h1> <p class="post-meta">Posted February 07, 2025</p> </header> <article class="post-content"> <p>Dear RocksDB Community,</p> <p>We want to share an update about the bug that allowed our bug bounty researcher to update the release note title in August 2024 involving the RocksDB open-source repository on GitHub. This issue was found and responsibly disclosed to us by an external bug bounty researcher through our <a href="https://www.facebook.com/whitehat">Meta Bug Bounty program</a> and quickly mitigated by our teams. We have not seen any evidence of malicious exploitation. Please note that no action is required from our community, as we have taken all necessary steps to remediate the issue.</p> <h2 id="background">Background</h2> <p>RocksDB is a high-performance storage engine library widely used in various large-scale applications. On August 21, 2024, a bug was reported to us by one of our bug bounty researchers. They were able to demonstrate the ability to obtain the GITHUB_TOKEN used in GitHub Actions workflows. This token provides write access to the metadata of the repository, and the researcher used it to change the title of the release note 9.5.2 as proof of concept. The researcher also unsuccessfully attempted to merge a change to the main branch of the repository; however, we had access controls set up to prevent it from going through.</p> <h2 id="key-details">Key Details</h2> <ul> <li><strong>Incident Discovery</strong>: After the bug bounty researcher changed the open source release note title to demonstrate the vulnerability, external users noticed this change and <a href="https://github.com/facebook/rocksdb/issues/12962">notified</a> RocksDB. RocksDB then reached out to the Bug Bounty program to confirm this was the result of security research.</li> <li><strong>No Malicious Abuse</strong>: The investigation confirmed that no code or data was compromised. The change was public and visible on GitHub.</li> <li><strong>Tag Reversion Clarification</strong>: On August 21, a tag named “v9.5.2” was initially published pointing to an incorrect commit. This was unrelated to the bug described here and was promptly corrected by pointing the tag to the correct commit. The release binary remains safe to use, and this correction does not impact the security or integrity of the release.</li> </ul> <p>We’ve taken the following steps to mitigate and remediate the issue:</p> <ul> <li>The release note title was corrected.</li> <li>The workflow running on the self-hosted runner was disabled immediately.</li> <li>It was confirmed that the GITHUB_TOKEN expired and is no longer in use.</li> <li>The binary tagged for public release was examined to confirm that it was not compromised.</li> <li>Action logs were cross-checked to ensure no other actions were taken with the compromised token, other than the release note title change and the failed attempts to merge self-approved pull requests to the main branch.</li> <li>We have <a href="https://github.com/facebook/rocksdb/pull/12973">scoped down</a> the access level of tokens generated for workflows to prevent similar issues. Additionally, we are developing better guidelines for bug bounty researchers to minimize disruptions during their research.</li> </ul> <p>Thank you for your continued support and trust in RocksDB.</p> <p>Sincerely,</p> <p>The RocksDB Team</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2024/02/20/foreign-function-interface.html">Java Foreign Function Interface</a></h1> <p class="post-meta">Posted February 20, 2024</p> </header> <article class="post-content"> <h1 id="java-foreign-function-interface-ffi">Java Foreign Function Interface (FFI)</h1> <p>Evolved Binary has been working on several aspects of how the Java API to RocksDB can be improved. The recently introduced FFI features in Java provide significant opportunities for improving the API. We have investigated this through a prototype implementation.</p> <p>Java 19 introduced a new <a href="https://openjdk.org/jeps/424">FFI Preview</a> which is described as <em>an API by which Java programs can interoperate with code and data outside of the Java runtime. By efficiently invoking foreign functions (i.e., code outside the JVM), and by safely accessing foreign memory (i.e., memory not managed by the JVM), the API enables Java programs to call native libraries and process native data without the brittleness and danger of JNI</em>.</p> <p>If the twin promises of efficiency and safety are realised, then using FFI as a mechanism to support a future RocksDB API may be of significant benefit.</p> <ul> <li>Remove the complexity of <code class="language-plaintext highlighter-rouge">JNI</code> access to <code class="language-plaintext highlighter-rouge">C++ RocksDB</code></li> <li>Improve RocksDB Java API performance</li> <li>Reduce the opportunity for coding errors in the RocksDB Java API</li> </ul> <p>Here’s what we did. We have</p> <ul> <li>created a prototype FFI branch</li> <li>updated the RocksDB Java build to use Java 19</li> <li>implemented an <code class="language-plaintext highlighter-rouge">FFI Preview API</code> version of core RocksDB feature (<code class="language-plaintext highlighter-rouge">get()</code>)</li> <li>Extended the current JMH benchmarks to also benchmark the new FFI methods. Usefully, JNI and FFI can co-exist peacefully, so we use the existing RocksDB Java to do support work around the FFI-based <code class="language-plaintext highlighter-rouge">get()</code> implementation.</li> </ul> <h2 id="implementation">Implementation</h2> <h3 id="how-jni-works">How JNI Works</h3> <p><code class="language-plaintext highlighter-rouge">JNI</code> requires a preprocessing step during build/compilation to generate header files which are linked into by Pure Java code. <code class="language-plaintext highlighter-rouge">C++</code> implementations of the methods in the headers are implemented. Corresponding <code class="language-plaintext highlighter-rouge">native</code> methods are declared in Java and the whole is linked together.</p> <p>Code in the <code class="language-plaintext highlighter-rouge">C++</code> methods uses what amounts to a <code class="language-plaintext highlighter-rouge">JNI</code> library to access Java values and objects and to create Java objects in response.</p> <h3 id="how-ffi-works">How FFI Works</h3> <p><code class="language-plaintext highlighter-rouge">FFI</code> provides the facility for Java to call existing native (in our case C++) code from Pure Java without having to generate support files during compilation steps. <code class="language-plaintext highlighter-rouge">FFI</code> does support an external tool (<code class="language-plaintext highlighter-rouge">jextract</code>) which makes generating common boilerplate easier and less error prone, but we choose to start prototyping without it, in part better to understand how things really work.</p> <p><code class="language-plaintext highlighter-rouge">FFI</code> does its job by providing 2 things</p> <ol> <li>A model for allocating, reading and writing native memory and native structures within that memory</li> <li>A model for discovering and calling native methods with parameters consisting of native memory references and/or values</li> </ol> <p>The called <code class="language-plaintext highlighter-rouge">C++</code> is invoked entirely natively. It does not have to access any Java objects to retrieve data it needs. Therefore existing packages in <code class="language-plaintext highlighter-rouge">C++</code> and other sufficiently low level languages can be called from <code class="language-plaintext highlighter-rouge">Java</code> without having to implement stubs in the <code class="language-plaintext highlighter-rouge">C++</code>.</p> <h3 id="our-approach">Our Approach</h3> <p>While we could in principle avoid writing any C++, C++ objects and classes are not easily defined in the FFI model, so to begin with it is easier to write some very simple <code class="language-plaintext highlighter-rouge">C</code>-like methods/stubs in C++ which can immediately call into the object-oriented core of RocksDB. We define structures with which to pass parameters to and receive results from the <code class="language-plaintext highlighter-rouge">C</code>-like method(s) we implement.</p> <h4 id="c-side"><code class="language-plaintext highlighter-rouge">C++</code> Side</h4> <p>The first method we implement is</p> <pre><code class="language-C">extern "C" int rocksdb_ffi_get_pinnable( ROCKSDB_NAMESPACE::DB* db, ROCKSDB_NAMESPACE::ReadOptions* read_options, ROCKSDB_NAMESPACE::ColumnFamilyHandle* cf, rocksdb_input_slice_t* key, rocksdb_pinnable_slice_t* value); </code></pre> <p>our input structure is</p> <pre><code class="language-C">typedef struct rocksdb_input_slice { const char* data; size_t size; } rocksdb_input_slice_t; </code></pre> <p>and our output structure is a pinnable slice (of which more later)</p> <pre><code class="language-C">typedef struct rocksdb_pinnable_slice { const char* data; size_t size; ROCKSDB_NAMESPACE::PinnableSlice* pinnable_slice; bool is_pinned; } rocksdb_pinnable_slice_t; </code></pre> <h4 id="java-side"><code class="language-plaintext highlighter-rouge">Java</code> Side</h4> <p>We implement an <code class="language-plaintext highlighter-rouge">FFIMethod</code> class to advertise a <code class="language-plaintext highlighter-rouge">java.lang.invoke.MethodHandle</code> for each of our helper stubs</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 </pre></td><td class="rouge-code"><pre> <span class="kd">public</span> <span class="kd">static</span> <span class="nc">MethodHandle</span> <span class="nc">GetPinnable</span><span class="o">;</span> <span class="c1">// handle which refers to the rocksdb_ffi_get_pinnable method in C++</span> <span class="kd">public</span> <span class="kd">static</span> <span class="nc">MethodHandle</span> <span class="nc">ResetPinnable</span><span class="o">;</span> <span class="c1">// handle which refers to the rocksdb_ffi_reset_pinnable method in C++</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>We also implement an <code class="language-plaintext highlighter-rouge">FFILayout</code> class to describe each of the passed structures (<code class="language-plaintext highlighter-rouge">rocksdb_input_slice</code> , <code class="language-plaintext highlighter-rouge">rocksdb_pinnable_slice</code> and <code class="language-plaintext highlighter-rouge">rocksdb_output_slice</code>) in <code class="language-plaintext highlighter-rouge">Java</code> terms</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 </pre></td><td class="rouge-code"><pre> <span class="kd">public</span> <span class="kd">static</span> <span class="kd">class</span> <span class="nc">InputSlice</span> <span class="o">{</span> <span class="kd">static</span> <span class="kd">final</span> <span class="nc">GroupLayout</span> <span class="nc">Layout</span> <span class="o">=</span> <span class="o">...</span> <span class="kd">static</span> <span class="kd">final</span> <span class="nc">VarHandle</span> <span class="nc">Data</span> <span class="o">=</span> <span class="o">...</span> <span class="kd">static</span> <span class="kd">final</span> <span class="nc">VarHandle</span> <span class="nc">Size</span> <span class="o">=</span> <span class="o">...</span> <span class="o">};</span> <span class="kd">public</span> <span class="kd">static</span> <span class="kd">class</span> <span class="nc">PinnableSlice</span> <span class="o">{</span> <span class="kd">static</span> <span class="kd">final</span> <span class="nc">GroupLayout</span> <span class="nc">Layout</span> <span class="o">=</span> <span class="o">...</span> <span class="kd">static</span> <span class="kd">final</span> <span class="nc">VarHandle</span> <span class="nc">Data</span> <span class="o">=</span> <span class="o">...</span> <span class="kd">static</span> <span class="kd">final</span> <span class="nc">VarHandle</span> <span class="nc">Size</span> <span class="o">=</span> <span class="o">...</span> <span class="kd">static</span> <span class="kd">final</span> <span class="nc">VarHandle</span> <span class="nc">IsPinned</span> <span class="o">=</span> <span class="o">...</span> <span class="o">};</span> <span class="kd">public</span> <span class="kd">static</span> <span class="kd">class</span> <span class="nc">OutputSlice</span> <span class="o">{</span> <span class="kd">static</span> <span class="kd">final</span> <span class="nc">GroupLayout</span> <span class="nc">Layout</span> <span class="o">=</span> <span class="o">...</span> <span class="kd">static</span> <span class="kd">final</span> <span class="nc">VarHandle</span> <span class="nc">Data</span> <span class="o">=</span> <span class="o">...</span> <span class="kd">static</span> <span class="kd">final</span> <span class="nc">VarHandle</span> <span class="nc">Size</span> <span class="o">=</span> <span class="o">...</span> <span class="o">};</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>The <code class="language-plaintext highlighter-rouge">FFIDB</code> class, which implements the public Java FFI API methods, makes use of <code class="language-plaintext highlighter-rouge">FFIMethod</code> and <code class="language-plaintext highlighter-rouge">FFILayout</code> to make the code for each individual method as idiomatic and efficient as possible. This class also contains <code class="language-plaintext highlighter-rouge">java.lang.foreign.MemorySession</code> and <code class="language-plaintext highlighter-rouge">java.lang.foreign.SegmentAllocator</code> objects which control the lifetime of native memory sessions and allow us to allocate lifetime-limited native memory which can be written and read by Java, and passed to native methods.</p> <p>At the user level, we then present a method which wraps the details of use of <code class="language-plaintext highlighter-rouge">FFIMethod</code> and <code class="language-plaintext highlighter-rouge">FFILayout</code> to implement our single, core Java API <code class="language-plaintext highlighter-rouge">get()</code> method</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre> <span class="kd">public</span> <span class="nc">GetPinnableSlice</span> <span class="nf">getPinnableSlice</span><span class="o">(</span><span class="kd">final</span> <span class="nc">ReadOptions</span> <span class="n">readOptions</span><span class="o">,</span> <span class="kd">final</span> <span class="nc">ColumnFamilyHandle</span> <span class="n">columnFamilyHandle</span><span class="o">,</span> <span class="kd">final</span> <span class="nc">MemorySegment</span> <span class="n">keySegment</span><span class="o">,</span> <span class="kd">final</span> <span class="nc">GetParams</span> <span class="n">getParams</span><span class="o">)</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>The flow of implementation of <code class="language-plaintext highlighter-rouge">getPinnableSlice()</code>, in common with any other core RocksDB FFI API method becomes:</p> <ol> <li>Allocate <code class="language-plaintext highlighter-rouge">MemorySegment</code>s for <code class="language-plaintext highlighter-rouge">C++</code> structures using <code class="language-plaintext highlighter-rouge">Layout</code>s from <code class="language-plaintext highlighter-rouge">FFILayout</code></li> <li>Write to the allocated structures using <code class="language-plaintext highlighter-rouge">VarHandle</code>s from <code class="language-plaintext highlighter-rouge">FFILayout</code></li> <li>Invoke the native method using the <code class="language-plaintext highlighter-rouge">MethodHandle</code> from <code class="language-plaintext highlighter-rouge">FFIMethod</code> and addresses of instantiated <code class="language-plaintext highlighter-rouge">MemorySegment</code>s, or value types, as parameters</li> <li>Read the call result and the output parameter(s), again using <code class="language-plaintext highlighter-rouge">VarHandle</code>s from <code class="language-plaintext highlighter-rouge">FFILayout</code> to perform the mapping.</li> </ol> <p>For the <code class="language-plaintext highlighter-rouge">getPinnableSlice()</code> method, on successful return from an invocation of <code class="language-plaintext highlighter-rouge">rocksdb_ffi_get()</code>, the <code class="language-plaintext highlighter-rouge">PinnableSlice</code> object will contain the <code class="language-plaintext highlighter-rouge">data</code> and <code class="language-plaintext highlighter-rouge">size</code> fields of a pinnable slice (see below) containing the requested value. A <code class="language-plaintext highlighter-rouge">MemorySegment</code> referring to the native memory of the pinnable slice is then constructed, and used by the client to retrieve the value in whatever fashion they choose.</p> <h3 id="pinnable-slices">Pinnable Slices</h3> <p>RocksDB offers core (C++) API methods using the concept of a <a href="http://rocksdb.org/blog/2017/08/24/pinnableslice.html"><code class="language-plaintext highlighter-rouge">PinnableSlice</code></a> to return fetched data values while reducing copies to a minimum. We take advantage of this to base our central <code class="language-plaintext highlighter-rouge">get()</code> method(s) on <code class="language-plaintext highlighter-rouge">PinnableSlice</code>s. Methods mirroring the existing <code class="language-plaintext highlighter-rouge">JNI</code>-based API can then be implemented in pure Java by wrapping the core <code class="language-plaintext highlighter-rouge">getPinnableSlice()</code>.</p> <p>So we implement</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 </pre></td><td class="rouge-code"><pre><span class="kd">public</span> <span class="n">record</span> <span class="nf">GetPinnableSlice</span><span class="o">(</span><span class="nc">Status</span><span class="o">.</span><span class="na">Code</span> <span class="n">code</span><span class="o">,</span> <span class="nc">Optional</span><span class="o"><</span><span class="nc">FFIPinnableSlice</span><span class="o">></span> <span class="n">pinnableSlice</span><span class="o">)</span> <span class="o">{}</span> <span class="kd">public</span> <span class="nc">GetPinnableSlice</span> <span class="nf">getPinnableSlice</span><span class="o">(</span> <span class="kd">final</span> <span class="nc">ColumnFamilyHandle</span> <span class="n">columnFamilyHandle</span><span class="o">,</span> <span class="kd">final</span> <span class="kt">byte</span><span class="o">[]</span> <span class="n">key</span><span class="o">)</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>and we wrap that to provide</p> <div class="language-java highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre><span class="kd">public</span> <span class="n">record</span> <span class="nf">GetBytes</span><span class="o">(</span><span class="nc">Status</span><span class="o">.</span><span class="na">Code</span> <span class="n">code</span><span class="o">,</span> <span class="kt">byte</span><span class="o">[]</span> <span class="n">value</span><span class="o">,</span> <span class="kt">long</span> <span class="n">size</span><span class="o">)</span> <span class="o">{}</span> <span class="kd">public</span> <span class="nc">GetBytes</span> <span class="nf">get</span><span class="o">(</span><span class="kd">final</span> <span class="nc">ColumnFamilyHandle</span> <span class="n">columnFamilyHandle</span><span class="o">,</span> <span class="kd">final</span> <span class="kt">byte</span><span class="o">[]</span> <span class="n">key</span><span class="o">)</span> </pre></td></tr></tbody></table></code></pre></div></div> <h2 id="benchmark-results">Benchmark Results</h2> <p>We extended existing RocksDB Java JNI benchmarks with new benchmarks based on FFI. Full benchmark run on Ubuntu, including new benchmarks.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 </pre></td><td class="rouge-code"><pre>java <span class="nt">--enable-preview</span> <span class="nt">--enable-native-access</span><span class="o">=</span>ALL-UNNAMED <span class="nt">-jar</span> target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar <span class="nt">-p</span> <span class="nv">keyCount</span><span class="o">=</span>100000 <span class="nt">-p</span> <span class="nv">keySize</span><span class="o">=</span>128 <span class="nt">-p</span> <span class="nv">valueSize</span><span class="o">=</span>4096,65536 <span class="nt">-p</span> <span class="nv">columnFamilyTestType</span><span class="o">=</span><span class="s2">"no_column_family"</span> <span class="nt">-rf</span> csv org.rocksdb.jmh.GetBenchmarks </pre></td></tr></tbody></table></code></pre></div></div> <p><img src="/static/images/jni-ffi/jmh-result-fixed.png" alt="JNI vs FFI" /></p> <h3 id="discussion">Discussion</h3> <p>We have plotted the performance (more operations is better) of a selection of benchmarks,</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 </pre></td><td class="rouge-code"><pre>q <span class="s2">"select Benchmark,Score from ./plot/jmh-result-fixed.csv where </span><span class="se">\"</span><span class="s2">Param: keyCount</span><span class="se">\"</span><span class="s2">=100000 and </span><span class="se">\"</span><span class="s2">Param: valueSize</span><span class="se">\"</span><span class="s2">=65536 -d, -H </span></pre></td></tr></tbody></table></code></pre></div></div> <ul> <li>JNI versions of benchmarks are previously implemented <code class="language-plaintext highlighter-rouge">jmh</code> benchmarks for measuring the performance of the current RocksDB Java interface.</li> <li>FFI versions of benchmarks are equivalent benchmarks (as far as possible) implemented using the FFI mechanisms.</li> </ul> <p>We can see that for all benchmarks which have equivalent FFI and JNI pairs, the JNI version is only very marginally faster. FFI has successfully optimized away most of the extra safety-checking of the new invocation mechanism.</p> <p>Our initial implementation of FFI benchmarks lagged the JNI benchmarks quite significantly, but we have received extremely helpful support from Maurizio Cimadamore of the Panama Dev team, to help us optimize the performance of our FFI implementation. We consider that the small remaining performance gap is a feature of the remaining extra bounds checking of FFI.</p> <p>For basic <code class="language-plaintext highlighter-rouge">get()</code> the result buffer is allocated by the method, so that there is a cost of allocation associated with each request.</p> <ul> <li><code class="language-plaintext highlighter-rouge">ffiGet</code> vs <code class="language-plaintext highlighter-rouge">get</code></li> <li>The JNI version is very marginally faster than FFI</li> </ul> <p>For preallocated <code class="language-plaintext highlighter-rouge">get()</code> where the result buffer is supplied to the method, we avoid an allocation of a fresh result buffer on each call, and the test recycles its result buffers. Then the same small difference persists</p> <ul> <li>JNI is very marginally faster than FFI</li> <li><code class="language-plaintext highlighter-rouge">preallocatedGet()</code> is a lot faster than basic <code class="language-plaintext highlighter-rouge">get()</code></li> </ul> <p>We implemented some methods where the key for the <code class="language-plaintext highlighter-rouge">get()</code> is randomized, so that any ordering effects can be accounted for. The same differences persisted.</p> <p>The FFI interface gives us a natural way to expose RocksDB’s <a href="http://rocksdb.org/blog/2017/08/24/pinnableslice.html">pinnable slice</a> mechanism. When we provide a benchmark which accesses the raw <code class="language-plaintext highlighter-rouge">PinnableSlice</code> API, as expected this is the fastest method of any; however we are not comparing like with like:</p> <ul> <li><code class="language-plaintext highlighter-rouge">ffiGetPinnableSlice()</code> returns a handle to the RocksDB memory containing the slice, and presents that as an FFI <code class="language-plaintext highlighter-rouge">MemorySegment</code>. No copying of the memory in the segment occurs.</li> </ul> <p>As noted above, we implement the new FFI-based <code class="language-plaintext highlighter-rouge">get()</code> methods using the new FFI-based <code class="language-plaintext highlighter-rouge">getPinnableSlice()</code> method, and copying out the result. So the <code class="language-plaintext highlighter-rouge">ffiGet</code> and <code class="language-plaintext highlighter-rouge">ffiPreallocatedGet</code> benchmarks use this mechanism underneath.</p> <p>In an effort to discover whether using the Java APIs to copy from the pinnable slice backed <code class="language-plaintext highlighter-rouge">MemorySegment</code> was a problem, we implemented a separate <code class="language-plaintext highlighter-rouge">ffiGetOutputSlice()</code> benchmark which copies the result into a (Java allocated native memory) segment at the C++ side.</p> <ul> <li><code class="language-plaintext highlighter-rouge">ffiGetOutputSlice()</code> is faster than <code class="language-plaintext highlighter-rouge">ffiPreallocatedGet()</code> and is in fact at least as fast as <code class="language-plaintext highlighter-rouge">preallocatedGet()</code>, which is an almost exact analogue in the JNI world.</li> </ul> <p>So it appears that we can build an FFI-based API with equal performance to the JNI-based one.</p> <p>Thinking about the (very small, but probably statistically significant) difference between our <code class="language-plaintext highlighter-rouge">ffiGetPinnableSlice()</code>-based FFI calls and the JNI-based calls, it is reasonable to expect that some of the cost is the extra FFI call to C++ to release the pinned slice as a separate operation. A null FFI method call is extremely fast, but it does take some time.</p> <ul> <li>We would recommend looking again the performance of the FFI-based implementation when Panama is release post-Preview in Java 21. It seems that at least with Java 20 the performance is of our FFI benchmarks is not significantly different from that of the Java 19 version.</li> </ul> <h3 id="copies-versus-calls">Copies versus Calls</h3> <p>The second method call over the FFI boundary to release a pinnable slice has a cost. We compared the <code class="language-plaintext highlighter-rouge">ffiGetOutputSlice()</code> and <code class="language-plaintext highlighter-rouge">ffiGetPinnableSlice()</code> benchmarks in order to examine this cost. We ran it with a fixed ky size (128 bytes); the key size is likely to be pretty much irrelevant anyway; we varied the value size read from 16 bytes to 16k, and we found a crossover point between 1k and 4k for performance:</p> <p><img src="/static/images/jni-ffi/jmh-result-pinnable-vs-output-plot.png" alt="Plot" /></p> <ul> <li><code class="language-plaintext highlighter-rouge">ffiGetOutputSlice()</code> is faster when values read are 1k in size or smaller. The cost of an extra copy in the C++ side from the pinnable slice buffer into the supplied buffer allocated by Java Foreign Memory API is less than the cost of the extra call to release a pinnable slice.</li> <li><code class="language-plaintext highlighter-rouge">ffiGetPinnableSlice()</code> is faster when values read are 4k in size, or larger. Consistent with intuition, the advantage grows with larger read values.</li> </ul> <p>The way that the RocksDB API is constructed means that of the 2 methods compared, <code class="language-plaintext highlighter-rouge">ffiGetOutputSlice()</code> will always make exactly 1 more copy than <code class="language-plaintext highlighter-rouge">ffiGetPinnableSlice()</code>. The underlying RocksDB C++ API will always copy into its own temporary buffer if it decides that it cannot pin an internal buffer, and that will be returned as the pinnable slice. There is a potential optimization where the temporary buffer could be replaced by an output buffer, such as that supplied by <code class="language-plaintext highlighter-rouge">ffiGetOutputSlice()</code>; in practice that is a hard fix to hack in. Its effectiveness depends on how often RocksDB fails to pin an internal buffer.</p> <p>A solution which either filled a buffer <em>or</em> returned a pinnable slice would give us the best of both worlds.</p> <h2 id="other-conclusions">Other Conclusions</h2> <h3 id="build-processing">Build Processing</h3> <ul> <li> <p>It is easier to implement an interface using FFI than JNI. No intermediate build processing or code generation steps were needed to implement this protoype.</p> </li> <li> <p>For a production version, we would urge using <code class="language-plaintext highlighter-rouge">jextract</code> to automate the process of generating Java API methods from the set of supporting stubs we generate.</p> </li> </ul> <h3 id="safety">Safety</h3> <ul> <li>The use of <code class="language-plaintext highlighter-rouge">jextract</code> will give a similar level of type security to the use of JNI, when crossing the language boundary. But we do not believe FFI is significantly more type-safe than JNI for method invocation. Neither is it less safe, though.</li> </ul> <h3 id="native-memory">Native Memory</h3> <p>Panama’s <em>Foreign-Memory Access API</em> appears to us to be the most significant part of the whole project. At the <code class="language-plaintext highlighter-rouge">Java</code> side of RocksDB it gives us a clean mechanism (a <code class="language-plaintext highlighter-rouge">MemorySegment</code>) for holding RocksDB data (e.g. as from the result of a <code class="language-plaintext highlighter-rouge">get()</code>) call pending its forwarding to client code or network buffers.</p> <p>We have taken advantage of this mechanism to provide the core <code class="language-plaintext highlighter-rouge">FFIDB.getPinnableSlice()</code> method in our Panama-based API. The rest of our prototype <code class="language-plaintext highlighter-rouge">get()</code> API, duplicating the existing <em>JNI</em>-based API, is then a <em>Pure Java</em> library on top of <code class="language-plaintext highlighter-rouge">FFIDB.getPinnableSlice()</code> and <code class="language-plaintext highlighter-rouge">FFIPinnableSlice.reset()</code>.</p> <p>The common standard for foreign memory opens up the possibility of efficient interoperation between RocksDB and Java clients (e.g. Kafka). We think that this is really the key to higher performing, more integrated Java-based systems:</p> <ul> <li>This could result in data never being copied into Java memory, or a significant reduction in copies, as native <code class="language-plaintext highlighter-rouge">MemorySegment</code>s are handed off between co-operating Java clients of fundamentally native APIs. This extra potential performance can be extremely useful when 2 or more clients are interoperating; we still need to provide a simplest possible API wrapping these calls (like our prototype <code class="language-plaintext highlighter-rouge">get()</code>), which operates at a similar level to the current Java API.</li> <li>Some thought should be applied to how this architecture would interact with the cache layer(s) in RocksDB, and whether it can be accommodated within the present RocksDB architecture. How long can 3rd-party applications <em>pin</em> pages in the RocksDB cache without disrupting RocksDB normal behaviour (e.g. compaction) ?</li> </ul> <h2 id="summary">Summary</h2> <ol> <li>Panama/FFI (in <a href="https://openjdk.org/jeps/424">Preview</a>) is a highly capable technology for (re)building the RocksDB Java API, although the supported language level of RocksDB and the planned release schedule for Panama mean that it could not replace JNI in production for some time to come.</li> <li>Panama/FFI would seem to offer comparable performance to JNI; there is no strong performance argument <em>for</em> a re-implementation of a standalone RocksDB Java API. But the opportunity to provide a natural pinnable slice-based API gives a lot of flexibility; not least because an efficient API could be built mostly in Java with only a small underlying layer implementing the pinnable slice interface.</li> <li>Panama/FFI can remove some boilerplate (native method declarations) and allow Java programs to access <code class="language-plaintext highlighter-rouge">C</code> libraries without stub code, but calling a <code class="language-plaintext highlighter-rouge">C++</code>-based library still requires <code class="language-plaintext highlighter-rouge">C</code> stubs; a possible approach would be to use the RocksDB <code class="language-plaintext highlighter-rouge">C</code> API as the basis for a rebuilt Java API. This would allow us to remove all the existing JNI boilerplate, and concentrate support effort on the <code class="language-plaintext highlighter-rouge">C</code> API. An alternative approach would be to build a robust API based on <a href="https://github.com/facebook/rocksdb/pull/10736">Reference Counting</a>, but using FFI.</li> <li>Panama/FFI really shines as a foreign memory standard for a Java API that can allow efficient interoperation between RocksDB Java clients and other (Java and native) components of a system. Foreign Memory gives us a model for how to efficiently return data from RocksDB; as pinnable slices with their contents presented in <code class="language-plaintext highlighter-rouge">MemorySegment</code>s. If we focus on designing an API <em>for native interoperability</em> we think this can be highly productive in opening RocksDB to new uses and opportunities in future.</li> </ol> <h2 id="appendix">Appendix</h2> <h3 id="code-and-data">Code and Data</h3> <p>The <a href="https://github.com/facebook/rocksdb/pull/11095/files">Experimental Pull Request</a> contains the source code implemented, together with further data plots and the source CSV files for all data plots.</p> <h3 id="running">Running</h3> <p>This is an example run; the jmh parameters (after <code class="language-plaintext highlighter-rouge">-p</code>) can be changed to measure performance with varying key counts, and key and value sizes.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 </pre></td><td class="rouge-code"><pre>java <span class="nt">--enable-preview</span> <span class="nt">--enable-native-access</span><span class="o">=</span>ALL-UNNAMED <span class="nt">-jar</span> target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar <span class="nt">-p</span> <span class="nv">keyCount</span><span class="o">=</span>100000 <span class="nt">-p</span> <span class="nv">keySize</span><span class="o">=</span>128 <span class="nt">-p</span> <span class="nv">valueSize</span><span class="o">=</span>4096,65536 <span class="nt">-p</span> <span class="nv">columnFamilyTestType</span><span class="o">=</span><span class="s2">"no_column_family"</span> <span class="nt">-rf</span> csv org.rocksdb.jmh.GetBenchmarks <span class="nt">-wi</span> 1 <span class="nt">-to</span> 1m <span class="nt">-i</span> 1 </pre></td></tr></tbody></table></code></pre></div></div> <h3 id="processing">Processing</h3> <p>Use <a href="http://harelba.github.io/q/"><code class="language-plaintext highlighter-rouge">q</code></a> to select the csv output for analysis and graphing.</p> <ul> <li>Note that we edited the column headings for easier processing</li> </ul> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 </pre></td><td class="rouge-code"><pre>q <span class="s2">"select Benchmark,Score,Error from ./plot/jmh-result.csv where keyCount=100000 and valueSize=65536"</span> <span class="nt">-d</span>, <span class="nt">-H</span> <span class="nt">-C</span> readwrite </pre></td></tr></tbody></table></code></pre></div></div> <h3 id="java-19-installation">Java 19 installation</h3> <p>We followed the instructions to install <a href="https://docs.azul.com/core/zulu-openjdk/install/debian">Azul</a>. Then select the correct instance of java locally:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 </pre></td><td class="rouge-code"><pre><span class="nb">sudo </span>update-alternatives <span class="nt">--config</span> java <span class="nb">sudo </span>update-alternatives <span class="nt">--config</span> javac </pre></td></tr></tbody></table></code></pre></div></div> <p>And set <code class="language-plaintext highlighter-rouge">JAVA_HOME</code> appropriately. In my case, <code class="language-plaintext highlighter-rouge">sudo update-alternatives --config java</code> listed a few JVMs thus:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 </pre></td><td class="rouge-code"><pre> 0 /usr/lib/jvm/bellsoft-java8-full-amd64/bin/java 20803123 auto mode 1 /usr/lib/jvm/bellsoft-java8-full-amd64/bin/java 20803123 manual mode 2 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 manual mode * 3 /usr/lib/jvm/zulu19/bin/java 2193001 manual mode </pre></td></tr></tbody></table></code></pre></div></div> <p>For our environment, we set this:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 </pre></td><td class="rouge-code"><pre><span class="nb">export </span><span class="nv">JAVA_HOME</span><span class="o">=</span>/usr/lib/jvm/zulu19 </pre></td></tr></tbody></table></code></pre></div></div> <p>The default version of Maven avaiable on the Ubuntu package repositories (3.6.3) is incompatible with Java 19. You will need to install a later <a href="https://maven.apache.org/install.html">Maven</a>, and use it. I used <code class="language-plaintext highlighter-rouge">3.8.7</code> successfully.</p> <h3 id="java-20-21-22-and-subsequent-versions">Java 20, 21, 22 and subsequent versions</h3> <p>The FFI version we used was a preview in Java 19, and the interface has changed through to Java 22, where it has been finalized. Future work with this prototype will need to update the code to use the changed interface.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2023/11/06/java-jni-benchmarks.html">Java API Performance Improvements</a></h1> <p class="post-meta">Posted November 06, 2023</p> </header> <article class="post-content"> <h1 id="rocksdb-java-api-performance-improvements">RocksDB Java API Performance Improvements</h1> <p>Evolved Binary has been working on several aspects of how the Java API to RocksDB can be improved. Two aspects of this which are of particular importance are performance and the developer experience.</p> <ul> <li>We have built some synthetic benchmark code to determine which are the most efficient methods of transferring data between Java and C++.</li> <li>We have used the results of the synthetic benchmarking to guide plans for rationalising the API interfaces.</li> <li>We have made some opportunistic performance optimizations/fixes within the Java API which have already yielded noticable improvements.</li> </ul> <h2 id="synthetic-jni-api-performance-benchmarks">Synthetic JNI API Performance Benchmarks</h2> <p>The synthetic benchmark repository contains tests designed to isolate the Java to/from C++ interaction of a canonical data intensive Key/Value Store implemented in C++ with a Java (JNI) API layered on top.</p> <p>JNI provides several mechanisms for allowing transfer of data between Java buffers and C++ buffers. These mechanisms are not trivial, because they require the JNI system to ensure that Java memory under the control of the JVM is not moved or garbage collected whilst it is being accessed outside the direct control of the JVM.</p> <p>We set out to determine which of multiple options for transfer of data from <code class="language-plaintext highlighter-rouge">C++</code> to <code class="language-plaintext highlighter-rouge">Java</code> and vice-versa were the most efficient. We used the <a href="https://github.com/openjdk/jmh">Java Microbenchmark Harness</a> to set up repeatable benchmarks to measure all the options.</p> <p>We explore these and some other potential mechanisms in the detailed results (in our <a href="https://github.com/evolvedbinary/jni-benchmarks/blob/main/DataBenchmarks.md">Synthetic JNI performance repository</a>)</p> <p>We summarise this work here:</p> <h3 id="the-model">The Model</h3> <ul> <li>In <code class="language-plaintext highlighter-rouge">C++</code> we represent the on-disk data as an in-memory map of <code class="language-plaintext highlighter-rouge">(key, value)</code> pairs.</li> <li>For a fetch query, we expect the result to be a Java object with access to the contents of the <em>value</em>. This may be a standard Java object which does the job of data access (a <code class="language-plaintext highlighter-rouge">byte[]</code> or a <code class="language-plaintext highlighter-rouge">ByteBuffer</code>) or an object of our own devising which holds references to the value in some form (a <code class="language-plaintext highlighter-rouge">FastBuffer</code> pointing to <code class="language-plaintext highlighter-rouge">com.sun.unsafe.Unsafe</code> unsafe memory, for instance).</li> </ul> <h3 id="data-types">Data Types</h3> <p>There are several potential data types for holding data for transfer, and they are unsurprisingly quite connected underneath.</p> <h4 id="byte-array">Byte Array</h4> <p>The simplest data container is a <em>raw</em> array of bytes (<code class="language-plaintext highlighter-rouge">byte[]</code>).</p> <p>There are 3 different mechanisms for transferring data between a <code class="language-plaintext highlighter-rouge">byte[]</code> and C++</p> <ul> <li>At the C++ side, the method <a href="https://docs.oracle.com/en/java/javase/13/docs/specs/jni/functions.html#getprimitivearraycritical"><code class="language-plaintext highlighter-rouge">JNIEnv.GetArrayCritical()</code></a> allows access to a C++ pointer to the underlying array.</li> <li>The <code class="language-plaintext highlighter-rouge">JNIEnv</code> methods <code class="language-plaintext highlighter-rouge">GetByteArrayElements()</code> and <code class="language-plaintext highlighter-rouge">ReleaseByteArrayElements()</code> fetch references/copies to and from the contents of a byte array, with less concern for critical sections than the <em>critical</em> methods, though they are consequently more likely/certain to result in (extra) copies.</li> <li>The <code class="language-plaintext highlighter-rouge">JNIEnv</code> methods <code class="language-plaintext highlighter-rouge">GetByteArrayRegion()</code> and <code class="language-plaintext highlighter-rouge">SetByteArrayRegion()</code> transfer raw C++ buffer data to and from the contents of a byte array. These must ultimately do some data pinning for the duration of copies; the mechanisms may be similar or different to the <em>critical</em> operations, and therefore performance may differ.</li> </ul> <h4 id="byte-buffer">Byte Buffer</h4> <p>A <code class="language-plaintext highlighter-rouge">ByteBuffer</code> abstracts the contents of a collection of bytes, and was in fact introduced to support a range of higher-performance I/O operations in some circumstances.</p> <p>There are 2 types of byte buffers in Java, <em>indirect</em> and <em>direct</em>. Indirect byte buffers are the standard, and the memory they use is on-heap as with all usual Java objects. In contrast, direct byte buffers are used to wrap off-heap memory which is accessible to direct network I/O. Either type of <code class="language-plaintext highlighter-rouge">ByteBuffer</code> can be allocated at the Java side, using the <code class="language-plaintext highlighter-rouge">allocate()</code> and <code class="language-plaintext highlighter-rouge">allocateDirect()</code> methods respectively.</p> <p>Direct byte buffers can be created in C++ using the JNI method <a href="https://docs.oracle.com/en/java/javase/13/docs/specs/jni/functions.html#newdirectbytebuffer"><code class="language-plaintext highlighter-rouge">JNIEnv.NewDirectByteBuffer()</code></a> to wrap some native (C++) memory.</p> <p>Direct byte buffers can be accessed in C++ using the <a href="https://docs.oracle.com/en/java/javase/13/docs/specs/jni/functions.html#GetDirectBufferAddress"><code class="language-plaintext highlighter-rouge">JNIEnv.GetDirectBufferAddress()</code></a> and measured using <a href="https://docs.oracle.com/en/java/javase/13/docs/specs/jni/functions.html#GetDirectBufferCapacity"><code class="language-plaintext highlighter-rouge">JNIEnv.GetDirectBufferCapacity()</code></a></p> <h4 id="unsafe-memory">Unsafe Memory</h4> <p>The call <code class="language-plaintext highlighter-rouge">com.sun.unsafe.Unsafe.allocateMemory()</code> returns a handle which is (of course) just a pointer to raw memory, and can be used as such on the C++ side. We could turn it into a byte buffer on the C++ side by calling <code class="language-plaintext highlighter-rouge">JNIEnv.NewDirectByteBuffer()</code>, or simply use it as a native C++ buffer at the expected address, assuming we record or remember how much space was allocated.</p> <p>A custom <code class="language-plaintext highlighter-rouge">FastBuffer</code> class provides access to unsafe memory from the Java side.</p> <h4 id="allocation">Allocation</h4> <p>For these benchmarks, allocation has been excluded from the benchmark costs by pre-allocating a quantity of buffers of the appropriate kind as part of the test setup. Each run of the benchmark acquires an existing buffer from a pre-allocated FIFO list, and returns it afterwards. A small test has confirmed that the request and return cycle is of insignificant cost compared to the benchmark API call.</p> <h3 id="getjnibenchmark-performance">GetJNIBenchmark Performance</h3> <p>Benchmarks ran for a duration of order 6 hours on an otherwise unloaded VM, the error bars are small and we can have strong confidence in the values derived and plotted.</p> <p><img src="/static/images/jni-get-benchmarks/fig_1024_1_none_nopoolbig.png" alt="Raw JNI Get small" /></p> <p>Comparing all the benchmarks as the data size tends large, the conclusions we can draw are:</p> <ul> <li>Indirect byte buffers add cost; they are effectively an overhead on plain <code class="language-plaintext highlighter-rouge">byte[]</code> and the JNI-side only allows them to be accessed via their encapsulated <code class="language-plaintext highlighter-rouge">byte[]</code>.</li> <li><code class="language-plaintext highlighter-rouge">SetRegion</code> and <code class="language-plaintext highlighter-rouge">GetCritical</code> mechanisms for copying data into a <code class="language-plaintext highlighter-rouge">byte[]</code> are of very comparable performance; presumably the behaviour behind the scenes of <code class="language-plaintext highlighter-rouge">SetRegion</code> is very similar to that of declaring a critical region, doing a <code class="language-plaintext highlighter-rouge">memcpy()</code> and releasing the critical region.</li> <li><code class="language-plaintext highlighter-rouge">GetElements</code> methods for transferring data from C++ to Java are consistently less efficient than <code class="language-plaintext highlighter-rouge">SetRegion</code> and <code class="language-plaintext highlighter-rouge">GetCritical</code>.</li> <li>Getting into a raw memory buffer, passed as an address (the <code class="language-plaintext highlighter-rouge">handle</code> of an <code class="language-plaintext highlighter-rouge">Unsafe</code> or of a netty <code class="language-plaintext highlighter-rouge">ByteBuf</code>) is of similar cost to the more efficient <code class="language-plaintext highlighter-rouge">byte[]</code> operations.</li> <li>Getting into a direct <code class="language-plaintext highlighter-rouge">nio.ByteBuffer</code> is of similar cost again; while the ByteBuffer is passed over JNI as an ordinary Java object, JNI has a specific method for getting hold of the address of the direct buffer, and using this, the <code class="language-plaintext highlighter-rouge">get()</code> cost with a ByteBuffer is just that of the underlying C++ <code class="language-plaintext highlighter-rouge">memcpy()</code>.</li> </ul> <p>At small(er) data sizes, we can see whether other factors are important.</p> <p><img src="/static/images/jni-get-benchmarks/fig_1024_1_none_nopoolsmall.png" alt="Raw JNI Get large" /></p> <ul> <li>Indirect byte buffers are the most significant overhead here. Again, we can conclude that this is due to pure overhead compared to <code class="language-plaintext highlighter-rouge">byte[]</code> operations.</li> <li>At the lowest data sizes, netty <code class="language-plaintext highlighter-rouge">ByteBuf</code>s and unsafe memory are marginally more efficient than <code class="language-plaintext highlighter-rouge">byte[]</code>s or (slightly less efficient) direct <code class="language-plaintext highlighter-rouge">nio.Bytebuffer</code>s. This may be explained by even the small cost of calling the JNI model on the C++ side simply to acquire a direct buffer address. The margins (nanoseconds) here are extremely small.</li> </ul> <h4 id="post-processing-the-results">Post processing the results</h4> <p>Our benchmark model for post-processing is to transfer the results into a <code class="language-plaintext highlighter-rouge">byte[]</code>. Where the result is already a <code class="language-plaintext highlighter-rouge">byte[]</code> this may seem like an unfair extra cost, but the aim is to model the least cost processing step for any kind of result.</p> <ul> <li>Copying into a <code class="language-plaintext highlighter-rouge">byte[]</code> using the bulk methods supported by <code class="language-plaintext highlighter-rouge">byte[]</code>, <code class="language-plaintext highlighter-rouge">nio.ByteBuffer</code> have comparable performance.</li> <li>Accessing the contents of an <code class="language-plaintext highlighter-rouge">Unsafe</code> buffer using the supplied unsafe methods is inefficient. The access is word by word, in Java.</li> <li>Accessing the contents of a netty <code class="language-plaintext highlighter-rouge">ByteBuf</code> is similarly inefficient; again the access is presumably word by word, using normal Java mechanisms.</li> </ul> <p><img src="/static/images/jni-get-benchmarks/fig_1024_1_copyout_nopoolbig.png" alt="Copy out JNI Get" /></p> <h3 id="putjnibenchmark">PutJNIBenchmark</h3> <p>We benchmarked <code class="language-plaintext highlighter-rouge">Put</code> methods in a similar synthetic fashion in less depth, but enough to confirm that the performance profile is similar/symmetrical. As with <code class="language-plaintext highlighter-rouge">get()</code> using <code class="language-plaintext highlighter-rouge">GetElements</code> is the least performant way of implementing transfers to/from Java objects in C++/JNI, and other JNI mechanisms do not differ greatly one from another.</p> <h2 id="lessons-from-synthetic-api">Lessons from Synthetic API</h2> <p>Performance analysis shows that for <code class="language-plaintext highlighter-rouge">get()</code>, fetching into allocated <code class="language-plaintext highlighter-rouge">byte[]</code> is equally as efficient as any other mechanism, as long as JNI region methods are used for the internal data transfer. Copying out or otherwise using the result on the Java side is straightforward and efficient. Using <code class="language-plaintext highlighter-rouge">byte[]</code> avoids the manual memory management required with direct <code class="language-plaintext highlighter-rouge">nio.ByteBuffer</code>s, which extra work does not appear to provide any gain. A C++ implementation using the <code class="language-plaintext highlighter-rouge">GetRegion</code> JNI method is probably to be preferred to using <code class="language-plaintext highlighter-rouge">GetCritical</code> because while their performance is equal, <code class="language-plaintext highlighter-rouge">GetRegion</code> is a higher-level/simpler abstraction.</p> <p>Vitally, whatever JNI transfer mechanism is chosen, the buffer allocation mechanism and pattern is crucial to achieving good performance. We experimented with making use of netty’s pooled allocator part of the benchmark, and the difference of <code class="language-plaintext highlighter-rouge">getIntoPooledNettyByteBuf</code>, using the allocator, compared to <code class="language-plaintext highlighter-rouge">getIntoNettyByteBuf</code> using the same pre-allocate on setup as every other benchmark, is significant.</p> <p>Equally importantly, transfer of data to or from buffers should where possible be done in bulk, using array copy or buffer copy mechanisms. Thought should perhaps be given to supporting common transformations in the underlying C++ layer.</p> <h2 id="api-recommendations">API Recommendations</h2> <p>Of course there is some noise within the results. but we can agree:</p> <ul> <li>Don’t make copies you don’t need to make</li> <li>Don’t allocate/deallocate when you can avoid it</li> </ul> <p>Translating this into designing an efficient API, we want to:</p> <ul> <li>Support API methods that return results in buffers supplied by the client.</li> <li>Support <code class="language-plaintext highlighter-rouge">byte[]</code>-based APIs as the simplest way of getting data into a usable configuration for a broad range of Java use.</li> <li>Support direct <code class="language-plaintext highlighter-rouge">ByteBuffer</code>s as these can reduce copies when used as part of a chain of <code class="language-plaintext highlighter-rouge">ByteBuffer</code>-based operations. This sort of sophisticated streaming model is most likely to be used by clients where performance is important, and so we decide to support it.</li> <li>Support indirect <code class="language-plaintext highlighter-rouge">ByteBuffer</code>s for a combination of reasons: <ul> <li>API consistency between direct and indirect buffers</li> <li>Simplicity of implementation, as we can wrap <code class="language-plaintext highlighter-rouge">byte[]</code>-oriented methods</li> </ul> </li> <li>Continue to support methods which allocate return buffers per-call, as these are the easiest to use on initial encounter with the RocksDB API.</li> </ul> <p>High performance Java interaction with RocksDB ultimately requires architectural decisions by the client</p> <ul> <li>Use more complex (client supplied buffer) API methods where performance matters</li> <li>Don’t allocate/deallocate where you don’t need to <ul> <li>recycle your own buffers where this makes sense</li> <li>or make sure that you are supplying the ultimate destination buffer (your cache, or a target network buffer) as input to RocksDB <code class="language-plaintext highlighter-rouge">get()</code> and <code class="language-plaintext highlighter-rouge">put()</code> calls</li> </ul> </li> </ul> <p>We are currently implementing a number of extra methods consistently across the Java fetch and store APIs to RocksDB in the PR <a href="https://github.com/facebook/rocksdb/pull/11019">Java API consistency between RocksDB.put() , .merge() and Transaction.put() , .merge()</a> according to these principles.</p> <h2 id="optimizations">Optimizations</h2> <h3 id="reduce-copies-within-api-implementation">Reduce Copies within API Implementation</h3> <p>Having analysed JNI performance as described, we reviewed the core of RocksJNI for opportunities to improve the performance. We noticed one thing in particular; some of the <code class="language-plaintext highlighter-rouge">get()</code> methods of the Java API had not been updated to take advantage of the new <a href="http://rocksdb.org/blog/2017/08/24/pinnableslice.html"><code class="language-plaintext highlighter-rouge">PinnableSlice</code></a> methods.</p> <p>Fixing this turned out to be a straightforward change, which has now been incorporated in the codebase <a href="https://github.com/facebook/rocksdb/pull/10970">Improve Java API <code class="language-plaintext highlighter-rouge">get()</code> performance by reducing copies</a></p> <h4 id="performance-results">Performance Results</h4> <p>Using the JMH performances tests we updated as part of the above PR, we can see a small but consistent improvement in performance for all of the different get method variants which we have enhanced in the PR.</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 </pre></td><td class="rouge-code"><pre>java <span class="nt">-jar</span> target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar <span class="nt">-p</span> <span class="nv">keyCount</span><span class="o">=</span>1000,50000 <span class="nt">-p</span> <span class="nv">keySize</span><span class="o">=</span>128 <span class="nt">-p</span> <span class="nv">valueSize</span><span class="o">=</span>1024,16384 <span class="nt">-p</span> <span class="nv">columnFamilyTestType</span><span class="o">=</span><span class="s2">"1_column_family"</span> GetBenchmarks.get GetBenchmarks.preallocatedByteBufferGet GetBenchmarks.preallocatedGet </pre></td></tr></tbody></table></code></pre></div></div> <p>The y-axis shows <code class="language-plaintext highlighter-rouge">ops/sec</code> in throughput, so higher is better.</p> <p><img src="/static/images/jni-get-benchmarks/optimization-graph.png" alt="" /></p> <h3 id="analysis">Analysis</h3> <p>Before the invention of the Pinnable Slice the simplest RocksDB (native) API <code class="language-plaintext highlighter-rouge">Get()</code> looked like this:</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre><span class="n">Status</span> <span class="n">Get</span><span class="p">(</span><span class="k">const</span> <span class="n">ReadOptions</span><span class="o">&</span> <span class="n">options</span><span class="p">,</span> <span class="n">ColumnFamilyHandle</span><span class="o">*</span> <span class="n">column_family</span><span class="p">,</span> <span class="k">const</span> <span class="n">Slice</span><span class="o">&</span> <span class="n">key</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">string</span><span class="o">*</span> <span class="n">value</span><span class="p">)</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>After PinnableSlice the correct way for new code to implement a <code class="language-plaintext highlighter-rouge">get()</code> is like this</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre><span class="n">Status</span> <span class="n">Get</span><span class="p">(</span><span class="k">const</span> <span class="n">ReadOptions</span><span class="o">&</span> <span class="n">options</span><span class="p">,</span> <span class="n">ColumnFamilyHandle</span><span class="o">*</span> <span class="n">column_family</span><span class="p">,</span> <span class="k">const</span> <span class="n">Slice</span><span class="o">&</span> <span class="n">key</span><span class="p">,</span> <span class="n">PinnableSlice</span><span class="o">*</span> <span class="n">value</span><span class="p">)</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>But of course RocksDB has to support legacy code, so there is an <code class="language-plaintext highlighter-rouge">inline</code> method in <code class="language-plaintext highlighter-rouge">db.h</code> which re-implements the former using the latter. And RocksJava API implementation seamlessly continues to use the <code class="language-plaintext highlighter-rouge">std::string</code>-based <code class="language-plaintext highlighter-rouge">get()</code></p> <p>Let’s examine what happens when get() is called from Java</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 </pre></td><td class="rouge-code"><pre><span class="n">jint</span> <span class="n">Java_org_rocksdb_RocksDB_get__JJ_3BII_3BIIJ</span><span class="p">(</span> <span class="n">JNIEnv</span><span class="o">*</span> <span class="n">env</span><span class="p">,</span> <span class="n">jobject</span><span class="p">,</span> <span class="n">jlong</span> <span class="n">jdb_handle</span><span class="p">,</span> <span class="n">jlong</span> <span class="n">jropt_handle</span><span class="p">,</span> <span class="n">jbyteArray</span> <span class="n">jkey</span><span class="p">,</span> <span class="n">jint</span> <span class="n">jkey_off</span><span class="p">,</span> <span class="n">jint</span> <span class="n">jkey_len</span><span class="p">,</span> <span class="n">jbyteArray</span> <span class="n">jval</span><span class="p">,</span> <span class="n">jint</span> <span class="n">jval_off</span><span class="p">,</span> <span class="n">jint</span> <span class="n">jval_len</span><span class="p">,</span> <span class="n">jlong</span> <span class="n">jcf_handle</span><span class="p">)</span> </pre></td></tr></tbody></table></code></pre></div></div> <ol> <li>Create an empty <code class="language-plaintext highlighter-rouge">std::string value</code></li> <li>Call <code class="language-plaintext highlighter-rouge">DB::Get()</code> using the <code class="language-plaintext highlighter-rouge">std::string</code> variant</li> <li>Copy the resultant <code class="language-plaintext highlighter-rouge">std::string</code> into Java, using the JNI <code class="language-plaintext highlighter-rouge">SetByteArrayRegion()</code> method</li> </ol> <p>So stage (3) costs us a copy into Java. It’s mostly unavoidable that there will be at least the one copy from a C++ buffer into a Java buffer.</p> <p>But what does stage 2 do ?</p> <ul> <li>Create a <code class="language-plaintext highlighter-rouge">PinnableSlice(std::string&)</code> which uses the value as the slice’s backing buffer.</li> <li>Call <code class="language-plaintext highlighter-rouge">DB::Get()</code> using the PinnableSlice variant</li> <li>Work out if the slice has pinned data, in which case copy the pinned data into value and release it.</li> <li>..or, if the slice has not pinned data, it is already in value (because we tried, but couldn’t pin anything).</li> </ul> <p>So stage (2) costs us a copy into a <code class="language-plaintext highlighter-rouge">std::string</code>. But! It’s just a naive <code class="language-plaintext highlighter-rouge">std::string</code> that we have copied a large buffer into. And in RocksDB, the buffer is or can be large, so an extra copy something we need to worry about.</p> <p>Luckily this is easy to fix. In the Java API (JNI) implementation:</p> <ol> <li>Create a PinnableSlice() which uses its own default backing buffer.</li> <li>Call <code class="language-plaintext highlighter-rouge">DB::Get()</code> using the <code class="language-plaintext highlighter-rouge">PinnableSlice</code> variant of the RocksDB API</li> <li>Copy the data indicated by the <code class="language-plaintext highlighter-rouge">PinnableSlice</code> straight into the Java output buffer using the JNI <code class="language-plaintext highlighter-rouge">SetByteArrayRegion()</code> method, then release the slice.</li> <li>Work out if the slice has successfully pinned data, in which case copy the pinned data straight into the Java output buffer using the JNI <code class="language-plaintext highlighter-rouge">SetByteArrayRegion()</code> method, then release the pin.</li> <li>..or, if the slice has not pinned data, it is in the pinnable slice’s default backing buffer. All that is left, is to copy it straight into the Java output buffer using the JNI SetByteArrayRegion() method.</li> </ol> <p>In the case where the <code class="language-plaintext highlighter-rouge">PinnableSlice</code> has succesfully pinned the data, this saves us the intermediate copy to the <code class="language-plaintext highlighter-rouge">std::string</code>. In the case where it hasn’t, we still have the extra copy so the observed performance improvement depends on when the data can be pinned. Luckily, our benchmarking suggests that the pin is happening in a significant number of cases.</p> <p>On discussion with the RocksDB core team we understand that the core <code class="language-plaintext highlighter-rouge">PinnableSlice</code> optimization is most likely to succeed when pages are loaded from the block cache, rather than when they are in <code class="language-plaintext highlighter-rouge">memtable</code>. And it might be possible to successfully pin in the <code class="language-plaintext highlighter-rouge">memtable</code> as well, with some extra coding effort. This would likely improve the results for these benchmarks.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100032386042884/picture/" alt="" title="" /> </div> <p class="post-authorName">Jay Zhuang</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2022/11/09/time-aware-tiered-storage.html">Time-Aware Tiered Storage in RocksDB</a></h1> <p class="post-meta">Posted November 09, 2022</p> </header> <article class="post-content"> <h2 id="tldr">TL:DR</h2> <p>Tiered storage is now natively supported in the RocksDB with the option <a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/advanced_options.h#L910"><code class="language-plaintext highlighter-rouge">last_level_temperature</code></a>, time-aware Tiered storage feature guarantees the recently written data are put in the hot tier storage with the option <a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/advanced_options.h#L927"><code class="language-plaintext highlighter-rouge">preclude_last_level_data_seconds</code></a>.</p> <h2 id="background">Background</h2> <p>RocksDB Tiered Storage assigns a data temperature when creating the new SST which <a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/file_system.h#L162">hints the file system</a> to put the data on the corresponding storage media, so the data in a single DB instance can be placed on different storage media. Before the feature, the user typically creates multiple DB instances for different storage media, for example, one DB instance stores the recent hot data and migrates the data to another cold DB instance when the data becomes cold. Tracking and migrating the data could be challenging. With the RocksDB tiered storage feature, RocksDB compaction migrates the data from hot storage to cold storage.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/time-aware-tiered-storage/tiered_storage_overview.png" alt="" /></p> <p>Currently, RocksDB supports assigning the last level file temperature. In an LSM tree, typically the last level data is most likely the coldest. As the most recent data is on the higher level and gradually compacted to the lower level. The higher level data is more likely to be read, because:</p> <ol> <li>RocksDB read always queries from the higher level to the lower level until it finds the data;</li> <li>The high-level data is much more likely to be read and written by the compactions.</li> </ol> <h3 id="problem">Problem</h3> <p>Generally in the LSM tree, hotter data is likely on the higher levels as mentioned before, <strong>but it is not always the case</strong>, for example for the skewed dataset, the recent data could be compacted to the last level first. For the universal compaction, a major compaction would compact all data to the last level (the cold tier) which includes both recent data that should be cataloged as hot data. In production, <strong>we found the majority of the compaction load is actually major compaction (more than 80%)</strong>.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/time-aware-tiered-storage/tiered_storage_problem.png" alt="" /></p> <h3 id="goal-and-non-goals">Goal and Non-goals</h3> <p>It’s hard to predict the hot and cold data. The most frequently accessed data should be cataloged as hot data. But it is hard to predict which key is going to be accessed most, it is also hard to track the per-key based access history. The time-aware tiered storage feature is only <strong>focusing on the use cases that the more recent data is more likely to be accessed</strong>. Which is the majority of the cases, but not all.</p> <h2 id="user-apis">User APIs</h2> <p>Here are the 3 main tiered storage options:</p> <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre><span class="n">Temperature</span> <span class="n">last_level_temperature</span> <span class="o">=</span> <span class="n">Temperature</span><span class="o">::</span><span class="n">kUnknown</span><span class="p">;</span> <span class="kt">uint64_t</span> <span class="n">preclude_last_level_data_seconds</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="kt">uint64_t</span> <span class="n">preserve_internal_time_seconds</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> </pre></td></tr></tbody></table></code></pre></div></div> <p><a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/advanced_options.h#L910"><code class="language-plaintext highlighter-rouge">last_level_temperature</code></a> defines the data temperature for the last level SST files, which is typically kCold or kWarm. RocksDB doesn’t check the option value, instead it just passes that to the file_system API with <a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/file_system.h#L162"><code class="language-plaintext highlighter-rouge">FileOptions.temperature</code></a> when creating the last level SST files. For all the other files, non-last-level SST files, and non-SST files like manifest files, the temperature is set to kUnknown, which typically maps to hot data. The user can also get each SST’s temperature information through APIs:</p> <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre><span class="n">db</span><span class="p">.</span><span class="n">GetLiveFilesStorageInfo</span><span class="p">();</span> <span class="n">db</span><span class="p">.</span><span class="n">GetLiveFilesMetaData</span><span class="p">();</span> <span class="n">db</span><span class="p">.</span><span class="n">GetColumnFamilyMetaData</span><span class="p">();</span> </pre></td></tr></tbody></table></code></pre></div></div> <h3 id="user-metrics">User Metrics</h3> <p>Here are the tiered storage related statistics:</p> <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 </pre></td><td class="rouge-code"><pre><span class="n">HOT_FILE_READ_BYTES</span><span class="p">,</span> <span class="n">WARM_FILE_READ_BYTES</span><span class="p">,</span> <span class="n">COLD_FILE_READ_BYTES</span><span class="p">,</span> <span class="n">HOT_FILE_READ_COUNT</span><span class="p">,</span> <span class="n">WARM_FILE_READ_COUNT</span><span class="p">,</span> <span class="n">COLD_FILE_READ_COUNT</span><span class="p">,</span> <span class="c1">// Last level and non-last level statistics</span> <span class="n">LAST_LEVEL_READ_BYTES</span><span class="p">,</span> <span class="n">LAST_LEVEL_READ_COUNT</span><span class="p">,</span> <span class="n">NON_LAST_LEVEL_READ_BYTES</span><span class="p">,</span> <span class="n">NON_LAST_LEVEL_READ_COUNT</span><span class="p">,</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>And more details from <code class="language-plaintext highlighter-rouge">IOStats</code>:</p> <div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 </pre></td><td class="rouge-code"><pre><span class="k">struct</span> <span class="nc">FileIOByTemperature</span> <span class="p">{</span> <span class="c1">// the number of bytes read to Temperature::kHot file</span> <span class="kt">uint64_t</span> <span class="n">hot_file_bytes_read</span><span class="p">;</span> <span class="c1">// the number of bytes read to Temperature::kWarm file</span> <span class="kt">uint64_t</span> <span class="n">warm_file_bytes_read</span><span class="p">;</span> <span class="c1">// the number of bytes read to Temperature::kCold file</span> <span class="kt">uint64_t</span> <span class="n">cold_file_bytes_read</span><span class="p">;</span> <span class="c1">// total number of reads to Temperature::kHot file</span> <span class="kt">uint64_t</span> <span class="n">hot_file_read_count</span><span class="p">;</span> <span class="c1">// total number of reads to Temperature::kWarm file</span> <span class="kt">uint64_t</span> <span class="n">warm_file_read_count</span><span class="p">;</span> <span class="c1">// total number of reads to Temperature::kCold file</span> <span class="kt">uint64_t</span> <span class="n">cold_file_read_count</span><span class="p">;</span> </pre></td></tr></tbody></table></code></pre></div></div> <h2 id="implementation">Implementation</h2> <p>There are 2 main components for this feature. One is the <strong>time-tracking</strong>, and another is the <strong>per-key based placement compaction</strong>. These 2 components are relatively independent and linked together during the compaction initialization phase which gets the sequence number for splitting the hot and cold data. The time-tracking components can even be enabled independently by setting the option <a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/advanced_options.h#L950"><code class="language-plaintext highlighter-rouge">preserve_internal_time_seconds</code></a>. The purpose of that is before migrating existing user cases to the tiered storage feature and avoid compacting the existing hot data to the cold tier (detailed in the migration session below).</p> <p>Unlike the user-defined timestamp feature, the time tracking feature doesn’t have accurate time information for each key. It only samples the time information and gives a rough estimation for the key write time. Here is the high-level graph for the implementation:</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/time-aware-tiered-storage/tiered_storage_design.png" alt="" /></p> <h3 id="time-tracking">Time Tracking</h3> <p>Time tracking information is recorded by a <a href="https://github.com/facebook/rocksdb/blob/d9e71fb2c53726d9c5ed73b4ec962a7ed6ef15ec/db/periodic_task_scheduler.cc#L36">periodic task</a> which gets the latest sequence number and the current time and then stores it in an in-memory data structure. The interval of the periodic task is determined by the user setting <a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/advanced_options.h#L950"><code class="language-plaintext highlighter-rouge">preserve_internal_time_seconds</code></a> and dividing that by 100. For example, if 3 days of data should be precluded from the last level, then the interval of the periodic task is about 0.7 hours (3 * 24 / 100 ~= 0.72), which also means only the latest 100 seq->time pairs needed in memory.</p> <p>Currently, the in-memory seq_time_mapping is only used during Flush() and encoded to the SST property. The data is delta encoded and again maximum 100 pairs are stored, so the extra data size is pretty small (far less than 1KB per SST) and only non-last-level SSTs need to have that information. Internally, RocksDB also uses the minimal sequence number and SST creation time from the SST metadata to improve the time accuracy. <strong>The sequence number to time information is distributed in each SST</strong>, ranging from the min seqno to max seqno for that SST file, so each SST has its self-contained time information. This also means there could be redundancy for the time information, for example, if 2 SSTs have an overlapped sequence number (which is very likely for non-L0 files), the same seq->time pair may exist in both SSTs. For the future, the time information could also be useful for other potential features like a better estimate of the oldest timestamp for an SST which is critical for the RocksDB TTL feature.</p> <h3 id="per-key-placement-compaction">Per-Key Placement Compaction</h3> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/time-aware-tiered-storage/per_key_placement_compaction.png" alt="" /></p> <p>Compare to normal compaction which only outputs the data to a single level, Per-key placement compaction can output data to 2 different levels, as per-per placement compaction is only for the last level compaction, so the 2 output levels would <strong>always be the penultimate level, and the last level</strong>. The compaction places the key to its corresponding tier by simply checking the key’s sequence number.</p> <p>At the beginning of the compaction, the compaction job collects all seq to time information from every input SSTs and merges them together, then based on the current time to get the oldest sequence number that should be put into non-last-level (hot tier). During the last level compaction, as long as the key is newer than the oldest_sequence_number, it will be placed in the penultimate level (hot tier) instead of the last level (cold tier).</p> <p>Note, RocksDB also places the keys that are within the user snapshot in the hot tier, there’re a few reasons for that:</p> <ol> <li>It’s reasonable to assume snapshot-protected data are hot data;</li> <li>Avoid mixing the sequence number not zeroed out data with old last-level data, which is desirable to reduce the oldest obsolete data time (it’s defined as the oldest SST time that has a non-zero sequence number). It also means tombstones are always placed in the hot tier, which is also desirable as it should be pretty small.</li> <li>The original motivation was to avoid moving data from the lower level to a higher level in case the user increases the <a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/advanced_options.h#L927"><code class="language-plaintext highlighter-rouge">preclude_last_level_data_seconds</code></a>, so the snapshot-protected data in the last level will become hot again, and moving data to a higher level. It’s not always safe to move data from a lower level to a higher level in the LSM tree which could cause key conflict. Later we added a conflict check to allow the data to move up as long as there’s no key conflict, but then the movement is not guaranteed (see Migration for details)</li> </ol> <h3 id="migration">Migration</h3> <p>Once the user enables the feature, it enables both time tracking and per-key placement compaction <strong>at the same time</strong>. As the existing data, it can still be mismarked as cold data. To have a smooth migration to the feature. The user can enable the time-tracking feature first. For example, if the user plans to set <a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/advanced_options.h#L927"><code class="language-plaintext highlighter-rouge">preclude_last_level_data_seconds</code></a> to 3 days, the user can enable time tracking 3 days earlier with <a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/advanced_options.h#L950"><code class="language-plaintext highlighter-rouge">preserve_internal_time_seconds</code></a>. Then when enabling the tiered storage feature, it already has the time information for the last 3 days’ hot data, then per-key placement compaction won’t compact them to the last level.</p> <p>Just preserving the time information won’t prevent the data from compacting to the last level (which should be still on the hot tier). Once the <a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/advanced_options.h#L927"><code class="language-plaintext highlighter-rouge">preclude_last_level_data_seconds</code></a> and <a href="https://github.com/facebook/rocksdb/blob/b0d9776b704af01c2b5385e9d53754e0c8176373/include/rocksdb/advanced_options.h#L910"><code class="language-plaintext highlighter-rouge">last_level_temperature</code></a> features are enabled, some of the last-level data might need to move up. Currently, RocksDB just does a conflict check, the hot/cold split in this case is not guaranteed.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/time-aware-tiered-storage/compaction_moving_up_conflict.png" alt="" /></p> <h2 id="summary">Summary</h2> <p>Time-aware tired storage feature guarantees the new data is placed in the hot tier, which <strong>is ideal for the tiering use cases where the most recent data is likely the hot data</strong>. It’s done by tracking the write time information and per-key placement compaction to split the hot/cold data.</p> <p>The tiered storage feature is actively being developed, any suggestions or PRs will be welcomed.</p> <h2 id="acknowledgements">Acknowledgements</h2> <p>We thank Siying Dong and Andrew Kryczka for brainstorming and reviewing the feature design and implementation. And it was my fortune to work with the RocksDB team members!</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100032386042884/picture/" alt="" title="" /> </div> <p class="post-authorName">Jay Zhuang</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2022/10/31/align-compaction-output-file.html">Reduce Write Amplification by Aligning Compaction Output File Boundaries</a></h1> <p class="post-meta">Posted October 31, 2022</p> </header> <article class="post-content"> <h2 id="tldr">TL;DR</h2> <p>By cutting the compaction output file earlier and allowing larger than targeted_file_size to align the compaction output files to the next level files, it can <strong>reduce WA (Write Amplification) by more than 10%</strong>. The feature is <strong>enabled by default</strong> after the user upgrades RocksDB to version <code class="language-plaintext highlighter-rouge">7.8.0+</code>.</p> <h2 id="background">Background</h2> <p>RocksDB level compaction picks one file from the source level and compacts to the next level, which is a typical partial merge compaction algorithm. Compared to the full merge compaction strategy for example <a href="https://github.com/facebook/rocksdb/wiki/Universal-Compaction">universal compaction</a>, it has the benefits of smaller compaction size, better parallelism, etc. But it also has a larger write amplification (typically 20-30 times user data). One of the problems is wasted compaction at the beginning and ending:</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/align-compaction-output/file_cut_normal.png" alt="" /></p> <p>In the diagram above, <code class="language-plaintext highlighter-rouge">SST11</code> is selected for the compaction, it overlaps with <code class="language-plaintext highlighter-rouge">SST20</code> to <code class="language-plaintext highlighter-rouge">SST23</code>, so all these files are selected for compaction. But the beginning and ending of the SST on Level 2 are wasted, which also means it will be compacted again when <code class="language-plaintext highlighter-rouge">SST10</code> is compacting down. If the file boundaries are aligned, then the wasted compaction size could be reduced. On average, the wasted compaction is <code class="language-plaintext highlighter-rouge">1</code> file size: <code class="language-plaintext highlighter-rouge">0.5</code> at the beginning, and <code class="language-plaintext highlighter-rouge">0.5</code> at the end. Typically the average compaction fan-out is about 6 (with the default max_bytes_for_level_multiplier = 10), then <code class="language-plaintext highlighter-rouge">1 / (6 + 1) ~= 14%</code> of compaction is wasted.</p> <h2 id="implementation">implementation</h2> <p>To reduce such wasted compaction, RocksDB now tries to align the compaction output file to the next level’s file. So future compactions will have fewer wasted compaction. For example, the above case might be cut like this:</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/align-compaction-output/file_cut_align.png" alt="" /></p> <p>The trade-off is the file won’t be cut exactly after it exceeds target_file_size_base, instead, it will be more likely cut when it’s aligned with the next level file’s boundary, so the file size might be more varied. It could be as small as 50% of <code class="language-plaintext highlighter-rouge">target_file_size</code> or as large as <code class="language-plaintext highlighter-rouge">2x target_file_size</code>. It will only impact non-bottommost-level files, which should be only <code class="language-plaintext highlighter-rouge">~11%</code> of the data. Internally, RocksDB tries to cut the file so its size is close to the <code class="language-plaintext highlighter-rouge">target_file_size</code> setting but also aligned with the next level boundary. When the compaction output file hit a next-level file boundary, either the beginning or ending boundary, it will cut if:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 </pre></td><td class="rouge-code"><pre>current_size > ((5 * min(bounderies_num, 8) + 50) / 100) * target_file_size </pre></td></tr></tbody></table></code></pre></div></div> <p>(<a href="https://github.com/facebook/rocksdb/blob/23fa5b7789d6acd0c211d6bdd41448bbf1513bb6/db/compaction/compaction_outputs.cc#L270-L290">details</a>)</p> <p>The file size is also capped at <code class="language-plaintext highlighter-rouge">2x target_file_size</code>: <a href="https://github.com/facebook/rocksdb/blob/f726d29a8268ae4e2ffeec09172383cff2ab4db9/db/compaction/compaction.cc#L273-L277">details</a>. Another benefit of cutting the file earlier is having more trivial move compaction, which is moving the file from a high level to a low level without compacting anything. Based on a compaction simulator test, the trivial move data is increased by 30% (but still less than 1% compaction data is trivial move):</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/align-compaction-output/file_cut_trival_move.png" alt="" /></p> <p>Based on the db_bench test, it can save <code class="language-plaintext highlighter-rouge">~12%</code> compaction load, here is the test command and result:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 </pre></td><td class="rouge-code"><pre>TEST_TMPDIR=/data/dbbench ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432 # baseline: Flush(GB): cumulative 25.882, interval 7.216 Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds # with this change: Flush(GB): cumulative 25.882, interval 7.753 Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds </pre></td></tr></tbody></table></code></pre></div></div> <p>The feature is enabled by default by upgrading to RocksDB 7.8 or later versions, as the feature should have a limited impact on the file size and have great write amplification improvements. If in a rare case, it needs to opt out, set</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 </pre></td><td class="rouge-code"><pre>options.level_compaction_dynamic_file_size = false; </pre></td></tr></tbody></table></code></pre></div></div> <h2 id="other-options-and-benchmark">Other Options and Benchmark</h2> <p>We also tested a few other options, starting with a fixed threshold: 75% of the target_file_size and 50%. Then with a dynamic threshold that is explained, but still limiting file size smaller than the target_file_size.</p> <ol> <li>Baseline (main branch before <a href="https://github.com/facebook/rocksdb/pull/10655">PR#10655</a>);</li> <li>Fixed Threshold <code class="language-plaintext highlighter-rouge">75%</code>: after 75% of target file size, cut the file whenever it aligns with a low level file boundary;</li> <li>Fixed Threshold <code class="language-plaintext highlighter-rouge">50%</code>: reduce the threshold to 50% of target file size;</li> <li>Dynamic Threshold <code class="language-plaintext highlighter-rouge">(5*bounderies_num + 50)</code> percent of target file size and maxed at 90%;</li> <li>Dynamic Threshold + allow 2x the target file size (chosen option).</li> </ol> <h3 id="test-environment-and-data">Test Environment and Data</h3> <p>To speed up the benchmark, we introduced a compaction simulator within Rocksdb (<a href="https://github.com/jay-zhuang/rocksdb/tree/compaction_sim">details</a>), which replaced the physical SST with in-memory data (a large bitset). Which can test compaction more consistently. As it’s a simulator, it has its limitations:</p> <p>it assumes each key-value has the same size;</p> <ol> <li>no deletion (but has override);</li> <li>doesn’t consider data compression;</li> <li>single-threaded and finish all compactions before the next flush (so no write stall).</li> </ol> <p>We use 3 kinds of the dataset for tests:</p> <ol> <li>Random Data, has an override, evenly distributed;</li> <li>Zipf distribution with alpha = 1.01, moderately skewed;</li> <li>Zipf distribution with alpha = 1.2, highly skewed.</li> </ol> <h4 id="write-amplification">Write Amplification</h4> <p style="display: block; margin-left: auto; margin-right: auto; width: 100%"><img src="/static/images/align-compaction-output/write_amp_compare.png" alt="" /></p> <p>As we can see, all options are better than the baseline. Option5 (brown) and option3 (green) have similar WA improvements. (The sudden WA drop during ~40G Random Dataset is because we enabled <code class="language-plaintext highlighter-rouge">level_compaction_dynamic_level_bytes</code> and the level number was increased from 3 to 4, the similar test result without enabling <code class="language-plaintext highlighter-rouge">level_compaction_dynamic_level_bytes</code>).</p> <h4 id="file-size-distribution-at-the-end-of-test">File Size Distribution at the End of Test</h4> <p>This is the file size distribution at the end of the test, which loads about 100G data. As this change only impacts the non-bottommost file size, and the majority of the SST files are bottommost, there’re no significant differences:</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 100%"><img src="/static/images/align-compaction-output/file_size_compare.png" alt="" /></p> <h4 id="all-compaction-generated-file-sizes">All Compaction Generated File Sizes</h4> <p>The high-level files are much more likely to be compacted, so all compaction-generated files size has more significant change:</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 100%"><img src="/static/images/align-compaction-output/compaction_output_file_size_compare.png" alt="" /></p> <p>Overall option5 has most of the file size close to the target file size. vs. option3 has a much smaller size. Here are more detailed stats for compaction output file size:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 </pre></td><td class="rouge-code"><pre> base 50p 75p dynamic 2xdynamic count 1.656000e+03 1.960000e+03 1.770000e+03 1.687000e+03 1.705000e+03 mean 3.116062e+07 2.634125e+07 2.917876e+07 3.060135e+07 3.028076e+07 std 7.145242e+06 1.065134e+07 8.800474e+06 7.612939e+06 8.046139e+06 </pre></td></tr></tbody></table></code></pre></div></div> <h2 id="summary">Summary</h2> <p>Allowing more dynamic file size and aligning the compaction output file to the next level file’s boundary improves the RocksDB write amplification by more than 10%, which will be enabled by default in <code class="language-plaintext highlighter-rouge">7.8.0</code> release. We picked a simple algorithm to decide when to cut the output file, which can be further improved. For example, by estimating output file size with index information. Any suggestions or PR are welcomed.</p> <h2 id="acknowledgements">Acknowledgements</h2> <p>We thank Siying Dong for initializing the file-cutting idea and thank Andrew Kryczka, Mark Callaghan for contributing to the ideas. And Changyu Bi for the detailed code review.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> </div> <div style="padding: 16px; display: inline-block; text-align: center"> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2022/10/07/asynchronous-io-in-rocksdb.html">Asynchronous IO in RocksDB</a></h1> <p class="post-meta">Posted October 07, 2022</p> </header> <article class="post-content"> <h2 id="summary">Summary</h2> <p>RocksDB provides several APIs to read KV pairs from a database, including Get and MultiGet for point lookups and Iterator for sequential scanning. These APIs may result in RocksDB reading blocks from SST files on disk storage. The types of blocks and the frequency with which they are read from storage is workload dependent. Some workloads may have a small working set and thus may be able to cache most of the data required, while others may have large working sets and have to read from disk more often. In the latter case, the latency would be much higher and throughput would be lower than the former. They would also be dependent on the characteristics of the underlying storage media, making it difficult to migrate from one medium to another, for example, local flash to disaggregated flash.</p> <p>One way to mitigate the impact of storage latency is to read asynchronously and in parallel as much as possible, in order to hide IO latency. We have implemented this in RocksDB in Iterators and MultiGet. In Iterators, we prefetch data asynchronously in the background for each file being iterated on, unlike the current implementation that does prefetching synchronously, thus blocking the iterator thread. In MultiGet, we determine the set of files that a given batch of keys overlaps, and read the necessary data blocks from those files in parallel using an asynchronous file system API. These optimizations have significantly decreased the overall latency of the RocksDB MultiGet and iteration APIs on slower storage compared to local flash.</p> <p>The optimizations described here are in the internal implementation of Iterator and MultiGet in RocksDB. The user API is still synchronous, so existing code can easily benefit from it. We might consider async user APIs in the future.</p> <h2 id="design">Design</h2> <h3 id="api">API</h3> <p>A new flag in <code class="language-plaintext highlighter-rouge">ReadOptions</code>, <code class="language-plaintext highlighter-rouge">async_io</code>, controls the usage of async IO. This flag, when set, enables async IO in Iterators and MultiGet. For MultiGet, an additional <code class="language-plaintext highlighter-rouge">ReadOptions</code> flag, <code class="language-plaintext highlighter-rouge">optimize_multiget_for_io</code> (defaults to true), controls how aggressively to use async IO. If the flag is not set, files in the same level are read in parallel but not different levels. If the flag is set, the level restriction is removed and as many files as possible are read in parallel, regardless of level. The latter might have a higher CPU cost depending on the workload.</p> <p>At the FileSystem layer, we use the <code class="language-plaintext highlighter-rouge">FSRandomAccessFile::ReadAsync</code> API to start an async read, providing a completion callback.</p> <h3 id="scan">Scan</h3> <p>A RocksDB scan usually involves the allocation of a new iterator, followed by a Seek call with a target key to position the iterator, followed by multiple Next calls to iterate through the keys sequentially. Both the Seek and Next operations present opportunities to read asynchronously, thereby reducing the scan latency.</p> <p>A scan usually involves iterating through keys in multiple entities - the active memtable, sealed and unflushed memtables, every L0 file, and every non-empty non-zero level. The first two are completely in memory and thus not impacted by IO latency. The latter two involve reading from SST files. This means that an increase in IO latency has a multiplier effect, since multiple L0 files and levels have to be iterated on.</p> <p>Some factors, such as block cache and prefix bloom filters, can reduce the number of files to iterate and number of reads from the files. Nevertheless, even a few reads from disk can dominate the overall latency. RocksDB uses async IO in both Seek and Next to mitigate the latency impact, as described below.</p> <h4 id="seek">Seek</h4> <p>A RocksDB iterator maintains a collection of child iterators, one for each L0 file and for each non-empty non-zero levels. For a Seek operation every child iterator has to Seek to the target key. This is normally done serially, by doing synchronous reads from SST files when the required data blocks are not in cache. When the async_io option is enabled, RocksDB performs the Seek in 2 phases - 1) Locate the data block required for Seek in each file/level and issue an async read, and 2) in the second phase, reseek with the same key, which will wait for the async read to finish at each level and position the table iterator. Phase 1 reads multiple blocks in parallel, reducing overall Seek latency.</p> <h4 id="next">Next</h4> <p>For the iterator Next operation, RocksDB tries to reduce the latency due to IO by prefetching data from the file. This prefetching occurs when a data block required by Next is not present in the cache. The reads from file and prefetching is managed by the FilePrefetchBuffer, which is an object that’s created per table iterator (BlockBasedTableIterator). The FilePrefetchBuffer reads the required data block, and an additional amount of data that varies depending on the options provided by the user in ReadOptions and BlockBasedTableOptions. The default behavior is to start prefetching on the third read from a file, with an initial prefetch size of 8KB and doubling it on every subsequent read, upto a max of 256KB.</p> <p>While the prefetching in the previous paragraph helps, it is still synchronous and contributes to the iterator latency. When the async_io option is enabled, RocksDB prefetches in the background, i.e while the iterator is scanning KV pairs. This is accomplished in FilePrefetchBuffer by maintaining two prefetch buffers. The prefetch size is calculated as usual, but its then split across the two buffers. As the iteration proceeds and data in the first buffer is consumed, the buffer is cleared and an async read is scheduled to prefetch additional data. This read continues in the background while the iterator continues to process data in the second buffer. At this point, the roles of the two buffers are reversed. This does not completely hide the IO latency, since the iterator would have to wait for an async read to complete after the data in memory has been consumed. However, it does hide some of it by overlapping CPU and IO, and async prefetch can be happening on multiple levels in parallel, further reducing the latency.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/asynchronous-io/scan_async.png" alt="Scan flow" /></p> <h3 id="multiget">MultiGet</h3> <p>The MultiGet API accepts a batch of keys as input. Its a more efficient way of looking up multiple keys compared to a loop of Gets. One way MultiGet is more efficient is by reading multiple data blocks from an SST file in a batch, for keys in the same file. This greatly reduces the latency of the request, compared to a loop of Gets. The MultiRead FileSystem API is used to read a batch of data blocks.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/asynchronous-io/mget_async.png" alt="MultiGet flow" /></p> <p>Even with the MultiRead optimization, subset of keys that are in different files still need to be read serially. We can take this one step further and read multiple files in parallel. In order to do this, a few fundamental changes were required in the MultiGet implementation -</p> <ol> <li>Coroutines - A MultiGet involves determining the set of keys in a batch that overlap an SST file, and then calling TableReader::MultiGet to do the actual lookup. The TableReader probes the bloom filter, traverses the index block, looks up the block cache for the necessary, reads the missing data blocks from the SST file, and then searches for the keys in the data blocks. There is a significant amount of context that’s accumulated at each stage, and it would be rather complex to interleave data blocks reads by multiple TableReaders. In order to simplify it, we used async IO with C++ coroutines. The TableReader::MultiGet is implemented as a coroutine, and the coroutine is suspended after issuing async reads for missing data blocks. This allows the top-level MultiGet to iterate through the TableReaders for all the keys, before waiting for the reads to finish and resuming the coroutines.</li> <li>Filtering - The downside of using coroutines is the CPU overhead, which is non-trivial. To minimize the overhead, its desirable to not use coroutines as much as possible. One scenario in which we can completely avoid the call to a TableReader::MultiGet coroutine is if we know that none of the overlapping keys are actually present in the SST file. This can easily determined by probing the bloom filter. In the previous implementation, the bloom filter lookup was embedded in TableReader::MultiGet. However, we could easily implement is as a separate step, before calling TableReader::MultiGet.</li> <li>Splitting batches - The default strategy of MultiGet is to lookup keys in one level (or L0 file), before moving on to the next. This limits the amount of IO parallelism we can exploit. For example, the keys in a batch may not be clustered together, and may be scattered over multiple files. Even if they are clustered together in the key space, they may not all be in the same level. In order to optimize for these situations, we determine the subset of keys that are likely to be in a given level, and then split the MultiGet batch into 2 - the subset in that level, and the remainder. The batch containing the remainder can then be processed in parallel. The subset of keys likely to be in a level is determined by the filtering step.</li> </ol> <p>Together, these changes enabled two types of latency optimization in MultiGet using async IO - single-level and multi-level. The former reads data blocks in parallel from multiple files in the same LSM level, while the latter reads in parallel from multiple files in multiple levels.</p> <h2 id="results">Results</h2> <p>Command used to generate the database:</p> <p><code class="language-plaintext highlighter-rouge">buck-out/opt/gen/rocks/tools/rocks_db_bench —db=/rocks_db_team/prefix_scan —env_uri=ws://ws.flash.ftw3preprod1 -logtostderr=false -benchmarks="fillseqdeterministic" -key_size=32 -value_size=512 -num=5000000 -num_levels=4 -multiread_batched=true -use_direct_reads=false -adaptive_readahead=true -threads=1 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -disable_auto_compactions=true -compaction_style=1 -bloom_bits=10</code></p> <p>Structure of the database:</p> <p><code class="language-plaintext highlighter-rouge">Level[0]: /000233.sst(size: 24828520 bytes)</code> <code class="language-plaintext highlighter-rouge">Level[0]: /000232.sst(size: 49874113 bytes)</code> <code class="language-plaintext highlighter-rouge">Level[0]: /000231.sst(size: 100243447 bytes)</code> <code class="language-plaintext highlighter-rouge">Level[0]: /000230.sst(size: 201507232 bytes)</code> <code class="language-plaintext highlighter-rouge">Level[1]: /000224.sst - /000229.sst(total size: 405046844 bytes)</code> <code class="language-plaintext highlighter-rouge">Level[2]: /000211.sst - /000223.sst(total size: 814190051 bytes)</code> <code class="language-plaintext highlighter-rouge">Level[3]: /000188.sst - /000210.sst(total size: 1515327216 bytes)</code></p> <h3 id="multiget-1">MultiGet</h3> <p>MultiGet benchmark command:</p> <p><code class="language-plaintext highlighter-rouge">buck-out/opt/gen/rocks/tools/rocks_db_bench -use_existing_db=true —db=/rocks_db_team/prefix_scan -benchmarks="multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=8 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -threads=4 -cache_size=300000000 -async_io=true -multiread_stride=40000 -statistics —env_uri=ws://ws.flash.ftw3preprod1 -logtostderr=false -adaptive_readahead=true -bloom_bits=10</code></p> <h4 id="single-file">Single-file</h4> <p>The default MultiGet implementation of reading from one file at a time had a latency of 1292 micros/op.</p> <p><code class="language-plaintext highlighter-rouge">multireadrandom : 1291.992 micros/op 3095 ops/sec 60.007 seconds 185768 operations; 1.6 MB/s (46768 of 46768 found) </code> <code class="language-plaintext highlighter-rouge">rocksdb.db.multiget.micros P50 : 9664.419795 P95 : 20757.097056 P99 : 29329.444444 P100 : 46162.000000 COUNT : 23221 SUM : 239839394</code></p> <h4 id="single-level">Single-level</h4> <p>MultiGet with async_io=true and optimize_multiget_for_io=false had a latency of 775 micros/op.</p> <p><code class="language-plaintext highlighter-rouge">multireadrandom : 774.587 micros/op 5163 ops/sec 60.009 seconds 309864 operations; 2.7 MB/s (77816 of 77816 found)</code> <code class="language-plaintext highlighter-rouge">rocksdb.db.multiget.micros P50 : [6029.601964](tel:6029601964) P95 : 10727.467932 P99 : 13986.683940 P100 : 47466.000000 COUNT : 38733 SUM : 239750172</code></p> <h4 id="multi-level">Multi-level</h4> <p>With all optimizations turned on, MultiGet had the lowest latency of 508 micros/op.</p> <p><code class="language-plaintext highlighter-rouge">multireadrandom : 507.533 micros/op 7881 ops/sec 60.003 seconds 472896 operations; 4.1 MB/s (117536 of 117536 found)</code> <code class="language-plaintext highlighter-rouge">rocksdb.db.multiget.micros P50 : 3923.819467 P95 : 7356.182075 P99 : 10880.728723 P100 : 28511.000000 COUNT : 59112 SUM : 239642721</code></p> <h3 id="scan-1">Scan</h3> <p>Benchmark command:</p> <p><code class="language-plaintext highlighter-rouge">buck-out/opt/gen/rocks/tools/rocks_db_bench -use_existing_db=true —db=/rocks_db_team/prefix_scan -ben</code><code class="language-plaintext highlighter-rouge">chmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=8 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_che</code><code class="language-plaintext highlighter-rouge">cks=1 -readonly=true -threads=4 -cache_size=300000000 -async_io=true -multiread_stride=40000 -statistics —env_uri=ws://ws.flash.ftw3preprod1 -logtostderr=false -a</code><code class="language-plaintext highlighter-rouge">daptive_readahead=true -bloom_bits=10 -seek_nexts=65536</code></p> <h3 id="with-async-scan">With async scan</h3> <p><code class="language-plaintext highlighter-rouge">seekrandom : 414442.303 micros/op 9 ops/sec 60.288 seconds 581 operations; 326.2 MB/s (145 of 145 found)</code></p> <h3 id="without-async-scan">Without async scan</h3> <p><code class="language-plaintext highlighter-rouge">seekrandom : 848858.669 micros/op 4 ops/sec 60.529 seconds 284 operations; 158.1 MB/s (74 of 74 found)</code></p> <h2 id="known-limitations">Known Limitations</h2> <p>These optimizations apply only to block based table SSTs. File system support for the <code class="language-plaintext highlighter-rouge">ReadAsync</code> and <code class="language-plaintext highlighter-rouge">Poll</code> interfaces is required. Currently, it is available only for <code class="language-plaintext highlighter-rouge">PosixFileSystem</code>.</p> <p>The MultiGet async IO optimization has a few additional limitations -</p> <ol> <li>Depends on folly, which introduces a few additional build steps</li> <li>Higher CPU overhead due to coroutines. The CPU overhead of MultiGet may increase 6-15%, with the worst case being a single threaded MultiGet batch of keys with 1 key/file intersection and 100% cache hit rate. A more realistic case of multiple threads with a few keys (~4) overlap per file should see ~6% higher CPU util.</li> <li>No parallelization of metadata reads. A metadata read will block the thread.</li> <li>A few other cases will also be in serial, such as additional block reads for merge operands.</li> </ol> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/568694102/picture/" alt="" title="" /> </div> <p class="post-authorName">Andrew Kryczka</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2022/10/05/lost-buffered-write-recovery.html">Verifying crash-recovery with lost buffered writes</a></h1> <p class="post-meta">Posted October 05, 2022</p> </header> <article class="post-content"> <h2 id="introduction">Introduction</h2> <p>Writes to a RocksDB instance go through multiple layers before they are fully persisted. Those layers may buffer writes, delaying their persistence. Depending on the layer, buffered writes may be lost in a process or system crash. A process crash loses writes buffered in process memory only. A system crash additionally loses writes buffered in OS memory.</p> <p>The new test coverage introduced in this post verifies there is no hole in the recovered data in either type of crash. A hole would exist if any recovered write were newer than any lost write, as illustrated below. This guarantee is important for many applications, such as those that use the newest recovered write to determine the starting point for replication.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/lost-buffered-write-recovery/happy-cat.png" alt="" /></p> <p style="text-align: center"><em>Valid (no hole) recovery: all recovered writes (1 and 2) are older than all lost writes (3 and 4)</em></p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/lost-buffered-write-recovery/angry-cat.png" alt="" /></p> <p style="text-align: center"><em>Invalid (hole) recovery: a recovered write (4) is newer than a lost write (3)</em></p> <p>The new test coverage assumes all writes use the same options related to buffering/persistence. For example, we do not cover the case of alternating writes with WAL disabled and WAL enabled (<code class="language-plaintext highlighter-rouge">WriteOptions::disableWAL</code>). It also assumes the crash does not have any unexpected consequences like corrupting persisted data.</p> <p>Testing for holes in the recovery is challenging because there are many valid recovery outcomes. Our solution involves tracing all the writes and then verifying the recovery matches a prefix of the trace. This proves there are no holes in the recovery. See “Extensions for lost buffered writes” subsection below for more details.</p> <p>Testing actual system crashes would be operationally difficult. Our solution simulates system crash by buffering written but unsynced data in process memory such that it is lost in a process crash. See “Simulating system crash” subsection below for more details.</p> <h2 id="scenarios-covered">Scenarios covered</h2> <p>We began testing recovery has no hole in the following new scenarios. This coverage is included in our internal CI that periodically runs against the latest commit on the main branch.</p> <ol> <li><strong>Process crash with WAL disabled</strong> (<code class="language-plaintext highlighter-rouge">WriteOptions::disableWAL=1</code>), which loses writes since the last memtable flush.</li> <li><strong>System crash with WAL enabled</strong> (<code class="language-plaintext highlighter-rouge">WriteOptions::disableWAL=0</code>), which loses writes since the last memtable flush or WAL sync (<code class="language-plaintext highlighter-rouge">WriteOptions::sync=1</code>, <code class="language-plaintext highlighter-rouge">SyncWAL()</code>, or <code class="language-plaintext highlighter-rouge">FlushWAL(true /* sync */)</code>).</li> <li><strong>Process crash with manual WAL flush</strong> (<code class="language-plaintext highlighter-rouge">DBOptions::manual_wal_flush=1</code>), which loses writes since the last memtable flush or manual WAL flush (<code class="language-plaintext highlighter-rouge">FlushWAL()</code>).</li> <li><strong>System crash with manual WAL flush</strong> (<code class="language-plaintext highlighter-rouge">DBOptions::manual_wal_flush=1</code>), which loses writes since the last memtable flush or synced manual WAL flush (<code class="language-plaintext highlighter-rouge">FlushWAL(true /* sync */)</code>, or <code class="language-plaintext highlighter-rouge">FlushWAL(false /* sync */)</code> followed by WAL sync).</li> </ol> <h2 id="issues-found">Issues found</h2> <ul> <li><a href="https://github.com/facebook/rocksdb/pull/10185">False detection of corruption after system crash due to race condition with WAL sync and `track_and_verify_wals_in_manifest</a></li> <li><a href="https://github.com/facebook/rocksdb/pull/10560">Undetected hole in recovery after system crash due to race condition in WAL sync</a></li> <li><a href="https://github.com/facebook/rocksdb/pull/10573">Recovery failure after system crash due to missing directory sync for critical metadata file</a></li> </ul> <h2 id="solution-details">Solution details</h2> <h3 id="basic-setup">Basic setup</h3> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/lost-buffered-write-recovery/basic-setup.png" alt="" /></p> <p>Our correctness testing framework consists of a stress test program (<code class="language-plaintext highlighter-rouge">db_stress</code>) and a wrapper script (<code class="language-plaintext highlighter-rouge">db_crashtest.py</code>). <code class="language-plaintext highlighter-rouge">db_crashtest.py</code> manages instances of <code class="language-plaintext highlighter-rouge">db_stress</code>, starting them and injecting crashes. <code class="language-plaintext highlighter-rouge">db_stress</code> operates a DB and test oracle (“Latest values file”).</p> <p>At startup, <code class="language-plaintext highlighter-rouge">db_stress</code> verifies the DB using the test oracle, skipping keys that had pending writes when the last crash happened. <code class="language-plaintext highlighter-rouge">db_stress</code> then stresses the DB with random operations, keeping the test oracle up-to-date.</p> <p>As the name “Latest values file” implies, this test oracle only tracks the latest value for each key. As a result, this setup is unable to verify recoveries involving lost buffered writes, where recovering older values is tolerated as long as there is no hole.</p> <h3 id="extensions-for-lost-buffered-writes">Extensions for lost buffered writes</h3> <p>To accommodate lost buffered writes, we extended the test oracle to include two new files: “<code class="language-plaintext highlighter-rouge">verifiedSeqno</code>.state” and “<code class="language-plaintext highlighter-rouge">verifiedSeqno</code>.trace”. <code class="language-plaintext highlighter-rouge">verifiedSeqno</code> is the sequence number of the last successful verification. “<code class="language-plaintext highlighter-rouge">verifiedSeqno</code>.state” is the expected values file at that sequence number, and “<code class="language-plaintext highlighter-rouge">verifiedSeqno</code>.trace” is the trace file of all operations that happened after that sequence number.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/lost-buffered-write-recovery/replay-extension.png" alt="" /></p> <p>When buffered writes may have been lost by the previous <code class="language-plaintext highlighter-rouge">db_stress</code> instance, the current <code class="language-plaintext highlighter-rouge">db_stress</code> instance must reconstruct the latest values file before startup verification. M is the recovery sequence number of the current <code class="language-plaintext highlighter-rouge">db_stress</code> instance and N is the recovery sequence number of the previous <code class="language-plaintext highlighter-rouge">db_stress</code> instance. M is learned from the DB, while N is learned from the filesystem by parsing the “*.{trace,state}” filenames. Then, the latest values file (“LATEST.state”) can be reconstructed by replaying the first M-N traced operations (in “N.trace”) on top of the last instance’s starting point (“N.state”).</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/lost-buffered-write-recovery/trace-extension.png" alt="" /></p> <p>When buffered writes may be lost by the current <code class="language-plaintext highlighter-rouge">db_stress</code> instance, we save the current expected values into “M.state” and begin tracing newer operations in “M.trace”.</p> <h3 id="simulating-system-crash">Simulating system crash</h3> <p>When simulating system crash, we send file writes to a <code class="language-plaintext highlighter-rouge">TestFSWritableFile</code>, which buffers unsynced writes in process memory. That way, the existing <code class="language-plaintext highlighter-rouge">db_stress</code> process crash mechanism will lose unsynced writes.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/lost-buffered-write-recovery/test-fs-writable-file.png" alt="" /></p> <p><code class="language-plaintext highlighter-rouge">TestFSWritableFile</code> is implemented as follows.</p> <ul> <li><code class="language-plaintext highlighter-rouge">Append()</code> buffers the write in a local <code class="language-plaintext highlighter-rouge">std::string</code> rather than calling <code class="language-plaintext highlighter-rouge">write()</code>.</li> <li><code class="language-plaintext highlighter-rouge">Sync()</code> transfers the local <code class="language-plaintext highlighter-rouge">std::string</code>s content to <code class="language-plaintext highlighter-rouge">PosixWritableFile::Append()</code>, which will then <code class="language-plaintext highlighter-rouge">write()</code> it to the OS page cache.</li> </ul> <h2 id="next-steps">Next steps</h2> <p>An untested guarantee is that RocksDB recovers all writes that the user explicitly flushed out of the buffers lost in the crash. We may recover more writes than these due to internal flushing of buffers, but never less. Our test oracle needs to be further extended to track the lower bound on the sequence number that is expected to survive a crash.</p> <p>We would also like to make our system crash simulation more realistic. Currently we only drop unsynced regular file data, but we should drop unsynced directory entries as well.</p> <h2 id="acknowledgements">Acknowledgements</h2> <p>Hui Xiao added the manual WAL flush coverage and compatibility with <code class="language-plaintext highlighter-rouge">TransactionDB</code>. Zhichao Cao added the system crash simulation. Several RocksDB team members contributed to this feature’s dependencies.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100078474793041/picture/" alt="" title="" /> </div> <p class="post-authorName">Changyu Bi</p> </div> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/568694102/picture/" alt="" title="" /> </div> <p class="post-authorName">Andrew Kryczka</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2022/07/18/per-key-value-checksum.html">Per Key-Value Checksum</a></h1> <p class="post-meta">Posted July 18, 2022</p> </header> <article class="post-content"> <h2 id="summary">Summary</h2> <p>Silent data corruptions can severely impact RocksDB users. As a key-value library, RocksDB resides at the bottom of the user space software stack for many diverse applications. Returning wrong query results can cause unpredictable consequences for our users so must be avoided.</p> <p>To prevent and detect corruption, RocksDB has several consistency checks [1], especially focusing on the storage layer. For example, SST files contain block checksums that are verified during reads, and each SST file has a full file checksum that can be verified when files are transferred.</p> <p>Other sources of corruptions, such as those from faulty CPU/memory or heap corruptions, pose risks for which protections are relatively underdeveloped. Meanwhile, recent work [2] suggests one per thousand machines in our fleet will at some point experience a hardware error that is exposed to an application. Additionally, software bugs can increase the risk of heap corruptions at any time.</p> <p>Hardware/heap corruptions are naturally difficult to detect in the application layer since they can compromise any data or control flow. Some factors we take into account when choosing where to add protection are the volume of data, the importance of the data, the CPU instructions that operate on the data, and the duration it resides in memory. One recently added protection, <code class="language-plaintext highlighter-rouge">detect_filter_construct_corruption</code>, has proven itself useful in preventing corrupt filters from being persisted. We have seen hardware encounter machine-check exceptions a few hours after we detected a corrupt filter.</p> <p>The next way we intend to detect hardware and heap corruptions before they cause queries to return wrong results is through developing a new feature: per key-value checksum. This feature will eventually provide optional end-to-end integrity protection for every key-value pair. RocksDB 7.4 offers substantial coverage of the user write and recovery paths with per key-value checksum protection.</p> <h2 id="user-api">User API</h2> <p>For integrity protection during recovery, no change is required. Recovery is always protected.</p> <p>For user write protection, RocksDB allows the user to specify per key-value protection through <code class="language-plaintext highlighter-rouge">WriteOptions::protection_bytes_per_key</code> or pass in <code class="language-plaintext highlighter-rouge">protection_bytes_per_key</code> to <code class="language-plaintext highlighter-rouge">WriteBatch</code> constructor when creating a <code class="language-plaintext highlighter-rouge">WriteBatch</code> directly. Currently, only 0 (default, no protection) and 8 bytes per key are supported. This should be fine for write batches as they do not usually contain a huge number of keys. We are working on supporting more settings as 8 bytes per key might cause considerable memory overhead when the protection is extended to memtable entries.</p> <h2 id="feature-design">Feature Design</h2> <h3 id="data-structures">Data Structures</h3> <h4 id="protection-info">Protection info</h4> <p>For protecting key-value pairs, we chose to use a hashing algorithm, xxh3 [3], for its good efficiency without relying on special hardware. While algorithms like crc32c can guarantee detection of certain patterns of bit flips, xxh3 offers no such guarantees. This is acceptable for us as we do not expect any particular error pattern [4], and even if we did, xxh3 can achieve a collision probability close enough to zero for us by tuning the number of protection bytes per key-value.</p> <p>Key-value pairs have multiple representations in RocksDB: in <a href="https://github.com/facebook/rocksdb/blob/7d0ecab570742c7280628b08ddc03cfd692f484f/db/write_batch.cc#L14-L31">WriteBatch</a>, in memtable <a href="https://github.com/facebook/rocksdb/blob/fc51b7f33adcba7ac725ed0e7fe8b8155aaeaee4/db/memtable.cc#L541-L545">entries</a> and in <a href="https://github.com/facebook/rocksdb/blob/fc51b7f33adcba7ac725ed0e7fe8b8155aaeaee4/table/block_based/block_builder.cc#L21-L27">data blocks</a>. In this post we focus on key-values in write batches and memtable as in-memory data blocks are not yet protected.</p> <p>Besides user key and value, RocksDB includes internal metadata in the per key-value checksum calculation. Depending on the representation, internal metadata consists of some combination of sequence number, operation type, and column family ID. Note that since timestamp (when enabled) is part of the user key it is protected as well.</p> <p>The protection info consists of the XOR’d result of the xxh3 hash for all the protected components. This allows us to efficiently transform protection info for different representations. See below for an example converting WriteBatch protection info to memtable protection info.</p> <p>A risk of using XOR is the possibility of swapping corruptions (e.g., key becomes the value and the value becomes the key). To mitigate this risk, we use an independent seed for hashing each type of component.</p> <p>The following two figures illustrate how protection info in WriteBatch and memtable are calculated from a key-value’s components.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/kv-checksum/ProtInfo-Writebatch.png" alt="" /></p> <p style="text-align: center"><em>Protection info for a key-value in a WriteBatch</em></p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/kv-checksum/ProtInfo-Memtable.png" alt="" /></p> <p style="text-align: center"><em>Protection info for a key-value in a memtable</em></p> <p>The next figure illustrates how protection info for a key-value can be transformed to protect that same key-value in a different representation. Note this is done without recalculating the hash for all the key-value’s components.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/kv-checksum/ProtInfo-Writebatch-to-Memtable.png" alt="" /></p> <p style="text-align: center"><em>Protection info for a key-value in a memtable derived from an existing WriteBatch protection info</em></p> <p>Above, we see two (small) components are hashed: column family ID and sequence number. When a key-value is inserted from WriteBatch into memtable, it is assigned a sequence number and drops the column family ID since each memtable is associated with one column family. Recall the xxh3 of column family ID was included in the WriteBatch protection info, which is canceled out by the column family ID xxh3 included in the XOR.</p> <h4 id="wal-fragment">WAL fragment</h4> <p>WAL (Write-ahead-log) persists write batches that correspond to operations in memtables and enables consistent database recovery after restart. RocksDB writes to WAL in chunks of some <a href="https://github.com/facebook/rocksdb/blob/fc51b7f33adcba7ac725ed0e7fe8b8155aaeaee4/db/log_writer.h#L44">fixed block size</a> for efficiency. It is possible that some write batch does not fit into the space left in the current block and/or is larger than the fixed block size. Thus, serialized write batches (WAL records) are divided into WAL fragments before being written to WAL. The format of a WAL fragment is in the following diagram (there is another legacy format detailed in code <a href="https://github.com/facebook/rocksdb/blob/fc51b7f33adcba7ac725ed0e7fe8b8155aaeaee4/db/log_writer.h#L47-L59">comments</a>). Roughly, the <code class="language-plaintext highlighter-rouge">Type</code> field indicates whether a fragment is at the beginning, middle or end of a record, and is used to group fragments.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/kv-checksum/WAL-fragment.png" alt="" /></p> <p>Note that each fragment is prefixed by a crc32c checksum that is calculated over <code class="language-plaintext highlighter-rouge">Type</code>, <code class="language-plaintext highlighter-rouge">Log #</code> and <code class="language-plaintext highlighter-rouge">Payload</code>. This ensures that RocksDB can detect corruptions that happened to the WAL in the storage layer.</p> <h4 id="write-batch">Write batch</h4> <p>As mentioned above, a WAL record is a serialized <code class="language-plaintext highlighter-rouge">WriteBatch</code> that is split into physical fragments during writes to WAL. During DB recovery, once a WAL record is reconstructed from one or more fragments, it is <a href="https://github.com/facebook/rocksdb/blob/fc51b7f33adcba7ac725ed0e7fe8b8155aaeaee4/db/db_impl/db_impl_open.cc#L1127">copied</a> into the content of a <code class="language-plaintext highlighter-rouge">WriteBatch</code>. The write batch will then be used to restore the memtable states.</p> <p>Besides the recovery path, a write batch is always constructed during user writes. Firstly, RocksDB allows users to construct a write batch directly, and pass it to DB through <code class="language-plaintext highlighter-rouge">DB::Write()</code> API for execution. Higher-level buffered write APIs like Transaction rely on a write batch to buffer writes prior to executing them. For unbuffered write APIs like <code class="language-plaintext highlighter-rouge">DB::Put()</code>, RocksDB constructs a write batch internally with the input user key and value.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/kv-checksum/Write-batch.png" alt="" /></p> <p>The above diagram shows a rough representation of a write batch in memory. <code class="language-plaintext highlighter-rouge">Contents</code> is the concatenation of serialized user operations in this write batch. Each operation consists of user key, value, op_type and optionally column family ID. With per key-value checksum protection enabled, a vector of ProtectionInfo is stored in the write batch, one for each user operation.</p> <h4 id="memtable-entry">Memtable entry</h4> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/kv-checksum/Memtable-entry.png" alt="" /></p> <p>A memtable entry is similar to write batch content, except that it captures only a single user operation and that it does not contain column family ID (since memtable is per column family). User key and value are length-prefixed, and seqno and optype are combined in a fixed 8 bytes representation.</p> <h3 id="processes">Processes</h3> <p>In order to protect user writes and recovery, per key-value checksum is covered in the following code paths.</p> <h4 id="writebatch-write">WriteBatch write</h4> <p>Per key-value checksum coverage starts with the user buffers that contain user key and/or value. When users call DB Write APIs (e.g., <code class="language-plaintext highlighter-rouge">DB::Put()</code>), or when users add operations into write batches directly (e.g. <code class="language-plaintext highlighter-rouge">WriteBatch::Put()</code>), RocksDB constructs <code class="language-plaintext highlighter-rouge">ProtectionInfo</code> from the user buffer (e.g. <a href="https://github.com/facebook/rocksdb/blob/96206531bc0bb56d87012921c5458c8a3047a6b3/db/write_batch.cc#L813">here</a>) and <a href="https://github.com/facebook/rocksdb/blob/96206531bc0bb56d87012921c5458c8a3047a6b3/include/rocksdb/write_batch.h#L478">stores</a> the protection information within the corresponding <code class="language-plaintext highlighter-rouge">WriteBatch</code> object as diagramed below. Then the user key and/or value are copied into the <code class="language-plaintext highlighter-rouge">WriteBatch</code>, thus starting per key-value checksum protection from user buffer.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/kv-checksum/Writebatch-write.png" alt="" /></p> <h4 id="wal-write">WAL write</h4> <p>Before a <code class="language-plaintext highlighter-rouge">WriteBatch</code> leaves RocksDB and be persisted in a WAL file, it is verified against its <code class="language-plaintext highlighter-rouge">ProtectionInfo</code> to ensure its content is not corrupted. We added <code class="language-plaintext highlighter-rouge">WriteBatch::VerifyChecksum()</code> for this purpose. Once we verify the content of a <code class="language-plaintext highlighter-rouge">WriteBatch</code>, it is then divided into potentially multiple WAL fragments and persisted in the underlying file system. From that point on, the integrity protection is handed off to the per fragment crc32c checksum that is persisted in WAL too.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/kv-checksum/WAL-write.png" alt="" /></p> <h4 id="memtable-write">Memtable write</h4> <p>Similar to the WAL write path, <code class="language-plaintext highlighter-rouge">ProtectionInfo</code> is verified before an entry is inserted into a memtable. The difference here is that an memtable entry has its own buffer, and the content of a <code class="language-plaintext highlighter-rouge">WriteBatch</code> is copied into the memtable entry. So the <code class="language-plaintext highlighter-rouge">ProtectionInfo</code> is verified against the memtable entry buffer instead. The current per key-value checksum protection ends at this verification on the buffer containing a memtable entry, and one of the future work is to extend the coverage to key-value pairs in memtables.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/kv-checksum/Memtable-write.png" alt="" /></p> <h4 id="wal-read">WAL read</h4> <p>This is for the DB recovery path: WAL fragments are read into memory, concatenated together to form WAL records, and then <code class="language-plaintext highlighter-rouge">WriteBatch</code>es are constructed from WAL records and added to memtables. In RocksDB 7.4, once a <code class="language-plaintext highlighter-rouge">WriteBatch</code> copies its content from a WAL record, <code class="language-plaintext highlighter-rouge">ProtectionInfo</code> is constructed from the <code class="language-plaintext highlighter-rouge">WriteBatch</code> content and per key-value protection starts. However, this copy operation is not protected, neither is the reconstruction of a WAL record from WAL fragments. To provide protection from silent data corruption during these memory copying operations, we added checksum handshake detailed below in RocksDB 7.5.</p> <p>When a WAL fragment is first read into memory, its crc32c checksum is <a href="https://github.com/facebook/rocksdb/blob/2f13f5f7d09c589d5adebf0cbc42fadf0da0f00e/db/log_reader.cc#L483">verified</a>. The WAL fragment is then appended to the buffer containing a WAL record. RocksDB uses xxh3’s streaming API to calculate the checksum of the WAL record and updates the streaming hash state with the new WAL fragment content whenever it is appended to the WAL record buffer (e.g. <a href="https://github.com/facebook/rocksdb/blob/2f13f5f7d09c589d5adebf0cbc42fadf0da0f00e/db/log_reader.cc#L135">here</a>). After the WAL record is constructed, it is copied into a <code class="language-plaintext highlighter-rouge">WriteBatch</code> and <code class="language-plaintext highlighter-rouge">ProtectionInfo</code> is constructed from the write batch content. Then, the xxh3 checksum of the WAL record is <a href="https://github.com/facebook/rocksdb/blob/2f13f5f7d09c589d5adebf0cbc42fadf0da0f00e/db/write_batch.cc#L3081-L3085">verified</a> against the write batch content to complete the checksum handshake. If the checksum verification succeeds, then we are more confident that <code class="language-plaintext highlighter-rouge">ProtectionInfo</code> is calculated based on uncorrupted data, and the protection coverage continues with the newly constructed <code class="language-plaintext highlighter-rouge">ProtectionInfo</code> along the write code paths mentioned above.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/kv-checksum/WAL-read.png" alt="" /></p> <h2 id="future-work">Future work</h2> <p>Future coverage expansion will cover memtable KVs, flush, compaction and user reads etc.</p> <h2 id="references">References</h2> <p>[1] http://rocksdb.org/blog/2021/05/26/online-validation.html</p> <p>[2] H. D. Dixit, L. Boyle, G. Vunnam, S. Pendharkar, M. Beadon, and S. Sankar, ‘Detecting silent data corruptions in the wild’. arXiv, 2022.</p> <p>[3] https://github.com/Cyan4973/xxHash</p> <p>[4] https://github.com/Cyan4973/xxHash/issues/229#issuecomment-511956403</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2021/12/29/ribbon-filter.html">Ribbon Filter</a></h1> <p class="post-meta">Posted December 29, 2021</p> </header> <article class="post-content"> <h2 id="summary">Summary</h2> <p>Since version 6.15 last year, RocksDB supports Ribbon filters, a new alternative to Bloom filters that save space, especially memory, at the cost of more CPU usage, mostly in constructing the filters in the background. Most applications with long-lived data (many hours or longer) will likely benefit from adopting a Ribbon+Bloom hybrid filter policy. Here we explain why and how.</p> <p><a href="https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#ribbon-filter">Ribbon filter on RocksDB wiki</a></p> <p><a href="https://arxiv.org/abs/2103.02515">Ribbon filter paper</a></p> <h2 id="problem--background">Problem & background</h2> <p>Bloom filters play a critical role in optimizing point queries and some range queries in LSM-tree storage systems like RocksDB. Very large DBs can use 10% or more of their RAM memory for (Bloom) filters, so that (average case) read performance can be very good despite high (worst case) read amplification, <a href="http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-pick-2_23.html">which is useful for lowering write and/or space amplification</a>. Although the <code class="language-plaintext highlighter-rouge">format_version=5</code> Bloom filter in RocksDB is extremely fast, all Bloom filters use around 50% more space than is theoretically possible for a hashed structure configured for the same false positive (FP) rate and number of keys added. What would it take to save that significant share of “wasted” filter memory, and when does it make sense to use such a Bloom alternative?</p> <p>A number of alternatives to Bloom filters were known, especially for static filters (not modified after construction), but all the previously known structures were unsatisfying for SSTs because of some combination of</p> <ul> <li>Not enough space savings for CPU increase. For example, <a href="https://arxiv.org/abs/1912.08258">Xor filters</a> use 3-4x more CPU than Bloom but only save 15-20% of space. <a href="https://arxiv.org/pdf/1603.04330.pdf">GOV</a> can save around 30% space but requires around 10x more CPU than Bloom.</li> <li>Inconsistent space savings. <a href="https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf">Cuckoo filters</a> and Xor+ filters offer significant space savings for very low FP rates (high bits per key) but little or no savings for higher FP rates (low bits per key). (<a href="https://stratos.seas.harvard.edu/files/stratos/files/monkeykeyvaluestore.pdf">Higher FP rates are considered best for largest levels of LSM.</a>) <a href="https://arxiv.org/pdf/2001.10500.pdf">Spatially-coupled Xor filters</a> require very large number of keys per filter for large space savings.</li> <li>Inflexible configuration. No published alternatives offered the same continuous configurability of Bloom filters, where any FP rate and any fractional bits per key could be chosen. This flexibility improves memory efficiency with the <code class="language-plaintext highlighter-rouge">optimize_filters_for_memory</code> option that minimizes internal fragmentation on filters.</li> </ul> <h2 id="ribbon-filter-development-and-implementation">Ribbon filter development and implementation</h2> <p>The Ribbon filter came about when I developed a faster, simpler, and more adaptable algorithm for constructing a little-known <a href="https://arxiv.org/pdf/1907.04750.pdf">Xor-based structure from Dietzfelbinger and Walzer</a>. It has very good space usage for required CPU time (~30% space savings for 3-4x CPU) and, with some engineering, Bloom-like configurability. The complications were managable for use in RocksDB:</p> <ul> <li>Ribbon space efficiency does not naturally scale to very large number of keys in a single filter (whole SST file or partition), but with the current 128-bit Ribbon implementation in RocksDB, even 100 million keys in one filter saves 27% space vs. Bloom rather than 30% for 100,000 keys in a filter.</li> <li>More temporary memory is required during construction, ~230 bits per key for 128-bit Ribbon vs. ~75 bits per key for Bloom filter. A quick calculation shows that if you are saving 3 bits per key on the generated filter, you only need about 50 generated filters in memory to offset this temporary memory usage. (Thousands of filters in memory is typical.) Starting in RocksDB version 6.27, this temporary memory can be accounted for under block cache using <code class="language-plaintext highlighter-rouge">BlockBasedTableOptions::reserve_table_builder_memory</code>.</li> <li>Ribbon filter queries use relatively more CPU for lower FP rates (but still O(1) relative to number of keys added to filter). This should be OK because lower FP rates are only appropriate when then cost of a false positive is very high (worth extra query time) or memory is not so constrained (can use Bloom instead).</li> </ul> <p>Future: data in <a href="https://arxiv.org/abs/2103.02515">the paper</a> suggests that 32-bit Balanced Ribbon (new name: <a href="https://arxiv.org/pdf/2109.01892.pdf">Bump-Once Ribbon</a>) would improve all of these issues and be better all around (except for code complexity).</p> <h2 id="ribbon-vs-bloom-in-rocksdb-configuration">Ribbon vs. Bloom in RocksDB configuration</h2> <p>Different applications and hardware configurations have different constraints, but we can use hardware costs to examine and better understand the trade-off between Bloom and Ribbon.</p> <h3 id="same-fp-rate-ram-vs-cpu-hardware-cost">Same FP rate, RAM vs. CPU hardware cost</h3> <p>Under ideal conditions where we can adjust our hardware to suit the application, in terms of dollars, how much does it cost to construct, query, and keep in memory a Bloom filter vs. a Ribbon filter? The Ribbon filter costs more for CPU but less for RAM. Importantly, the RAM cost directly depends on how long the filter is kept in memory, which in RocksDB is essentially the lifetime of the filter. (Temporary RAM during construction is so short-lived that it is ignored.) Using some consumer hardware and electricity prices and a predicted balance between construction and queries, we can compute a “break even” duration in memory. To minimize cost, filters with a lifetime shorter than this should be Bloom and filters with a lifetime longer than this should be Ribbon. (Python code)</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 </pre></td><td class="rouge-code"><pre># Commodity prices based roughly on consumer prices and rough guesses # Upfront cost of a CPU per hardware thread upfront_dollars_per_cpu_thread = 30.0 # CPU average power usage per hardware thread watts_per_cpu_thread = 3.5 # Upfront cost of a GB of RAM upfront_dollars_per_gb_ram = 8.0 # RAM average power usage per GB # https://www.crucial.com/support/articles-faq-memory/how-much-power-does-memory-use watts_per_gb_ram = 0.375 # Estimated price of power per kilowatt-hour, including overheads like conversion losses and cooling dollars_per_kwh = 0.35 # Assume 3 year hardware lifetime hours_per_lifetime = 3 * 365 * 24 seconds_per_lifetime = hours_per_lifetime * 60 * 60 # Number of filter queries per key added in filter construction is heavily dependent on workload. # When replication is in layer above RocksDB, it will be low, likely < 1. When replication is in # storage layer below RocksDB, it will likely be > 1. Using a rough and general guesstimate. key_query_per_construct = 1.0 #================================== # Bloom & Ribbon filter performance typical_bloom_bits_per_key = 10.0 typical_ribbon_bits_per_key = 7.0 # Speeds here are sensitive to many variables, especially query speed because it # is so dependent on memory latency. Using this benchmark here: # for IMPL in 2 3; do # ./filter_bench -impl=$IMPL -quick -m_keys_total_max=200 -use_full_block_reader # done # and "Random filter" queries. nanoseconds_per_construct_bloom_key = 32.0 nanoseconds_per_construct_ribbon_key = 140.0 nanoseconds_per_query_bloom_key = 500.0 nanoseconds_per_query_ribbon_key = 600.0 #================================== # Some constants kwh_per_watt_lifetime = hours_per_lifetime / 1000.0 bits_per_gb = 8 * 1024 * 1024 * 1024 #================================== # Crunching the numbers # on CPU for constructing filters dollars_per_cpu_thread_lifetime = upfront_dollars_per_cpu_thread + watts_per_cpu_thread * kwh_per_watt_lifetime * dollars_per_kwh dollars_per_cpu_thread_second = dollars_per_cpu_thread_lifetime / seconds_per_lifetime dollars_per_construct_bloom_key = dollars_per_cpu_thread_second * nanoseconds_per_construct_bloom_key / 10**9 dollars_per_construct_ribbon_key = dollars_per_cpu_thread_second * nanoseconds_per_construct_ribbon_key / 10**9 dollars_per_query_bloom_key = dollars_per_cpu_thread_second * nanoseconds_per_query_bloom_key / 10**9 dollars_per_query_ribbon_key = dollars_per_cpu_thread_second * nanoseconds_per_query_ribbon_key / 10**9 dollars_per_bloom_key_cpu = dollars_per_construct_bloom_key + key_query_per_construct * dollars_per_query_bloom_key dollars_per_ribbon_key_cpu = dollars_per_construct_ribbon_key + key_query_per_construct * dollars_per_query_ribbon_key # on holding filters in RAM dollars_per_gb_ram_lifetime = upfront_dollars_per_gb_ram + watts_per_gb_ram * kwh_per_watt_lifetime * dollars_per_kwh dollars_per_gb_ram_second = dollars_per_gb_ram_lifetime / seconds_per_lifetime dollars_per_bloom_key_in_ram_second = dollars_per_gb_ram_second / bits_per_gb * typical_bloom_bits_per_key dollars_per_ribbon_key_in_ram_second = dollars_per_gb_ram_second / bits_per_gb * typical_ribbon_bits_per_key #================================== # How many seconds does it take for the added cost of constructing a ribbon filter instead # of bloom to be offset by the added cost of holding the bloom filter in memory? break_even_seconds = (dollars_per_ribbon_key_cpu - dollars_per_bloom_key_cpu) / (dollars_per_bloom_key_in_ram_second - dollars_per_ribbon_key_in_ram_second) print(break_even_seconds) # -> 3235.1647730256936 </pre></td></tr></tbody></table></code></pre></div></div> <p>So roughly speaking, filters that live in memory for more than an hour should be Ribbon, and filters that live less than an hour should be Bloom. This is very interesting, but how long do filters live in RocksDB?</p> <p>First let’s consider the average case. Write-heavy RocksDB loads are often backed by flash storage, which has some specified write endurance for its intended lifetime. This can be expressed as <em>device writes per day</em> (DWPD), and supported DWPD is typically < 10.0 even for high end devices (excluding NVRAM). Roughly speaking, the DB would need to be writing at a rate of 20+ DWPD for data to have an average lifetime of less than one hour. Thus, unless you are prematurely burning out your flash or massively under-utilizing available storage, using the Ribbon filter has the better cost profile <em>on average</em>.</p> <h3 id="predictable-lifetime">Predictable lifetime</h3> <p>But we can do even better than optimizing for the average case. LSM levels give us very strong data lifetime hints. Data in L0 might live for minutes or a small number of hours. Data in Lmax might live for days or weeks. So even if Ribbon filters weren’t the best choice on average for a workload, they almost certainly make sense for the larger, longer-lived levels of the LSM. As of RocksDB 6.24, you can specify a minimum LSM level for Ribbon filters with <code class="language-plaintext highlighter-rouge">NewRibbonFilterPolicy</code>, and earlier levels will use Bloom filters.</p> <h3 id="resident-filter-memory">Resident filter memory</h3> <p>The above analysis assumes that nearly all filters for all live SST files are resident in memory. This is true if using <code class="language-plaintext highlighter-rouge">cache_index_and_filter_blocks=0</code> and <code class="language-plaintext highlighter-rouge">max_open_files=-1</code> (defaults), but <code class="language-plaintext highlighter-rouge">cache_index_and_filter_blocks=1</code> is popular. In that case, if you use <code class="language-plaintext highlighter-rouge">optimize_filters_for_hits=1</code> and non-partitioned filters (a popular MyRocks configuration), it is also likely that nearly all live filters are in memory. However, if you don’t use <code class="language-plaintext highlighter-rouge">optimize_filters_for_hits</code> and use partitioned filters, then cold data (by age or by key range) can lead to only a portion of filters being resident in memory. In that case, benefit from Ribbon filter is not as clear, though because Ribbon filters are smaller, they are more efficient to read into memory.</p> <p>RocksDB version 6.21 and later include a rough feature to determine block cache usage for data blocks, filter blocks, index blocks, etc. Data like this is periodically dumped to LOG file (<code class="language-plaintext highlighter-rouge">stats_dump_period_sec</code>):</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 </pre></td><td class="rouge-code"><pre>Block cache entry stats(count,size,portion): DataBlock(441761,6.82 GB,75.765%) FilterBlock(3002,1.27 GB,14.1387%) IndexBlock(17777,887.75 MB,9.63267%) Misc(1,0.00 KB,0%) Block cache LRUCache@0x7fdd08104290#7004432 capacity: 9.00 GB collections: 2573 last_copies: 10 last_secs: 0.143248 secs_since: 0 </pre></td></tr></tbody></table></code></pre></div></div> <p>This indicates that at this moment in time, the block cache object identified by <code class="language-plaintext highlighter-rouge">LRUCache@0x7fdd08104290#7004432</code> (potentially used by multiple DBs) uses roughly 14% of its 9GB, about 1.27 GB, on filter blocks. This same data is available through <code class="language-plaintext highlighter-rouge">DB::GetMapProperty</code> with <code class="language-plaintext highlighter-rouge">DB::Properties::kBlockCacheEntryStats</code>, and (with some effort) can be compared to total size of all filters (not necessarily in memory) using <code class="language-plaintext highlighter-rouge">rocksdb.filter.size</code> from <code class="language-plaintext highlighter-rouge">DB::Properties::kAggregatedTableProperties</code>.</p> <h3 id="sanity-checking-lifetime">Sanity checking lifetime</h3> <p>Can we be sure that using filters even makes sense for such long-lived data? We can apply <a href="http://renata.borovica-gajic.com/data/adms2017_5minuterule.pdf">the current 5 minute rule for caching SSD data in RAM</a>. A 4KB filter page holds data for roughly 4K keys. If we assume at least one negative (useful) filter query in its lifetime per added key, it can satisfy the 5 minute rule with a lifetime of up to about two weeks. Thus, the lifetime threshold for “no filter” is about 300x higher than the lifetime threshold for Ribbon filter.</p> <h3 id="what-to-do-with-saved-memory">What to do with saved memory</h3> <p>The default way to improve overall RocksDB performance with more available memory is to use more space for caching, which improves latency, CPU load, read IOs, etc. With <code class="language-plaintext highlighter-rouge">cache_index_and_filter_blocks=1</code>, savings in filters will automatically make room for caching more data blocks in block cache. With <code class="language-plaintext highlighter-rouge">cache_index_and_filter_blocks=0</code>, consider increasing block cache size.</p> <p>Using the space savings to lower filter FP rates is also an option, but there is less evidence for this commonly improving existing <em>optimized</em> configurations.</p> <h2 id="generic-recommendation">Generic recommendation</h2> <p>If using <code class="language-plaintext highlighter-rouge">NewBloomFilterPolicy(bpk)</code> for a large persistent DB using compression, try using <code class="language-plaintext highlighter-rouge">NewRibbonFilterPolicy(bpk)</code> instead, which will generate Ribbon filters during compaction and Bloom filters for flush, both with the same FP rate as the old setting. Once new SST files are generated under the new policy, this should free up some memory for more caching without much effect on burst or sustained write speed. Both kinds of filters can be read under either policy, so there’s always an option to adjust settings or gracefully roll back to using Bloom filter only (keeping in mind that SST files must be replaced to see effect of that change).</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/568694102/picture/" alt="" title="" /> </div> <p class="post-authorName">Andrew Kryczka</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2021/05/31/dictionary-compression.html">Preset Dictionary Compression</a></h1> <p class="post-meta">Posted May 31, 2021</p> </header> <article class="post-content"> <h2 id="summary">Summary</h2> <p>Compression algorithms relying on an adaptive dictionary, such as LZ4, zstd, and zlib, struggle to achieve good compression ratios on small inputs when using the basic compress API. With the basic compress API, the compressor starts with an empty dictionary. With small inputs, not much content gets added to the dictionary during the compression. Combined, these factors suggest the dictionary will never have enough contents to achieve great compression ratios.</p> <p>RocksDB groups key-value pairs into data blocks before storing them in files. For use cases that are heavy on random accesses, smaller data block size is sometimes desirable for reducing I/O and CPU spent reading blocks. However, as explained above, smaller data block size comes with the downside of worse compression ratio when using the basic compress API.</p> <p>Fortunately, zstd and other libraries offer advanced compress APIs that preset the dictionary. A preset dictionary makes it possible for the compressor to start from a useful state instead of from an empty one, making compression immediately effective.</p> <p>RocksDB now optionally takes advantage of these dictionary presetting APIs. The challenges in integrating this feature into the storage engine were more substantial than apparent on the surface. First, we need to target a preset dictionary to the relevant data. Second, preset dictionaries need to be trained from data samples, which need to be gathered. Third, preset dictionaries need to be persisted since they are needed at decompression time. Fourth, overhead in accessing the preset dictionary must be minimized to prevent regression in critical code paths. Fifth, we need easy-to-use measurement to evaluate candidate use cases and production impact.</p> <p>In production, we have deployed dictionary presetting to save space in multiple RocksDB use cases with data block size 8KB or smaller. We have measured meaningful benefit to compression ratio in use cases with data block size up to 16KB. We have also measured a use case that can save both CPU and space by reducing data block size and turning on dictionary presetting at the same time.</p> <h2 id="feature-design">Feature design</h2> <h4 id="targeting">Targeting</h4> <p>Over time we have considered a few possibilities for the scope of a dictionary.</p> <ul> <li>Subcompaction</li> <li>SST file</li> <li>Column family</li> </ul> <p>The original choice was subcompaction scope. This enabled an approach with minimal buffering overhead because we could collect samples while generating the first output SST file. The dictionary could then be trained and applied to subsequent SST files in the same subcompaction.</p> <p>However, we found a large use case where the proximity of data in the keyspace was more correlated with its similarity than we had predicted. In particular, the approach of training a dictionary on an adjacent file yielded substantially worse ratios than training the dictionary on the same file it would be used to compress. In response to this finding, we changed the preset dictionary scope to per SST file.</p> <p>With this change in approach, we had to face the problem we had hoped to avoid: how can we compress all of an SST file’s data blocks with the same preset dictionary while that dictionary can only be trained after many data blocks have been sampled? The solutions we considered both involved a new overhead. We could read the input more than once and introduce I/O overhead, or we could buffer the uncompressed output file data blocks until a dictionary is trained, introducing memory overhead. We chose to take the hit on memory overhead.</p> <p>Another approach that we considered was associating multiple dictionaries with a column family. For example, in MyRocks there could be a dictionary trained on data from each large table. When compressing a data block, we would look at the table to which its data belongs and pick the corresponding dictionary. However, this approach would introduce many challenges. RocksDB would need to be aware of the key schema to know where are the table boundaries. RocksDB would also need to periodically update the dictionaries to account for changes in data pattern. It would need somewhere to store dictionaries at column family scope. Overall, we thought these challenges were too difficult to pursue the approach.</p> <h4 id="training">Training</h4> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/dictcmp/dictcmp_raw_sampled.png" alt="" /></p> <p align="center"><i> Raw samples mode (`zstd_max_train_bytes == 0`) </i></p> <p>As mentioned earlier, the approach we took is to build the dictionary from buffered uncompressed data blocks. The first row of data blocks in these diagrams illustrate this buffering. The second row illustrates training samples selected from the buffered blocks. In raw samples mode (above), the final dictionary is simply the concatenation of these samples. Whereas, in zstd training mode (below), these samples will be passed to the trainer to produce the final dictionary.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/dictcmp/dictcmp_zstd_trained.png" alt="" /></p> <p align="center"><i> zstd training mode (`zstd_max_train_bytes > 0`) </i></p> <h4 id="compression-path">Compression path</h4> <p>Once the preset dictionary is generated by the above process, we apply it to the buffered data blocks and write them to the output file. Thereafter, newly generated data blocks are immediately compressed and written out.</p> <p>One optimization here is available to zstd v0.7.0+ users. Instead of deserializing the dictionary on each compress invocation, we can do that work once and reuse it. A <code class="language-plaintext highlighter-rouge">ZSTD_CDict</code> holds this digested dictionary state and is passed to the compress API.</p> <h4 id="persistence">Persistence</h4> <p>When an SST file’s data blocks are compressed using a preset dictionary, that dictionary is stored inside the file for later use in decompression.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/dictcmp/dictcmp_sst_blocks.png" alt="" /></p> <p align="center"><i> SST file layout with the preset dictionary in its own (uncompressed) block </i></p> <h4 id="decompression-path">Decompression path</h4> <p>To decompress, we need to provide both the data block and the dictionary used to compress it. Since dictionaries are just blocks in a file, we access them through block cache. However this additional load on block cache can be problematic. It can be alleviated by pinning the dictionaries to avoid going through the LRU locks.</p> <p>An optimization analogous to the digested dictionary exists for certain zstd users (see User API section for details). When enabled, the block cache stores the digested dictionary state for decompression (<code class="language-plaintext highlighter-rouge">ZSTD_DDict</code>) instead of the block contents. In some cases we have seen decompression CPU decrease overall when enabling dictionary thanks to this optimization.</p> <h4 id="measurement">Measurement</h4> <p>Typically our first step in evaluating a candidate use case is an offline analysis of the data. This gives us a quick idea whether presetting dictionary will be beneficial without any code, config, or data changes. Our <code class="language-plaintext highlighter-rouge">sst_dump</code> tool reports what size SST files would have been using specified compression libraries and options. We can select random SST files and compare the size with vs. without dictionary.</p> <p>When that goes well, the next step is to see how it works in a live DB, like a production shadow or canary. There we can observe how it affects application/system metrics.</p> <p>Even after dictionary is enabled, there is the question of how much space was finally saved. We provide a way to A/B test size with vs. without dictionary while running in production. This feature picks a sample of data blocks to compress in multiple ways – one of the outputs is stored, while the other outputs are thrown away after counting their size. Due to API limitations, the stored output always has to be the dictionary-compressed one, so this feature can only be used after enabling dictionary. The size with and without dictionary are stored in the SST file as table properties. These properties can be aggregated across all SST files in a DB (and across all DBs in a tier) to learn the final space saving.</p> <h2 id="user-api">User API</h2> <p>RocksDB allows presetting compression dictionary for users of LZ4, zstd, and zlib. The most advanced capabilities are available to zstd v1.1.4+ users who statically link (see below). Newer versions of zstd (v1.3.6+) have internal changes to the dictionary trainer and digested dictionary management, which significantly improve memory and CPU efficiency.</p> <p>Run-time settings:</p> <ul> <li><code class="language-plaintext highlighter-rouge">CompressionOptions::max_dict_bytes</code>: Limit on per-SST file dictionary size. Increasing this causes dictionaries to consume more space and memory for the possibility of better data block compression. A typical value we use is 16KB.</li> <li>(<strong>zstd only</strong>) <code class="language-plaintext highlighter-rouge">CompressionOptions::zstd_max_train_bytes</code>: Limit on training data passed to zstd dictionary trainer. Larger values cause the training to consume more CPU (and take longer) while generating more effective dictionaries. The starting point guidance we received from zstd team is to set it to 100x <code class="language-plaintext highlighter-rouge">CompressionOptions::max_dict_bytes</code>.</li> <li><code class="language-plaintext highlighter-rouge">CompressionOptions::max_dict_buffer_bytes</code>: Limit on data buffering from which training samples are gathered. By default we buffer up to the target file size per ongoing background job. If this amount of memory is concerning, this option can constrain the buffering with the downside that training samples will cover a smaller portion of the SST file. Work is ongoing to charge this memory usage to block cache so it will not need to be accounted for separately.</li> <li><code class="language-plaintext highlighter-rouge">BlockBasedTableOptions::cache_index_and_filter_blocks</code>: Controls whether metadata blocks including dictionary are accessed through block cache or held in table reader memory (yes, its name is outdated).</li> <li><code class="language-plaintext highlighter-rouge">BlockBasedTableOptions::metadata_cache_options</code>: Controls what metadata blocks are pinned in block cache. Pinning avoids LRU contention at the risk of cold blocks holding memory.</li> <li><code class="language-plaintext highlighter-rouge">ColumnFamilyOptions::sample_for_compression</code>: Controls frequency of measuring extra compressions on data blocks using various libraries with default settings (i.e., without preset dictionary).</li> </ul> <p>Compile-time setting:</p> <ul> <li>(<strong>zstd only</strong>) <code class="language-plaintext highlighter-rouge">EXTRA_CXXFLAGS=-DZSTD_STATIC_LINKING_ONLY</code>: Hold digested dictionaries in block cache to save repetitive deserialization overhead. This saves a lot of CPU for read-heavy workloads. This compiler flag is necessary because one of the digested dictionary APIs we use is marked as experimental. We still use it in production, however.</li> </ul> <p>Function:</p> <ul> <li><code class="language-plaintext highlighter-rouge">DB::GetPropertiesOfAllTables()</code>: The properties <code class="language-plaintext highlighter-rouge">kSlowCompressionEstimatedDataSize</code> and <code class="language-plaintext highlighter-rouge">kFastCompressionEstimatedDataSize</code> estimate what the data block size (<code class="language-plaintext highlighter-rouge">kDataSize</code>) would have been if the corresponding compression library had been used. These properties are only present when <code class="language-plaintext highlighter-rouge">ColumnFamilyOptions::sample_for_compression</code> causes one or more samples to be measured, and they become more accurate with higher sampling frequency.</li> </ul> <p>Tool:</p> <ul> <li><code class="language-plaintext highlighter-rouge">sst_dump --command=recompress</code>: Offline analysis tool that reports what the SST file size would have been using the specified compression library and options.</li> </ul> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html">RocksDB Secondary Cache</a></h1> <p class="post-meta">Posted May 27, 2021</p> </header> <article class="post-content"> <h2 id="introduction">Introduction</h2> <p>The RocksDB team is implementing support for a block cache on non-volatile media, such as a local flash device or NVM/SCM. It can be viewed as an extension of RocksDB’s current volatile block cache (LRUCache or ClockCache). The non-volatile block cache acts as a second tier cache that contains blocks evicted from the volatile cache. Those blocks are then promoted to the volatile cache as they become hotter due to access.</p> <p>This feature is meant for cases where the DB is located on remote storage or cloud storage. The non-volatile cache is officially referred to in RocksDB as the SecondaryCache. By maintaining a SecondaryCache that’s an order of magnitude larger than DRAM, fewer reads would be required from remote storage, thus reducing read latency as well as network bandwidth consumption.</p> <p>From the user point of view, the local flash cache will support the following requirements -</p> <ol> <li>Provide a pointer to a secondary cache when opening a DB</li> <li>Be able to share the secondary cache across DBs in the same process</li> <li>Have multiple secondary caches on a host</li> <li>Support persisting the cache across process restarts and reboots by ensuring repeatability of the cache key</li> </ol> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/rocksdb-secondary-cache/arch_diagram.png" alt="Architecture" /></p> <h2 id="design">Design</h2> <p>When designing the API for a SecondaryCache, we had a choice between making it visible to the RocksDB code (table reader) or hiding it behind the RocksDB block cache. There are several advantages of hiding it behind the block cache -</p> <ul> <li>Allows flexibility in insertion of blocks into the secondary cache. A block can be inserted on eviction from the RAM tier, or it could be eagerly inserted.</li> <li>It makes the rest of the RocksDB code less complex by providing a uniform interface regardless of whether a secondary cache is configured or not</li> <li>Makes parallel reads, peeking in the cache for prefetching, failure handling etc. easier</li> <li>Makes it easier to extend to compressed data if needed, and allows other persistent media, such as PM, to be added as an additional tier</li> </ul> <p>We decided to make the secondary cache transparent to the rest of RocksDB code by hiding it behind the block cache. A key issue that we needed to address was the allocation and ownership of memory of the cached items - insertion into the secondary cache may require that memory be allocated by the same. This means that parts of the cached object that can be transferred to the secondary cache needs to be copied out (referred to as <strong>unpacking</strong>), and on a lookup the data stored in the secondary cache needs to be provided to the object constructor (referred to as <strong>packing</strong>). For RocksDB cached objects such as data blocks, index and filter blocks, and compression dictionaries, unpacking involves copying out the raw uncompressed BlockContents of the block, and packing involves constructing the corresponding block/index/filter/dictionary object using the raw uncompressed data.</p> <p>Another alternative we considered was the existing PersistentCache interface. However, we decided to not pursue it and eventually deprecate it for the following reasons -</p> <ul> <li>It is exposed directly to the table reader code, which makes it more difficult to implement different policies such as inclusive/exclusive cache, as well as extending it to more sophisticated admission control policies</li> <li>The interface does not allow for custom memory allocation and object packing/unpacking, so new APIs would have to be defined anyway</li> <li>The current PersistentCache implementation is very simple and does not have any admission control policies</li> </ul> <h2 id="api">API</h2> <p>The interface between RocksDB’s block cache and the secondary cache is designed to allow pluggable implementations. For FB internal usage, we plan to use Cachelib with a wrapper to provide the plug-in implementation and use folly and other fbcode libraries, which cannot be used directly by RocksDB, to efficiently implement the cache operations. The following diagrams show the flow of insertion and lookup of a block.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/rocksdb-secondary-cache/insert_flow.png" alt="Insert flow" /></p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/rocksdb-secondary-cache/lookup_flow.png" alt="Lookup flow" /></p> <p>An item in the secondary cache is referenced by a SecondaryCacheHandle. The handle may not be immediately ready or have a valid value. The caller can call IsReady() to determine if its ready, and can call Wait() in order to block until it becomes ready. The caller must call Value() after it becomes ready to determine if the item was successfully read. Value() must return nullptr on failure.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 </pre></td><td class="rouge-code"><pre>class SecondaryCacheHandle { public: virtual ~SecondaryCacheHandle() {} // Returns whether the handle is ready or not virtual bool IsReady() = 0; // Block until handle becomes ready virtual void Wait() = 0; // Return the value. If nullptr, it means the lookup was unsuccessful virtual void* Value() = 0; // Return the size of value virtual size_t Size() = 0; }; </pre></td></tr></tbody></table></code></pre></div></div> <p>The user of the secondary cache (for example, BlockBasedTableReader indirectly through LRUCache) must implement the callbacks defined in CacheItemHelper, in order to facilitate the unpacking/packing of objects for saving to and restoring from the secondary cache. The CreateCallback must be implemented to construct a cacheable object from the raw data in secondary cache.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 </pre></td><td class="rouge-code"><pre> // The SizeCallback takes a void* pointer to the object and returns the size // of the persistable data. It can be used by the secondary cache to allocate // memory if needed. using SizeCallback = size_t (*)(void* obj); // The SaveToCallback takes a void* object pointer and saves the persistable // data into a buffer. The secondary cache may decide to not store it in a // contiguous buffer, in which case this callback will be called multiple // times with increasing offset using SaveToCallback = Status (*)(void* from_obj, size_t from_offset, size_t length, void* out); // A function pointer type for custom destruction of an entry's // value. The Cache is responsible for copying and reclaiming space // for the key, but values are managed by the caller. using DeleterFn = void (*)(const Slice& key, void* value); // A struct with pointers to helper functions for spilling items from the // cache into the secondary cache. May be extended in the future. An // instance of this struct is expected to outlive the cache. struct CacheItemHelper { SizeCallback size_cb; SaveToCallback saveto_cb; DeleterFn del_cb; CacheItemHelper() : size_cb(nullptr), saveto_cb(nullptr), del_cb(nullptr) {} CacheItemHelper(SizeCallback _size_cb, SaveToCallback _saveto_cb, DeleterFn _del_cb) : size_cb(_size_cb), saveto_cb(_saveto_cb), del_cb(_del_cb) {} }; // The CreateCallback is passed by the block cache user to Lookup(). It // takes in a buffer from the NVM cache and constructs an object using // it. The callback doesn't have ownership of the buffer and should // copy the contents into its own buffer. // typedef std::function<Status(void* buf, size_t size, void** out_obj, // size_t* charge)> // CreateCallback; using CreateCallback = std::function<Status(void* buf, size_t size, void** out_obj, size_t* charge)>; </pre></td></tr></tbody></table></code></pre></div></div> <p>The secondary cache provider must provide a concrete implementation of the SecondaryCache abstract class.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 </pre></td><td class="rouge-code"><pre>// SecondaryCache // // Cache interface for caching blocks on a secondary tier (which can include // non-volatile media, or alternate forms of caching such as compressed data) class SecondaryCache { public: virtual ~SecondaryCache() {} virtual std::string Name() = 0; static const std::string Type() { return "SecondaryCache"; } // Insert the given value into this cache. The value is not written // directly. Rather, the SaveToCallback provided by helper_cb will be // used to extract the persistable data in value, which will be written // to this tier. The implementation may or may not write it to cache // depending on the admission control policy, even if the return status is // success. virtual Status Insert(const Slice& key, void* value, const Cache::CacheItemHelper* helper) = 0; // Lookup the data for the given key in this cache. The create_cb // will be used to create the object. The handle returned may not be // ready yet, unless wait=true, in which case Lookup() will block until // the handle is ready virtual std::unique_ptr<SecondaryCacheHandle> Lookup( const Slice& key, const Cache::CreateCallback& create_cb, bool wait) = 0; // At the discretion of the implementation, erase the data associated // with key virtual void Erase(const Slice& key) = 0; // Wait for a collection of handles to become ready. This would be used // by MultiGet, for example, to read multitple data blocks in parallel virtual void WaitAll(std::vector<SecondaryCacheHandle*> handles) = 0; virtual std::string GetPrintableOptions() const = 0; }; </pre></td></tr></tbody></table></code></pre></div></div> <p>A SecondaryCache is configured by the user by providing a pointer to it in LRUCacheOptions -</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 </pre></td><td class="rouge-code"><pre>struct LRUCacheOptions { ... // A SecondaryCache instance to use as an additional cache tier std::shared_ptr<SecondaryCache> secondary_cache; ... }; </pre></td></tr></tbody></table></code></pre></div></div> <h2 id="current-status">Current Status</h2> <p>The initial RocksDB support for the secondary cache has been merged into the main branch, and will be available in the 6.21 release. This includes providing a way for the user to configure a secondary cache when instantiating RocksDB’s LRU cache (volatile block cache), spilling blocks evicted from the LRU cache to the flash cache, promoting a block read from the SecondaryCache to the LRU cache, update tools such as cache_bench and db_bench to specify a flash cache. The relevant PRs are <a href="https://github.com/facebook/rocksdb/pull/8271">#8271</a>, <a href="https://github.com/facebook/rocksdb/pull/8191">#8191</a>, and <a href="https://github.com/facebook/rocksdb/pull/8312">#8312</a>.</p> <p>We prototyped an end-to-end solution, with the above PRs as well as a Cachelib based implementation of the SecondaryCache. We ran a mixgraph benchmark to simulate a realistic read/write workload. The results showed a 15% gain with the local flash cache over no local cache, and a ~25-30% reduction in network reads with a corresponding decrease in cache misses.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/rocksdb-secondary-cache/Mixgraph_throughput.png" alt="Throughput" /></p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/rocksdb-secondary-cache/Mixgraph_hit_rate.png" alt="Hit Rate" /></p> <h2 id="future-work">Future Work</h2> <p>In the short term, we plan to do the following in order to fully integrate the SecondaryCache with RocksDB -</p> <ol> <li>Use DB session ID as the cache key prefix to ensure uniqueness and repeatability</li> <li>Optimize flash cache usage of MultiGet and iterator workloads</li> <li>Stress testing</li> <li>More benchmarking</li> </ol> <p>Longer term, we plan to deploy this in production at Facebook.</p> <h2 id="call-to-action">Call to Action</h2> <p>We are hoping for a community contribution of a secondary cache implementation, which would make this feature usable by the broader RocksDB userbase. If you are interested in contributing, please reach out to us in <a href="https://github.com/facebook/rocksdb/issues/8347">this issue</a>.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2021/05/26/online-validation.html">Online Validation</a></h1> <p class="post-meta">Posted May 26, 2021</p> </header> <article class="post-content"> <p>To prevent or mitigate data corrution in RocksDB when some software or hardware issues happens, we keep adding online consistency checks and improving existing ones.</p> <p>We improved ColumnFamilyOptions::force_consistency_checks and enabled it by default. The option does some basic consistency checks to LSM-tree, e.g., files in one level are not overlapping. The DB will be frozen from new writes if a violation is detected. Previously, the feature’s check was too limited and didn’t always freeze the DB in a timely manner. Last year, we made the checking stricter so that it can <a href="https://github.com/facebook/rocksdb/pull/6901">catch much more corrupted LSM-tree structures</a>. We also fixed several issues where the checking failure was swallowed without freezing the DB. After making force_consistency_checks more reliable, we changed the default value to be on.</p> <p>ColumnFamilyOptions::paranoid_file_checks does some more expensive extra checking when generating a new SST file. Last year, we advanced coverage to this feature: after every SST file is generated, the SST file is created, read back keys one by one and check two things: (1) the keys are in comparator order (also available and enabled by default during file write via ColumnFamilyOptions::check_flush_compaction_key_order); (2) the hash of all the KVs is the same as calculated when we add KVs into it. These checks detect certain corruptions so we can prevent the corrupt files from being applied to the DB. We suggest users turn it on at least in shadow environments, and consider to run it in production too if you can afford the overheads.</p> <p>A recent feature is added to check the count of entries added into memtable while flushing it into an SST file. This feature is to have some online coverage to memtable corruption, caused by either software bug or hardware issue. This feature will be released in the coming release (6.21) and by default on. In the future, we will check more counters during memtables, e.g. number of puts or number of deletes.</p> <p>We also improved the reporting of online validation errors to improve debuggability. For example, failure to parse a corrupt key now reports details about the corrupt key. Since we did not want to expose key data in logs, error messages, etc., by default, this reporting is opt-in via DBOptions::allow_data_in_errors.</p> <p>More online checking features are planned and some are more sophisticated, including key/value checksums and sample based query validation.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <p class="post-authorName">Levi Tamasi</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2021/05/26/integrated-blob-db.html">Integrated BlobDB</a></h1> <p class="post-meta">Posted May 26, 2021</p> </header> <article class="post-content"> <h2 id="background">Background</h2> <p>BlobDB is essentially RocksDB for large-value use cases. The basic idea, which was proposed in the <a href="https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf">WiscKey paper</a>, is key-value separation: by storing large values in dedicated blob files and storing only small pointers to them in the LSM tree, we avoid copying the values over and over again during compaction, thus reducing write amplification. Historically, BlobDB supported only FIFO and TTL based use cases that can tolerate some data loss. In addition, it was incompatible with many widely used RocksDB features, and required users to adopt a custom API. In 2020, we decided to rearchitect BlobDB from the ground up, taking the lessons learned from WiscKey and the original BlobDB but also drawing inspiration and incorporating ideas from other similar systems. Our goals were to eliminate the above limitations and to create a new integrated version that enables customers to use the well-known RocksDB API, has feature parity with the core of RocksDB, and offers better performance. This new implementation is now available and provides the following improvements over the original:</p> <ul> <li><strong>API.</strong> In contrast with the legacy BlobDB implementation, which had its own <code class="language-plaintext highlighter-rouge">StackableDB</code>-based interface (<code class="language-plaintext highlighter-rouge">rocksdb::blob_db::BlobDB</code>), the new version can be used via the well-known <code class="language-plaintext highlighter-rouge">rocksdb::DB</code> API, and can be configured simply by using a few column family options.</li> <li><strong>Consistency.</strong> With the integrated BlobDB implementation, RocksDB’s consistency guarantees and various write options (like using the WAL or synchronous writes) now apply to blobs as well. Moreover, the new BlobDB keeps track of blob files in the RocksDB MANIFEST.</li> <li><strong>Write performance.</strong> When using the old BlobDB, blobs are extracted and immediately written to blob files by the BlobDB layer <em>in the application thread</em>. This has multiple drawbacks from a performance perspective: first, it requires synchronization; second, it means that expensive operations like compression are performed in the application thread; and finally, it involves flushing the blob file after each blob. The new code takes a completely different approach by <em>offloading blob file building to RocksDB’s background jobs</em>, i.e. flushes and compactions. This means that similarly to SSTs, any given blob file is now written by a single background thread, eliminating the need for locking, flushing, or performing compression in the foreground. Note that this approach is also a better fit for network-based file systems where small writes might be expensive and opens up the possibility of file format optimizations that involve buffering (like dictionary compression).</li> <li><strong>Read performance.</strong> The old code relies on each read (i.e. <code class="language-plaintext highlighter-rouge">Get</code>, <code class="language-plaintext highlighter-rouge">MultiGet</code>, or iterator) taking a snapshot and uses those snapshots when deciding which obsolete blob files can be removed. The new BlobDB improves this by generalizing RocksDB’s Version concept, which historically referred to the set of live SST files at a given point in time, to include the set of live blob files as well. This has performance benefits like <a href="https://rocksdb.org/blog/2014/06/27/avoid-expensive-locks-in-get.html">making the read path mostly lock-free by utilizing thread-local storage</a>. We have also introduced a blob file cache that can be utilized to keep frequently accessed blob files open.</li> <li><strong>Garbage collection.</strong> Key-value separation means that if a key pointing to a blob gets overwritten or deleted, the blob becomes unreferenced garbage. To be able to reclaim this space, BlobDB now has garbage collection capabilities. GC is integrated into the compaction process and works by relocating valid blobs residing in old blob files as they are encountered during compaction. Blob files can be marked obsolete (and eventually deleted in one shot) once they contain nothing but garbage. This is more efficient than the method used by WiscKey, which involves performing a <code class="language-plaintext highlighter-rouge">Get</code> operation to find out whether a blob is still referenced followed by a <code class="language-plaintext highlighter-rouge">Put</code> to update the reference, which in turn results in garbage collection competing and potentially conflicting with the application’s writes.</li> <li><strong>Feature parity with the RocksDB core.</strong> The new BlobDB supports way more features than the original and is near feature parity with vanilla RocksDB. In particular, we support all basic read/write APIs (with the exception of <code class="language-plaintext highlighter-rouge">Merge</code>, which is coming soon), recovery, compression, atomic flush, column families, compaction filters, checkpoints, backup/restore, transactions, per-file checksums, and the SST file manager. In addition, the new BlobDB’s options can be dynamically adjusted using the <code class="language-plaintext highlighter-rouge">SetOptions</code> interface.</li> </ul> <h2 id="api">API</h2> <p>The new BlobDB can be configured (on a per-column family basis if needed) simply by using the following options:</p> <ul> <li><code class="language-plaintext highlighter-rouge">enable_blob_files</code>: set it to <code class="language-plaintext highlighter-rouge">true</code> to enable key-value separation.</li> <li><code class="language-plaintext highlighter-rouge">min_blob_size</code>: values at or above this threshold will be written to blob files during flush or compaction.</li> <li><code class="language-plaintext highlighter-rouge">blob_file_size</code>: the size limit for blob files.</li> <li><code class="language-plaintext highlighter-rouge">blob_compression_type</code>: the compression type to use for blob files. All blobs in the same file are compressed using the same algorithm.</li> <li><code class="language-plaintext highlighter-rouge">enable_blob_garbage_collection</code>: set this to <code class="language-plaintext highlighter-rouge">true</code> to make BlobDB actively relocate valid blobs from the oldest blob files as they are encountered during compaction.</li> <li><code class="language-plaintext highlighter-rouge">blob_garbage_collection_age_cutoff</code>: the threshold that the GC logic uses to determine which blob files should be considered “old.” For example, the default value of 0.25 signals to RocksDB that blobs residing in the oldest 25% of blob files should be relocated by GC. This parameter can be tuned to adjust the trade-off between write amplification and space amplification.</li> </ul> <p>The above options are all dynamically adjustable via the <code class="language-plaintext highlighter-rouge">SetOptions</code> API; changing them will affect subsequent flushes and compactions but not ones that are already in progress.</p> <p>In terms of compaction styles, we recommend using leveled compaction with BlobDB. The rationale behind universal compaction in general is to provide lower write amplification at the expense of higher read amplification; however, as we will see later in the Performance section, BlobDB can provide very low write amp and good read performance with leveled compaction. Therefore, there is really no reason to take the hit in read performance that comes with universal compaction.</p> <p>In addition to the above, consider tuning the following non-BlobDB specific options:</p> <ul> <li><code class="language-plaintext highlighter-rouge">write_buffer_size</code>: this is the memtable size. You might want to increase it for large-value workloads to ensure that SST and blob files contain a decent number of keys.</li> <li><code class="language-plaintext highlighter-rouge">target_file_size_base</code>: the target size of SST files. Note that even when using BlobDB, it is important to have an LSM tree with a “nice” shape and multiple levels and files per level to prevent heavy compactions. Since BlobDB extracts and writes large values to blob files, it makes sense to make this parameter significantly smaller than the memtable size. One guideline is to set <code class="language-plaintext highlighter-rouge">blob_file_size</code> to the same value as <code class="language-plaintext highlighter-rouge">write_buffer_size</code> (adjusted for compression if needed) and make <code class="language-plaintext highlighter-rouge">target_file_size_base</code> proportionally smaller based on the ratio of key size to value size.</li> <li><code class="language-plaintext highlighter-rouge">max_bytes_for_level_base</code>: consider setting this to a multiple (e.g. 8x or 10x) of <code class="language-plaintext highlighter-rouge">target_file_size_base</code>.</li> </ul> <p>As mentioned above, the new BlobDB now also supports compaction filters. Key-value separation actually enables an optimization here: if the compaction filter of an application can make a decision about a key-value solely based on the key, it is unnecessary to read the value from the blob file. Applications can take advantage of this optimization by implementing the new <code class="language-plaintext highlighter-rouge">FilterBlobByKey</code> method of the <code class="language-plaintext highlighter-rouge">CompactionFilter</code> interface. This method gets called by RocksDB first whenever it encounters a key-value where the value is stored in a blob file. If this method returns a “final” decision like <code class="language-plaintext highlighter-rouge">kKeep</code>, <code class="language-plaintext highlighter-rouge">kRemove</code>, <code class="language-plaintext highlighter-rouge">kChangeValue</code>, or <code class="language-plaintext highlighter-rouge">kRemoveAndSkipUntil</code>, RocksDB will honor that decision; on the other hand, if the method returns <code class="language-plaintext highlighter-rouge">kUndetermined</code>, RocksDB will read the blob from the blob file and call <code class="language-plaintext highlighter-rouge">FilterV2</code> with the value in the usual fashion.</p> <h2 id="performance">Performance</h2> <p>We tested the performance of the new BlobDB for six different value sizes between 1 KB and 1 MB using a customized version of our <a href="https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks">standard benchmark suite</a> on a box with an 18-core Skylake DE CPU (running at 1.6 GHz, with hyperthreading enabled), 64 GB RAM, a 512 GB boot SSD, and two 1.88 TB M.2 SSDs in a RAID0 configuration for data. The RocksDB version used was equivalent to 6.18.1, with some benchmarking and statistics related enhancements. Leveled and universal compaction without key-value separation were used as reference points. Note that for simplicity, we use “leveled compaction” and “universal compaction” as shorthand for leveled and universal compaction without key-value separation, respectively, and “BlobDB” for BlobDB with leveled compaction.</p> <p>Our benchmarks cycled through six different workloads: two write-only ones (initial load and overwrite), two read/write ones (point lookup/write mix and range scan/write mix), and finally two read-only ones (point lookups and range scans). The first two phases performed a fixed amount of work (see below), while the final four were run for a fixed amount of time, namely 30 minutes each. Each phase other than the first one started with the database state left behind by the previous one. Here’s a brief description of the workloads:</p> <ul> <li><strong>Initial load</strong>: this workload has two distinct stages, a single-threaded random write stage during which compactions are disabled (so all data is flushed to L0, where it remains for the rest of the stage), followed by a full manual compaction. The random writes are performed with load-optimized settings, namely using the vector memtable implementation and with concurrent memtable writes and WAL disabled. This stage was used to populate the database with 1 TB worth of raw values, e.g. 2^30 (~1 billion) 1 KB values or 2^20 (~1 million) 1 MB values.</li> <li><strong>Overwrite</strong>: this is a multi-threaded random write workload using the usual skiplist memtable, with compactions, WAL, and concurrent memtable writes enabled. In our tests, 16 writer threads were used. The total number of writes was set to the same number as in the initial load stage and split up evenly between the writer threads. For instance, for the 1 MB value size, we had 2^20 writes divided up between the 16 threads, resulting in each thread performing 2^16 write operations. At the end of this phase, a “wait for compactions” step was added to prevent this workload from exhibiting artificially low write amp or conversely, the next phase showing inflated write amp.</li> <li><strong>Point lookup/write mix</strong>: a single writer thread performing random writes while N (in our case, 16) threads perform random point lookups. WAL is enabled and all writes are synced.</li> <li><strong>Range scan/write mix</strong>: similar to the above, with one writer thread and N reader threads (where N was again set to 16 in our tests). The reader threads perform random range scans, with 10 <code class="language-plaintext highlighter-rouge">Next</code> calls per <code class="language-plaintext highlighter-rouge">Seek</code>. Again, WAL is enabled, and sync writes are used.</li> <li><strong>Point lookups (read-only)</strong>: N=16 threads perform random point lookups.</li> <li><strong>Range scans (read-only)</strong>: N=16 threads execute random range scans, with 10 <code class="language-plaintext highlighter-rouge">Next</code>s per <code class="language-plaintext highlighter-rouge">Seek</code> like above.</li> </ul> <p>With that out of the way, let’s see how the new BlobDB performs against traditional leveled and universal compaction. In the next few sections, we’ll be looking at write amplification as well as read and write performance. We’ll also briefly compare the write performance of the new BlobDB with the legacy implementation.</p> <h3 id="write-amplification">Write amplification</h3> <p>Reducing write amp is the original motivation for key-value separation. Here, we follow RocksDB’s definition of write amplification (as used in compaction statistics and the info log). That is, we define write amp as the total amount of data written by flushes and compactions divided by the amount of data written by flushes, where “data written” includes SST files and blob files as well (if applicable). The following charts show that BlobDB significantly reduces write amplification for all of our (non-read only) workloads.</p> <p>For the initial load, where due to the nature of the workload both leveled and universal already have a low write amp factor of 1.6, BlobDB has a write amp close to the theoretical minimum of 1.0, namely in the 1.0..1.02 range, depending on value size. How is this possible? Well, the trick is that when key-value separation is used, the full compaction step only has to sort the keys but not the values. This results in a write amp that is about <strong>36% lower</strong> than the already low write amp you get with either leveled or universal.</p> <p>In the case of the overwrite workload, BlobDB had a write amp between 1.4 and 1.7 depending on value size. This is around <strong>75-78% lower</strong> than the write amp of leveled compaction (6.1 to 6.8) and <strong>70-77% lower</strong> than universal (5.7 to 6.2); for this workload, there wasn’t a huge difference between the performance of leveled and universal.</p> <p>When it comes to the point lookup/write mix workload, BlobDB had a write amp between 1.4 and 1.8. This is <strong>83-88% lower</strong> than the write amp of leveled compaction, which had values between 10.8 and 12.5. Universal fared much better than leveled under this workload, and had write amp in the 2.2..6.6 range; however, BlobDB still provided significant gains for all value sizes we tested: namely, write amp was <strong>18-77% lower</strong> than that of universal, depending on value size.</p> <p>As for the range scan/write mix workload, BlobDB again had a write amp between 1.4 and 1.8, while leveled had values between 13.6 and 14.9, and universal was between 2.8 and 5.0. In other words, BlobDB’s write amp was <strong>88-90% lower</strong> than that of leveled, and <strong>46-70% lower</strong> than that of universal.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/integrated-blob-db/BlobDB_Benchmarks_Write_Amp.png" alt="Write amplification" /></p> <h3 id="write-performance">Write performance</h3> <p>In terms of write performance, there are other factors to consider besides write amplification. The following charts show some interesting metrics for the two write-only workloads (initial load and overwrite). As discussed earlier, these two workloads perform a fixed amount of work; the two charts in the top row show how long it took BlobDB, leveled, and universal to complete that work. Note that each bar is broken down into two, corresponding to the two stages of each workload (random write and full compaction for initial load, and random write and waiting for compactions for overwrite).</p> <p>For initial load, note that the random write stage takes the same amount of time regardless of which algorithm is used. This is not surprising considering the fact that compactions are disabled during this stage and thus RocksDB is simply writing L0 files (and in BlobDB’s case, blob files) as fast as it can. The second stage, on the other hand, is very different: as mentioned above, BlobDB essentially only needs to read, sort, and rewrite the keys during compaction, which can be done much much faster (with 1 MB values, more than a hundred times faster) than doing the same for large key-values. Due to this, initial load completed <strong>2.3x to 4.7x faster</strong> overall when using BlobDB.</p> <p>As for the overwrite workload, BlobDB performs much better during both stages. The two charts in the bottom row help explain why. In the case of both leveled and universal compaction, compactions can’t keep up with the write rate, which eventually leads to back pressure in the form of write stalls. As shown in the chart below, both leveled and universal stall between ~40% and ~70% of the time; on the other hand, BlobDB is stall-free except for the largest value size tested (1 MB). This naturally leads to higher throughput, namely <strong>2.1x to 3.5x higher</strong> throughput compared to leveled, and <strong>1.6x to 3.0x higher</strong> throughput compared to universal. The overwrite time chart also shows that the catch-up stage that waits for all compactions to finish is much shorter (and in fact, at larger value sizes, negligible) with BlobDB.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/integrated-blob-db/BlobDB_Benchmarks_Write_Perf.png" alt="Write performance" /></p> <h3 id="readwrite-and-read-only-performance">Read/write and read-only performance</h3> <p>The charts below show the read performance (in terms of operations per second) of BlobDB versus leveled and universal compaction under the two read/write workloads and the two read-only workloads. BlobDB meets or exceeds the read performance of leveled compaction, except for workloads involving range scans at the two smallest value sizes tested (1 KB and 4 KB). It also provides better (in some cases, much better) read performance than universal across the board. In particular, BlobDB provides up <strong>1.4x higher</strong> read performance than leveled (for larger values), and up to <strong>5.6x higher</strong> than universal.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/integrated-blob-db/BlobDB_Benchmarks_RW_RO_Perf.png" alt="Read-write and read-only performance" /></p> <h3 id="comparing-the-two-blobdb-implementations">Comparing the two BlobDB implementations</h3> <p>To compare the write performance of the new BlobDB with the legacy implementation, we ran two versions of the first (single-threaded random write) stage of the initial load benchmark using 1 KB values: one with WAL disabled, and one with WAL enabled. The new implementation completed the load <strong>4.6x faster</strong> than the old one without WAL, and <strong>2.3x faster</strong> with WAL.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/integrated-blob-db/BlobDB_Benchmarks_Legacy_Vs_Integrated.png" alt="Comparing the two BlobDB implementations" /></p> <h2 id="future-work">Future work</h2> <p>There are a few remaining features that are not yet supported by the new BlobDB. The most important one is <code class="language-plaintext highlighter-rouge">Merge</code> (and the related <code class="language-plaintext highlighter-rouge">GetMergeOperands</code> API); in addition, we don’t currently support the <code class="language-plaintext highlighter-rouge">EventListener</code> interface, the <code class="language-plaintext highlighter-rouge">GetLiveFilesMetaData</code> and <code class="language-plaintext highlighter-rouge">GetColumnFamilyMetaData</code> APIs, secondary instances, and ingestion of blob files. We will continue to work on closing this gap.</p> <p>We also have further plans when it comes to performance. These include optimizing garbage collection, introducing a dedicated cache for blobs, improving iterator and <code class="language-plaintext highlighter-rouge">MultiGet</code> performance, and evolving the blob file format amongst others.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2021/04/12/universal-improvements.html">(Call For Contribution) Make Universal Compaction More Incremental</a></h1> <p class="post-meta">Posted April 12, 2021</p> </header> <article class="post-content"> <h3 id="motivation">Motivation</h3> <p>Universal Compaction is an important compaction style, but few changes were made after we made the structure multi-leveled. Yet the major restriction of always compacting full sorted run is not relaxed. Compared to Leveled Compaction, where we usually only compile several SST files together, in universal compaction, we frequently compact GBs of data. Two issues with this gap: 1. it makes it harder to unify universal and leveled compaction; 2. periodically data is fully compacted, and in the mean time space is doubled. To ease the problem, we can break the restriction and do similar as leveled compaction, and bring it closer to unified compaction.</p> <p>We call for help for making following improvements.</p> <h3 id="how-universal-compaction-works">How Universal Compaction Works</h3> <p>In universal, whole levels are compacted together to satisfy two conditions (See <a href="https://github.com/facebook/rocksdb/wiki/Universal-Compaction">wiki page</a> for more details):</p> <ol> <li>total size / bottommost level size > a threshold, or</li> <li>total number of sorted runs (non-0 levels + L0 files) is within a threshold</li> </ol> <p>1 is to limit extra space overhead used for dead data and 2 is for read performance.</p> <p>If 1 is triggered, likely a full compaction will be triggered. If 2 is triggered, RocksDB compact some sorted runs to bring the number down. It does it by using a simple heuristic so that less writes needed for that purpose over time: it starts from compacting smaller files, but if total size to compact is similar to or larger than size of the next level, it will take that level together, as soon on (whether it is the best heuristic is another question and we’ve never seriously looked at it).</p> <h3 id="how-we-can-improve">How We Can Improve?</h3> <p>Let’s start from condition 1. Here we do full compaction but is not necessary. A simple optimization would be to compact so that just enough files are merged into the bottommost level (Lmax) to satisfy condition 1. It would work if we only need to pick some files from Lmax-1, or if it is cheaper over time, we can pick some files from other levels too.</p> <p>Then condition 2. If we finish condition 1, there might be holes in some ranges in older levels. These holes might make it possible that only by compacting some sub ranges, we can fix the LSM-tree for condition 2. RocksDB can take single files into consideration and apply more sophisticated heuristic.</p> <p>This new approach makes universal compaction closer to leveled compaction. The operation for 1 is closer to how Leveled compaction triggeres Lmax-1 to Lmax compaction. And 2 can potentially be implemented as something similar to level picking in Leveled Compaction. In fact, all those file picking can co-existing in one single compaction style and there isn’t fundamental conflicts to that.</p> <h3 id="limitation">Limitation</h3> <p>There are two limitations:</p> <ul> <li>Periodic automatic full compaction is unpleasant but at the same time is pleasant in another way. Some users might uses it to reason that everything is periodically collapsed so dead data is gone and old data is rewritten. We need to make sure periodic compaction works to continue with that.</li> <li>L0 to the first non-L0 level compaction is the first time data is partitioned in LSM-tree so that incremental compaction by range is possible. We might need to do more of these compactions in order to make incremental possible, which will increase compaction slightly.</li> <li>Compacting subset of a level would introduce some extra overhead for unaligned files, just as in leveled compaction. More SST boundary cutting heuristic can reduce this overhead but it will be there.</li> </ul> <p>But I believe the benefits would outweight the limitations. Reducing temporary space doubling and moving towards to unified compaction would be important achievements.</p> <h3 id="interested-in-help">Interested in Help?</h3> <p>Compaction is the core of LSM-tree, but its improvements are far overdue. If you are a user of universal compaction and would be able to benefit from those improvements, we will be happy to work with you on speeding up the project and bring them to RocksDB sooner. Feel free to communicate with us in <a href="https://github.com/facebook/rocksdb/issues/8181">this issue</a>.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100003482360101/picture/" alt="" title="" /> </div> <p class="post-authorName">Maysam Yabandeh</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2019/08/15/unordered-write.html">Higher write throughput with `unordered_write` feature</a></h1> <p class="post-meta">Posted August 15, 2019</p> </header> <article class="post-content"> <p>Since RocksDB 6.3, The <code class="language-plaintext highlighter-rouge">unordered_write=</code>true option together with WritePrepared transactions offers 34-42% higher write throughput compared to vanilla RocksDB. If the application can handle more relaxed ordering guarantees, the gain in throughput would increase to 63-131%.</p> <h3 id="background">Background</h3> <p>Currently RocksDB API delivers the following powerful guarantees:</p> <ul> <li>Atomic reads: Either all of a write batch is visible to reads or none of it.</li> <li>Read-your-own writes: When a write thread returns to the user, a subsequent read by the same thread will be able to see its own writes.</li> <li>Immutable Snapshots: The reads visible to the snapshot are immutable in the sense that it will not be affected by any in-flight or future writes.</li> </ul> <h3 id="unordered_write"><code class="language-plaintext highlighter-rouge">unordered_write</code></h3> <p>The <code class="language-plaintext highlighter-rouge">unordered_write</code> feature, when turned on, relaxes the default guarantees of RocksDB. While it still gives read-your-own-write property, neither atomic reads nor the immutable snapshot properties are provided any longer. However, RocksDB users could still get read-your-own-write and immutable snapshots when using this feature in conjunction with TransactionDB configured with WritePrepared transactions and <code class="language-plaintext highlighter-rouge">two_write_queues</code>. You can read <a href="https://github.com/facebook/rocksdb/wiki/unordered_write">here</a> to learn about the design of <code class="language-plaintext highlighter-rouge">unordered_write</code> and <a href="https://github.com/facebook/rocksdb/wiki/WritePrepared-Transactions">here</a> to learn more about WritePrepared transactions.</p> <h3 id="how-to-use-it">How to use it?</h3> <p>To get the same guarantees as vanilla RocksdB:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 </pre></td><td class="rouge-code"><pre>DBOptions db_options; db_options.unordered_write = true; db_options.two_write_queues = true; DB* db; { TransactionDBOptions txn_db_options; txn_db_options.write_policy = TxnDBWritePolicy::WRITE_PREPARED; txn_db_options.skip_concurrency_control = true; TransactionDB* txn_db; TransactionDB::Open(options, txn_db_options, kDBPath, &txn_db); db = txn_db; } db->Write(...); </pre></td></tr></tbody></table></code></pre></div></div> <p>To get relaxed guarantees:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 </pre></td><td class="rouge-code"><pre>DBOptions db_options; db_options.unordered_write = true; DB* db; DB::Open(db_options, kDBPath, &db); db->Write(...); </pre></td></tr></tbody></table></code></pre></div></div> <h1 id="benchmarks">Benchmarks</h1> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 </pre></td><td class="rouge-code"><pre>TEST_TMPDIR=/dev/shm/ ~/db_bench --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --transaction_db=true --unordered_write=1 --disable_wal=0 </pre></td></tr></tbody></table></code></pre></div></div> <p>Throughput with <code class="language-plaintext highlighter-rouge">unordered_write</code>=true and using WritePrepared transaction:</p> <ul> <li>WAL: +42%</li> <li>No-WAL: +34% Throughput with <code class="language-plaintext highlighter-rouge">unordered_write</code>=true</li> <li>WAL: +63%</li> <li>NoWAL: +131%</li> </ul> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100003482360101/picture/" alt="" title="" /> </div> <p class="post-authorName">Maysam Yabandeh</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2019/03/08/format-version-4.html">format_version 4</a></h1> <p class="post-meta">Posted March 08, 2019</p> </header> <article class="post-content"> <p>The data blocks in RocksDB consist of a sequence of key/values pairs sorted by key, where the pairs are grouped into <em>restart intervals</em> specified by <code class="language-plaintext highlighter-rouge">block_restart_interval</code>. Up to RocksDB version 5.14, where the latest and default value of <code class="language-plaintext highlighter-rouge">BlockBasedTableOptions::format_version</code> is 2, the format of index and data blocks are the same: index blocks use the same key format of <<code class="language-plaintext highlighter-rouge">user_key</code>,<code class="language-plaintext highlighter-rouge">seq</code>> and encode pointers to data blocks, <<code class="language-plaintext highlighter-rouge">offset</code>,<code class="language-plaintext highlighter-rouge">size</code>>, to a byte string and use them as values. The only difference is that the index blocks use <code class="language-plaintext highlighter-rouge">index_block_restart_interval</code> for the size of <em>restart intervals</em>. <code class="language-plaintext highlighter-rouge">format_version=</code>3,4 offer more optimized, backward-compatible, yet forward-incompatible format for index blocks.</p> <h3 id="pros">Pros</h3> <p>Using <code class="language-plaintext highlighter-rouge">format_version</code>=4 significantly reduces the index block size, in some cases around 4-5x. This frees more space in block cache, which would result in higher hit rate for data and filter blocks, or offer the same performance with a smaller block cache size.</p> <h3 id="cons">Cons</h3> <p>Being <em>forward-incompatible</em> means that if you enable <code class="language-plaintext highlighter-rouge">format_version=</code>4 you cannot downgrade to a RocksDB version lower than 5.16.</p> <h3 id="how-to-use-it">How to use it?</h3> <ul> <li><code class="language-plaintext highlighter-rouge">BlockBasedTableOptions::format_version</code> = 4</li> <li><code class="language-plaintext highlighter-rouge">BlockBasedTableOptions::index_block_restart_interval</code> = 16</li> </ul> <h3 id="what-is-format_version-3">What is format_version 3?</h3> <p>(Since RocksDB 5.15) In most cases, the sequence number <code class="language-plaintext highlighter-rouge">seq</code> is not necessary for keys in the index blocks. In such cases, <code class="language-plaintext highlighter-rouge">format_version</code>=3 skips encoding the sequence number and sets <code class="language-plaintext highlighter-rouge">index_key_is_user_key</code> in TableProperties, which is used by the reader to know how to decode the index block.</p> <h3 id="what-is-format_version-4">What is format_version 4?</h3> <p>(Since RocksDB 5.16) Changes the format of index blocks by delta encoding the index values, which are the block handles. This saves the encoding of <code class="language-plaintext highlighter-rouge">BlockHandle::offset</code> of the non-head index entries in each restart interval. If used, <code class="language-plaintext highlighter-rouge">TableProperties::index_value_is_delta_encoded</code> is set, which is used by the reader to know how to decode the index block. The format of each key is (shared_size, non_shared_size, shared, non_shared). The format of each value, i.e., block handle, is (offset, size) whenever the shared_size is 0, which included the first entry in each restart point. Otherwise the format is delta-size = block handle size - size of last block handle.</p> <p>The index format in <code class="language-plaintext highlighter-rouge">format_version=4</code> would be as follows:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 </pre></td><td class="rouge-code"><pre>restart_point 0: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) restart_point 1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) ... restart_point n-1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz) where, k is key, v is value, and its encoding is in parenthesis. </pre></td></tr></tbody></table></code></pre></div></div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/1850247869/picture/" alt="" title="" /> </div> <p class="post-authorName">Abhishek Madan</p> </div> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/568694102/picture/" alt="" title="" /> </div> <p class="post-authorName">Andrew Kryczka</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2018/11/21/delete-range.html">DeleteRange: A New Native RocksDB Operation</a></h1> <p class="post-meta">Posted November 21, 2018</p> </header> <article class="post-content"> <h2 id="motivation">Motivation</h2> <h3 id="deletion-patterns-in-lsm">Deletion patterns in LSM</h3> <p>Deleting a range of keys is a common pattern in RocksDB. Most systems built on top of RocksDB have multi-component key schemas, where keys sharing a common prefix are logically related. Here are some examples.</p> <p>MyRocks is a MySQL fork using RocksDB as its storage engine. Each key’s first four bytes identify the table or index to which that key belongs. Thus dropping a table or index involves deleting all the keys with that prefix.</p> <p>Rockssandra is a Cassandra variant that uses RocksDB as its storage engine. One of its admin tool commands, <code class="language-plaintext highlighter-rouge">nodetool cleanup</code>, removes key-ranges that have been migrated to other nodes in the cluster.</p> <p>Marketplace uses RocksDB to store product data. Its key begins with product ID, and it stores various data associated with the product in separate keys. When a product is removed, all these keys must be deleted.</p> <p>When we decide what to improve, we try to find a use case that’s common across users, since we want to build a generally useful system, not one that has many one-off features for individual users. The range deletion pattern is common as illustrated above, so from this perspective it’s a good target for optimization.</p> <h3 id="existing-mechanisms-challenges-and-opportunities">Existing mechanisms: challenges and opportunities</h3> <p>The most common pattern we see is scan-and-delete, i.e., advance an iterator through the to-be-deleted range, and issue a <code class="language-plaintext highlighter-rouge">Delete</code> for each key. This is slow (involves read I/O) so cannot be done in any critical path. Additionally, it creates many tombstones, which slows down iterators and doesn’t offer a deadline for space reclamation.</p> <p>Another common pattern is using a custom compaction filter that drops keys in the deleted range(s). This deletes the range asynchronously, so cannot be used in cases where readers must not see keys in deleted ranges. Further, it has the disadvantage of outputting tombstones to all but the bottom level. That’s because compaction cannot detect whether dropping a key would cause an older version at a lower level to reappear.</p> <p>If space reclamation time is important, or it is important that the deleted range not affect iterators, the user can trigger <code class="language-plaintext highlighter-rouge">CompactRange</code> on the deleted range. This can involve arbitrarily long waits in the compaction queue, and increases write-amp. By the time it’s finished, however, the range is completely gone from the LSM.</p> <p><code class="language-plaintext highlighter-rouge">DeleteFilesInRange</code> can be used prior to compacting the deleted range as long as snapshot readers do not need to access them. It drops files that are completely contained in the deleted range. That saves write-amp because, in <code class="language-plaintext highlighter-rouge">CompactRange</code>, the file data would have to be rewritten several times before it reaches the bottom of the LSM, where tombstones can finally be dropped.</p> <p>In addition to the above approaches having various drawbacks, they are quite complicated to reason about and implement. In an ideal world, deleting a range of keys would be (1) simple, i.e., a single API call; (2) synchronous, i.e., when the call finishes, the keys are guaranteed to be wiped from the DB; (3) low latency so it can be used in critical paths; and (4) a first-class operation with all the guarantees of any other write, like atomicity, crash-recovery, etc.</p> <h2 id="v1-getting-it-to-work">v1: Getting it to work</h2> <h3 id="where-to-persist-them">Where to persist them?</h3> <p>The first place we thought about storing them is inline with the data blocks. We could not think of a good way to do it, however, since the start of a range tombstone covering a key could be anywhere, making binary search impossible. So, we decided to investigate segregated storage.</p> <p>A second solution we considered is appending to the manifest. This file is append-only, periodically compacted, and stores metadata like the level to which each SST belongs. This is tempting because it leverages an existing file, which is maintained in the background and fully read when the DB is opened. However, it conceptually violates the manifest’s purpose, which is to store metadata. It also has no way to detect when a range tombstone no longer covers anything and is droppable. Further, it’d be possible for keys above a range tombstone to disappear when they have their seqnums zeroed upon compaction to the bottommost level.</p> <p>A third candidate is using a separate column family. This has similar problems to the manifest approach. That is, we cannot easily detect when a range tombstone is obsolete, and seqnum zeroing can cause a key to go from above a range tombstone to below, i.e., disappearing. The upside is we can reuse logic for memory buffering, consistent reads/writes, etc.</p> <p>The problems with the second and third solutions indicate a need for range tombstones to be aware of flush/compaction. An easy way to achieve this is put them in the SST files themselves - but not in the data blocks, as explained for the first solution. So, we introduced a separate meta-block for range tombstones. This resolved the problem of when to obsolete range tombstones, as it’s simple: when they’re compacted to the bottom level. We also reused the LSM invariants that newer versions of a key are always in a higher level to prevent the seqnum zeroing problem. This approach has the side benefit of constraining the range tombstones seen during reads to ones in a similar key-range.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/delrange/delrange_sst_blocks.png" alt="" /></p> <p style="text-align: center"><em>When there are range tombstones in an SST, they are segregated in a separate meta-block</em></p> <p style="display: block; margin-left: auto; margin-right: auto; width: 80%"><img src="/static/images/delrange/delrange_key_schema.png" alt="" /></p> <p style="text-align: center"><em>Logical range tombstones (left) and their corresponding physical key-value representation (right)</em></p> <h3 id="write-path">Write path</h3> <p><code class="language-plaintext highlighter-rouge">WriteBatch</code> stores range tombstones in its buffer which are logged to the WAL and then applied to a dedicated range tombstone memtable during <code class="language-plaintext highlighter-rouge">Write</code>. Later in the background the range tombstone memtable and its corresponding data memtable are flushed together into a single SST with a range tombstone meta-block. SSTs periodically undergo compaction which rewrites SSTs with point data and range tombstones dropped or merged wherever possible.</p> <p>We chose to use a dedicated memtable for range tombstones. The memtable representation is always skiplist in order to minimize overhead in the usual case, which is the memtable contains zero or a small number of range tombstones. The range tombstones are segregated to a separate memtable for the same reason we segregated range tombstones in SSTs. That is, we did not know how to interleave the range tombstone with point data in a way that we would be able to find it for arbitrary keys that it covers.</p> <p style="display: block; margin-left: auto; margin-right: auto; width: 70%"><img src="/static/images/delrange/delrange_write_path.png" alt="" /></p> <p style="text-align: center"><em>Lifetime of point keys and range tombstones in RocksDB</em></p> <p>During flush and compaction, we chose to write out all non-obsolete range tombstones unsorted. Sorting by a single dimension is easy to implement, but doesn’t bring asymptotic improvement to queries over range data. Ideally, we want to store skylines (see “Read Path” subsection below) computed over our ranges so we can binary search. However, a couple of concerns cause doing this in flush and compaction to feel unsatisfactory: (1) we need to store multiple skylines, one for each snapshot, which further complicates the range tombstone meta-block encoding; and (2) even if we implement this, the range tombstone memtable still needs to be linearly scanned. Given these concerns we decided to defer collapsing work to the read side, hoping a good caching strategy could optimize this at some future point.</p> <h3 id="read-path">Read path</h3> <p>In point lookups, we aggregate range tombstones in an unordered vector as we search through live memtable, immutable memtables, and then SSTs. When a key is found that matches the lookup key, we do a scan through the vector, checking whether the key is deleted.</p> <p>In iterators, we aggregate range tombstones into a skyline as we visit live memtable, immutable memtables, and SSTs. The skyline is expensive to construct but fast to determine whether a key is covered. The skyline keeps track of the most recent range tombstone found to optimize <code class="language-plaintext highlighter-rouge">Next</code> and <code class="language-plaintext highlighter-rouge">Prev</code>.</p> <table> <tbody> <tr> <td><img src="/static/images/delrange/delrange_uncollapsed.png" alt="" /></td> <td><img src="/static/images/delrange/delrange_collapsed.png" alt="" /></td> </tr> </tbody> </table> <p style="text-align: center"><em>(<a href="https://leetcode.com/problems/the-skyline-problem/description/">Image source: Leetcode</a>) The skyline problem involves taking building location/height data in the unsearchable form of A and converting it to the form of B, which is binary-searchable. With overlapping range tombstones, to achieve efficient searching we need to solve an analogous problem, where the x-axis is the key-space and the y-axis is the sequence number.</em></p> <h3 id="performance-characteristics">Performance characteristics</h3> <p>For the v1 implementation, writes are much faster compared to the scan and delete (optionally within a transaction) pattern. <code class="language-plaintext highlighter-rouge">DeleteRange</code> only logs to WAL and applies to memtable. Logging to WAL always <code class="language-plaintext highlighter-rouge">fflush</code>es, and optionally <code class="language-plaintext highlighter-rouge">fsync</code>s or <code class="language-plaintext highlighter-rouge">fdatasync</code>s. Applying to memtable is always an in-memory operation. Since range tombstones have a dedicated skiplist memtable, the complexity of inserting is O(log(T)), where T is the number of existing buffered range tombstones.</p> <p>Reading in the presence of v1 range tombstones, however, is much slower than reads in a database where scan-and-delete has happened, due to the linear scan over range tombstone memtables/meta-blocks.</p> <p>Iterating in a database with v1 range tombstones is usually slower than in a scan-and-delete database, although the gap lessens as iterations grow longer. When an iterator is first created and seeked, we construct a skyline over its tombstones. This operation is O(T*log(T)) where T is the number of tombstones found across live memtable, immutable memtable, L0 files, and one file from each of the L1+ levels. However, moving the iterator forwards or backwards is simply a constant-time operation (excluding edge cases, e.g., many range tombstones between consecutive point keys).</p> <h2 id="v2-making-it-fast">v2: Making it fast</h2> <p><code class="language-plaintext highlighter-rouge">DeleteRange</code>’s negative impact on read perf is a barrier to its adoption. The root cause is range tombstones are not stored or cached in a format that can be efficiently searched. We needed to design DeleteRange so that we could maintain write performance while making read performance competitive with workarounds used in production (e.g., scan-and-delete).</p> <h3 id="representations">Representations</h3> <p>The key idea of the redesign is that, instead of globally collapsing range tombstones, we can locally “fragment” them for each SST file and memtable to guarantee that:</p> <ul> <li>no range tombstones overlap; and</li> <li>range tombstones are ordered by start key.</li> </ul> <p>Combined, these properties make range tombstones binary searchable. This fragmentation will happen on the read path, but unlike the previous design, we can easily cache many of these range tombstone fragments on the read path.</p> <h3 id="write-path-1">Write path</h3> <p>The write path remains unchanged.</p> <h3 id="read-path-1">Read path</h3> <p>When an SST file is opened, its range tombstones are fragmented and cached. For point lookups, we binary search each file’s fragmented range tombstones for one that covers the lookup key. Unlike the old design, once we find a tombstone, we no longer need to search for the key in lower levels, since we know that any keys on those levels will be covered (though we do still check the current level since there may be keys written after the range tombstone).</p> <p>For range scans, we create iterators over all the fragmented range tombstones and store them in a list, seeking each one to cover the start key of the range scan (if possible), and query each encountered key in this structure as in the old design, advancing range tombstone iterators as necessary. In effect, we implicitly create a skyline. This requires significantly less work on iterator creation, but since each memtable/SST has its own range tombstone iterator, querying range tombstones requires key comparisons (and possibly iterator increments) for several iterators (as opposed to v1, where we had a global collapsed representation of all range tombstones). As a result, very long range scans may become slower than before, but short range scans are an order of magnitude faster, which are the more common class of range scan.</p> <h2 id="benchmarks">Benchmarks</h2> <p>To understand the performance of this new design, we used <code class="language-plaintext highlighter-rouge">db_bench</code> to compare point lookup, short range scan, and long range scan performance across:</p> <ul> <li>the v1 DeleteRange design,</li> <li>the scan-and-delete workaround, and</li> <li>the v2 DeleteRange design.</li> </ul> <p>In these benchmarks, we used a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written. Writing the range tombstones ensures that most of them are not compacted away, and we have more tombstones in higher levels that cover keys in lower levels, which allows the benchmarks to exercise more interesting behavior when reading deleted keys.</p> <p>Point lookup benchmarks read 100000 keys from a database using <code class="language-plaintext highlighter-rouge">readwhilewriting</code>. Range scan benchmarks used <code class="language-plaintext highlighter-rouge">seekrandomwhilewriting</code> and seeked 100000 times, and advanced up to 10 keys away from the seek position for short range scans, and advanced up to 1000 keys away from the seek position for long range scans.</p> <p>The results are summarized in the tables below, averaged over 10 runs (note the different SHAs for v1 benchmarks are due to a new <code class="language-plaintext highlighter-rouge">db_bench</code> flag that was added in order to compare performance with databases with no tombstones; for brevity, those results are not reported here). Also note that the block cache was large enough to hold the entire db, so the large throughput is due to limited I/Os and little time spent on decompression. The range tombstone blocks are always pinned uncompressed in memory. We believe these setup details should not affect relative performance between versions.</p> <h3 id="point-lookups">Point Lookups</h3> <table> <tbody> <tr> <td>Name</td> <td>SHA</td> <td>avg micros/op</td> <td>avg ops/sec</td> </tr> <tr> <td>v1</td> <td>35cd754a6</td> <td>1.3179</td> <td>759,830.90</td> </tr> <tr> <td>scan-del</td> <td>7528130e3</td> <td>0.6036</td> <td>1,667,237.70</td> </tr> <tr> <td>v2</td> <td>7528130e3</td> <td>0.6128</td> <td>1,634,633.40</td> </tr> </tbody> </table> <h3 id="short-range-scans">Short Range Scans</h3> <table> <tbody> <tr> <td>Name</td> <td>SHA</td> <td>avg micros/op</td> <td>avg ops/sec</td> </tr> <tr> <td>v1</td> <td>0ed738fdd</td> <td>6.23</td> <td>176,562.00</td> </tr> <tr> <td>scan-del</td> <td>PR 4677</td> <td>2.6844</td> <td>377,313.00</td> </tr> <tr> <td>v2</td> <td>PR 4677</td> <td>2.8226</td> <td>361,249.70</td> </tr> </tbody> </table> <h3 id="long-range-scans">Long Range scans</h3> <table> <tbody> <tr> <td>Name</td> <td>SHA</td> <td>avg micros/op</td> <td>avg ops/sec</td> </tr> <tr> <td>v1</td> <td>0ed738fdd</td> <td>52.7066</td> <td>19,074.00</td> </tr> <tr> <td>scan-del</td> <td>PR 4677</td> <td>38.0325</td> <td>26,648.60</td> </tr> <tr> <td>v2</td> <td>PR 4677</td> <td>41.2882</td> <td>24,714.70</td> </tr> </tbody> </table> <h2 id="future-work">Future Work</h2> <p>Note that memtable range tombstones are fragmented every read; for now this is acceptable, since we expect there to be relatively few range tombstones in memtables (and users can enforce this by keeping track of the number of memtable range deletions and manually flushing after it passes a threshold). In the future, a specialized data structure can be used for storing range tombstones in memory to avoid this work.</p> <p>Another future optimization is to create a new format version that requires range tombstones to be stored in a fragmented form. This would save time when opening SST files, and when <code class="language-plaintext highlighter-rouge">max_open_files</code> is not -1 (i.e., files may be opened several times).</p> <h2 id="acknowledgements">Acknowledgements</h2> <p>Special thanks to Peter Mattis and Nikhil Benesch from Cockroach Labs, who were early users of DeleteRange v1 in production, contributed the cleanest/most efficient v1 aggregation implementation, found and fixed bugs, and provided initial DeleteRange v2 design and continued help.</p> <p>Thanks to Huachao Huang and Jinpeng Zhang from PingCAP for early DeleteRange v1 adoption, bug reports, and fixes.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100002297362180/picture/" alt="" title="" /> </div> <p class="post-authorName">Fenggang Wu</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2018/08/23/data-block-hash-index.html">Improving Point-Lookup Using Data Block Hash Index</a></h1> <p class="post-meta">Posted August 23, 2018</p> </header> <article class="post-content"> <p>We’ve designed and implemented a <em>data block hash index</em> in RocksDB that has the benefit of both reducing the CPU util and increasing the throughput for point lookup queries with a reasonable and tunable space overhead.</p> <p>Specifially, we append a compact hash table to the end of the data block for efficient indexing. It is backward compatible with the data base created without this feature. After turned on the hash index feature, existing data will be gradually converted to the hash index format.</p> <p>Benchmarks with <code class="language-plaintext highlighter-rouge">db_bench</code> show the CPU utilization of one of the main functions in the point lookup code path, <code class="language-plaintext highlighter-rouge">DataBlockIter::Seek()</code>, is reduced by 21.8%, and the overall RocksDB throughput is increased by 10% under purely cached workloads, at an overhead of 4.6% more space. Shadow testing with Facebook production traffic shows good CPU improvements too.</p> <h3 id="how-to-use-it">How to use it</h3> <p>Two new options are added as part of this feature: <code class="language-plaintext highlighter-rouge">BlockBasedTableOptions::data_block_index_type</code> and <code class="language-plaintext highlighter-rouge">BlockBasedTableOptions::data_block_hash_table_util_ratio</code>.</p> <p>The hash index is disabled by default unless <code class="language-plaintext highlighter-rouge">BlockBasedTableOptions::data_block_index_type</code> is set to <code class="language-plaintext highlighter-rouge">data_block_index_type = kDataBlockBinaryAndHash</code>. The hash table utilization ratio is adjustable using <code class="language-plaintext highlighter-rouge">BlockBasedTableOptions::data_block_hash_table_util_ratio</code>, which is valid only if <code class="language-plaintext highlighter-rouge">data_block_index_type = kDataBlockBinaryAndHash</code>.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 </pre></td><td class="rouge-code"><pre>// the definitions can be found in include/rocksdb/table.h // The index type that will be used for the data block. enum DataBlockIndexType : char { kDataBlockBinarySearch = 0, // traditional block type kDataBlockBinaryAndHash = 1, // additional hash index }; // Set to kDataBlockBinaryAndHash to enable hash index DataBlockIndexType data_block_index_type = kDataBlockBinarySearch; // #entries/#buckets. It is valid only when data_block_hash_index_type is // kDataBlockBinaryAndHash. double data_block_hash_table_util_ratio = 0.75; </pre></td></tr></tbody></table></code></pre></div></div> <h3 id="data-block-hash-index-design">Data Block Hash Index Design</h3> <p>Current data block format groups adjacent keys together as a restart interval. One block consists of multiple restart intervals. The byte offset of the beginning of each restart interval, i.e. a restart point, is stored in an array called restart interval index or binary seek index. RocksDB does a binary search when performing point lookup for keys in data blocks to find the right restart interval the key may reside. We will use binary seek and binary search interchangeably in this post.</p> <p>In order to find the right location where the key may reside using binary search, multiple key parsing and comparison are needed. Each binary search branching triggers CPU cache miss, causing much CPU utilization. We have seen that this binary search takes up considerable CPU in production use-cases.</p> <p><img src="/static/images/data-block-hash-index/block-format-binary-seek.png" alt="" /></p> <p>We implemented a hash map at the end of the block to index the key to reduce the CPU overhead of the binary search. The hash index is just an array of pointers pointing into the binary seek index.</p> <p><img src="/static/images/data-block-hash-index/block-format-hash-index.png" alt="" /></p> <p>Each array element is considered as a hash bucket when storing the location of a key (or more precisely, the restart index of the restart interval where the key resides). When multiple keys happen to hash into the same bucket (hash collision), we just mark the bucket as “collision”. So that when later querying on that key, the hash table lookup knows that there was a hash collision happened so it can fall back to the traditional binary search to find the location of the key.</p> <p>We define hash table utilization ratio as the #keys/#buckets. If a utilization ratio is 0.5 and there are 100 buckets, 50 keys are stored in the bucket. The less the util ratio, the less hash collision, and the less chance for a point lookup falls back to binary seek (fall back ratio) due to the collision. So a small util ratio has more benefit to reduce the CPU time but introduces more space overhead.</p> <p>Space overhead depends on the util ratio. Each bucket is a <code class="language-plaintext highlighter-rouge">uint8_t</code> (i.e. one byte). For a util ratio of 1, the space overhead is 1Byte per key, the fall back ratio observed is ~52%.</p> <p><img src="/static/images/data-block-hash-index/hash-index-data-structure.png" alt="" /></p> <h3 id="things-that-need-attention">Things that Need Attention</h3> <p><strong>Customized Comparator</strong></p> <p>Hash index will hash different keys (keys with different content, or byte sequence) into different hash values. This assumes the comparator will not treat different keys as equal if they have different content.</p> <p>The default bytewise comparator orders the keys in alphabetical order and works well with hash index, as different keys will never be regarded as equal. However, some specially crafted comparators will do. For example, say, a <code class="language-plaintext highlighter-rouge">StringToIntComparator</code> can convert a string into an integer, and use the integer to perform the comparison. Key string “16” and “0x10” is equal to each other as seen by this <code class="language-plaintext highlighter-rouge">StringToIntComparator</code>, but they probably hash to different value. Later queries to one form of the key will not be able to find the existing key been stored in the other format.</p> <p>We add a new function member to the comparator interface:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 </pre></td><td class="rouge-code"><pre>virtual bool CanKeysWithDifferentByteContentsBeEqual() const { return true; } </pre></td></tr></tbody></table></code></pre></div></div> <p>Every comparator implementation should override this function and specify the behavior of the comparator. If a comparator can regard different keys equal, the function returns true, and as a result the hash index feature will not be enabled, and vice versa.</p> <p>NOTE: to use the hash index feature, one should 1) have a comparator that can never treat different keys as equal; and 2) override the <code class="language-plaintext highlighter-rouge">CanKeysWithDifferentByteContentsBeEqual()</code> function to return <code class="language-plaintext highlighter-rouge">false</code>, so the hash index can be enabled.</p> <p><strong>Util Ratio’s Impact on Data Block Cache</strong></p> <p>Adding the hash index to the end of the data block essentially takes up the data block cache space, making the effective data block cache size smaller and increasing the data block cache miss ratio. Therefore, a very small util ratio will result in a large data block cache miss ratio, and the extra I/O may drag down the throughput gain achieved by the hash index lookup. Besides, when compression is enabled, cache miss also incurs data block decompression, which is CPU-consuming. Therefore the CPU may even increase if using a too small util ratio. The best util ratio depends on workloads, cache to data ratio, disk bandwidth/latency etc. In our experiment, we found util ratio = 0.5 ~ 1 is a good range to explore that brings both CPU and throughput gains.</p> <h3 id="limitations">Limitations</h3> <p>As we use <code class="language-plaintext highlighter-rouge">uint8_t</code> to store binary seek index, i.e. restart interval index, the total number of restart intervals cannot be more than 253 (we reserved 255 and 254 as special flags). For blocks having a larger number of restart intervals, the hash index will not be created and the point lookup will be done by traditional binary seek.</p> <p>Data block hash index only supports point lookup. We do not support range lookup. Range lookup request will fall back to BinarySeek.</p> <p>RocksDB supports many types of records, such as <code class="language-plaintext highlighter-rouge">Put</code>, <code class="language-plaintext highlighter-rouge">Delete</code>, <code class="language-plaintext highlighter-rouge">Merge</code>, etc (visit <a href="https://github.com/facebook/rocksdb/wiki/rocksdb-basics">here</a> for more information). Currently we only support <code class="language-plaintext highlighter-rouge">Put</code> and <code class="language-plaintext highlighter-rouge">Delete</code>, but not <code class="language-plaintext highlighter-rouge">Merge</code>. Internally we have a limited set of supported record types:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 </pre></td><td class="rouge-code"><pre>kPutRecord, <=== supported kDeleteRecord, <=== supported kSingleDeleteRecord, <=== supported kTypeBlobIndex, <=== supported </pre></td></tr></tbody></table></code></pre></div></div> <p>For records not supported, the searching process will fall back to the traditional binary seek.</p> <h3 id="evaluation">Evaluation</h3> <p>To evaluate the CPU util reduction and isolate other factors such as disk I/O and block decompression, we first evaluate the hash idnex in a purely cached workload. We observe that the CPU utilization of one of the main functions in the point lookup code path, DataBlockIter::Seek(), is reduced by 21.8% and the overall throughput is increased by 10% at an overhead of 4.6% more space.</p> <p>However, general worload is not always purely cached. So we also evaluate the performance under different cache space pressure. In the following test, we use <code class="language-plaintext highlighter-rouge">db_bench</code> with RocksDB deployed on SSDs. The total DB size is 5~6GB, and it is about 14GB if decompressed. Different block cache sizes are used, ranging from 14GB down to 2GB, with an increasing cache miss ratio.</p> <p>Orange bars are representing our hash index performance. We use a hash util ratio of 1.0 in this test. Block size are set to 16KiB with the restart interval as 16.</p> <p><img src="/static/images/data-block-hash-index/perf-throughput.png" alt="" /> <img src="/static/images/data-block-hash-index/perf-cache-miss.png" alt="" /></p> <p>We can see that if cache size is greater than 8GB, hash index can bring throughput gain. Cache size greater than 8GB can be translated to a cache miss ratio smaller than 40%. So if the workload has a cache miss ratio smaller than 40%, hash index is able to increase the throughput.</p> <p>Besides, shadow testing with Facebook production traffic shows good CPU improvements too.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2018/08/01/rocksdb-tuning-advisor.html">Rocksdb Tuning Advisor</a></h1> <p class="post-meta">Posted August 01, 2018</p> </header> <article class="post-content"> <p>The performance of Rocksdb is contingent on its tuning. However, because of the complexity of its underlying technology and a large number of configurable parameters, a good configuration is sometimes hard to obtain. The aim of the python command-line tool, Rocksdb Advisor, is to automate the process of suggesting improvements in the configuration based on advice from Rocksdb experts.</p> <h3 id="overview">Overview</h3> <p>Experts share their wisdom as rules comprising of conditions and suggestions in the INI format (refer <a href="https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/rules.ini">rules.ini</a>). Users provide the Rocksdb configuration that they want to improve upon (as the familiar Rocksdb OPTIONS file — <a href="https://github.com/facebook/rocksdb/blob/main/examples/rocksdb_option_file_example.ini">example</a>) and the path of the file which contains Rocksdb logs and statistics. The <a href="https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/rule_parser_example.py">Advisor</a> creates appropriate DataSource objects (for Rocksdb <a href="https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/db_log_parser.py">logs</a>, <a href="https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/db_options_parser.py">options</a>, <a href="https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/db_stats_fetcher.py">statistics</a> etc.) and provides them to the <a href="https://github.com/facebook/rocksdb/blob/main/tools/advisor/advisor/rule_parser.py">Rules Engine</a>. The Rules uses rules from experts to parse data-sources and trigger appropriate rules. The Advisor’s output gives information about which rules were triggered, why they were triggered and what each of them suggests. Each suggestion provided by a triggered rule advises some action on a Rocksdb configuration option, for example, increase CFOptions.write_buffer_size, set bloom_bits to 2 etc.</p> <h3 id="usage">Usage</h3> <p>An example command to run the tool:</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 </pre></td><td class="rouge-code"><pre><span class="nb">cd </span>rocksdb/tools/advisor python3 <span class="nt">-m</span> advisor.rule_parser_example <span class="nt">--rules_spec</span><span class="o">=</span>advisor/rules.ini <span class="nt">--rocksdb_options</span><span class="o">=</span><span class="nb">test</span>/input_files/OPTIONS-000005 <span class="nt">--log_files_path_prefix</span><span class="o">=</span><span class="nb">test</span>/input_files/LOG-0 <span class="nt">--stats_dump_period_sec</span><span class="o">=</span>20 </pre></td></tr></tbody></table></code></pre></div></div> <p>Sample output where a Rocksdb log-based rule has been triggered :</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 </pre></td><td class="rouge-code"><pre>Rule: stall-too-many-memtables LogCondition: stall-too-many-memtables regex: Stopping writes because we have <span class="se">\d</span>+ immutable memtables <span class="se">\(</span>waiting <span class="k">for </span>flush<span class="se">\)</span>, max_write_buffer_number is <span class="nb">set </span>to <span class="se">\d</span>+ Suggestion: inc-bg-flush option : DBOptions.max_background_flushes action : increase suggested_values : <span class="o">[</span><span class="s1">'2'</span><span class="o">]</span> Suggestion: inc-write-buffer option : CFOptions.max_write_buffer_number action : increase scope: col_fam: <span class="o">{</span><span class="s1">'default'</span><span class="o">}</span> </pre></td></tr></tbody></table></code></pre></div></div> <h3 id="read-more">Read more</h3> <p>For more information, refer to <a href="https://github.com/facebook/rocksdb/tree/main/tools/advisor/README.md">advisor</a>.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2018/02/05/rocksdb-5-10-2-released.html">RocksDB 5.10.2 Released!</a></h1> <p class="post-meta">Posted February 05, 2018</p> </header> <article class="post-content"> <h3 id="public-api-change">Public API Change</h3> <ul> <li>When running <code class="language-plaintext highlighter-rouge">make</code> with environment variable <code class="language-plaintext highlighter-rouge">USE_SSE</code> set and <code class="language-plaintext highlighter-rouge">PORTABLE</code> unset, will use all machine features available locally. Previously this combination only compiled SSE-related features.</li> </ul> <h3 id="new-features">New Features</h3> <ul> <li>CRC32C is now using the 3-way pipelined SSE algorithm <code class="language-plaintext highlighter-rouge">crc32c_3way</code> on supported platforms to improve performance. The system will choose to use this algorithm on supported platforms automatically whenever possible. If PCLMULQDQ is not supported it will fall back to the old Fast_CRC32 algorithm.</li> <li>Provide lifetime hints when writing files on Linux. This reduces hardware write-amp on storage devices supporting multiple streams.</li> <li>Add a DB stat, <code class="language-plaintext highlighter-rouge">NUMBER_ITER_SKIP</code>, which returns how many internal keys were skipped during iterations (e.g., due to being tombstones or duplicate versions of a key).</li> <li>Add PerfContext counters, <code class="language-plaintext highlighter-rouge">key_lock_wait_count</code> and <code class="language-plaintext highlighter-rouge">key_lock_wait_time</code>, which measure the number of times transactions wait on key locks and total amount of time waiting.</li> </ul> <h3 id="bug-fixes">Bug Fixes</h3> <ul> <li>Fix IOError on WAL write doesn’t propagate to write group follower</li> <li>Make iterator invalid on merge error.</li> <li>Fix performance issue in <code class="language-plaintext highlighter-rouge">IngestExternalFile()</code> affecting databases with large number of SST files.</li> <li>Fix possible corruption to LSM structure when <code class="language-plaintext highlighter-rouge">DeleteFilesInRange()</code> deletes a subset of files spanned by a <code class="language-plaintext highlighter-rouge">DeleteRange()</code> marker.</li> <li>Fix DB::Flush() keep waiting after flush finish under certain condition.</li> </ul> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100003482360101/picture/" alt="" title="" /> </div> <p class="post-authorName">Maysam Yabandeh</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/12/19/write-prepared-txn.html">WritePrepared Transactions</a></h1> <p class="post-meta">Posted December 19, 2017</p> </header> <article class="post-content"> <p>RocksDB supports both optimistic and pessimistic concurrency controls. The pessimistic transactions make use of locks to provide isolation between the transactions. The default write policy in pessimistic transactions is <em>WriteCommitted</em>, which means that the data is written to the DB, i.e., the memtable, only after the transaction is committed. This policy simplified the implementation but came with some limitations in throughput, transaction size, and variety in supported isolation levels. In the below, we explain these in detail and present the other write policies, <em>WritePrepared</em> and <em>WriteUnprepared</em>. We then dive into the design of <em>WritePrepared</em> transactions.</p> <h3 id="writecommitted-pros-and-cons">WriteCommitted, Pros and Cons</h3> <p>With <em>WriteCommitted</em> write policy, the data is written to the memtable only after the transaction commits. This greatly simplifies the read path as any data that is read by other transactions can be assumed to be committed. This write policy, however, implies that the writes are buffered in memory in the meanwhile. This makes memory a bottleneck for large transactions. The delay of the commit phase in 2PC (two-phase commit) also becomes noticeable since most of the work, i.e., writing to memtable, is done at the commit phase. When the commit of multiple transactions are done in a serial fashion, such as in 2PC implementation of MySQL, the lengthy commit latency becomes a major contributor to lower throughput. Moreover this write policy cannot provide weaker isolation levels, such as READ UNCOMMITTED, that could potentially provide higher throughput for some applications.</p> <h3 id="alternatives-writeprepared-and-writeunprepared">Alternatives: <em>WritePrepared</em> and <em>WriteUnprepared</em></h3> <p>To tackle the lengthy commit issue, we should do memtable writes at earlier phases of 2PC so that the commit phase become lightweight and fast. 2PC is composed of Write stage, where the transaction <code class="language-plaintext highlighter-rouge">::Put</code> is invoked, the prepare phase, where <code class="language-plaintext highlighter-rouge">::Prepare</code> is invoked (upon which the DB promises to commit the transaction if later is requested), and commit phase, where <code class="language-plaintext highlighter-rouge">::Commit</code> is invoked and the transaction writes become visible to all readers. To make the commit phase lightweight, the memtable write could be done at either <code class="language-plaintext highlighter-rouge">::Prepare</code> or <code class="language-plaintext highlighter-rouge">::Put</code> stages, resulting into <em>WritePrepared</em> and <em>WriteUnprepared</em> write policies respectively. The downside is that when another transaction is reading data, it would need a way to tell apart which data is committed, and if they are, whether they are committed before the transaction’s start, i.e., in the read snapshot of the transaction. <em>WritePrepared</em> would still have the issue of buffering the data, which makes the memory the bottleneck for large transactions. It however provides a good milestone for transitioning from <em>WriteCommitted</em> to <em>WriteUnprepared</em> write policy. Here we explain the design of <em>WritePrepared</em> policy. We will cover the changes that make the design to also supported <em>WriteUnprepared</em> in an upcoming post.</p> <h3 id="writeprepared-in-a-nutshell"><em>WritePrepared</em> in a nutshell</h3> <p>These are the primary design questions that needs to be addressed: 1) How do we identify the key/values in the DB with transactions that wrote them? 2) How do we figure if a key/value written by transaction Txn_w is in the read snapshot of the reading transaction Txn_r? 3) How do we rollback the data written by aborted transactions?</p> <p>With <em>WritePrepared</em>, a transaction still buffers the writes in a write batch object in memory. When 2PC <code class="language-plaintext highlighter-rouge">::Prepare</code> is called, it writes the in-memory write batch to the WAL (write-ahead log) as well as to the memtable(s) (one memtable per column family); We reuse the existing notion of sequence numbers in RocksDB to tag all the key/values in the same write batch with the same sequence number, <code class="language-plaintext highlighter-rouge">prepare_seq</code>, which is also used as the identifier for the transaction. At commit time, it writes a commit marker to the WAL, whose sequence number, <code class="language-plaintext highlighter-rouge">commit_seq</code>, will be used as the commit timestamp of the transaction. Before releasing the commit sequence number to the readers, it stores a mapping from <code class="language-plaintext highlighter-rouge">prepare_seq</code> to <code class="language-plaintext highlighter-rouge">commit_seq</code> in an in-memory data structure that we call <em>CommitCache</em>. When a transaction reading values from the DB (tagged with <code class="language-plaintext highlighter-rouge">prepare_seq</code>) it makes use of the <em>CommitCache</em> to figure if <code class="language-plaintext highlighter-rouge">commit_seq</code> of the value is in its read snapshot. To rollback an aborted transaction, we apply the status before the transaction by making another write that cancels out the writes of the aborted transaction.</p> <p>The <em>CommitCache</em> is a lock-free data structure that caches the recent commit entries. Looking up the entries in the cache must be enough for almost all th transactions that commit in a timely manner. When evicting the older entries from the cache, it still maintains some other data structures to cover the corner cases for transactions that takes abnormally too long to finish. We will cover them in the design details below.</p> <h3 id="benchmark-results">Benchmark Results</h3> <p>Here we presents the improvements observed in MyRocks with sysbench and linkbench:</p> <ul> <li>benchmark………..tps………p95 latency….cpu/query</li> <li>insert……………….68%</li> <li>update-noindex…30%……38%</li> <li>update-index…….61%…….28%</li> <li>read-write…………6%……..3.5%</li> <li>read-only………..-1.2%…..-1.8%</li> <li>linkbench………….1.9%……+overall……..0.6%</li> </ul> <p>Here are also the detailed results for <a href="https://gist.github.com/maysamyabandeh/bdb868091b2929a6d938615fdcf58424">In-Memory Sysbench</a> and <a href="https://gist.github.com/maysamyabandeh/ff94f378ab48925025c34c47eff99306">SSD Sysbench</a> curtesy of <a href="https://github.com/mdcallag">@mdcallag</a>.</p> <p>Learn more <a href="https://github.com/facebook/rocksdb/wiki/WritePrepared-Transactions">here</a>.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/568694102/picture/" alt="" title="" /> </div> <p class="post-authorName">Andrew Kryczka</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html">Auto-tuned Rate Limiter</a></h1> <p class="post-meta">Posted December 18, 2017</p> </header> <article class="post-content"> <h3 id="introduction">Introduction</h3> <p>Our rate limiter has been hard to configure since users need to pick a value that is low enough to prevent background I/O spikes, which can impact user-visible read/write latencies. Meanwhile, picking too low a value can cause memtables and L0 files to pile up, eventually leading to writes stalling. Tuning the rate limiter has been especially difficult for users whose DB instances have different workloads, or have workloads that vary over time, or commonly both.</p> <p>To address this, in RocksDB 5.9 we released a dynamic rate limiter that adjusts itself over time according to demand for background I/O. It can be enabled simply by passing <code class="language-plaintext highlighter-rouge">auto_tuned=true</code> in the <code class="language-plaintext highlighter-rouge">NewGenericRateLimiter()</code> call. In this case <code class="language-plaintext highlighter-rouge">rate_bytes_per_sec</code> will indicate the upper-bound of the window within which a rate limit will be picked dynamically. The chosen rate limit will be much lower unless absolutely necessary, so setting this to the device’s maximum throughput is a reasonable choice on dedicated hosts.</p> <h3 id="algorithm">Algorithm</h3> <p>We use a simple multiplicative-increase, multiplicative-decrease algorithm. We measure demand for background I/O as the ratio of intervals where the rate limiter is drained. There are low and high watermarks for this ratio, which will trigger a change in rate limit when breached. The rate limit can move within a window bounded by the user-specified upper-bound, and a lower-bound that we derive internally. Users can expect this lower bound to be 1-2 orders of magnitude less than the provided upper-bound (so don’t provide INT64_MAX as your upper-bound), although it’s subject to change.</p> <h3 id="benchmark-results">Benchmark Results</h3> <p>Data is ingested at 10MB/s and the rate limiter was created with 1000MB/s as its upper bound. The dynamically chosen rate limit hovers around 125MB/s. The other clustering of points at 50MB/s is due to number of compaction threads being reduced to one when there’s no compaction pressure.</p> <p><img src="/static/images/rate-limiter/write-KBps-series.png" alt="" /></p> <p><img src="/static/images/rate-limiter/auto-tuned-write-KBps-series.png" alt="" /></p> <p>The following graph summarizes the above two time series graphs in CDF form. In particular, notice the p90 - p100 for background write rate are significantly lower with auto-tuned rate limiter enabled.</p> <p><img src="/static/images/rate-limiter/write-KBps-cdf.png" alt="" /></p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100003482360101/picture/" alt="" title="" /> </div> <p class="post-authorName">Maysam Yabandeh</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/09/28/rocksdb-5-8-released.html">RocksDB 5.8 Released!</a></h1> <p class="post-meta">Posted September 28, 2017</p> </header> <article class="post-content"> <h3 id="public-api-change">Public API Change</h3> <ul> <li>Users of <code class="language-plaintext highlighter-rouge">Statistics::getHistogramString()</code> will see fewer histogram buckets and different bucket endpoints.</li> <li><code class="language-plaintext highlighter-rouge">Slice::compare</code> and BytewiseComparator <code class="language-plaintext highlighter-rouge">Compare</code> no longer accept <code class="language-plaintext highlighter-rouge">Slice</code>s containing nullptr.</li> <li><code class="language-plaintext highlighter-rouge">Transaction::Get</code> and <code class="language-plaintext highlighter-rouge">Transaction::GetForUpdate</code> variants with <code class="language-plaintext highlighter-rouge">PinnableSlice</code> added.</li> </ul> <h3 id="new-features">New Features</h3> <ul> <li>Add Iterator::Refresh(), which allows users to update the iterator state so that they can avoid some initialization costs of recreating iterators.</li> <li>Replace dynamic_cast<> (except unit test) so people can choose to build with RTTI off. With make, release mode is by default built with -fno-rtti and debug mode is built without it. Users can override it by setting USE_RTTI=0 or 1.</li> <li>Universal compactions including the bottom level can be executed in a dedicated thread pool. This alleviates head-of-line blocking in the compaction queue, which cause write stalling, particularly in multi-instance use cases. Users can enable this feature via <code class="language-plaintext highlighter-rouge">Env::SetBackgroundThreads(N, Env::Priority::BOTTOM)</code>, where <code class="language-plaintext highlighter-rouge">N > 0</code>.</li> <li>Allow merge operator to be called even with a single merge operand during compactions, by appropriately overriding <code class="language-plaintext highlighter-rouge">MergeOperator::AllowSingleOperand</code>.</li> <li>Add <code class="language-plaintext highlighter-rouge">DB::VerifyChecksum()</code>, which verifies the checksums in all SST files in a running DB.</li> <li>Block-based table support for disabling checksums by setting <code class="language-plaintext highlighter-rouge">BlockBasedTableOptions::checksum = kNoChecksum</code>.</li> </ul> <h3 id="bug-fixes">Bug Fixes</h3> <ul> <li>Fix wrong latencies in <code class="language-plaintext highlighter-rouge">rocksdb.db.get.micros</code>, <code class="language-plaintext highlighter-rouge">rocksdb.db.write.micros</code>, and <code class="language-plaintext highlighter-rouge">rocksdb.sst.read.micros</code>.</li> <li>Fix incorrect dropping of deletions during intra-L0 compaction.</li> <li>Fix transient reappearance of keys covered by range deletions when memtable prefix bloom filter is enabled.</li> <li>Fix potentially wrong file smallest key when range deletions separated by snapshot are written together.</li> </ul> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100003482360101/picture/" alt="" title="" /> </div> <p class="post-authorName">Maysam Yabandeh</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/08/25/flushwal.html">FlushWAL; less fwrite, faster writes</a></h1> <p class="post-meta">Posted August 25, 2017</p> </header> <article class="post-content"> <p>When <code class="language-plaintext highlighter-rouge">DB::Put</code> is called, the data is written to both memtable (to be flushed to SST files later) and the WAL (write-ahead log) if it is enabled. In the case of a crash, RocksDB can recover as much as the memtable state that is reflected into the WAL. By default RocksDB automatically flushes the WAL from the application memory to the OS buffer after each <code class="language-plaintext highlighter-rouge">::Put</code>. It however can be configured to perform the flush manually after an explicit call to <code class="language-plaintext highlighter-rouge">::FlushWAL</code>. Not doing fwrite syscall after each <code class="language-plaintext highlighter-rouge">::Put</code> offers a tradeoff between reliability and write latency for the general case. As we explain below, some applications such as MyRocks benefit from this API to gain higher write throughput with however no compromise in reliability.</p> <h3 id="how-much-is-the-gain">How much is the gain?</h3> <p>Using <code class="language-plaintext highlighter-rouge">::FlushWAL</code> API along with setting <code class="language-plaintext highlighter-rouge">DBOptions.concurrent_prepare</code>, MyRocks achieves 40% higher throughput in Sysbench’s <a href="https://github.com/akopytov/sysbench/blob/master/src/lua/oltp_update_non_index.lua">update-nonindex</a> benchmark.</p> <h3 id="write-flush-and-sync">Write, Flush, and Sync</h3> <p>The write to the WAL is first written to the application memory buffer. The buffer in the next step is “flushed” to OS buffer by calling fwrite syscall. The OS buffer is later “synced” to the persistent storage. The data in the OS buffer, although not persisted yet, will survive the application crash. By default, the flush occurs automatically upon each call to <code class="language-plaintext highlighter-rouge">DB::Put</code> or <code class="language-plaintext highlighter-rouge">DB::Write</code>. The user can additionally request sync after each write by setting <code class="language-plaintext highlighter-rouge">WriteOptions::sync</code>.</p> <h3 id="flushwal-api">FlushWAL API</h3> <p>The user can turn off the automatic flush of the WAL by setting <code class="language-plaintext highlighter-rouge">DBOptions::manual_wal_flush</code>. In that case, the WAL buffer is flushed when it is either full or <code class="language-plaintext highlighter-rouge">DB::FlushWAL</code> is called by the user. The API also accepts a boolean argument should we want to sync right after the flush: <code class="language-plaintext highlighter-rouge">::FlushWAL(true)</code>.</p> <h3 id="success-story-myrocks">Success story: MyRocks</h3> <p>Some applications that use RocksDB, already have other machinsims in place to provide reliability. MySQL for example uses 2PC (two-phase commit) to write to both binlog as well as the storage engine such as InnoDB and MyRocks. The group commit logic in MySQL allows the 1st phase (Prepare) to be run in parallel but after a commit group is formed performs the 2nd phase (Commit) in a serial manner. This makes low commit latency in the storage engine essential for achieving high throughput. The commit in MyRocks includes writing to the RocksDB WAL, which as explaiend above, by default incures the latency of flushing the WAL new appends to the OS buffer.</p> <p>Since binlog helps in recovering from some failure scenarios, MySQL can provide reliability without however needing a storage WAL flush after each individual commit. MyRocks benefits from this property, disables automatic WAL flush in RocksDB, and manually calls <code class="language-plaintext highlighter-rouge">::FlushWAL</code> when requested by MySQL.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100003482360101/picture/" alt="" title="" /> </div> <p class="post-authorName">Maysam Yabandeh</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/08/24/pinnableslice.html">PinnableSlice; less memcpy with point lookups</a></h1> <p class="post-meta">Posted August 24, 2017</p> </header> <article class="post-content"> <p>The classic API for <a href="https://github.com/facebook/rocksdb/blob/9e583711144f580390ce21a49a8ceacca338fcd5/include/rocksdb/db.h#L310">DB::Get</a> receives a std::string as argument to which it will copy the value. The memcpy overhead could be non-trivial when the value is large. The <a href="https://github.com/facebook/rocksdb/blob/9e583711144f580390ce21a49a8ceacca338fcd5/include/rocksdb/db.h#L322">new API</a> receives a PinnableSlice instead, which avoids memcpy in most of the cases.</p> <h3 id="what-is-pinnableslice">What is PinnableSlice?</h3> <p>Similarly to Slice, PinnableSlice refers to some in-memory data so it does not incur the memcpy cost. To ensure that the data will not be erased while it is being processed by the user, PinnableSlice, as its name suggests, has the data pinned in memory. The pinned data are released when PinnableSlice object is destructed or when ::Reset is invoked explicitly on it.</p> <h3 id="how-good-is-it">How good is it?</h3> <p>Here are the improvements in throughput for an <a href="https://github.com/facebook/rocksdb/pull/1756#issuecomment-286201693">in-memory benchmark</a>:</p> <ul> <li>value 1k byte: 14%</li> <li>value 10k byte: 34%</li> </ul> <h3 id="any-limitations">Any limitations?</h3> <p>PinnableSlice tries to avoid memcpy as much as possible. The primary gain is when reading large values from the block cache. There are however cases that it would still have to copy the data into its internal buffer. The reason is mainly the complexity of implementation and if there is enough motivation on the application side. the scope of PinnableSlice could be extended to such cases too. These include:</p> <ul> <li>Merged values</li> <li>Reads from memtables</li> </ul> <h3 id="how-to-use-it">How to use it?</h3> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 </pre></td><td class="rouge-code"><pre><span class="n">PinnableSlice</span> <span class="n">pinnable_val</span><span class="p">;</span> <span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">stopped</span><span class="p">)</span> <span class="p">{</span> <span class="k">auto</span> <span class="n">s</span> <span class="o">=</span> <span class="n">db</span><span class="o">-></span><span class="n">Get</span><span class="p">(</span><span class="n">opt</span><span class="p">,</span> <span class="n">cf</span><span class="p">,</span> <span class="n">key</span><span class="p">,</span> <span class="o">&</span><span class="n">pinnable_val</span><span class="p">);</span> <span class="c1">// ... use it</span> <span class="n">pinnable_val</span><span class="p">.</span><span class="n">Reset</span><span class="p">();</span> <span class="c1">// then release it immediately</span> <span class="p">}</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>You can also <a href="https://github.com/facebook/rocksdb/blob/9e583711144f580390ce21a49a8ceacca338fcd5/include/rocksdb/db.h#L314">initialize the internal buffer</a> of PinnableSlice by passing your own string in the constructor. <a href="https://github.com/facebook/rocksdb/blob/main/examples/simple_example.cc">simple_example.cc</a> demonstrates that with more examples.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100000476362039/picture/" alt="" title="" /> </div> <p class="post-authorName">Yi Wu</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/07/25/rocksdb-5-6-1-released.html">RocksDB 5.6.1 Released!</a></h1> <p class="post-meta">Posted July 25, 2017</p> </header> <article class="post-content"> <h3 id="public-api-change">Public API Change</h3> <ul> <li>Scheduling flushes and compactions in the same thread pool is no longer supported by setting <code class="language-plaintext highlighter-rouge">max_background_flushes=0</code>. Instead, users can achieve this by configuring their high-pri thread pool to have zero threads. See https://github.com/facebook/rocksdb/wiki/Thread-Pool for more details.</li> <li>Replace <code class="language-plaintext highlighter-rouge">Options::max_background_flushes</code>, <code class="language-plaintext highlighter-rouge">Options::max_background_compactions</code>, and <code class="language-plaintext highlighter-rouge">Options::base_background_compactions</code> all with <code class="language-plaintext highlighter-rouge">Options::max_background_jobs</code>, which automatically decides how many threads to allocate towards flush/compaction.</li> <li>options.delayed_write_rate by default take the value of options.rate_limiter rate.</li> <li>Replace global variable <code class="language-plaintext highlighter-rouge">IOStatsContext iostats_context</code> with <code class="language-plaintext highlighter-rouge">IOStatsContext* get_iostats_context()</code>; replace global variable <code class="language-plaintext highlighter-rouge">PerfContext perf_context</code> with <code class="language-plaintext highlighter-rouge">PerfContext* get_perf_context()</code>.</li> </ul> <h3 id="new-features">New Features</h3> <ul> <li>Change ticker/histogram statistics implementations to use core-local storage. This improves aggregation speed compared to our previous thread-local approach, particularly for applications with many threads. See http://rocksdb.org/blog/2017/05/14/core-local-stats.html for more details.</li> <li>Users can pass a cache object to write buffer manager, so that they can cap memory usage for memtable and block cache using one single limit.</li> <li>Flush will be triggered when 7/8 of the limit introduced by write_buffer_manager or db_write_buffer_size is triggered, so that the hard threshold is hard to hit. See https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager for more details.</li> <li>Introduce WriteOptions.low_pri. If it is true, low priority writes will be throttled if the compaction is behind. See https://github.com/facebook/rocksdb/wiki/Low-Priority-Write for more details.</li> <li><code class="language-plaintext highlighter-rouge">DB::IngestExternalFile()</code> now supports ingesting files into a database containing range deletions.</li> </ul> <h3 id="bug-fixes">Bug Fixes</h3> <ul> <li>Shouldn’t ignore return value of fsync() in flush.</li> </ul> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/1351549072/picture/" alt="" title="" /> </div> <p class="post-authorName">Aaron Gao</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/06/29/rocksdb-5-5-1-released.html">RocksDB 5.5.1 Released!</a></h1> <p class="post-meta">Posted June 29, 2017</p> </header> <article class="post-content"> <h3 id="new-features">New Features</h3> <ul> <li>FIFO compaction to support Intra L0 compaction too with CompactionOptionsFIFO.allow_compaction=true.</li> <li>Statistics::Reset() to reset user stats.</li> <li>ldb add option –try_load_options, which will open DB with its own option file.</li> <li>Introduce WriteBatch::PopSavePoint to pop the most recent save point explicitly.</li> <li>Support dynamically change <code class="language-plaintext highlighter-rouge">max_open_files</code> option via SetDBOptions()</li> <li>Added DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families.</li> <li>Add debugging function <code class="language-plaintext highlighter-rouge">GetAllKeyVersions</code> to see internal versions of a range of keys.</li> <li>Support file ingestion with universal compaction style</li> <li>Support file ingestion behind with option <code class="language-plaintext highlighter-rouge">allow_ingest_behind</code></li> <li>New option enable_pipelined_write which may improve write throughput in case writing from multiple threads and WAL enabled.</li> </ul> <h3 id="bug-fixes">Bug Fixes</h3> <ul> <li>Fix the bug that Direct I/O uses direct reads for non-SST file</li> <li>Fix the bug that flush doesn’t respond to fsync result</li> </ul> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/568694102/picture/" alt="" title="" /> </div> <p class="post-authorName">Andrew Kryczka</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/06/26/17-level-based-changes.html">Level-based Compaction Changes</a></h1> <p class="post-meta">Posted June 26, 2017</p> </header> <article class="post-content"> <h3 id="introduction">Introduction</h3> <p>RocksDB provides an option to limit the number of L0 files, which bounds read-amplification. Since L0 files (unlike files at lower levels) can span the entire key-range, a key might be in any file, thus reads need to check them one-by-one. Users often wish to configure a low limit to improve their read latency.</p> <p>Although, the mechanism with which we enforce L0’s file count limit may be unappealing. When the limit is reached, RocksDB intentionally delays user writes. This slows down accumulation of files in L0, and frees up resources for compacting files down to lower levels. But adding delays will significantly increase user-visible write latency jitter.</p> <p>Also, due to how L0 files can span the entire key-range, compaction parallelization is limited. Files at L0 or L1 may be locked due to involvement in pending L0->L1 or L1->L2 compactions. We can only schedule a parallel L0->L1 compaction if it does not require any of the locked files, which is typically not the case.</p> <p>To handle these constraints better, we added a new type of compaction, L0->L0. It quickly reduces file count in L0 and can be scheduled even when L1 files are locked, unlike L0->L1. We also changed the L0->L1 picking algorithm to increase opportunities for parallelism.</p> <h3 id="old-l0-l1-picking-logic">Old L0->L1 Picking Logic</h3> <p>Previously, our logic for picking which L0 file to compact was the same as every other level: pick the largest file in the level. One special property of L0->L1 compaction is that files can overlap in the input level, so those overlapping files must be pulled in as well. For example, a compaction may look like this:</p> <p><img src="/static/images/compaction/full-range.png" alt="full-range.png" /></p> <p>This compaction pulls in every L0 and L1 file. This happens regardless of which L0 file is initially chosen as each file overlaps with every other file.</p> <p>Users may insert their data less uniformly in the key-range. For example, a database may look like this during L0->L1 compaction:</p> <p><img src="/static/images/compaction/part-range-old.png" alt="part-range-old.png" /></p> <p>Let’s say the third file from the top is the largest, and let’s say the top two files are created after the compaction started. When the compaction is picked, the fourth L0 file and six rightmost L1 files are pulled in due to overlap. Notice this leaves the database in a state where we might not be able to schedule parallel compactions. For example, if the sixth file from the top is the next largest, we can’t compact it because it overlaps with the top two files, which overlap with the locked L0 files.</p> <p>We can now see the high-level problems with this approach more clearly. First, locked files in L0 or L1 prevent us from parallelizing compactions. When locked files block L0->L1 compaction, there is nothing we can do to eliminate L0 files. Second, L0->L1 compactions are relatively slow. As we saw, when keys are uniformly distributed, L0->L1 compacts two entire levels. While this is happening, new files are being flushed to L0, advancing towards the file count limit.</p> <h3 id="new-l0-l0-algorithm">New L0->L0 Algorithm</h3> <p>We introduced compaction within L0 to improve both parallelization and speed of reducing L0 file count. An L0->L0 compaction may look like this:</p> <p><img src="/static/images/compaction/l1-l2-contend.png" alt="l1-l2-contend.png" /></p> <p>Say the L1->L2 compaction started first. Now L0->L1 is prevented by the locked L1 file. In this case, we compact files within L0. This allows us to start the work for eliminating L0 files earlier. It also lets us do less work since we don’t pull in any L1 files, whereas L0->L1 compaction would’ve pulled in all of them. This lets us quickly reduce L0 file count to keep read-amp low while sustaining large bursts of writes (i.e., fast accumulation of L0 files).</p> <p>The tradeoff is this increases total compaction work, as we’re now compacting files without contributing towards our eventual goal of moving them towards lower levels. Our benchmarks, though, consistently show less compaction stalls and improved write throughput. One justification is that L0 file data is highly likely in page cache and/or block cache due to it being recently written and frequently accessed. So, this type of compaction is relatively cheap compared to compactions at lower levels.</p> <p>This feature is available since RocksDB 5.4.</p> <h3 id="new-l0-l1-picking-logic">New L0->L1 Picking Logic</h3> <p>Recall how the old L0->L1 picking algorithm chose the largest L0 file for compaction. This didn’t fit well with L0->L0 compaction, which operates on a span of files. That span begins at the newest L0 file, and expands towards older files as long as they’re not being compacted. Since the largest file may be anywhere, the old L0->L1 picking logic could arbitrarily prevent us from getting a long span of files. See the second illustration in this post for a scenario where this would happen.</p> <p>So, we changed the L0->L1 picking algorithm to start from the oldest file and expand towards newer files as long as they’re not being compacted. For example:</p> <p><img src="/static/images/compaction/l0-l1-contend.png" alt="l0-l1-contend.png" /></p> <p>Now, there can never be L0 files unreachable for L0->L0 due to L0->L1 selecting files in the middle. When longer spans of files are available for L0->L0, we perform less compaction work per deleted L0 file, thus improving efficiency.</p> <p>This feature will be available in RocksDB 5.7.</p> <h3 id="performance-changes">Performance Changes</h3> <p>Mark Callaghan did the most extensive benchmarking of this feature’s impact on MyRocks. See his results <a href="http://smalldatum.blogspot.com/2017/05/innodb-myrocks-and-tokudb-on-insert.html">here</a>. Note the primary change between his March 17 and April 14 builds is the latter performs L0->L0 compaction.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/2419111/picture/" alt="" title="" /> </div> <p class="post-authorName">Sagar Vemuri</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/05/26/rocksdb-5-4-5-released.html">RocksDB 5.4.5 Released!</a></h1> <p class="post-meta">Posted May 26, 2017</p> </header> <article class="post-content"> <h3 id="public-api-change">Public API Change</h3> <ul> <li>Support dynamically changing <code class="language-plaintext highlighter-rouge">stats_dump_period_sec</code> option via SetDBOptions().</li> <li>Added ReadOptions::max_skippable_internal_keys to set a threshold to fail a request as incomplete when too many keys are being skipped while using iterators.</li> <li>DB::Get in place of std::string accepts PinnableSlice, which avoids the extra memcpy of value to std::string in most of cases. <ul> <li>PinnableSlice releases the pinned resources that contain the value when it is destructed or when ::Reset() is called on it.</li> <li>The old API that accepts std::string, although discouraged, is still supported.</li> </ul> </li> <li>Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction. See Direct IO wiki for details.</li> </ul> <h3 id="new-features">New Features</h3> <ul> <li>Memtable flush can be avoided during checkpoint creation if total log file size is smaller than a threshold specified by the user.</li> <li>Introduce level-based L0->L0 compactions to reduce file count, so write delays are incurred less often.</li> <li>(Experimental) Partitioning filters which creates an index on the partitions. The feature can be enabled by setting partition_filters when using kFullFilter. Currently the feature also requires two-level indexing to be enabled. Number of partitions is the same as the number of partitions for indexes, which is controlled by metadata_block_size.</li> <li>DB::ResetStats() to reset internal stats.</li> <li>Added CompactionEventListener and EventListener::OnFlushBegin interfaces.</li> <li>Added DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families.</li> <li>Facility for cross-building RocksJava using Docker.</li> </ul> <h3 id="bug-fixes">Bug Fixes</h3> <ul> <li>Fix WriteBatchWithIndex address use after scope error.</li> <li>Fix WritableFile buffer size in direct IO.</li> <li>Add prefetch to PosixRandomAccessFile in buffered io.</li> <li>Fix PinnableSlice access invalid address when row cache is enabled.</li> <li>Fix huge fallocate calls fail and make XFS unhappy.</li> <li>Fix memory alignment with logical sector size.</li> <li>Fix alignment in ReadaheadRandomAccessFile.</li> <li>Fix bias with read amplification stats (READ_AMP_ESTIMATE_USEFUL_BYTES and READ_AMP_TOTAL_READ_BYTES).</li> <li>Fix a manual / auto compaction data race.</li> <li>Fix CentOS 5 cross-building of RocksJava.</li> <li>Build and link with ZStd when creating the static RocksJava build.</li> <li>Fix snprintf’s usage to be cross-platform.</li> <li>Fix build errors with blob DB.</li> <li>Fix readamp test type inconsistency.</li> </ul> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/568694102/picture/" alt="" title="" /> </div> <p class="post-authorName">Andrew Kryczka</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/05/14/core-local-stats.html">Core-local Statistics</a></h1> <p class="post-meta">Posted May 14, 2017</p> </header> <article class="post-content"> <h2 id="origins-global-atomics">Origins: Global Atomics</h2> <p>Until RocksDB 4.12, ticker/histogram statistics were implemented with std::atomic values shared across the entire program. A ticker consists of a single atomic, while a histogram consists of several atomics to represent things like min/max/per-bucket counters. These statistics could be updated by all user/background threads.</p> <p>For concurrent/high-throughput workloads, cache line bouncing of atomics caused high CPU utilization. For example, we have tickers that count block cache hits and misses. Almost every user read increments these tickers a few times. Many concurrent user reads would cause the cache lines containing these atomics to bounce between cores.</p> <h3 id="performance">Performance</h3> <p>Here are perf results for 32 reader threads where most reads (99%+) are served by uncompressed block cache. Such a scenario stresses the statistics code heavily.</p> <p>Benchmark command: <code class="language-plaintext highlighter-rouge">TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench -statistics -use_existing_db=true -benchmarks=readrandom -threads=32 -cache_size=1048576000 -num=1000000 -reads=1000000 && perf report -g --children</code></p> <p>Perf snippet for “cycles” event:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre> Children Self Command Shared Object Symbol + 30.33% 30.17% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick + 3.65% 0.98% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime </pre></td></tr></tbody></table></code></pre></div></div> <p>Perf snippet for “cache-misses” event:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre> Children Self Command Shared Object Symbol + 19.54% 19.50% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick + 3.44% 0.57% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime </pre></td></tr></tbody></table></code></pre></div></div> <p>The high CPU overhead for updating tickers and histograms corresponds well to the high cache misses.</p> <h2 id="thread-locals-faster-updates">Thread-locals: Faster Updates</h2> <p>Since RocksDB 4.12, ticker/histogram statistics use thread-local storage. Each thread has a local set of atomic values that no other thread can update. This prevents the cache line bouncing problem described above. Even though updates to a given value are always made by the same thread, atomics are still useful to synchronize with aggregations for querying statistics.</p> <p>Implementing this approach involved a couple challenges. First, each query for a statistic’s global value must aggregate all threads’ local values. This adds some overhead, which may pass unnoticed if statistics are queried infrequently. Second, exited threads’ local values are still needed to provide accurate statistics. We handle this by merging a thread’s local values into process-wide variables upon thread exit.</p> <h3 id="performance-1">Performance</h3> <p>Update benchmark setup is same as before. CPU overhead improved 7.8x compared to global atomics, corresponding to a 17.8x reduction in cache-misses overhead.</p> <p>Perf snippet for “cycles” event:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre> Children Self Command Shared Object Symbol + 2.96% 0.87% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick + 1.37% 0.10% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime </pre></td></tr></tbody></table></code></pre></div></div> <p>Perf snippet for “cache-misses” event:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre> Children Self Command Shared Object Symbol + 1.21% 0.65% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick 0.08% 0.00% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime </pre></td></tr></tbody></table></code></pre></div></div> <p>To measure statistics query latency, we ran sysbench with 4K OLTP clients concurrently with one client that queries statistics repeatedly. Times shown are in milliseconds.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 </pre></td><td class="rouge-code"><pre> min: 18.45 avg: 27.91 max: 231.65 95th percentile: 55.82 </pre></td></tr></tbody></table></code></pre></div></div> <h2 id="core-locals-faster-querying">Core-locals: Faster Querying</h2> <p>The thread-local approach is working well for applications calling RocksDB from only a few threads, or polling statistics infrequently. Eventually, though, we found use cases where those assumptions do not hold. For example, one application has per-connection threads and typically runs into performance issues when connection count grows very high. For debugging such issues, they want high-frequency statistics polling to correlate issues in their application with changes in RocksDB’s state.</p> <p>Once <a href="https://github.com/facebook/rocksdb/pull/2258">PR #2258</a> lands, ticker/histogram statistics will be local to each CPU core. Similarly to thread-local, each core updates only its local values, thus avoiding cache line bouncing. Local values are still atomics to make aggregation possible. With this change, query work depends only on number of cores, not the number of threads. So, applications with many more threads than cores can no longer impact statistics query latency.</p> <h3 id="performance-2">Performance</h3> <p>Update benchmark setup is same as before. CPU overhead worsened ~23% compared to thread-local, while cache performance was unchanged.</p> <p>Perf snippet for “cycles” event:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre> Children Self Command Shared Object Symbol + 2.96% 0.87% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick + 1.37% 0.10% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime </pre></td></tr></tbody></table></code></pre></div></div> <p>Perf snippet for “cache-misses” event:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 </pre></td><td class="rouge-code"><pre> Children Self Command Shared Object Symbol + 1.21% 0.65% db_bench db_bench [.] rocksdb::StatisticsImpl::recordTick 0.08% 0.00% db_bench db_bench [.] rocksdb::StatisticsImpl::measureTime </pre></td></tr></tbody></table></code></pre></div></div> <p>Query latency is measured same as before with times in milliseconds. Average latency improved by 6.3x compared to thread-local.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 </pre></td><td class="rouge-code"><pre> min: 2.47 avg: 4.45 max: 91.13 95th percentile: 7.56 </pre></td></tr></tbody></table></code></pre></div></div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100003482360101/picture/" alt="" title="" /> </div> <p class="post-authorName">Maysam Yabandeh</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/05/12/partitioned-index-filter.html">Partitioned Index/Filters</a></h1> <p class="post-meta">Posted May 12, 2017</p> </header> <article class="post-content"> <p>As DB/mem ratio gets larger, the memory footprint of filter/index blocks becomes non-trivial. Although <code class="language-plaintext highlighter-rouge">cache_index_and_filter_blocks</code> allows storing only a subset of them in block cache, their relatively large size negatively affects the performance by i) occupying the block cache space that could otherwise be used for caching data, ii) increasing the load on the disk storage by loading them into the cache after a miss. Here we illustrate these problems in more detail and explain how partitioning index/filters alleviates the overhead.</p> <h3 id="how-large-are-the-indexfilter-blocks">How large are the index/filter blocks?</h3> <p>RocksDB has by default one index/filter block per SST file. The size of the index/filter varies based on the configuration but for a SST of size 256MB the index/filter block of size 0.5/5MB is typical, which is much larger than the typical data block size of 4-32KB. That is fine when all index/filters fit perfectly into memory and hence are read once per SST lifetime, not so much when they compete with data blocks for the block cache space and are also likely to be re-read many times from the disk.</p> <h3 id="what-is-the-big-deal-with-large-indexfilter-blocks">What is the big deal with large index/filter blocks?</h3> <p>When index/filter blocks are stored in block cache they are effectively competing with data blocks (as well as with each other) on this scarce resource. A filter of size 5MB is occupying the space that could otherwise be used to cache 1000s of data blocks (of size 4KB). This would result in more cache misses for data blocks. The large index/filters also kick each other out of the block cache more often and exacerbate their own cache miss rate too. This is while only a small part of the index/filter block might have been actually used during its lifetime in the cache.</p> <p>After the cache miss of an index/filter, it has to be reloaded from the disk, and its large size is not helping in reducing the IO cost. While a simple point lookup might need at most a couple of data block reads (of size 4KB) one from each layer of LSM, it might end up also loading multiple megabytes of index/filter blocks. If that happens often then the disk is spending more time serving index/filters rather than the actual data blocks.</p> <h2 id="what-is-partitioned-indexfilters">What is partitioned index/filters?</h2> <p>With partitioning, the index/filter of a SST file is partitioned into smaller blocks with an additional top-level index on them. When reading an index/filter, only top-level index is loaded into memory. The partitioned index/filter then uses the top-level index to load on demand into the block cache the partitions that are required to perform the index/filter query. The top-level index, which has much smaller memory footprint, can be stored in heap or block cache depending on the <code class="language-plaintext highlighter-rouge">cache_index_and_filter_blocks</code> setting.</p> <h3 id="success-stories">Success stories</h3> <h4 id="hdd-100tb-db">HDD, 100TB DB</h4> <p>In this example we have a DB of size 86G on HDD and emulate the small memory that is present to a node with 100TB of data by using direct IO (skipping OS file cache) and a very small block cache of size 60MB. Partitioning improves throughput by 11x from 5 op/s to 55 op/s.</p> <h4 id="ssd-linkbench">SSD, Linkbench</h4> <p>In this example we have a DB of size 300G on SSD and emulate the small memory that would be available in presence of other DBs on the same node by by using direct IO (skipping OS file cache) and block cache of size 6G and 2G. Without partitioning the linkbench throughput drops from 38k tps to 23k when reducing block cache size from 6G to 2G. With partitioning the throughput drops from 38k to only 30k.</p> <p>Learn more <a href="https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters">here</a>.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/03/02/rocksdb-5-2-1-released.html">RocksDB 5.2.1 Released!</a></h1> <p class="post-meta">Posted March 02, 2017</p> </header> <article class="post-content"> <h3 id="public-api-change">Public API Change</h3> <ul> <li>NewLRUCache() will determine number of shard bits automatically based on capacity, if the user doesn’t pass one. This also impacts the default block cache when the user doesn’t explict provide one.</li> <li>Change the default of delayed slowdown value to 16MB/s and further increase the L0 stop condition to 36 files.</li> </ul> <h3 id="new-features">New Features</h3> <ul> <li>Added new overloaded function GetApproximateSizes that allows to specify if memtable stats should be computed only without computing SST files’ stats approximations.</li> <li>Added new function GetApproximateMemTableStats that approximates both number of records and size of memtables.</li> <li>(Experimental) Two-level indexing that partition the index and creates a 2nd level index on the partitions. The feature can be enabled by setting kTwoLevelIndexSearch as IndexType and configuring index_per_partition.</li> </ul> <h3 id="bug-fixes">Bug Fixes</h3> <ul> <li>RangeSync() should work if ROCKSDB_FALLOCATE_PRESENT is not set</li> <li>Fix wrong results in a data race case in Get()</li> <li>Some fixes related to 2PC.</li> <li>Fix several bugs in Direct I/O supports.</li> <li>Fix a regression bug which can cause Seek() to miss some keys if the return key has been updated many times after the snapshot which is used by the iterator.</li> </ul> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/642759407/picture/" alt="" title="" /> </div> <p class="post-authorName">Islam AbdelRahman</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/02/17/bulkoad-ingest-sst-file.html">Bulkloading by ingesting external SST files</a></h1> <p class="post-meta">Posted February 17, 2017</p> </header> <article class="post-content"> <h2 id="introduction">Introduction</h2> <p>One of the basic operations of RocksDB is writing to RocksDB, Writes happen when user call (DB::Put, DB::Write, DB::Delete … ), but what happens when you write to RocksDB ? .. this is a brief description of what happens.</p> <ul> <li>User insert a new key/value by calling DB::Put() (or DB::Write())</li> <li>We create a new entry for the new key/value in our in-memory structure (memtable / SkipList by default) and we assign it a new sequence number.</li> <li>When the memtable exceeds a specific size (64 MB for example), we convert this memtable to a SST file, and put this file in level 0 of our LSM-Tree</li> <li>Later, compaction will kick in and move data from level 0 to level 1, and then from level 1 to level 2 .. and so on</li> </ul> <p>But what if we can skip these steps and add data to the lowest possible level directly ? This is what bulk-loading does</p> <h2 id="bulkloading">Bulkloading</h2> <ul> <li>Write all of our keys and values into SST file outside of the DB</li> <li>Add the SST file into the LSM directly</li> </ul> <p>This is bulk-loading, and in specific use-cases it allow users to achieve faster data loading and better write-amplification.</p> <p>and doing it is as simple as</p> <div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 </pre></td><td class="rouge-code"><pre><span class="n">Options</span> <span class="n">options</span><span class="p">;</span> <span class="n">SstFileWriter</span> <span class="nf">sst_file_writer</span><span class="p">(</span><span class="n">EnvOptions</span><span class="p">(),</span> <span class="n">options</span><span class="p">,</span> <span class="n">options</span><span class="p">.</span><span class="n">comparator</span><span class="p">);</span> <span class="n">Status</span> <span class="n">s</span> <span class="o">=</span> <span class="n">sst_file_writer</span><span class="p">.</span><span class="n">Open</span><span class="p">(</span><span class="n">file_path</span><span class="p">);</span> <span class="n">assert</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">ok</span><span class="p">());</span> <span class="c1">// Insert rows into the SST file, note that inserted keys must be </span> <span class="c1">// strictly increasing (based on options.comparator)</span> <span class="k">for</span> <span class="p">(...)</span> <span class="p">{</span> <span class="n">s</span> <span class="o">=</span> <span class="n">sst_file_writer</span><span class="p">.</span><span class="n">Add</span><span class="p">(</span><span class="n">key</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span> <span class="n">assert</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">ok</span><span class="p">());</span> <span class="p">}</span> <span class="c1">// Ingest the external SST file into the DB</span> <span class="n">s</span> <span class="o">=</span> <span class="n">db_</span><span class="o">-></span><span class="n">IngestExternalFile</span><span class="p">({</span><span class="s">"/home/usr/file1.sst"</span><span class="p">},</span> <span class="n">IngestExternalFileOptions</span><span class="p">());</span> <span class="n">assert</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">ok</span><span class="p">());</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>You can find more details about how to generate SST files and ingesting them into RocksDB in this <a href="https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files">wiki page</a></p> <h2 id="use-cases">Use cases</h2> <p>There are multiple use cases where bulkloading could be useful, for example</p> <ul> <li>Generating SST files in offline jobs in Hadoop, then downloading and ingesting the SST files into RocksDB</li> <li>Migrating shards between machines by dumping key-range in SST File and loading the file in a different machine</li> <li>Migrating from a different storage (InnoDB to RocksDB migration in MyRocks)</li> </ul> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100003482360101/picture/" alt="" title="" /> </div> <p class="post-authorName">Maysam Yabandeh</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/02/07/rocksdb-5-1-2-released.html">RocksDB 5.1.2 Released!</a></h1> <p class="post-meta">Posted February 07, 2017</p> </header> <article class="post-content"> <h3 id="public-api-change">Public API Change</h3> <ul> <li>Support dynamically change <code class="language-plaintext highlighter-rouge">delete_obsolete_files_period_micros</code> option via SetDBOptions().</li> <li>Added EventListener::OnExternalFileIngested which will be called when IngestExternalFile() add a file successfully.</li> <li>BackupEngine::Open and BackupEngineReadOnly::Open now always return error statuses matching those of the backup Env.</li> </ul> <h3 id="bug-fixes">Bug Fixes</h3> <ul> <li>Fix the bug that if 2PC is enabled, checkpoints may loss some recent transactions.</li> <li>When file copying is needed when creating checkpoints or bulk loading files, fsync the file after the file copying.</li> </ul> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100000476362039/picture/" alt="" title="" /> </div> <p class="post-authorName">Yi Wu</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2017/01/06/rocksdb-5-0-1-released.html">RocksDB 5.0.1 Released!</a></h1> <p class="post-meta">Posted January 06, 2017</p> </header> <article class="post-content"> <h3 id="public-api-change">Public API Change</h3> <ul> <li>Options::max_bytes_for_level_multiplier is now a double along with all getters and setters.</li> <li>Support dynamically change <code class="language-plaintext highlighter-rouge">delayed_write_rate</code> and <code class="language-plaintext highlighter-rouge">max_total_wal_size</code> options via SetDBOptions().</li> <li>Introduce DB::DeleteRange for optimized deletion of large ranges of contiguous keys.</li> <li>Support dynamically change <code class="language-plaintext highlighter-rouge">delayed_write_rate</code> option via SetDBOptions().</li> <li>Options::allow_concurrent_memtable_write and Options::enable_write_thread_adaptive_yield are now true by default.</li> <li>Remove Tickers::SEQUENCE_NUMBER to avoid confusion if statistics object is shared among RocksDB instance. Alternatively DB::GetLatestSequenceNumber() can be used to get the same value.</li> <li>Options.level0_stop_writes_trigger default value changes from 24 to 32.</li> <li>New compaction filter API: CompactionFilter::FilterV2(). Allows to drop ranges of keys.</li> <li>Removed flashcache support.</li> <li>DB::AddFile() is deprecated and is replaced with DB::IngestExternalFile(). DB::IngestExternalFile() remove all the restrictions that existed for DB::AddFile.</li> </ul> <h3 id="new-features">New Features</h3> <ul> <li>Add avoid_flush_during_shutdown option, which speeds up DB shutdown by not flushing unpersisted data (i.e. with disableWAL = true). Unpersisted data will be lost. The options is dynamically changeable via SetDBOptions().</li> <li>Add memtable_insert_with_hint_prefix_extractor option. The option is mean to reduce CPU usage for inserting keys into memtable, if keys can be group by prefix and insert for each prefix are sequential or almost sequential. See include/rocksdb/options.h for more details.</li> <li>Add LuaCompactionFilter in utilities. This allows developers to write compaction filters in Lua. To use this feature, LUA_PATH needs to be set to the root directory of Lua.</li> <li>No longer populate “LATEST_BACKUP” file in backup directory, which formerly contained the number of the latest backup. The latest backup can be determined by finding the highest numbered file in the “meta/” subdirectory.</li> </ul> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2016/09/28/rocksdb-4-11-2-released.html">RocksDB 4.11.2 Released!</a></h1> <p class="post-meta">Posted September 28, 2016</p> </header> <article class="post-content"> <p>We abandoned release candidates 4.10.x and directly go to 4.11.2 from 4.9, to make sure the latest release is stable. In 4.11.2, we fixed several data corruption related bugs introduced in 4.9.0.</p> <h2 id="4112-9152016">4.11.2 (9/15/2016)</h2> <h3 id="bug-fixes">Bug fixes</h3> <ul> <li>Segfault when failing to open an SST file for read-ahead iterators.</li> <li>WAL without data for all CFs is not deleted after recovery.</li> </ul> <div class="read-more"> <a href="http://rocksdb.org/blog/2016/09/28/rocksdb-4-11-2-released.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100000476362039/picture/" alt="" title="" /> </div> <p class="post-authorName">Yi Wu</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2016/07/26/rocksdb-4-8-released.html">RocksDB 4.8 Released!</a></h1> <p class="post-meta">Posted July 26, 2016</p> </header> <article class="post-content"> <h2 id="480-522016">4.8.0 (5/2/2016)</h2> <h3 id="public-api-change"><a href="https://github.com/facebook/rocksdb/blob/main/HISTORY.md#public-api-change-1"></a>Public API Change</h3> <ul> <li>Allow preset compression dictionary for improved compression of block-based tables. This is supported for zlib, zstd, and lz4. The compression dictionary’s size is configurable via CompressionOptions::max_dict_bytes.</li> <li>Delete deprecated classes for creating backups (BackupableDB) and restoring from backups (RestoreBackupableDB). Now, BackupEngine should be used for creating backups, and BackupEngineReadOnly should be used for restorations. For more details, see <a href="https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F">https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F</a></li> <li>Expose estimate of per-level compression ratio via DB property: “rocksdb.compression-ratio-at-levelN”.</li> <li>Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.</li> </ul> <h3 id="new-features"><a href="https://github.com/facebook/rocksdb/blob/main/HISTORY.md#new-features-2"></a>New Features</h3> <ul> <li>Add ReadOptions::readahead_size. If non-zero, NewIterator will create a new table reader which performs reads of the given size.</li> </ul> <p><br /></p> <div class="read-more"> <a href="http://rocksdb.org/blog/2016/07/26/rocksdb-4-8-released.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2016/04/26/rocksdb-4-5-1-released.html">RocksDB 4.5.1 Released!</a></h1> <p class="post-meta">Posted April 26, 2016</p> </header> <article class="post-content"> <h2 id="451-3252016">4.5.1 (3/25/2016)</h2> <h3 id="bug-fixes">Bug Fixes</h3> <ul> <li> Fix failures caused by the destorying order of singleton objects.</li> </ul> <p><br /></p> <h2 id="450-252016">4.5.0 (2/5/2016)</h2> <h3 id="public-api-changes">Public API Changes</h3> <ul> <li>Add a new perf context level between kEnableCount and kEnableTime. Level 2 now does not include timers for mutexes.</li> <li>Statistics of mutex operation durations will not be measured by default. If you want to have them enabled, you need to set Statistics::stats_level_ to kAll.</li> <li>DBOptions::delete_scheduler and NewDeleteScheduler() are removed, please use DBOptions::sst_file_manager and NewSstFileManager() instead</li> </ul> <h3 id="new-features">New Features</h3> <ul> <li>ldb tool now supports operations to non-default column families.</li> <li>Add kPersistedTier to ReadTier. This option allows Get and MultiGet to read only the persited data and skip mem-tables if writes were done with disableWAL = true.</li> <li>Add DBOptions::sst_file_manager. Use NewSstFileManager() in include/rocksdb/sst_file_manager.h to create a SstFileManager that can be used to track the total size of SST files and control the SST files deletion rate.</li> </ul> <p><br /></p> <div class="read-more"> <a href="http://rocksdb.org/blog/2016/04/26/rocksdb-4-5-1-released.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/1619020986/picture/" alt="" title="" /> </div> <p class="post-authorName">Yueh-Hsuan Chiang</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2016/03/07/rocksdb-options-file.html">RocksDB Options File</a></h1> <p class="post-meta">Posted March 07, 2016</p> </header> <article class="post-content"> <p>In RocksDB 4.3, we added a new set of features that makes managing RocksDB options easier. Specifically:</p> <ul> <li> <p><strong>Persisting Options Automatically</strong>: Each RocksDB database will now automatically persist its current set of options into an INI file on every successful call of DB::Open(), SetOptions(), and CreateColumnFamily() / DropColumnFamily().</p> </li> <li> <p><strong>Load Options from File</strong>: We added <a href="https://github.com/facebook/rocksdb/blob/4.3.fb/include/rocksdb/utilities/options_util.h#L48-L58">LoadLatestOptions() / LoadOptionsFromFile()</a> that enables developers to construct RocksDB options object from an options file.</p> </li> <li> <p><strong>Sanity Check Options</strong>: We added <a href="https://github.com/facebook/rocksdb/blob/4.3.fb/include/rocksdb/utilities/options_util.h#L64-L77">CheckOptionsCompatibility</a> that performs compatibility check on two sets of RocksDB options.</p> </li> </ul> <div class="read-more"> <a href="http://rocksdb.org/blog/2016/03/07/rocksdb-options-file.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2016/02/25/rocksdb-ama.html">RocksDB AMA</a></h1> <p class="post-meta">Posted February 25, 2016</p> </header> <article class="post-content"> <p>RocksDB developers are doing a Reddit Ask-Me-Anything now at 10AM – 11AM PDT! We welcome you to stop by and ask any RocksDB related questions, including existing / upcoming features, tuning tips, or database design.</p> <p>Here are some enhancements that we’d like to focus on over the next six months:</p> <ul> <li>2-Phase Commit</li> <li>Lua support in some custom functions</li> <li>Backup and repair tools</li> <li>Direct I/O to bypass OS cache</li> <li>RocksDB Java API</li> </ul> <p><a href="https://www.reddit.com/r/IAmA/comments/47k1si/we_are_rocksdb_developers_ask_us_anything/">https://www.reddit.com/r/IAmA/comments/47k1si/we_are_rocksdb_developers_ask_us_anything/</a></p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2016/02/24/rocksdb-4-2-release.html">RocksDB 4.2 Release!</a></h1> <p class="post-meta">Posted February 24, 2016</p> </header> <article class="post-content"> <p>New RocksDB release - 4.2!</p> <p><strong>New Features</strong></p> <ol> <li> <p>Introduce CreateLoggerFromOptions(), this function create a Logger for provided DBOptions.</p> </li> <li> <p>Add GetAggregatedIntProperty(), which returns the sum of the GetIntProperty of all the column families.</p> </li> <li> <p>Add MemoryUtil in rocksdb/utilities/memory.h. It currently offers a way to get the memory usage by type from a list rocksdb instances.</p> </li> </ol> <div class="read-more"> <a href="http://rocksdb.org/blog/2016/02/24/rocksdb-4-2-release.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2016/01/29/compaction_pri.html">Option of Compaction Priority</a></h1> <p class="post-meta">Posted January 29, 2016</p> </header> <article class="post-content"> <p>The most popular compaction style of RocksDB is level-based compaction, which is an improved version of LevelDB’s compaction algorithm. Page 9- 16 of this <a href="https://github.com/facebook/rocksdb/blob/gh-pages/talks/2015-09-29-HPTS-Siying-RocksDB.pdf">slides</a> gives an illustrated introduction of this compaction style. The basic idea that: data is organized by multiple levels with exponential increasing target size. Except a special level 0, every level is key-range partitioned into many files. When size of a level exceeds its target size, we pick one or more of its files, and merge the file into the next level.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2016/01/29/compaction_pri.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/11/16/analysis-file-read-latency-by-level.html">Analysis File Read Latency by Level</a></h1> <p class="post-meta">Posted November 16, 2015</p> </header> <article class="post-content"> <p>In many use cases of RocksDB, people rely on OS page cache for caching compressed data. With this approach, verifying effective of the OS page caching is challenging, because file system is a black box to users.</p> <p>As an example, a user can tune the DB as following: use level-based compaction, with L1 - L4 sizes to be 1GB, 10GB, 100GB and 1TB. And they reserve about 20GB memory as OS page cache, expecting level 0, 1 and 2 are mostly cached in memory, leaving only reads from level 3 and 4 requiring disk I/Os. However, in practice, it’s not easy to verify whether OS page cache does exactly what we expect. For example, if we end up with doing 4 instead of 2 I/Os per query, it’s not easy for users to figure out whether the it’s because of efficiency of OS page cache or reading multiple blocks for a level. Analysis like it is especially important if users run RocksDB on hard drive disks, for the gap of latency between hard drives and memory is much higher than flash-based SSDs.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2015/11/16/analysis-file-read-latency-by-level.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100008352697325/picture/" alt="" title="" /> </div> <p class="post-authorName">Venkatesh Radhakrishnan</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/11/10/use-checkpoints-for-efficient-snapshots.html">Use Checkpoints for Efficient Snapshots</a></h1> <p class="post-meta">Posted November 10, 2015</p> </header> <article class="post-content"> <p><strong>Checkpoint</strong> is a feature in RocksDB which provides the ability to take a snapshot of a running RocksDB database in a separate directory. Checkpoints can be used as a point in time snapshot, which can be opened Read-only to query rows as of the point in time or as a Writeable snapshot by opening it Read-Write. Checkpoints can be used for both full and incremental backups.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2015/11/10/use-checkpoints-for-efficient-snapshots.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/1619020986/picture/" alt="" title="" /> </div> <p class="post-authorName">Yueh-Hsuan Chiang</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/10/27/getthreadlist.html">GetThreadList</a></h1> <p class="post-meta">Posted October 27, 2015</p> </header> <article class="post-content"> <p>We recently added a new API, called <code class="language-plaintext highlighter-rouge">GetThreadList()</code>, that exposes the RocksDB background thread activity. With this feature, developers will be able to obtain the real-time information about the currently running compactions and flushes such as the input / output size, elapsed time, the number of bytes it has written. Below is an example output of <code class="language-plaintext highlighter-rouge">GetThreadList</code>. To better illustrate the example, we have put a sample output of <code class="language-plaintext highlighter-rouge">GetThreadList</code> into a table where each column represents a thread status:</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2015/10/27/getthreadlist.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/07/23/dynamic-level.html">Dynamic Level Size for Level-Based Compaction</a></h1> <p class="post-meta">Posted July 23, 2015</p> </header> <article class="post-content"> <p>In this article, we follow up on the first part of an answer to one of the questions in our <a href="https://www.reddit.com/r/IAmA/comments/3de3cv/we_are_rocksdb_engineering_team_ask_us_anything/ct4a8tb">AMA</a>, the dynamic level size in level-based compaction.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2015/07/23/dynamic-level.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <p class="post-authorName">Dmitri Smirnov</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/07/22/rocksdb-is-now-available-in-windows-platform.html">RocksDB is now available in Windows Platform</a></h1> <p class="post-meta">Posted July 22, 2015</p> </header> <article class="post-content"> <p>Over the past 6 months we have seen a number of use cases where RocksDB is successfully used by the community and various companies to achieve high throughput and volume in a modern server environment.</p> <p>We at Microsoft Bing could not be left behind. As a result we are happy to <a href="http://bit.ly/1OmWBT9">announce</a> the availability of the Windows Port created here at Microsoft which we intend to use as a storage option for one of our key/value data stores.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2015/07/22/rocksdb-is-now-available-in-windows-platform.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/706165749/picture/" alt="" title="" /> </div> <p class="post-authorName">Igor Canadi</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/07/17/spatial-indexing-in-rocksdb.html">Spatial indexing in RocksDB</a></h1> <p class="post-meta">Posted July 17, 2015</p> </header> <article class="post-content"> <p>About a year ago, there was a need to develop a spatial database at Facebook. We needed to store and index Earth’s map data. Before building our own, we looked at the existing spatial databases. They were all very good technology, but also general purpose. We could sacrifice a general-purpose API, so we thought we could build a more performant database, since it would be specifically designed for our use-case. Furthermore, we decided to build the spatial database on top of RocksDB, because we have a lot of operational experience with running and tuning RocksDB at a large scale.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2015/07/17/spatial-indexing-in-rocksdb.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/706165749/picture/" alt="" title="" /> </div> <p class="post-authorName">Igor Canadi</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/07/15/rocksdb-2015-h2-roadmap.html">RocksDB 2015 H2 roadmap</a></h1> <p class="post-meta">Posted July 15, 2015</p> </header> <article class="post-content"> <p>Every 6 months, RocksDB team gets together to prioritize the work ahead of us. We just went through this exercise and we wanted to share the results with the community. Here’s what RocksDB team will be focusing on for the next 6 months:</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2015/07/15/rocksdb-2015-h2-roadmap.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/706165749/picture/" alt="" title="" /> </div> <p class="post-authorName">Igor Canadi</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/06/12/rocksdb-in-osquery.html">RocksDB in osquery</a></h1> <p class="post-meta">Posted June 12, 2015</p> </header> <article class="post-content"> <p>Check out <a href="https://code.facebook.com/posts/1411870269134471/how-rocksdb-is-used-in-osquery/">this</a> blog post by <a href="https://www.facebook.com/mike.arpaia">Mike Arpaia</a> and <a href="https://www.facebook.com/treeded">Ted Reed</a> about how osquery leverages RocksDB to build an embedded pub-sub system. This article is a great read and contains insights on how to properly use RocksDB.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/706165749/picture/" alt="" title="" /> </div> <p class="post-authorName">Igor Canadi</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/04/22/integrating-rocksdb-with-mongodb-2.html">Integrating RocksDB with MongoDB</a></h1> <p class="post-meta">Posted April 22, 2015</p> </header> <article class="post-content"> <p>Over the last couple of years, we have been busy integrating RocksDB with various services here at Facebook that needed to store key-value pairs locally. We have also seen other companies using RocksDB as local storage components of their distributed systems.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2015/04/22/integrating-rocksdb-with-mongodb-2.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/02/27/write-batch-with-index.html">WriteBatchWithIndex: Utility for Implementing Read-Your-Own-Writes</a></h1> <p class="post-meta">Posted February 27, 2015</p> </header> <article class="post-content"> <p>RocksDB can be used as a storage engine of a higher level database. In fact, we are currently plugging RocksDB into MySQL and MongoDB as one of their storage engines. RocksDB can help with guaranteeing some of the ACID properties: durability is guaranteed by RocksDB by design; while consistency and isolation need to be enforced by concurrency controls on top of RocksDB; Atomicity can be implemented by committing a transaction’s writes with one write batch to RocksDB in the end.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2015/02/27/write-batch-with-index.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/8649950/picture/" alt="" title="" /> </div> <p class="post-authorName">Leonidas Galanis</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/02/24/reading-rocksdb-options-from-a-file.html">Reading RocksDB options from a file</a></h1> <p class="post-meta">Posted February 24, 2015</p> </header> <article class="post-content"> <p>RocksDB options can be provided using a file or any string to RocksDB. The format is straightforward: <code class="language-plaintext highlighter-rouge">write_buffer_size=1024;max_write_buffer_number=2</code>. Any whitespace around <code class="language-plaintext highlighter-rouge">=</code> and <code class="language-plaintext highlighter-rouge">;</code> is OK. Moreover, options can be nested as necessary. For example <code class="language-plaintext highlighter-rouge">BlockBasedTableOptions</code> can be nested as follows: <code class="language-plaintext highlighter-rouge">write_buffer_size=1024; max_write_buffer_number=2; block_based_table_factory={block_size=4k};</code>. Similarly any white space around <code class="language-plaintext highlighter-rouge">{</code> or <code class="language-plaintext highlighter-rouge">}</code> is ok. Here is what it looks like in code:</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2015/02/24/reading-rocksdb-options-from-a-file.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/8649950/picture/" alt="" title="" /> </div> <p class="post-authorName">Leonidas Galanis</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2015/01/16/migrating-from-leveldb-to-rocksdb-2.html">Migrating from LevelDB to RocksDB</a></h1> <p class="post-meta">Posted January 16, 2015</p> </header> <article class="post-content"> <p>If you have an existing application that uses LevelDB and would like to migrate to using RocksDB, one problem you need to overcome is to map the options for LevelDB to proper options for RocksDB. As of release 3.9 this can be automatically done by using our option conversion utility found in rocksdb/utilities/leveldb_options.h. What is needed, is to first replace <code class="language-plaintext highlighter-rouge">leveldb::Options</code> with <code class="language-plaintext highlighter-rouge">rocksdb::LevelDBOptions</code>. Then, use <code class="language-plaintext highlighter-rouge">rocksdb::ConvertOptions( )</code> to convert the <code class="language-plaintext highlighter-rouge">LevelDBOptions</code> struct into appropriate RocksDB options. Here is an example:</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2015/01/16/migrating-from-leveldb-to-rocksdb-2.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/634570164/picture/" alt="" title="" /> </div> <p class="post-authorName">Lei Jin</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/09/15/rocksdb-3-5-release.html">RocksDB 3.5 Release!</a></h1> <p class="post-meta">Posted September 15, 2014</p> </header> <article class="post-content"> <p>New RocksDB release - 3.5!</p> <p><strong>New Features</strong></p> <ol> <li> <p>Add include/utilities/write_batch_with_index.h, providing a utility class to query data out of WriteBatch when building it.</p> </li> <li> <p>new ReadOptions.total_order_seek to force total order seek when block-based table is built with hash index.</p> </li> </ol> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/09/15/rocksdb-3-5-release.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100006493823622/picture/" alt="" title="" /> </div> <p class="post-authorName">Feng Zhu</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/09/12/new-bloom-filter-format.html">New Bloom Filter Format</a></h1> <p class="post-meta">Posted September 12, 2014</p> </header> <article class="post-content"> <h2 id="introduction">Introduction</h2> <p>In this post, we are introducing “full filter block” — a new bloom filter format for <a href="https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format">block based table</a>. This could bring about 40% of improvement for key query under in-memory (all data stored in memory, files stored in tmpfs/ramfs, an <a href="https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks">example</a> workload. The main idea behind is to generate a big filter that covers all the keys in SST file to avoid lots of unnecessary memory look ups.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/09/12/new-bloom-filter-format.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/800837305/picture/" alt="" title="" /> </div> <p class="post-authorName">Radheshyam Balasundaram</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/09/12/cuckoo.html">Cuckoo Hashing Table Format</a></h1> <p class="post-meta">Posted September 12, 2014</p> </header> <article class="post-content"> <h2 id="introduction">Introduction</h2> <p>We recently introduced a new <a href="http://en.wikipedia.org/wiki/Cuckoo_hashing">Cuckoo Hashing</a> based SST file format which is optimized for fast point lookups. The new format was built for applications which require very high point lookup rates (~4Mqps) in read only mode but do not use operations like range scan, merge operator, etc. But, the existing RocksDB file formats were built to support range scan and other operations and the current best point lookup in RocksDB is 1.2 Mqps given by <a href="https://github.com/facebook/rocksdb/wiki/PlainTable-Format">PlainTable</a><a href="https://github.com/facebook/rocksdb/wiki/PlainTable-Format"> format</a>. This prompted a hashing based file format, which we present here. The new table format uses a cache friendly version of Cuckoo Hashing algorithm with only 1 or 2 memory accesses per lookup.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/09/12/cuckoo.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/1619020986/picture/" alt="" title="" /> </div> <p class="post-authorName">Yueh-Hsuan Chiang</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/07/29/rocksdb-3-3-release.html">RocksDB 3.3 Release</a></h1> <p class="post-meta">Posted July 29, 2014</p> </header> <article class="post-content"> <p>Check out new RocksDB release on <a href="https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.3">GitHub</a>!</p> <p>New Features in RocksDB 3.3:</p> <ul> <li> <p><strong>JSON API prototype</strong>.</p> </li> <li> <p><strong>Performance improvement on HashLinkList</strong>: We addressed performance outlier of HashLinkList caused by skewed bucket by switching data in the bucket from linked list to skip list. Add parameter threshold_use_skiplist in NewHashLinkListRepFactory().</p> </li> </ul> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/07/29/rocksdb-3-3-release.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/634570164/picture/" alt="" title="" /> </div> <p class="post-authorName">Lei Jin</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/06/27/rocksdb-3-2-release.html">RocksDB 3.2 release</a></h1> <p class="post-meta">Posted June 27, 2014</p> </header> <article class="post-content"> <p>Check out new RocksDB release on <a href="https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.2">GitHub</a>!</p> <p>New Features in RocksDB 3.2:</p> <ul> <li> <p>PlainTable now supports a new key encoding: for keys of the same prefix, the prefix is only written once. It can be enabled through encoding_type paramter of NewPlainTableFactory()</p> </li> <li> <p>Add AdaptiveTableFactory, which is used to convert from a DB of PlainTable to BlockBasedTabe, or vise versa. It can be created using NewAdaptiveTableFactory()</p> </li> </ul> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/06/27/rocksdb-3-2-release.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/634570164/picture/" alt="" title="" /> </div> <p class="post-authorName">Lei Jin</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/06/27/avoid-expensive-locks-in-get.html">Avoid Expensive Locks in Get()</a></h1> <p class="post-meta">Posted June 27, 2014</p> </header> <article class="post-content"> <p>As promised in the previous <a href="blog/2014/05/14/lock.html">blog post</a>!</p> <p>RocksDB employs a multiversion concurrency control strategy. Before reading data, it needs to grab the current version, which is encapsulated in a data structure called <a href="https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c">SuperVersion</a>.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/06/27/avoid-expensive-locks-in-get.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/06/23/plaintable-a-new-file-format.html">PlainTable — A New File Format</a></h1> <p class="post-meta">Posted June 23, 2014</p> </header> <article class="post-content"> <p>In this post, we are introducing “PlainTable” – a file format we designed for RocksDB, initially to satisfy a production use case at Facebook.</p> <p>Design goals:</p> <ol> <li>All data stored in memory, in files stored in tmpfs/ramfs. Support DBs larger than 100GB (may be sharded across multiple RocksDB instance).</li> <li>Optimize for <a href="https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Siying-Prefix-Hash.pdf">prefix hashing</a></li> <li>Less than or around 1 micro-second average latency for single Get() or Seek().</li> <li>Minimize memory consumption.</li> <li>Queries efficiently return empty results</li> </ol> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/06/23/plaintable-a-new-file-format.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/706165749/picture/" alt="" title="" /> </div> <p class="post-authorName">Igor Canadi</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/05/22/rocksdb-3-1-release.html">RocksDB 3.1 release</a></h1> <p class="post-meta">Posted May 22, 2014</p> </header> <article class="post-content"> <p>Check out the new release on <a href="https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.1">Github</a>!</p> <p>New features in RocksDB 3.1:</p> <ul> <li> <p><a href="https://github.com/facebook/rocksdb/commit/0b3d03d026a7248e438341264b4c6df339edc1d7">Materialized hash index</a></p> </li> <li> <p><a href="https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style">FIFO compaction style</a></p> </li> </ul> <p>We released 3.1 so fast after 3.0 because one of our internal customers needed materialized hash index.</p> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/706165749/picture/" alt="" title="" /> </div> <p class="post-authorName">Igor Canadi</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/05/19/rocksdb-3-0-release.html">RocksDB 3.0 release</a></h1> <p class="post-meta">Posted May 19, 2014</p> </header> <article class="post-content"> <p>Check out new RocksDB release on <a href="https://github.com/facebook/rocksdb/releases/tag/3.0.fb">Github</a>!</p> <p>New features in RocksDB 3.0:</p> <ul> <li> <p><a href="https://github.com/facebook/rocksdb/wiki/Column-Families">Column Family support</a></p> </li> <li> <p><a href="https://github.com/facebook/rocksdb/commit/0afc8bc29a5800e3212388c327c750d32e31f3d6">Ability to chose different checksum function</a></p> </li> <li> <p>Deprecated ReadOptions::prefix_seek and ReadOptions::prefix</p> </li> </ul> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/05/19/rocksdb-3-0-release.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/9805119/picture/" alt="" title="" /> </div> <p class="post-authorName">Siying Dong</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/05/14/lock.html">Reducing Lock Contention in RocksDB</a></h1> <p class="post-meta">Posted May 14, 2014</p> </header> <article class="post-content"> <p>In this post, we briefly introduce the recent improvements we did to RocksDB to improve the issue of lock contention costs.</p> <p>RocksDB has a simple thread synchronization mechanism (See <a href="https://github.com/facebook/rocksdb/wiki/Rocksdb-Architecture-Guide">RocksDB Architecture Guide</a> to understand terms used below, like SST tables or mem tables). SST tables are immutable after being written and mem tables are lock-free data structures supporting single writer and multiple readers. There is only one single major lock, the DB mutex (DBImpl.mutex_) protecting all the meta operations, including:</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/05/14/lock.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/634570164/picture/" alt="" title="" /> </div> <p class="post-authorName">Lei Jin</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/04/21/indexing-sst-files-for-better-lookup-performance.html">Indexing SST Files for Better Lookup Performance</a></h1> <p class="post-meta">Posted April 21, 2014</p> </header> <article class="post-content"> <p>For a <code class="language-plaintext highlighter-rouge">Get()</code> request, RocksDB goes through mutable memtable, list of immutable memtables, and SST files to look up the target key. SST files are organized in levels.</p> <p>On level 0, files are sorted based on the time they are flushed. Their key range (as defined by FileMetaData.smallest and FileMetaData.largest) are mostly overlapped with each other. So it needs to look up every L0 file.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/04/21/indexing-sst-files-for-better-lookup-performance.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/706165749/picture/" alt="" title="" /> </div> <p class="post-authorName">Igor Canadi</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/04/07/rocksdb-2-8-release.html">RocksDB 2.8 release</a></h1> <p class="post-meta">Posted April 07, 2014</p> </header> <article class="post-content"> <p>Check out the new RocksDB 2.8 release on <a href="https://github.com/facebook/rocksdb/releases/tag/2.8.fb">Github</a>.</p> <p>RocksDB 2.8. is mostly focused on improving performance for in-memory workloads. We are seeing read QPS as high as 5M (we will write a separate blog post on this).</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/04/07/rocksdb-2-8-release.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/100000739847320/picture/" alt="" title="" /> </div> <p class="post-authorName">Xing Jin</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/04/02/the-1st-rocksdb-local-meetup-held-on-march-27-2014.html">The 1st RocksDB Local Meetup Held on March 27, 2014</a></h1> <p class="post-meta">Posted April 02, 2014</p> </header> <article class="post-content"> <p>On Mar 27, 2014, RocksDB team @ Facebook held the 1st RocksDB local meetup in FB HQ (Menlo Park, California). We invited around 80 guests from 20+ local companies, including LinkedIn, Twitter, Dropbox, Square, Pinterest, MapR, Microsoft and IBM. Finally around 50 guests showed up, totaling around 60% show-up rate.</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/04/02/the-1st-rocksdb-local-meetup-held-on-march-27-2014.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/706165749/picture/" alt="" title="" /> </div> <p class="post-authorName">Igor Canadi</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/03/27/how-to-persist-in-memory-rocksdb-database.html">How to persist in-memory RocksDB database?</a></h1> <p class="post-meta">Posted March 27, 2014</p> </header> <article class="post-content"> <p>In recent months, we have focused on optimizing RocksDB for in-memory workloads. With growing RAM sizes and strict low-latency requirements, lots of applications decide to keep their entire data in memory. Running in-memory database with RocksDB is easy – just mount your RocksDB directory on tmpfs or ramfs [1]. Even if the process crashes, RocksDB can recover all of your data from in-memory filesystem. However, what happens if the machine reboots?</p> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/03/27/how-to-persist-in-memory-rocksdb-database.html" > Read More </a> </div> </article> </div> <div class="post"> <header class="post-header"> <div style="display: flex; align-content: center; align-items: center; justify-content: center"> <div style="padding: 16px; display: inline-block; text-align: center"> <div class="authorPhoto"> <img src="http://graph.facebook.com/706165749/picture/" alt="" title="" /> </div> <p class="post-authorName">Igor Canadi</p> </div> </div> <h1 class="post-title"><a href="http://rocksdb.org/blog/2014/03/27/how-to-backup-rocksdb.html">How to backup RocksDB?</a></h1> <p class="post-meta">Posted March 27, 2014</p> </header> <article class="post-content"> <p>In RocksDB, we have implemented an easy way to backup your DB. Here is a simple example:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="rougeHighlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 </pre></td><td class="rouge-code"><pre>#include "rocksdb/db.h" #include "utilities/backupable_db.h" using namespace rocksdb; DB* db; DB::Open(Options(), "/tmp/rocksdb", &db); BackupableDB* backupable_db = new BackupableDB(db, BackupableDBOptions("/tmp/rocksdb_backup")); backupable_db->Put(...); // do your thing backupable_db->CreateNewBackup(); delete backupable_db; // no need to also delete db </pre></td></tr></tbody></table></code></pre></div></div> <div class="read-more"> <a href="http://rocksdb.org/blog/2014/03/27/how-to-backup-rocksdb.html" > Read More </a> </div> </article> </div> </div> </div> </div> </div> <div class="footerContainer"> <div id="footer_wrap" class="wrapper footerWrapper"> <div class="footerBlocks"> <div id="fb_oss" class="footerSection fbOpenSourceFooter"> <svg class="facebookOSSLogoSvg" viewBox="0 0 1133.9 1133.9" x="0px" y="0px"> <g> <path class="logoRing outerRing" d="M 498.3 3.7 c 153.6 88.9 307.3 177.7 461.1 266.2 c 7.6 4.4 10.3 9.1 10.3 17.8 c -0.3 179.1 -0.2 358.3 0 537.4 c 0 8.1 -2.4 12.8 -9.7 17.1 c -154.5 88.9 -308.8 178.1 -462.9 267.5 c -9 5.2 -15.5 5.3 -24.6 0.1 c -153.9 -89.2 -307.9 -178 -462.1 -266.8 C 3 838.8 0 833.9 0 825.1 c 0.3 -179.1 0.2 -358.3 0 -537.4 c 0 -8.6 2.6 -13.6 10.2 -18 C 164.4 180.9 318.4 92 472.4 3 C 477 -1.5 494.3 -0.7 498.3 3.7 Z M 48.8 555.3 c 0 79.9 0.2 159.9 -0.2 239.8 c -0.1 10 3 15.6 11.7 20.6 c 137.2 78.8 274.2 157.8 411 237.3 c 9.9 5.7 17 5.7 26.8 0.1 c 137.5 -79.8 275.2 -159.2 412.9 -238.5 c 7.4 -4.3 10.5 -8.9 10.5 -17.8 c -0.3 -160.2 -0.3 -320.5 0 -480.7 c 0 -8.8 -2.8 -13.6 -10.3 -18 C 772.1 218 633.1 137.8 494.2 57.4 c -6.5 -3.8 -11.5 -4.5 -18.5 -0.5 C 336.8 137.4 197.9 217.7 58.8 297.7 c -7.7 4.4 -10.2 9.2 -10.2 17.9 C 48.9 395.5 48.8 475.4 48.8 555.3 Z" /> <path class="logoRing middleRing" d="M 184.4 555.9 c 0 -33.3 -1 -66.7 0.3 -100 c 1.9 -48 24.1 -86 64.7 -110.9 c 54.8 -33.6 110.7 -65.5 167 -96.6 c 45.7 -25.2 92.9 -24.7 138.6 1 c 54.4 30.6 108.7 61.5 162.2 93.7 c 44 26.5 67.3 66.8 68 118.4 c 0.9 63.2 0.9 126.5 0 189.7 c -0.7 50.6 -23.4 90.7 -66.6 116.9 c -55 33.4 -110.8 65.4 -167.1 96.5 c -43.4 24 -89 24.2 -132.3 0.5 c -57.5 -31.3 -114.2 -64 -170 -98.3 c -41 -25.1 -62.9 -63.7 -64.5 -112.2 C 183.5 621.9 184.3 588.9 184.4 555.9 Z M 232.9 556.3 c 0 29.5 0.5 59.1 -0.1 88.6 c -0.8 39.2 16.9 67.1 50.2 86.2 c 51.2 29.4 102.2 59.2 153.4 88.4 c 31.4 17.9 63.6 18.3 95 0.6 c 53.7 -30.3 107.1 -61.2 160.3 -92.5 c 29.7 -17.5 45 -44.5 45.3 -78.8 c 0.6 -61.7 0.5 -123.5 0 -185.2 c -0.3 -34.4 -15.3 -61.5 -44.9 -79 C 637.7 352.6 583 320.8 527.9 290 c -27.5 -15.4 -57.2 -16.1 -84.7 -0.7 c -56.9 31.6 -113.4 64 -169.1 97.6 c -26.4 15.9 -40.7 41.3 -41.1 72.9 C 232.6 491.9 232.9 524.1 232.9 556.3 Z" /> <path class="logoRing innerRing" d="M 484.9 424.4 c 69.8 -2.8 133.2 57.8 132.6 132 C 617 630 558.5 688.7 484.9 689.1 c -75.1 0.4 -132.6 -63.6 -132.7 -132.7 C 352.1 485 413.4 421.5 484.9 424.4 Z M 401.3 556.7 c -3.4 37.2 30.5 83.6 83 84.1 c 46.6 0.4 84.8 -37.6 84.9 -84 c 0.1 -46.6 -37.2 -84.4 -84.2 -84.6 C 432.2 472.1 397.9 518.3 401.3 556.7 Z" /> </g> </svg> <h2><a href="https://opensource.fb.com/" target="_blank">Meta Open Source</a></h2> </div> <div class="footerSection"> <a class="footerLink" href="https://github.com/facebook/rocksdb" target="_blank">GitHub</a> <a class="footerLink" href="https://twitter.com/rocksdb" target="_blank">Twitter</a> <a class="footerLink" href="https://opensource.fb.com/legal/terms" target="_blank">Terms of Use</a> <a class="footerLink" href="https://opensource.fb.com/legal/privacy" target="_blank">Privacy Policy</a> </div> <div class="footerSection rightAlign"> Copyright © 2022 Meta Platforms, Inc. </div> </div> </div> </div> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-49459723-1', 'auto'); ga('send', 'pageview'); </script> </div> </body> </html>