CINXE.COM

Google SRE - SRE Book for Must Techniques & Practices

<!DOCTYPE html> <html lang="en"> <head> <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= 'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-WVF23W3');</script> <meta charset="utf-8"> <meta content="initial-scale=1, minimum-scale=1, width=device-width" name="viewport"> <title>Google SRE - SRE Book for Must Techniques &amp; Practices</title> <meta name="description" content="Explore reliable and scalable systems with "Google SRE - A Comprehensive Guide to Site Reliability Engineering: Essential Techniques and Practices.""> <meta name="referrer" content="no-referrer" /> <link rel="canonical" href="https://sre.google/sre-book/bibliography/"> <link rel="apple-touch-icon-precomposed" sizes="180x180" href="https://lh3.googleusercontent.com/Yf2DCX8RKda6r4Jml9DLMByS2zQCBFs3kQpvBfN8UgIh4YVWIYSYIQOoTxJriyuM26cT5PDjyEb5aynDQ0Xyz46yHKnfg8JlUbDW"> <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Google+Sans:400|Roboto:400,400italic,500,500italic,700,700italic|Roboto+Mono:400,500,700|Material+Icons"> <link rel="icon" type="image/png" sizes="32x32" href="https://lh3.googleusercontent.com/Yf2DCX8RKda6r4Jml9DLMByS2zQCBFs3kQpvBfN8UgIh4YVWIYSYIQOoTxJriyuM26cT5PDjyEb5aynDQ0Xyz46yHKnfg8JlUbDW"> <link rel="icon" type="image/png" sizes="16x16" href="https://lh3.googleusercontent.com/Yf2DCX8RKda6r4Jml9DLMByS2zQCBFs3kQpvBfN8UgIh4YVWIYSYIQOoTxJriyuM26cT5PDjyEb5aynDQ0Xyz46yHKnfg8JlUbDW"> <link rel="shortcut icon" href="https://lh3.googleusercontent.com/Yf2DCX8RKda6r4Jml9DLMByS2zQCBFs3kQpvBfN8UgIh4YVWIYSYIQOoTxJriyuM26cT5PDjyEb5aynDQ0Xyz46yHKnfg8JlUbDW"> <link href="/sre-book/static/css/index.min.css?cache=6c30b59" rel="stylesheet"> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-75468017-1', 'auto'); ga('send', 'pageview'); </script> <script type="application/ld+json"> { "@context": "https://schema.org", "@type": "Article", "mainEntityOfPage": { "@type": "WebPage", "@id": "/sre-book/bibliography/" }, "headline": "Bibliography", "description": "[Ada15] Bram Adams, Stephany Bellomo, Christian Bird, Tamara Marshall-Keim, Foutse Khomh, and Kim Moir, 'The Practice and Future of Release Engineering: A Roundtable with Three Release Engineers', IEEE Software, vol. 32, no. 2 (March/April 2015), pp. 42–49.", "publisher": { "@type": "Organization", "name": "Google SRE", "logo": { "@type": "ImageObject", "url": "https://lh3.googleusercontent.com/C3_YVnTdc7xzTDekhsGeZ2hEYUnAlp47Au-9C50vi5r44rfpJAgiycs1g6AFKWqIpw6KVPrZWLse1VUqgOqYht-RxV1iowdB0_IABUd966aDsDWW-65m" } } } </script> <script src="/sre-book/static/js/detect.min.js?cache=4cb778b"></script> </head> <body> <noscript><iframe class="no-script-iframe" src="https://www.googletagmanager.com/ns.html?id=GTM-WVF23W3"></iframe></noscript> <main> <div ng-controller= "HeaderCtrl as headerCtrl"> <div id="curtain" class="menu-closed"></div> <div class="header clearfix"> <a id="burger-menu" class="expand"></a> <h2 class="chapter-title"> Bibliography </h2> </div> <div id="overlay-element" class="expands"> <div class="logo"> <a href="https://www.google.com"><img src="https://lh3.googleusercontent.com/YoVRtLOHMSRYQZ3OhFL8RIamcjFYbmQXX4oAQx02MRqqY9zlKNvsuZpS73khXiOqTH3qrFW27VrERJJIHTjPk-tAh46q8-Fd4w6qlw" alt="Google"></a> </div> <ol id="drop-down" class="dropdown-content hide"> <li><a class="menu-buttons" href="/sre-book/table-of-contents/">Table of Contents</a></li> <li> <a href="/sre-book/foreword/" class="menu-buttons"> Foreword </a> </li> <li> <a href="/sre-book/preface/" class="menu-buttons"> Preface </a> </li> <li> <a href="/sre-book/part-I-introduction/" class="menu-buttons"> Part I - Introduction </a> </li> <li> <a href="/sre-book/introduction/" class="menu-buttons"> 1. Introduction </a> </li> <li> <a href="/sre-book/production-environment/" class="menu-buttons"> 2. The Production Environment at Google, from the Viewpoint of an SRE </a> </li> <li> <a href="/sre-book/part-II-principles/" class="menu-buttons"> Part II - Principles </a> </li> <li> <a href="/sre-book/embracing-risk/" class="menu-buttons"> 3. Embracing Risk </a> </li> <li> <a href="/sre-book/service-level-objectives/" class="menu-buttons"> 4. Service Level Objectives </a> </li> <li> <a href="/sre-book/eliminating-toil/" class="menu-buttons"> 5. Eliminating Toil </a> </li> <li> <a href="/sre-book/monitoring-distributed-systems/" class="menu-buttons"> 6. Monitoring Distributed Systems </a> </li> <li> <a href="/sre-book/automation-at-google/" class="menu-buttons"> 7. The Evolution of Automation at Google </a> </li> <li> <a href="/sre-book/release-engineering/" class="menu-buttons"> 8. Release Engineering </a> </li> <li> <a href="/sre-book/simplicity/" class="menu-buttons"> 9. Simplicity </a> </li> <li> <a href="/sre-book/part-III-practices/" class="menu-buttons"> Part III - Practices </a> </li> <li> <a href="/sre-book/practical-alerting/" class="menu-buttons"> 10. Practical Alerting </a> </li> <li> <a href="/sre-book/being-on-call/" class="menu-buttons"> 11. Being On-Call </a> </li> <li> <a href="/sre-book/effective-troubleshooting/" class="menu-buttons"> 12. Effective Troubleshooting </a> </li> <li> <a href="/sre-book/emergency-response/" class="menu-buttons"> 13. Emergency Response </a> </li> <li> <a href="/sre-book/managing-incidents/" class="menu-buttons"> 14. Managing Incidents </a> </li> <li> <a href="/sre-book/postmortem-culture/" class="menu-buttons"> 15. Postmortem Culture: Learning from Failure </a> </li> <li> <a href="/sre-book/tracking-outages/" class="menu-buttons"> 16. Tracking Outages </a> </li> <li> <a href="/sre-book/testing-reliability/" class="menu-buttons"> 17. Testing for Reliability </a> </li> <li> <a href="/sre-book/software-engineering-in-sre/" class="menu-buttons"> 18. Software Engineering in SRE </a> </li> <li> <a href="/sre-book/load-balancing-frontend/" class="menu-buttons"> 19. Load Balancing at the Frontend </a> </li> <li> <a href="/sre-book/load-balancing-datacenter/" class="menu-buttons"> 20. Load Balancing in the Datacenter </a> </li> <li> <a href="/sre-book/handling-overload/" class="menu-buttons"> 21. Handling Overload </a> </li> <li> <a href="/sre-book/addressing-cascading-failures/" class="menu-buttons"> 22. Addressing Cascading Failures </a> </li> <li> <a href="/sre-book/managing-critical-state/" class="menu-buttons"> 23. Managing Critical State: Distributed Consensus for Reliability </a> </li> <li> <a href="/sre-book/distributed-periodic-scheduling/" class="menu-buttons"> 24. Distributed Periodic Scheduling with Cron </a> </li> <li> <a href="/sre-book/data-processing-pipelines/" class="menu-buttons"> 25. Data Processing Pipelines </a> </li> <li> <a href="/sre-book/data-integrity/" class="menu-buttons"> 26. Data Integrity: What You Read Is What You Wrote </a> </li> <li> <a href="/sre-book/reliable-product-launches/" class="menu-buttons"> 27. Reliable Product Launches at Scale </a> </li> <li> <a href="/sre-book/part-IV-management/" class="menu-buttons"> Part IV - Management </a> </li> <li> <a href="/sre-book/accelerating-sre-on-call/" class="menu-buttons"> 28. Accelerating SREs to On-Call and Beyond </a> </li> <li> <a href="/sre-book/dealing-with-interrupts/" class="menu-buttons"> 29. Dealing with Interrupts </a> </li> <li> <a href="/sre-book/operational-overload/" class="menu-buttons"> 30. Embedding an SRE to Recover from Operational Overload </a> </li> <li> <a href="/sre-book/communication-and-collaboration/" class="menu-buttons"> 31. Communication and Collaboration in SRE </a> </li> <li> <a href="/sre-book/evolving-sre-engagement-model/" class="menu-buttons"> 32. The Evolving SRE Engagement Model </a> </li> <li> <a href="/sre-book/part-V-conclusions/" class="menu-buttons"> Part V - Conclusions </a> </li> <li> <a href="/sre-book/lessons-learned/" class="menu-buttons"> 33. Lessons Learned from Other Industries </a> </li> <li> <a href="/sre-book/conclusion/" class="menu-buttons"> 34. Conclusion </a> </li> <li> <a href="/sre-book/availability-table/" class="menu-buttons"> Appendix A. Availability Table </a> </li> <li> <a href="/sre-book/service-best-practices/" class="menu-buttons"> Appendix B. A Collection of Best Practices for Production Services </a> </li> <li> <a href="/sre-book/incident-document/" class="menu-buttons"> Appendix C. Example Incident State Document </a> </li> <li> <a href="/sre-book/example-postmortem/" class="menu-buttons"> Appendix D. Example Postmortem </a> </li> <li> <a href="/sre-book/launch-checklist/" class="menu-buttons"> Appendix E. Launch Coordination Checklist </a> </li> <li> <a href="/sre-book/production-meeting/" class="menu-buttons"> Appendix F. Example Production Meeting Minutes </a> </li> <li class="active"> <a href="/sre-book/bibliography/" class="menu-buttons"> Bibliography </a> </li> </ol> </div> </div> <div id="maia-main"> <div class="content" id="content"> <section data-type="bibliography" class="bibliography" id="bibliography-zVskZ"> <h1 class="heading jumptargets">Bibliography</h1> <ul class="references"> <li> <a class="jumptargets" id="Ada15">[Ada15]</a> Bram Adams, Stephany Bellomo, Christian Bird, Tamara Marshall-Keim, Foutse Khomh, and Kim Moir, <a href="https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=434819" target="_blank" rel="noopener noreferrer">"The Practice and Future of Release Engineering: A Roundtable with Three Release Engineers"</a>, <em>IEEE Software</em>, vol. 32, no. 2 (March/April 2015), pp. 42–49. </li> <li> <a class="jumptargets" id="Agu10">[Agu10]</a> M. K. Aguilera, <a href="https://dl.acm.org/citation.cfm?id=2172342" target="_blank" rel="noopener noreferrer">"Stumbling over Consensus Research: Misunderstandings and Issues"</a>, in <em>Replication</em>, Lecture Notes in Computer Science 5959, 2010. </li> <li> <a class="jumptargets" id="All10">[All10]</a> J. Allspaw and J. Robbins, <em>Web Operations: Keeping the Data on Time</em>: O’Reilly, 2010. </li> <li> <a class="jumptargets" id="All12">[All12]</a> J. Allspaw, <a href="https://codeascraft.com/2012/05/22/blameless-postmortems/" target="_blank" rel="noopener noreferrer">"Blameless PostMortems and a Just Culture"</a>, blog post, 2012. </li> <li> <a class="jumptargets" id="All15">[All15]</a> J. Allspaw, <a href="https://lup.lub.lu.se/student-papers/record/8084520/file/8084521.pdf" target="_blank" rel="noopener noreferrer">"Trade-Offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages"</a>, MSc thesis, Lund University, 2015. </li> <li> <a class="jumptargets" id="Ana07">[Ana07]</a> S. Anantharaju, <a href="https://googleonlinesecurity.blogspot.com/2007/07/automating-web-application-security.html" target="_blank" rel="noopener noreferrer">"Automating web application security testing"</a>, blog post, July 2007. </li> <li> <a class="jumptargets" id="Ana13">[Ana13]</a> R. Ananatharayan et al., <a href="https://research.google.com/pubs/pub41318.html" target="_blank" rel="noopener noreferrer">"Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams"</a>, in <em>SIGMOD '13</em>, 2013. </li> <li> <a class="jumptargets" id="And05">[And05]</a> A. Andrieux, K. Czajkowski, A. Dan, et al., <a href="https://www.ogf.org/documents/GFD.107.pdf" target="_blank" rel="noopener noreferrer">"Web Services Agreement Specification (WS-Agreement)"</a>, September 2005. </li> <li> <a class="jumptargets" id="Bai13">[Bai13]</a> P. Bailis and A. Ghodsi, <a href="https://dl.acm.org/citation.cfm?id=2462076" target="_blank" rel="noopener noreferrer">"Eventual Consistency Today: Limitations, Extensions, and Beyond"</a>, in <em>ACM Queue</em>, vol. 11, no. 3, 2013. </li> <li> <a class="jumptargets" id="Bai83">[Bai83]</a> L. Bainbridge, <a href="https://dx.doi.org/10.1016/0005-1098(83)90046-8" target="_blank" rel="noopener noreferrer">"Ironies of Automation"</a>, in <em>Automatica</em>, vol. 19, no. 6, November 1983. </li> <li> <a class="jumptargets" id="Bak11">[Bak11]</a> J. Baker et al., <a href="https://research.google.com/pubs/pub36971.html" target="_blank" rel="noopener noreferrer">"Megastore: Providing Scalable, Highly Available Storage for Interactive Services"</a>, in <em>Proceedings of the Conference on Innovative Data System Research</em>, 2011. </li> <li> <a class="jumptargets" id="Bar11">[Bar11]</a> L. A. Barroso, <a href="https://dl.acm.org/citation.cfm?id=2019527" target="_blank" rel="noopener noreferrer">"Warehouse-Scale Computing: Entering the Teenage Decade"</a>, talk at 38th Annual Symposium on Computer Architecture, video available online, 2011. </li> <li> <a class="jumptargets" id="Bar13">[Bar13]</a> L. A. Barroso, J. Clidaras, and U. Hölzle, <a href="https://research.google.com/pubs/pub41606.html" target="_blank" rel="noopener noreferrer"><em>The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition</em></a>, Morgan &amp; Claypool, 2013. </li> <li> <a class="jumptargets" id="Ben12">[Ben12]</a> C. Bennett and A. Tseitlin, <a href="https://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html" target="_blank" rel="noopener noreferrer">"Chaos Monkey Released Into The Wild"</a>, blog post, July 2012. </li> <li> <a class="jumptargets" id="Bla14">[Bla14]</a> M. Bland, <a href="https://martinfowler.com/articles/testing-culture.html" target="_blank" rel="noopener noreferrer">"Goto Fail, Heartbleed, and Unit Testing Culture"</a>, blog post, June 2014. </li> <li> <a class="jumptargets" id="Boc15">[Boc15]</a> L. Bock, <a href="https://www.workrules.net" target="_blank" rel="noopener noreferrer"><em>Work Rules!</em></a>, Twelve Books, 2015. </li> <li> <a class="jumptargets" id="Bol11">[Bol11]</a> W. J. Bolosky, D. Bradshaw, R. B. Haagens, N. P. Kusters, and P. Li, <a href="https://www.usenix.org/legacy/event/nsdi11/tech/full_papers/Bolosky.pdf" target="_blank" rel="noopener noreferrer">"Paxos Replicated State Machines as the Basis of a High-Performance Data Store"</a>, in <em>Proc. NSDI 2011</em>, 2011. </li> <li> <a class="jumptargets" id="Boy13">[Boy13]</a> P. G. Boysen, <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3776518/" target="_blank" rel="noopener noreferrer">"Just Culture: A Foundation for Balanced Accountability and Patient Safety"</a>, in <em>The Ochsner Journal</em>, Fall 2013. </li> <li> <a class="jumptargets" id="Bra15">[Bra15]</a> VM Brasseur, <a href="https://youtu.be/DLn4fZsZsKM?t=29m05s" target="_blank" rel="noopener noreferrer">"Failure: Why it happens &amp; How to benefit from it"</a>, YAPC 2015. </li> <li> <a class="jumptargets" id="Bre01">[Bre01]</a> E. Brewer, <a href="https://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=939450" target="_blank" rel="noopener noreferrer">"Lessons From Giant-Scale Services"</a>, in <em>IEEE Internet Computing</em>, vol. 5, no. 4, July / August 2001. </li> <li> <a class="jumptargets" id="Bre12">[Bre12]</a> E. Brewer, <a href="https://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6133253" target="_blank" rel="noopener noreferrer">"CAP Twelve Years Later: How the "Rules" Have Changed"</a>, in <em>Computer</em>, vol. 45, no. 2, February 2012. </li> <li> <a class="jumptargets" id="Bro15">[Bro15]</a> M. Brooker, <a href="https://www.awsarchitectureblog.com/2015/03/backoff.html" target="_blank" rel="noopener noreferrer">"Exponential Backoff and Jitter"</a>, on <em>AWS Architecture Blog</em>, March 2015. </li> <li> <a class="jumptargets" id="Bro95">[Bro95]</a> F. P. Brooks Jr., "No Silver Bullet—Essence and Accidents of Software Engineering", in <em>The Mythical Man-Month</em>, Boston: Addison-Wesley, 1995, pp. 180–186. </li> <li> <a class="jumptargets" id="Bru09">[Bru09]</a> J. Brutlag, <a href="https://googleresearch.blogspot.com/2009/06/speed-matters.html" target="_blank" rel="noopener noreferrer">"Speed Matters"</a>, on <em>Google Research Blog</em>, June 2009. </li> <li> <a class="jumptargets" id="Bul80">[Bul80]</a> G. M. Bull, <em>The Dartmouth Time-sharing System</em>: Ellis Horwood, 1980. </li> <li> <a class="jumptargets" id="Bur99">[Bur99]</a> M. Burgess, <em>Principles of Network and System Administration</em>: Wiley, 1999. </li> <li> <a class="jumptargets" id="Bur06">[Bur06]</a> M. Burrows, <a href="https://research.google.com/archive/chubby.html" target="_blank" rel="noopener noreferrer">"The Chubby Lock Service for Loosely-Coupled Distributed Systems"</a>, in <em>OSDI '06: Seventh Symposium on Operating System Design and Implementation</em>, November 2006. </li> <li> <a class="jumptargets" id="Bur16">[Bur16]</a> B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, <a href="https://dl.acm.org/citation.cfm?id=2898444" target="_blank" rel="noopener noreferrer">"Borg, Omega, and Kubernetes"</a> in <em>ACM Queue</em>, vol. 14, no. 1, 2016. </li> <li> <a class="jumptargets" id="Cas99">[Cas99]</a> M. Castro and B. Liskov, <a href="https://www.pmg.lcs.mit.edu/papers/osdi99.pdf" target="_blank" rel="noopener noreferrer">"Practical Byzantine Fault Tolerance"</a>, in <em>Proc. OSDI 1999</em>, 1999. </li> <li> <a class="jumptargets" id="Cha10">[Cha10]</a> C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and N. Weizenbaum, <a href="https://research.google.com/pubs/pub35650.html" target="_blank" rel="noopener noreferrer">"FlumeJava: Easy, Efficient Data-Parallel Pipelines"</a>, in <em>ACM SIGPLAN Conference on Programming Language Design and Implementation</em>, 2010. </li> <li> <a class="jumptargets" id="Cha96">[Cha96]</a> T. D. Chandra and S. Toueg, <a href="https://dl.acm.org/citation.cfm?id=226647" target="_blank" rel="noopener noreferrer">"Unreliable Failure Detectors for Reliable Distributed Systems"</a>, in <em>J. ACM</em>, 1996. </li> <li> <a class="jumptargets" id="Cha07">[Cha07]</a> T. Chandra, R. Griesemer, and J. Redstone, <a href="https://research.google.com/archive/paxos_made_live.html" target="_blank" rel="noopener noreferrer">"Paxos Made Live—An Engineering Perspective"</a>, in <em>PODC '07: 26th ACM Symposium on Principles of Distributed Computing</em>, 2007. </li> <li> <a class="jumptargets" id="Cha06">[Cha06]</a> F. Chang et al., <a href="https://research.google.com/archive/bigtable.html" target="_blank" rel="noopener noreferrer">"Bigtable: A Distributed Storage System for Structured Data"</a>, in <em>OSDI '06: Seventh Symposium on Operating System Design and Implementation</em>, November 2006. </li> <li> <a class="jumptargets" id="Chr09">[Chr09]</a> G. P. Chrousous, <a href="https://www.ncbi.nlm.nih.gov/pubmed/19488073" target="_blank" rel="noopener noreferrer">"Stress and Disorders of the Stress System"</a>, in <em>Nature Reviews Endocrinology</em>, vol 5., no. 7, 2009. </li> <li> <a class="jumptargets" id="Clos53">[Clos53]</a> C. Clos, <a href="https://dx.doi.org/10.1002/j.1538-7305.1953.tb01433.x" target="_blank" rel="noopener noreferrer">"A Study of Non-Blocking Switching Networks"</a>, in <em>Bell System Technical Journal</em>, vol. 32, no. 2, 1953. </li> <li> <a class="jumptargets" id="Con15">[Con15]</a> C. Contavalli, W. van der Gaast, D. Lawrence, and W. Kumari, <a href="https://tools.ietf.org/html/draft-vandergaast-edns-client-subnet" target="_blank" rel="noopener noreferrer">"Client Subnet in DNS Queries"</a>, <em>IETF Internet-Draft</em>, 2015. </li> <li> <a class="jumptargets" id="Con63">[Con63]</a> M. E. Conway, <a href="https://dl.acm.org/citation.cfm?id=366704" target="_blank" rel="noopener noreferrer">"Design of a Separable Transition-Diagram Compiler"</a>, in <em>Commun. ACM</em> 6, 7 (July 1963), 396–408. </li> <li> <a class="jumptargets" id="Con96">[Con96]</a> P. Conway, <a href="https://www.clir.org/pubs/reports/conway2/index.html" target="_blank" rel="noopener noreferrer">"Preservation in the Digital World"</a>, report published by the Council on Library and Information Resources, 1996. </li> <li> <a class="jumptargets" id="Coo00">[Coo00]</a> R. I. Cook, <a href="https://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf" target="_blank" rel="noopener noreferrer">"How Complex Systems Fail"</a>, in <em>Web Operations</em>: O’Reilly, 2010. </li> <li> <a class="jumptargets" id="Cor12">[Cor12]</a> J. C. Corbett et al., <a href="https://research.google.com/archive/spanner.html" target="_blank" rel="noopener noreferrer">"Spanner: Google’s Globally-Distributed Database"</a>, in <em>OSDI '12: Tenth Symposium on Operating System Design and Implementation</em>, October 2012. </li> <li> <a class="jumptargets" id="Cra10">[Cra10]</a> J. Cranmer, <a href="https://quetzalcoatal.blogspot.com/2010/03/visualizing-code-coverage.html" target="_blank" rel="noopener noreferrer">"Visualizing code coverage"</a>, blog post, March 2010. </li> <li> <a class="jumptargets" id="Dea13">[Dea13]</a> J. Dean and L. A. Barroso, <a href="https://research.google.com/pubs/pub40801.html" target="_blank" rel="noopener noreferrer">"The Tail at Scale"</a>, in <em>Communications of the ACM</em>, vol. 56, 2013. </li> <li> <a class="jumptargets" id="Dea04">[Dea04]</a> J. Dean and S. Ghemawat, <a href="https://research.google.com/archive/mapreduce.html" target="_blank" rel="noopener noreferrer">"MapReduce: Simplified Data Processing on Large Clusters"</a>, in <em>OSDI’04: Sixth Symposium on Operating System Design and Implementation</em>, December 2004. </li> <li> <a class="jumptargets" id="Dea07">[Dea07]</a> J. Dean, <a href="https://static.googleusercontent.com/media/research.google.com/en//people/jeff/stanford-295-talk.pdf" target="_blank" rel="noopener noreferrer">"Software Engineering Advice from Building Large-Scale Distributed Systems"</a>, Stanford CS297 class lecture, Spring 2007. </li> <li> <a class="jumptargets" id="Dek02">[Dek02]</a> S. Dekker, <a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.411.4985&amp;rep=rep1&amp;type=pdf" target="_blank" rel="noopener noreferrer">"Reconstructing human contributions to accidents: the new view on error and performance"</a>, in <em>Journal of Safety Research</em>, vol. 33, no. 3, 2002. </li> <li> <a class="jumptargets" id="Dek14">[Dek14]</a> S. Dekker, <em>The Field Guide to Understanding "Human Error"</em>, 3rd edition: Ashgate, 2014. </li> <li> <a class="jumptargets" id="Dic14">[Dic14]</a> C. Dickson, <a href="https://usenix.org/conference/ures14west/summit-program/presentation/dickson" target="_blank" rel="noopener noreferrer">"How Embracing Continuous Release Reduced Change Complexity"</a>, presentation at USENIX Release Engineering Summit West 2014, video available online. </li> <li> <a class="jumptargets" id="Dur05">[Dur05]</a> J. Durmer and D. Dinges, <a href="https://www.ncbi.nlm.nih.gov/pubmed/15798944" target="_blank" rel="noopener noreferrer">"Neurocognitive Consequences of Sleep Deprivation"</a>, in <em>Seminars in Neurology</em>, vol. 25, no. 1, 2005. </li> <li> <a class="jumptargets" id="Eis16">[Eis16]</a> D. E. Eisenbud et al., <a href="https://research.google.com/pubs/pub44824.html" target="_blank" rel="noopener noreferrer">"Maglev: A Fast and Reliable Software Network Load Balancer"</a>, in NSDI '16: 13th USENIX Symposium on Networked Systems Design and Implementation, March 2016. </li> <li> <a class="jumptargets" id="Ere03">[Ere03]</a> J. R. Erenkrantz, <a href="https://www.erenkrantz.com/Geeks/Research/Publications/ReleaseManagement.pdf" target="_blank" rel="noopener noreferrer">"Release Management Within Open Source Projects"</a>, in <em>Proceedings of the 3rd Workshop on Open Source Software Engineering</em>, Portland, Oregon, May 2003. </li> <li> <a class="jumptargets" id="Fis85">[Fis85]</a> M. J. Fischer, N. A. Lynch, and M. S. Paterson, <a href="https://dl.acm.org/citation.cfm?id=214121" target="_blank" rel="noopener noreferrer">"Impossibility of Distributed Consensus with One Faulty Process"</a>, <em>J. ACM</em>, 1985. </li> <li> <a class="jumptargets" id="Fit12">[Fit12]</a> B. W. Fitzpatrick and B. Collins-Sussman, <em>Team Geek: A Software Developer’s Guide to Working Well with Others</em>: O’Reilly, 2012. </li> <li> <a class="jumptargets" id="Flo94">[Flo94]</a> S. Floyd and V. Jacobson, <a href="https://dl.acm.org/citation.cfm?id=187045" target="_blank" rel="noopener noreferrer">"The Synchronization of Periodic Routing Messages"</a>, in IEEE/ACM Transactions on Networking, vol. 2, issue 2, April 1994, pp. 122–136. </li> <li> <a class="jumptargets" id="For10">[For10]</a> D. Ford et al, <a href="https://research.google.com/pubs/pub36737.html" target="_blank" rel="noopener noreferrer">"Availability in Globally Distributed Storage Systems"</a>, in <em>Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation</em>, 2010. </li> <li> <a class="jumptargets" id="Fox99">[Fox99]</a> A. Fox and E. A. Brewer, <a href="https://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=798396" target="_blank" rel="noopener noreferrer">"Harvest, Yield, and Scalable Tolerant Systems"</a>, in <em>Proceedings of the 7th Workshop on Hot Topics in Operating Systems</em>, Rio Rico, Arizona, March 1999. </li> <li> <a class="jumptargets" id="Fow08">[Fow08]</a> M. Fowler, <a href="https://martinfowler.com/eaaDev/uiArchs.html" target="_blank" rel="noopener noreferrer">"GUI Architectures"</a>, blog post, 2006. </li> <li> <a class="jumptargets" id="Gal78">[Gal78]</a> J. Gall, <em>SYSTEMANTICS: How Systems Really Work and How They Fail</em>, 1st ed., Pocket, 1977. </li> <li> <a class="jumptargets" id="Gal03">[Gal03]</a> J. Gall, <em>The Systems Bible: The Beginner’s Guide to Systems Large and Small</em>, 3rd ed., General Systemantics Press/Liberty, 2003. </li> <li> <a class="jumptargets" id="Gaw09">[Gaw09]</a> A. Gawande, <em>The Checklist Manifesto: How to Get Things Right</em>: Henry Holt and Company, 2009. </li> <li> <a class="jumptargets" id="Ghe03">[Ghe03]</a> S. Ghemawat, H. Gobioff, and S-T. Leung, <a href="https://research.google.com/archive/gfs.html" target="_blank" rel="noopener noreferrer">"The Google File System"</a>, in <em>19th ACM Symposium on Operating Systems Principles</em>, October 2003. </li> <li> <a class="jumptargets" id="Gil02">[Gil02]</a> S. Gilbert and N. Lynch, <a href="https://dl.acm.org/citation.cfm?id=564601" target="_blank" rel="noopener noreferrer">"Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services"</a>, in <em>ACM SIGACT News</em>, vol. 33, no. 2, 2002. </li> <li> <a class="jumptargets" id="Gla02">[Gla02]</a> R. Glass, <em>Facts and Fallacies of Software Engineering</em>, Addison-Wesley Professional, 2002. </li> <li> <a class="jumptargets" id="Gol14">[Gol14]</a> W. Golab et al., <a href="https://dl.acm.org/citation.cfm?id=2582994" target="_blank" rel="noopener noreferrer">"Eventually Consistent: Not What You Were Expecting?"</a>, in <em>ACM Queue</em>, vol. 12, no. 1, 2014. </li> <li> <a class="jumptargets" id="Gra09">[Gra09]</a> P. Graham, <a href="https://paulgraham.com/makersschedule.html" target="_blank" rel="noopener noreferrer">"Maker’s Schedule, Manager’s Schedule"</a>, blog post, July 2009. </li> <li> <a class="jumptargets" id="Gup15">[Gup15]</a> A. Gupta and J. Shute, <a href="https://research.google.com/pubs/pub44686.html" target="_blank" rel="noopener noreferrer">"High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads"</a>, in <em>Workshop on Business Intelligence for the Real Time Enterprise</em>, 2015. </li> <li> <a class="jumptargets" id="Ham07">[Ham07]</a> J. Hamilton, <a href="https://www.usenix.org/legacy/event/lisa07/tech/hamilton.html" target="_blank" rel="noopener noreferrer">"On Designing and Deploying Internet-Scale Services"</a>, in <em>Proceedings of the 21st Large Installation System Administration Conference</em>, November 2007. </li> <li> <a class="jumptargets" id="Han94">[Han94]</a> S. Hanks, T. Li, D. Farinacci, and P. Traina, <a href="https://tools.ietf.org/html/rfc1702" target="_blank" rel="noopener noreferrer">"Generic Routing Encapsulation over IPv4 networks"</a>, <em>IETF Informational RFC</em>, 1994. </li> <li> <a class="jumptargets" id="Hic11">[Hic11]</a> M. Hickins, <a href="https://blogs.wsj.com/digits/2011/03/01/tape-rescues-google-in-lost-email-scare/" target="_blank" rel="noopener noreferrer">"Tape Rescues Google in Lost Email Scare"</a>, in <em>Digits</em>, <em>Wall Street Journal</em>, 1 March 2011. </li> <li> <a class="jumptargets" id="Hix15a">[Hix15a]</a> D. Hixson, <a href="https://www.usenix.org/publications/login/feb15/capacity-planning" target="_blank" rel="noopener noreferrer">"Capacity Planning"</a>, in <em>;login:</em>, vol. 40, no. 1, February 2015. </li> <li> <a class="jumptargets" id="Hix15b">[Hix15b]</a> D. Hixson, <a href="https://www.usenix.org/publications/login/june15/hixson" target="_blank" rel="noopener noreferrer">"The Systems Engineering Side of Site Reliability Engineering"</a>, in <em>;login:</em> vol. 40, no. 3, June 2015. </li> <li> <a class="jumptargets" id="Hod13">[Hod13]</a> J. Hodges, <a href="https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/" target="_blank" rel="noopener noreferrer">"Notes on Distributed Systems for Young Bloods"</a>, blog post, 14 January 2013. </li> <li> <a class="jumptargets" id="Hol14">[Hol14]</a> L. Holmwood, <a href="https://fractio.nl/2014/08/26/cardiac-alarms-and-ops/" target="_blank" rel="noopener noreferrer">"Applying Cardiac Alarm Management Techniques to Your On-Call"</a>, blog post, 26 August 2014. </li> <li> <a class="jumptargets" id="Hum06">[Hum06]</a> J. Humble, C. Read, D. North, "The Deployment Production Line", in <em>Proceedings of the IEEE Agile Conference</em>, July 2006. </li> <li> <a class="jumptargets" id="Hum10">[Hum10]</a> J. Humble and D. Farley, <em>Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation</em>: Addison-Wesley, 2010. </li> <li> <a class="jumptargets" id="Hun10">[Hun10]</a> P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, <a href="https://www.usenix.org/legacy/events/atc10/tech/full_papers/Hunt.pdf" target="_blank" rel="noopener noreferrer">"ZooKeeper: Wait-free coordination for Internet-scale systems"</a>, in <em>USENIX ATC</em>, 2010. </li> <li> <a class="jumptargets" id="IAEA12">[IAEA12]</a> International Atomic Energy Agency, <a href="https://www-pub.iaea.org/MTCD/publications/PDF/Pub1534_web.pdf" target="_blank" rel="noopener noreferrer">"Safety of Nuclear Power Plants: Design, SSR-2/1"</a>, 2012. </li> <li> <a class="jumptargets" id="Jai13">[Jai13]</a> S. Jain et al., <a href="https://doi.org/10.1145/2486001.2486019" target="_blank" rel="noopener noreferrer">"B4: Experience with a Globally-Deployed Software Defined WAN"</a>, in <em>SIGCOMM '13</em>. </li> <li> <a class="jumptargets" id="Jon15">[Jon15]</a> C. Jones, T. Underwood, and S. Nukala, <a href="https://www.usenix.org/publications/login/june15/hiring-site-reliability-engineers" target="_blank" rel="noopener noreferrer">"Hiring Site Reliability Engineers"</a>, in <em>;login:</em>, vol. 40, no. 3, June 2015. </li> <li> <a class="jumptargets" id="Jun07">[Jun07]</a> F. Junqueira, Y. Mao, and K. Marzullo, <a href="https://dl.acm.org/citation.cfm?id=1323158" target="_blank" rel="noopener noreferrer">"Classic Paxos vs. Fast Paxos: Caveat Emptor"</a>, in <em>Proc. HotDep '07</em>, 2007. <li> <a class="jumptargets" id="Jun11">[Jun11]</a> F. P. Junqueira, B. C. Reid, and M. Serafini, <a href="https://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5958223&amp;tag=1" target="_blank" rel="noopener noreferrer">"Zab: High-performance broadcast for primary-backup systems."</a>, in <em>Dependable Systems &amp; Networks (DSN), 2011 IEEE/IFIP 41st International Conference</em> on 27 Jun 2011: 245–256. </li> <li> <a class="jumptargets" id="Kah11">[Kah11]</a> D. Kahneman, <em>Thinking, Fast and Slow</em>: Farrar, Straus and Giroux, 2011. </li> <li> <a class="jumptargets" id="Kar97">[Kar97]</a> D. Karger et al., <a href="https://dl.acm.org/citation.cfm?id=258660" target="_blank" rel="noopener noreferrer">"Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web"</a>, in <em>Proc. STOC '97</em>, 29th annual ACM symposium on theory of computing, 1997. </li> <li> <a class="jumptargets" id="Kem11">[Kem11]</a> C. Kemper, <a href="https://google-engtools.blogspot.com/2011/08/build-in-cloud-how-build-system-works.html" target="_blank" rel="noopener noreferrer">"Build in the Cloud: How the Build System Works"</a>, <em>Google Engineering Tools</em> blog post, August 2011. </li> <li> <a class="jumptargets" id="Ken12">[Ken12]</a> S. Kendrick, <a href="https://usenix.org/publications/login/october-2012-volume-37-number-5/what-takes-us-down" target="_blank" rel="noopener noreferrer">"What Takes Us Down?"</a>, in <em>;login:</em>, vol. 37, no. 5, October 2012 </li> <li> <a class="jumptargets" id="Kinc09">[Kinc09]</a> Kincaid, Jason. "T-Mobile Sidekick Disaster: Danger’s Servers Crashed, And They Don’t Have A Backup." <em>Techcrunch</em>. n.p., 10 Oct. 2009. Web. 20 Jan. 2015, <a href="https://techcrunch.com/2009/10/10/t-mobile-sidekick-disaster-microsofts-servers-crashed-and-they-dont-have-a-backup" target="_blank" rel="noopener noreferrer"><em class="hyperlink">https://techcrunch.com/2009/10/10/t-mobile-sidekick-disaster-microsofts-servers-crashed-and-they-dont-have-a-backup</em></a> </li> <li> <a class="jumptargets" id="Kin15">[Kin15]</a> K. Kingsbury, <a href="https://www.aphyr.com/posts/299-the-trouble-with-timestamps" target="_blank" rel="noopener noreferrer">"The trouble with timestamps"</a>, blog post, 2013. </li> <li> <a class="jumptargets" id="Kir08">[Kir08]</a> J. Kirsch and Y. Amir, <a href="https://dl.acm.org/citation.cfm?id=1529979" target="_blank" rel="noopener noreferrer">"Paxos for System Builders: An Overview"</a>, in <em>Proc. LADIS '08</em>, 2008. </li> <li> <a class="jumptargets" id="Kla12">[Kla12]</a> R. Klau, <a href="https://library.gv.com/how-google-sets-goals-okrs-a1f69b0b72c7" target="_blank" rel="noopener noreferrer">"How Google Sets Goals: OKRs"</a>, blog post, October 2012. </li> <li> <a class="jumptargets" id="Kle06">[Kle06]</a> D. V. Klein, <a href="https://www.usenix.org/legacy/event/lisa06/tech/klein/klein_html/index.html" target="_blank" rel="noopener noreferrer">"A Forensic Analysis of a Distributed Two-Stage Web-Based Spam Attack"</a>, in <em>Proceedings of the 20th Large Installation System Administration Conference</em>, December 2006. </li> <li> <a class="jumptargets" id="Kle14">[Kle14]</a> D. V. Klein, D. M. Betser, and M. G. Monroe, <a href="https://www.usenix.org/publications/login/october-2014-vol-39-no-5/making-push-green-reality" target="_blank" rel="noopener noreferrer">"Making <em>Push On Green</em> a Reality"</a>, in <em>;login:</em>, vol. 39, no. 5, October 2014. </li> <li> <a class="jumptargets" id="Kra08">[Kra08]</a> T. Krattenmaker, <a href="https://hbr.org/2008/02/make-every-meeting-matter" target="_blank" rel="noopener noreferrer">"Make Every Meeting Matter"</a>, in <em>Harvard Business Review</em>, February 27, 2008. </li> <li> <a class="jumptargets" id="Kre12">[Kre12]</a> J. Kreps, <a href="https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability" target="_blank" rel="noopener noreferrer">"Getting Real About Distributed System Reliability"</a>, blog post, 19 March 2012. </li> <li> <a class="jumptargets" id="Kri12">[Kri12]</a> K. Krishan, <a href="https://dl.acm.org/citation.cfm?id=2366332" target="_blank" rel="noopener noreferrer">"Weathering The Unexpected"</a>, in <em>Communications of the ACM</em>, vol. 55, no. 11, November 2012 </li> <li> <a class="jumptargets" id="Kum15">[Kum15]</a> A. Kumar et al., <a href="https://research.google.com/pubs/pub43838.html" target="_blank" rel="noopener noreferrer">"BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing"</a>, in <em>SIGCOMM '15</em>. </li> <li> <a class="jumptargets" id="Lam98">[Lam98]</a> L. Lamport, <a href="https://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf" target="_blank" rel="noopener noreferrer">"The Part-Time Parliament"</a>, in <em>ACM Transactions on Computer Systems 16, 2</em>, May 1998. </li> <li> <a class="jumptargets" id="Lam01">[Lam01]</a> L. Lamport, <a href="https://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf" target="_blank" rel="noopener noreferrer">"Paxos Made Simple"</a>, in <em>ACM SIGACT News</em> 121, December 2001. </li> <li> <a class="jumptargets" id="Lam06">[Lam06]</a> L. Lamport, <a href="https://research.microsoft.com/pubs/64624/tr-2005-112.pdf" target="_blank" rel="noopener noreferrer">"Fast Paxos"</a>, in <em>Distributed Computing</em> 19.2, October 2006. </li> <li> <a class="jumptargets" id="Lim14">[Lim14]</a> T. A. Limoncelli, S. R. Chalup, and C. J. Hogan, <em>The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2</em>: Addison-Wesley, 2014. </li> <li> <a class="jumptargets" id="Loo10">[Loo10]</a> J. Loomis, "How to Make Failure Beautiful: The Art and Science of Postmortems", in <em>Web Operations</em>: O’Reilly, 2010. </li> <li> <a class="jumptargets" id="Lu15">[Lu15]</a> H. Lu et al, <a href="https://sigops.org/sosp/sosp15/current/2015-Monterey/printable/240-lu.pdf" target="_blank" rel="noopener noreferrer">"Existential Consistency: Measuring and Understanding Consistency at Facebook"</a>, in <em>SOSP</em> '15, 2015. </li> <li> <a class="jumptargets" id="Mao08">[Mao08]</a> Y. Mao, F. P. Junqueira, and K. Marzullo, <a href="https://www.usenix.org/legacy/events/osdi08/tech/full_papers/mao/mao.pdf" target="_blank" rel="noopener noreferrer">"Mencius: Building Efficient Replicated State Machines for WANs"</a>, in <em>OSDI '08</em>, 2008. </li> <li> <a class="jumptargets" id="Mas43">[Mas43]</a> A. H. Maslow, "A Theory of Human Motivation", in <em>Psychological Review</em> 50(4), 1943. </li> <li> <a class="jumptargets" id="Mau15">[Mau15]</a> B. Maurer, <a href="https://dl.acm.org/citation.cfm?id=2839461" target="_blank" rel="noopener noreferrer">"Fail at Scale"</a>, in <em>ACM Queue</em>, vol. 13, no. 12, 2015. </li> <li> <a class="jumptargets" id="May09">[May09]</a> M. Mayer, <a href="https://googleblog.blogspot.com/2009/01/this-site-may-harm-your-computer-on.html" target="_blank" rel="noopener noreferrer">"<em>This site may harm your computer</em> on every search result?!?!"</a>, blog post, January 2009. </li> <li> <a class="jumptargets" id="McI86">[McI86]</a> M. D. McIlroy, <a href="https://www.cs.dartmouth.edu/~doug/reader.pdf" target="_blank" rel="noopener noreferrer">"A Research Unix Reader: Annotated Excerpts from the Programmer’s Manual, 1971–1986"</a>. </li> <li> <a class="jumptargets" id="McN13">[McN13]</a> D. McNutt, <a href="https://www.usenix.org/conference/ucms13/summit-program/presentation/mcnutt" target="_blank" rel="noopener noreferrer">"Maintaining Consistency in a Massively Parallel Environment"</a>, presentation at USENIX Configuration Management Summit 2013, video available online. </li> <li> <a class="jumptargets" id="McN14a">[McN14a]</a> D. McNutt, <a href="https://www.usenix.org/system/files/login/articles/05_mcnutt.pdf" target="_blank" rel="noopener noreferrer">"Accelerating the Path from Dev to DevOps"</a>, in <em>;login:</em>, vol. 39, no. 2, April 2014. </li> <li> <a class="jumptargets" id="McN14b">[McN14b]</a> D. McNutt, <a href="https://www.youtube.com/watch?v=RNMjYV_UsQ8" target="_blank" rel="noopener noreferrer">"The 10 Commandments of Release Engineering"</a>, presentation at 2nd International Workshop on Release Engineering 2014, April 2014. </li> <li> <a class="jumptargets" id="McN14c">[McN14c]</a> D. McNutt, <a href="https://www.usenix.org/conference/lisa14/conference-program/presentation/mcnutt" target="_blank" rel="noopener noreferrer">"Distributing Software in a Massively Parallel Environment"</a>, presentation at USENIX LISA 2014, video available online. </li> <li> <a class="jumptargets" id="Mic03">[Mic03]</a> Microsoft TechNet, "What is SNMP?", last modified March 28, 2003, <a href="https://technet.microsoft.com/en-us/library/cc776379%28v=ws.10%29.aspx" target="_blank" rel="noopener noreferrer"><em class="hyperlink">https://technet.microsoft.com/en-us/library/cc776379%28v=ws.10%29.aspx</em></a>. </li> <li> <a class="jumptargets" id="Mea08">[Mea08]</a> D. Meadows, <em>Thinking in Systems</em>: Chelsea Green, 2008. </li> <li> <a class="jumptargets" id="Men07">[Men07]</a> P. Menage, <a href="https://www.kernel.org/doc/ols/2007/ols2007v2-pages-45-58.pdf" target="_blank" rel="noopener noreferrer">"Adding Generic Process Containers to the Linux Kernel"</a>, in <em>Proc. of Ottawa Linux Symposium</em>, 2007. </li> <li> <a class="jumptargets" id="Mer11">[Mer11]</a> N. Merchant, <a href="https://hbr.org/2011/03/culture-trumps-strategy-every" target="_blank" rel="noopener noreferrer">"Culture Trumps Strategy, Every Time"</a>, in <em>Harvard Business Review</em>, March 22, 2011. </li> <li> <a class="jumptargets" id="Moc87">[Moc87]</a> P. Mockapetris, <a href="https://tools.ietf.org/html/rfc1035" target="_blank" rel="noopener noreferrer">"Domain Names - Implementation and Specification"</a>, <em>IETF Internet Standard</em>, 1987. </li> <li> <a class="jumptargets" id="Mol86">[Mol86]</a> C. Moler, "Matrix Computation on Distributed Memory Multiprocessors", in <em>Hypercube Multiprocessors 1986</em>, 1987. </li> <li> <a class="jumptargets" id="Mor12a">[Mor12a]</a> I. Moraru, D. G. Andersen, and M. Kaminsky, <a href="https://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-12-108.pdf" target="_blank" rel="noopener noreferrer">"Egalitarian Paxos"</a>, <em>Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-12-108</em>, 2012. </li> <li> <a class="jumptargets" id="Mor14">[Mor14]</a> I. Moraru, D. G. Andersen, and M. Kaminsky, <a href="https://dl.acm.org/citation.cfm?id=2671001" target="_blank" rel="noopener noreferrer">"Paxos Quorum Leases: Fast Reads Without Sacrificing Writes"</a>, in <em>Proc. SOCC '14</em>, 2014. </li> <li> <a class="jumptargets" id="Mor12b">[Mor12b]</a> J. D. Morgenthaler, M. Gridnev, R. Sauciuc, and S. Bhansali, <a href="https://research.google.com/pubs/pub37755.html" target="_blank" rel="noopener noreferrer">"Searching for Build Debt: Experiences Managing Technical Debt at Google"</a>, in <em>Proceedings of the 3rd Int’l Workshop on Managing Technical Debt</em>, 2012. </li> <li> <a class="jumptargets" id="Nar12">[Nar12]</a> C. Narla and D. Salas, <a href="https://googletesting.blogspot.com/2012/10/hermetic-servers.html" target="_blank" rel="noopener noreferrer">"Hermetic Servers"</a>, blog post, 2012. </li> <li> <a class="jumptargets" id="Nel14">[Nel14]</a> B. Nelson, <a href="https://dl.acm.org/citation.cfm?id=2684442.2597886" target="_blank" rel="noopener noreferrer">"The Data on Diversity"</a>, in <em>Communications of the ACM</em>, vol. 57, 2014. </li> <li> <a class="jumptargets" id="Nic12">[Nic12]</a> K. Nichols and V. Jacobson, <a href="https://dl.acm.org/citation.cfm?id=2209336" target="_blank" rel="noopener noreferrer">"Controlling Queue Delay"</a>, in <em>ACM Queue</em>, vol. 10, no. 5, 2012. </li> <li> <a class="jumptargets" id="Oco12">[Oco12]</a> P. O’Connor and A. Kleyner, <em>Practical Reliability Engineering</em>, 5th edition: Wiley, 2012. </li> <li> <a class="jumptargets" id="Ohn88">[Ohn88]</a> T. Ohno, <em>Toyota Production System: Beyond Large-Scale Production</em>: Productivity Press, 1988. </li> <li> <a class="jumptargets" id="Ong14">[Ong14]</a> D. Ongaro and J. Ousterhout, <a href="https://ramcloud.stanford.edu/raft.pdf" target="_blank" rel="noopener noreferrer">"In Search of an Understandable Consensus Algorithm (Extended Version)"</a>. </li> <li> <a class="jumptargets" id="Pen10">[Pen10]</a> D. Peng and F. Dabek, <a href="https://research.google.com/pubs/pub36726.html" target="_blank" rel="noopener noreferrer">"Large-scale Incremental Processing Using Distributed Transactions and Notifications"</a>, in <em>Proc. of the 9th USENIX Symposium on Operating System Design and Implementation</em>, November 2010. </li> <li> <a class="jumptargets" id="Per99">[Per99]</a> C. Perrow, <em>Normal Accidents: Living with High-Risk Technologies</em>, Princeton University Press, 1999. </li> <li> <a class="jumptargets" id="Per07">[Per07]</a> A. R. Perry, <a href="https://research.google.com/pubs/pub32583.html" target="_blank" rel="noopener noreferrer">"Engineering Reliability into Web Sites: Google SRE"</a>, in <em>Proc. of LinuxWorld 2007</em>, 2007. </li> <li> <a class="jumptargets" id="Pik05">[Pik05]</a> R. Pike, S. Dorward, R. Griesemer, S. Quinlan, <a href="https://research.google.com/archive/sawzall.html" target="_blank" rel="noopener noreferrer">"Interpreting the Data: Parallel Analysis with Sawzall"</a>, in <em>Scientific Programming Journal</em> vol. 13, no. 4, 2005. </li> <li> <a class="jumptargets" id="Pot16">[Pot16]</a> R. Potvin and J. Levenberg, <a href="https://dl.acm.org/citation.cfm?id=2963119.2854146" target="_blank" rel="noopener noreferrer">"The Motivation for a Monolithic Codebase: Why Google stores billions of lines of code in a single repository"</a>, in <em>Communications of the ACM</em>, vol. 59, no. 7, 2016. Video available on <a href="https://www.youtube.com/watch?v=W71BTkUbdqE" target="_blank" rel="noopener noreferrer">YouTube</a>. </li> <li> <a class="jumptargets" id="Roo04">[Roo04]</a> J. J. Rooney and L. N. Vanden Heuvel, <a href="https://asq.org/quality-progress/2004/07/quality-tools/root-cause-analysis-for-beginners.html" target="_blank" rel="noopener noreferrer">"Root Cause Analysis for Beginners"</a>, in <em>Quality Progress</em>, July 2004. </li> <li> <a class="jumptargets" id="Sai39">[Sai39]</a> A. de Saint Exupéry, <em>Terre des Hommes</em> (Paris: Le Livre de Poche, 1939, in translation by Lewis Galantière as <em>Wind, Sand and Stars</em>. </li> <li> <a class="jumptargets" id="Sam14">[Sam14]</a> R. R. Sambasivan, R. Fonseca, I. Shafer, and G. R. Ganger, <a href="https://pdl.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-14-102_abs.shtml" target="_blank" rel="noopener noreferrer">"So, You Want To Trace Your Distributed System? Key Design Insights from Years of Practical Experience"</a>, Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-14-102, 2014. </li> <li> <a class="jumptargets" id="San11">[San11]</a> N. Santos and A. Schiper, <a href="https://rd.springer.com/chapter/10.1007%2F978-3-642-25959-3_11" target="_blank" rel="noopener noreferrer">"Tuning Paxos for High-Throughput with Batching and Pipelining"</a>, in <em>13th Int’l Conf. on Distributed Computing and Networking</em>, 2012. </li> <li> <a class="jumptargets" id="Sar97">[Sar97]</a> N. B. Sarter, D. D. Woods, and C. E. Billings, "Automation Surprises", in <em>Handbook of Human Factors &amp; Ergonomics</em>, 2nd edition, G. Salvendy (ed.), Wiley, 1997. </li> <li> <a class="jumptargets" id="Sch14">[Sch14]</a> E. Schmidt, J. Rosenberg, and A. Eagle, <a href="https://www.howgoogleworks.net" target="_blank" rel="noopener noreferrer"><em>How Google Works</em></a>: Grand Central Publishing, 2014. </li> <li> <a class="jumptargets" id="Sch15">[Sch15]</a> B. Schwartz, <a href="https://www.vividcortex.com/blog/the-factors-that-impact-availability-visualized" target="_blank" rel="noopener noreferrer">"The Factors That Impact Availability, Visualized"</a>, blog post, 21 December 2015. </li> <li> <a class="jumptargets" id="Sch90">[Sch90]</a> F. B. Schneider, <a href="https://dl.acm.org/citation.cfm?id=98167" target="_blank" rel="noopener noreferrer">"Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial"</a>, in <em>ACM Computing Surveys</em>, vol. 22, no. 4, 1990. </li> <li> <a class="jumptargets" id="Sec13">[Sec13]</a> Securities and Exchange Commission, <a href="https://www.sec.gov/litigation/admin/2013/34-70694.pdf" target="_blank" rel="noopener noreferrer">"Order In the Matter of Knight Capital Americas LLC"</a>, file 3-15570, 2013. </li> <li> <a class="jumptargets" id="Sha00">[Sha00]</a> G. Shao, F. Berman, and R. Wolski, <a href="https://www.cs.ucsb.edu/~rich/publications/shao-hcw.pdf" target="_blank" rel="noopener noreferrer">"Master/Slave Computing on the Grid"</a>, in <em>Heterogeneous Computing Workshop</em>, 2000. </li> <li> <a class="jumptargets" id="Shu13">[Shu13]</a> J. Shute et al., <a href="https://research.google.com/pubs/pub41344.html" target="_blank" rel="noopener noreferrer">"F1: A Distributed SQL Database That Scales"</a>, in <em>Proc. VLDB 2013</em>, 2013. </li> <li> <a class="jumptargets" id="Sig10">[Sig10]</a> B. H. Sigelman et al., <a href="https://research.google.com/pubs/pub36356.html" target="_blank" rel="noopener noreferrer">"Dapper, a Large-Scale Distributed Systems Tracing Infrastructure"</a>, Google Technical Report, 2010. </li> <li> <a class="jumptargets" id="Sin15">[Sin15]</a> A. Singh et al., <a href="https://research.google.com/pubs/pub43837.html" target="_blank" rel="noopener noreferrer">"Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network"</a>, in <em>SIGCOMM '15</em>. </li> <li> <a class="jumptargets" id="Skel13">[Skel13]</a> M. Skelton, <a href="https://blog.softwareoperability.com/2013/10/16/operability-can-improve-if-developers-write-a-draft-run-book/" target="_blank" rel="noopener noreferrer">"Operability can Improve if Developers Write a Draft Run Book"</a>, blog post, 16 October 2013. </li> <li> <a class="jumptargets" id="Slo11">[Slo11]</a> B. Treynor Sloss, <a href="https://gmailblog.blogspot.com/2011/02/gmail-back-soon-for-everyone.html" target="_blank" rel="noopener noreferrer">"Gmail back soon for everyone"</a>, blog post, 28 February 2011. </li> <li> <a class="jumptargets" id="Tat99">[Tat99]</a> S. Tatham, <a href="https://www.chiark.greenend.org.uk/~sgtatham/bugs.html" target="_blank" rel="noopener noreferrer">"How to Report Bugs Effectively"</a>, 1999. </li> <li> <a class="jumptargets" id="Ver15">[Ver15]</a> A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, <a href="https://research.google.com/pubs/pub43438.html" target="_blank" rel="noopener noreferrer">"Large-scale cluster management at Google with Borg"</a>, in <em>Proceedings of the European Conference on Computer Systems</em>, 2015. </li> <li> <a class="jumptargets" id="Wal89">[Wal89]</a> D. R. Wallace and R. U. Fujii, <a href="https://www-usr.inf.ufsm.br/~ceretta/papers/fujii89_software_vv.pdf" target="_blank" rel="noopener noreferrer">"Software Verification and Validation: An Overview"</a>, <em>IEEE Software</em>, vol. 6, no. 3 (May 1989), pp. 10, 17. </li> <li> <a class="jumptargets" id="War14">[War14]</a> R. Ward and B. Beyer, <a href="https://www.usenix.org/publications/login/dec14/ward" target="_blank" rel="noopener noreferrer">"BeyondCorp: A New Approach to Enterprise Security"</a>, in <em>;login:</em>, vol. 39, no. 6, December 2014. </li> <li> <a class="jumptargets" id="Whi12">[Whi12]</a> J. A. Whittaker, J. Arbon, and J. Carollo, <em>How Google Tests Software</em>: Addison-Wesley, 2012. </li> <li> <a class="jumptargets" id="Woo96">[Woo96]</a> A. Wood, <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=544240" target="_blank" rel="noopener noreferrer">"Predicting Software Reliability"</a>, in <em>Computer</em>, vol. 29, no. 11, 1996. </li> <li> <a class="jumptargets" id="Wri12a">[Wri12a]</a> H. K. Wright, <a href="https://www.hyrumwright.org/papers/dissertation.pdf" target="_blank" rel="noopener noreferrer">"Release Engineering Processes, Their Faults and Failures"</a>, (section 7.2.2.2) PhD Thesis, University of Texas at Austin, 2012. </li> <li> <a class="jumptargets" id="Wri12b">[Wri12b]</a> H. K. Wright and D. E. Perry, <a href="https://www.hyrumwright.org/papers/icse2012.pdf" target="_blank" rel="noopener noreferrer">"Release Engineering Practices and Pitfalls"</a>, in <em>Proceedings of the 34th International Conference on Software Engineering (ICSE '12)</em>. (IEEE, 2012), pp. 1281–1284. </li> <li> <a class="jumptargets" id="Wri13">[Wri13]</a> H. K. Wright, D. Jasper, M. Klimek, C. Carruth, Z. Wan, <a href="https://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/41342.pdf" target="_blank" rel="noopener noreferrer">"Large-Scale Automated Refactoring Using ClangMR"</a>, in <em>Proceedings of the 29th International Conference on Software Maintenance (ICSM '13)</em>, (IEEE, 2013), pp. 548–551. </li> <li> <a class="jumptargets" id="Yor11">[Yor11]</a> N. York, <a href="https://google-engtools.blogspot.com/2011/06/build-in-cloud-accessing-source-code.html" target="_blank" rel="noopener noreferrer">"Build in the Cloud: Accessing Source Code"</a>, <em>Google Engineering Tools</em> blog post, June 2011. </li> <li> <a class="jumptargets" id="Zoo14">[Zoo14]</a> ZooKeeper Project (Apache Foundation), <a href=" https://zookeeper.apache.org/doc/current/recipes.html" target="_blank" rel="noopener noreferrer">"ZooKeeper Recipes and Solutions"</a>, in ZooKeeper 3.4 documentation, 2014. </li> </ul> </section> </div> </div> <div class="footer"> <div class="maia-aux"> <div class="previous"> <a href="/sre-book/production-meeting/"> <p class="footer-caption">Previous</p> <p class="chapter-link"> Appendix F - Example Production Meeting Minutes </p> </a> </div> <div class="next"> </div> <p class="footer-link">Copyright © 2017 Google, Inc. Published by O'Reilly Media, Inc. Licensed under <a href="https://creativecommons.org/licenses/by-nc-nd/4.0/" rel="noopener noreferrer" target="_blank">CC BY-NC-ND 4.0</a></p> </div> </div> </main> <script src="//ajax.googleapis.com/ajax/libs/angularjs/1.6.6/angular.min.js"></script> <script src="//ajax.googleapis.com/ajax/libs/angularjs/1.6.6/angular-animate.min.js"></script> <script src="//ajax.googleapis.com/ajax/libs/angularjs/1.6.6/angular-touch.min.js"></script> <script src="/sre-book/static/js/index.min.js?cache=5b7f90b"></script> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10