CINXE.COM

Google SRE- Incident Document Shakespeare Search Outage

<!DOCTYPE html> <html lang="en"> <head> <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= 'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-WVF23W3');</script> <meta charset="utf-8"> <meta content="initial-scale=1, minimum-scale=1, width=device-width" name="viewport"> <title>Google SRE- Incident Document Shakespeare Search Outage</title> <meta name="description" content="Google SRE incident document that details the "Shakespeare Sonnet++ Overload," with summary, status, command hierarchy, and timeline of the incident."> <meta name="referrer" content="no-referrer" /> <link rel="canonical" href="https://sre.google/sre-book/incident-document/"> <link rel="apple-touch-icon-precomposed" sizes="180x180" href="https://lh3.googleusercontent.com/Yf2DCX8RKda6r4Jml9DLMByS2zQCBFs3kQpvBfN8UgIh4YVWIYSYIQOoTxJriyuM26cT5PDjyEb5aynDQ0Xyz46yHKnfg8JlUbDW"> <link rel="stylesheet" href="//fonts.googleapis.com/css?family=Google+Sans:400|Roboto:400,400italic,500,500italic,700,700italic|Roboto+Mono:400,500,700|Material+Icons"> <link rel="icon" type="image/png" sizes="32x32" href="https://lh3.googleusercontent.com/Yf2DCX8RKda6r4Jml9DLMByS2zQCBFs3kQpvBfN8UgIh4YVWIYSYIQOoTxJriyuM26cT5PDjyEb5aynDQ0Xyz46yHKnfg8JlUbDW"> <link rel="icon" type="image/png" sizes="16x16" href="https://lh3.googleusercontent.com/Yf2DCX8RKda6r4Jml9DLMByS2zQCBFs3kQpvBfN8UgIh4YVWIYSYIQOoTxJriyuM26cT5PDjyEb5aynDQ0Xyz46yHKnfg8JlUbDW"> <link rel="shortcut icon" href="https://lh3.googleusercontent.com/Yf2DCX8RKda6r4Jml9DLMByS2zQCBFs3kQpvBfN8UgIh4YVWIYSYIQOoTxJriyuM26cT5PDjyEb5aynDQ0Xyz46yHKnfg8JlUbDW"> <link href="/sre-book/static/css/index.min.css?cache=6c30b59" rel="stylesheet"> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-75468017-1', 'auto'); ga('send', 'pageview'); </script> <script type="application/ld+json"> { "@context": "https://schema.org", "@type": "Article", "mainEntityOfPage": { "@type": "WebPage", "@id": "/sre-book/incident-document/" }, "headline": "Example Incident State Document", "description": "Shakespeare search service in cascading failure due to newly discovered sonnet not in search index.", "publisher": { "@type": "Organization", "name": "Google SRE", "logo": { "@type": "ImageObject", "url": "https://lh3.googleusercontent.com/C3_YVnTdc7xzTDekhsGeZ2hEYUnAlp47Au-9C50vi5r44rfpJAgiycs1g6AFKWqIpw6KVPrZWLse1VUqgOqYht-RxV1iowdB0_IABUd966aDsDWW-65m" } } } </script> <script src="/sre-book/static/js/detect.min.js?cache=4cb778b"></script> </head> <body> <noscript><iframe class="no-script-iframe" src="https://www.googletagmanager.com/ns.html?id=GTM-WVF23W3"></iframe></noscript> <main> <div ng-controller= "HeaderCtrl as headerCtrl"> <div id="curtain" class="menu-closed"></div> <div class="header clearfix"> <a id="burger-menu" class="expand"></a> <h2 class="chapter-title"> Appendix C - Example Incident State Document </h2> </div> <div id="overlay-element" class="expands"> <div class="logo"> <a href="https://www.google.com"><img src="https://lh3.googleusercontent.com/YoVRtLOHMSRYQZ3OhFL8RIamcjFYbmQXX4oAQx02MRqqY9zlKNvsuZpS73khXiOqTH3qrFW27VrERJJIHTjPk-tAh46q8-Fd4w6qlw" alt="Google"></a> </div> <ol id="drop-down" class="dropdown-content hide"> <li><a class="menu-buttons" href="/sre-book/table-of-contents/">Table of Contents</a></li> <li> <a href="/sre-book/foreword/" class="menu-buttons"> Foreword </a> </li> <li> <a href="/sre-book/preface/" class="menu-buttons"> Preface </a> </li> <li> <a href="/sre-book/part-I-introduction/" class="menu-buttons"> Part I - Introduction </a> </li> <li> <a href="/sre-book/introduction/" class="menu-buttons"> 1. Introduction </a> </li> <li> <a href="/sre-book/production-environment/" class="menu-buttons"> 2. The Production Environment at Google, from the Viewpoint of an SRE </a> </li> <li> <a href="/sre-book/part-II-principles/" class="menu-buttons"> Part II - Principles </a> </li> <li> <a href="/sre-book/embracing-risk/" class="menu-buttons"> 3. Embracing Risk </a> </li> <li> <a href="/sre-book/service-level-objectives/" class="menu-buttons"> 4. Service Level Objectives </a> </li> <li> <a href="/sre-book/eliminating-toil/" class="menu-buttons"> 5. Eliminating Toil </a> </li> <li> <a href="/sre-book/monitoring-distributed-systems/" class="menu-buttons"> 6. Monitoring Distributed Systems </a> </li> <li> <a href="/sre-book/automation-at-google/" class="menu-buttons"> 7. The Evolution of Automation at Google </a> </li> <li> <a href="/sre-book/release-engineering/" class="menu-buttons"> 8. Release Engineering </a> </li> <li> <a href="/sre-book/simplicity/" class="menu-buttons"> 9. Simplicity </a> </li> <li> <a href="/sre-book/part-III-practices/" class="menu-buttons"> Part III - Practices </a> </li> <li> <a href="/sre-book/practical-alerting/" class="menu-buttons"> 10. Practical Alerting </a> </li> <li> <a href="/sre-book/being-on-call/" class="menu-buttons"> 11. Being On-Call </a> </li> <li> <a href="/sre-book/effective-troubleshooting/" class="menu-buttons"> 12. Effective Troubleshooting </a> </li> <li> <a href="/sre-book/emergency-response/" class="menu-buttons"> 13. Emergency Response </a> </li> <li> <a href="/sre-book/managing-incidents/" class="menu-buttons"> 14. Managing Incidents </a> </li> <li> <a href="/sre-book/postmortem-culture/" class="menu-buttons"> 15. Postmortem Culture: Learning from Failure </a> </li> <li> <a href="/sre-book/tracking-outages/" class="menu-buttons"> 16. Tracking Outages </a> </li> <li> <a href="/sre-book/testing-reliability/" class="menu-buttons"> 17. Testing for Reliability </a> </li> <li> <a href="/sre-book/software-engineering-in-sre/" class="menu-buttons"> 18. Software Engineering in SRE </a> </li> <li> <a href="/sre-book/load-balancing-frontend/" class="menu-buttons"> 19. Load Balancing at the Frontend </a> </li> <li> <a href="/sre-book/load-balancing-datacenter/" class="menu-buttons"> 20. Load Balancing in the Datacenter </a> </li> <li> <a href="/sre-book/handling-overload/" class="menu-buttons"> 21. Handling Overload </a> </li> <li> <a href="/sre-book/addressing-cascading-failures/" class="menu-buttons"> 22. Addressing Cascading Failures </a> </li> <li> <a href="/sre-book/managing-critical-state/" class="menu-buttons"> 23. Managing Critical State: Distributed Consensus for Reliability </a> </li> <li> <a href="/sre-book/distributed-periodic-scheduling/" class="menu-buttons"> 24. Distributed Periodic Scheduling with Cron </a> </li> <li> <a href="/sre-book/data-processing-pipelines/" class="menu-buttons"> 25. Data Processing Pipelines </a> </li> <li> <a href="/sre-book/data-integrity/" class="menu-buttons"> 26. Data Integrity: What You Read Is What You Wrote </a> </li> <li> <a href="/sre-book/reliable-product-launches/" class="menu-buttons"> 27. Reliable Product Launches at Scale </a> </li> <li> <a href="/sre-book/part-IV-management/" class="menu-buttons"> Part IV - Management </a> </li> <li> <a href="/sre-book/accelerating-sre-on-call/" class="menu-buttons"> 28. Accelerating SREs to On-Call and Beyond </a> </li> <li> <a href="/sre-book/dealing-with-interrupts/" class="menu-buttons"> 29. Dealing with Interrupts </a> </li> <li> <a href="/sre-book/operational-overload/" class="menu-buttons"> 30. Embedding an SRE to Recover from Operational Overload </a> </li> <li> <a href="/sre-book/communication-and-collaboration/" class="menu-buttons"> 31. Communication and Collaboration in SRE </a> </li> <li> <a href="/sre-book/evolving-sre-engagement-model/" class="menu-buttons"> 32. The Evolving SRE Engagement Model </a> </li> <li> <a href="/sre-book/part-V-conclusions/" class="menu-buttons"> Part V - Conclusions </a> </li> <li> <a href="/sre-book/lessons-learned/" class="menu-buttons"> 33. Lessons Learned from Other Industries </a> </li> <li> <a href="/sre-book/conclusion/" class="menu-buttons"> 34. Conclusion </a> </li> <li> <a href="/sre-book/availability-table/" class="menu-buttons"> Appendix A. Availability Table </a> </li> <li> <a href="/sre-book/service-best-practices/" class="menu-buttons"> Appendix B. A Collection of Best Practices for Production Services </a> </li> <li class="active"> <a href="/sre-book/incident-document/" class="menu-buttons"> Appendix C. Example Incident State Document </a> </li> <li> <a href="/sre-book/example-postmortem/" class="menu-buttons"> Appendix D. Example Postmortem </a> </li> <li> <a href="/sre-book/launch-checklist/" class="menu-buttons"> Appendix E. Launch Coordination Checklist </a> </li> <li> <a href="/sre-book/production-meeting/" class="menu-buttons"> Appendix F. Example Production Meeting Minutes </a> </li> <li> <a href="/sre-book/bibliography/" class="menu-buttons"> Bibliography </a> </li> </ol> </div> </div> <div id="maia-main"> <div class="content" id="content"> <section data-type="appendix" id="appendix_managing-incidents_template"> <h1 class="heading jumptargets">Example Incident State Document</h1> <p><strong>Shakespeare Sonnet++ Overload: 2015-10-21</strong><br> Incident management info: <em>https://incident-management-cheat-sheet</em></p> <p><em>(Communications lead to keep summary updated.)</em><br><strong>Summary</strong>: Shakespeare search service in cascading failure due to newly discovered sonnet not in search index.</p> <p><strong>Status</strong>: active, incident #465</p> <p><strong>Command Post(s)</strong>: <code>#shakespeare</code> on IRC</p> <p><strong>Command Hierarchy</strong> <em>(all responders)</em></p> <ul> <li> <p>Current Incident Commander: jennifer</p> <ul> <li>Operations lead: docbrown</li> <li>Planning lead: jennifer</li> <li>Communications lead: jennifer</li> </ul> </li> <li> <p>Next Incident Commander: <em>to be determined</em></p> </li> </ul> <p><em>(Update at least every four hours and at handoff of Comms Lead role.)</em><br><strong>Detailed Status</strong> (last updated at 2015-10-21 15:28 UTC by jennifer)</p> <p><strong>Exit Criteria:</strong></p> <ul> <li>New sonnet added to Shakespeare search corpus <strong>TODO</strong></li> <li>Within availability (99.99%) and latency (99%ile &lt; 100 ms) SLOs for 30+ minutes <strong>TODO</strong></li> </ul> <p class="pagebreak-before"><strong>TODO list and bugs filed:</strong></p> <ul> <li>Run MapReduce job to reindex Shakespeare corpus <strong>DONE</strong></li> <li>Borrow emergency resources to bring up extra capacity <strong>DONE</strong></li> <li>Enable flux capacitor to balance load between clusters (Bug 5554823) <strong>TODO</strong></li> </ul> <p><strong>Incident timeline</strong> <em>(most recent first: times are in UTC)</em></p> <ul> <li> <p>2015-10-21 15:28 UTC jennifer</p> <ul> <li>Increasing serving capacity globally by 2x </li> </ul> </li> <li> <p>2015-10-21 15:21 UTC jennifer</p> <ul> <li>Directing all traffic to USA-2 sacrificial cluster and draining traffic from other clusters so they can recover from cascading failure while spinning up more tasks</li> <li>MapReduce index job complete, awaiting Bigtable replication to all clusters</li> </ul> </li> <li> <p>2015-10-21 15:10 UTC martym</p> <ul> <li>Adding new sonnet to Shakespeare corpus and starting index MapReduce</li> </ul> </li> <li> <p>2015-10-21 15:04 UTC martym</p> <ul> <li>Obtains text of newly discovered sonnet from <em>shakespeare-discuss@</em> mailing list</li> </ul> </li> <li> <p>2015-10-21 15:01 UTC docbrown</p> <ul> <li>Incident declared due to cascading failure</li> </ul> </li> <li> <p>2015-10-21 14:55 UTC docbrown</p> <ul> <li>Pager storm, <code>ManyHttp500s</code> in all clusters</li> </ul> </li> </ul> </section> </div> </div> <div class="footer"> <div class="maia-aux"> <div class="previous"> <a href="/sre-book/service-best-practices/"> <p class="footer-caption">Previous</p> <p class="chapter-link"> Appendix B - A Collection of Best Practices for Production Services </p> </a> </div> <div class="next"> <a href="/sre-book/example-postmortem/"> <p class="footer-caption">Next</p> <p class="chapter-link"> Appendix D - Example Postmortem </p> </a> </div> <p class="footer-link">Copyright 漏 2017 Google, Inc. Published by O'Reilly Media, Inc. Licensed under <a href="https://creativecommons.org/licenses/by-nc-nd/4.0/" rel="noopener noreferrer" target="_blank">CC BY-NC-ND 4.0</a></p> </div> </div> </main> <script src="//ajax.googleapis.com/ajax/libs/angularjs/1.6.6/angular.min.js"></script> <script src="//ajax.googleapis.com/ajax/libs/angularjs/1.6.6/angular-animate.min.js"></script> <script src="//ajax.googleapis.com/ajax/libs/angularjs/1.6.6/angular-touch.min.js"></script> <script src="/sre-book/static/js/index.min.js?cache=5b7f90b"></script> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10