CINXE.COM
Quattor | Summary of 22nd Quattor workshop (2016-10-04 to 2016-10-06, London)
<!DOCTYPE html> <html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <title>Quattor | Summary of 22nd Quattor workshop (2016-10-04 to 2016-10-06, London)</title> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <link type="application/atom+xml" rel="alternate" href="https://www.quattor.org/feed.xml" title="Quattor" /> <link rel="stylesheet" type="text/css" href="/assets/bootstrap/css/bootstrap.min.css" media="screen" /> <link rel="stylesheet" type="text/css" media="screen" href="/assets/bootstrap/css/pygments.css"/> <link rel="stylesheet" type="text/css" href="/assets/octicons/octicons.css"> <style type="text/css"> @import url(https://fonts.googleapis.com/css?family=Lato:400); body { padding-bottom: 40px; font-family: 'Lato', 'Helvetica', sans-serif; } h1, h2, h3, h4, h5, h6, .h1, .h2, .h3, .h4, .h5, .h6 { font-family: 'Lato', 'Helvetica', sans-serif; } ul, ol { padding-left: 2em; } div.toc { margin-left: -2em; } div.page-content { border-left: 5px solid #eee; } </style> <link rel="publisher" href="https://plus.google.com/106108753304191902209" /> <link rel="icon" href="/assets/img/favicon.ico"/> </head> <body> <div class="navbar navbar-inverse navbar-static-top" role="navigation"> <div class="container"> <div class="navbar-header"> <button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse"> <span class="sr-only">Toggle navigation</span> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button> <a class="navbar-brand" href="/"> <img src="/assets/img/quattor_logo_navbar.png" width="94" height="23" alt="quattor logo"/> </a> </div> <div class="collapse navbar-collapse"> <ul class="nav navbar-nav"> <li><a href="/news/">News</a></li> <li><a href="/documentation/">Documentation</a></li> <li><a href="/contacts/">Contacts</a></li> <li><a href="/downloads/">Downloads</a></li> <li><a href="/meeting/">Meetings</a></li> </ul> </div> </div> </div> <div class="container"> <div class="row"> <ul class="breadcrumb"> <li><a href="/">Home</a></li> <li><a href="/meeting/index.html">Meeting</a></li> <li class="active">Summary of 22nd Quattor workshop (2016-10-04 to 2016-10-06, London)</li> </ul> <h1>Summary of 22nd Quattor workshop (2016-10-04 to 2016-10-06, London)</h1> <dl class="dl-horizontal"> <dt>Written</dt><dd class="date">04 October 2016</dd> <dt>Author</dt><dd class="author">Michel Jouvin</dd> </dl> <hr> </div> <div class="row"> <div class="col-md-2"> <div class="toc" id="toc"></div> </div> <div class="col-md-10 page-content"> <h1 id="summary-of-22nd-quattor-workshop">Summary of 22nd Quattor Workshop</h1> <p><a href="https://indico.cern.ch/event/560421/">Agenda</a></p> <h2 id="site-news">Site News</h2> <h3 id="ms">MS</h3> <ul> <li>Some reorganization: SW distribution added to configruation management group under Nathan responsibility, Dave taking over from Nathan and Gabor now in charge of Operating System/Linux support (handing over to Dave the Aquilon broker development)</li> <li>Main challenge: upgrading the configuration modules and core libraries, many still very old <ul> <li>Currently working on using 16.8 as the basis for RHEL7 with the hope to use it for EL5/6 too</li> </ul> </li> <li>Working on monthly upgrades of OS (and related tools): using YUM snapshots <ul> <li>Using yumng rather than yum ncm-spma plugin</li> </ul> </li> <li>Solaris: Solaris 11 support, AII extended to support SPARC OBP rather than PXE/DHCP <ul> <li>Solaris 12: not yet looked at it but no major issue expected</li> </ul> </li> <li>Still running ~500 EL5 systems: hope get it down to ~200 in the coming months but expect a long tail after that</li> <li>Difficulties to stay up to date with all the dependencies that are required to run configuration module tests <ul> <li>Michel: using template <code class="language-plaintext highlighter-rouge">quattor-development.pan</code> from OS <code class="language-plaintext highlighter-rouge">rpms</code> directory may help to deploy or identify required dependencies</li> <li>Jindrich subsequently suggested using an auto-generated meta-package that could be shipped with each version of Quattor.</li> </ul> </li> <li>Maven also seen as a difficulty as there is no source RPM build and it makes difficult to insert MS patches <ul> <li>Currently maintaining a MS branch, having to do a lot of merging</li> <li>Would like to get away from patches but may need to have some and would prefer to rebuild the official version a different minor version after applying the patches</li> </ul> </li> <li>AII should handle partition alignment</li> <li>Any plan to support <code class="language-plaintext highlighter-rouge">systemd-resolved</code> in replace of <code class="language-plaintext highlighter-rouge">/etc/resolv.conf</code> for configuring DNS resolver? <a href="https://github.com/quattor/configuration-modules-core/issues/941">Issue opened</a></li> </ul> <h3 id="ral">RAL</h3> <ul> <li>CentOS 7 work started: pushed by Ceph</li> <li>~20 EL5 machines including all storage clients plus an Oracle DB</li> <li>Still trying to get rid of SCDB but not yet complete</li> <li>Upgrading their Aquilon managed plant to 16.8 at present</li> <li>More recent releases of Aquilon require EL7, and Aquilon host remains on EL6 so cannot move forward</li> <li>Active work on virtualisation through container: about to release a lot of services in a Apache Mesos cluster <ul> <li>EL6 containers running on EL7 bare metal</li> <li>Virtualisation has not worked for high throughput services so moving services to containers running on top of a bare-metal grid appears to be a better approach</li> <li>Developing an Aquilon Personality to Image-based build using packer.io</li> </ul> </li> <li>James suggests to build as part of the release a plenary template library whose root directory will be the Quattor version number. Allow to easily switch from one version to another one and reduces the risk of a fork (as local changes to the plenary version will be lost in the next version)…</li> <li>Using a monthly OS upgrade cycle. Quattor components upgraded separately. A summer student wrote some code to compare what packages are installed compared to what the profile says should be installed. They run this via nagios.</li> </ul> <h3 id="ulb">ULB</h3> <ul> <li>Working on UEFI support in AII, required by last systems purchased (EL7) <ul> <li>Requires at least one additional partition <code class="language-plaintext highlighter-rouge">/boot/uefi</code>: would be great to add it to <code class="language-plaintext highlighter-rouge">filesystem_layout</code> and be able to define a variable triggering whatever is required in the configuration for UEFI support</li> </ul> </li> <li>Still fighting with Aquilon installation because of MS dependencies: an up-to-date documentation is required</li> <li>Interested by EL7 support for grid services <ul> <li>Michel: umd-4 branch was started but never merged… some preliminary work at LAL. Will be added to next version.</li> </ul> </li> <li>Problem with ncm-chkconfig not stopping/removing configuration of services no longer part of the configuration <ul> <li>In fact the more general problem of unconfiguring/restoring previous state when something is removed from the configuration: what is the state to restore</li> <li>Non trivial but worth a discussion: CAF::History may help but is not necessarily the complete solution</li> <li>James: another variation of the problem is changes in config file names with the old one left (for example under <code class="language-plaintext highlighter-rouge">conf.d</code>)</li> </ul> </li> </ul> <h3 id="lal">LAL</h3> <ul> <li>Installation on systems with system disks larger than 2 TB: requires a small partitions with type <code class="language-plaintext highlighter-rouge">biosboot</code>.</li> <li>Many CentOS7 nodes as part of OpenStack Cloud deployment</li> <li>WIP: use ncm-ceph for Ceph deployment <ul> <li>Currently Quattor only deploys the packages and the basic configuration</li> </ul> </li> </ul> <h3 id="ghent">Ghent</h3> <ul> <li>Mostly running EL7, plan to remove last EL6 machines in December.</li> <li>Still do not have a working Aquilon broker after 2 years of trying.</li> <li>Brief discussion of “protected nodes” feature in AII.</li> </ul> <h2 id="configuration-modules">Configuration Modules</h2> <h3 id="ccm">CCM</h3> <ul> <li><code class="language-plaintext highlighter-rouge">CDB_File</code> to become the default backend. It is significantly more robust than <code class="language-plaintext highlighter-rouge">DB_File</code> which is prone to unrecoverable corruption if the filesystem is full. Confirmed the rpm is available on EL5.</li> </ul> <h3 id="ncm-ncd">ncm-ncd</h3> <ul> <li>Proposal to run ncm-ncd in -T mode (currently runs in <code class="language-plaintext highlighter-rouge">warnings</code> mode, -t)</li> <li>perl-taint-runtime is available from EPEL for RH5+. This module gives us the ability to enable -T component by component</li> <li>at present it’s hard to know which components will fail because it’s not recorded in /var/log/ncm/ncd.log. You can see it in ncm-cdispd.log for those components run by cdispd.</li> <li>it’s a run time check so it’s not possible to get confidence code is OK just by running it as part of the unit tests</li> <li>yumng currently gives lots of taint warnings due to unlink</li> <li>need to change CCM to produce untainted data, and developer guide on how to use CCM correctly to avoid getting tainted data <ul> <li>getTree() is only supported way of retrieving profile data. It is not recommended to use anything else!</li> <li>what about a quattor critic to look for bad practices, etc? Nice to have… but should write down what the rules are first.</li> </ul> </li> </ul> <h3 id="ncm-metaconfig">ncm-metaconfig</h3> <p>ncm-metaconfig seen as a component that can run many things with the risk of one of them failing and preventing other dependencies to run</p> <ul> <li>One workaround is to define a separate configuration path for the critical things and alias it to metaconfig: will result in ncm-metaconfig running twice <ul> <li>Requires https://github.com/quattor/ncm-cdispd/issues/41 to be fixed</li> </ul> </li> </ul> <p>Sometimes would like to run an arbitrary command but not allowed</p> <ul> <li>Was a design choice in metaconfig</li> <li>One possibility is to subclass it and add the support for an arbitrary command</li> <li>We could also allow predefined commands (not from the pan code, but something similar to TT files) per service, in particular to validate/test the generated files.</li> </ul> <h3 id="ncm-systemd">ncm-systemd</h3> <p>Support for instances</p> <ul> <li>Instances need to be created from templates</li> <li>To manage all instances as a group, create a target with all the instances</li> <li>This has to be done through <code class="language-plaintext highlighter-rouge">/software/components/systemd</code> rather than <code class="language-plaintext highlighter-rouge">/software/components/chkconfig</code></li> </ul> <p>Start/restart of services</p> <ul> <li>Kept the original chkconfig behaviour where the service is started/stopped only if the service was previously stopped/started and the desired state was changed</li> <li>To be checked: is the service restarted if its configuration is changed</li> </ul> <p>Syntax/Schema</p> <ul> <li>unit syntax is a bit confusing. e.g. <a href="https://github.com/quattor/configuration-modules-core/blob/master/ncm-systemd/src/main/pan/components/systemd/schema.pan#L265">only</a> when true then systemd will only generate the unitfile and will not enable/disable the service, it will fall back to /software/components/chkconfig for that.</li> <li>the unit tests provide good examples for how to use the component</li> <li><code class="language-plaintext highlighter-rouge">replace</code> property: <ul> <li>when true, says to ignore any configuration for the service shipped by the system or provided by a RPM.</li> <li>when false, can be used to augment and complement the existing unitfile as shipped by rpm.</li> </ul> </li> </ul> <h3 id="ncm-spma">ncm-spma</h3> <p>APT plugin available as a <a href="https://github.com/quattor/configuration-modules-core/pull/821">PR</a>.</p> <ul> <li>Used at RAL to manage a few Ubuntu VMs</li> <li>Several comments to address before merging the PR</li> <li>Worth testing: should work for most cases</li> </ul> <h3 id="exceptions">Exceptions</h3> <p>Agreement to move away from <code class="language-plaintext highlighter-rouge">LC::Exception</code> and have a <code class="language-plaintext highlighter-rouge">CAF::Die</code> that will do the proper logging before raising a standard exception through <code class="language-plaintext highlighter-rouge">croak</code>.</p> <h2 id="aquilon">Aquilon</h2> <p>Packaging is critical for adoption</p> <ul> <li>Agreement that getting setuptools to work for Aquilon installation/upgrade would be great step forward</li> <li>Managing upgrades across schema upgrades can be difficult. It would be a good idea to have multiple versions installable on a host to make managing schema changes easier, similar to the way postgres does it. This brings it closer to how things are managed at MS which should reduce barrier to making good packages.</li> </ul> <h3 id="tooling-around-aquilon">Tooling around Aquilon</h3> <p>MS developed several related tools that may be useful for the community and may be upstreamed…</p> <ul> <li><code class="language-plaintext highlighter-rouge">aqs</code>: a wrapper over <code class="language-plaintext highlighter-rouge">aq show</code> with short options</li> <li><code class="language-plaintext highlighter-rouge">ckey</code>: utility to put colours on key/value display for easier reading. Receive input on stdin</li> <li><code class="language-plaintext highlighter-rouge">getprof</code>: dowload a host or cluster profile from the broker. Sent on stdout.</li> <li><code class="language-plaintext highlighter-rouge">rifle</code>: parses a profile and renders it as key/value pairs (similar to CCM but without requiring it). Can unescape keys. Can contain <code class="language-plaintext highlighter-rouge">*</code> in configuration paths. Option to display multi-line values on several lines. List indexes (generally not meaningful) can be replaced by <code class="language-plaintext highlighter-rouge">#</code> for easier comparison between profiles. <ul> <li><code class="language-plaintext highlighter-rouge">trifle</code>: C version of <code class="language-plaintext highlighter-rouge">rifle</code> processing only JSON profiles without using any JSON parser (streaming in and out). 15x to 20x faster.</li> </ul> </li> <li><code class="language-plaintext highlighter-rouge">prdiff</code>: print differences between 2 profiles parsed by <code class="language-plaintext highlighter-rouge">rifle</code>.</li> <li><code class="language-plaintext highlighter-rouge">diffuniq</code>: find unique differences into a set of differences. Option to ignore some differences, based on a regexp.</li> <li><code class="language-plaintext highlighter-rouge">diffprof</code>: in a recursive diff, identify all unique differences and a display the number of occurences of them. <ul> <li>Needs two directories containing the same profile names</li> <li>Diff done in // for good performances with a large number of profiles</li> </ul> </li> <li><code class="language-plaintext highlighter-rouge">make diff-complete</code>: a set of tools to display profile differences between the current sandbox and the version without the mods introduced by the sandbox: walk through the git history to identify the sandbox parent commit, rebuild the profiles for this version, build the sandbox and run <code class="language-plaintext highlighter-rouge">diffprof</code>. Can work with a specific set of profiles (static profiles or sample of live profiles) or all live profiles. <ul> <li>Splitting the compilation of a large number of profiles into several // panc instances: showed a significant speedup</li> <li>JSON turned out to be much faster to parse by the (Python) tools: JSON used internally by the tools although XML is still used for the profiles sent to hosts</li> <li>Currently based on a Makefile that does the proper glue between tools and is MS specific… Anyway underlying tools are generic.</li> <li>MS plans to use these tools through CI (Jenkins) as part of the <code class="language-plaintext highlighter-rouge">aq publish</code> workflow</li> </ul> </li> </ul> <p>Agreement to open source these tools in a <code class="language-plaintext highlighter-rouge">aquilon-tools</code> repo</p> <ul> <li>Some of the tools could also be used in the SCDB context to replace the current (very basic) <code class="language-plaintext highlighter-rouge">compare_profile</code> or other site-specific tools</li> <li>Should work on improving <code class="language-plaintext highlighter-rouge">aq compare</code> to use these tools internally</li> </ul> <h3 id="atomic-and-test-deployments">Atomic and Test Deployments</h3> <p>Objectives: rock-solid deployments and back-outs</p> <ul> <li>Be confident in the knowledge of the effects of deploying a change: really push it without applying the changes</li> <li>Quickly and 100% reliably remove a change</li> <li>List of changes recently applied to a host</li> <li>Identify which changes caused a particular artefact</li> <li>List all the manual steps required to back out a change in case of emergency (machine no longer manageable by Quattor)</li> </ul> <p>Proposed changes</p> <ul> <li><code class="language-plaintext highlighter-rouge">aq deploy --compile</code>: current <code class="language-plaintext highlighter-rouge">deploy</code> then <code class="language-plaintext highlighter-rouge">compile</code> with the risk of interleaving deployments. The broker can block any other deployment until the end of the compilation and the notification of the clients <ul> <li>In the notification, add a change id that will be passed to ncm-cdispd and will be used to produce a deployment log for this change id. The change id could be use on the git commit id (with possibly additional data). Need to ensure that this doesn’t lead to recompiling every profile each time.</li> <li>Not clear how you ensure that all the intermediate changes are deployed: currently <code class="language-plaintext highlighter-rouge">ccm-fetch</code> is picking the last profile, not the intermediate ones if any. And in fact not necessarily desirable to apply all the intermediate changes one by one: often one change introduces problems fixed by the next change/profile.</li> <li>An alternative could be to track the set of change ids associated with a particular deployment</li> </ul> </li> <li><code class="language-plaintext highlighter-rouge">aq undeploy</code>: undeploy a change id. Done by reverting associated Git changes and producing new profiles but a new type of notification (<code class="language-plaintext highlighter-rouge">ccu</code> for undeploy) that instruct <code class="language-plaintext highlighter-rouge">ncm-cdispd</code> to do a rollback rather than a deploy, using the change id deployment log. <ul> <li>Would imply undeploying any change after the reverted change id</li> <li>Some actions may be marked as ‘not undoable’</li> <li>Possible to add <code class="language-plaintext highlighter-rouge">--redeploy</code> to redeploy those later changes after undeploying the buggy change id</li> <li>yum changes are difficult to track: not necessarily associated with changes in profiles</li> </ul> </li> </ul> <p>Discussion</p> <ul> <li>Agreement that <code class="language-plaintext highlighter-rouge">aq deploy --compile</code> should be implemented</li> <li>Scepticism about the undeploy idea: the problem is too general to be solvable… <ul> <li>James: concentrate on the file use case. Track file changes (CAF::History) and decide/implement what must be done to undo a file modification action (not trivial, depends on cases)</li> <li>Stijn/Luis: first try to make something of CAF::History. Currently only used for logging events but those events could be consumed by <code class="language-plaintext highlighter-rouge">ncm-ncd</code> to maintain a component-agnostic database of modified files for example.</li> <li>On the other hand, improving the logging of actions done to ease troubleshooting and give a chance to understand what must be done to “undeploy” the change would be a good step forward</li> <li>CAF::History could be associated with change ids and help to store file backups in diff formats rather than multiple version of the entire file.</li> </ul> </li> </ul> <p>Test deployments: <code class="language-plaintext highlighter-rouge">--test</code> option to <code class="language-plaintext highlighter-rouge">aq deploy</code></p> <ul> <li>Produces temporary profiles</li> <li>Notification with a <code class="language-plaintext highlighter-rouge">cct</code> event</li> <li><code class="language-plaintext highlighter-rouge">ccm-fetch</code> fetches the profile as a test profile rather than a production one</li> <li><code class="language-plaintext highlighter-rouge">ncm-cdispd</code> dispatches components with <code class="language-plaintext highlighter-rouge">--noaction</code> and produces a deployment log sent back to the AQ web server, that can be viewed with <code class="language-plaintext highlighter-rouge">aq show testing --change id</code>. <ul> <li><code class="language-plaintext highlighter-rouge">ncm-ncd</code> actually responsible to push the deployment log back to the web server rather than <code class="language-plaintext highlighter-rouge">ncm-cdispd</code> which has <code class="language-plaintext highlighter-rouge">ncm-ncd</code> dispatching as its only responsibility</li> </ul> </li> </ul> <p>Suggestion to use JSON as the format of deployment logs to make them machine parsable</p> <ul> <li>Nathan suggests to put the verbose information in the deployment logs and use syslog to produce the file (will take care of serialization between event sources)</li> <li>A DB on the machine for keeping the history may use another format like sqlite</li> </ul> <h2 id="quattor-building-tools-and-releases">Quattor Building Tools and Releases</h2> <p>MS interested in getting source RPMs for build reproducibility plus easier integration of any patches</p> <ul> <li>James will try to produce them as part of the next release</li> </ul> <p>Component config.pan: MS needs to patch them to include a MS specific template at the end</p> <ul> <li>Proposal: include a site specific template at the end of the standard config.pan</li> <li>Proposal 2: provide a template in the Maven tools that will do the typical part of a <code class="language-plaintext highlighter-rouge">config.pan</code> template, including the inclusion of a site-specific template. For most components, the actual <code class="language-plaintext highlighter-rouge">config.pan</code> would be as simple as: <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$(component_config_pan) </code></pre></div> </div> <ul> <li>Proposal 2 would help to get the new feature out and used in all components (and also help with future modifications that must be done in every component).</li> </ul> </li> <li>Issue tracking this: https://github.com/quattor/maven-tools/issues/109</li> </ul> <p>panc: Cal was clear he no longer has the time to review pull request</p> <ul> <li>Gabor is the main panc expert, let’s rely on him!</li> <li>Ensure that we add unit test covering proposed changes</li> <li>Release is produced with Maven the same way as the Quattor release <ul> <li>As a demonstration, 10.3 was successfully released by @jrha during the workshop with two bug fixes</li> </ul> </li> </ul> <h2 id="discussion-of-open-issues">Discussion of Open Issues</h2> <p>https on quattor.org: no real solution until proper support for https on custom domains is provided by GitHub</p> <p>Validation of variables by a schema: in fact, raise the questions of what is the interface to the template library</p> <ul> <li>Action: Michel should start a documentation on how variables are used in the template library, how they are named, <code class="language-plaintext highlighter-rouge">config.pan</code> as the entry point for features…</li> <li>Need to make progress with adding annotations to the variables in the template library and to processing them to build a documentation <ul> <li>MS has a couple of script doing some processing that may be used as a starting point: @wdpypere managed to used them during the workshop (initial version)</li> </ul> </li> </ul> <h2 id="conclusions">Conclusions</h2> <p>Next workshop: Annecy</p> <ul> <li>See Doodle: https://doodle.com/poll/rbfyhx5kux8a7tfg</li> </ul> <p><a href="https://fosdem.org/2017/">FOSDEM</a> participation: interesting to meet people but quite difficult to get a talk</p> <ul> <li>DevOps and Config Mgt track has 10x more talks submitted than what can fit into the agenda… <ul> <li>FOSDEM interesting talks are generally not about a specific technology or implementation but more about what was learn from a concrete experience: could we present how we enable a high level of sharing through the releases containing configuration modules and the template library, avoiding the sites to fork with difficulties to integrate further development by the community…</li> </ul> </li> <li>Submission usually before December</li> </ul> </div> </div> <!-- row --> </div> <hr/> <div class="footer container"> <p class="text-muted credit">Copyright © 2011-2024 quattor.org</p> </div> <!-- Le javascript ================================================== --> <!-- Placed at the end of the document so the pages load faster --> <script src="https://code.jquery.com/jquery-1.10.2.min.js"></script> <script src="/assets/bootstrap/js/bootstrap.min.js"></script> <script src="/assets/js/toc.js"></script> <script type="text/javascript"> $(document).ready(function() { $('.toc').toc({'title':'', 'showSpeed':0, 'noBackToTopLinks': true, 'listType': 'ul'}); }); </script> </body> </html>