CINXE.COM

Invisible XML

<!DOCTYPE html> <!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--> <!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]--> <!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]--> <head> <meta charset="utf-8" /> <meta http-equiv="X-UA-Compatible" content="IE=edge" /> <title>Invisible XML</title> <meta name="description" content="Norm Tovey-Walsh introduces Invisible XML, a language for describing the implicit structure of data, and a set of technologies for making that structure explicit as XML markup."/> <meta name="viewport" content="width=device-width, initial-scale=1" /> <link rel="icon" type="image/x-icon" href="/static/favicon.ico"/> <link rel="stylesheet" href="/static/CACHE/css/output.7ac6b21eee6a.css" type="text/css"> <link href="https://fonts.googleapis.com/css?family=Lato|Roboto" rel="stylesheet"/> <link rel="stylesheet" type="text/css" href="/static/css/print.css" media="print" /> <script async src="https://www.googletagmanager.com/gtag/js?id=G-6Z87ZDEY5E"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-6Z87ZDEY5E'); </script> <script async='async' src='https://www.googletagservices.com/tag/js/gpt.js'></script> <script> var googletag = googletag || {}; googletag.cmd = googletag.cmd || []; </script> <script> googletag.cmd.push(function() { googletag.defineSlot('/21754636678/xml.com-1', [[160, 600], [120, 240], [300, 250]], 'div-gpt-ad-1550450394815-0').addService(googletag.pubads()); googletag.defineSlot('/21754636678/xml.com-2', [[300, 250]], 'div-gpt-ad-1550513522284-0').addService(googletag.pubads()); googletag.pubads().enableSingleRequest(); googletag.pubads().disableInitialLoad(); googletag.enableServices(); }); </script> <script> //load the apstag.js library !function(a9,a,p,s,t,A,g){if(a[a9])return;function q(c,r){a[a9]._Q.push([c,r])}a[a9]={init:function(){q("i",arguments)},fetchBids:function(){q("f",arguments)},setDisplayBids:function(){},targetingKeys:function(){return[]},_Q:[]};A=p.createElement(s);A.async=!0;A.src=t;g=p.getElementsByTagName(s)[0];g.parentNode.insertBefore(A,g)}("apstag",window,document,"script","//c.amazon-adsystem.com/aax2/apstag.js"); //initialize the apstag.js library on the page to allow bidding apstag.init({ pubID: '32676f4f-8458-484f-b742-dcd7ad80a504', //enter your pub ID here as shown above, it must within quotes adServer: 'googletag' }); apstag.fetchBids({ slots: [{ slotID: 'div-gpt-ad-1550450394815-0', //example: 'div-gpt-ad-1475102693815-0' slotName: '21754636678/xml.com-1', //example: '12345/box-1' sizes: [[160,600], [300,250], [120,400]] //example: [[300,250], [300,600]] }, { slotID: 'div-gpt-ad-1550513522284-0', //example: 'div-gpt-ad-1475185990716-0' slotName: '21754636678/xml.com-2', //example: '12345/leaderboard-1' sizes: [[300,250]] //example: [[728,90]] }], timeout: 2e3 }, function(bids) { // set apstag targeting on googletag, then trigger the first DFP request in googletag's disableInitialLoad integration googletag.cmd.push(function(){ apstag.setDisplayBids(); googletag.pubads().refresh(); }); }); </script> </head> <body class="homepage"> <div class="title-bar hide-for-print" data-responsive-toggle="menu" data-hide-for="large"> <button class="menu-icon" type="button" value="Menu" data-toggle></button> <div class="title-bar-title">XML.com</div> </div> <div class="row"> <div class="top-bar hide-for-print" id="menu"> <div class="top-bar-left"> <a href="/"><img src="/static/img/XML_com_logo.svg" alt="XML.com logo"/></a> </div> <div class="top-bar-right"> <ul class="menu vertical medium-horizontal" data-responsive-menu="drilldown medium-dropdown" role="menubar"> <li class=""><a href="/">Home</a></li> <li class=" "> <a href="/articles/">Articles</a> </li> <li class=" "> <a href="/authors/">Authors</a> </li> <li class=" "> <a href="/news/">News</a> </li> <li class=" "> <a href="/job-board/">Job Board</a> </li> <li class="has-submenu "> <a href="/about/">About</a> <ul class="submenu menu vertical"> <li class=""> <a href="/about/contribute/">Contribute</a> </li> <li class=""> <a href="/about/style-guide/">Style guide</a> </li> <li class=""> <a href="/about/copyright/">Copyright</a> </li> <li class=""> <a href="/about/contact/">Contact</a> </li> <li class=""> <a href="/about/privacy/">Privacy Policy</a> </li> </ul> </li> <li class="has-form" style="background: transparent;"> <form id="cse-search-box" action="https://google.com/cse"> <input type="hidden" name="cx" value="partner-pub-9264479583913780:3063344556"/> <input type="hidden" name="ie" value="UTF-8" /> <input type="text" placeholder="Search" name="q" title="Google Search"/> <!--<input type="submit" name="sa" value="Search">--> </form> </li> </ul> </div> </div> </div> <div class="row"> <div class="medium-9 columns"> <nav aria-label="You are here:" role="navigation"> <ul class="breadcrumbs"> <li><a href="/">Home</a></li> <li><a href="/articles/">Articles</a></li> <li class="current">Invisible XML</li> </ul> </nav> <div id="content"> <div class="medium-12 columns" role="content"> <ul class="share-buttons hide-for-print"> <li> <a class="button tiny radius facebook" href="http://www.facebook.com/sharer.php?u=/articles/2022/03/01/invisible-xml/" target="_blank"><i class="fa fa-facebook"></i>Share on Facebook</a> </li> <li> <a class="button tiny radius twitter" href="https://twitter.com/share?url=/articles/2022/03/01/invisible-xml/"><i class="fa fa-twitter"></i>Tweet</a> </li> <li> <a class="button tiny radius linkedin" href="http://www.linkedin.com/shareArticle?mini=true&amp;url=/articles/2022/03/01/invisible-xml/" target="_blank"><i class="fa fa-linkedin"></i>LinkedIn</a> </li> <li> <a class="button tiny radius mail" href="mailto:?subject=Invisible XML&amp;body=/articles/2022/03/01/invisible-xml/" target="_blank"><i class="fa fa-envelope"></i>Email</a> </li> <li> <a class="button tiny radius print" href="javascript:window.print()" target="_blank"><i class="fa fa-print"></i>Print</a> </li> </ul> <article class="article"> <div class="callout small"> <h1>Invisible XML</h1> <p>March 1, 2022</p> <p> <a href="/authors/norm-tovey-walsh/">Norm Tovey-Walsh</a> </p> <div id="tags" class="hide-for-print"> <a class="fancy radius button small" href="/articles/?tag=invisible XML">invisible XML</a> </div> <div class="summary">Norm Tovey-Walsh introduces Invisible XML, a language for describing the implicit structure of data, and a set of technologies for making that structure explicit as XML markup.</div> </div> <div class="body"> <p>Invisible XML is a language for describing the implicit structure of data, and a set of technologies for making that structure explicit as XML markup. It allows you to write a declarative description of the format of some text and then leverage that format to represent the text as structured information. That sounds a bit abstract, so let’s start with an example:</p> <p>Suppose you have a document, <code class="filename">contacts.txt</code>, where you keep the names, email addresses, and other details of friends and colleagues. It might look something like this:</p> <pre class="programlisting language-none">John Doe john@example.com 555-1234 Mary Smith m.smith@estaff.example.com +1-222-555-2344 Jane Doe (512) 555-9999 Nancy Jones nancy@example.org</pre> <p>This is obviously a toy example, but it will work to illustrate a few points. First of all, those of us who are used to working with marked up data (such as XML or JSON) are likely to think of this as “unstructured” data. But that’s not really true. There is structure there, it’s just indicated with whitespace and other informal conventions, rather than with angle brackets or curly braces.</p> <p>The real world is full of data marked up using different conventions. We use files like this every day: email headers, ical and vcard files, CSS, Java property files, Windows “ini” configuration files, BibTeX, Markdown, etc. At finer levels of granularity, we find even more examples: ISO 8601 dates and times, XPath expressions, CSS selectors, and so on. And then we have our own ad hoc formats in countless text files: diaries, todo lists, calendars, event planners, etc. There are probably a dozen or so files like this within easy reach wherever you’re reading this. </p> <p>This is fine when the important consumers are human readers looking at the text. But what happens when we want to use that data in some other system? What happens when we want to extract the information out of these documents structured by conventions of spacing and punctuation?</p> <p>A common approach would be to write a script to read the format:</p> <table class="codehighlighttable"><tr><td class="linenos"><div class="linenodiv"><pre> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19</pre></div></td><td class="code"><div class="codehighlight"><pre><span></span><span class="ch">#!/usr/bin/env python3</span> <span class="n">contacts</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s1">&#39;contacts.txt&#39;</span><span class="p">,</span> <span class="s1">&#39;r&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">infile</span><span class="p">:</span> <span class="n">contact_id</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">expect_name</span> <span class="o">=</span> <span class="bp">True</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">infile</span><span class="o">.</span><span class="n">readlines</span><span class="p">():</span> <span class="n">line</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="k">print</span><span class="p">(</span><span class="n">contact_id</span><span class="p">,</span> <span class="n">line</span><span class="p">)</span> <span class="k">if</span> <span class="n">line</span> <span class="o">==</span> <span class="s2">&quot;&quot;</span><span class="p">:</span> <span class="n">contact_id</span> <span class="o">+=</span> <span class="mi">1</span> <span class="n">expect_name</span> <span class="o">=</span> <span class="bp">True</span> <span class="k">elif</span> <span class="n">expect_name</span><span class="p">:</span> <span class="n">contacts</span><span class="o">.</span><span class="n">append</span><span class="p">({</span> <span class="s2">&quot;name&quot;</span><span class="p">:</span> <span class="n">line</span> <span class="p">})</span> <span class="n">expect_name</span> <span class="o">=</span> <span class="bp">False</span> <span class="k">elif</span> <span class="s2">&quot;@&quot;</span> <span class="ow">in</span> <span class="n">line</span><span class="p">:</span> <span class="n">contacts</span><span class="p">[</span><span class="n">contact_id</span><span class="p">][</span><span class="s2">&quot;email&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">line</span> <span class="k">else</span><span class="p">:</span> <span class="n">contacts</span><span class="p">[</span><span class="n">contact_id</span><span class="p">][</span><span class="s2">&quot;phone&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">line</span> </pre></div> </td></tr></table> <p>That’s horrible, and not just because it’s a bit of sloppy code I banged out in a few short minutes. (Way more of the world’s infrastructure relies on code someone put together for a prototype or a demo than you’d care to think about.) The real problem here is that this tells you nothing about the file format beyond what you can glean from the source code. These kinds of procedural or imperative definitions of a file format, even if they’re backed up by some sort of prose description (that you hope is up-to-date with the latest version of the code, even though you know it isn’t), are difficult to understand, difficult to test, and difficult to reason about.</p> <p>Can I put an address in this file? Is it ok if I have two blank lines? If I have a phone number but not an email address, can I just put in a blank line for the email address? For anything bigger than a toy example, these are not easy questions to answer.</p> <p>You would be much better off if you had a declarative description of the format. Declarative descriptions are more accessible, easier to reuse, and generally require less coding.</p> <p>So why, you might ask, don’t we use them all the time? </p> <p>We do. Your XML processor, your JavaScript engine, your JSON tooling, the compilers that built your web browser and your operating system, are all driven by parsers that take declarative descriptions of a system plus some input, construct an abstract representation of that system, and operate on it. Under the hood, our favourite software is making constant use of declarative descriptions. </p> <p>Okay, but why don’t <em>we</em> use them all the time?</p> <p>There are a couple of reasons. Historically, it was expensive. I mean computationally expensive. But <a href="https://en.wikipedia.org/wiki/Moore%27s_law">Moore’s Law</a> has pretty much sorted that out. The other reason, the real reason, is because it’s <em>hard work</em>. Off-the-shelf tools are mostly designed to build incredibly efficient parsers that can analyze huge amounts of data, unambiguously and efficiently, with only a character or two of lookahead. Writing format descriptions for those tools is a specialized skill that few of us have.</p> <p>But it doesn’t actually have to be hard. Or at least, nowhere near <em>that</em> hard. What makes it a specialized skill, and what makes it hard, is fitting your declarative description into those very tight requirements: no ambiguity and no (or only a small, fixed amount of) lookahead.</p> <p>Luckily, that’s not the only way to approach the problem. Techniques for parsing that are tolerant of ambiguity and prepared to engage in (more-or-less) arbitrary lookahead have been around for decades. They are not as fast, or as memory efficient, as traditional parsers, but they can be competitive in many cases. And we have Moore’s Law on our side! Those problems are no longer significant in many environments.</p> <p>Invisible XML provides a specific syntax for declarative description that’s easy to use and easy to understand. Combined with a parser that understands Invisible XML, you can apply those descriptions to your data and get structured data out. No (imperative) coding required.</p> <p>Before we dig in deeper, let’s establish a little bit of vocabulary. An Invisible XML document is a <em class="glossterm">grammar</em>. As a term of art in computer programming, a <em class="firstterm">grammar</em> is essentially a set of rules. Each rule describes a <em class="glossterm">symbol</em> in terms of some other symbols.</p> <p>If you’ve ever used regular expressions, you’re familiar with these ideas, even if you didn’t use this exact vocabulary. If you’re matching data in Python or XPath, or any language with regular expressions, and you say that an integer is a string that matches <code class="code">[-+]?[0-9]+</code>, you’ve created a grammar rule: <code class="code">integer: [-+]?[0-9]+</code>.</p> <p>Invisible XML collects these rules together. Each rule has a “left hand side” and a “right hand side”. The left hand side is a single symbol, the one being defined, and the right hand side is a list of one or more symbols that define it. A <em class="firstterm">symbol</em> is either the name of something, in which case there must be a further rule that defines it, one that has it as the rule’s left hand side; or it’s something that literally matches characters in your input. With these rules, the processor will work out whether it’s possible to match the whole input string that you gave it by applying these rules in some order.</p> <p>Let’s take a very small example. Suppose we want to match sentences of three letter words like “see cat sat”. We could write an Invisible XML grammar like this:</p> <pre class="programlisting language-none line-numbers">sentence : word+" " . word : consonant, vowel, consonant; consonant, vowel, vowel. vowel : ["aeiouy"] . consonant : ["bcdfghjklmnpqrstvwxyz"] .</pre> <p>We’ll come back to the specific details about syntax and how to write an Invisible XML grammar in the next article, for now we’ll just do a little hand waving.</p> <p>That grammar has four rules and you can read them like this: </p> <div class="orderedlist"><ol style="list-style: decimal;"> <li><p>A <em>sentence</em> is one or more occurrences of <em>word</em> separated by a single space.</p></li> <li><p>A <em>word</em> is a <em>consonant</em>, followed by a <em>vowel</em>, followed by a <em>consonant</em> <strong class="emphasis">or</strong> a <em>consonant</em> followed by two consecutive <em>vowel</em>s.</p></li> <li><p>A <em>vowel</em> is literally “a”, “e”, “i”, “o”, “u”, or “y”.</p></li><li><p>A <em>consonant</em> is literally any one of the other lowercase, English language letters, and also “y”.</p></li></ol> </div><p>Note that there’s nothing procedural here. There’s no attempt to say how you do anything with a word or any of its constituent parts. The grammar just declaratively describes the format. It’s easy to answer questions about this format. Are numbers allowed? Can words be more or less than three letters long? Is punctuation allowed? No, no, and no, respectively. And it’s easy to imagine writing tests to ensure that this grammar does match what you want. </p> <p>Of course, there is software that’s going to process it. The first thing a processor, or parser, is going to do is determine whether the input string you gave it matches the grammar (you’ll sometimes see this written as “checking if the input is a sentence in the grammar”). In order to do that, it has to know where to start, so you need to nominate one of the symbols as the “start symbol”. In Invisible XML, that’s the symbol in your first rule, so “sentence” in this case. </p> <p>We can imagine a parser, given this grammar and the one word sentence “cat”, doing something like this:</p><div class="orderedlist"> <ol style="list-style: decimal;"><li><p>Does the input match <em>sentence</em>? I don’t know, what’s a sentence? </p></li> <li><p>A sentence is one or more <em>word</em>s separated by spaces. Does the input match <em>word</em>? I don’t know, what’s a word? </p></li> <li><p>A word is either:</p><div class="itemizedlist"><ul><li><p>A <em>consonant</em>, followed by <em>vowel</em>, followed by a <em>consonant</em>, or </p></li><li><p>a <em>consonant</em>, followed by a <em>vowel</em>, followed by another <em>vowel</em>. </p></li></ul></div><p>Does the input match that? I don’t know, what’s a consonant? </p></li><li><p>A <em>consonant</em> is one of a set of letters. Ok. Does the first letter match one of those? Yes, “c” is one of those letters. Great. </p></li> <li><p>Does the rest of the input match the rest of the “right hand side”? The next thing is <em>vowel</em>. Does the rest of the input match that? I don’t know, what’s a vowel? </p></li><li><p>A <em>vowel</em> is one of a set of letters. Ok. Does the next letter match one of those? Yes, “a” is one of those letters. Great. </p></li> <li><p>And on it goes, matching symbols against rules until it runs out of symbols or runs out of input. In this case, it will run out of input after it matches the “t”. </p></li> <li><p>There’s no input left. If we stop here, are we finished with the right hand side of the rule for the start symbol? Yes? Great. Yes, “cat” is a <em>sentence</em>. </p></li></ol> </div><p>As you can see, the parser uses the rules to replace symbols with what those symbols can be in a kind of recursive process that “bottoms out” when it reaches something that has to match the input. Sometimes you’ll see the class of symbols that can be replaced by other symbols identified as “<em class="firstterm">nonterminals</em>” as distinct from the symbols that literally match against the input, the “<em class="firstterm">terminals</em>”.</p> <p>What happens if it doesn’t match? What happens, for example, if we give this grammar the input “frog”? In that case, the parser will tell you it couldn’t match your grammar (“the input is not a sentence in the grammar”). Something like this:</p> <p>Parse failed. At position 2, found unexpected “r”. Would have permitted: ["aeiouy"].</p> <p>So your grammar also functions as a validator for your input!</p> <p>On a successful parse, the other thing an Invisible XML parser has to do is tell you <em>how</em> it matched the input. This is the part that gives you back the structured information. For our word parser, it’s pretty simple and obvious:</p> <div class="codehighlight"><pre><span></span><span class="nt">&lt;sentence&gt;</span> <span class="nt">&lt;word&gt;</span> <span class="nt">&lt;consonant&gt;</span>c<span class="nt">&lt;/consonant&gt;</span> <span class="nt">&lt;vowel&gt;</span>a<span class="nt">&lt;/vowel&gt;</span> <span class="nt">&lt;consonant&gt;</span>t<span class="nt">&lt;/consonant&gt;</span> <span class="nt">&lt;/word&gt;</span> <span class="nt">&lt;/sentence&gt;</span> </pre></div> <p>Broadly speaking, Invisible XML makes each nonterminal in your grammar into an XML element. In fact, it offers you a number of ways to control how the results are constructed, including which nonterminals should be output, which things should be elements and which should be attributes, and even which things to elide altogether. We’ll come back to those topics when we talk about writing grammars in the next article.</p> <p>There are two other things to bear in mind. First, the grammar only validates against its rules. According to this grammar “xeq bei” is a perfectly fine sentence. Second, an input may be ambiguous with respect to a grammar. Consider the sentence “hey bee”. If you parse that with our sentence grammar, you’ll get:</p> <div class="codehighlight"><pre><span></span><span class="nt">&lt;sentence</span> <span class="na">xmlns:ixml=</span><span class="s">&quot;http://invisiblexml.org/NS&quot;</span> <span class="na">ixml:state=</span><span class="s">&quot;ambiguous&quot;</span><span class="nt">&gt;</span> <span class="nt">&lt;word&gt;</span> <span class="nt">&lt;consonant&gt;</span>h<span class="nt">&lt;/consonant&gt;</span> <span class="nt">&lt;vowel&gt;</span>e<span class="nt">&lt;/vowel&gt;</span> <span class="nt">&lt;vowel&gt;</span>y<span class="nt">&lt;/vowel&gt;</span> <span class="nt">&lt;/word&gt;</span> <span class="nt">&lt;word&gt;</span> <span class="nt">&lt;consonant&gt;</span>b<span class="nt">&lt;/consonant&gt;</span> <span class="nt">&lt;vowel&gt;</span>e<span class="nt">&lt;/vowel&gt;</span> <span class="nt">&lt;vowel&gt;</span>e<span class="nt">&lt;/vowel&gt;</span> <span class="nt">&lt;/word&gt;</span> <span class="nt">&lt;/sentence&gt;</span> </pre></div> <p>Or maybe you’ll get:</p> <div class="codehighlight"><pre><span></span><span class="nt">&lt;sentence</span> <span class="na">xmlns:ixml=</span><span class="s">&quot;http://invisiblexml.org/NS&quot;</span> <span class="na">ixml:state=</span><span class="s">&quot;ambiguous&quot;</span><span class="nt">&gt;</span> <span class="nt">&lt;word&gt;</span> <span class="nt">&lt;consonant&gt;</span>h<span class="nt">&lt;/consonant&gt;</span> <span class="nt">&lt;vowel&gt;</span>e<span class="nt">&lt;/vowel&gt;</span> <span class="nt">&lt;consonant&gt;</span>y<span class="nt">&lt;/consonant&gt;</span> <span class="nt">&lt;/word&gt;</span> <span class="nt">&lt;word&gt;</span> <span class="nt">&lt;consonant&gt;</span>b<span class="nt">&lt;/consonant&gt;</span> <span class="nt">&lt;vowel&gt;</span>e<span class="nt">&lt;/vowel&gt;</span> <span class="nt">&lt;vowel&gt;</span>e<span class="nt">&lt;/vowel&gt;</span> <span class="nt">&lt;/word&gt;</span> <span class="nt">&lt;/sentence&gt;</span> </pre></div> <p>There are two different ways to parse the input because our grammar says that a “y” can be either a consonant or a vowel. The word “hey” therefore matches both the pattern “consonant-vowel-vowel” and the pattern “consonant-vowel-consonant”. That makes the results ambiguous. The Invisible XML processor is required to tell us that, and it’s required to return one of the valid parses. (Implementations may give you more control than that, but that’s all that’s required for conformance.)</p> <p>You might now be asking yourself, what about the word “gym”? Would that be ambiguous too? It is, in some sense “ambiguous” because it’s either “consonant-vowel-consonant” or “consonant-consonant-consonant”. But it’s not ambiguous in <em>this grammar</em> because “consonant-consonant-consonant” isn’t a possible match.</p> <p>Ambiguity is ok, and sometimes it’s necessary. The ambiguity in this grammar is an inherent property of the English language rule that “y” sometimes represents the sound of a vowel and sometimes it represents the sound of a consonant.</p> <p>That said, the more ambiguity there is, the more possibilities the parser may have to consider when it’s looking for matches. That can make the process slower. We’ll look more at ambiguity when we start writing grammars in the next part.</p> <p>Here, finally, is a grammar for the contacts format:</p> <pre class="programlisting language-none line-numbers">contacts: (contact, NL*)+ . contact: name, NL, (email, NL)?, (phone, NL)? . name: letter, ~[#a; "@"]* . email: username, "@", domainname . phone: ["+0123456789()- "]+ . -username: (letter; ["+-."])+ . -domainname: (letter; ["+-."])+ . -letter: [L] . -NL: -#a ; -#d, -#a .</pre> <p>It can be read like this:</p> <div class="orderedlist"><ol style="list-style: decimal;"><li><p>A <em>contacts</em> file consists of one or more contact items followed by zero or more newlines.</p></li> <li><p>A <em>contact</em> is a name followed by a newline, optionally followed by an email followed by a newline, optionally followed by a phone followed by a newline. </p></li> <li><p>A <em>name</em> is a letter followed by any characters except newline or “@”. </p></li> <li><p>An <em>email</em> is a username followed by “@” followed by a domainname. </p></li> <li><p>A <em>phone</em> is one or more digits or the “+”, “(“, “)”, “-”, and space punctuation characters. </p></li> <li><p>A <em>username</em> is one or more occurrences of letter or any of the characters “+”, “-”, and “.”. </p></li> <li><p>A <em>domainname</em> is one or more occurrences of letter or any of the characters “+”, “-”, and “.”. </p></li> <li><p>A <em>letter</em> is any character in the Unicode character class “L” (letters). </p></li> <li><p>A <em>NL</em> is a newline or the sequence carriage return followed by newline. </p></li></ol></div> <p>If you give that grammar and the contacts file to an Invisible XML processor, it will return:</p> <div class="codehighlight"><pre><span></span><span class="nt">&lt;contacts&gt;</span> <span class="nt">&lt;contact&gt;</span> <span class="nt">&lt;name&gt;</span>John Doe<span class="nt">&lt;/name&gt;</span> <span class="nt">&lt;email&gt;</span>john@example.com<span class="nt">&lt;/email&gt;</span> <span class="nt">&lt;phone&gt;</span>555-1234<span class="nt">&lt;/phone&gt;</span> <span class="nt">&lt;/contact&gt;</span> <span class="nt">&lt;contact&gt;</span> <span class="nt">&lt;name&gt;</span>Mary Smith<span class="nt">&lt;/name&gt;</span> <span class="nt">&lt;email&gt;</span>m.smith@estaff.example.com<span class="nt">&lt;/email&gt;</span> <span class="nt">&lt;phone&gt;</span>+1-222-555-2344<span class="nt">&lt;/phone&gt;</span> <span class="nt">&lt;/contact&gt;</span> <span class="nt">&lt;contact&gt;</span> <span class="nt">&lt;name&gt;</span>Jane Doe<span class="nt">&lt;/name&gt;</span> <span class="nt">&lt;phone&gt;</span>(512) 555-9999<span class="nt">&lt;/phone&gt;</span> <span class="nt">&lt;/contact&gt;</span> <span class="nt">&lt;contact&gt;</span> <span class="nt">&lt;name&gt;</span>Nancy Jones<span class="nt">&lt;/name&gt;</span> <span class="nt">&lt;email&gt;</span>nancy@example.org<span class="nt">&lt;/email&gt;</span> <span class="nt">&lt;/contact&gt;</span> <span class="nt">&lt;/contacts&gt;</span> </pre></div> <p>Or the author of the processor might allow you to ask for the output in JSON instead:</p> <div class="codehighlight"><pre><span></span><span class="p">{</span> <span class="s2">&quot;contacts&quot;</span><span class="o">:</span> <span class="p">{</span> <span class="s2">&quot;contact&quot;</span><span class="o">:</span> <span class="p">[</span> <span class="p">{</span> <span class="s2">&quot;name&quot;</span><span class="o">:</span> <span class="s2">&quot;John Doe&quot;</span><span class="p">,</span> <span class="s2">&quot;email&quot;</span><span class="o">:</span> <span class="s2">&quot;john@example.com&quot;</span><span class="p">,</span> <span class="s2">&quot;phone&quot;</span><span class="o">:</span> <span class="s2">&quot;555-1234&quot;</span> <span class="p">},</span> <span class="p">{</span> <span class="s2">&quot;name&quot;</span><span class="o">:</span> <span class="s2">&quot;Mary Smith&quot;</span><span class="p">,</span> <span class="s2">&quot;email&quot;</span><span class="o">:</span> <span class="s2">&quot;m.smith@estaff.example.com&quot;</span><span class="p">,</span> <span class="s2">&quot;phone&quot;</span><span class="o">:</span> <span class="s2">&quot;+1-222-555-2344&quot;</span> <span class="p">},</span> <span class="p">{</span> <span class="s2">&quot;name&quot;</span><span class="o">:</span> <span class="s2">&quot;Jane Doe&quot;</span><span class="p">,</span> <span class="s2">&quot;phone&quot;</span><span class="o">:</span> <span class="s2">&quot;(512) 555-9999&quot;</span> <span class="p">},</span> <span class="p">{</span> <span class="s2">&quot;name&quot;</span><span class="o">:</span> <span class="s2">&quot;Nancy Jones&quot;</span><span class="p">,</span> <span class="s2">&quot;email&quot;</span><span class="o">:</span> <span class="s2">&quot;nancy@example.org&quot;</span> <span class="p">}</span> <span class="p">]</span> <span class="p">}</span> <span class="p">}</span> </pre></div> <p>Or maybe in CSV:</p> <pre class="programlisting language-none">"name","email","phone" "John Doe","john@example.com","555-1234" "Mary Smith","m.smith@estaff.example.com","+1-222-555-2344" "Jane Doe",,"(512) 555-9999" "Nancy Jones","nancy@example.org",</pre> <p>Every Invisible XML processor begins by creating a basic XML representation of your input data. That’s why it’s called Invisible XML. But what’s really going on here is that we’re turning input with implicit structure into explicitly structured data that can be transformed into whatever output format we require. And we’re doing it by describing that format in a compact, understandable, reusable, testable way.</p> <p>Join us next time for a detailed look at the syntax of Invisible XML.</p> </div> <div class="copyright"> Article contents &#169; 2022 Norman Tovey-Walsh </div> </article> </div> </div><!-- content --> </div> <aside id="sidebar" class="medium-3 columns hide-for-print"> <div class="text-center"> <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <!-- 300x250 --> <ins class="adsbygoogle" style="display:inline-block;width:300px;height:250px" data-ad-client="ca-pub-9264479583913780" data-ad-slot="1017007355"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> <div class="text-center"> <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> <!-- Google-160x600 --> <ins class="adsbygoogle" style="display:inline-block;width:160px;height:600px" data-ad-client="ca-pub-9264479583913780" data-ad-slot="1629348156"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> </aside> </div> <div class="row column"> <hr class="dotted"/> </div> <footer class="row column"> <p><strong>&#169; Textuality Services, Inc.</strong> except for those articles with named authors or copyright holders. All trademarks and registered trademarks appearing on XML.com are the property of their respective owners.</p> </footer> <script src="/static/CACHE/js/output.d5b0ccff8392.js"></script> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10