Recently I’ve been looking for XML validation solution. Yes, there are many out of there. But I’ve been looking for the best one.This is what’s drawn my attention:

  1. Document Type Definition (DTD) – XML’s built-in schema language
    • Pros and cons:
      • Not easy for interpreting by humans;
      • Doesn’t provide data validation, provides only structure validation;
      • Non-XML syntax;
      • Doesn’t have good support of namespaces.
  2. W3C XML Schema (WXS) – an obect-oriented XML schema language. WXS also provides a type system for constraining the character data of an XML document. WXS is maintained by the World Wide Web Consortium (W3C) and is a W3C Recommendation (that is, a ratified W3C standard specification)
    •  Pros and cons:
      • Xml based syntax;
      • Has data validation;
      • Too restrictive, precise, verbose;
      • Provides very weak support for unordered content;
      • Too long learning curve;
      • It’s not easy to write.
  3. RELAX NG (RNG) – a pattern-based, user-friendly XML schema language. RNG schemas may also use types to constrain XML character data. RNG is maintained by the Organization for the Advancement of Structured Information Standards (OASIS) and is both an OASIS and an ISO (International Organization for Standardization) standard.
    • Pros and cons:
      • Still grammar based;
      • Uses XML syntax to represent schemas;
      • Supports data-typing;
      • Integrates attributes into content models;
      • Supports XML namespaces;
      • Supports unordered content;
      • Supports context-sensitive content models;
      • Simple syntax and relatively easy for reading by humans;
      • Data types are not part of RELAX NG, but can be specified in a modular fashion.
  4. Schematron – a rules-based XML schema language. Whereas DTD, WXS, and RNG are designed to express the structure of a content model, Schematron is designed to enforce individual rules that are difficult or impossible to express with other schema languages. Schematron is intended to supplement a schema written in structural schema language such as the aforementioned. Schematron is in the process of becoming an ISO standard.
    •  Pros and cons:
      • Allows directly expressing rules w/o creating a whole grammar;
      • Very flexible;
      • More expressive even than RELAX NG;
      • Relies almost entirely on XPath query patterns for defining rules and checks;
      • Assertion based;
      • Short learning curve;
      • Trivial to implement on top of XSLT.

Considering all pros and cons I really was between RELAX NG and Schematron. WXS is too restrictive and verbose for my needs and I wasn’t going to map XML structures to any data objects. DTD isn’t worth at all. I needed something simple and flexible with short learning curve and easy to use. The more I’m looking at Schematron the more I like it. It’s not grammar-based like RELAX or WXS. It based on XPATH, assertion and rules.Writing Schematron schema to me is like writing unit-tests. You make assumptions about the structure of your XML and check them with assertion. It’s very flexible approach you can check only those thing you actually care about. It requires a little different view on XML validation process and you’ll need to get used to it. But hey! it’s cool. So, try it.This is example from an tutorial on Developer works.

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
  <title>Technical document schema</title>
    <pattern id="rightdoc" name="Document root">
      <!-- Validates that ROOT element named as "doc" -->
      <rule context="/">
        <assert test="doc">Root element must be "doc".</assert>
      </rule>
    </pattern>
    <pattern id="extradoc" name="Extraneous docs">
      <rule context="doc">
      <!-- Validates that "doc" element is only allowed at the ROOT -->
        <assert test="not(ancestor::*)">
        The "doc" element is only allowed at the document root.
        </assert>
      </rule>
    </pattern>
    <pattern id="majelements" name="Major elements">
      <rule context="doc">
        <assert test="prologue">
        <name/> must have a "prologue" child.
        </assert>
        <assert test="section">
        <name/> must have at least one "section" child.
        </assert>
      </rule>
    </pattern>
    <!-- Validating for a certain number of elements -->
    <pattern name="Minimum keywords">
      <rule context="prologue">
        <assert test="count(keyword) > 2">
          At least three keywords are required.
        </assert>
      </rule>
    </pattern>
	<!-- Validating a sequence of elements. following-sibling::*[1], selects the element immediately following the context (title).
The next step, self::subtitle, ensures that this element is a subtitle.-->
    <pattern name="Title with subtitle">
      <rule context="title">
        <assert test="following-sibling::*[1]/self::subtitle">
          A "title" must be immediately followed by a "subtitle".
        </assert>
      </rule>
    </pattern>
</schema>

 Isn’t this cool?

Advertisements