The integrated XML validator

Ever since the World Wide Web Consortium launched the Web Accessibility Initiative, the problems of XML validation and web accessibility for the vision-impaired have converged. Very early on, SPIP was concerned with the problems concerning accessibility, starting with version 1.5 in 2002. On the other hand, it has for a long time abstained from XML given the rarity of conformant pages : the abundance of HTML that is not XHTML-compliant which the web browsers developed their own analysers for, and languages based on XML, like SVG and XQuery had very slow beginnings indeed.

However, the convergence of accessibility and validation problems on one side, and native SVG implementations in several browsers on the other, have allowed SPIP to introduce, with version 1.8.1, an interface with validation tools like Tidy or the official W3C validator. These tools, as is even mentioned on their respective home pages, suffer from numerous limitations, providing widely varying error messages, and are not installable by the average internet user [1].

What’s more, they are dysfunctional when faced with new technologies like Ajax, which incrementally modify web pages directly. Even more seriously, the W3C Document Type Definition files also have their shortcomings, including the strict one DTD XHTML 1.0 : where although the official specifications disallow nesting tags like a, label or form, the formal grammar definitions consider the following constructions as valid <label for='x'><label... or <form action='.'><div><form... [2].

Faced with these issues, SPIP 1.9.2 has made a radical development in proposing an extensible, integrated, incremental and optional validator based on the Simple Analyzer for XML supplied by default with PHP. This proliferation of features makes it useful as much to webmasters as to graphic artists and to developers of SPIP extensions with the obvious different methods of application of the tool.

Validation for webmasters

For webmasters, simply including the command line $xhtml = 'sax'; within the mes_options.php file will alter the output of HTML code pages produced by a template to force each line that contains a single tag opening and at most one attribute to be left indented proportionally to the number of unclosed tags at that point in the HTML code. Wherever the page invalidates, this work is discarded and the initial HTML text is sent to the HTTP client instead. In such cases, the possible caching has obviously happened after the actions of the validator, which in this case has simply attempted to reformat the page in a simple manner.

There are a few details related to the treatment of attributes. Their values will be systematically delimited with an apostrophe, except those that contain an apostrophe themselves, in which case they will be delimited by double quotes (and if they contain double quotes as well, those will be converted into &quot;) codes. Because of a design error in SAX  [3], the XML entities are preliminarily converted into the site’s charset excepting the codes &amp; &lt; &gt; and &quot; which are correctly processed by SAX (since that’s indispensable). The list of entities and their values will be deduced by the DOCTYPE indicated in the template; by default, SPIP will use a predefined collection equivalent to the DTD of latin1 supplemented with &euro;, &oelig; and &OElig; defined in the special symbol DTD.

Validation for graphic artists

For template writers, the validator is available through the debugger that was introduced in SPIP 1.8. This tool is visible only to site administrators when they visit pages on the public site. One of these buttons is labelled Analyse XML and it explicitly launches a request for validation of the page being viewed. This analysis is first assigned to SAX which, even in the absence of a valid DOCTYPE, can still identify:

  • incorrect delimitation of matching opening and closing tags;
  • attributes without values;
  • attributes without delimiters;
  • attributes missing the trailing space before the next attribute;
  • XML entities that are poorly formed (notably any literal &, where XML demands the expanded &amp;).

At the first error found, the SPIP debugger will immediately try to find from which template the error has been produced (there are several when includes are used) and on which line, and produces links to the source files (this estimation can’t always be isolated, and a margin of error is taken into account).

If this first analysis passes the test, the new version of SPIP then goes further by taking into account the DOCTYPE of the page. It can be either of the type PUBLIC or SYSTEM, and in the first case the relevant DTD will be cached in order to accelerate the final analyses. Each DTD can include others which will also be loaded at the same time. In order to avoid redundant calculations and still taking account of the system of conditional DTD includes, SPIP also caches a specific data structure that it determines by reading the definitions of all the elements, attributes and entities relevant to the nominated DOCTYPE. With the help of this structure, SPIP validates the page, in other words it verifies that:

  • all of the attribute names used in a tag are acceptable to the DTD;
  • all of the compulsory tag attributes are present;
  • all of the attribute values conform to the form that may be possibly specified by the DTD;
  • all ID attributes have values composed of letters, numbers, underscores, hyphens and colons;
  • all IDREF or IDREFS attributes actually refer to valid IDs on the page;
  • all XML entities are defined in the DTD;
  • all tag names used are defined in the DTD;
  • all non-empty tags are authorised to be so by the DTD (for example, a non-empty img will be flagged) ;
  • all empty tags are authorised to be so by the DTD (for example, an empty tr will be flagged) ;
  • all used tags are contained by a tag that is authorised to be its parent as defined in the DTD;
  • all tags that must appear before another specified sibling tag do so as specified by the DTD (for example, the head appears before the body)
  • all tags that are limited to being used a fixed number of times do so (for example, title may appear only once in the head).

If any of these errors are detected, the validator will display a table of all errors, with the frequency of their appearances, with links to the infringing lines, and with suggestions of corrections that are automatically deduced from the constructions authorised by the DTD. In the absence of errors, the debugger will display the code according to the page formatting defined previously.

Validation for developers

Web pages that are assigned "restricted access" are in particular need of an integrated validator in order to be debugged, given that external validators will not have explicit access to those pages. As concerns scripts in the SPIP editorial space, be they standard or from an extension, they can be processed by the XML validator by putting $GLOBALS['transformer_xml'] = 'valider_xml'; into the mes_options.php file. The private space of Spip 1.9.2 itself has been rendered XHTML 1.0 conformant thanks to this mechanism. Changing this global variable to 'indenter_xml' will cause the automatic indentation of the HTML source if it is XML-compliant, without checking to validate it.

It is equally possible to start, with the mouse, the XML analysis of the results of an Ajax script that is active in the editorial space. Such a script does not return a complete HTML page, buy only a segment, so the integrated validator will manufacture a page with the current DOCTYPE and a header consisting of only a Title tag, and a body containing the code segment. Subsequently, a window will be opened with the results of the analysis, as for a regular HTML page, while the calling window will receive the results of the Ajax script as per normal. Because of an impractical W3C specification regarding the event model  [4] , the launch of the validator is not initiated by a specific mouse button, but by a click during which at least one of the keys Alt or Meta has been depressed.

Other than that, the SPIP validator can be applied to any page present on the Web. Any SPIP site installed at a URL http://u contains the page http://uecrire/valider_xml which the site authors can invoke explicitly to validate pages not limited to those of the site in question. However, this validator applies only to XML documents; in the absence of the DOCTYPE, the DTD XHTML1.0 Transitional will be assumed.

As implied by the previous commentary, the validator can be applied to any DOCTYPE, including those referenced by DTDs resident on the site. To make Web pages accessible is to be more rigorous in the use of attributes and tags, and so one can easily construct one’s own accessibility validation tool by defining a DTD that is less lenient than that commonly known as XHTML 1.0 Strict. Replacing IMPLIED with #REQUIRED for attributes considered indispensable is an easy obvious move. Forcing input, select, textarea and button to be exclusively the children of label tags is considerably more difficult. Other than that, the validator accepts any sequence (in the sense of the term inferred by PCRE) as an attribute type, it will therefore be applied to every occurrence of this attribute in the analysed page.

Finally, it is possible to define one’s own validation rules to be associated with an attribute. The regular types ID, IDREF et IDREFS are implemented by the respective functions validerAttribut_ID, validerAttribut_IDREF, validerAttribut_IDREFS.

One only needs to introduce a new type S1 ... Sn in the DTD, and to define the associated functions in the mes_options.php file in order to initiate customised verification. At the end of the analysis, the overloadable function inc_valider_passe2 is called, with the aim of applying retrospective verifications (it is here that the IDREF attributes are verified as referencing attributes that are actually present). This programming interface is still a little frustrating and it will be revised after the community’s experiences. But for now it allows anyone to quickly implement any new accessibility specifications that are developed.


[1The W3C validator is available in the form of an archive of over 1MB, consisting principally of a CGI script written in Perl, and requiring several language modules and its interface with the HTTP server. On the contrary, the syntactical analysis and validation are not performed by this programme, but are delegated to the onsgmls analyser, itself programmed in C++ and therefore requiring its own compilation after preconfiguring its own OpenSP project libraries of over 1MB (some operating systems do, however, already include the compiled executable of 128MB). These difficulties clearly identify why there are so few implementations, which are therefore often overwhelmed by requests.

[2This deficiency is justified in this section of the XHTML recommendations which states: The HTML 4 Strict DTD forbids the nesting of an ’a’ element within another ’a’ element to any descendant depth. It is not possible to spell out such prohibitions in XML. This statement is not supported with any example or scientific reference. The theories establishing the abilities of automatic syntactical analysis have been clarified and amply demonstrated since the 1950’s, and clearly disavow such statements. W3C, perhaps it’s time to return to school?

[3When encountering what it considers as a lexeme (root word), SAX calls an event handler. Because it mistakenly processes attributes as lexemes, it causes the same sequence of calls for the following 3 texts:

&eolig;&EOlig;<a ...

&eolig;<a title='&EOlig;' ...

<a title='&eolig;' href='&EOlig;' ....

As a result, the event handler XML entity doesn’t know whether the entity calling it is part of the text element of the preceding tag, or if it is one of the attributes of the current tag and which one of those it might be. The ambiguity can’t even be removed by using line and column numbers (or flow characters), which indicate in every case the location of the preceding tag.

[4Whereas the most widespread operating system in the world wisely follows, for once, the lessons learnt from Unix and more precisely from X-Windows when describing a mouse button as a bitmask, the W3C considered themselves even wiser in producing an incompatible and ill-conceived specification that virtually no current browser has adhered to.

Author Mark Published : Updated : 26/10/12

Translations : عربي, català, English, Español, français, italiano