Tidy: XHTML 1.0 correction

The pages generated by SPIP can by analysed by the The integrated XML validator available since SPIP 1.9. There also exists, since SPIP 1.8, 1.8.1, an interface with HTML Tidy; it does require a separate installation, but offers the advantage of (often successfully) automatically corrected many validation errors.

The general principle

Tidy is a tool (external from SPIP) which enables relatively clean HTML 4 code to be transformed into valid XHTML 1.0 transitional code. This tool helps webmasters to bring their sites into compliance with the XHTML recommendations, even in the forum messages freely composed by site visitors.

Important: Tidy is not a "magic wand": it is incapable of transforming "very dirty" code into compliant code. When faced with certain "errors" in the code, it just purely and simply refuses to function at all. You should therefore expect to be making some manual changes on the errors that it may discover...

-  Step 1: before anything else, it is highly recommended to make your code as clean and compliant as possible, even before passing it through Tidy. This is done on several levels:
— first, SPIP generates clean code with its typographical processing, and which is increasingly compliant upon each new revision; the private zone and the SPIP 1.8, 1.8.1 templates are fully XHTML transitional compliant, and in [spip19] they are fully XHTML strict compliant.

-  Step 2: If Tidy already exists on the server, and the webmaster has activated this option (see further below), then SPIP will pass the generated pages through Tidy, which will then attempt to clean the pages and transform them into fully compliant pages.

-  Step 3: if the process works properly (Tidy hasn’t found any "blocking" errors), then the page displayed is definitively in XHTML 1.0; on the other hand, if Tidy didn’t manage to process the page (see further below), then it is the original page which is displayed — in this case (it is an important tool to be considered), SPIP will display an administration button signalling the "Tidy error", and will create a file summarising the various pages where it found errors.

We must stress again, so that you don’t assume the tool has "magic" powers that it actually does not have, and to assume intentions of its developers that are quite false, it is important to note that you can not entirely rely upon the abilities of Tidy, which has well-defined limits. It should be considered as one element in a larger strategy to achieve full compliance:
— first by improving the code generated by SPIP,
— and then using Tidy as both a method to "clean" the code, but also to allow the webmaster to improve the code by offering a list of errors to be addressed.

Making your code compliant can not rely on a purely technical, system solution, but on a personal effort to which SPIP can offer a supplementary monitoring tool.

Installing Tidy

-  Tidy, a PHP extension

Tidy exists as a PHP extension. This is the easiest way to use it, the tool being directly accessible by the webmaster. To see if it is present already, you can check the ecrire/?exec=info page of your web site to show the configuration of the site’s server and the list of available PHP extensions installed on it.

N.B.: Any feedback from webmasters is interesting information for the developers — through the spip-dev mailing list. (We need feedback on versions 1 and 2 of Tidy, i.e. on sites running on either PHP 4 or PHP 5, but also on installations using PEAR.)

-  Tidy as an independent programme

It is otherwise possible to use Tidy through a command line prompt (i.e. as an independent PHP programme running directly on the server).

This version is particularly practical because:
— there are pre-compiled versions of Tidy available for almost all operating systems,
— it is often possible and easy to install these versions of Tidy on a web host without requiring root access,
— some site administrators have run into incompatibilities during the installation of Tidy as a PHP extension (as appears to be the case with ImageMagick); the command line version does not cause any of these kinds of problems.

Before doing anything else, check that Tidy does not already exist on your server. To do this, insert the following lines into your mes_options.php file:

define('_TIDY_COMMAND', 'tidy');
$xhtml = true;

And check on your public site if the pages have been modified (either transformed into XHTML, or if they are displaying the message "Tidy error"). If it doesn’t work, remember to delete those two lines, and try installing Tidy as per the following method (or ask the person in charge of your hosting service to do it for you).

You can install a pre-compiled version of Tidy that matches your system.
— These versions can be found on the official Tidy web site; there are versions for Linux, various BSD implementations, MacOS X, etc.
— Unzip the downloaded archive, and install the "tidy" file on your web site.
— Check the execution rights of this file on your server (if necessary, chmod it to "777"). (If you have SSH access to your server, you can test the programme directly through the terminal. If you don’t have such access, don’t worry, just move on to the next steps, knowing that this isn’t the end of the story just because it doesn’t work on this first attempt.)
— Configure the access to this file by specifying the access path such as:

define('_TIDY_COMMAND', '/usr/bin/tidy');
$xhtml = true;

If the path indicated in _TIDY_COMMAND is correct, then Tidy will be triggered when you display the pages for your public site.

Important. The definition of _TIDY_COMMAND must occur in the /ecrire/mes_options.php file and not at the site root in the /mes_fonctions.phpfile. This is because of the fairly particular method that the system works under (post-processing of the files extracted from the SPIP cache).

The $xhtml = true; section, on the other hand, works just like a "customisation variable"; you may, if you want to make some tests or restrict its operation to just a part of your site, define this variable at the file call level, e.g. embedded into article.html if you want Tidy to operate on only the article pages.

Clean up your code...

Once again, you need to understand that Tidy is only able to make compliant code out of something that already is originally very clean. With the default templates delivered with SPIP and the code generated by default by SPIP, this isn’t any problem: the code is already very close to fully compliant HTML 4 code, which Tidy has absolutely no problem transforming into fully compliant XHTML 1.0 transitional code.

When Tidy works properly, you will see your pages including the following source code segments:
— the "DOCTYPE" for your pages now commences with "XHTML...",
— the code is beautifully indented,
— the final result will pass through W3C validation without any difficulties.

If the DOCTYPE has not been changed, then Tidy has decided not to correct the page, because it has encountered errors that it has found impossible to correct.

Such errors may have two possible sources: the templates, and the text in the articles (or other editorial user-entered content).

-  Your templates themselves are not compliant; in which case, it’s up to you to fix them. This is generally the most common cause.

The best place to start here is to deactivate Tidy (switch the $xhtml variable to false), and run your pages through a validator (the The integrated XML validator or the W3C Validator) and requesting HTML 4.01 transitional compliance as the minimum.

Once your templates are as close as possible to HTML 4, Tidy won’t have too much trouble generating the corresponding compliant XHTML code. If your pages are completely compliant, then that’s even better! (And this is not an impossible feat, believe us - the default templates delivered with SPIP are already fully compliant).

-  Certain articles contain incorrect codes

SPIP openly permits content editors to work using "source code", granting them the opportunity to entire non-compliant code inside their own articles. (For example, within the documentation on www.spip.net, there are some places where Tidy would consider the use of <tt>...</tt> HTML tags as unacceptable.)

In addition, once your templates have been cleaned, you will be able to look at and correct the text of certain articles (which will therefore naturally include HTML code insertions directly into the article texts; once again. the code generated by SPIP is essentially compliant itself, and will not make Tidy run into any serious "blocks").

For this, other than the visible appearance on the pages in question of an administration button titled "Tidy error", SPIP constantly maintains a file called /ecrire/data/w3c-go-home.txt which contains the list of pages which are impossible to validate [1]. Once your templates are cleaned up properly (see the previous discussion), the number of failing pages should be limited to only those articles that contain HTML code not acceptable to Tidy.

It’s a fairly difficult task to precisely define what Tidy considers as an "insurmountable error". For example, incorrectly closed tags are not really considered as impossible to correct (e.g. switching into italics in one paragraph and switching them off again in another paragraph generates non-compliant HTML code that Tidy can still sometimes manage to correct properly). Most often, such errors are tags that have been entered by hand which just don’t actually exist at all (e.g. typing <bt> instead of <br>), or using HTML tags considered to obsolete in HTML 4 (such as <tt> or <blink>), which Tidy will just purely and simply refuse outright to do anything with.


We repeat once more for good measure, that the Tidy tool must certainly not be considered as a "miracle worker": it can not turn dirty code into clean code. Its integration into SPIP must therefore follow a guided approach to cleaning code:

— SPIP itself continues to generate ever cleaner code;

— Tidy then serves as a final touch to clean the code that is already largely "compliant" (when it runs, it resolves a handful of incompatibilities difficult to manage with a single code common to HTML and XHTML, such as several auto-closing tags in XHTML, which have not been closed in HTML, like <br />) for example;

— Tidy is then used to identify coding errors in the source code of the articles themselves (because of often quite frequent insertion "by hand" of HTML code into the body text of the articles by the contributing editors).


[1This file gets its name from the article called W3C go home!, published on uZine, which criticised the dogged efforts of the compliance prophets against... those lazy webmasters who do nothing more than just create Web pages (without checking them); the essential matter, it stated, was to publish correctly without giving yourself too many headaches. If you can manage, at the same time, to make it compliant, then well and good, and that is what SPIP-Tidy is trying to accomplish too.

Author Mark Published : Updated : 26/10/12

Translations : català, English, Español, français, italiano