This page documents the experimental HTML5 fork of HTML Tidy available at https://github.com/w3c/tidy-html5.
File bug reports and enhancement requests at https://github.com/w3c/tidy-html5/issues.
The W3C public mailing list for HTML Tidy discussion is html-tidy@w3.org (list archive).
For more information on HTML5:
Validate your HTML documents using the W3C Nu Markup Validator.
Tidy corrects and cleans up HTML content by fixing markup errors. Here are a few examples:
<h2>subheading</h3>
…is converted to:
<h2>subheading</h2>
<p>here is a para <b>bold <i>bold italic</b> bold?</i> normal?
…is converted to:
<p>here is a para <b>bold <i>bold italic</i> bold?</b> normal?
<h1>heading <h2>subheading</h2>
…is converted to:
<h1>heading</h1> <h2>subheading</h2>…and
<h1><i>italic heading</h1>
…is converted to:
<h1><i>italic heading</i></h1>
<i><h1>heading</h1></i> <p>new paragraph <b>bold text <p>some more bold text
…is converted to:
<h1><i>heading</i></h1> <p>new paragraph <b>bold text</b> <p><b>some more bold text</b>
<h1><hr>heading</h1> <h2>sub<hr>heading</h2>
…is converted to:
<hr> <h1>heading</h1> <h2>sub</h2> <hr> <h2>heading</h2>
<a href="#refs">References<a>
…is converted to:
<a href="#refs">References</a>
<body> <li>1st list item <li>2nd list item
…is converted to:
<body> <ul> <li>1st list item</li> <li>2nd list item</li> </ul>
Tidy inserts quotation marks around all attribute values for you. It can also detect when you have forgotten the closing quotation mark, although this is something you will have to fix yourself.
Tidy has a comprehensive knowledge of the attributes defined in HTML5. That often allows you to spot where you have mis-typed an attribute.
This is something you then have to fix yourself as Tidy cannot determine where the ">" was meant to be inserted.
This is the syntax for invoking Tidy from the command line:
tidy [[options] filename]*
Tidy defaults to reading from standard input, so if you run Tidy without
specifying the filename
argument, it will just sit
there waiting for input to read.
And Tidy defaults to writing to standard output. So you can pipe output
from Tidy to other programs, as well as pipe output from other programs to
Tidy. You can page through the output from Tidy by piping it to a pager:
tidy file.html | less
To have Tidy write its output to a file instead, either use the
-o filename
or -output filename
option, or redirect standard output to the file; for example:
tidy -o output.html index.html tidy index.html > output.html
Both of those run tidy on the file index.html and write the output to the file output.html, while writing any error messages to standard error.
Tidy defaults to writing its error messages to standard error (that is, to the console where you’re running Tidy). To page through the error messages, along with the output, redirect standard error to standard output, and pipe it to your pager:
tidy index.html 2>&1 | less
To have Tidy write the errors to a file instead, either use the
-f filename
or -file filename
option, or redirect standard error to a file:
tidy -o output.html -f errs.txt index.html tidy index.html > output.html 2> errs.txt
Both of those run tidy on the file index.html and write the output to the file output.html, while writing any error messages to the file errs.txt.
Writing the error messages to a file is especially useful if the file you are checking has many errors; reading them from a file instead of the console or pager can make it easier to review them.
You can use the or -m
or -modify
option to
modify (in-place) the contents of the input file you are checking; that is,
to overwrite those contents with the output from Tidy. Example:
tidy -f errs.txt -m index.html
That runs tidy on the file index.html, modifying it in place and writing the error messages to the file errs.txt.
Caution: If you use the -m option, you should first save a copy of your file.
To get a list of available options, use:
tidy -help
To get a list of all configuration settings, use:
tidy -help-config
To read the help output a page at time, pipe it to a pager:
tidy -help | less tidy -help-config | less
Single-letter options other than -f may be combined; for example:
tidy -f errs.txt -imu foo.html
The most convenient way to configure Tidy is by using separate
config file.
Assuming you have created a
Tidy config file named config.txt (the name doesn't matter), you can
instruct Tidy to use it via the command line option
-config config.txt
; for example:
tidy -config config.txt file1.html file2.html
Alternatively, you can name the default config file via the environment variable named HTML_TIDY, the value of which is the absolute path for the config file.
You can also set config options on the command line by preceding
the name of the option immediately (no intervening space) with the string "--
";
for example:
tidy --break-before-br true --show-warnings false
You can find documentation for full set of configuration options on the Quick Reference page.
The following is an example of a Tidy config file.
// sample config file for HTML tidy indent: auto indent-spaces: 2 wrap: 72 markup: yes output-xml: no input-xml: no show-warnings: yes numeric-entities: yes quote-marks: yes quote-nbsp: yes quote-ampersand: no break-before-br: no uppercase-tags: no uppercase-attributes: no char-encoding: latin1 new-inline-tags: cfif, cfelse, math, mroot, mrow, mi, mn, mo, msqrt, mfrac, msubsup, munderover, munder, mover, mmultiscripts, msup, msub, mtext, mprescripts, mtable, mtr, mtd, mth new-blocklevel-tags: cfoutput, cfquery new-empty-tags: cfelse
The experimental HTML5-aware fork of Tidy adds the following new configuration options:
In addition, it also adds a new html5
value for the
doctype
configuration option.
Indenting the source markup of an HTML document makes the markup easier to read. Tidy can indent the markup for an HTML document while recognizing elements whose contents should not be indented. In the example below, Tidy indents the output while preserving the formatting of the <pre> element:
Input:
<html> <head> <title>Test document</title> </head> <body> <p>This example shows how Tidy can indent output while preserving formatting of particular elements.</p> <pre>This is <em>genuine preformatted</em> text </pre> </body> </html>
Output:
<html> <head> <title>Test document</title> </head> <body> <p>This example shows how Tidy can indent output while preserving formatting of particular elements.</p> <pre> This is <em>genuine preformatted</em> text </pre> </body> </html>
Tidy’s indenting behavior is not perfect and can sometimes cause your
output to be rendered by browsers in a different way than the input.
You can avoid unexpected indenting-related rendering problems by setting
indent: no
or indent: auto
in a config file.
Tidy is not capable of preserving the original indenting of the markup from the input it receives. That’s because Tidy starts by building a clean parse tree from the input, and that parse tree doesn’t contain any information about the original indenting. Tidy then pretty-prints the parse tree using the current config settings. Trying to preserve the original indenting from the input would interact badly with the repair operations needed to build a clean parse tree, and would considerably complicate the code.
Tidy defaults to assuming you want output to be encoded in UTF-8. But Tidy offers you a choice of other character encodings: US ASCII, ISO Latin-1, and the ISO 2022 family of 7 bit encodings.
Tidy doesn't yet recognize the use of the HTML <meta> element for specifying the character encoding.
The full set of HTML character references are defined. Cleaned-up output uses named character references for characters when appropriate. Otherwise, characters outside the normal range are output as numeric character references.
Tidy offers advice on potential accessibility problems for people using non-graphical browsers.
Some tools generate HTML with presentational elements such as <font>,
<nobr>, and <center>.
Tidy's -clean
option will replace those elements with CSS style
properties.
Some HTML documents rely on the presentational effects of <p> start tags that are not followed by any content. Tidy deletes such <p> tags (as well as any headings that don’t have content). So do not use <p> tags simply for adding vertical whitespace; instead use CSS, or the <br> element. However, note that Tidy won’t discard <p> tags that are followed by any nonbreaking space (that is, the named character reference).
You can teach Tidy about new tags by declaring them in the configuration file, the syntax is:
new-inline-tags: tag1, tag2, tag3 new-empty-tags: tag1, tag2, tag3 new-blocklevel-tags: tag1, tag2, tag3 new-pre-tags: tag1, tag2, tag3
The same tag can be defined as empty and as inline or as empty and as block.
These declarations can be combined to define a new empty inline or empty block element. But you are not advised to declare tags as being both inline and block.
Note that the new tags can only appear where Tidy expects inline or block-level tags respectively. That means you can’t place new tags within the document head or other contexts with restricted content models.
Tidy will gracefully ignore many cases of PHP, ASP, and JSTE instructions within element content and as replacements for attributes, and preserve them as-is in output; for example:
<option <% if rsSchool.Fields("ID").Value = session("sessSchoolID") then Response.Write("selected") %> value='<%=rsSchool.Fields("ID").Value%>'> <%=rsSchool.Fields("Name").Value%> (<%=rsSchool.Fields("ID").Value%>) </option>
But note that Tidy may report missing attributes when those are “hidden” within the PHP, ASP, or JSTE code. If you use PHP, ASP, or JSTE code to create a start tag, but place the end tag explicitly in the HTML markup, Tidy won’t be able to match them up, and will delete the end tag. So in that case you are advised to make the start tag explicit and to use PHP, ASP, or JSTE code for just the attributes; for example:
<a href="<%=random.site()%>">do you feel lucky?</a>
Tidy can also get things wrong if the PHP, ASP, or JSTE code includes quotation marks; for example:
value="<%=rsSchool.Fields("ID").Value%>"
Tidy will see the quotation mark preceding ID as ending the attribute value, and proceed to complain about what follows.
Tidy allows you to control whether line wrapping on spaces within PHP, ASP, and JSTE instructions is enabled; see the wrap-php, wrap-asp, and wrap-jste config options.
Tidy can help you to correct well-formedness errors in XML markup. Tidy doesn't yet recognize all XML features, though; for example, it doesn't understand CDATA sections or DTD subsets.
If you want to run Tidy from a Perl or other scripting language you may find it of value to inspect the result returned by Tidy when it exits: 0 if everything is fine, 1 if there were warnings and 2 if there were errors. This is an example using Perl:
if (close(TIDY) == 0) { my $exitcode = $? >> 8; if ($exitcode == 1) { printf STDERR "tidy issued warning messages\n"; } elsif ($exitcode == 2) { printf STDERR "tidy issued error messages\n"; } else { die "tidy exited with code: $exitcode\n"; } } else { printf STDERR "tidy detected no errors\n"; }
The source code for the experimental HTML5 fork of Tidy can be found at https://github.com/w3c/tidy-html5.
For Linux/BSD/OSX platforms, you can build and install the
tidy
command-line tool from the source code using the
following steps.
make -C build/gmake/
make install -C build/gmake/
Note that you will either need to run make install
as root,
or with sudo make install
.
For Linux/BSD/OSX platforms, you can build and install the
tidylib
shared library (for integrating Tidy into other
applications) from the source code using the following steps.
sh build/gnuauto/setup.sh && ./configure && make
make install
Note that you will either need to run make install
as root,
or with sudo make install
.
Dave Raggett has a list of Acknowledgements for people who made suggestions or reported bugs for the original version of Tidy.