parser and picklist system. Console application needs to be updated to fix
the description, as it shows autobool, and for some reason on the current
system I'm not getting assertion failures.
This PR refactors how picklists and option parsers are implemented in LibTidy,
making is vastly easier to implement new picklists in the future, as well as
modify some of the existing picklists such that they have more logical names.
Picklist arrays are now arrays of structures that include the possible strings
capable of setting a particular option value, and a new parser has been written
to work with these structures.
In addition, several of the existing parsers were removed, as they are now
redundant, and a couple of the remaining parsers were refactored to take
advantage of the new parser.
In effect, this means that:
- New parsers don't have to be written in the majority of cases where new
options are added that exceed yes/no/auto.
- Some of the existing options can have more meaningful names than yes/no/auto,
in a backward compatible way. For example, vertical-spacing "auto" currently
in no way reflects "auto" when used.
According to the MSN documentation 'isalnum(c)' is only valid when c equals
EOF, or is in the range 0 to 255 inclusive. It states the behavior is
undefined outside this range, and in Debug mode triggers an assert dialog.
in #352, but I'm worried that there's some over-reach here.
Currently only implemented as a warning, with no switch to turn it off, which
maintains current behavior other than the warning.
In general, we're treating any string as a complete URL, rather than breaking
URL's into component parts. Thus the `IsURLCodePoint()` check includes a few
other generic characters that strictly speaking aren't valid codepoints, but
are valid as escape characters and delimiters.
When addressing #338, I ran into a similar situation in not having a built-in
method to separate path components (although a simple generalized solution was
good enough in that case).
Thus without introducing a new structure and functions to deconstruct a URL
into scheme, authority, path, parameters, etc., some variation of this patch
will have to be used to address #352.
extension is a file, and so links to TLD's ending with .pl, .au, etc., will
cause accessibility warnings. This fix attempts to distinguish between URI's
that are likely to be files versus links to domains.
- Many, many updates to the public header files.
- tidyenum.h was reorganized substantially in order to better generate
documentation with Doxygen.
- This was also a good time to clean up all of the various enums for languages
and strings. Everything is simple and in a single enum now, other than a
couple of cases (TidyOptionId, for example, doesn't need to be redefined).
- A full and complete audit of the strings meant some opportunities to delete
useless strings.
- Reorganized the order of the strings in language_en.h in order to better
find things when programmers want to make changes. There are a lot fewer
internal "sections" now, and everything has been painstakingly sorted within
the remaining sections.
- Consequently rebased all of the PO's, POT, and other language files.
- Updated several of the READMEs with the newest information.
- Made the READMEs easier to copy into the Doxygen project by changing some of
the code format for compatibility, mainly the use of tildes instead of
backslashes for code blocks.
- Added tidyGetMessageCode() to message API. Despite the huge diff, this is the
only externally-visible change, other than removing some enums (but not their
values!).
- Passing `next` tests on Mac, Linux, Win10.
- tidyDetectedHtmlVersion()
- tidyDetectedXhtml()
- added two new fields to W3C_Doctypes[] in order to simplify this.
- added TY_(HTMLVersionNumberFromCode)() to enable lookup.
- Implement tidyDetectedGenericXml()
- Added a warning message if an XML declaration exists but the document is not
XHTML.
- Remove dead commented code.
- Updated POs and POT. Headers not affected, but translators should check
their translations.
- Testing is clean on Mac OS X, Ubuntu 16.04, and Windows 10.
This is a rather large refactoring of Tidy's messaging system. This was done
mostly to allow non-C libraries that cannot adequately take advantage of
arg_lists a chance to query report filter information for information related
to arguments used in constructing an error message.
Three main goals were in mind for this project:
- Don't change the contents of Tidy's existing output sinks. This will ensure
that changes do no affect console Tidy users, or LibTidy users who use the
output sinks directly. This was accomplished 100% other than some improved
cosmetics in the output. See tidy-html5-tests repository, the `refactor` and
`more_messages_changes` branches for these minor diffs.
- Provide an API that is simple and also extensible without having to write new
error filters all the time. This was accomplished by adding the new message
callback `TidyMessageCallback` that provides callback functions an opaque
object representing the message, and an API to query the message for wanted
details. With this, we should never have to add a new callback routine again,
as additional API can simply be written against the opaque object.
- The API should work the same as the rest of LibTidy's API in that it's
consistent and only uses simple types with wide interoperability with other
languages. Thanks to @gagern who suggested the model for the API in #409.
Although the API uses the "Tidy" way off accessing data via an iterator
rather than an index, this can be easily abstracted in the target language.
There are two *major* API breaking changes:
- Removed TidyReportFilter2
- This was only used by one application in the entire world, and was a hacky
kludge that served its purpose. TidyReportCallback (né TidyReportFilter3)
is much better. If, for some reason, this affects you, I recommend using
TidyReportCallback instead. It's a minor change for your application.
- Renamed TidyReportFilter3 to TidyReportCallback
- This name is much more semantic, and much more sensible in light of
improved callback system. As the name implies, it remains capable of
*only* receiving callbacks for Tidy "reports."
Introducing TidyMessageCallback, and a new message interrogation API.
- As its name implies, it is able to capture (and optionally suppress) *all*
of Tidy's output, including the dialogue messages that never make it to
the existing report filters.
- Provides an opaque `TidyMessage` and an API that can be used to query against
it to find the juicy goodness inside.
- For example, `tidyGetMessageOutput( tmessage )` will return the complete,
localized message.
- Another example, `tidyGetMessageLine( tmessage )` will return the line the
message applies to.
- You can also get information about the individual arguments that make up a
message. By using the `tidyGetMessageArguments( tmessage )` itorator and
`tidyGetNextMessageArgument` you will obtain an opaque `TidyMessageArgument`
which has its own interrogation API. For example:
- tidyGetArgType( tmessage, &iterator );
- tidyGetArgFormat( tmessage, &iterator );
- tidyGetArgValueString( tmessage, &iterator );
- …and so on.
Other major changes include refactoring `messages.c` to use the new message
"object" directly when emitting messages to the console or output sinks. This
allowed replacement of a lot of specialized functions with generalized ones.
Some of this generalizing involved modifications to the `language_xx.h` header
files, and these are all positive improvements even without the above changes.
merge.
- Sort all of the existing options and re-indent per Tidy standards. This is
simply for cosmetic effect.
- Allow the iterator to return all options again, even "internal" options.
Things are too embedded with N_TIDY_OPTIONS, etc., to try to hide them.
- Instead, simply add documentation to LibTidy users that they shouldn't use
internal options.
- Also added `TidyInternalCategory` to `TidyConfigCategory` without adding a
new field to the struct. API users should check for this category before
use.
- Defined a two character macro for `TidyInternalCategory` for use in
`option_defs[]`.
- Changed struct `option_defs[]` to reflect the new category for affected
options.
- Removed string indicating * refers to internal options, since it no longer
applies.
- Regen'd all strings for previous point.
- `tidy.c` now checks for `TidyInternalCategory` everywhere in order to
suppress output.
- Updated strings files to match.
- Inhibit internal options from being output via the iterator. Internals should
never have the chance to be exposed if they shouldn't be use.
- Added tidySetEmacsFile() and TidyGetEmacsFile() to the public API, and use it
instead of secret API to set the filename in the console application.
The end result is that `gnu-emacs-file` (and also `doctype-mode`) officially no
longer exist to CLI users nor to API users, and tidy console behaves properly
by using a published API to set the filename for emacs.
The previous action was to discard the second, while it is the second
table that browsers will render.
This conforms to the principle that the html output by tidy should render
in a browser like the original html.
- TidySlideStyle
- TidyBurstSlides
- Added documentation for TidyEmacsFile, since it's a valid option.
- Because TidyEmacsFile is a valid option, tweaked tidy.c so that it can
be specified in a configuration file without being overwritten by the console
app. Why a user might do this is dumb, but who are we to stop them.
- Consolidated all output string definitions enums into `tidyenum.h`, which
is where they belong, and where they have proper visibility.
- Re-arranged `messages.c/h` with several comments useful to developers.
- Properly added the key lookup functions and the language localization
functions into tidy.h/tidylib.c with proper name-spacing.
- Previous point restored a *lot* of sanity to the #include pollution that's
been introduced in light of these.
- Note that opaque types have been (properly) introduced. Look at the updated
headers for `language.h`. In particular only an opaque structure is passed
outside of LibTidy, and so use TidyLangWindowsName and TidyLangPosixName
to poll these objects.
- Console application updated as a result of this.
- Removed dead code:
- void TY_(UnknownOption)( TidyDocImpl* doc, char c );
- void TY_(UnknownFile)( TidyDocImpl* doc, ctmbstr program, ctmbstr file );
- Redundant strings were removed with the removal of this dead code.
- Several enums were given fixed starting values. YOUR PROGRAMS SHOULD NEVER
depend on enum values. `TidyReportLevel` is an example of such.
- Some enums were removed as a result of this. `TidyReportLevel` now has
matching strings, so the redundant `TidyReportLevelStrings` was removed.
- All of the PO's and language header files were regenerated as a result of
the string cleanup and header cleanup.
- Made the interface to the library version and release date consistent.
- CMakeLists.txt now supports SUPPORT_CONSOLE_APP. The intention is to
be able to remove console-only code from LibTidy (for LibTidy users).
- Updated README/MESSAGES.md, which is *vastly* more simple now.
- No longer break script tags up on two lines if there is content. However
output is still subject to the `--wrap` behavior.
- Previous behavior intact if there is content.
Todo.
- Associate this with a new Tidy option.
Only deals with a successful case.
TODO: Maybe add a warning/error if the trailing surrogate not found, and
maybe consider substituting to avoid invalid utf-8 output.
result[6] is a fixed array of size 6, but in the process
of copying data into it, we clobber the last allocated byte.
Simplify some of the code by not calling redundant functions.
These services can only be used when there is a document loaded, ie a
lexer created. But really should not be calling a Clean and Repair
service with no doc!
According to the specs, each va_copy call should be matched by a va_end call
to ensure proper cleanup. Furthermore, since message filters might iterate
over the list of arguments, we should hand a new copy to each filter.
The problem was, with --hide-endtags yes, a conditional pprint buffer
flush had nothing to flush, thus the indent was not adjusted.
To track down this bug added a lot of MSVC Debug code, but is only
existing if some additional items defined, so has no effect on the release
code.
This, what feels like a good fix, was first reported about 12 years ago by
@OlafvdSpek in SF Bugs 563. Hopefully finally closed.
This is intended to make it very, very easy to update the POT and all of the POs when
changes are made to `language_en.h`. Used without an sha-1 hash, untranslated strings
(i.e., the "source" strings) are updated in the POT/PO's.
However if you specify an --sha=HASH (or -c HASH) option, then the script will use git
to examine the `language_en.h` file from that specified commit, determing the strings
that have changed, and mark all of these strings as `fuzzy` in the POs. This will serve
as a flag to translators that the original has changed. In addition, this `fuzzy` flag
will appear in the headers as "(fuzzy) " in the item comments.
If a translator edits the header directly, he should remove the "(fuzzy )" in the
comment. Then when the PO is rebuilt, the fuzzy flag will be removed automatically.
The reverse is also true; if a translator is working with the PO, he or she should
clear the fuzzy flag and the comment will be adjusted accordingly in the generated
header.
As in the previous case these messages are already valid utf-8 text, and
thus, if output on a byte-by-byte basis, must not use WriteChar, except
for the EOL char.
Of course this output can be to either a user ouput file, if configured,
otherwise stderr.
- Show the language Tidy is using.
- Update the POT and POs with the modified string.
- Regen language_es.h, which uses the string.
Note that the new header uses the new commentless behavior that's still
pending in another branch. In addition the proper c style hints have
been added to all PO's, as their previous absense was a bug.
How this lasted so long in the code is a mystery! But of course it will
only be a read out-of-bounds if testing the first character in the lexer,
and it is a spacey char.
A big thanks to @gaa-cifasis for running ASAN tests on Tidy.
This bug was first openned in 2009 by Christophe Chenon, as bug sf905 but
the patch provided then never made it into the source.
Now appears fixed, 7 years later!
Previous only output the first 8 characters, followed by an elipse if more
than 8. Now return first up to 19 chars. If nore than 19, return first 8,
followed by an elipse, followed by the last 8 characters.
This is in the get_text_string service, which is only used if MSVC and not
NDEBUG.
This is a MUCH SANER approach to what I was trying to do (now that I screwed up enough internals to understand some of them!
At this point there are zero exit state reversions, and zero markup reversions! There are still 21 errout reversions; I'll
annotate and adjust as necessary.
my changes, I'm starting over with this. Comments in the PR thread.
This commit reduces the size of attrdict.c while causing only a single errout
regression that is justified.
Previously Tidy produced different output based on the compilation target, NOT based on
the file encoding and specified options. Every platform was equal except Mac OS. Now unless
the encoding is specifically set to a Mac file type, all encoding assumptions are the same
across platforms.
This reverts commit 61cfcb1555.
This added an inconsistent warning about a missing optional close tag. In
general tidy does not report such optional close tags. See issue #327 for
some discussion on this.
- Includes <p> support
- Matches the description class name in quickref.include.xsl
- Styles <br /> to enforce vertical spacing (in the reference table only).
- documentation/style.css
- Styles <br /> to enforce vertical spacing (in the reference table only).
- documentation/tidy1.xsl.in
- Includes <p> support.
- Better manages line breaks with .sp1 instead of .br.
- src/localize.c
- Legibility to the troublesome `drop-font-tags` description.
This is only if nonested is on, then a <script> tag has not incremented
the nested, so likewise no need to treat an escaped close tag <\/script>
as an end tage to decrement nested.
This is in the GetCDATA function. If the container is script or style and
this option is on, avoid bumping nested.
This addresses issues #65 (1642186) and #280.
All attempts at parsing script data are now abandoned as a bad direction.
That is reordering windows includes per #234
In general the order of includes should be system <headers>,
then local "headers", except perhaps for the ocassional local
"version" or "config" header...
Resolved conflicts in src/pprint.c by reverting to current master, and in
version.txt by increasing the version.
Moved the <windows.h> include above the "streamio.h" include to fix compilation with the latest Windows SDK.
<winnt.h> now has the following struct. In particular the `CR` member of this struct conflicts with a define in streamio.h.
typedef struct _IMAGE_ARM64_RUNTIME_FUNCTION_ENTRY {
DWORD BeginAddress;
union {
DWORD UnwindData;
struct {
DWORD Flag : 2;
DWORD FunctionLength : 11;
DWORD RegF : 3;
DWORD RegI : 4;
DWORD H : 1;
DWORD CR : 2; // This line causes a compile error because CR is redefined in streamio.h
DWORD FrameSize : 9;
} DUMMYSTRUCTNAME;
} DUMMYUNIONNAME;
} IMAGE_ARM64_RUNTIME_FUNCTION_ENTRY, * PIMAGE_ARM64_RUNTIME_FUNCTION_ENTRY;
This was evidenced by an 'assert' failure, that the type was not an 'int'!
And also in the -xml-help output, thus effecting the tidy.1 manual page
for this new feature --vertical-space auto, which produces almost single
line html output.
This 'fix' began in the issue-228 branch - see Issue #231
This is a case where the lexer, in GetTokenfromStream, does NOT eat any
trailing newline after a LEX_STARTTAG: case...
So far have identified pre, script, style as NEEDING this user newline
character for later pprint output. Any others?
html5 allows a naked ampersand unquoted, and now tidy will not issue a
warning. This only deals with a & b, and P&<li>O</li>
More may need to be done for other cases.
This is when setting a String config value through say tidyOptSetValue
using say tidyOptSetValue(tdoc,id,"").
If the length of the new string is zero then do not allocate a 1 byte
buffer, set it to 0, for the option. Any previous buffer has already been
released.
This means API functions like tidyOptSaveSink will not return erroneous
null String values!
This new warning will only be seen if the document remains in HTML5 mode,
where the summary attribute is obsolete. The W3C validator flags this as
an error, and suggests 'Consider describing the structure of the table in
a caption element or in a figure element containing the table; or simplify
the structure of the table so that no description is needed'.
At the same time this patch also restored the old warning if the document
is HTML4--, if the table element lacks a summary attribute. This has been
a tidy warning since the beginning of time, although the W3C validator
does not presently flag this.
In certain circumstances a leading space has to be preverved to allow it
to be used to create a text space node to insert before this element to
preserve the view in a browser.
And added a note asking why is ParseTag called with a hardcoded
IgnoreWhitespace when some effort above has set the mode variable to
MixedContent in certain cases, but need to think about this 2nd change.
Also added some MSVC Debug output when this leading text is used to insert
such a created text node before the element just to be reminded of this
special event.
Such debug is OFF by default, and only added by defining DEBUG_MEMORY. And
is only available for the Debug configuration compiled with MSVC, but this
could be easily extended...
This is particularly for the anchor tag which in html5 mode is parsed in
ParseBlock. That is retain a leading space, in case it needs to be
moved to in front of the block to keep space rendering.
Revert TidyTag_A to HTML5 mode, but allow the table to be modified if the
DOCTYPE given is found to NOT be HTML5, through a service TY_(AdjustTags).
Care is taken to clear any previous hash cached tags.
At present this only effects the anchor tag, but could be applied to
others that need to change their parsing due to an identified DOCTYPE.
The fix for #111 added an end tag for all StartEnd tags, when outputting
HTML5, but there should be some exceptions to this.
Added a new service, isVoidElement(node) for the void elements. Perhaps
this service could be further optimised.
With this fix introduced two new services, FindNodeById and
FindNodeWithId. The former does a total tree search for a TidyTagId.
Maybe there is a way to optimise this search...
Also change the uint badForm from an on/off to a bit field, so could be
extended to other document format errors.
This is a set of kludgy fixes for MathML attribute and entities support.
It is intended that a full HTML5 entity table be added at some time, but
at present ALL entities are accepted as written when within the math
element.
Likewise all attributes are accepted on MathML elements without any check
of their name or value, even if they match attributes outside MathML.
And in the pprinter such entities are written as is from the lexer, using
a new PPrintMathML service added, using the new mode OtherNameSpace.
It is hoped all these fixes will NOT effect tidy outside the math element.
ALL fixes in the set a clearly marked '#130 - MathML attr and entity fix!'
for easy searching, and improving if possible.
As predicted the previous fix had adverse consequences on say script text,
which then lost the indent, and was reverted.
This introduces a new service, nodeIsTextLike, which naturally returns yes
if it is text, but also is an AspTag.
Maybe other text like nodes need to be added.
An immense thanks to Ger Hobbelt who had already done this
in his github.com/GerHobbelt/htmltidy fork.
The two sources have diverges so was not a simple cut
an paste. But again thanks Ger for this.
This is a simple but profound change in pprint.c.
Since leading space is preserved on script code, after tidy indents
the code once, a second run on that tidied file would add more indent
to already indented code. This fix should be carefully checked,
and removed if there are other bad consequences.
Bump the version point to 4 for this change.
the anchor name as a parameter, so it can look in the correct bin.
In the case of FreeAttrs, we have the name already (since we found a name or
id attribute). In the case of FixAnchors, the anchor name could come from
either the name or id attribute, so we call the function separately for each
case, passing the appropriate attribute value.