Ticket #21 (new defect)

Opened 3 years ago

Last modified 3 years ago

Fuzz does not report Expat parsing errors

Reported by: msporny Owned by: msporny
Priority: major Milestone: 1.0
Component: fuzz Version: 0.9
Keywords: Cc:

Description

Testing the site  http://www.w3c.es/Personal/Martin/ , there is a problem because not all triples appear. After some research, we found the problem. It is the comment:

<!-- Metadatos sobre la localización personal -->

because it contains a non ASCII character 'ó'.

Change History

Changed 3 years ago by msporny

Passing the document you noted through the W3C Validator, after adding the "localización" text, outputs the following:


Sorry, I am unable to validate this document because on line 19 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.

The error was: utf8 "\xF3" does not map to Unicode


So, if we are to be technically accurate - the 'ó' character is not a Unicode character, which is the character set the document was being served as (UTF-8), which would have raised an Expat XML parser exception in librdfa (which is what Fuzzbot uses for parsing the XHTML+RDFa document).

When Fuzzbot encounters a parser exception, it runs the entire document through HTML Tidy and re-attempts the parse. It would have failed the second time and given up.

Once a document has failed at the Expat parser level, it is impossible for me to recover. I could do two things at that point:

1. Display a warning via the Fuzzbot interface that states that the document is malformed and note the location of the bad data.

2. Attempt to ignore the invalid data and continue. This approach would be bad as there is no good algorithm for deleting content from a page... especially if we have to depend on the triples generated from a page that has had its content deleted.

Do you have any other suggestions? What would you expect or like to see if such an error happens?

Changed 3 years ago by msporny

  • version changed from 0.15 to 0.9
  • summary changed from Fuzzbot does not report Expat parsing errors to Fuzz does not report Expat parsing errors
Note: See TracTickets for help on using tickets.