One problem that can plague tools for building or processing any standardized type of document, whether it be RSS, Atom, Info Bite List, web pages, etc., is the question of what to do with invalid documents. Because so many people learn by imitation, the internet is a fertile breeding ground for incorrectly coded documents. What can we do to encourage developers of publishing tools to generate valid documents, and valid newsfeeds in particular?

Why do people build tools that generate invalid documents? In most cases, it is certainly not intentional. Answers include:

  1. buggy code written by people who do understand the format correctly
    1. but don't have the necessary programming skills
    2. but didn't test their code rigorously enough
  2. incorrect code written by people who don't understand the format correctly because
    1. they didn't understand the documentation because
      1. they didn't read it carefully
      2. it is poorly written
      3. it is incomplete
      4. they read the core documentation, but not other documentation which explains things not spelled out there because
        1. they thought they understood the issues without reading the other documentation
        2. they couldn't find the other documentation
    2. they didn't read the documentation at all because
      1. they thought they understood the specification correctly by looking at existing documents
        1. which were invalid themselves
        2. but they misunderstood the existing documents
        3. but the existing documents didn't contain certain cases which the new tools needed to address
    3. they couldn't find it
  • correct code that relies on buggy libraries
  • Clearly there are a multitude of possible explanations for invalid newsfeeds. How do we address the problem? One approach is to accept that invalid feeds will exist and write special-case code to compensate for others' errors. This philosophy seeks to serve the end user by giving them what they want out of every feed, however poorly it may be constructed. Another approach is to reject invalid feeds in the hope that their authors will fix them once they find out that people can't access them. Is one of these approaches better than the other? What other approaches might we consider?

    I read the other day that one of the goals of XML was to combat invalid documents, as they had proliferated on the web, by requiring strict adherence to certain rules, and not compensating for violations of them. I believe this is the best approach. If we begin accepting invalid documents, more invalid documents will be created based on them, and based on the attitude that it is not important to generate valid documents. I myself have never bothered to generate valid HTML pages, because invalid pages work fine, and in some cases, invalid HTML is even necessary to work around bugs and quirks in web browsers. Until the vast majority of internet surfers are using browsers that support standards correctly and in the same way, there is little incentive to bother with generating valid HTML. More importantly, even if all browsers do work the same way and work correctly, if they continue to accept invalid HTML, there will be no incentive to correct HTML errors. Acceptance of invalid newsfeeds leads to more work for everyone and, probably even more than the same problem in HTML, reduces the value of newsfeed formats.

    I believe the same problem already exists for RSS, partly because the specification is not rigorous enough, and partly because so much code has already been written to compensate for errors. The average RSS feed is probably closer to valid than the average web page, because most RSS processors almost require the feeds to at least be valid XML (I say almost, because even my own code compensates for some XML errors, though I'm considering dropping support for such errors from upcoming versions of my tools).

    Atom may be in better condition, as new as it is. But will it remain so? Considering how long it has taken me to understand certain aspects of the Atom specification, I believe Atom is at risk of running into the same problems. Looking back at my list of causes of invalid feeds, some feeds are built wrong because the documentation is incomplete. This is the biggest problem I see with the Atom specification right now. More of what has been decided needs to find its way into the specification soon to keep developers from going off in the wrong direction. Also, pointers need to be added to the specification to other documentation which need to be understood in order to comprehend Atom.

    Well, enough talk. Here are my recommendations:

    For RSS: The problem is probably here to stay. Deal with it. Just make sure your tools generate valid feeds.

    For Atom:

    • Write a more complete specification with pointers to any necessary external documentation.
    • Make sure the specification is easy to find no matter where you jump into the Atom world from.
    • Make sure than specification is easy enough to understand that mere mortals will bother reading and understanding it. If you have to write two versions--one in concise technical language, and one in more natural language, do it--just be sure the natural one is also rigorously complete.
    • Don't start down the slippery slope of accepting invalid feeds! Constantly promote the philosophy of requiring valid feeds to tools developers. Maybe even have notes in the list of tools supporting Atom to point out tools that do it wrong--shame the developers into fixing their tools if necessary. If that sounds too cruel, then just remove buggy tools from the list and tell the developers they'll be added back in when they fix their bugs. Don't promote buggy tools.