When you retrieve a representation of a resource such as a Web page, you or your software will probably try to determine some things about it. What sort of file is it, HTML, JPEG image, XML, or maybe plain text? How can you tell?
It's very tempting to try a variety of guesses. If a URI ends in .xml, or if the first characters in the file are <?xml version="1.0"> , then maybe it's an XML file. Too much Web software makes the mistake of using such heuristics, and too many Web applications rely on it. In fact, the Web Architecture makes clear where metadata like this is supposed to come from. In the case of this example, it's the Content-Type header of the HTTP response, which will have a media type like application/xml or text/plain. The examples below show some of the reasons that doing it "right" matters.
Imagine that you are using a browser to look at entries in a bug reporting system that's being used to examine problems in various XML files. There are two files: both look like XML and both have .xml in their file names, but one of them is not well-formed: if a browser tries to render it as XML you'll get an error. Here is the text of the two files; underneath are links to versions that are served as application/xml and text/plain.
<?xml version="1.0"?> <animals> <dog>Rufus</dog> <cat>Kitty</cat> </animals> Well-Formed XML |
<?xml version="1.0"?> <animals> <dog>Rufus</fish> <cat>Kitty</elephant> </animals> Broken! This is not well-formed. |
This demonstration works best in the Firefox browser. Be sure to use features like View Page Info to see the Mime type the browser determined. In fact, Internet Explorer 6 does not do the right thing with the broken file that's served as text. Instead of showing the text, it incorrectly guesses that this is XML, tries to parse it, and reports that the file is not well-formed! |
By the way, if you want to learn more about this issue, the TAG finding Authoritative Metdata gives a really good explanation.