I just unzipped an MS Word document (.DOC) and and an OOO text document (.ODT) to look at their respective XML because I was kinda curious. Microsoft has somehow managed to encode their documents so the content doesn’t show up inside what you can extract, but even so at first I thought I could almost guarantee it’s not as bloated as the .ODT, whose content file begins by defining the style of every single section of the document, and then goes on to assign every single word a separate style tag.
Sample from the style heading:
<style:style style:name="T1_9" style:family="text" style:parent-style-
Corresponding sample from the actual text of the document:
<text:span text:style-name="T1_9" >the</text:span>
Tx goes up to 195 and
_x generally goes up to ~250. Notice that the second exerpt is 50 characters. That’s 47 characters in addition to the word it represents (“the”). I had to find it manually because the document is too large for my word processor’s search funtion to handle.
The original document is 45,512 characters/12 pages long. The XML content document is 3,178,441 characters/947 pages long. The document is almost entirely without formatting, using mostly defaults (default font; default left-justified, tab-character indented paragraphs; no font styling) with the exception of 53 instances of italics containing 1,639 characters. There is no way a separate style heading for every single word is necessary. Even if every word were to get its own
<text:span> tag to define its style separately from the entire rest of the document, there should be only two style tags necessary in the heading, followed by a string of words with
<text:span text:style-name="T1_1" ></text:span> or
<text:span text:style-name="T1_2" ></text:span> around them. (There are also paragraph tags in the heading. Only one of those should be required since every paragraph has the same formatting.)
If you want a more direct reference as to how big this document ought to be, I rewrote it in HTML4.1 Strict, with my entire usual story CSS sheet embedded (I usually keep it in an external document and reference it; this sheet contains a lot of information that is not necessary for a single story chapter, including styling for titles, table of contents, lists, images, links, copyright information, etc.). I also updated a lot of the styling elements: there’s a lot of French in this chapter, which I did not italicize at the time; now every phrase not in English is surrounded by
<i lang="ja"></i>; added smart quotes (
“, ”, ‘, ’ instead of “, “, ‘, ‘); and ensured the inter-browser persistence of all accented characters by using ampersand codes for them as well (eg:
é for é). I also maintain an indentation hierarchy like any decent coder, and I skip lines between paragraphs for ease of readability of the markup source, which adds to my character count because my word processor’s word count function is whitespace sensitive. All told, this ought to be quite a bit longer than a reasonably well-written SGML content file (keep in mind that program-generated XML has no reason to be indented or ever include a carriage return, and that a lot of the XML formatting information is actually contained in external files that just reference templates stored by the word processor itself). Why OOO needs to be told that every single one of 8,395 words is
Default_20_Paragraph_20_Font, rather than approximately 240 blocks of words (accounting for paragraphs and having to start over after italics), I have no idea.)
The HTML file is 57,881 characters/22 pages long. Just for comparison, you understand.
As it turns out, however, I was wrong about the relative efficiency of the two XML formats: the MS Word format is even more bloated. Originally I’d been testing different documents, a Word document that was only two pages long and the aforementioned Open Office chapter, so I knew comparing filesize wouldn’t get me anywhere with those. After my HTML experiment, however, I decided to save that chapter as .DOC just to compare file sizes.
When I saved that same story chapter as an MS Word document, the filesize jumped from 62.8kb to 189kb. How much more complicated must OOXML be than ODF that it takes up more than twice as much space? Now I really want to see the content file from an MS Word document.
Although it’s not quite the same thing, I saved it as a Microsoft XML file, and it somehow jumped up to 881kb. Is this miles more complicated than the older OOXML (OOO calls the two formats “Microsoft Word 97/2000/XP (.doc)” and “Microsoft Word 2003 (.xml)” respectively) or is that just the result of saving in a non-zipped format? I suspect the latter, since it’s only 901,575 characters/219 pages. It’s still massively more than the HTML, but significantly fewer characters than the smaller .DOC file. There’s a little more information in the style section, and the style tags are written in a slightly more complicated manner that only appears to change the way the tags in the body reference them rather than actually adding any functionality, although I suppose an expert could correct me in how the way OOXML is written makes it work better. I just don’t think it’s anywhere near worth the tradeoff, especially considering the average PC’s memory and processor limitations at the time the format was originally written.
Since I’ve already started down this road, the next step is to save as “Microsoft Word 2007 (.docx).” I would also save it as “DocBook (.xml),” but I seem to bet getting errors when I try that. The .DOCX file is only 36.7.kb (actually smaller than the .HTML file). When unzipped, it has a few more files in its structure, and I can see how some of these would make it simpler to define complicated formatting. (There is a separate style document and font table, for instance, and the style document is much shorter than previous XML style headings.) The character count of the content file is a mere 864,421 on 334 pages (869,413 characters/335 pages including the style document), the smallest count yet for an XML file. The reason it takes up more pages than .XML in spite of its smaller character count is that it actually has some carriage returns and tab characters which take up a lot of space on the page, but still only one byte in the file. The tags are part of an arcane proprietary system I can’t understand in skimming, but that’s not a problem since the word processor generates them anyway. The unzipped folder is 853kb, so I guess that answers my earlier question: The biggest advantage of XML-based document formats seems to be that they can be zipped to reduce the file size on the disk. I personally find that unacceptable, however, as memory is a far more limiting factor than hard drive space, and I really don’t want to have to load up a massive document and a massive word processor to work with it at the same time when I could just load a relatively tiny document in a minimalist text editor. Also, just like the other versions, every word is surrounded by its own personal tags, even if they all have the exact same tags… except, for some reason, the (just as unformatted as the rest of the document) title. Don’t ask me. Here have another code sample:
This is part of why I always write in markup (HTML or LaTEX depending on what I’m writing) nowadays. Word processors offer no more flexibility in formatting once you know what you’re doing (admittedly there is a slightly higher learning curve, but for some of the more complicated funtions of word processors, only slightly), and the file sizes are unnecessarily immense.
And before anyone tells me that this is part of how MS Word and OOO provide so many formatting options, let me remind you that XML is, like HTML, based on SGML, and I can do just as much with massively fewer tags in HTML. Give me literally any Word document, and I will be able to recreate it in an HTML file literally a tenth the size or smaller. (No, this is not a challenge; if I start actually doing this, people will start sending me book-sized documents, and considering it took two hours to edit that one chapter into HTML, even I don’t have that kind of time to waste. Well, unless you’re paying me, in which case I charge five cents a word to translate documents into markup formats. If that seems like a lot, well, then don’t send me anything with minimal formatting that you could convert just by pasting it into Dreamweaver.)
I understand that a word processor is inherently less flexible than a person going into the markup and tweaking it manually, but there’s got to be a better way to go about it than making separate style tags for every span and a separate span for every word. In fact, HTML front-end programs like Dreamweaver (and even Frontpage back in the day) prove fairly conclusively that it’s possible to have essentially a word processor with massively customizable formatting that works with blocks of text when it’s convenient to do so.
And if you really care that much about disk space, it’s just as easy to make a zipped HTML file as an XML file (and fairly useful to do so, since it can carry a CSS document and any and all images and embeds along with it), and a format for it already exists: .MHT (MIME HTML) and Firefox’s proprietary .MAFF (Mozilla Archive Format File).
WordPress doesn’t like <hr> for some reason.
“SGML: Stupid General Markup Language” by Wholly Crap Productions is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License; permissions beyond the scope of this license may be available at email@example.com. If you believe that work to which you own the copyright is being used in a way that infringes on your copyright, please send an email with credentials stating which pages you find offensive, and they will be taken down until an agreement can be reached, or permanently if no agreement can be reached.