08
Sep
11

SGML: Stupid Generalized Markup Language

I just unzipped an MS Word document (.DOC) and and an OOO text document (.ODT) to look at their respective XML because I was kinda curious. Microsoft has somehow managed to encode their documents so the content doesn’t show up inside what you can extract, but even so at first I thought I could almost guarantee it’s not as bloated as the .ODT, whose content file begins by defining the style of every single section of the document, and then goes on to assign every single word a separate style tag.

Sample from the style heading:

<style:style style:name="T1_9" style:family="text" style:parent-style-

name="Default_20_Paragraph_20_Font"></style:style>

Corresponding sample from the actual text of the document:

<text:span text:style-name="T1_9" >the</text:span>

For reference, Tx goes up to 195 and _x generally goes up to ~250. Notice that the second exerpt is 50 characters. That’s 47 characters in addition to the word it represents (“the”). I had to find it manually because the document is too large for my word processor’s search funtion to handle.

The original document is 45,512 characters/12 pages long. The XML content document is 3,178,441 characters/947 pages long. The document is almost entirely without formatting, using mostly defaults (default font; default left-justified, tab-character indented paragraphs; no font styling) with the exception of 53 instances of italics containing 1,639 characters. There is no way a separate style heading for every single word is necessary. Even if every word were to get its own <text:span> tag to define its style separately from the entire rest of the document, there should be only two style tags necessary in the heading, followed by a string of words with <text:span text:style-name="T1_1" ></text:span> or <text:span text:style-name="T1_2" ></text:span> around them. (There are also paragraph tags in the heading. Only one of those should be required since every paragraph has the same formatting.)

If you want a more direct reference as to how big this document ought to be, I rewrote it in HTML4.1 Strict, with my entire usual story CSS sheet embedded (I usually keep it in an external document and reference it; this sheet contains a lot of information that is not necessary for a single story chapter, including styling for titles, table of contents, lists, images, links, copyright information, etc.). I also updated a lot of the styling elements: there’s a lot of French in this chapter, which I did not italicize at the time; now every phrase not in English is surrounded by <i lang="ja"></i>; added smart quotes (&ldquo;, &rdquo;, &lsquo;, &rsquo; instead of “, “, ‘, ‘); and ensured the inter-browser persistence of all accented characters by using ampersand codes for them as well (eg: &eacute; for é). I also maintain an indentation hierarchy like any decent coder, and I skip lines between paragraphs for ease of readability of the markup source, which adds to my character count because my word processor’s word count function is whitespace sensitive. All told, this ought to be quite a bit longer than a reasonably well-written SGML content file (keep in mind that program-generated XML has no reason to be indented or ever include a carriage return, and that a lot of the XML formatting information is actually contained in external files that just reference templates stored by the word processor itself). Why OOO needs to be told that every single one of 8,395 words is Default_20_Paragraph_20_Font, rather than approximately 240 blocks of words (accounting for paragraphs and having to start over after italics), I have no idea.)

The HTML file is 57,881 characters/22 pages long. Just for comparison, you understand.

As it turns out, however, I was wrong about the relative efficiency of the two XML formats: the MS Word format is even more bloated. Originally I’d been testing different documents, a Word document that was only two pages long and the aforementioned Open Office chapter, so I knew comparing filesize wouldn’t get me anywhere with those. After my HTML experiment, however, I decided to save that chapter as .DOC just to compare file sizes.

When I saved that same story chapter as an MS Word document, the filesize jumped from 62.8kb to 189kb. How much more complicated must OOXML be than ODF that it takes up more than twice as much space? Now I really want to see the content file from an MS Word document.

Although it’s not quite the same thing, I saved it as a Microsoft XML file, and it somehow jumped up to 881kb. Is this miles more complicated than the older OOXML (OOO calls the two formats “Microsoft Word 97/2000/XP (.doc)” and “Microsoft Word 2003 (.xml)” respectively) or is that just the result of saving in a non-zipped format? I suspect the latter, since it’s only 901,575 characters/219 pages. It’s still massively more than the HTML, but significantly fewer characters than the smaller .DOC file. There’s a little more information in the style section, and the style tags are written in a slightly more complicated manner that only appears to change the way the tags in the body reference them rather than actually adding any functionality, although I suppose an expert could correct me in how the way OOXML is written makes it work better. I just don’t think it’s anywhere near worth the tradeoff, especially considering the average PC’s memory and processor limitations at the time the format was originally written.

Since I’ve already started down this road, the next step is to save as “Microsoft Word 2007 (.docx).” I would also save it as “DocBook (.xml),” but I seem to bet getting errors when I try that. The .DOCX file is only 36.7.kb (actually smaller than the .HTML file). When unzipped, it has a few more files in its structure, and I can see how some of these would make it simpler to define complicated formatting. (There is a separate style document and font table, for instance, and the style document is much shorter than previous XML style headings.) The character count of the content file is a mere 864,421 on 334 pages (869,413 characters/335 pages including the style document), the smallest count yet for an XML file. The reason it takes up more pages than .XML in spite of its smaller character count is that it actually has some carriage returns and tab characters which take up a lot of space on the page, but still only one byte in the file. The tags are part of an arcane proprietary system I can’t understand in skimming, but that’s not a problem since the word processor generates them anyway. The unzipped folder is 853kb, so I guess that answers my earlier question: The biggest advantage of XML-based document formats seems to be that they can be zipped to reduce the file size on the disk. I personally find that unacceptable, however, as memory is a far more limiting factor than hard drive space, and I really don’t want to have to load up a massive document and a massive word processor to work with it at the same time when I could just load a relatively tiny document in a minimalist text editor. Also, just like the other versions, every word is surrounded by its own personal tags, even if they all have the exact same tags… except, for some reason, the (just as unformatted as the rest of the document) title. Don’t ask me. Here have another code sample:

<w:r><w:rPr><w:t xml:space="preserve"></w:t></w:r>

This is part of why I always write in markup (HTML or LaTEX depending on what I’m writing) nowadays. Word processors offer no more flexibility in formatting once you know what you’re doing (admittedly there is a slightly higher learning curve, but for some of the more complicated funtions of word processors, only slightly), and the file sizes are unnecessarily immense.

And before anyone tells me that this is part of how MS Word and OOO provide so many formatting options, let me remind you that XML is, like HTML, based on SGML, and I can do just as much with massively fewer tags in HTML. Give me literally any Word document, and I will be able to recreate it in an HTML file literally a tenth the size or smaller. (No, this is not a challenge; if I start actually doing this, people will start sending me book-sized documents, and considering it took two hours to edit that one chapter into HTML, even I don’t have that kind of time to waste. Well, unless you’re paying me, in which case I charge five cents a word to translate documents into markup formats. If that seems like a lot, well, then don’t send me anything with minimal formatting that you could convert just by pasting it into Dreamweaver.)

I understand that a word processor is inherently less flexible than a person going into the markup and tweaking it manually, but there’s got to be a better way to go about it than making separate style tags for every span and a separate span for every word. In fact, HTML front-end programs like Dreamweaver (and even Frontpage back in the day) prove fairly conclusively that it’s possible to have essentially a word processor with massively customizable formatting that works with blocks of text when it’s convenient to do so.

And if you really care that much about disk space, it’s just as easy to make a zipped HTML file as an XML file (and fairly useful to do so, since it can carry a CSS document and any and all images and embeds along with it), and a format for it already exists: .MHT (MIME HTML) and Firefox’s proprietary .MAFF (Mozilla Archive Format File).

WordPress doesn’t like <hr> for some reason.

Creative Commons License

“SGML: Stupid General Markup Language” by Wholly Crap Productions is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License; permissions beyond the scope of this license may be available at houiostesmoiras@gmail.com. If you believe that work to which you own the copyright is being used in a way that infringes on your copyright, please send an email with credentials stating which pages you find offensive, and they will be taken down until an agreement can be reached, or permanently if no agreement can be reached.

About these ads

4 Responses to “SGML: Stupid Generalized Markup Language”


  1. October 13, 2011 at 6:42 am

    There is the case for backwards compatibility bloating the size, as well as one-off specialized exceptional cases. I think there was a mention (on the blog of a former Microsoft programming head) of how some Microsoft teams attempt to remake a format from the ground up, only to find that all the backwards compatibility and exceptions bloat it back to the same size.

    • October 13, 2011 at 12:41 pm

      Most new formats can’t be used in older programs anyway. (Don’t believe me? Try to open a Word 2003 document in Word 97.) I’m not talking about making the program smaller (although that would be a not-unreasonable goal, say, ten years down the line), but making the document smaller. They apparently already don’t care about backwards compatibility on that score. Neither OpenOffice.Org nor Microsoft really has any excuse (well, beyond getting ISO to approve their format) for not making a simpler XML format for their next edition. (Speaking of which, I actually did read about Microsoft’s attempts to build a new format from the ground up; I don’t know what they wound up doing, but it actually wound up more bloated than any previous format, on which grounds ISO rejected it for approval as a standard. It might be that the reason it was bloated is that they were trying to make it backwards compatible, and if so kudos to them, but the fact remains that they haven’t done that with any other format.)

      Furthermore, the exceptions shouldn’t be much of a problem either. Is the text formatted the same way as the previous text? Then include it in the same label as the previous text. Is it not? Then start a new label. Even if you don’t return to a previously used label with the same formatting, that algorithm (which doesn’t sound hard to program, although I’ll admit that I could use independent verification from someone with experience in the field) would save massive amounts of space on most documents. If there’s a one-off case where you need lots of labels, well, then lots of labels should be added. I’m not saying that I expect a document in which every single word is formatted differently to be small. I’m just saying that a document with only one formatting style, ever, should be able to have only one formatting class. Heck, one class per paragraph would be a big reduction in overhead.

  2. October 13, 2011 at 5:56 pm

    No, no, the backwards compatibility to allow older document files with their exceptions and their backwards compatibility to be used on newer versions of the program.

    • October 14, 2011 at 2:00 pm

      Like I said, I have no problem with the program being bloated. (Well, okay, I do, but I realize it’s unreasonable to expect that to be fixed; even just making sure the program will continue to be compatible with the newest versions of Windows is going to bloat it.) It’s just the document format itself that I think needs to be fixed.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: