DOCX is not pure XML

Hi,

So one more bashing post against OOXML from a TDF member? I see it more as a reflection as I had to create a non-binary (do not say anything about binary formats to save important data) for a homework from my university. And I realized that I made the very same mistake, it is just easier to explain…..

Anyway I start with bashing ;) The example I took is from here.

<!--?xml version="1.0" encoding="UTF-8" standalone="yes"?-->

<xml xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel">
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1"/>
</o:shapelayout>
<v:shapetype id="_x0000_t202" coordsize="21600,21600" o:spt="202" path="m,l,21600r21600,l21600,xe">
<v:stroke joinstyle="miter"/>
<v:path gradientshapeok="t" o:connecttype="rect"/>
shapetype>
<v:shape id="_x0000_s1025" type="#_x0000_t202" style="position:absolute;
margin-left:203.25pt;margin-top:37.5pt;width:96pt;height:55.5pt;z-index:1;
visibility:hidden" fillcolor="#ffffe1" o:insetmode="auto">
<v:fill color2="#ffffe1"/>
<v:shadow on="t" color="black" obscured="t"/>
<v:path o:connecttype="none"/>
<v:textbox style="mso-direction-alt:auto">
<div style="text-align:left"/>
</v:textbox>
<x:ClientData ObjectType="Note">
<x:MoveWithCells/>
<x:SizeWithCells/>
<x:Anchor>
4, 15, 2, 10, 6, 15, 6, 4</x:Anchor>
<x:AutoFill>False</x:AutoFill>
<x:Row>3</x:Row>
<x:Column>3</x:Column>
</x:ClientData>
</v:shape>
</xml>

So, do you find the mistake? What is not XML? It’s this line:

<x:Anchor>4, 15, 2, 10, 6, 15, 6, 4</x:Anchor>

So please keep that code fragment in mind…. I will continue with the story of my program. It is a sudoku program. (Wikipedia for Sudoku). What I want to achieve: A sudoku (for me, you can skip this it is not necessary) consists out of x boxes, where x is the total # of numbers. Each box of a sudoku has a height h and a width w.  x = w*h. w and h naturally have to be natural numbers starting from inclusive 2. This also means you have to store the width and height, as you will need them (A box can have 12 numbers and a height of either 3 ,4 or 6). The following shows the final XML file for a sudoku,

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sudoku>
<innerWidth>3</innerWidth><innerHeight>3</innerHeight>
<row>6 0 0 0 1 0 5 0 0 </row>
<row>8 0 3 0 0 0 0 0 0 </row>
<row>0 0 0 0 6 0 0 2 0 </row>
<row>0 3 0 1 0 8 0 9 0 </row>
<row>1 0 0 0 9 0 0 0 4 </row>
<row>0 5 0 2 0 3 0 1 0 </row>
<row>0 7 0 0 3 0 0 0 0 </row>
<row>0 0 0 0 0 0 3 0 6 </row>
<row>0 0 4 0 5 0 0 0 9 </row>
</sudoku>

So, do you see the parallel between this file (or a row-element of it) and the docx file? Although it should not be too difficult to get a sudoku out of this, it is no real XML…. Why? Because a row in fact is application encoded.  So

<row>1 0 0 0 9 0 0 0 4 </row>

should in a perfect world be

<row>
<cell>1</cell>
<cell/>
<cell/>
<cell/>
<cell>9</cell>
<cell/>
<cell/>
<cell/>
<cell>4</cell>
<cell/>
<cell/>
<cell/>
</row>

I am quite sure, that you realized it in the sudoku context, that it should not be a big problem, but

 <x:Anchor>4, 15, 2, 10, 6, 15, 6, 4</x:Anchor> 

And this could be 4 points x1,y1,x2,y2,x3,y3,x4,y4, but this would be a rect but no anchor. (As I really do not know what it is, but just want to give you an idea how tricky it could be ( means a “Test” property…):   ,  ,  ,  .

Just before the end please have a look at this line

21600r21600,l21600,xe">

coordsize=”21600,21600″: Ok, let us just hope it is pixel or distance from some corner of the page….

But you cannot tell me, that this is not application encoded….. Do you have a clue what it should be? I guess it is a curve (You see the coordsize here again)

path="m,l,21600r21600,l21600,xe">

I have written two articles about Microsoft office, you may find the first here. As always your thoughts are very much appreciated :D

About these ads

2 thoughts on “DOCX is not pure XML”

  1. Also similar appies for ODF because it uses a subset of SVG. For example look at svg:d attribute what kind of content it expects. :) But this is really the cases where the size of XML would blew up if it would be done in a “pure” way.

    1. Hi,
      Personally I do see the problem in the following: you should be able to strip away all the newlines and spaces in an XML file (and add them) anywhere (except between “”) I guess that the spaces in the x:Anchor property are needed and they are not surrounded by “”. The example with the coordsize still applies as ther cannot be much context :)
      BTW: VML is depricated, SVG not. Why WRITE depricated stuff….
      Still it might be I got blind….(some would call it ignorant) Enlighten me if so /)

Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s