Spreadsheets in XML

In my work with DITA documentation, I use XMLMind’s XML Editor (XXE) – and very good it is too. The professional version comes with an Integrated Spreadsheet Engine, which I have just recently taken a look at. I think it’s rather interesting, particularly how the approach differs from Excel or similar traditional spreadhseets.

We are not talking here about Excel (2007+) using an XML-based file format. In this, the information is structured using elements that relate to spreadsheets – such as <worksheet> and <row> – not to the content domain (invoices, timesheets, product specifications, whatever). That means that the information is not in practice accessible to users (as opposed to tool developers).

What XXE is addressing is the insertion of computed element content into a ‘user level’ XML document. Such a document could be in XHTML or DITA (XHTML is obviously a ‘final’ format; DITA is a ‘source’ format for conversion into various final formats, such as XHTML, PDF, CHM).

The idea is that the XML contains formulas as Processing Instructions. PIs are interpreted by the XXE application, but are ignored by XML processing tools and web browsers. A formula PI generates content (a result value) which is inserted between the PI and the end of the enclosing element. In practice, a formula will provide content for a lowest-level element such as a table entry, or a text phrase. Here is a simple-table entry:

<stentry>
<?xxe-formula formula=’=($[+0,3] * $[+0,4])’?>
150
</stentry>

This is in column 5 of a table, and is multiplying the values in columns 3 and 4, to produce the value 150. XXE provides a sugared syntax for references within a table, which is essentially the same as an Excel formula. So the above formula would actually be written as:

=($C2 * $D2)

with the relative addressing of the rows allowing the formula to be copy-pasted in subsequent rows.

In XML, any element can have an id attribute. For a table element, this allows us to reference a value in a table from outside (from ordinary text elements). For a leaf element, this allows us to reference a value by name. Suppose that invoice_table contains details of an invoice, and cell D7 contains the tax. We could have a formula:

=invoice_table!$D$7

or, better:

=$(tax)

There are, as you would expect, a reasonable number of built-in functions, in the usual categories.

An interesting twist is that a formula can set not just element content, but alternatively an element attribute. The most obvious use of this is to generate id attributes. For example, suppose that I have a table with Product Codes in column A, and Unit Prices in column B. Then, in the B2 cell, I can have this formula setting the id attribute:

=($A2 & “_unitprice”)

(and similarly for the rest of column B). Now, if I want to look up a unit price, I simply construct the id from the product code and the suffix, and access the element directly by id – no need for a VLOOKUP function.

So far, the main difference from Excel is that we are not restricted to using  tabular structures (worksheets, in Excel). We can have values and formulas anywhere in a document structure. For erxample, I could drop a calculated value into an ordinary free-text paragraph, and pick up that value elsewhere in the document.

However, the really different aspect is that the formulas are based on the XPath 1.0 language – the Excel-like syntax is just cosmetic. XPath is a pattern-based query language that treats XML documents as trees of nodes-with-attributes. Here’s an example-based tutorial. This means that a formula can operate on a set of nodes (returned from an XPath expression), without knowing how many there are or where they are in a document.

We can tag values (wherever they are) using an attribute. In DITA, we could use @outputclass (not @id, as this must be unique). For example, I could tag various elements (possibly table cells, possibly not) with @outputclass = ‘cost’, and then sum these using the formula:

=sum(`//*[@outputclass=’cost’]`)

The backquotes encapsulate an XPath expression (as opposed to the XXE formula language). ‘//*’ means “any element anywhere in the document” (which we then filter by the outputclass attribute). Furthermore, XPath can access not only the current document, but also other documents (as individual documents, not as a document collection – a notion supported by XPath 2.0/XQuery).

I’ll discuss an example of all this in the next post.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




November 2010
M T W T F S S
« Oct   Dec »
1234567
891011121314
15161718192021
22232425262728
2930  

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: