Posts Tagged 'XML'

Spreadsheets in XML – Part 2

In the previous post, I was looking at the ‘spreadsheet extensions’ provided by XMLMind’s XML Editor (XXE). This allows XPath-based formulas to be inserted into XML documents, not only in tabular elements, but also in free text.

As an example, I mocked up some invoices. An invoice is a good example of a hybrid document: we want to print it out (or PDF it) as a nicely formatted document; there are calculated and looked-up elements in the manner of a spreadsheet; we want the whole set of invoices to be queryable subsequently, in the manner of a database.

Here’s an invoice, as a DITA document, shown in XXE:

The little green ‘f’ icons represent the formulas, held as XML Processing Instructions. These are ignored in subsequent transformations (to final formats, such as PDF). You double-click an icon to edit the formula.

The first one (before the table) is today’s date: =today(). The ones in the Cost column are simple arithmetic: =($C2 * $D2), etc. Column and row headers can be displayed optionally:

The Product Description and Unit Price formulas are more interesting, since they are lookups in another document, containing the product catalog. Here’s the formula in B2:

The first thing to notice is that we can have multi-line formulas, with ‘let’ definitions preceding the actual formula. (The “…” is really a full file path – I’ve elided it for compactness). The id of the element with the product description is the product code appended with “_desc”. This is then retrieved from the product catalog by matching the id attribute (@id) with the constructed value ($id). (The back-quotes indicate ‘raw’ XPath, rather than XXE formula language).

Here’s the Product catalog (not very extensive!):

The formulas here are used not to calculate visible values, but to construct values for the id attribute. For example, in B2:

Note the id attribute picked from the drop-down list. In Excel terms, this is rather like having a formula that constructs a Range name. It means that the ids for cells in column B and C always follow the product codes in column A. I think this is rather neat.

Back in the invoice, the Total Cost formula sums the values in the Cost column (E) – see the first scrrenshot. We could do this with a table/column reference, but an alternative is to tag the Cost cells with a common attribute value. In DITA, @outputclass allows a kind of informal specialization (we can’t use @id, as this must be unique within a document). Here, we can set @outputclass = ‘cost’. Now, the Total Cost formula sums all elements with this attribute value, wherever they are in the document:


That’s it, in terms of the documents. We can then generate formatted output, as we require.

The database aspect comes if the invoices are put into an XML database, such as XMLMind’s Qizx (Free Engine edition). This provides indexing and querying, using the XQuery language. We can then calculate aggregated values, for example by customer and product. Here’s a simple query to calculate the total invoiced for a given product:

xquery version "1.0";

let $prod := "PR01"    (:edit this:)

let $costs :=
 for $row in //strow
 let $cost := $row/stentry[@outputclass='cost']
 where $row/stentry[1]/text() = $prod
 return $cost
return ($prod, sum($costs))

strow is a simple-table row, stentry is a cell. One could, of course, get a lot fancier, and produce proper date-based reports.

There’s an interesting contrast here with how we would do this in Excel. If each invoice is a separate Workbook, we would need to provide some collation mechanism for the data, to get it into a single source for pivot tables, etc. – either in a single workbook, or in Access. I think that where we have a large number of computationally relatively simple documents, the XML approach is quite attractive.


Spreadsheets in XML

In my work with DITA documentation, I use XMLMind’s XML Editor (XXE) – and very good it is too. The professional version comes with an Integrated Spreadsheet Engine, which I have just recently taken a look at. I think it’s rather interesting, particularly how the approach differs from Excel or similar traditional spreadhseets.

We are not talking here about Excel (2007+) using an XML-based file format. In this, the information is structured using elements that relate to spreadsheets – such as <worksheet> and <row> – not to the content domain (invoices, timesheets, product specifications, whatever). That means that the information is not in practice accessible to users (as opposed to tool developers).

What XXE is addressing is the insertion of computed element content into a ‘user level’ XML document. Such a document could be in XHTML or DITA (XHTML is obviously a ‘final’ format; DITA is a ‘source’ format for conversion into various final formats, such as XHTML, PDF, CHM).

The idea is that the XML contains formulas as Processing Instructions. PIs are interpreted by the XXE application, but are ignored by XML processing tools and web browsers. A formula PI generates content (a result value) which is inserted between the PI and the end of the enclosing element. In practice, a formula will provide content for a lowest-level element such as a table entry, or a text phrase. Here is a simple-table entry:

<?xxe-formula formula=’=($[+0,3] * $[+0,4])’?>

This is in column 5 of a table, and is multiplying the values in columns 3 and 4, to produce the value 150. XXE provides a sugared syntax for references within a table, which is essentially the same as an Excel formula. So the above formula would actually be written as:

=($C2 * $D2)

with the relative addressing of the rows allowing the formula to be copy-pasted in subsequent rows.

In XML, any element can have an id attribute. For a table element, this allows us to reference a value in a table from outside (from ordinary text elements). For a leaf element, this allows us to reference a value by name. Suppose that invoice_table contains details of an invoice, and cell D7 contains the tax. We could have a formula:


or, better:


There are, as you would expect, a reasonable number of built-in functions, in the usual categories.

An interesting twist is that a formula can set not just element content, but alternatively an element attribute. The most obvious use of this is to generate id attributes. For example, suppose that I have a table with Product Codes in column A, and Unit Prices in column B. Then, in the B2 cell, I can have this formula setting the id attribute:

=($A2 & “_unitprice”)

(and similarly for the rest of column B). Now, if I want to look up a unit price, I simply construct the id from the product code and the suffix, and access the element directly by id – no need for a VLOOKUP function.

So far, the main difference from Excel is that we are not restricted to using  tabular structures (worksheets, in Excel). We can have values and formulas anywhere in a document structure. For erxample, I could drop a calculated value into an ordinary free-text paragraph, and pick up that value elsewhere in the document.

However, the really different aspect is that the formulas are based on the XPath 1.0 language – the Excel-like syntax is just cosmetic. XPath is a pattern-based query language that treats XML documents as trees of nodes-with-attributes. Here’s an example-based tutorial. This means that a formula can operate on a set of nodes (returned from an XPath expression), without knowing how many there are or where they are in a document.

We can tag values (wherever they are) using an attribute. In DITA, we could use @outputclass (not @id, as this must be unique). For example, I could tag various elements (possibly table cells, possibly not) with @outputclass = ‘cost’, and then sum these using the formula:


The backquotes encapsulate an XPath expression (as opposed to the XXE formula language). ‘//*’ means “any element anywhere in the document” (which we then filter by the outputclass attribute). Furthermore, XPath can access not only the current document, but also other documents (as individual documents, not as a document collection – a notion supported by XPath 2.0/XQuery).

I’ll discuss an example of all this in the next post.

Excel – XML Workflow – part 2

In the previous post, we were round-tripping data between Job workbooks and Timesheet workbooks. The final link in the cycle was for each Job workbook to update its cumulative hours-booked value, from a single XML file JobHoursList.xml. Here’s an example of that file:

<?xml version='1.0' encoding='ISO-8859-1'?>

A Job workbook needs to find the jobhoursentry for its own jobnumber, and load the associated jobhours value.

I decided to load the file, and search it sequentially, using the MS XML library (I went for Microsoft XML v6.0, which is called MSXML2 – I don’t know whether the version makes any difference).

The Workbook_Open handler calls a Load procedure. The whole file is loaded as a MSXML2.DOMDocument:

    Set doc = New MSXML2.DOMDocument
    With doc
        .async = False
        .validateOnParse = False
    End With
    ok = doc.Load(ActiveWorkbook.Path + "\" + datafile)

The nodes are each of type MSXML2.IXMLDOMNode. The jobhourslist is the document (that is, root) element:

    Set jobhourslist = doc.DocumentElement

Then we iterate through the child (jobhoursentry) nodes:

    For Each jobhoursentry In jobhourslist.ChildNodes
      Set jobnumber = jobhoursentry.FirstChild
      Set jobhours = jobhoursentry.LastChild
      'check jobnumber against the one in this workbook
      'and if a match, update the jobhours cell

Surprisingly straightforward. Incidentally, “DOM” stands for Document Object Model – that is, the node-tree representation of an XML document.

Here’s the VBA module, if you’re interested.

Excel – XML Workflow

Following on from a previous suggestion about workbook initialization from XML data, I’ve had a go at implementing a bit of round-trip workflow, based on a time-reporting scenario.

Time is booked against currently active jobs; this data is totalled for each job, and updates the job specifications accordingly. I’m assuming this works on a regular cycle, say weekly. Here’s an overview:

A Job workbook contains:

  • Job Number
  • Job Name
  • Start Date
  • End Date
  • Total Hours booked.

There’s a job_Map, so that the data can be exported to a corresponding XML data file. This is done on the Workbook_BeforeClose event (arrow A).

The Job*.xml files are imported into an XML database (Qizx/Free Engine). We can then run a simple XQuery over the collection to generate the list of all active jobs – those that have a Start Date, but no End Date. This list is saved as joblist.xml (arrow Q1).

When a new Timesheet is created (from a template), the joblist.xml data is imported, and is mapped to an Excel table, which expands to hold the possibly multiple rows (arrow B).

The hours column is not mapped: that’s where the user enters the time data.

Now, we want to export jobnumber + hours for each timesheet entry. However, we can’t have jobnumber in two maps (input and output). My solution (there might be better ones) is to copy the data to another table on another worksheet.

To get the data across, I’m using the formula:


where Table_tslist is the input table. The #VALUE! errors are because the table is fixed size.

The output table is mapped within timesheet_Map (along with tsname and tsperiod). We can now export the data on the Workbook_BeforeClose event (arrow C). The empty (#VALUE!) rows generate empty elements, which are just ignored.

The Timesheet*.xml files are imported into the database, and another XQuery is run to generate a list of (all) job numbers + total hours booked (arrow Q2).

The final link in the cycle is for a Job workbook to refresh its total hours data, when it is opened (arrow D). This is different from arrow B, in that it’s not a simple Import. Each Job workbook wants only the hours value for its own job number. There might well be various ways of doing this, but I’m doing a programmatic load-document-and-lookup, in VBA. I’ll leave this to the next post.

And it’s all working pretty neatly. Potentially, we could extract and summarize other information from the timesheets, and present this in the Job workbooks – for example, the latest period with booked time.

Workbook Initialization

Interesting post over at Bacon Bits. The idea here is that a workbook application has a single-field XML Map linked to a shared XML file somewhere. This file contains the current version number of the application, and is updated accordingly by the developer. A handler for the Workbook_Open event refreshes the XML map, and then checks the workbook’s own version number (encoded in the handler procedure) against the refreshed value. If these are different, a MsgBox is displayed prompting the user to get the updated version, and the workbook is closed.

That’s neat. Furthermore, you could use this pattern to do all sorts of ‘dynamic initialization’. For example, a timesheet workbook could initialize itself with a set of current jobs, obtained from a ‘master’ XML file.

This file could itself be the result of a query over a ‘database’ of XML documents. For example, we could have an XML document per job (mapped into, and editable via an Excel workbook), with start and end dates. Jobs without an end date are Active, and are picked up by an XQuery which generates the current ‘master’ file. This is then imported by our timesheet workbooks when they open, as above. I think I might have a go at implementing this workflow – more anon.

XML Export from Excel – part 3

In Part 2, we had some VBA code that exported data from financial-report workbooks into XML files. What we can do now is query these files to produce summary tables – rather like views of a pivot table – and include these in a PDF or XHTML report, to provide an ‘executive summary’. (Note: we’re leaving VBA behind here, but stay tuned).

Here’s what our exported XML data looks like:

<?xml version="1.0" standalone="yes"?>
<bureport xmlns:xsi='>

And here’s the kind of report we might generate (this is XHTML):

The idea is that this is a presentable/printable report. We could have a lot more in terms of sections, supporting text, fancy layout, etc.

So how do we get from the XML data to the summary report? Here’s one possibility:

  1. import the XML files into an XML database
  2. run queries over the database to generate summary tables as individual DITA topics
  3. include (content reference) the summary tables from our master-report topic, which has the overall structure and additional content
  4. generate the output report in the desired format(s).

Sounds a bit of a performance – why don’t we just collate all our data in a ‘database’ Excel workbook, knock up a few pivot tables, and copy-paste these into a Word document? Basically, flexibility.  Firstly, a pivot table isn’t a general query mechanism: we might want all sorts of summary information, not just dimension-aggregations. Secondly, the individual summary tables might be included in different combinations in different reports. Also, we want the reports to reference fixed locations, where they can pick up the latest (e.g. last month’s) summary tables automatically.

For the XML database, I’m using XMLMind’s QizX Free Engine. You can use this either from a command-line or via a simple client app. There’s an XQuery query to generate each table (they’re very similar): here‘s the one for the totals by Business Unit (BU).

The summary tables are minimal DITA topics containing a <simpletable>:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE topic ...>
<topic id="FR_BU_Totals_Data">
 <title>BU Totals Data</title>
    <simpletable id="BU_Totals_Data">

The “FR” prefix on the topic id stands for “Financial Report”. Note the id also on the <simpletable>.

Now, in the DITA topic for our master report, we just have a <simpletable> element that con(tent)ref(erence)s the summary table in its separate topic:

<simpletable conref="..."/>

where the “…” is a reference of the form: filepath#topicid/elementid.

Now we just generate the report output in our chosen format. Multiple reports can be built into a larger structure: a set of XHTML pages with appropriate linking, or a multi-section PDF document, both with tables of contents, etc.

Stages #1, #2 and #4 above can all be done by command-line tools, so it should be possible to automate the entire process (including the VBA bit).

I think that deserves a cup of tea…

XML Export from Excel – part 2

In Part 1, I wanted to export the data in a simple workbook as XML, so I can pick it up with XML query and documentation tools (of which more anon).

The code needs to:

    1. construct the output file name from the BU name and period (e.g. Report_Applications_4Q2009.xml)
    2. open a Scripting.TextStream on this (new) file
    3. get the complete XML string from the bureport map
    4. substitute the basic <bureport> element with the full one
    5. write the modified XML string to the TextStream.

      Items 3 and 4 are done by a simple function:

      Function XMLwithNS() As String
          Dim xml As String
          Dim res As XlXmlExportResult
          res = _
          If res = xlXmlExportSuccess Then
              XMLwithNS = Replace(xml, _
                            basicroot, fullroot, 1, 1)
              XMLwithNS = ""
          End If
      End Function

      “NS” in the function name stands for “namespace” – that is, the reference to the XML schema, discussed in Part 1. basicroot and fullroot are the before and after versions of the root element.

      Notice that XMLMap.ExportXML returns a success/failure status; the (ByRef) variable xml gets the actual XML string. The export method will fail if the mapped data is invalid with respect to the map (i.e. schema); we return an empty string in this case.

      Here’s the main routine:

      Sub ExportXMLtoFile()
          Dim bu As String, period As String
          Dim xmlfilename As String
          Dim xml As String
          bu = Range("BU").Value
          period = Range("Period").Value
          xmlfilename = basename + "_" + bu + "_" + _
                                period + ".xml"
          Dim fso As New Scripting.FileSystemObject
          Dim ts As Scripting.TextStream
          ChDir ActiveWorkbook.Path
          MsgBox "Current directory is " & CurDir
          xml = XMLwithNS
          If Len(xml) > 0 Then 'okay
              Set ts = fso.CreateTextFile(xmlfilename)
              ts.Write XMLwithNS
              MsgBox "Exported XML to file " & xmlfilename
          Else 'XML did not validate
              MsgBox "XMl does not validate; export aborted"
          End If
      End Sub

      As mentioned before, we’d probably want to run this as a batch operation from a ‘master’ workbook, which iterated over the data workbooks (maybe looking at File.DateLastModified).

      So, having now got umpteen XML data files, what can we do with them? I’ll discuss this in Part 3.

      July 2018
      M T W T F S S
      « Dec