Posts Tagged 'XQuery'

Analysing Cross-references Between Documents

I have recently been working on a set of about 200 Use Case documents, converting them into DITA - an XML-based structured document format. In the body text of a Use Case, there can be cross-references to other Use cases – so we have a many-to-many relationship between Use Cases (just as we could between any type of document). It’s interesting to analyse this relationship, to see which Use Cases have lots of outgoing references, and which Use Cases are referenced by many other Use Cases.

This kind of totalling is Pivot Table territory. So the question arises of how to contruct the basic data about the cross-references – a table of A-references-B records – and how to get it into Excel.

A key advantage of XML-based documents is that you can import them into an XML database and run queries over them, where the queries are written in XQuery. Such queries can generate output as XML, HTML or plain text. For our purpose, a simple route is to output an HTML table. This can then be imported by Excel via >Get External Data >From Web. The URL for the import points to our XQuery file, within the database server:

http://roy-laptop:18080/qizx/xqs/ucrefs.xq

Here, the database is XMLMind’s Qizx, running in server mode (locally on my laptop, in this case). This URL can be used like any other in a browser interface, with the results of the query being displayed within the browser. In our case, we are going to give the URL to Excel.

Here’s the code of the query:

xquery version "1.0";
declare option output:method "html";

declare variable $ucs := collection("/UC");

declare function local:uc_refs()
{
  for $ucref in $ucs//uc-ref
  let $file := document-uri(root($ucref))
  let $ucreftext := $ucref/text()
  order by $file
  return
  <tr><td>{$file}</td><td>{$ucreftext}</td><td>1</td></tr>
};

let $ucrefs := local:uc_refs()
return
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>UC References</title>
  </head>
  <body>
    <table>
      <thead>
	    <tr><td>Source</td><td>Target</td><td>Token</td></tr>
      </thead>
      <tbody>
	    {$ucrefs}
      </tbody>
    </table>
  </body>
</html>

Variables start with $. The function collection (line 4) scopes the query to a particular directory tree of document files. The expression ‘$ucs//uc-ref’ binds to each occurrence of the <uc-ref> element, anywhere in the collection of documents. $file is the document in which a <uc-ref> is found. The third column of the table (Token) is just the number 1 – this gives me some data for the pivot table to sum.

Now we can do the import into Excel:

Finally, I can insert a pivot table based on this data, subtotalling by either Source or Target, or putting both dimensions on the same axis. I love it when stuff just works…

Out of curiosity, I ran the import with the macro recorder on, to see what the VBA looks like:

Sub Import()
    With ActiveSheet.QueryTables.Add( _
        Connection:= _
            "URL;http://roy-laptop:18080/qizx/xq/ucrefs.xq",
            Destination:=Range ("$A$1"))
        .Name = "ucrefs.xq"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlAllTables
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
End Sub

So it’s just creation of a QueryTable, with the URL given as a Connection string. The WebSelectionType property is ‘all tables’, although I thought I clicked the lower of the two yellow arrows in the Import dialog. Clearly, we could wrap this up a bit more: selecting from a list of queries, creating a new Worksheet, etc.

Having established the basic XML –> XQuery –> HTML Table –> Excel/VBA architecture, I think I’ll experiment more with this in due course.

Incidentally, this is my 100th post (Rah! Rah!), having taken almost exactly two years. And most enjoyable it’s been – at least for me :-)

Spreadsheets in XML – Part 2

In the previous post, I was looking at the ‘spreadsheet extensions’ provided by XMLMind’s XML Editor (XXE). This allows XPath-based formulas to be inserted into XML documents, not only in tabular elements, but also in free text.

As an example, I mocked up some invoices. An invoice is a good example of a hybrid document: we want to print it out (or PDF it) as a nicely formatted document; there are calculated and looked-up elements in the manner of a spreadsheet; we want the whole set of invoices to be queryable subsequently, in the manner of a database.

Here’s an invoice, as a DITA document, shown in XXE:

The little green ‘f’ icons represent the formulas, held as XML Processing Instructions. These are ignored in subsequent transformations (to final formats, such as PDF). You double-click an icon to edit the formula.

The first one (before the table) is today’s date: =today(). The ones in the Cost column are simple arithmetic: =($C2 * $D2), etc. Column and row headers can be displayed optionally:

The Product Description and Unit Price formulas are more interesting, since they are lookups in another document, containing the product catalog. Here’s the formula in B2:

The first thing to notice is that we can have multi-line formulas, with ‘let’ definitions preceding the actual formula. (The “…” is really a full file path – I’ve elided it for compactness). The id of the element with the product description is the product code appended with “_desc”. This is then retrieved from the product catalog by matching the id attribute (@id) with the constructed value ($id). (The back-quotes indicate ‘raw’ XPath, rather than XXE formula language).

Here’s the Product catalog (not very extensive!):

The formulas here are used not to calculate visible values, but to construct values for the id attribute. For example, in B2:

Note the id attribute picked from the drop-down list. In Excel terms, this is rather like having a formula that constructs a Range name. It means that the ids for cells in column B and C always follow the product codes in column A. I think this is rather neat.

Back in the invoice, the Total Cost formula sums the values in the Cost column (E) – see the first scrrenshot. We could do this with a table/column reference, but an alternative is to tag the Cost cells with a common attribute value. In DITA, @outputclass allows a kind of informal specialization (we can’t use @id, as this must be unique within a document). Here, we can set @outputclass = ‘cost’. Now, the Total Cost formula sums all elements with this attribute value, wherever they are in the document:

=sum(`//*[@outputclass='cost']`)

That’s it, in terms of the documents. We can then generate formatted output, as we require.

The database aspect comes if the invoices are put into an XML database, such as XMLMind’s Qizx (Free Engine edition). This provides indexing and querying, using the XQuery language. We can then calculate aggregated values, for example by customer and product. Here’s a simple query to calculate the total invoiced for a given product:

xquery version "1.0";

let $prod := "PR01"    (:edit this:)

let $costs :=
 for $row in //strow
 let $cost := $row/stentry[@outputclass='cost']
 where $row/stentry[1]/text() = $prod
 return $cost
return ($prod, sum($costs))

strow is a simple-table row, stentry is a cell. One could, of course, get a lot fancier, and produce proper date-based reports.

There’s an interesting contrast here with how we would do this in Excel. If each invoice is a separate Workbook, we would need to provide some collation mechanism for the data, to get it into a single source for pivot tables, etc. – either in a single workbook, or in Access. I think that where we have a large number of computationally relatively simple documents, the XML approach is quite attractive.

Sub-sequence Cursor

In the previous post, we were partitioning rows on a worksheet into separate workbooks. If you think of this more generally, we’re identifying disjoint sub-sequences of the data records and performing some operation or calculation over each of these sub-sequences. The end of a sub-sequence is defined by a condition – in this case, a change of the key value (end.key <> endnext.key).

By a complete coincidence, with my XML hat on, I recently came across the spec for XQuery 1.1 (this is a draft spec for the next version, so it’s not implemented yet). XQuery is a functional query language for XML documents – roughly equivalent to SQL, but much cleaner and better structured.

Basically, you identify ‘tuple streams’ using the XPath pattern language. A tuple can be a single value, a list of values, or a node-tree, but we can see these as analogous to rows on a worksheet. Normally, our XQuery uses a ‘For Each’-like construct (called a FLWOR – pronounced ‘flower’) to iterate through the tuples one at a time. However in v1.1 they have introduced a Window Clause, which is essentially a For Each Subsequence iteration, with the start and end defined by test conditions. (So ‘window’ means sub-sequence, here).

In addition, you can specify whether the windows are allowed to overlap or not. In a ‘tumbling’ window, start(n+1) must be after end(n). In a ‘sliding’ window, start(n+1) must be after start(n), but could be before end(n). In our data partitioning case, our windows were ‘tumbling’ (and, as it happens, contiguous).

Anyway, it struck me that we could implement something similar for Excel, but in an object-based, rather than functional, style. That is, we could have an object of class SubsequenceCursor, which provides access to successive sub-sequences of our data records. As with all cursor objects, there would be a MoveNext operation, plus properties to access properties of the current sub-sequence. The start and end conditions would be private functions.

Such a cursor object could be used within a loop to derive totals, or other summary information, for each subsequence. If that summary information is written to another worksheet, then we have a very flexible way of doing subtotals, and such like. Since I’m always up for a bit of OO design, I think I’ll have a go.

I’m aware that I never finished the Worksheet Hierarchies stuff, but as always, something new is more interesting :-). I’m also busy with some DITA work, at present.

Excel – XML Workflow

Following on from a previous suggestion about workbook initialization from XML data, I’ve had a go at implementing a bit of round-trip workflow, based on a time-reporting scenario.

Time is booked against currently active jobs; this data is totalled for each job, and updates the job specifications accordingly. I’m assuming this works on a regular cycle, say weekly. Here’s an overview:

A Job workbook contains:

  • Job Number
  • Job Name
  • Start Date
  • End Date
  • Total Hours booked.

There’s a job_Map, so that the data can be exported to a corresponding XML data file. This is done on the Workbook_BeforeClose event (arrow A).

The Job*.xml files are imported into an XML database (Qizx/Free Engine). We can then run a simple XQuery over the collection to generate the list of all active jobs – those that have a Start Date, but no End Date. This list is saved as joblist.xml (arrow Q1).

When a new Timesheet is created (from a template), the joblist.xml data is imported, and is mapped to an Excel table, which expands to hold the possibly multiple rows (arrow B).

The hours column is not mapped: that’s where the user enters the time data.

Now, we want to export jobnumber + hours for each timesheet entry. However, we can’t have jobnumber in two maps (input and output). My solution (there might be better ones) is to copy the data to another table on another worksheet.

To get the data across, I’m using the formula:

=Table_tslist[[#ThisRow],[jobnumber]]

where Table_tslist is the input table. The #VALUE! errors are because the table is fixed size.

The output table is mapped within timesheet_Map (along with tsname and tsperiod). We can now export the data on the Workbook_BeforeClose event (arrow C). The empty (#VALUE!) rows generate empty elements, which are just ignored.

The Timesheet*.xml files are imported into the database, and another XQuery is run to generate a list of (all) job numbers + total hours booked (arrow Q2).

The final link in the cycle is for a Job workbook to refresh its total hours data, when it is opened (arrow D). This is different from arrow B, in that it’s not a simple Import. Each Job workbook wants only the hours value for its own job number. There might well be various ways of doing this, but I’m doing a programmatic load-document-and-lookup, in VBA. I’ll leave this to the next post.

And it’s all working pretty neatly. Potentially, we could extract and summarize other information from the timesheets, and present this in the Job workbooks – for example, the latest period with booked time.

Recursive Functions

I’ve been experimenting with XQuery, which is a ‘pure’ functional language: you just have functions and expressions. You iterate over collections of documents/nodes using a SQL-SELECT-like construct: For-Let-Where-OrderBy-Return (FLWOR). There are also quantified expressions: Some/Every-In-Satisfies.

There are no explicit loops, but you can write a function that is recursive - that is, the function calls itself, until a terminating case is reached (e.g the empty-list case). Each recursive call adds a frame to the call stack of the implementation; when the terminating case is reached, the calls unwind and the frames are popped off the stack, and the result value is constructed. The amount of stack space then limits how long the chain of recursive calls can be (unless the implementation optimizes recursion to iteration – some functional languages do).

Anyway, I wondered how many recursive calls you can make in VBA before you run out stack space (this is in Excel 2007). Here’s a trivial recursive function.

Function Recursive(n As Long) As Long
   If n = 0 Then
      Recursive = 1
   Else
      Recursive = 1 + Recursive(n - 1)
   End If
End Function

This works up to n = 6128. When n = 6129, I get Run-time Error 28 – Out of stack space. That’s a rather higher number than I was expecting. If you change the Long to an Integer, it works up to n = 6285, so the number and types of the arguments makes a slight difference.

It’s unlikely perhaps that you would write a straightforwardly recursive function like this, where a loop would do. However, it’s just conceivable that you might get a chain of object-to-object calls, within a custom data structure, long enough to cause an error.


November 2014
M T W T F S S
« Dec    
 12
3456789
10111213141516
17181920212223
24252627282930

Follow

Get every new post delivered to your Inbox.