Posts Tagged 'DITA'

Analysing Cross-references Between Documents

I have recently been working on a set of about 200 Use Case documents, converting them into DITA – an XML-based structured document format. In the body text of a Use Case, there can be cross-references to other Use cases – so we have a many-to-many relationship between Use Cases (just as we could between any type of document). It’s interesting to analyse this relationship, to see which Use Cases have lots of outgoing references, and which Use Cases are referenced by many other Use Cases.

This kind of totalling is Pivot Table territory. So the question arises of how to contruct the basic data about the cross-references – a table of A-references-B records – and how to get it into Excel.

A key advantage of XML-based documents is that you can import them into an XML database and run queries over them, where the queries are written in XQuery. Such queries can generate output as XML, HTML or plain text. For our purpose, a simple route is to output an HTML table. This can then be imported by Excel via >Get External Data >From Web. The URL for the import points to our XQuery file, within the database server:


Here, the database is XMLMind’s Qizx, running in server mode (locally on my laptop, in this case). This URL can be used like any other in a browser interface, with the results of the query being displayed within the browser. In our case, we are going to give the URL to Excel.

Here’s the code of the query:

xquery version "1.0";
declare option output:method "html";

declare variable $ucs := collection("/UC");

declare function local:uc_refs()
  for $ucref in $ucs//uc-ref
  let $file := document-uri(root($ucref))
  let $ucreftext := $ucref/text()
  order by $file

let $ucrefs := local:uc_refs()
<html xmlns="">
    <title>UC References</title>

Variables start with $. The function collection (line 4) scopes the query to a particular directory tree of document files. The expression ‘$ucs//uc-ref’ binds to each occurrence of the <uc-ref> element, anywhere in the collection of documents. $file is the document in which a <uc-ref> is found. The third column of the table (Token) is just the number 1 – this gives me some data for the pivot table to sum.

Now we can do the import into Excel:

Finally, I can insert a pivot table based on this data, subtotalling by either Source or Target, or putting both dimensions on the same axis. I love it when stuff just works…

Out of curiosity, I ran the import with the macro recorder on, to see what the VBA looks like:

Sub Import()
    With ActiveSheet.QueryTables.Add( _
        Connection:= _
            Destination:=Range ("$A$1"))
        .Name = "ucrefs.xq"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlAllTables
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
End Sub

So it’s just creation of a QueryTable, with the URL given as a Connection string. The WebSelectionType property is ‘all tables’, although I thought I clicked the lower of the two yellow arrows in the Import dialog. Clearly, we could wrap this up a bit more: selecting from a list of queries, creating a new Worksheet, etc.

Having established the basic XML –> XQuery –> HTML Table –> Excel/VBA architecture, I think I’ll experiment more with this in due course.

Incidentally, this is my 100th post (Rah! Rah!), having taken almost exactly two years. And most enjoyable it’s been – at least for me 🙂

Spreadsheets in XML – Part 2

In the previous post, I was looking at the ‘spreadsheet extensions’ provided by XMLMind’s XML Editor (XXE). This allows XPath-based formulas to be inserted into XML documents, not only in tabular elements, but also in free text.

As an example, I mocked up some invoices. An invoice is a good example of a hybrid document: we want to print it out (or PDF it) as a nicely formatted document; there are calculated and looked-up elements in the manner of a spreadsheet; we want the whole set of invoices to be queryable subsequently, in the manner of a database.

Here’s an invoice, as a DITA document, shown in XXE:

The little green ‘f’ icons represent the formulas, held as XML Processing Instructions. These are ignored in subsequent transformations (to final formats, such as PDF). You double-click an icon to edit the formula.

The first one (before the table) is today’s date: =today(). The ones in the Cost column are simple arithmetic: =($C2 * $D2), etc. Column and row headers can be displayed optionally:

The Product Description and Unit Price formulas are more interesting, since they are lookups in another document, containing the product catalog. Here’s the formula in B2:

The first thing to notice is that we can have multi-line formulas, with ‘let’ definitions preceding the actual formula. (The “…” is really a full file path – I’ve elided it for compactness). The id of the element with the product description is the product code appended with “_desc”. This is then retrieved from the product catalog by matching the id attribute (@id) with the constructed value ($id). (The back-quotes indicate ‘raw’ XPath, rather than XXE formula language).

Here’s the Product catalog (not very extensive!):

The formulas here are used not to calculate visible values, but to construct values for the id attribute. For example, in B2:

Note the id attribute picked from the drop-down list. In Excel terms, this is rather like having a formula that constructs a Range name. It means that the ids for cells in column B and C always follow the product codes in column A. I think this is rather neat.

Back in the invoice, the Total Cost formula sums the values in the Cost column (E) – see the first scrrenshot. We could do this with a table/column reference, but an alternative is to tag the Cost cells with a common attribute value. In DITA, @outputclass allows a kind of informal specialization (we can’t use @id, as this must be unique within a document). Here, we can set @outputclass = ‘cost’. Now, the Total Cost formula sums all elements with this attribute value, wherever they are in the document:


That’s it, in terms of the documents. We can then generate formatted output, as we require.

The database aspect comes if the invoices are put into an XML database, such as XMLMind’s Qizx (Free Engine edition). This provides indexing and querying, using the XQuery language. We can then calculate aggregated values, for example by customer and product. Here’s a simple query to calculate the total invoiced for a given product:

xquery version "1.0";

let $prod := "PR01"    (:edit this:)

let $costs :=
 for $row in //strow
 let $cost := $row/stentry[@outputclass='cost']
 where $row/stentry[1]/text() = $prod
 return $cost
return ($prod, sum($costs))

strow is a simple-table row, stentry is a cell. One could, of course, get a lot fancier, and produce proper date-based reports.

There’s an interesting contrast here with how we would do this in Excel. If each invoice is a separate Workbook, we would need to provide some collation mechanism for the data, to get it into a single source for pivot tables, etc. – either in a single workbook, or in Access. I think that where we have a large number of computationally relatively simple documents, the XML approach is quite attractive.

Spreadsheets in XML

In my work with DITA documentation, I use XMLMind’s XML Editor (XXE) – and very good it is too. The professional version comes with an Integrated Spreadsheet Engine, which I have just recently taken a look at. I think it’s rather interesting, particularly how the approach differs from Excel or similar traditional spreadhseets.

We are not talking here about Excel (2007+) using an XML-based file format. In this, the information is structured using elements that relate to spreadsheets – such as <worksheet> and <row> – not to the content domain (invoices, timesheets, product specifications, whatever). That means that the information is not in practice accessible to users (as opposed to tool developers).

What XXE is addressing is the insertion of computed element content into a ‘user level’ XML document. Such a document could be in XHTML or DITA (XHTML is obviously a ‘final’ format; DITA is a ‘source’ format for conversion into various final formats, such as XHTML, PDF, CHM).

The idea is that the XML contains formulas as Processing Instructions. PIs are interpreted by the XXE application, but are ignored by XML processing tools and web browsers. A formula PI generates content (a result value) which is inserted between the PI and the end of the enclosing element. In practice, a formula will provide content for a lowest-level element such as a table entry, or a text phrase. Here is a simple-table entry:

<?xxe-formula formula=’=($[+0,3] * $[+0,4])’?>

This is in column 5 of a table, and is multiplying the values in columns 3 and 4, to produce the value 150. XXE provides a sugared syntax for references within a table, which is essentially the same as an Excel formula. So the above formula would actually be written as:

=($C2 * $D2)

with the relative addressing of the rows allowing the formula to be copy-pasted in subsequent rows.

In XML, any element can have an id attribute. For a table element, this allows us to reference a value in a table from outside (from ordinary text elements). For a leaf element, this allows us to reference a value by name. Suppose that invoice_table contains details of an invoice, and cell D7 contains the tax. We could have a formula:


or, better:


There are, as you would expect, a reasonable number of built-in functions, in the usual categories.

An interesting twist is that a formula can set not just element content, but alternatively an element attribute. The most obvious use of this is to generate id attributes. For example, suppose that I have a table with Product Codes in column A, and Unit Prices in column B. Then, in the B2 cell, I can have this formula setting the id attribute:

=($A2 & “_unitprice”)

(and similarly for the rest of column B). Now, if I want to look up a unit price, I simply construct the id from the product code and the suffix, and access the element directly by id – no need for a VLOOKUP function.

So far, the main difference from Excel is that we are not restricted to using  tabular structures (worksheets, in Excel). We can have values and formulas anywhere in a document structure. For erxample, I could drop a calculated value into an ordinary free-text paragraph, and pick up that value elsewhere in the document.

However, the really different aspect is that the formulas are based on the XPath 1.0 language – the Excel-like syntax is just cosmetic. XPath is a pattern-based query language that treats XML documents as trees of nodes-with-attributes. Here’s an example-based tutorial. This means that a formula can operate on a set of nodes (returned from an XPath expression), without knowing how many there are or where they are in a document.

We can tag values (wherever they are) using an attribute. In DITA, we could use @outputclass (not @id, as this must be unique). For example, I could tag various elements (possibly table cells, possibly not) with @outputclass = ‘cost’, and then sum these using the formula:


The backquotes encapsulate an XPath expression (as opposed to the XXE formula language). ‘//*’ means “any element anywhere in the document” (which we then filter by the outputclass attribute). Furthermore, XPath can access not only the current document, but also other documents (as individual documents, not as a document collection – a notion supported by XPath 2.0/XQuery).

I’ll discuss an example of all this in the next post.

XML Export from Excel – part 3

In Part 2, we had some VBA code that exported data from financial-report workbooks into XML files. What we can do now is query these files to produce summary tables – rather like views of a pivot table – and include these in a PDF or XHTML report, to provide an ‘executive summary’. (Note: we’re leaving VBA behind here, but stay tuned).

Here’s what our exported XML data looks like:

<?xml version="1.0" standalone="yes"?>
<bureport xmlns:xsi='>

And here’s the kind of report we might generate (this is XHTML):

The idea is that this is a presentable/printable report. We could have a lot more in terms of sections, supporting text, fancy layout, etc.

So how do we get from the XML data to the summary report? Here’s one possibility:

  1. import the XML files into an XML database
  2. run queries over the database to generate summary tables as individual DITA topics
  3. include (content reference) the summary tables from our master-report topic, which has the overall structure and additional content
  4. generate the output report in the desired format(s).

Sounds a bit of a performance – why don’t we just collate all our data in a ‘database’ Excel workbook, knock up a few pivot tables, and copy-paste these into a Word document? Basically, flexibility.  Firstly, a pivot table isn’t a general query mechanism: we might want all sorts of summary information, not just dimension-aggregations. Secondly, the individual summary tables might be included in different combinations in different reports. Also, we want the reports to reference fixed locations, where they can pick up the latest (e.g. last month’s) summary tables automatically.

For the XML database, I’m using XMLMind’s QizX Free Engine. You can use this either from a command-line or via a simple client app. There’s an XQuery query to generate each table (they’re very similar): here‘s the one for the totals by Business Unit (BU).

The summary tables are minimal DITA topics containing a <simpletable>:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE topic ...>
<topic id="FR_BU_Totals_Data">
 <title>BU Totals Data</title>
    <simpletable id="BU_Totals_Data">

The “FR” prefix on the topic id stands for “Financial Report”. Note the id also on the <simpletable>.

Now, in the DITA topic for our master report, we just have a <simpletable> element that con(tent)ref(erence)s the summary table in its separate topic:

<simpletable conref="..."/>

where the “…” is a reference of the form: filepath#topicid/elementid.

Now we just generate the report output in our chosen format. Multiple reports can be built into a larger structure: a set of XHTML pages with appropriate linking, or a multi-section PDF document, both with tables of contents, etc.

Stages #1, #2 and #4 above can all be done by command-line tools, so it should be possible to automate the entire process (including the VBA bit).

I think that deserves a cup of tea…

Adding HTML Help – 2

I was previously wondering how to link a VBA project to HTML Help (that is, linking to a page in a .chm file). I’m thinking of the Help as pages in a User Manual, not control-level Help (“The Delete button deletes the current item”, etc).

Having got some compiled Help (a CHM file), we need to link into it. The simplest way seems to be to use the Application.Help method. This takes a path to a CHM file, and a numerical key – for example:

Application.Help “C:\…….\myhelp.chm”, 1002

There are a couple of issues with this. The first is that each call to this method (on the same file) opens a new Help Viewer window. I haven’t found a way round this. The second issue is that the numerical keys have to be defined in a header file prior to compilation – something like this:

#define VBA_Creating_a_Widget     1001
#define VBA_Filtering_Widgets        1002

where the names are those of the topic files in the HTML Help project (minus the .html suffix). This mapping is compiled into the help project to produce the CHM file. However, generation from DITA (XML) source goes straight to the compiled CHM, so there’s no chance of adding in a map. (You could decompile the CHM and then add the map before recompiling).

Fortunately, it doesn’t matter, because there’s another way of linking to the Help from VBA. This is to use a more general function, HtmlHelp, available from the Help OCX (HHCtrl.ocx).

You need to declare this function in a public module:

Declare Function HtmlHelp Lib "HHCtrl.ocx" _
   Alias "HtmlHelpA" _
   (ByVal hwndCaller As Long, _
   ByVal pszFile As String, _
   ByVal uCommand As Long, _
   dwData As Any) As Long

You also need a couple of constants (just to give us meaningful names):


The constant declarations  need to be in each module that calls the function.

We can now call the function from a Command button handler, or whatever:

Private Sub HelpCommand_Click()
  Call HtmlHelp (0, Dotslash("myhelp.chm"), _
      HH_DISPLAY_TOPIC, ByVal "VBA_Filtering_Widgets.html")
End Sub

The zero argument is to do with the windowing – the Help opens in a sibling window. Note the ByVal passing of the string (because it’s to an external library). Dotslash is just my shorthand for an absolute path (which is recommended):

Public Function Dotslash(filename As String) As String
  Dotslash = ActiveWorkbook.Path + "\" + filename
End Function

The other mode argument, HH_HELP_CONTEXT, is for the mapped-id lookup, which is what Application.Help does.

This is just scratching the surface. There’s more information on MSDN.

Adding HTML Help

We’re all familiar with HTML Help for applications – usually in the form of .chm files (with the yellow question-mark icon). Typically, constructing these this involves getting to grips with the Microsoft HTML Help Workshop. This includes a tool for building Help projects, and compiling them into .chm files.

This is okay for fine-grained context-sensitive Help, at the level of individual operations and controls. However, if the material is more in the nature of a User Manual, this is probably not the form in which we want to author it. And we will probably want to generate different forms of output, such as general HTML pages and PDF.

A few posts ago, I mentioned producing documentation using the XML-based DITA framework. One of the benefits of this is that content is held in a generic XML-based format, from which outputs of different types can be generated: XHTML, PDF and CHM (and also Java Help). DITA allows us to select and combine various topics (chunks of content) from our User Manual – perhaps the lower-level task or reference topics – and compile these into a .chm file. Apart from the compiler, which is invoked automatically, this does not involve the HTML Help Workshop.

Note on tools: There is a DITA Open Toolkit, but I have been using tools from XMLMind – the free XML Editor (Personal Edition) and the open-source DITA Converter. And very good they are too.

The selection, structuring and ordering of topics is done using a DITA Map file. HTML Help is specified as the output format via options to the DITA Converter.

Having got our .chm file, the next step is to access it from a VBA project. I’d been vaguely aware that you could do this, but had never investigated further. More tomorrow…

Digression on Documentation: DITA

As is the way of things, I’ve been diverted from the delights of VBA, on to some other work. This is a migration of the system definition documentation for a major financial system, from traditional Word documents to an XML-based architecture called DITA. (Somewhat fancifully, the ‘D’ stands for Darwin, which is appropriate in this, his bicentenary year).

The production of technical documentation is undergoing something of a revolution at present. This is due to the maturing of a raft of technologies based on the XML markup language. Broadly speaking, these technologies provide a solution that sits between monolithic documents-as-files (such as Word documents), and relational databases, as complex aggregations of fine-grained information records.

With a Word document, the unit of content is the same as the unit of presentation: a file is edited, and the same file is printed. This applies equally to web pages (HTML), but with the page as the unit. In practice, it is difficult and time-consuming to identify, extract and recombine fragments of documents to produce new deliverables.

With a relational database, the information records can be conjoined, aggregated and filtered in very complex ways, using a query language (SQL). However, databases are not really suited to holding large free-text elements, like a section of a document. Also, there is no notion of hierarchical structuring in query output, in contrast to the hierarchy of chapters, sections and sub-sections that we are familiar with in documents.

The XML-based solutions aim to provide a middle course. Content is created and held in a form that is structured enough to identify, extract and recombine fragments of documents to produce new deliverables. At the same time, the content does not carry information about presentation (either the target format or the details of layout). This is provided by transformations of the content to produce deliverables in different formats, such as Word or PDF for printable documents, or hypertext (XHTML) for web presentation or online Help.

The challenge is to come up with an information model that defines and relates appropriate topics (i.e. basic chunks), in ways that allow querying, selection and combination in flexible ways. There’s a trade-off here between flexibility and chunk size. Too fine-grained and it’s impossible to manage; too coarse-grained and you’re back with monolithic documents. There’s a wider trend towards ‘medium-sized’ information chunks: think of blog posts, like this one, or Wiki pages.

More on this anon…

November 2019
« Dec