Signing Off…

Thanks to all of you who have visited this blog in 2011, and have continued to make comments. However, I need to point out that I’m no longer actively blogging here, so this site is now just for reference. I hope you continue to find it useful.

At the start of the year I started a new  job which has taken me away from Excel, Access and VBA. As some of my later posts indicated (see the tag cloud), I am now in the world of DITA XML documentation, in both a technical and an authoring capacity. My current investigations concern:

  • storing document components in an XML database
  • categorizing document components, on multiple dimensions, using internal meta-data
  • constructing documentation build manifests (DITA maps) using high-level queries (in XQuery)

If that sounds interesting, then contact me via the About page.

Enjoy your coding, in whatever language!

Still here…

Sorry the postings have been a bit thin on the ground since the New Year. Rather unexpectedly some freelance work has developed into a permanent position. I am now in a role which is primarily focused on DITA/XML documentation, for a company that produces a Payments Processing system (Direct Debits and such like). The challenge is to make the documentation scheme fit the highly modular nature of the system, and the core product –> client customization delivery process.

While Excel/VBA will still be in the mix, it won’t be central to what I’m doing. I’ll try to carry on posting as and when, so keep dropping by.

Analysing Cross-references Between Documents

I have recently been working on a set of about 200 Use Case documents, converting them into DITA - an XML-based structured document format. In the body text of a Use Case, there can be cross-references to other Use cases – so we have a many-to-many relationship between Use Cases (just as we could between any type of document). It’s interesting to analyse this relationship, to see which Use Cases have lots of outgoing references, and which Use Cases are referenced by many other Use Cases.

This kind of totalling is Pivot Table territory. So the question arises of how to contruct the basic data about the cross-references – a table of A-references-B records – and how to get it into Excel.

A key advantage of XML-based documents is that you can import them into an XML database and run queries over them, where the queries are written in XQuery. Such queries can generate output as XML, HTML or plain text. For our purpose, a simple route is to output an HTML table. This can then be imported by Excel via >Get External Data >From Web. The URL for the import points to our XQuery file, within the database server:

http://roy-laptop:18080/qizx/xqs/ucrefs.xq

Here, the database is XMLMind’s Qizx, running in server mode (locally on my laptop, in this case). This URL can be used like any other in a browser interface, with the results of the query being displayed within the browser. In our case, we are going to give the URL to Excel.

Here’s the code of the query:

xquery version "1.0";
declare option output:method "html";

declare variable $ucs := collection("/UC");

declare function local:uc_refs()
{
  for $ucref in $ucs//uc-ref
  let $file := document-uri(root($ucref))
  let $ucreftext := $ucref/text()
  order by $file
  return
  <tr><td>{$file}</td><td>{$ucreftext}</td><td>1</td></tr>
};

let $ucrefs := local:uc_refs()
return
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>UC References</title>
  </head>
  <body>
    <table>
      <thead>
	    <tr><td>Source</td><td>Target</td><td>Token</td></tr>
      </thead>
      <tbody>
	    {$ucrefs}
      </tbody>
    </table>
  </body>
</html>

Variables start with $. The function collection (line 4) scopes the query to a particular directory tree of document files. The expression ‘$ucs//uc-ref’ binds to each occurrence of the <uc-ref> element, anywhere in the collection of documents. $file is the document in which a <uc-ref> is found. The third column of the table (Token) is just the number 1 – this gives me some data for the pivot table to sum.

Now we can do the import into Excel:

Finally, I can insert a pivot table based on this data, subtotalling by either Source or Target, or putting both dimensions on the same axis. I love it when stuff just works…

Out of curiosity, I ran the import with the macro recorder on, to see what the VBA looks like:

Sub Import()
    With ActiveSheet.QueryTables.Add( _
        Connection:= _
            "URL;http://roy-laptop:18080/qizx/xq/ucrefs.xq",
            Destination:=Range ("$A$1"))
        .Name = "ucrefs.xq"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlAllTables
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
End Sub

So it’s just creation of a QueryTable, with the URL given as a Connection string. The WebSelectionType property is ‘all tables’, although I thought I clicked the lower of the two yellow arrows in the Import dialog. Clearly, we could wrap this up a bit more: selecting from a list of queries, creating a new Worksheet, etc.

Having established the basic XML –> XQuery –> HTML Table –> Excel/VBA architecture, I think I’ll experiment more with this in due course.

Incidentally, this is my 100th post (Rah! Rah!), having taken almost exactly two years. And most enjoyable it’s been – at least for me :-)

An Alternative to VBA in Excel?

Back last year, Gareth Hayter of Slyce Software emailed me about VScript, an alternative to VBA for writing functions and macros in Excel. Unfortunately, for various reasons, I have not been able to look into it in any detail, but it certainly sounds interesting.

VScript is based on Excel-Dna, which is a project to integrate .NET into Excel. The aim is thus to allow you to write functions and macros in C# or VB.Net – so, presumably it is aimed at developers familiar with those languages, and preferring them over VBA. Code is claimed to run considerably faster than VBA (not hard to believe).  VScript provides an IDE integrated with Excel, as with VBA. You can use VScript to create stand-alone XLL add-ins, and digitally sign these.

What about VSTO? Gareth says:

VScript is different from VSTO (Visual Studio Tools for Office) in many ways:

  • VSTO is for professional programmers: It can be complex and confusing to use and requires a lot of time, effort and money to learn.
  • VSTO is expensive: You need to buy Visual Studio® which starts at $799.
  • VSTO is not built into Excel®: It is an external program and works in a very different way from VBA.
  • VSTO projects are complicated to deploy: VSTO is not integrated into Excel, which means that it’s difficult to make a few changes and test them quickly. It requires ClickOnce deployment.
  • VSTO cannot create User-Defined Functions (UDFs): With VSTO, you can’t create functions that you can use in a similar way to SUM() and AVG().

One to watch …

Moving Data Out of Outlook – part 2

Happy and code-full 2011!

In an earlier post, I was considering exporting my Outlook contacts as XHTML pages, and accessing these as a repository of free-format, but queryable, documents. After some time out for Christmas and New Year, not to mention my family’s seasonal cough-fest, I can report back.

I haven’t attempted a comprehensive export of all contact attributes, just the standard ones that I typically use. This still leaves information in the Contact’s body section, but I’ll migrate that manually, as and when.

Here’s an example contact page:

I just grabbed a standard style sheet, so the formatting is not up to much (for example, the cell text middle-aligns vertically, as you can see with “Address”).

In terms of the VBA code, all we need is an adapter class, ContactPage. Instances of this are created for each ContactItem that we export. A ContactPage is responsible for building the XHTML string corresponding to the ContactItem, and writing it out to a text file. Here’s the root procedure:

Const outdir As String = "C:\Temp\Contacts"

Sub ExportContacts()
  Dim allcontacts As Items
  Set allcontacts = Session.Folders("Personal Folders"). _
                             Folders("Contacts").Items

  MsgBox "Exporting " & allcontacts.Count & " Contacts"

  Dim cn As ContactItem  'Outlook class
  Dim cp As ContactPage  'Custom class
  For Each cn In allcontacts
    Set cp = New ContactPage
    Set cp.Contact = cn    'Property Set
    cp.OutputToFile outdir
  Next
End Sub

ContactPage has a function for PageTitle, which is taken from the ContactItem‘s Company or Full Name (in that order of preference). Functions PageId and PageFile are based on this (with a bit of cleaning up to remove characters such as ‘/’).

The sub OutputToFile uses the Microsoft Scripting Runtime library, specifically FileSystemObject and TextStream. You need to load the Scripting library manually, via >Tools >References.

The rest of ContactPage’s features are just functions that generate elements in the XHTML hierarchy. For example:

Public Function XhtmlPage() As String
'The XHTML representation of cp_contact
    XhtmlPage = Header_ & vbCrLf & _
                Html_
End Function

'...

Public Function Html_() As String
    Html_ = "<html id=" & Chr(34) & PageId & Chr(34) '& ...
    Html_ = Html_ & vbCrLf & _
            Head_ & vbCrLf & _
            Body_ & vbCrLf & _
            "</html>"
End Function

Public Function Body_() As String
    Body_ = "<body>" & vbCrLf & _
            "<h1>" & PageTitle & "</h1>" & vbCrLf & _
            Company_table_ & vbCrLf & _
            Person_table_(1) & vbCrLf & _
            Person_table_(2) & vbCrLf & _
            Person_table_(3) & vbCrLf & _
            Categories_ul_ & vbCrLf & _
            "</body>"
End Function

'...

Public Function LastName_td_() As String
    LastName_td_ = "<td class=" & Chr(34) & "LastName" & Chr(34) & ">" & _
     cp_contact.LastName & "</td>"
End Function

The lowest level functions, typically for <td> elements, access the ContactItem‘s properties. As you can see, there are a lot of quotation characters (Chr(34)) in XHTML. The trailing-underscore convention for the function names was so that I could call the functions after the XHTML tags they generate – without the underscore I got a name clash with something existing.

Minor complications where the Contacts folder contains Distribution Lists (class DistListItem), which is a different class from ContactItem – that is, a MAPIFolder.Items is a heterogeneous Collection. This makes the For Each loop throw a type mismatch error, although you can resume execution (>Debug, >Run).

Also a slight issue with duplicate filenames, since the file output overwrites by default. I used Round(Rnd * 1000, 0) to generate a temporary disambiguating suffiix for the filename.

Here’s the Class Module for ContactPage.

 

Diff for Access Tables

I haven’t had time to progress the Outlook Contact Export, so here is a minor digression.

Suppose that I have two Access tables each containing records that represent the same entities – Orders, say. These two sets of equivalent records come from different sources: maybe Orders as received, and Orders as dispatched. I want to cross-check certain fields for equivalent records – say, that order-quantity received and order-quantity dispatched are the same – and flag any mismatches.

Here are two tables:

I want to check the equivalence of columns B and C. There are differences in B/C for the second and fourth records, and an extra record in the second table.

Note that the IDs are counters, but are not necessarily the same, so I can not do a Join of the two tables. The tables are assumed to represent equivalent sets of records – maybe for the same period – so they could be queries on underlying cumulative tables.

It’s not apparent to me how you could do this using queries. If we could join the tables together, then for numerical values I suppose that we could subtract one field from the other, and check for non-zero differences.

However, a more generic solution is to suck the relevant columns into two Recordsets, using a simple SELECT query as the Source, and then iterate through records and fields, comparing values. Differences can be logged in some way:

  • If we expect only rare differences, then we could build up a log String, and display this at the end of the check:
  • If we expect substantial differences, we could base an output Recordset on a temporary table, and write message-records into it.

I’ve adopted the first approach, for simplicity.

Here’s a code module, and the database (2003 .mdb) with test tables. There’s no user interface: RunDiff is run directly from within VBA.

Moving Data Out of Outlook

I’ve felt uneasy for a while that important data – for example, my Contacts – is sitting in Outlook, in an enormous and opaque database file (.pst). Okay, it’s backed up, but if it became ‘corrupted’ and wouldn’t open, then I have no way of recovering the data, that I know of. Why should I have to open Outlook, just to find someone’s website, or phone number? The whole idea of organizing your life within a single application seems terribly outmoded. I just want a mail client that handles message send and receipt; all the information I want to be in the ‘outside world’.

Retrieving data can be pretty painful too. I’ve just clicked on the ‘Activities’ tab of a contact, and it’s taken over 2 minutes to come back with about 20 items. While there is free-text search, there is no proper query access.

Outlook’s data model is baroquely large – a futile attempt to cover all possibilities. I’m looking at the class ContactItem, and it has nearly 150 attributes. I can have NickName, TelexNumber (telex?) and YomiFirstName (something Japanese); but I can only have three email addresses and one webiste URL.

So what’s the alternative? What I’d like is for each item (contact, whatever) to be a web page in a local (private) website. New pages could be created, edited and added as necessary. The information is all fully accessible using basic or more sophisticated editors. Website and mailto links work normally – the latter opening a mail client, as usual.

As it happens I’m using a document repository for XML documents, including those in XHTML. This runs on a Jetty server, at a localhost port. The repository can be browsed as a directory hierarchy, free-text searched, and queried (using XQuery). Because the repository is fully indexed, the search and query are very fast. I can create/edit documents in the repository directly, without the need to export/import, using a WebDAV-enabled editor.

So, to populate the repository initially, I would need to export my existing contacts from Outlook in some way. Exporting items as XML or HTML directly is beyond Outlook (up to and including v2007). So I’m thinking that I could write a custom export to XHTML – nothing complicated, just a simple page with tables. Elements, such as the table cells, would be given standard id attributes to indicate the content type – such as ‘company’, ‘address’; these can then be used in queries, analogously to column names in SQL.

In terms of the design, I’m wondering about a custom class, say ContactAdapter, which would wrap a ContactItem, and build the complete XHTML string, and write it to a file (using the Scripting library for the file access). I’ve not done anything like this before, so we’ll see. I suppose that I could export the data to Excel, and work on it there, but dealing with rows and columns is less direct than working with the actual ContactItem objects in Outlook.

More anon…

Spreadsheets in XML – Part 2

In the previous post, I was looking at the ‘spreadsheet extensions’ provided by XMLMind’s XML Editor (XXE). This allows XPath-based formulas to be inserted into XML documents, not only in tabular elements, but also in free text.

As an example, I mocked up some invoices. An invoice is a good example of a hybrid document: we want to print it out (or PDF it) as a nicely formatted document; there are calculated and looked-up elements in the manner of a spreadsheet; we want the whole set of invoices to be queryable subsequently, in the manner of a database.

Here’s an invoice, as a DITA document, shown in XXE:

The little green ‘f’ icons represent the formulas, held as XML Processing Instructions. These are ignored in subsequent transformations (to final formats, such as PDF). You double-click an icon to edit the formula.

The first one (before the table) is today’s date: =today(). The ones in the Cost column are simple arithmetic: =($C2 * $D2), etc. Column and row headers can be displayed optionally:

The Product Description and Unit Price formulas are more interesting, since they are lookups in another document, containing the product catalog. Here’s the formula in B2:

The first thing to notice is that we can have multi-line formulas, with ‘let’ definitions preceding the actual formula. (The “…” is really a full file path – I’ve elided it for compactness). The id of the element with the product description is the product code appended with “_desc”. This is then retrieved from the product catalog by matching the id attribute (@id) with the constructed value ($id). (The back-quotes indicate ‘raw’ XPath, rather than XXE formula language).

Here’s the Product catalog (not very extensive!):

The formulas here are used not to calculate visible values, but to construct values for the id attribute. For example, in B2:

Note the id attribute picked from the drop-down list. In Excel terms, this is rather like having a formula that constructs a Range name. It means that the ids for cells in column B and C always follow the product codes in column A. I think this is rather neat.

Back in the invoice, the Total Cost formula sums the values in the Cost column (E) – see the first scrrenshot. We could do this with a table/column reference, but an alternative is to tag the Cost cells with a common attribute value. In DITA, @outputclass allows a kind of informal specialization (we can’t use @id, as this must be unique within a document). Here, we can set @outputclass = ‘cost’. Now, the Total Cost formula sums all elements with this attribute value, wherever they are in the document:

=sum(`//*[@outputclass='cost']`)

That’s it, in terms of the documents. We can then generate formatted output, as we require.

The database aspect comes if the invoices are put into an XML database, such as XMLMind’s Qizx (Free Engine edition). This provides indexing and querying, using the XQuery language. We can then calculate aggregated values, for example by customer and product. Here’s a simple query to calculate the total invoiced for a given product:

xquery version "1.0";

let $prod := "PR01"    (:edit this:)

let $costs :=
 for $row in //strow
 let $cost := $row/stentry[@outputclass='cost']
 where $row/stentry[1]/text() = $prod
 return $cost
return ($prod, sum($costs))

strow is a simple-table row, stentry is a cell. One could, of course, get a lot fancier, and produce proper date-based reports.

There’s an interesting contrast here with how we would do this in Excel. If each invoice is a separate Workbook, we would need to provide some collation mechanism for the data, to get it into a single source for pivot tables, etc. – either in a single workbook, or in Access. I think that where we have a large number of computationally relatively simple documents, the XML approach is quite attractive.

Spreadsheets in XML

In my work with DITA documentation, I use XMLMind’s XML Editor (XXE) – and very good it is too. The professional version comes with an Integrated Spreadsheet Engine, which I have just recently taken a look at. I think it’s rather interesting, particularly how the approach differs from Excel or similar traditional spreadhseets.

We are not talking here about Excel (2007+) using an XML-based file format. In this, the information is structured using elements that relate to spreadsheets – such as <worksheet> and <row> – not to the content domain (invoices, timesheets, product specifications, whatever). That means that the information is not in practice accessible to users (as opposed to tool developers).

What XXE is addressing is the insertion of computed element content into a ‘user level’ XML document. Such a document could be in XHTML or DITA (XHTML is obviously a ‘final’ format; DITA is a ‘source’ format for conversion into various final formats, such as XHTML, PDF, CHM).

The idea is that the XML contains formulas as Processing Instructions. PIs are interpreted by the XXE application, but are ignored by XML processing tools and web browsers. A formula PI generates content (a result value) which is inserted between the PI and the end of the enclosing element. In practice, a formula will provide content for a lowest-level element such as a table entry, or a text phrase. Here is a simple-table entry:

<stentry>
<?xxe-formula formula=’=($[+0,3] * $[+0,4])’?>
150
</stentry>

This is in column 5 of a table, and is multiplying the values in columns 3 and 4, to produce the value 150. XXE provides a sugared syntax for references within a table, which is essentially the same as an Excel formula. So the above formula would actually be written as:

=($C2 * $D2)

with the relative addressing of the rows allowing the formula to be copy-pasted in subsequent rows.

In XML, any element can have an id attribute. For a table element, this allows us to reference a value in a table from outside (from ordinary text elements). For a leaf element, this allows us to reference a value by name. Suppose that invoice_table contains details of an invoice, and cell D7 contains the tax. We could have a formula:

=invoice_table!$D$7

or, better:

=$(tax)

There are, as you would expect, a reasonable number of built-in functions, in the usual categories.

An interesting twist is that a formula can set not just element content, but alternatively an element attribute. The most obvious use of this is to generate id attributes. For example, suppose that I have a table with Product Codes in column A, and Unit Prices in column B. Then, in the B2 cell, I can have this formula setting the id attribute:

=($A2 & “_unitprice”)

(and similarly for the rest of column B). Now, if I want to look up a unit price, I simply construct the id from the product code and the suffix, and access the element directly by id – no need for a VLOOKUP function.

So far, the main difference from Excel is that we are not restricted to using  tabular structures (worksheets, in Excel). We can have values and formulas anywhere in a document structure. For erxample, I could drop a calculated value into an ordinary free-text paragraph, and pick up that value elsewhere in the document.

However, the really different aspect is that the formulas are based on the XPath 1.0 language – the Excel-like syntax is just cosmetic. XPath is a pattern-based query language that treats XML documents as trees of nodes-with-attributes. Here’s an example-based tutorial. This means that a formula can operate on a set of nodes (returned from an XPath expression), without knowing how many there are or where they are in a document.

We can tag values (wherever they are) using an attribute. In DITA, we could use @outputclass (not @id, as this must be unique). For example, I could tag various elements (possibly table cells, possibly not) with @outputclass = ‘cost’, and then sum these using the formula:

=sum(`//*[@outputclass='cost']`)

The backquotes encapsulate an XPath expression (as opposed to the XXE formula language). ‘//*’ means “any element anywhere in the document” (which we then filter by the outputclass attribute). Furthermore, XPath can access not only the current document, but also other documents (as individual documents, not as a document collection – a notion supported by XPath 2.0/XQuery).

I’ll discuss an example of all this in the next post.

Code Quality

An interesting essay on Code Quality by Hew Wolff, over at the Pragmatic Programmers, in their online magazine. The basic thesis is that ‘Good Code Tells the Truth’ (that’s the title). It resonated with me because a week or two back I was looking at an Access/VBA application (not one of mine!). The code in this application was, if I might be frank, dire:

  • no indentation in a lot of procedures
  • no splitting of long SQL strings over multiple lines (when I did reformat one, it ran to 8 lines)
  • no useful comments at either module/form or procedure level
  • form controls not renamed (one form had buttons called Command27, Command28, Command66, Command139 – see what I mean?)
  • obscure variable names – for example, in one procedure, U3, D3, d4, SN3, FN3
  • using 0 and -1 for False and True (why?)
  • copy-paste-and-tweak repetition (e.g. to re-initialize a form).

The piece de resistance was a SELECT-INTO query that created a column using this expression:

[id]-[id]-1

So, arithmetic on Ids? As far as I can tell, this converts an Id value into -1 (presumably treated as True), but leaves Null values as Null, since arithmetic expressions involving Nulls equal Null. Is this a common Access programming technique? Even if it is, couldn’t it be wrapped in a function called something like NonNullToTrue (although this could not have argument or result types more specific than Variant).

This might (or might not) be a clever trick, but not considering the possibility of other people finding it confusing indicates of a poor approach to programming. What the essay mentioned earlier is saying, and what I’m agreeing with here, is that software development is as much about communication of a solution as it is about that solution just working.

Next Page »


 

January 2012
M T W T F S S
« Dec    
 1
2345678
9101112131415
16171819202122
23242526272829
3031  

Follow

Get every new post delivered to your Inbox.