Archive for the 'Excel' Category

Exporting XML From Excel Tables – 3: Self-checking Structure

In the previous post, we were looking at how to export a hierarchical structure as XML, from a set of Excel tables, where the rows are hierarchically numbered (1, 1.2, 1.2.3, etc). The approach in our VBA is to build a hierarchical object model from our worksheet tables, and then ask the root of the model for its XML string. Both the building and the string generation proceed recursively down the structure. Note that this approach works for any hierarchical data – in our case it is software test definitions.

Having built a hierarchical object structure from our worksheet data, there are a number of other things we might want to do:

  • check the test definitions for structural validity
  • generate output in other formats, such as plain text (really plain, not XML)

I’ll talk a bit more about the first of these, here.

The basic idea is that of a class invariant – a property (which can be a conjunction of simpler properties) that always applies to any instance of the class. This idea is, of course, part of Bertrand Meyer’s Design By Contract approach.

Writing good invariants is hard, particularly for objects that are essentially reactive in their behavior. However, in our case here, we have a static hierarchical object structure, and there are some fairly obvious relationships between a node in our hierarchy and its child nodes:

  • there is a least one child node
  • the ID of the current node is a prefix of each child node ID (for example, 2.3 –> 2.3.1)
  • the number of the table rows representing the current node = the sum of the rows for the child nodes + 1 row for the current node

These properties help to ensure that the test data worksheets are ‘well formed’ – with no extra or missing rows, and with proper hierarchical nesting. You can see an example of the last property in this screen shot:

worksheet

  • test condition 101 is represented by rows 3:9 = 7 rows
  • test scenario 101.1 is represented by rows 4:6 = 3 rows
  • test scenario 101.2 is represented by rows 7:9 = 3 rows
  • so 7 = 3 + 3 + 1 row for the test condition itself

The Eiffel language has full support for invariants and other assertions, and it would be the job of the class constructor to establish the invariant. With VBA, we have to take a less rigorous approach, and have the invariant as a normal Boolean-returning function, which should return True any time after the Build procedure has been called. In Eiffel, failure of an invariant or other assertion raises an exception; in VBA, we’ll just put up an error MsgBox.

Here’s the invariant function for a test condition (level #2 in the hierarchy):

Function Invariant() As Boolean
    Invariant = _
            tcond_tid.IsCondition And _
            tcond_scenarios.Count > 0 And _
            tcond_scenarios(1).ScenNum = "1" And _
            ChildrenPartitionRows
    If Not Invariant Then MsgBox "Problem with Condition " & Me.Tid

    Dim ts As TScen
    For Each ts In tcond_scenarios
        Invariant = Invariant And IsChild(ts.Tid, Me.Tid)
        If Not Invariant Then MsgBox ts.Tid & " is not a child number of " & Me.Tid
        Invariant = Invariant And ts.Invariant
    Next
End Function

The first part checks:

  • the Condition’s ID is at the right level for a Condition
  • there is at least one child Scenario, and that the numbering starts from 1
  • that the child Scenarios cover all of the Condition’s rows (minus 1)

The second part iterates over the child Scenarios:

  • checking that the child IDs extend the Condition’s ID (there’s an IsChild relationship between the two IDs)
  • recursively evaluating the invariants for the child Scenarios (and thence recursively down the whole structure)

The ChildrenPartitionRows function is a simple iteration over the child Scenarios:

Function ChildrenPartitionRows() As Boolean
        'Do the rows of my children + 1 = my rows?
    Dim childrenrows As Integer
    Dim ts As TScen
    For Each ts In tcond_scens
        childrenrows = childrenrows + ts.Rowcount
    Next
    ChildrenPartitionRows = (childrenrows = tcond_tblrowcount - 1)

End Function

It might seem that we’re making extra work in writing these invariant and helper functions, but it’s a useful exercise in understanding how our worksheet tables represent a hierarchical structure, and what rules this implies for the worksheet/table layout. Further, we can make the checking directly accessible to a user via a Command Button: the handler simply builds the structure and calls the top-level invariant (on the Test Definition):

Private Sub CheckCommand_Click()
        'Build the TDefn, then evaluate its Invariant (and recursively down the structure)

    Dim td As TDefn
    Set td = New TDefn
    td.Build
    If td.Invariant Then
        MsgBox "Check Complete"
    Else
        MsgBox "Some problems were found"
    End If
End Sub

Note that we’re not being very informative about the precise location of an invariant failure. However, it would be simple enough to record an invariant failure on the object concerned, and then interrogate the structure subsequently to find out the ID of that object.

Finally, as an example of ‘plain text’ output (as opposed to XML), we can have objects provide a simple text representation of themselves – say just ID and Description, with vbTab and vbCrLf. This can be displayed in a TextBox (with property MultiLine = True), to give us a simple Table of Contents. Done within the handler above, this would be:

Worksheets("Front").TocText.value = td.Contents

For a simple one-level ToC, the Test Definition would simply iterate over the child Conditions, concatenating their individual Contents strings.

Exporting XML From Excel Tables – 2: Object Model

In the previous post, we were looking at how to export a hierarchical structure as XML, from a set of Excel tables, where the rows are hierarchically numbered (1, 1.2, 1.2.3, etc). The approach in our VBA is to build a hierarchical object model from our worksheet tables, and then ask the root of the model for its XML string. Both the building and the string generation proceed recursively down the structure. Note that this approach works for any hierarchical data – in our case it is software test definitions.

There are six levels in our hierarchy. #1 is represented by an entire workbook; #2 to #5 are represented by subsets of (contiguous) table rows; #6 is represented by a single cell in a table row (plus its associated header information). There is a VBA Class Module for levels #1-5.

Level #6 corresponds to a test parameter – a single cell, plus its associated column header. The parameters don’t have their own class, as they are really just a couple of values (data, header). The level #5 class holds this information as arrays (corresponding to the relevant data row and the header row).

The multi-level numerical ID is complex enough to warrant a VBA Class Module of its own. This allows us to do things like asking an ID what level it’s at, or asking for a new ID for its Nth child.

Each of the five classes representing a test level has attributes (Private) to record:

  • the chunk of worksheet to which an instance maps: the table (ListObject), the row within the table, and the row count (i.e. the size of the table chunk)
  • its ID (TestID object)
  • data that applies to that level, such as a description, category and mode
  • a Collection of child objects, representing the next level down

Each of the five classes has a Build method (Sub), which is called by its parent object (or the user interface, for the top level). The Build method takes as parameters:

  • its parent’s ID
  • the chunk of worksheet to which an instance maps: table (ListObject), row within the table, and row count (i.e. the size of the table chunk)

The Build method:

  • stores its parameters
  • gets and stores its table-row information
  • creates a new ID for itself (derived from its parent’s ID)
  • calls a method to BuildChildren

The BuildChildren method scans through the table chunk, finding the sub-chunks corresponding to each of its children. For each sub-chunk, it creates a child object of the appropriate class, and asks the child to Build itself, given the appropriate parameters (see above). Determining which level a particular row represents is just a case of seeing which of its first four cells have numbers, and which blanks. There are utility functions for doing this.

A couple of observations regarding the ‘object oriented-ness’ of this. Firstly, in a fully OO language, the Build method would probably be the constructor for an object. The is preferable because it removes the possibility of having ‘not yet built’ objects. However, in VBA, Sub Class_Initialize can not take parameters, so we need a separate method called immediately after the object creation (Set … New …). Secondly, in a fully OO language, one would be tempted to to abstract some of the structure and processing into an abstract class – say, TestNode. However, our structure is not arbitrarily deep – we’re not going to have 20 levels. Also, the top and bottom levels are special cases – actually, only levels #2 and #3 are substantially the same.

Having built an object model, generating the XML is then very easy. Each of the five classes representing a test level has an AsXml function. Here’s the top-level one, on Test Definition:

Function AsXml() As String
        'Element <tdefn>

    AsXml = "<tdefn id='" & tdefn_id & "'>" & vbCrLf & _
                DescAsXml() & _
                ConditionsAsXml() & _
            "</tdefn>"

End Function

Function ConditionsAsXml() As String
        'Element <tcond>*

    ConditionsAsXml = ""
    Dim tc As tcond
    For Each tc In tdefn_conds
        ConditionsAsXml = ConditionsAsXml & tc.AsXml & vbCrLf
    Next

End Function

The second function shows the recursion of AsXml to the child level (here, Test Condition). At other levels, there’s a little extra complexity to allow for optionally outputting secondary information, such as Descriptions.

The top-level XML String is then written out to a file via a TextStream (MS Scripting library). This is wrapped in a function XmlOut.

There is a simple ‘user interface’ worksheet, with a Command Button:

Private Sub ExportCommand_Click()
        'Build the TDefn and write its AsXml string to file

    Dim td As TDefn
    Set td = New TDefn
    td.Build
    XmlOut td.AsXml

The same kind of recursion down our hierarchical object structure can be used to make the structure self-checking (up to a point). I’ll talk about this in the next post.

Back, Briefly: Exporting XML From Excel Tables

It’s been a while since I posted anything to this VBA blog, since my main field of work is XML based documentation (see previous posts). However, I recently did an Excel VBA micro-project, which I think is worth sharing. I’ll describe the problem and outline the solution in this post, and add some details in a couple of further posts.

The overall aim is to provide a way of creating and editing software test definitions, while also making those definitions accessible to a test harness for actually executing the tests. (The solution I’m going to describe is, with appropriate modifications, more widely applicable than just for test definitions).

Since the test definitions fall naturally into a tabular form, Excel is an obvious front-end tool. The problem is how to make the data accessible to a testing harness. One way is to use Apache POI, a Java API for accessing Microsoft Office files – and our developer was keen to use this. However, there are benefits to having a plain text – that is, XML – representation of the data:

  • It can be stored (alongside the Excel workbook) in a version-control repository, and different versions differenced, using a plain text or (better) XML-aware diff tool
  • A body of test definitions can be analyzed and reported on using XQuery
  • Any scripting language with an XML library can access the test data

So, the requirement is for a VBA export facility that takes the test definitions in a workbook, and exports it as a single XML file.

Now, if the test definitions were simply a list of tests, it would be very simple – we might be able to use Excel’s built-in XML facilities. However, our test definitions are quite deeply hierarchical:

  1. Test Definition (for a system component or capability)
  2. Test Condition (a functional area to be tested)
  3. Test Scenario (a particular test configuration)
  4. Test (an atomic test)
  5. Test Step (one or more steps within a Test – typically only one)
  6. Test Parameter (property, value and purpose – in, out, environment)

#1 is represented by an entire workbook; #2 to #5 are represented by subsets of (contiguous) table rows; #6 is represented by a single cell in a table row (plus its associated header information).

#2 to #5 have hierarchical, numeric IDs, spread over the first 4 columns of a table. Where (typically) a Test has only a single Step, they are represented by a single row (that is, the Step is implicit). For example:

worksheet

  • 101 is a Condition,represented by rows 3:9
  • 101.2 is a Scenario,represented by rows 7:9
  • 101.2.1 is a Test, represented by row 8

(each Test here has a single implicit Step). The columns on the right-hand half of the table represent the Parameters. The greying-out of ID numbers is just conditional formatting.

The XML looks much like any XML:

XML

The point to note is that we have a six-level hierarchical structure, with each node in the hierarchy mapping on to a chunk of Excel table.

Now, the solution, in outline. We could pass through the tables and rows, spitting out XML strings as we go. However, this approach is rather limited, as there are a number of other things we might want to do:

  • check the test definitions for structural validity
  • generate output in other formats, such as plain text (really plain, not XML)

As earlier posts have illustrated, I’m keen on building object models that ‘sit behind’ the worksheets, and provide the application functionality. This seems to be a classic application for this. Each item in our test definition hierarchy is represented by an object (of the appropriate class), and is built from a chunk of worksheet table.

We start by creating a Test Definition object, and asking it to build itself. It goes through the worksheets and tables, finding the chunks that correspond to Conditions (using the ID level). It creates Test Condition objects, tells them to build themselves from the appropriate chunks, and adds them into a Collection.

Each Test Condition object goes through its chunk, finding the sub-chunks that correspond to Scenarios (using the ID level). It creates Test Scenario objects, tells them to build themselves from the appropriate sub-chunks, and adds them into a Collection. And so on, down the levels…

Having built the object hierarchy, we can do various things with it. Primarily, we want an XML string, which we can write out to a file. We simply ask the top-level Test Definition object for its XML string. This asks each of its Test Conditions for their XML strings, concatenates them successively, and wraps the result in its own Definition-level XML element (<tdefn> – see above).

Each Test Condition asks its Test Scenarios for their XML strings, concatenates them successively, and wraps the result in its own Condition-level XML element (<tcond> – see above). And so on, down the levels… A few other elements, such as descriptions, get mixed in, as we go.

Note that the object hierarchy is built anew, each time we invoke an operation – there’s no persistence other than through the table rows. This might seem wasteful, but there’s really no performance issue.

In the next post, I’ll say a bit more about the actual code involved, and I’ll say something about the self-checking functionality in a third post.

Analysing Cross-references Between Documents

I have recently been working on a set of about 200 Use Case documents, converting them into DITA - an XML-based structured document format. In the body text of a Use Case, there can be cross-references to other Use cases – so we have a many-to-many relationship between Use Cases (just as we could between any type of document). It’s interesting to analyse this relationship, to see which Use Cases have lots of outgoing references, and which Use Cases are referenced by many other Use Cases.

This kind of totalling is Pivot Table territory. So the question arises of how to contruct the basic data about the cross-references – a table of A-references-B records – and how to get it into Excel.

A key advantage of XML-based documents is that you can import them into an XML database and run queries over them, where the queries are written in XQuery. Such queries can generate output as XML, HTML or plain text. For our purpose, a simple route is to output an HTML table. This can then be imported by Excel via >Get External Data >From Web. The URL for the import points to our XQuery file, within the database server:

http://roy-laptop:18080/qizx/xqs/ucrefs.xq

Here, the database is XMLMind’s Qizx, running in server mode (locally on my laptop, in this case). This URL can be used like any other in a browser interface, with the results of the query being displayed within the browser. In our case, we are going to give the URL to Excel.

Here’s the code of the query:

xquery version "1.0";
declare option output:method "html";

declare variable $ucs := collection("/UC");

declare function local:uc_refs()
{
  for $ucref in $ucs//uc-ref
  let $file := document-uri(root($ucref))
  let $ucreftext := $ucref/text()
  order by $file
  return
  <tr><td>{$file}</td><td>{$ucreftext}</td><td>1</td></tr>
};

let $ucrefs := local:uc_refs()
return
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>UC References</title>
  </head>
  <body>
    <table>
      <thead>
	    <tr><td>Source</td><td>Target</td><td>Token</td></tr>
      </thead>
      <tbody>
	    {$ucrefs}
      </tbody>
    </table>
  </body>
</html>

Variables start with $. The function collection (line 4) scopes the query to a particular directory tree of document files. The expression ‘$ucs//uc-ref’ binds to each occurrence of the <uc-ref> element, anywhere in the collection of documents. $file is the document in which a <uc-ref> is found. The third column of the table (Token) is just the number 1 – this gives me some data for the pivot table to sum.

Now we can do the import into Excel:

Finally, I can insert a pivot table based on this data, subtotalling by either Source or Target, or putting both dimensions on the same axis. I love it when stuff just works…

Out of curiosity, I ran the import with the macro recorder on, to see what the VBA looks like:

Sub Import()
    With ActiveSheet.QueryTables.Add( _
        Connection:= _
            "URL;http://roy-laptop:18080/qizx/xq/ucrefs.xq",
            Destination:=Range ("$A$1"))
        .Name = "ucrefs.xq"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlAllTables
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
End Sub

So it’s just creation of a QueryTable, with the URL given as a Connection string. The WebSelectionType property is ‘all tables’, although I thought I clicked the lower of the two yellow arrows in the Import dialog. Clearly, we could wrap this up a bit more: selecting from a list of queries, creating a new Worksheet, etc.

Having established the basic XML –> XQuery –> HTML Table –> Excel/VBA architecture, I think I’ll experiment more with this in due course.

Incidentally, this is my 100th post (Rah! Rah!), having taken almost exactly two years. And most enjoyable it’s been – at least for me :-)

An Alternative to VBA in Excel?

Back last year, Gareth Hayter of Slyce Software emailed me about VScript, an alternative to VBA for writing functions and macros in Excel. Unfortunately, for various reasons, I have not been able to look into it in any detail, but it certainly sounds interesting.

VScript is based on Excel-Dna, which is a project to integrate .NET into Excel. The aim is thus to allow you to write functions and macros in C# or VB.Net – so, presumably it is aimed at developers familiar with those languages, and preferring them over VBA. Code is claimed to run considerably faster than VBA (not hard to believe).  VScript provides an IDE integrated with Excel, as with VBA. You can use VScript to create stand-alone XLL add-ins, and digitally sign these.

What about VSTO? Gareth says:

VScript is different from VSTO (Visual Studio Tools for Office) in many ways:

  • VSTO is for professional programmers: It can be complex and confusing to use and requires a lot of time, effort and money to learn.
  • VSTO is expensive: You need to buy Visual Studio® which starts at $799.
  • VSTO is not built into Excel®: It is an external program and works in a very different way from VBA.
  • VSTO projects are complicated to deploy: VSTO is not integrated into Excel, which means that it’s difficult to make a few changes and test them quickly. It requires ClickOnce deployment.
  • VSTO cannot create User-Defined Functions (UDFs): With VSTO, you can’t create functions that you can use in a similar way to SUM() and AVG().

One to watch …

Pivot Table Calculated Column – 2

Thanks to Dick Kusleika and Andy Pope for offering solutions to my pivot table problem: how to have a column that shows the difference between Max and Min aggregations (or other ‘custom’ aggregation).

Dick’s solution is to have an additional column in the source data: effectively calculating Max, Min and MaxMinDiff before aggregation in the pivot table (so the Max and Min are not done in the pivot table).

I have actually added three columns, to break the formula down: the Max and Min columns are just intermediates. The Max and Min values are within Month: since the records are sorted by Month, this can be seen easily.

Here’s the pivot table:

In the pivot table, the Max and Min columns are normal aggregations of the Sales column in the data: they are not derived from the Max and Min columns in the data.

However, the MaxMinDiff column in the pivot table is derived from the Diff column in the data. Since all the values underlying a MaxMinDiff cell have the same value (e.g. Data H2:H6 underlies Pivot Table D3), we can use any of the non-summing aggregation functions: Max, Min, even Average. I changed the lable to MaxMinDiff, because “Max of Diff” looks weird.

Now the formulas in the additional data columns. The Max and Min formulas are single-cell array formulas:

{=MAX([Sales] * ([Month] = SalesTable[[#This Row],[Month]]))}

{=MIN(IF([Month] = SalesTable[[#This Row],[Month]],[Sales],””))}

The MAX formula is multiplying the Sales column by an array of Booleans, where the latter come from testing the Month column against this row’s Month. I do like the table-range addressing, although I don’t see why a formula within SalesTable needs to refer to SalesTable. The resulting array contains Sales figures for a specific month, and zeroes for the other months; this is then MAXed.

The MIN formula is slightly trickier, because we can not leave the false-case zeroes, as these would of course be the minimum value. However, MIN of a number and an empty value is the number, so “” is substituted for false/zero.

The Diff column formula is then just a simple difference:

=SalesTable[[#This Row],[Max]]-SalesTable[[#This Row],[Min]]

(Again, I don’t see why the syntax requires the reference to SalesTable).

This all works fine, provided that Month is the only visible dimension in our pivot table. This is because the data Max and Min values are within Month, as is obvious from the formulas. However, we could equally well want to see MaxMinDiff for values within Region or within Sales Rep. This would necessitate further columns, or changing the formulas to test a different dimension (e.g. Region).

Andy Pope suggested a VBA solution, handling the Worksheet_PivotTableUpdate event:

Private Sub Worksheet_PivotTableUpdate(ByVal Target As PivotTable)

    With Target.DataBodyRange
        With .Columns(.Columns.Count)
            .Offset(-1, 1).Resize(1, 1) = "MaxMinDiff"
            .Offset(, 1).FormulaR1C1 = "=RC[-2]-RC[-1]"
        End With
    End With
End Sub

This certainly inserts a column of formulas, but it is next to, rather than part of the pivot table. Consequently, it does not adapt to changes in the pivot table layout. Also, it requires the Max and Min columns to be in the pivot table (I guess you could live with that).

On a completely different tack, a PivotCache can be based on a Recordset (typically sourced by a SELECT query on some data). I wonder if it’s possible to manipulate such a Recordset to add columns…

Pivot Table Calculated Column?

My excuse is that I’ve got a heavy cold, and my brain is definitely not firing on all cylinders. I’ve found a problem which ought to have a simple Excel (not VBA) solution, but I can’t work out what it is. Dick, Jon, anybody: please put me out of my misery.

I’m currently revising some documentation for a leading financial OLAP framework (ActivePivot), so I’m much involved with pivot tables. I was toying with one of my own, which summarized Sales data (in Excel, not OLAP) – the usual stuff:

(Yes, it’s old data, and I need to change the dates). Simple enough to have Min and Max of Sales:

But suppose that I want to see the difference between Max and Min. Surely there is somewhere to enter the formula ‘=Max – Min’ (or possibly something like ‘=Values.Max – Values.Min’). But where?

It’s not a Calculated Field, since that is essentially a calculated column on the data table. I could, for example, have a field ‘= Sales / Contacts’.

It’s not a Calculated Item, since that aggregates values for dimension members. For example, I could aggregate Jan+Feb+Mar into a Q1 value. I could even do differences, for example, Jul-Dec minus Jan-Jun. However, although the Min and Max Values (fields) appear as column headers, they are not a dimension.

I could have formulas that sit outside the pivot table:

  • using GETPIVOTDATA on individual cells
  • using an array formula to subtract the whole Min array from the whole Max array.

However, these do not, of course, adapt to changes in the pivot table layout, such as filtering member values.

Since an aggregation function could be any function that takes an array of values and returns a single value, we could have more sophisticated derived measures based on the built-in aggregation functions (Max, Average, Stdev, etc).

The OLAP framework mentioned above is of course a much more sophisticated beast. There are ways to calculate additional fields before aggregation in the hypercube (cf. pivot cache), custom aggregation functions within the hypercube, and post-processor functions to calculate additional values. I think the last of these would be the appropriate one here.

Is there a VBA solution? Right now, I haven’t penetrated deep enough into the Byzantine complexities  of the pivot table object model. A quick look indicated that PivotField.function is an enumeration of the built-in types. So there’s no obvious hook for a custom aggregation function.

Now I’m going away to cough a lot…


November 2014
M T W T F S S
« Dec    
 12
3456789
10111213141516
17181920212223
24252627282930

Follow

Get every new post delivered to your Inbox.