Posts Tagged 'pivot'

Analysing Cross-references Between Documents

I have recently been working on a set of about 200 Use Case documents, converting them into DITA - an XML-based structured document format. In the body text of a Use Case, there can be cross-references to other Use cases – so we have a many-to-many relationship between Use Cases (just as we could between any type of document). It’s interesting to analyse this relationship, to see which Use Cases have lots of outgoing references, and which Use Cases are referenced by many other Use Cases.

This kind of totalling is Pivot Table territory. So the question arises of how to contruct the basic data about the cross-references – a table of A-references-B records – and how to get it into Excel.

A key advantage of XML-based documents is that you can import them into an XML database and run queries over them, where the queries are written in XQuery. Such queries can generate output as XML, HTML or plain text. For our purpose, a simple route is to output an HTML table. This can then be imported by Excel via >Get External Data >From Web. The URL for the import points to our XQuery file, within the database server:


Here, the database is XMLMind’s Qizx, running in server mode (locally on my laptop, in this case). This URL can be used like any other in a browser interface, with the results of the query being displayed within the browser. In our case, we are going to give the URL to Excel.

Here’s the code of the query:

xquery version "1.0";
declare option output:method "html";

declare variable $ucs := collection("/UC");

declare function local:uc_refs()
  for $ucref in $ucs//uc-ref
  let $file := document-uri(root($ucref))
  let $ucreftext := $ucref/text()
  order by $file

let $ucrefs := local:uc_refs()
<html xmlns="">
    <title>UC References</title>

Variables start with $. The function collection (line 4) scopes the query to a particular directory tree of document files. The expression ‘$ucs//uc-ref’ binds to each occurrence of the <uc-ref> element, anywhere in the collection of documents. $file is the document in which a <uc-ref> is found. The third column of the table (Token) is just the number 1 – this gives me some data for the pivot table to sum.

Now we can do the import into Excel:

Finally, I can insert a pivot table based on this data, subtotalling by either Source or Target, or putting both dimensions on the same axis. I love it when stuff just works…

Out of curiosity, I ran the import with the macro recorder on, to see what the VBA looks like:

Sub Import()
    With ActiveSheet.QueryTables.Add( _
        Connection:= _
            Destination:=Range ("$A$1"))
        .Name = "ucrefs.xq"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlAllTables
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
End Sub

So it’s just creation of a QueryTable, with the URL given as a Connection string. The WebSelectionType property is ‘all tables’, although I thought I clicked the lower of the two yellow arrows in the Import dialog. Clearly, we could wrap this up a bit more: selecting from a list of queries, creating a new Worksheet, etc.

Having established the basic XML –> XQuery –> HTML Table –> Excel/VBA architecture, I think I’ll experiment more with this in due course.

Incidentally, this is my 100th post (Rah! Rah!), having taken almost exactly two years. And most enjoyable it’s been – at least for me :-)

Pivot Table Calculated Column – 2

Thanks to Dick Kusleika and Andy Pope for offering solutions to my pivot table problem: how to have a column that shows the difference between Max and Min aggregations (or other ‘custom’ aggregation).

Dick’s solution is to have an additional column in the source data: effectively calculating Max, Min and MaxMinDiff before aggregation in the pivot table (so the Max and Min are not done in the pivot table).

I have actually added three columns, to break the formula down: the Max and Min columns are just intermediates. The Max and Min values are within Month: since the records are sorted by Month, this can be seen easily.

Here’s the pivot table:

In the pivot table, the Max and Min columns are normal aggregations of the Sales column in the data: they are not derived from the Max and Min columns in the data.

However, the MaxMinDiff column in the pivot table is derived from the Diff column in the data. Since all the values underlying a MaxMinDiff cell have the same value (e.g. Data H2:H6 underlies Pivot Table D3), we can use any of the non-summing aggregation functions: Max, Min, even Average. I changed the lable to MaxMinDiff, because “Max of Diff” looks weird.

Now the formulas in the additional data columns. The Max and Min formulas are single-cell array formulas:

{=MAX([Sales] * ([Month] = SalesTable[[#This Row],[Month]]))}

{=MIN(IF([Month] = SalesTable[[#This Row],[Month]],[Sales],””))}

The MAX formula is multiplying the Sales column by an array of Booleans, where the latter come from testing the Month column against this row’s Month. I do like the table-range addressing, although I don’t see why a formula within SalesTable needs to refer to SalesTable. The resulting array contains Sales figures for a specific month, and zeroes for the other months; this is then MAXed.

The MIN formula is slightly trickier, because we can not leave the false-case zeroes, as these would of course be the minimum value. However, MIN of a number and an empty value is the number, so “” is substituted for false/zero.

The Diff column formula is then just a simple difference:

=SalesTable[[#This Row],[Max]]-SalesTable[[#This Row],[Min]]

(Again, I don’t see why the syntax requires the reference to SalesTable).

This all works fine, provided that Month is the only visible dimension in our pivot table. This is because the data Max and Min values are within Month, as is obvious from the formulas. However, we could equally well want to see MaxMinDiff for values within Region or within Sales Rep. This would necessitate further columns, or changing the formulas to test a different dimension (e.g. Region).

Andy Pope suggested a VBA solution, handling the Worksheet_PivotTableUpdate event:

Private Sub Worksheet_PivotTableUpdate(ByVal Target As PivotTable)

    With Target.DataBodyRange
        With .Columns(.Columns.Count)
            .Offset(-1, 1).Resize(1, 1) = "MaxMinDiff"
            .Offset(, 1).FormulaR1C1 = "=RC[-2]-RC[-1]"
        End With
    End With
End Sub

This certainly inserts a column of formulas, but it is next to, rather than part of the pivot table. Consequently, it does not adapt to changes in the pivot table layout. Also, it requires the Max and Min columns to be in the pivot table (I guess you could live with that).

On a completely different tack, a PivotCache can be based on a Recordset (typically sourced by a SELECT query on some data). I wonder if it’s possible to manipulate such a Recordset to add columns…

Pivot Table Calculated Column?

My excuse is that I’ve got a heavy cold, and my brain is definitely not firing on all cylinders. I’ve found a problem which ought to have a simple Excel (not VBA) solution, but I can’t work out what it is. Dick, Jon, anybody: please put me out of my misery.

I’m currently revising some documentation for a leading financial OLAP framework (ActivePivot), so I’m much involved with pivot tables. I was toying with one of my own, which summarized Sales data (in Excel, not OLAP) – the usual stuff:

(Yes, it’s old data, and I need to change the dates). Simple enough to have Min and Max of Sales:

But suppose that I want to see the difference between Max and Min. Surely there is somewhere to enter the formula ‘=Max – Min’ (or possibly something like ‘=Values.Max – Values.Min’). But where?

It’s not a Calculated Field, since that is essentially a calculated column on the data table. I could, for example, have a field ‘= Sales / Contacts’.

It’s not a Calculated Item, since that aggregates values for dimension members. For example, I could aggregate Jan+Feb+Mar into a Q1 value. I could even do differences, for example, Jul-Dec minus Jan-Jun. However, although the Min and Max Values (fields) appear as column headers, they are not a dimension.

I could have formulas that sit outside the pivot table:

  • using GETPIVOTDATA on individual cells
  • using an array formula to subtract the whole Min array from the whole Max array.

However, these do not, of course, adapt to changes in the pivot table layout, such as filtering member values.

Since an aggregation function could be any function that takes an array of values and returns a single value, we could have more sophisticated derived measures based on the built-in aggregation functions (Max, Average, Stdev, etc).

The OLAP framework mentioned above is of course a much more sophisticated beast. There are ways to calculate additional fields before aggregation in the hypercube (cf. pivot cache), custom aggregation functions within the hypercube, and post-processor functions to calculate additional values. I think the last of these would be the appropriate one here.

Is there a VBA solution? Right now, I haven’t penetrated deep enough into the Byzantine complexities¬† of the pivot table object model. A quick look indicated that PivotField.function is an enumeration of the built-in types. So there’s no obvious hook for a custom aggregation function.

Now I’m going away to cough a lot…

April 2014
« Dec    


Get every new post delivered to your Inbox.