Posts Tagged '2007'

Pivot Table Calculated Column – 2

Thanks to Dick Kusleika and Andy Pope for offering solutions to my pivot table problem: how to have a column that shows the difference between Max and Min aggregations (or other ‘custom’ aggregation).

Dick’s solution is to have an additional column in the source data: effectively calculating Max, Min and MaxMinDiff before aggregation in the pivot table (so the Max and Min are not done in the pivot table).

I have actually added three columns, to break the formula down: the Max and Min columns are just intermediates. The Max and Min values are within Month: since the records are sorted by Month, this can be seen easily.

Here’s the pivot table:

In the pivot table, the Max and Min columns are normal aggregations of the Sales column in the data: they are not derived from the Max and Min columns in the data.

However, the MaxMinDiff column in the pivot table is derived from the Diff column in the data. Since all the values underlying a MaxMinDiff cell have the same value (e.g. Data H2:H6 underlies Pivot Table D3), we can use any of the non-summing aggregation functions: Max, Min, even Average. I changed the lable to MaxMinDiff, because “Max of Diff” looks weird.

Now the formulas in the additional data columns. The Max and Min formulas are single-cell array formulas:

{=MAX([Sales] * ([Month] = SalesTable[[#This Row],[Month]]))}

{=MIN(IF([Month] = SalesTable[[#This Row],[Month]],[Sales],””))}

The MAX formula is multiplying the Sales column by an array of Booleans, where the latter come from testing the Month column against this row’s Month. I do like the table-range addressing, although I don’t see why a formula within SalesTable needs to refer to SalesTable. The resulting array contains Sales figures for a specific month, and zeroes for the other months; this is then MAXed.

The MIN formula is slightly trickier, because we can not leave the false-case zeroes, as these would of course be the minimum value. However, MIN of a number and an empty value is the number, so “” is substituted for false/zero.

The Diff column formula is then just a simple difference:

=SalesTable[[#This Row],[Max]]-SalesTable[[#This Row],[Min]]

(Again, I don’t see why the syntax requires the reference to SalesTable).

This all works fine, provided that Month is the only visible dimension in our pivot table. This is because the data Max and Min values are within Month, as is obvious from the formulas. However, we could equally well want to see MaxMinDiff for values within Region or within Sales Rep. This would necessitate further columns, or changing the formulas to test a different dimension (e.g. Region).

Andy Pope suggested a VBA solution, handling the Worksheet_PivotTableUpdate event:

Private Sub Worksheet_PivotTableUpdate(ByVal Target As PivotTable)

    With Target.DataBodyRange
        With .Columns(.Columns.Count)
            .Offset(-1, 1).Resize(1, 1) = "MaxMinDiff"
            .Offset(, 1).FormulaR1C1 = "=RC[-2]-RC[-1]"
        End With
    End With
End Sub

This certainly inserts a column of formulas, but it is next to, rather than part of the pivot table. Consequently, it does not adapt to changes in the pivot table layout. Also, it requires the Max and Min columns to be in the pivot table (I guess you could live with that).

On a completely different tack, a PivotCache can be based on a Recordset (typically sourced by a SELECT query on some data). I wonder if it’s possible to manipulate such a Recordset to add columns…


‘Database’ Functions in Excel

In Excel there is a category of functions called “Database”. This has always been a confusing term, as now (in 2007+) they relate to Tables (a.k.a. ListObjects, in VBA), or equivalent Ranges. The function names all start with ‘D’ (for database): DCOUNT, DSUM, etc. These are equivalents of the standard aggregation functions COUNT, SUM, etc (by aggregation, I mean a function that takes a set of values, and returns a single value).

What these D-functions allow is selective aggregation of data from a table, given a set of criteria – in other words, the combination of an advanced filter with an aggregation, without the need for the filtered data to reside on a worksheet.

Here’s an example:

The simple case is where the criteria are directly related to the data in the table. So, to count Bob’s records, the criteria range is B2:B3, and the DCOUNT formula is in B5:


The reference to “SalesTable[[#Headers],[Contacts]]” is just because I clicked on D9 – I could have just put “Contacts” (but I hate typing :-)). Note that the blank headers in row 2 are required as part of the criteria range (row 1 is just labels). Unsurpisingly, there are 12 records – one for each month of the year.

However, this kind of subtotalling by ‘dimension member’ (to use the OLAP term) is what pivot tables do. More interesting is when the criteria involve a Boolean-returning formula applied to each record.

For example, suppose that we want to know how many reps had more than average number of contacts (yes, I know it will be about half). So in D7 we have:


and in D3:


The criterion is a formula referring to the first data record, which returns a Boolean. I think of this formula as being filled down through the table records, as if in an additional column (note the absolute/relative addresses).

Even more interesting is when the criteria involve functions on multiple fields in a record. For example, suppose that we are interested in records where Contacts2 is greater than Contacts – these are the guys who are improving.

So in E3, we have:


(again, referring relatively to the first record).

This is fine, but more generally, our Boolean function operates on the entire record. So in F3, we have:


where TestRecord is a pure VBA function:

Public Function TestRecord(rec As Range) As Boolean

    TestRecord = rec.Item(4) < rec.Item(5)

End Function

Clearly, the body of this function can be as complicated as we wish, using the cells in rec. However, it depends on a particular ordering of the columns within the table. It is possible that we would want to use this function on tables that have the Contacts columns in different positions. So, an improved version intersects the named columns with the given rec.

Public Function TestRecord2(rec As Range) As Boolean

    Dim table As ListObject
    Set table = rec.ListObject

    Dim arg1 As Range
    Set arg1 = Intersect(table.ListColumns("Contacts").Range, _

    Dim arg2 As Range
    Set arg2 = Intersect(table.ListColumns("Contacts2").Range, _

    TestRecord2 = arg1.Value &lt; arg2.Value
End Function

Since the supplied Range is the first record in the table, we could simply pass the Table name to the function and derive from that the Range for the first record (but it’s getting late…).

Constraints on the Database functions are:

  • the criteria have to be Ranges (and thus on a Worksheet), not in-code arrays
  • they have to be vertically-oriented, contiguous Ranges (so can not be filled down).

A somewhat more esoteric limitation is that you can not plug a custom aggregation function into the basic D-function mechanism – DWEIRDSUM, perhaps.

Navigating Part Relationships – 2

In the previous post, I introduced the idea of a cursor object that allows you to navigate around a graph of component-part relationships:

Navigation could be:

  • down the ‘contains’ relationships
  • along the ‘used in’ relationships
  • back through the history of visited records.

We’ll also have a Reset operation, which jumps back to the start of the table, and clears the history.

The navigation is done using keyboard shortcuts (but could be done via a form).

The core of the design is a Class Module GraphCursor. This provides our four operations: CursorDown, CursorNextUse, CursorBack and CursorReset. When an instance of this class initializes, it points itself at ListObjects(1) on the ActiveSheet (there is only one sheet, to keep things simple), and does a CursorReset.

A GraphCursor maintains a history of visted components using a linked List class (a simple chain of ListItem objects – nothing to do with ListObject a.k.a. Table).

CursorDown and CursorNextUse use Range.Find with the currently selected cell value. I assume this is pretty efficient – and in any case is neater in code terms than iterating through rows explicitly. The Range for CursorDown is just the first column (Component); the Range for CursorNextUse is the Part columns below the row of the current selection.

Something needs to create and hold on to an instance of GraphCursor – this is a standard module Client. This also provides procedures that are called via the keyboard shortcuts.

Public gc As New GraphCursor

Sub GCDown()
End Sub

'similarly for the other three operations

The keyboard shortcuts are set up on Workbook_Open:

Private Sub Workbook_Open()
    Application.OnKey "^+d", "GCDown"
    Application.OnKey "^+n", "GCNextUse"
    Application.OnKey "^+r", "GCReset"
    Application.OnKey "^+b", "GCBack"

End Sub

Here’s the workbook (Excel 2007).

Since each navigation step is worked out dynamically, we can insert or delete records from our table as we like. This would not be the case for an indexed solution (maintaining a map of Component to row number).

You could argue that each Component-Part relationship should be a separate record – for example, [A, B], [A, C]. This would allow us to associate quantities or other information with each relationship. In this case, we would also need a CursorNextPart operation.

Sub-sequence Iterator 3

In the previous post, we were looking at an iterator object that bound to successive sub-sequences of records in a table, according to a StartCondition and an EndCondition. In that example, the sub-sequences were contiguous, non-overlapping, and so could be summarized to give sub-totals in a very flexible way.

A slightly different example is where the start and end records for each sub-sequence are all intermixed. Here’s a time-ordered sequence of start and finish actions, for some identified entities:

You can see that B and C both start and finish before A finishes.

Suppose that want to find out the duration of each activity. We then need to find matching start and finish records, and calculate the difference between the two dates. In this case, the sub-sequence is the ‘lifetime’ of the entity. Although we are not interested in anything other than the start and finish for an entity, it’s conceivable in other examples that we might be interested in the intermediate values.

In the previous example, the sub-sequences were non-overlapping, so the MoveNext operation started looking for a new StartRow immediately after the old EndRow. Here, the sub-sequences can overlap, so the MoveNext operation starts looking for a new StartRow immediately after the old StartRow . At present, each entity is assumed to have a finish record.

StartCondition simply looks for a ‘start’ record; when it finds one, it records the Id. EndCondition looks for a ‘finish’ record with the current Id.

The client code creates an iterator on the Actions table (see above), moves it through the table, calculates the duration for each entity (finish – start + 1), and writes a summary record:

Here’s the client procedure:

Sub Run()
    Dim ssi As SSI2
    Set ssi = New SSI2
    With ssi
        .WsName = "Actions"    'Worksheet
        .TableName = "ActionTable"
        .KeyName1 = "Id"       'Key column
        .KeyName2 = "Action"   'there are 2 key columns
        .ValName = "Date"      'Value column
    End With

    Worksheets("Summary2").Activate  'for the output

    Dim row As Long
    row = 1
    ssi.MoveNext     'to first sub-sequence
    Do Until ssi.IsAfter
        Range("A" & row).Value = ssi.Label
        Range("B" & row).Value = Summarized(ssi.values)
        row = row + 1
End Sub

Private Function Summarized(values As Variant) As Variant
'pre: IsArray(values)
    Dim last As Long
    last = UBound(values)
    Summarized = values(last) - values(1) + 1
End Function

I think this is quite an interesting example, as it’s not obvious to me how you would do it with formulas, even array ones.

At present there are two different classes for the two examples (non-overlapping and overlapping) – hence SSI2, above. Maybe they should both implement a common interface. In practice, though I think you’d know which flavour you needed. Here’s the workbook with both examples, if you want to try it out.

Sub-sequence Iterator 2

Following on from the previous post, I’ve had a go at implementing a Sub-sequence Iterator (SSI) class, which operates on a named Table ( a 2007 table, a.k.a. ListObject). There are two variants, depending on whether the sub-sequences can overlap or not.

The first case is similar to the earlier data partitioning, where we want to break our table rows into contiguous, non-overlapping sub-sequences, and perform some summarizing operation over the values – for example, subtotalling. Here’s the data table:

The Date column is our ‘key’, by which we work out the sub-sequences. In this case it’s just a weekly incrementing date. The Value column contains the data that we want to summarize. For simplicity, let’s say that we want to summarize by month – although it could be something more interesting.

Our SSI object has a MoveNext operation, called by some client code, which makes it iterate through the sub-sequences. For each sub-sequence, the SSI makes available to the client code:

  • a Label that can be used to identify each sub-sequence
  • an array of Values for the sub-sequence.

The client code can then pass the Values to a function Summarized, which in this case calculates a (sub)total. The client code then writes the Label and the Summarized value to another worksheet:

Here, the Label is just “Month ” prepended to the month number (January -> 1, etc).

Now, you might be wondering why we don’t just add a Month column to our table, and generate a pivot table, aggregating by month.

Firstly, we might want to partition our dates in many different ways: by month, quarter, Mayan Lunar Year, and so on. It would be cumbersome to have to add a column, with appropriate values, for each of these partitionings. Indeed, our table might be linked to some external data source which has only the raw data.

Secondly, a particular partitioning might depend on the data values themselves (for example, a negative value terminates a sub-sequence), or some dynamic value, such as today’s date.

To customize the SSI class for a particular table, we need to write:

  • Function StartCondition(row As Long) As Boolean
  • Function EndCondition(row As Long) As Boolean
  • Function Label() As String

In this example, StartCondition and EndCondition are both looking for changes in Month. When the SSI finds new rows satisfying these conditions, it sets StartRow and EndRow, which then delimit a new array of Values.

I’ll talk about the second variant – where we can have overlapping subsequences – in the next post.

Worksheet Hierarchies – part 2

In Part 1, I was toying with the idea of worksheets that are hierarchically related. In this example, the hierarchy reflects geographical areas: Country, Area, City. Data is held by the leaf-node worksheets, and aggregated by the higher-level worksheets. Using the INDIRECT function means that the aggregation formulas are generic, and so can be used as-is from a template worksheet.

This is clearly pivot table territory, so what’s the difference with this approach?

Firstly, the scenario I have in mind is where you are collating worksheets sent to you in separate workbooks (for example, the London office sends you London.xlsx, containing a single London worksheet). You have already developed a bit of VBA to copy each of the worksheets into a single collated workbook (UK.xlsx, say). However, the incoming worksheets are effectively reports, and might contain information in addition to the raw Sales and Costs figures. So I can’t immediately collate the Sales/Costs information into a single table: if I need to, that’s a subsequent step (see later).

Secondly, each leaf-node worksheet only needs to know about its parent node. For example, London knows it’s in the South area, but not about its place in the overall hierarchy. Similarly, a non-leaf node knows only about its parent and its children. This makes it easy to change the hierarchy – for example, breaking London into Central London and Outer London.

In contrast, if the basic data were in table form, the levels of the hierarchy would need to be explicit, as columns:

If I decided to break London into Central and Outer zones, then I’d need an additional  ‘Zone’ column, with values (a lot ‘n/a’) for each record.

However, the tabular data is clearly useful for generating pivot table summaries – here’s one at the Area level:

or with multiple fields on the Row axis, to get a hierarchical summary:

So, I’m thinking that it would be useful have some VBA to do the second stage of collation, of the leaf-node Sales/Costs data into a single table. This sounds like it will involve a tree walk around the worksheet hierarchy. Perhaps in two passes: once to find out the maximum depth (and thus the columns of the table), and again to copy the data into the table.

Worksheet Hierarchies

Back from the Easter break now…

It occurred to me that a common requirement is for a workbook to reflect a hierarchical structure – for example:

  • divisions/units within an organization
  • geographical areas (country, region, city)
  • product/part breakdown
  • reporting period (year, quarter, month).

An obvious approach is to have a worksheet for each node in the hierarchy, with the leaf nodes holding the data, and the non-leaf nodes aggregating the data using formulas. (Note that I’m only considering single-dimension aggregation here, not hypercubes).

So I thought it would be useful to have a template worksheet that can be copied for each node in a hierarchy, and linked in to the hierarchy using the worksheet names. The aggregation formulas are already in place, and pick up the names of child nodes. And to aid navigation, we can have some hyperlinks to root, parent and child nodes.

My example is the start of a geographical hierarchy:

UK is the root; London and Brighton are in the South region. Each node worksheet has been copied from Template. Here’s the UK sheet:

The child nodes (i.e. the regions) are represented in a table, with the names entered in the Region column. The Sales and Costs columns pick up these names, to get the Sales/Costs values from the child worksheets. So, for example, cells E9 and F9 contain the formulas:

=INDIRECT($D9 & “!Sales”)

=INDIRECT($D9 & “!Costs”)

The UK node’s owns Sales and Costs values are just totals of the relevant columns:



The Sales/Costs cells are named ranges, needless to say.

A leaf node worksheet, such as London, has an empty child table, and literal values for Sales and Costs:

I’ve left the formulas in the table in case this node gets decomposed further, but it is in practice empty.

Here’s an intermediate node, for South region:

There are links to the root node (UK), the parent node (also UK here) and the child nodes. Each of these is a HYPERLINK formula that picks up the relevant node (i.e. worksheet ) name. For example, C2 contains the formula:

=HYPERLINK(“[” & Root & “.xlsm”]” & B2 & “!$A$1”, “link”)

Note that HYPERLINK needs a filename, even if the link is internal to this workbook. By convention, the workbook has the same name as the root node. So ‘Root’ is just a range name for B1 on the UK worksheet. The links all go to A1 on the target sheet. Similarly, G9 contains:

=HYPERLINK(“[” & Root & “.xlsm”]” & D9 & “!$A$1”, “link”)

So, to create a hierarchy node, you need to:

  • copy the Template worksheet
  • set Parent, Name, and Children (if any)
  • set Sales and Costs values for leaf nodes.

No VBA so far, but I think it would be useful to build:

  • a hyperlinked Table of Contents just to the leaf nodes – that is, where the data entry happens
  • hierarchical summaries (i.e. each one on a single worksheet), for individual data categories (e.g. Sales).

More in due course.

July 2018
« Dec