Archive for September, 2010

Pivot Table Calculated Column?

My excuse is that I’ve got a heavy cold, and my brain is definitely not firing on all cylinders. I’ve found a problem which ought to have a simple Excel (not VBA) solution, but I can’t work out what it is. Dick, Jon, anybody: please put me out of my misery.

I’m currently revising some documentation for a leading financial OLAP framework (ActivePivot), so I’m much involved with pivot tables. I was toying with one of my own, which summarized Sales data (in Excel, not OLAP) – the usual stuff:

(Yes, it’s old data, and I need to change the dates). Simple enough to have Min and Max of Sales:

But suppose that I want to see the difference between Max and Min. Surely there is somewhere to enter the formula ‘=Max – Min’ (or possibly something like ‘=Values.Max – Values.Min’). But where?

It’s not a Calculated Field, since that is essentially a calculated column on the data table. I could, for example, have a field ‘= Sales / Contacts’.

It’s not a Calculated Item, since that aggregates values for dimension members. For example, I could aggregate Jan+Feb+Mar into a Q1 value. I could even do differences, for example, Jul-Dec minus Jan-Jun. However, although the Min and Max Values (fields) appear as column headers, they are not a dimension.

I could have formulas that sit outside the pivot table:

  • using GETPIVOTDATA on individual cells
  • using an array formula to subtract the whole Min array from the whole Max array.

However, these do not, of course, adapt to changes in the pivot table layout, such as filtering member values.

Since an aggregation function could be any function that takes an array of values and returns a single value, we could have more sophisticated derived measures based on the built-in aggregation functions (Max, Average, Stdev, etc).

The OLAP framework mentioned above is of course a much more sophisticated beast. There are ways to calculate additional fields before aggregation in the hypercube (cf. pivot cache), custom aggregation functions within the hypercube, and post-processor functions to calculate additional values. I think the last of these would be the appropriate one here.

Is there a VBA solution? Right now, I haven’t penetrated deep enough into the Byzantine complexities  of the pivot table object model. A quick look indicated that PivotField.function is an enumeration of the built-in types. So there’s no obvious hook for a custom aggregation function.

Now I’m going away to cough a lot…

‘Database’ Functions in Excel

In Excel there is a category of functions called “Database”. This has always been a confusing term, as now (in 2007+) they relate to Tables (a.k.a. ListObjects, in VBA), or equivalent Ranges. The function names all start with ‘D’ (for database): DCOUNT, DSUM, etc. These are equivalents of the standard aggregation functions COUNT, SUM, etc (by aggregation, I mean a function that takes a set of values, and returns a single value).

What these D-functions allow is selective aggregation of data from a table, given a set of criteria – in other words, the combination of an advanced filter with an aggregation, without the need for the filtered data to reside on a worksheet.

Here’s an example:

The simple case is where the criteria are directly related to the data in the table. So, to count Bob’s records, the criteria range is B2:B3, and the DCOUNT formula is in B5:

=DCOUNT(Database,SalesTable[[#Headers],[Contacts]],B2:B3)

The reference to “SalesTable[[#Headers],[Contacts]]” is just because I clicked on D9 – I could have just put “Contacts” (but I hate typing :-)). Note that the blank headers in row 2 are required as part of the criteria range (row 1 is just labels). Unsurpisingly, there are 12 records – one for each month of the year.

However, this kind of subtotalling by ‘dimension member’ (to use the OLAP term) is what pivot tables do. More interesting is when the criteria involve a Boolean-returning formula applied to each record.

For example, suppose that we want to know how many reps had more than average number of contacts (yes, I know it will be about half). So in D7 we have:

=AVERAGE(SalesTable[Contacts])

and in D3:

=D10>$D$7

The criterion is a formula referring to the first data record, which returns a Boolean. I think of this formula as being filled down through the table records, as if in an additional column (note the absolute/relative addresses).

Even more interesting is when the criteria involve functions on multiple fields in a record. For example, suppose that we are interested in records where Contacts2 is greater than Contacts - these are the guys who are improving.

So in E3, we have:

=E10>D10

(again, referring relatively to the first record).

This is fine, but more generally, our Boolean function operates on the entire record. So in F3, we have:

=TestRecord(A10:E10)

where TestRecord is a pure VBA function:

Public Function TestRecord(rec As Range) As Boolean

    TestRecord = rec.Item(4) < rec.Item(5)

End Function

Clearly, the body of this function can be as complicated as we wish, using the cells in rec. However, it depends on a particular ordering of the columns within the table. It is possible that we would want to use this function on tables that have the Contacts columns in different positions. So, an improved version intersects the named columns with the given rec.

Public Function TestRecord2(rec As Range) As Boolean

    Dim table As ListObject
    Set table = rec.ListObject

    Dim arg1 As Range
    Set arg1 = Intersect(table.ListColumns("Contacts").Range, _
                             rec)

    Dim arg2 As Range
    Set arg2 = Intersect(table.ListColumns("Contacts2").Range, _
                             rec)

    TestRecord2 = arg1.Value &lt; arg2.Value
End Function

Since the supplied Range is the first record in the table, we could simply pass the Table name to the function and derive from that the Range for the first record (but it’s getting late…).

Constraints on the Database functions are:

  • the criteria have to be Ranges (and thus on a Worksheet), not in-code arrays
  • they have to be vertically-oriented, contiguous Ranges (so can not be filled down).

A somewhat more esoteric limitation is that you can not plug a custom aggregation function into the basic D-function mechanism – DWEIRDSUM, perhaps.

Period Overlap Function

Over at Daily Dose of Excel, Dick provided a solution to the problem of counting active entities (e.g. customers) within a given period. That is, we want to count a customer if their period of activity overlaps with the given period. The key to the solution is SUMPRODUCT.

An alternative solution, which is a bit more work, but is more flexible, is to write a custom VBA function to compare two periods: that is, start1-finish1 versus start2-finish2.

It is then easy to add an extra column:

PFirst and PLast specify the period of interest. The formula in D6 is:

=OVERLAPS(PFirst,PLast,B6,C6)

You might be tempted to write [First] and [Last], rather than B6 and C6. Unfortunately the implicit intersection of formula row with a table column does not work with the custom function – the function gets arrays of values from the whole columns.

However, we could have the OVERLAPS formulas off to the right of the table, say in column H. The formula can then be:

=OVERLAPS(PFirst,PLast,
Table1[[#This Row],[First]],
Table1[[#This Row],[Last]])

(Yes, I should have given the table a meaningful name…)

The formula in D3 is just:

=COUNTIF(Table1[Overlaps],TRUE)

We can also use the OVERLAPS function to provide conditional formatting – here in column A (see screenshot above). The formula is exactly the same as in column D:

=OVERLAPS(PFirst,PLast,B6,C6)

Finally, here’s the code for the function:

Public Function OVERLAPS(start1, finish1, start2, finish2) As Variant
'Does the period start1-finish1 overlap
'with the period start2-finish2?
'pre: start1 <= finish1 and start2 <= finish2 

  If IsDate(start1) And IsDate(finish1) And _
     IsDate(start2) And IsDate(finish2) Then
        OVERLAPS = _
        (start2 >= start1 And start2 <= finish1) Or _
        (finish2 >= start1 And finish2 <= finish1)
  Else
        OVERLAPS = CVErr(2001)
  End If
End Function

The error appears on the worksheet as a #VALUE, rather than a #N/A – I didn’t look into this deeply.


September 2010
M T W T F S S
« Aug   Oct »
 12345
6789101112
13141516171819
20212223242526
27282930  

Follow

Get every new post delivered to your Inbox.