Posts Tagged 'Excel'

Analysing Cross-references Between Documents

I have recently been working on a set of about 200 Use Case documents, converting them into DITA - an XML-based structured document format. In the body text of a Use Case, there can be cross-references to other Use cases – so we have a many-to-many relationship between Use Cases (just as we could between any type of document). It’s interesting to analyse this relationship, to see which Use Cases have lots of outgoing references, and which Use Cases are referenced by many other Use Cases.

This kind of totalling is Pivot Table territory. So the question arises of how to contruct the basic data about the cross-references – a table of A-references-B records – and how to get it into Excel.

A key advantage of XML-based documents is that you can import them into an XML database and run queries over them, where the queries are written in XQuery. Such queries can generate output as XML, HTML or plain text. For our purpose, a simple route is to output an HTML table. This can then be imported by Excel via >Get External Data >From Web. The URL for the import points to our XQuery file, within the database server:

http://roy-laptop:18080/qizx/xqs/ucrefs.xq

Here, the database is XMLMind’s Qizx, running in server mode (locally on my laptop, in this case). This URL can be used like any other in a browser interface, with the results of the query being displayed within the browser. In our case, we are going to give the URL to Excel.

Here’s the code of the query:

xquery version "1.0";
declare option output:method "html";

declare variable $ucs := collection("/UC");

declare function local:uc_refs()
{
  for $ucref in $ucs//uc-ref
  let $file := document-uri(root($ucref))
  let $ucreftext := $ucref/text()
  order by $file
  return
  <tr><td>{$file}</td><td>{$ucreftext}</td><td>1</td></tr>
};

let $ucrefs := local:uc_refs()
return
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>UC References</title>
  </head>
  <body>
    <table>
      <thead>
	    <tr><td>Source</td><td>Target</td><td>Token</td></tr>
      </thead>
      <tbody>
	    {$ucrefs}
      </tbody>
    </table>
  </body>
</html>

Variables start with $. The function collection (line 4) scopes the query to a particular directory tree of document files. The expression ‘$ucs//uc-ref’ binds to each occurrence of the <uc-ref> element, anywhere in the collection of documents. $file is the document in which a <uc-ref> is found. The third column of the table (Token) is just the number 1 – this gives me some data for the pivot table to sum.

Now we can do the import into Excel:

Finally, I can insert a pivot table based on this data, subtotalling by either Source or Target, or putting both dimensions on the same axis. I love it when stuff just works…

Out of curiosity, I ran the import with the macro recorder on, to see what the VBA looks like:

Sub Import()
    With ActiveSheet.QueryTables.Add( _
        Connection:= _
            "URL;http://roy-laptop:18080/qizx/xq/ucrefs.xq",
            Destination:=Range ("$A$1"))
        .Name = "ucrefs.xq"
        .FieldNames = True
        .RowNumbers = False
        .FillAdjacentFormulas = False
        .PreserveFormatting = True
        .RefreshOnFileOpen = False
        .BackgroundQuery = True
        .RefreshStyle = xlInsertDeleteCells
        .SavePassword = False
        .SaveData = True
        .AdjustColumnWidth = True
        .RefreshPeriod = 0
        .WebSelectionType = xlAllTables
        .WebFormatting = xlWebFormattingNone
        .WebPreFormattedTextToColumns = True
        .WebConsecutiveDelimitersAsOne = True
        .WebSingleBlockTextImport = False
        .WebDisableDateRecognition = False
        .WebDisableRedirections = False
        .Refresh BackgroundQuery:=False
    End With
End Sub

So it’s just creation of a QueryTable, with the URL given as a Connection string. The WebSelectionType property is ‘all tables’, although I thought I clicked the lower of the two yellow arrows in the Import dialog. Clearly, we could wrap this up a bit more: selecting from a list of queries, creating a new Worksheet, etc.

Having established the basic XML –> XQuery –> HTML Table –> Excel/VBA architecture, I think I’ll experiment more with this in due course.

Incidentally, this is my 100th post (Rah! Rah!), having taken almost exactly two years. And most enjoyable it’s been – at least for me :-)

An Alternative to VBA in Excel?

Back last year, Gareth Hayter of Slyce Software emailed me about VScript, an alternative to VBA for writing functions and macros in Excel. Unfortunately, for various reasons, I have not been able to look into it in any detail, but it certainly sounds interesting.

VScript is based on Excel-Dna, which is a project to integrate .NET into Excel. The aim is thus to allow you to write functions and macros in C# or VB.Net – so, presumably it is aimed at developers familiar with those languages, and preferring them over VBA. Code is claimed to run considerably faster than VBA (not hard to believe).  VScript provides an IDE integrated with Excel, as with VBA. You can use VScript to create stand-alone XLL add-ins, and digitally sign these.

What about VSTO? Gareth says:

VScript is different from VSTO (Visual Studio Tools for Office) in many ways:

  • VSTO is for professional programmers: It can be complex and confusing to use and requires a lot of time, effort and money to learn.
  • VSTO is expensive: You need to buy Visual Studio® which starts at $799.
  • VSTO is not built into Excel®: It is an external program and works in a very different way from VBA.
  • VSTO projects are complicated to deploy: VSTO is not integrated into Excel, which means that it’s difficult to make a few changes and test them quickly. It requires ClickOnce deployment.
  • VSTO cannot create User-Defined Functions (UDFs): With VSTO, you can’t create functions that you can use in a similar way to SUM() and AVG().

One to watch …

Pivot Table Calculated Column – 2

Thanks to Dick Kusleika and Andy Pope for offering solutions to my pivot table problem: how to have a column that shows the difference between Max and Min aggregations (or other ‘custom’ aggregation).

Dick’s solution is to have an additional column in the source data: effectively calculating Max, Min and MaxMinDiff before aggregation in the pivot table (so the Max and Min are not done in the pivot table).

I have actually added three columns, to break the formula down: the Max and Min columns are just intermediates. The Max and Min values are within Month: since the records are sorted by Month, this can be seen easily.

Here’s the pivot table:

In the pivot table, the Max and Min columns are normal aggregations of the Sales column in the data: they are not derived from the Max and Min columns in the data.

However, the MaxMinDiff column in the pivot table is derived from the Diff column in the data. Since all the values underlying a MaxMinDiff cell have the same value (e.g. Data H2:H6 underlies Pivot Table D3), we can use any of the non-summing aggregation functions: Max, Min, even Average. I changed the lable to MaxMinDiff, because “Max of Diff” looks weird.

Now the formulas in the additional data columns. The Max and Min formulas are single-cell array formulas:

{=MAX([Sales] * ([Month] = SalesTable[[#This Row],[Month]]))}

{=MIN(IF([Month] = SalesTable[[#This Row],[Month]],[Sales],””))}

The MAX formula is multiplying the Sales column by an array of Booleans, where the latter come from testing the Month column against this row’s Month. I do like the table-range addressing, although I don’t see why a formula within SalesTable needs to refer to SalesTable. The resulting array contains Sales figures for a specific month, and zeroes for the other months; this is then MAXed.

The MIN formula is slightly trickier, because we can not leave the false-case zeroes, as these would of course be the minimum value. However, MIN of a number and an empty value is the number, so “” is substituted for false/zero.

The Diff column formula is then just a simple difference:

=SalesTable[[#This Row],[Max]]-SalesTable[[#This Row],[Min]]

(Again, I don’t see why the syntax requires the reference to SalesTable).

This all works fine, provided that Month is the only visible dimension in our pivot table. This is because the data Max and Min values are within Month, as is obvious from the formulas. However, we could equally well want to see MaxMinDiff for values within Region or within Sales Rep. This would necessitate further columns, or changing the formulas to test a different dimension (e.g. Region).

Andy Pope suggested a VBA solution, handling the Worksheet_PivotTableUpdate event:

Private Sub Worksheet_PivotTableUpdate(ByVal Target As PivotTable)

    With Target.DataBodyRange
        With .Columns(.Columns.Count)
            .Offset(-1, 1).Resize(1, 1) = "MaxMinDiff"
            .Offset(, 1).FormulaR1C1 = "=RC[-2]-RC[-1]"
        End With
    End With
End Sub

This certainly inserts a column of formulas, but it is next to, rather than part of the pivot table. Consequently, it does not adapt to changes in the pivot table layout. Also, it requires the Max and Min columns to be in the pivot table (I guess you could live with that).

On a completely different tack, a PivotCache can be based on a Recordset (typically sourced by a SELECT query on some data). I wonder if it’s possible to manipulate such a Recordset to add columns…

Pivot Table Calculated Column?

My excuse is that I’ve got a heavy cold, and my brain is definitely not firing on all cylinders. I’ve found a problem which ought to have a simple Excel (not VBA) solution, but I can’t work out what it is. Dick, Jon, anybody: please put me out of my misery.

I’m currently revising some documentation for a leading financial OLAP framework (ActivePivot), so I’m much involved with pivot tables. I was toying with one of my own, which summarized Sales data (in Excel, not OLAP) – the usual stuff:

(Yes, it’s old data, and I need to change the dates). Simple enough to have Min and Max of Sales:

But suppose that I want to see the difference between Max and Min. Surely there is somewhere to enter the formula ‘=Max – Min’ (or possibly something like ‘=Values.Max – Values.Min’). But where?

It’s not a Calculated Field, since that is essentially a calculated column on the data table. I could, for example, have a field ‘= Sales / Contacts’.

It’s not a Calculated Item, since that aggregates values for dimension members. For example, I could aggregate Jan+Feb+Mar into a Q1 value. I could even do differences, for example, Jul-Dec minus Jan-Jun. However, although the Min and Max Values (fields) appear as column headers, they are not a dimension.

I could have formulas that sit outside the pivot table:

  • using GETPIVOTDATA on individual cells
  • using an array formula to subtract the whole Min array from the whole Max array.

However, these do not, of course, adapt to changes in the pivot table layout, such as filtering member values.

Since an aggregation function could be any function that takes an array of values and returns a single value, we could have more sophisticated derived measures based on the built-in aggregation functions (Max, Average, Stdev, etc).

The OLAP framework mentioned above is of course a much more sophisticated beast. There are ways to calculate additional fields before aggregation in the hypercube (cf. pivot cache), custom aggregation functions within the hypercube, and post-processor functions to calculate additional values. I think the last of these would be the appropriate one here.

Is there a VBA solution? Right now, I haven’t penetrated deep enough into the Byzantine complexities  of the pivot table object model. A quick look indicated that PivotField.function is an enumeration of the built-in types. So there’s no obvious hook for a custom aggregation function.

Now I’m going away to cough a lot…

‘Database’ Functions in Excel

In Excel there is a category of functions called “Database”. This has always been a confusing term, as now (in 2007+) they relate to Tables (a.k.a. ListObjects, in VBA), or equivalent Ranges. The function names all start with ‘D’ (for database): DCOUNT, DSUM, etc. These are equivalents of the standard aggregation functions COUNT, SUM, etc (by aggregation, I mean a function that takes a set of values, and returns a single value).

What these D-functions allow is selective aggregation of data from a table, given a set of criteria – in other words, the combination of an advanced filter with an aggregation, without the need for the filtered data to reside on a worksheet.

Here’s an example:

The simple case is where the criteria are directly related to the data in the table. So, to count Bob’s records, the criteria range is B2:B3, and the DCOUNT formula is in B5:

=DCOUNT(Database,SalesTable[[#Headers],[Contacts]],B2:B3)

The reference to “SalesTable[[#Headers],[Contacts]]” is just because I clicked on D9 – I could have just put “Contacts” (but I hate typing :-)). Note that the blank headers in row 2 are required as part of the criteria range (row 1 is just labels). Unsurpisingly, there are 12 records – one for each month of the year.

However, this kind of subtotalling by ‘dimension member’ (to use the OLAP term) is what pivot tables do. More interesting is when the criteria involve a Boolean-returning formula applied to each record.

For example, suppose that we want to know how many reps had more than average number of contacts (yes, I know it will be about half). So in D7 we have:

=AVERAGE(SalesTable[Contacts])

and in D3:

=D10>$D$7

The criterion is a formula referring to the first data record, which returns a Boolean. I think of this formula as being filled down through the table records, as if in an additional column (note the absolute/relative addresses).

Even more interesting is when the criteria involve functions on multiple fields in a record. For example, suppose that we are interested in records where Contacts2 is greater than Contacts - these are the guys who are improving.

So in E3, we have:

=E10>D10

(again, referring relatively to the first record).

This is fine, but more generally, our Boolean function operates on the entire record. So in F3, we have:

=TestRecord(A10:E10)

where TestRecord is a pure VBA function:

Public Function TestRecord(rec As Range) As Boolean

    TestRecord = rec.Item(4) < rec.Item(5)

End Function

Clearly, the body of this function can be as complicated as we wish, using the cells in rec. However, it depends on a particular ordering of the columns within the table. It is possible that we would want to use this function on tables that have the Contacts columns in different positions. So, an improved version intersects the named columns with the given rec.

Public Function TestRecord2(rec As Range) As Boolean

    Dim table As ListObject
    Set table = rec.ListObject

    Dim arg1 As Range
    Set arg1 = Intersect(table.ListColumns("Contacts").Range, _
                             rec)

    Dim arg2 As Range
    Set arg2 = Intersect(table.ListColumns("Contacts2").Range, _
                             rec)

    TestRecord2 = arg1.Value &lt; arg2.Value
End Function

Since the supplied Range is the first record in the table, we could simply pass the Table name to the function and derive from that the Range for the first record (but it’s getting late…).

Constraints on the Database functions are:

  • the criteria have to be Ranges (and thus on a Worksheet), not in-code arrays
  • they have to be vertically-oriented, contiguous Ranges (so can not be filled down).

A somewhat more esoteric limitation is that you can not plug a custom aggregation function into the basic D-function mechanism – DWEIRDSUM, perhaps.

Period Overlap Function

Over at Daily Dose of Excel, Dick provided a solution to the problem of counting active entities (e.g. customers) within a given period. That is, we want to count a customer if their period of activity overlaps with the given period. The key to the solution is SUMPRODUCT.

An alternative solution, which is a bit more work, but is more flexible, is to write a custom VBA function to compare two periods: that is, start1-finish1 versus start2-finish2.

It is then easy to add an extra column:

PFirst and PLast specify the period of interest. The formula in D6 is:

=OVERLAPS(PFirst,PLast,B6,C6)

You might be tempted to write [First] and [Last], rather than B6 and C6. Unfortunately the implicit intersection of formula row with a table column does not work with the custom function – the function gets arrays of values from the whole columns.

However, we could have the OVERLAPS formulas off to the right of the table, say in column H. The formula can then be:

=OVERLAPS(PFirst,PLast,
Table1[[#This Row],[First]],
Table1[[#This Row],[Last]])

(Yes, I should have given the table a meaningful name…)

The formula in D3 is just:

=COUNTIF(Table1[Overlaps],TRUE)

We can also use the OVERLAPS function to provide conditional formatting – here in column A (see screenshot above). The formula is exactly the same as in column D:

=OVERLAPS(PFirst,PLast,B6,C6)

Finally, here’s the code for the function:

Public Function OVERLAPS(start1, finish1, start2, finish2) As Variant
'Does the period start1-finish1 overlap
'with the period start2-finish2?
'pre: start1 <= finish1 and start2 <= finish2 

  If IsDate(start1) And IsDate(finish1) And _
     IsDate(start2) And IsDate(finish2) Then
        OVERLAPS = _
        (start2 >= start1 And start2 <= finish1) Or _
        (finish2 >= start1 And finish2 <= finish1)
  Else
        OVERLAPS = CVErr(2001)
  End If
End Function

The error appears on the worksheet as a #VALUE, rather than a #N/A – I didn’t look into this deeply.

Intentional Circular References

Over a lunchtime pastie, I was recently browsing my copy of Walkenbach’s Formulas. I found myself looking at the chapter on Intentional Circular References.

Here, JW gives an example of choosing a set of unique random numbers – that is, having chosen a number once, it can not be chosen again. Each random number has an associated COUNTIF formula, which counts occurrences of the associated number; if this is 1 for each number, then we have a solution. In a random-number cell (e.g. A1), we have a formula of the form:

=IF(<no solution>, <new number using RAND>, A1)

These formulas generate candidate solutions at random, until one is found. In >Tools >Options >Calculation (v2003), you need to allow Iteration, and set a sufficient number of Maximum Iterations.

I thought I’d try my own example: allocating resources to a set of tasks (with pre-determined start and finish dates). Here’s the worksheet:

  • The task durations are set out with 1 values, in C6:J9 – each column represents a day, with time increasing from left to right
  • The allocated resources are identified in A6:A9 (range Alloc)
  • The pool of  available resources is in A12:A14 (range Resource)
  • The resource utilization per-day is calculated in C12:J14, using an array formula (see below), plus some conditional formatting
  • The maximum utilization for a resource is calculated in B12:B14 (range Max)
  • The maximum of the maxima is calculated in B16 (range Supermax). If this is 1 – that is, no resource is multi-booked for any day – then we have a solution.

The resource utilization in C12, etc, is calculated using the array formula:

{=SUM(($A12=Alloc)*C$6:C$9)}

That is, the ones in C6:C9 are summed only if the resource in the corresponding row of Alloc matches the one in A12. You can see how the allocations of res2 (in rows 6 and 8 ) are unioned in row 13.

The random-allocation formula in A6 is:

=IF(SuperMax<>1, NewResource, A6)

The reference to Supermax is what creates the circularity.

NewResource is a named formula:

=INDEX(Resource, INT(RAND()*Rescount) + 1)

This generates a random index into the Resource list, whose size is Rescount.

Having found a solution, recalculation just keeps preserving the existing values. To force a new solution, you first have to ‘break’ the existing solution. This is done using the named value Started, in A2. Each of the green cells in C6:J9 has the formula ‘=Started’. Setting Started to zero makes the allocation fail to find a solution (since Supermax can never = 1), leaving Alloc with an arbitrary set of resource ids. Then setting Started = 1 allows a new solution to be found.

I’m not sure how useful this technique would be for realistically sized problems, since it’s basically trying solutions at random. Anyway, here’s the workbook, if you fancy a play.

Navigating Part Relationships – 2

In the previous post, I introduced the idea of a cursor object that allows you to navigate around a graph of component-part relationships:

Navigation could be:

  • down the ‘contains’ relationships
  • along the ‘used in’ relationships
  • back through the history of visited records.

We’ll also have a Reset operation, which jumps back to the start of the table, and clears the history.

The navigation is done using keyboard shortcuts (but could be done via a form).

The core of the design is a Class Module GraphCursor. This provides our four operations: CursorDown, CursorNextUse, CursorBack and CursorReset. When an instance of this class initializes, it points itself at ListObjects(1) on the ActiveSheet (there is only one sheet, to keep things simple), and does a CursorReset.

A GraphCursor maintains a history of visted components using a linked List class (a simple chain of ListItem objects – nothing to do with ListObject a.k.a. Table).

CursorDown and CursorNextUse use Range.Find with the currently selected cell value. I assume this is pretty efficient – and in any case is neater in code terms than iterating through rows explicitly. The Range for CursorDown is just the first column (Component); the Range for CursorNextUse is the Part columns below the row of the current selection.

Something needs to create and hold on to an instance of GraphCursor – this is a standard module Client. This also provides procedures that are called via the keyboard shortcuts.

Public gc As New GraphCursor

Sub GCDown()
    gc.CursorDown
End Sub

'similarly for the other three operations

The keyboard shortcuts are set up on Workbook_Open:

Private Sub Workbook_Open()
    Application.OnKey "^+d", "GCDown"
    Application.OnKey "^+n", "GCNextUse"
    Application.OnKey "^+r", "GCReset"
    Application.OnKey "^+b", "GCBack"

    GCReset
End Sub

Here’s the workbook (Excel 2007).

Since each navigation step is worked out dynamically, we can insert or delete records from our table as we like. This would not be the case for an indexed solution (maintaining a map of Component to row number).

You could argue that each Component-Part relationship should be a separate record – for example, [A, B], [A, C]. This would allow us to associate quantities or other information with each relationship. In this case, we would also need a CursorNextPart operation.

Navigating Part Relationships

This is something I was thinking about before a short holiday break (Strasbourg – very pretty) …

Suppose that we have an inventory of Components, perhaps in a manufacturing or engineering context. For each Component, we record its Parts (that is, sub-components). For example:

Sorry, my creativity did not extend to dreaming up meaningful data – think of car parts, or something.

So, an A is a top-level component, and consists of Bs and Cs (we do not specify quantities – an issue deferred); a B consists of Fs and Gs; E, G and H are elementary components (indicated by underscore). A component, such as B or H, can be used by multiple higher-level components. The restriction to 3 part types per component is just for simplicity.

These relationships form a directed, acyclic graph (there are top-level and bottom-level components, and no component can contain itself directly or indirectly).

Even with a very simple example, like above, it is quite difficult to see what contains what. With a realistically large inventory, say hundreds or thousands of records, it would be next to impossible.

One solution would be to support navigation around the graph of relationships:

  • Drill-down the ‘contains’ relationship: for example, [A contains B, C] >>> [B contains F, G] >>> [F contains H] >>> [H is elementary]
  • Move along the ‘used in’ relationship: for example, [B used in A] >>> [B used in D] >>> [no more uses of B]

Moving along the ‘used in’ relationship for ‘_’ (underscore) takes us through all the elementary components.

When you get the end of a ‘used in’ chain, you could have the option to start again from the beginning (as in textual searches):

In addition, we could have navigation ‘back’ through the sequence of visited Components.

We could offer these operations and the resulting ‘current record’ to a user via a form. However, it seems more lightweight to use keyboard shortcuts.

So, how might we do this? As previous postings have illustrated, I like the idea of a cursor object, which provides move operations of some kind, and access to the ‘current record’. In this case, the cursor position will be indicated visibly to the user by explicitly Selecting the relevant cell (and thus scrolling to the record).

An instance of the cursor object can be created on Workbook_Open, and linked on Class_Initialize to a the table. For simplicity, let’s assume that we have a single worksheet with a single table. As usual, there are advantages to using a v2007 Table (in VBA a ListObject), but you could get it to work in v2003.

More on the design/code in the next posting.

Multi-Area Ranges

We tend to think of a Range as a contiguous set of cells: a single cell, part (or all) of a column, part (or all) of a row, or a rectangular area. However, a Range is one or more or these. The contiguous areas of a Range are held in its Areas collection. If Areas.Count > 1 then we have a multi-area Range.

To construct a multi-area Range from several single-area ranges, you can use the Union operation, which is provided by the Application (so, effectively, global).

I came across an example of this in customizing a chart to show only the most recent N records in the source data. Here’s an accumulating table of monthly records:

I want a clustered column chart, with Months as the category (X) axis, and In/Out as the series.

If I wanted data from all records, I would set the chart’s source data range to be the entire table, including the headers, which Excel interprets to give the series names (that is, In and Out).

If I want my chart to show only the most recent 3 months (say), then I need to specify the source data range as the union of A1:C1 and A6:C8. Here’s the Source Data Range as it appears in the chart configuration:

=’Data 2010’!$A$1:$C$1,’Data 2010’!$A$6:$C$8

Here’s a procedure that constructs and sets the range for a given number of months:

Sub LastNMonths(nmonths As Integer)
'pre: 0 < nmonths
'  <= Range("A1").CurrentRegion.Rows.Count - 1

    Dim ws As Worksheet
    Dim firstrow As Integer, lastrow As Integer

    Set ws = Worksheets("Data 2010")

    lastrow = ws.Range("A1").CurrentRegion.Rows.Count
    firstrow = lastrow - nmonths + 1

    Dim headers As Range
    Set headers = ws.Range("A1:C1")

    Dim rng As Range
    Set rng = ws.Range("A" & firstrow & ":C" & lastrow)

    Set rng = Union(headers, rng)

    Dim cht As Chart
    Set cht = Charts("Chart 2010")
    With cht
        .SetSourceData rng
        .ChartTitle.Text = _
               MonthsTitle(ws, firstrow, lastrow)
    End With
End Sub

MonthsTitle simply constructs the chart title from the first and last months (from column A) – for example, “Report May-10 to Jul-10″.

As I did this initially in v2003, I have not made use of the ‘table-ness’ of the data – hence the Range(“A1″).CurrentRegion.

The procedure was called from a ‘Refresh Chart’ command button on a worksheet, with the nmonths value provided via an InputBox.

Before anyone points this out, let me say that you can do a certain amount of this kind of thing using formulas in the Chart configuration (using OFFSET, etc) – see Walkenbach’s ‘Formulas’ book. However, I think it’s clearer, and certainly more flexible to do it in VBA.


July 2014
M T W T F S S
« Dec    
 123456
78910111213
14151617181920
21222324252627
28293031  

Follow

Get every new post delivered to your Inbox.