Posts Tagged 'Formulas'

Spreadsheets in XML – Part 2

In the previous post, I was looking at the ‘spreadsheet extensions’ provided by XMLMind’s XML Editor (XXE). This allows XPath-based formulas to be inserted into XML documents, not only in tabular elements, but also in free text.

As an example, I mocked up some invoices. An invoice is a good example of a hybrid document: we want to print it out (or PDF it) as a nicely formatted document; there are calculated and looked-up elements in the manner of a spreadsheet; we want the whole set of invoices to be queryable subsequently, in the manner of a database.

Here’s an invoice, as a DITA document, shown in XXE:

The little green ‘f’ icons represent the formulas, held as XML Processing Instructions. These are ignored in subsequent transformations (to final formats, such as PDF). You double-click an icon to edit the formula.

The first one (before the table) is today’s date: =today(). The ones in the Cost column are simple arithmetic: =($C2 * $D2), etc. Column and row headers can be displayed optionally:

The Product Description and Unit Price formulas are more interesting, since they are lookups in another document, containing the product catalog. Here’s the formula in B2:

The first thing to notice is that we can have multi-line formulas, with ‘let’ definitions preceding the actual formula. (The “…” is really a full file path – I’ve elided it for compactness). The id of the element with the product description is the product code appended with “_desc”. This is then retrieved from the product catalog by matching the id attribute (@id) with the constructed value ($id). (The back-quotes indicate ‘raw’ XPath, rather than XXE formula language).

Here’s the Product catalog (not very extensive!):

The formulas here are used not to calculate visible values, but to construct values for the id attribute. For example, in B2:

Note the id attribute picked from the drop-down list. In Excel terms, this is rather like having a formula that constructs a Range name. It means that the ids for cells in column B and C always follow the product codes in column A. I think this is rather neat.

Back in the invoice, the Total Cost formula sums the values in the Cost column (E) – see the first scrrenshot. We could do this with a table/column reference, but an alternative is to tag the Cost cells with a common attribute value. In DITA, @outputclass allows a kind of informal specialization (we can’t use @id, as this must be unique within a document). Here, we can set @outputclass = ‘cost’. Now, the Total Cost formula sums all elements with this attribute value, wherever they are in the document:


That’s it, in terms of the documents. We can then generate formatted output, as we require.

The database aspect comes if the invoices are put into an XML database, such as XMLMind’s Qizx (Free Engine edition). This provides indexing and querying, using the XQuery language. We can then calculate aggregated values, for example by customer and product. Here’s a simple query to calculate the total invoiced for a given product:

xquery version "1.0";

let $prod := "PR01"    (:edit this:)

let $costs :=
 for $row in //strow
 let $cost := $row/stentry[@outputclass='cost']
 where $row/stentry[1]/text() = $prod
 return $cost
return ($prod, sum($costs))

strow is a simple-table row, stentry is a cell. One could, of course, get a lot fancier, and produce proper date-based reports.

There’s an interesting contrast here with how we would do this in Excel. If each invoice is a separate Workbook, we would need to provide some collation mechanism for the data, to get it into a single source for pivot tables, etc. – either in a single workbook, or in Access. I think that where we have a large number of computationally relatively simple documents, the XML approach is quite attractive.

Spreadsheets in XML

In my work with DITA documentation, I use XMLMind’s XML Editor (XXE) – and very good it is too. The professional version comes with an Integrated Spreadsheet Engine, which I have just recently taken a look at. I think it’s rather interesting, particularly how the approach differs from Excel or similar traditional spreadhseets.

We are not talking here about Excel (2007+) using an XML-based file format. In this, the information is structured using elements that relate to spreadsheets – such as <worksheet> and <row> – not to the content domain (invoices, timesheets, product specifications, whatever). That means that the information is not in practice accessible to users (as opposed to tool developers).

What XXE is addressing is the insertion of computed element content into a ‘user level’ XML document. Such a document could be in XHTML or DITA (XHTML is obviously a ‘final’ format; DITA is a ‘source’ format for conversion into various final formats, such as XHTML, PDF, CHM).

The idea is that the XML contains formulas as Processing Instructions. PIs are interpreted by the XXE application, but are ignored by XML processing tools and web browsers. A formula PI generates content (a result value) which is inserted between the PI and the end of the enclosing element. In practice, a formula will provide content for a lowest-level element such as a table entry, or a text phrase. Here is a simple-table entry:

<?xxe-formula formula=’=($[+0,3] * $[+0,4])’?>

This is in column 5 of a table, and is multiplying the values in columns 3 and 4, to produce the value 150. XXE provides a sugared syntax for references within a table, which is essentially the same as an Excel formula. So the above formula would actually be written as:

=($C2 * $D2)

with the relative addressing of the rows allowing the formula to be copy-pasted in subsequent rows.

In XML, any element can have an id attribute. For a table element, this allows us to reference a value in a table from outside (from ordinary text elements). For a leaf element, this allows us to reference a value by name. Suppose that invoice_table contains details of an invoice, and cell D7 contains the tax. We could have a formula:


or, better:


There are, as you would expect, a reasonable number of built-in functions, in the usual categories.

An interesting twist is that a formula can set not just element content, but alternatively an element attribute. The most obvious use of this is to generate id attributes. For example, suppose that I have a table with Product Codes in column A, and Unit Prices in column B. Then, in the B2 cell, I can have this formula setting the id attribute:

=($A2 & “_unitprice”)

(and similarly for the rest of column B). Now, if I want to look up a unit price, I simply construct the id from the product code and the suffix, and access the element directly by id – no need for a VLOOKUP function.

So far, the main difference from Excel is that we are not restricted to using  tabular structures (worksheets, in Excel). We can have values and formulas anywhere in a document structure. For erxample, I could drop a calculated value into an ordinary free-text paragraph, and pick up that value elsewhere in the document.

However, the really different aspect is that the formulas are based on the XPath 1.0 language – the Excel-like syntax is just cosmetic. XPath is a pattern-based query language that treats XML documents as trees of nodes-with-attributes. Here’s an example-based tutorial. This means that a formula can operate on a set of nodes (returned from an XPath expression), without knowing how many there are or where they are in a document.

We can tag values (wherever they are) using an attribute. In DITA, we could use @outputclass (not @id, as this must be unique). For example, I could tag various elements (possibly table cells, possibly not) with @outputclass = ‘cost’, and then sum these using the formula:


The backquotes encapsulate an XPath expression (as opposed to the XXE formula language). ‘//*’ means “any element anywhere in the document” (which we then filter by the outputclass attribute). Furthermore, XPath can access not only the current document, but also other documents (as individual documents, not as a document collection – a notion supported by XPath 2.0/XQuery).

I’ll discuss an example of all this in the next post.

‘Database’ Functions in Excel

In Excel there is a category of functions called “Database”. This has always been a confusing term, as now (in 2007+) they relate to Tables (a.k.a. ListObjects, in VBA), or equivalent Ranges. The function names all start with ‘D’ (for database): DCOUNT, DSUM, etc. These are equivalents of the standard aggregation functions COUNT, SUM, etc (by aggregation, I mean a function that takes a set of values, and returns a single value).

What these D-functions allow is selective aggregation of data from a table, given a set of criteria – in other words, the combination of an advanced filter with an aggregation, without the need for the filtered data to reside on a worksheet.

Here’s an example:

The simple case is where the criteria are directly related to the data in the table. So, to count Bob’s records, the criteria range is B2:B3, and the DCOUNT formula is in B5:


The reference to “SalesTable[[#Headers],[Contacts]]” is just because I clicked on D9 – I could have just put “Contacts” (but I hate typing :-)). Note that the blank headers in row 2 are required as part of the criteria range (row 1 is just labels). Unsurpisingly, there are 12 records – one for each month of the year.

However, this kind of subtotalling by ‘dimension member’ (to use the OLAP term) is what pivot tables do. More interesting is when the criteria involve a Boolean-returning formula applied to each record.

For example, suppose that we want to know how many reps had more than average number of contacts (yes, I know it will be about half). So in D7 we have:


and in D3:


The criterion is a formula referring to the first data record, which returns a Boolean. I think of this formula as being filled down through the table records, as if in an additional column (note the absolute/relative addresses).

Even more interesting is when the criteria involve functions on multiple fields in a record. For example, suppose that we are interested in records where Contacts2 is greater than Contacts - these are the guys who are improving.

So in E3, we have:


(again, referring relatively to the first record).

This is fine, but more generally, our Boolean function operates on the entire record. So in F3, we have:


where TestRecord is a pure VBA function:

Public Function TestRecord(rec As Range) As Boolean

    TestRecord = rec.Item(4) < rec.Item(5)

End Function

Clearly, the body of this function can be as complicated as we wish, using the cells in rec. However, it depends on a particular ordering of the columns within the table. It is possible that we would want to use this function on tables that have the Contacts columns in different positions. So, an improved version intersects the named columns with the given rec.

Public Function TestRecord2(rec As Range) As Boolean

    Dim table As ListObject
    Set table = rec.ListObject

    Dim arg1 As Range
    Set arg1 = Intersect(table.ListColumns("Contacts").Range, _

    Dim arg2 As Range
    Set arg2 = Intersect(table.ListColumns("Contacts2").Range, _

    TestRecord2 = arg1.Value &lt; arg2.Value
End Function

Since the supplied Range is the first record in the table, we could simply pass the Table name to the function and derive from that the Range for the first record (but it’s getting late…).

Constraints on the Database functions are:

  • the criteria have to be Ranges (and thus on a Worksheet), not in-code arrays
  • they have to be vertically-oriented, contiguous Ranges (so can not be filled down).

A somewhat more esoteric limitation is that you can not plug a custom aggregation function into the basic D-function mechanism – DWEIRDSUM, perhaps.

Period Overlap Function

Over at Daily Dose of Excel, Dick provided a solution to the problem of counting active entities (e.g. customers) within a given period. That is, we want to count a customer if their period of activity overlaps with the given period. The key to the solution is SUMPRODUCT.

An alternative solution, which is a bit more work, but is more flexible, is to write a custom VBA function to compare two periods: that is, start1-finish1 versus start2-finish2.

It is then easy to add an extra column:

PFirst and PLast specify the period of interest. The formula in D6 is:


You might be tempted to write [First] and [Last], rather than B6 and C6. Unfortunately the implicit intersection of formula row with a table column does not work with the custom function – the function gets arrays of values from the whole columns.

However, we could have the OVERLAPS formulas off to the right of the table, say in column H. The formula can then be:

Table1[[#This Row],[First]],
Table1[[#This Row],[Last]])

(Yes, I should have given the table a meaningful name…)

The formula in D3 is just:


We can also use the OVERLAPS function to provide conditional formatting – here in column A (see screenshot above). The formula is exactly the same as in column D:


Finally, here’s the code for the function:

Public Function OVERLAPS(start1, finish1, start2, finish2) As Variant
'Does the period start1-finish1 overlap
'with the period start2-finish2?
'pre: start1 <= finish1 and start2 <= finish2 

  If IsDate(start1) And IsDate(finish1) And _
     IsDate(start2) And IsDate(finish2) Then
        OVERLAPS = _
        (start2 >= start1 And start2 <= finish1) Or _
        (finish2 >= start1 And finish2 <= finish1)
        OVERLAPS = CVErr(2001)
  End If
End Function

The error appears on the worksheet as a #VALUE, rather than a #N/A – I didn’t look into this deeply.

Intentional Circular References

Over a lunchtime pastie, I was recently browsing my copy of Walkenbach’s Formulas. I found myself looking at the chapter on Intentional Circular References.

Here, JW gives an example of choosing a set of unique random numbers – that is, having chosen a number once, it can not be chosen again. Each random number has an associated COUNTIF formula, which counts occurrences of the associated number; if this is 1 for each number, then we have a solution. In a random-number cell (e.g. A1), we have a formula of the form:

=IF(<no solution>, <new number using RAND>, A1)

These formulas generate candidate solutions at random, until one is found. In >Tools >Options >Calculation (v2003), you need to allow Iteration, and set a sufficient number of Maximum Iterations.

I thought I’d try my own example: allocating resources to a set of tasks (with pre-determined start and finish dates). Here’s the worksheet:

  • The task durations are set out with 1 values, in C6:J9 – each column represents a day, with time increasing from left to right
  • The allocated resources are identified in A6:A9 (range Alloc)
  • The pool of  available resources is in A12:A14 (range Resource)
  • The resource utilization per-day is calculated in C12:J14, using an array formula (see below), plus some conditional formatting
  • The maximum utilization for a resource is calculated in B12:B14 (range Max)
  • The maximum of the maxima is calculated in B16 (range Supermax). If this is 1 – that is, no resource is multi-booked for any day – then we have a solution.

The resource utilization in C12, etc, is calculated using the array formula:


That is, the ones in C6:C9 are summed only if the resource in the corresponding row of Alloc matches the one in A12. You can see how the allocations of res2 (in rows 6 and 8 ) are unioned in row 13.

The random-allocation formula in A6 is:

=IF(SuperMax<>1, NewResource, A6)

The reference to Supermax is what creates the circularity.

NewResource is a named formula:

=INDEX(Resource, INT(RAND()*Rescount) + 1)

This generates a random index into the Resource list, whose size is Rescount.

Having found a solution, recalculation just keeps preserving the existing values. To force a new solution, you first have to ‘break’ the existing solution. This is done using the named value Started, in A2. Each of the green cells in C6:J9 has the formula ‘=Started’. Setting Started to zero makes the allocation fail to find a solution (since Supermax can never = 1), leaving Alloc with an arbitrary set of resource ids. Then setting Started = 1 allows a new solution to be found.

I’m not sure how useful this technique would be for realistically sized problems, since it’s basically trying solutions at random. Anyway, here’s the workbook, if you fancy a play.

Formula Builder

Suppose that we have a large number of numerical data sets, which we want to examine using aggregation functions, such as SUM, AVERAGE, STDEV, and possibly some custom functions of our own devising. Each of these functions takes an entire dataset and retuns a single result.

Here’s the data:

where ‘data1′, ‘data2′, etc, name the dataset ranges; the column headers are themselves named ‘datasets’. We could, of course, have larger and more numerous datasets.

Now, we could have another worksheet with formulas such as:

=SUM(data1), =SUM(data2), …

= STDEV(data1), = STDEV(data2), …

and so on, for each combination of function and dataset. However, that’s rather a sledgehammer approach.

Alternatively, we could select a function from a drop-down list, select a dataset from another list, and so build a particular formula of interest:

The list of available functions is defined elsewhere, and picked up by data validation in cell B1:

(SUMDIFFS is a sample custom function). Similarly, the data validation on cell D1 picks up the ‘datasets’ list (the data column headers).

The result cell (F1) contains the formula:


where CALC is a public VBA function:

Public Function CALC() As Variant

  Dim func As String
  Dim dataset As String
  Dim formula As String

  func = Range("function").Value
  dataset = Range("dataset").Value
  formula = func & "(" & dataset & ")"

  CALC = Evaluate(formula)
End Function

The formula String does not need an initial ‘=’.

As CALC does not reference its input cells explicitly, it does not get recalculated automatically if the input cells change. So we need to handle Change events that emanate from the two input cells. In the ‘Analysis’ Sheet module:

Private Sub Worksheet_Change(ByVal Target As Range)

  If Target.Address = Range("function").Address Or _
     Target.Address = Range("dataset").Address Then
  End If
End Sub

I’ve always wanted use the ‘Dirty’ method … This is equivalent to re-entering the =CALC() formula, triggering recalculation.

A constraint here is that the functions must form a family that have the same argument signature: here, a single array of values. Other statistical functions take more arguments, and might form other families. Alternatively, a multi-argument built-in function could be wrapped in a custom function that defaulted all but the main ‘data’ argument.

Worksheet Hierarchies

Back from the Easter break now…

It occurred to me that a common requirement is for a workbook to reflect a hierarchical structure – for example:

  • divisions/units within an organization
  • geographical areas (country, region, city)
  • product/part breakdown
  • reporting period (year, quarter, month).

An obvious approach is to have a worksheet for each node in the hierarchy, with the leaf nodes holding the data, and the non-leaf nodes aggregating the data using formulas. (Note that I’m only considering single-dimension aggregation here, not hypercubes).

So I thought it would be useful to have a template worksheet that can be copied for each node in a hierarchy, and linked in to the hierarchy using the worksheet names. The aggregation formulas are already in place, and pick up the names of child nodes. And to aid navigation, we can have some hyperlinks to root, parent and child nodes.

My example is the start of a geographical hierarchy:

UK is the root; London and Brighton are in the South region. Each node worksheet has been copied from Template. Here’s the UK sheet:

The child nodes (i.e. the regions) are represented in a table, with the names entered in the Region column. The Sales and Costs columns pick up these names, to get the Sales/Costs values from the child worksheets. So, for example, cells E9 and F9 contain the formulas:

=INDIRECT($D9 & “!Sales”)

=INDIRECT($D9 & “!Costs”)

The UK node’s owns Sales and Costs values are just totals of the relevant columns:



The Sales/Costs cells are named ranges, needless to say.

A leaf node worksheet, such as London, has an empty child table, and literal values for Sales and Costs:

I’ve left the formulas in the table in case this node gets decomposed further, but it is in practice empty.

Here’s an intermediate node, for South region:

There are links to the root node (UK), the parent node (also UK here) and the child nodes. Each of these is a HYPERLINK formula that picks up the relevant node (i.e. worksheet ) name. For example, C2 contains the formula:

=HYPERLINK(“[" & Root & ".xlsm"]” & B2 & “!$A$1″, “link”)

Note that HYPERLINK needs a filename, even if the link is internal to this workbook. By convention, the workbook has the same name as the root node. So ‘Root’ is just a range name for B1 on the UK worksheet. The links all go to A1 on the target sheet. Similarly, G9 contains:

=HYPERLINK(“[" & Root & ".xlsm"]” & D9 & “!$A$1″, “link”)

So, to create a hierarchy node, you need to:

  • copy the Template worksheet
  • set Parent, Name, and Children (if any)
  • set Sales and Costs values for leaf nodes.

No VBA so far, but I think it would be useful to build:

  • a hyperlinked Table of Contents just to the leaf nodes – that is, where the data entry happens
  • hierarchical summaries (i.e. each one on a single worksheet), for individual data categories (e.g. Sales).

More in due course.

Table Subtotals

The built-in Subtotals facility (now on the Data tab) has never been particularly useful. Inserting subtotal rows into your data rows breaks one of the commandments of Excel: Thou shalt not sully the purity of Thy Data. Not only does the data have to be sorted prior to subtotalling, you can not then re-sort or filter the data. Furthermore, with 2007, you can not use subtotals with Tables, which as I’ve mentioned previously, are rather useful.

So what are the alternatives? One way is to use an array formula as a calculated filter. Here’s an example:

As you can see, it’s an old workbook (2005), which I’ve ‘upgraded’. To get the subtotals by month, you compare the Date column with the two boundary values, and multiply with the Value column. D2 contains the following single-cell array formula:


Obviously, Table1 should have a more meaningful name. As you might already know, we can’t use AND here, as this collapses the tests to a single Boolean value, rather than a column-sized array of Booleans. The trick is to treat the Booleans as 0/1 values and multiply them. So for row 2, we’ll have (1 * 1 * 54).

The formula then Fills Right. We could equally well have the subtotals vertically, and Fill Down.

If you don’t fancy the array formulas, here’s an alternative:

Here, I’ve added a Month column, to make the filtering simpler. B2, for example, contains:

=MONTH(Table2[[#This Row],[Date]])

Each of the subtotals is a DSUM formula, using the cells above as the criteria for the filter. E3 contains:


Note that we need the [#All] accessor, otherwise the formula does not pick up the header row. Note also that the column name is a string, not a name, and is thus in quotes.

The formula Fills Right. A limitation is that the criteria ranges have to be vertical – that is, column label above criterion – so we can’t have the subtotals arranged vertically. However, for presentation purposes, we could use an array function to transpose the range from 3R x 4C to 4R x 3C:


Although we were using the DSUM variant to avoid array formulas, TRANSPOSE is easy enough to understand.

DIY Scenarios

I’ve always thought that the built-in Scenario Manager is a bit feeble. In particular, the constituent values of a scenario should be visible on a worksheet, not squirrelled away in the SM. It’s really quite straightforward to do it yourself, especially with 2007 Tables.

Here’s a very simple example. This is a ‘Model’ worksheet, showing two different scenarios:

There are three values in our scenarios, which are dropped into D2, D3, D4. These cells are named A, B, C respectively. Output (F2) is a suitably complicated formula that uses A, B and C – obviously, there could be lots of other formulas dependent on the scenario values, making up a complex model. B1 is named Scenario.

Each scenario is in a Table on a separate worksheet, with the Tables named Scenario1, Scenario2, etc:

Back on the Model worksheet, D2, D3, D4 each contain the formula:

=INDIRECT(“Scenario” & Scenario &
“[[#This Row],[value]]”)

The [#This Row] accessor requires the formula to be on the same row as the corresponding row in the scenario tables (even though on different worksheets). If this is an issue, then D2:D4 could contain the array formula:

{=INDIRECT(“Scenario” & Scenario & “[value]“)}

since this works anywhere with respect to the scenario tables.

In fact, you don’t really need tables – you could just use named ranges on the scenario worksheets – but it saves (re)defining names manually.

The Set buttons on the scenario worksheets are just a convenience, so you can inspect a scenario and then make it the current one, without having to remember which number it is. Here’s the button code for the Scenario1 button:

Private Sub SetCommand1_Click()
    Worksheets("Model").Range("Scenario").Value = _
                          Right(Me.Name, 1)
End Sub

And similarly for the other buttons. Note that the button names have to be unique within the workbook.

Excel COUNTIFS Function

My wife, Liz, is a Business Intelligence (i.e. mega-database reporting) consultant, and came up with a problem to which I offered to produce an Excel (non-VBA) solution. Since the problem is of quite wide applicability, I’ll describe it here. I also came across an Excel function that was new to me.

The problem is this. Suppose that you have a table of employee records, with Sex and Salary. For entirely proper reasons you want to find out the proportion of females in top n% of salaries. (The n% is of employees, not salary range). However, some employees just outside the n% will have salaries the same as the lowest-paid members of the n%, and so need to be included, otherwise we’d lose the former from the proportion calculation.

Here’s a simple example with 10 employees:

Obviously, with so few records, the percentage of top-earners we want to see has to be large – typically, we’d be looking at 1% to 10%. Here we want to see the female proportion for the top 25% of salaries. (The data is shown here sorted, for clarity, but does not need to be so).

25% takes us ‘half way down’ employee #3, who clearly needs to be included. But we also want to include employee #4, who earns the same. In this top-four, we have 1 female, so the proportion is 25%.

The formulas are as follows:

idcount: = ROWS(SalaryTable)

pcount: = CEILING(idcount * percentile,1)

cutoffsalary: = LARGE(SalaryTable[salary],pcount)

mcount: = COUNTIFS(SalaryTable[sex],”=m”,
SalaryTable[salary],”>=” & cutoffsalary)

fcount: = COUNTIFS(SalaryTable[sex],”=f”,
SalaryTable[salary],”>=” & cutoffsalary)

fpercent: = fcount / (fcount + mcount)

The COUNTIFS function was new in 2007, and is new to me. You can have up to 127 range/criterion pairs. The ranges must have the same dimensions (i.e. column length, here). The criteria are evaluated on a cell-by-cell basis (i.e. by row, here), and ANDed together.

Note also the use of the LARGE function, to get the ith largest salary, for i = pcount.

I was rolling up my sleeves for some array formulas, but it turned out to be quite straightforward.

April 2014
« Dec    


Get every new post delivered to your Inbox.