Archive for January, 2010

XML Export from Excel – part 1

I thought I’d look at using Excel as an information-gathering front-end for XML-based data, which can then feed into production of documentary reports (HTML, PDF). Or put the other way round, a reporting facility for Excel.

Here’s a simple scenario. My company has various Business Units (BUs) which produce and sell various products. Every Quarter, each BU produces a Sales/Costs report as an Excel workbook. Something like this:

The basic data is in the blue ProductTable. The pink cells are simple totals (=SUM(ProductTable[Sales]), etc, using structured table references). Incidentally, I should have the period as 2009Q4, etc, so that values sort lexically (2009Q4, 2010Q1, …).

To export this data as XML, we need to create a map from worksheet cells to XML schema elements. This is what the schema looks like in the Excel XML Source pane:

Each element is mapped to the appropriate cells by dragging-and-dropping. The root element is <bureport>.

Now, we could just click  >Developer >XML >Export. But there are a couple of reasons why we might want to do it from code. Firstly, each period, we will be receiving a report from each BU, and it would be nice to do the export as a batch operation.

Secondly, the XML map inside Excel seems to not keep a reference to the schema from which it was created. This means that when the XML is exported the root element comes out as just <bureport>, whereas we want it to have a reference to the schema:

<bureport xmlns:xsi=’http://www.w3.org/2001/XMLSchema-instance&#8217; xsi:noNamespaceSchemaLocation=’bureport.xsd’>

where the critical bit is the filename at the end: bureport.xsd. For simplicity, this is assumed to be in the same directory as the bureport XML file.

Maybe I’m missing something here, but it seems odd that Excel internalizes the schema, but doesn’t have an option to associate a schema location with a map, and then generate a root element with that location, as above, in exported files.

Now, to begin with, I’m going to ignore the batch-operation over multiple workbooks, because it should be straightforward. So the exporting code is going to go into the bureport workbook (template) itself, and I’ll just run it directly from each workbook, from the VBA side.

The code needs to:

  • construct the output file name from the BU name and period (e.g. Report_Applications_4Q2009.xml)
  • open a TextStream on this (new) file
  • get the complete XML string from the bureport map
  • substitute the basic <bureport> element with the full one (see above)
  • write the modified XML string to the TextStream.

Code tomorrow.

Happy Blogday

It’s now a year since I started this blog, and good fun it’s been too (for me at least). There’s something very satisfying about writing a post, particularly when it concludes and summarizes some piece of work. I’ve also found the blog useful for my own reference, when I’ve forgotten the details of something I did some months ago.

A few numbers; 63 posts, 139 comments. Right now, the blog’s had 15,823 visits, which I’m pretty pleased with. Not much, however, compared with John Walkenbach’s 973,794 visitors (we prostrate ourselves before Your Excel-lency). I’m also pleased with my position (which modesty prevents me from mentioning) on Googling “VBA blog”.

Perennial favorite post is ByRef and ByVal – maybe because there’s some desperate Googling going on with those as search terms. The recent posts on Performance issues seem to have attracted some interest too (thanks for the comments, guys).

What Next for Modern Man? Well, I still have to look at Andy Pope’s Ribbon tool. And I’ve been experimenting with getting XML data out of Excel – this ties in with my interest in DITA/XML documentation – and generally using Excel as a front-end. More anon.

Performance – Functions (addendum)

Following on from an earlier post, I thought that I’d quickly (!) try a different function. Not being a statistician, I wanted something with a simple formula, but that worked on potentially very large arrays: CORREL (correlation between two data sets) is something I can get my head around.

So I set up two columns of 1 million rows: column A containing 1,2,3,4,…; column B containing 2 * A * (1+ nnn), where nnn is a small, random, positive or negative quantity. For example, row 10 contains (10, 21.26344). The column ranges are named. A single cell contains a CORREL formula – this calculates in somewhat less than 1 second.

So I thought I’d try:

correlation = WorksheetFunction.Correl(arrX,arrY)

where arrX, arrY are 1-dimensional arrays obtained from the respective column ranges (whose Values are 2-D, N x 1 arrays). The above call is made between calls to Timer, and with all Application updating switched off. So far so good.

Times for 50000 rows are either 0.031 or 0.047 seconds – precisely those values, never anything in between; the mix seems to be random. For 65536 rows, the two alternative times are 0.047 and 0.063, again precisely, and randomly mixed. That’s pretty odd…

So, whack it up to the full 1 million rows, to get a decent elapsed time? Unfortunately not. In fact, 65537 rows causes a Run-time Error 13 – Type Mismatch on the call to Correl (as above). So something is limited to 16 bits, which is of course the storage for a VBA Integer. But my code does not use Integer anywhere. The actual data values are just Variants, indices are Longs. And the worksheet formula works fine for 1 million rows.

What’s going on? Any suggestions welcome. Given that I can’t get a benchmark for a very large number of rows, there doesn’t seem much point in writing a custom VBA Correl function to compare. Hmmmm.

Performance – Functions

In a post some months ago, there was a discussion  of the relative performance of equivalent functions: built-in worksheet versus custom VBA. Dougaj4 reckoned that the VBA function for Transpose is about twice as fast as the WS function. However, in my benchmark, the difference was swamped by Range Access. So I repeated the experiment with an in-memory array, comparing

WorksheetFunction.Transpose(arrin)

with Doug’s double-For-loop implementation. I tried it for a square array of size 1000 to 3000 (so 1 to 9 million elements), with Long values (=i/j).

Results:

  • 1000^2: WS = 0.72, VBA = 0.31
  • 2000^2: WS = 4.08, VBA = 1.27
  • 3000^2: WS = 9.42, VBA = 2.84

Here’s the graph:

So in fact the difference is greater: the custom VBA function is about 3 times faster. I wonder why? The function is pretty simple algorithmically, so it’s not as if there could be wildly different implementations. And there’s no significant initial overhead for the Worksheet function call: its line would pretty much go through (0,0).

I don’t know whether one could expect a similar difference in performance for other functions – say a really meaty statistical function.

Performance – Strings

In the previous post, we looked at the performance hit on accessing worksheet ranges from VBA. As is well known, it is advisable to minimize accesses by reading/writing arrays.

Another area where warnings abound is in operations on large strings. A particular case is building up long strings by concatenation. I expect we’ve all done something like:

mystring = mystring + newpiece

The overhead here seems to be because the old mystring is copied to a new mystring, and then the newpiece appended. The copying time is presumably proportional to Len(mystring).

For ad hoc string building (a range address, or name, or an SQL string in Access) this is not significant. Where problems could occur is when the concatenation is being done inside a high-iteration loop – that is, building up a very large string out of very many pieces.

Since mystring gets longer and longer, each concatenation gets slower and slower. The total time is therefore going to be proportional to n-squared, where n is the number of concatenations (i.e. the number of pieces). This is obviously bad news.

As a simple benchmark, I concatenated a single character string 100,000 times (on my fairly basic laptop), and it took 3.4 seconds; 200,000 times took 12.9 seconds. That’s a factor of 3.8, so roughly 4.

Now, that’s quite a big string to be building – not something you do every day, I’d guess. But you might be building up a string in memory in order to avoid large numbers of file-write operations, which are presumably going to be much slower than one write of a large string.

There are two alternatives. One is is to use the Mid function as a kind of ‘deconstructor’ function on the left-hand side of an assignment (this is a bit odd). This essentially ‘patches’ a substring on top of a base string. The trick here is that you start with an all-spaces base string of the anticipated size of your output string; this can be constructed using the (fast) Space function.

mystring = Space(size)
i = 1
Do While i < size
    substr = "xxx"   'or return from a function
    lensubstr = Len(substr)
    Mid(mystring,i,lensubstr) = substr
    i = i + lensubstr
Loop

Some people advocate using the ‘$’ versions of the string functions – so, Mid$, in this case. These are typed to take and return Strings, rather than Variants, and so are supposed to be more efficient. This does not seem to be the case with my string-building benchmark: the times are the same.

For a 100,000 character string, built one character at a time, it takes 0.05 seconds; for a 200,000 character string, 0.09 seconds. This looks like it might be linear, as you would expect.

Another alternative is to build an array of substrings, and then use the Join function to do a batch-concatenation:

mystring = Join(myarray)

This is about half as fast as the Mid solution (0.09, 0.17 seconds), but is still linear. The advantage of this is that it’s easier to retrieve/modify the substrings prior to the final string-build (with the Mid solution you’d need to keep an array of substring offsets).

Being good (!) object-oriented programmers, we should really encapsulate our chosen mechanism in a StringBuilder class (I expect that there are some out there). If the strings are being written to a file, then our class could handle this as well: a kind of wrapper around a TextStream object.

Another, less significant, optimization is checking for empty strings. Most people would compare with a literal empty string:

If mystring = “”

Since there is supposed to be an overhead in doing this (many times in a loop), alternatives are:

If Len(mystring) = 0

If mystring = vbNullString

The times for 10 million tests were 0.73, 0.36, 0.72 seconds. So the Len function is twice as fast, but it doesn’t really seem something to lose sleep over.

Finally, I’ve seen recommendations that when searching for a substring inside a larger string, you should use the InStr function, rather than Like “*foo*”. Times for 10 million tests were 1.56 and 2.55 seconds. So not hugely significant, and Like is obviously more flexible.

Performance – Range Access

I’ve just been writing up some examples relating to VBA performance issues (in Excel 2007). Although a lot of this is well known, I thought it might be useful to report my results. The topics (over three posts) are:

  • moving data between worksheet ranges and VBA
  • operations on Strings
  • worksheet functions versus VBA functions.

For the first topic, I thought I’d compare ways of reading data from one worksheet, making a trivial arithmetic change (* 2), and writing to another worksheet. The size of the data array depends on what your system will stand, but after some experimentation I went for a 100 x 100 cell range.

The ‘straw man’ is the version that does it one cell at a time. Although this seems idiotic, you might not worry for a small set of data (say 100 cells); then someone comes along and uses 10000 cells.

So I get each value from:

Range(“input”).Item(i).Value

for i = 1 to 10000. And similarly, the Value of an individual cell is set in the output range.

The alternative version reads the data into a VBA array in a single operation:

tempArray = Range(“input”).Value

Then the array elements are updated inside a (double) loop, and then

outRange.Value = tempArray

Surprise, surprise, there’s quite a difference:

Straw Man 22.9 seconds; Array OfHope 0.14 seconds.

That’s a factor of 164. Which suggests that inefficient range access is going to dwarf any other performance issues (of which more tomorrow).

Excel Name Manager Add-in

I love the way that technical blogs are a two-way channel: not only do I write about topics that other people can discover, I also find out about other things in the technical community.

I’ve just had a pingback from Jimmy Peña’s blog (thanks, Jimmy). As is the way, I had a quick browse around and saw this post. This points to Jan Karel Pieterse’s Excel Name Manager add-in. I haven’t done any more than download and point it quickly at the first workbook I could find, but it looks excellent. Something that’s always been lacking (and still is in 2007, despite the slightly improved dialog).

And the fact that it’s free makes it even better. Not because I’m mean (just Scottishly careful), but because it demonstrates the sharing mentality among the technical community which is one of the more encouraging aspects of modern life. Bravo!


January 2010
M T W T F S S
« Dec   Feb »
 123
45678910
11121314151617
18192021222324
25262728293031

Follow

Get every new post delivered to your Inbox.