Archive for the 'Non-VBA' Category

An Alternative to VBA in Excel?

Back last year, Gareth Hayter of Slyce Software emailed me about VScript, an alternative to VBA for writing functions and macros in Excel. Unfortunately, for various reasons, I have not been able to look into it in any detail, but it certainly sounds interesting.

VScript is based on Excel-Dna, which is a project to integrate .NET into Excel. The aim is thus to allow you to write functions and macros in C# or VB.Net – so, presumably it is aimed at developers familiar with those languages, and preferring them over VBA. Code is claimed to run considerably faster than VBA (not hard to believe).  VScript provides an IDE integrated with Excel, as with VBA. You can use VScript to create stand-alone XLL add-ins, and digitally sign these.

What about VSTO? Gareth says:

VScript is different from VSTO (Visual Studio Tools for Office) in many ways:

  • VSTO is for professional programmers: It can be complex and confusing to use and requires a lot of time, effort and money to learn.
  • VSTO is expensive: You need to buy Visual Studio® which starts at $799.
  • VSTO is not built into Excel®: It is an external program and works in a very different way from VBA.
  • VSTO projects are complicated to deploy: VSTO is not integrated into Excel, which means that it’s difficult to make a few changes and test them quickly. It requires ClickOnce deployment.
  • VSTO cannot create User-Defined Functions (UDFs): With VSTO, you can’t create functions that you can use in a similar way to SUM() and AVG().

One to watch …

Spreadsheets in XML – Part 2

In the previous post, I was looking at the ‘spreadsheet extensions’ provided by XMLMind’s XML Editor (XXE). This allows XPath-based formulas to be inserted into XML documents, not only in tabular elements, but also in free text.

As an example, I mocked up some invoices. An invoice is a good example of a hybrid document: we want to print it out (or PDF it) as a nicely formatted document; there are calculated and looked-up elements in the manner of a spreadsheet; we want the whole set of invoices to be queryable subsequently, in the manner of a database.

Here’s an invoice, as a DITA document, shown in XXE:

The little green ‘f’ icons represent the formulas, held as XML Processing Instructions. These are ignored in subsequent transformations (to final formats, such as PDF). You double-click an icon to edit the formula.

The first one (before the table) is today’s date: =today(). The ones in the Cost column are simple arithmetic: =($C2 * $D2), etc. Column and row headers can be displayed optionally:

The Product Description and Unit Price formulas are more interesting, since they are lookups in another document, containing the product catalog. Here’s the formula in B2:

The first thing to notice is that we can have multi-line formulas, with ‘let’ definitions preceding the actual formula. (The “…” is really a full file path – I’ve elided it for compactness). The id of the element with the product description is the product code appended with “_desc”. This is then retrieved from the product catalog by matching the id attribute (@id) with the constructed value ($id). (The back-quotes indicate ‘raw’ XPath, rather than XXE formula language).

Here’s the Product catalog (not very extensive!):

The formulas here are used not to calculate visible values, but to construct values for the id attribute. For example, in B2:

Note the id attribute picked from the drop-down list. In Excel terms, this is rather like having a formula that constructs a Range name. It means that the ids for cells in column B and C always follow the product codes in column A. I think this is rather neat.

Back in the invoice, the Total Cost formula sums the values in the Cost column (E) – see the first scrrenshot. We could do this with a table/column reference, but an alternative is to tag the Cost cells with a common attribute value. In DITA, @outputclass allows a kind of informal specialization (we can’t use @id, as this must be unique within a document). Here, we can set @outputclass = ‘cost’. Now, the Total Cost formula sums all elements with this attribute value, wherever they are in the document:

=sum(`//*[@outputclass='cost']`)

That’s it, in terms of the documents. We can then generate formatted output, as we require.

The database aspect comes if the invoices are put into an XML database, such as XMLMind’s Qizx (Free Engine edition). This provides indexing and querying, using the XQuery language. We can then calculate aggregated values, for example by customer and product. Here’s a simple query to calculate the total invoiced for a given product:

xquery version "1.0";

let $prod := "PR01"    (:edit this:)

let $costs :=
 for $row in //strow
 let $cost := $row/stentry[@outputclass='cost']
 where $row/stentry[1]/text() = $prod
 return $cost
return ($prod, sum($costs))

strow is a simple-table row, stentry is a cell. One could, of course, get a lot fancier, and produce proper date-based reports.

There’s an interesting contrast here with how we would do this in Excel. If each invoice is a separate Workbook, we would need to provide some collation mechanism for the data, to get it into a single source for pivot tables, etc. – either in a single workbook, or in Access. I think that where we have a large number of computationally relatively simple documents, the XML approach is quite attractive.

Spreadsheets in XML

In my work with DITA documentation, I use XMLMind’s XML Editor (XXE) – and very good it is too. The professional version comes with an Integrated Spreadsheet Engine, which I have just recently taken a look at. I think it’s rather interesting, particularly how the approach differs from Excel or similar traditional spreadhseets.

We are not talking here about Excel (2007+) using an XML-based file format. In this, the information is structured using elements that relate to spreadsheets – such as <worksheet> and <row> – not to the content domain (invoices, timesheets, product specifications, whatever). That means that the information is not in practice accessible to users (as opposed to tool developers).

What XXE is addressing is the insertion of computed element content into a ‘user level’ XML document. Such a document could be in XHTML or DITA (XHTML is obviously a ‘final’ format; DITA is a ‘source’ format for conversion into various final formats, such as XHTML, PDF, CHM).

The idea is that the XML contains formulas as Processing Instructions. PIs are interpreted by the XXE application, but are ignored by XML processing tools and web browsers. A formula PI generates content (a result value) which is inserted between the PI and the end of the enclosing element. In practice, a formula will provide content for a lowest-level element such as a table entry, or a text phrase. Here is a simple-table entry:

<stentry>
<?xxe-formula formula=’=($[+0,3] * $[+0,4])’?>
150
</stentry>

This is in column 5 of a table, and is multiplying the values in columns 3 and 4, to produce the value 150. XXE provides a sugared syntax for references within a table, which is essentially the same as an Excel formula. So the above formula would actually be written as:

=($C2 * $D2)

with the relative addressing of the rows allowing the formula to be copy-pasted in subsequent rows.

In XML, any element can have an id attribute. For a table element, this allows us to reference a value in a table from outside (from ordinary text elements). For a leaf element, this allows us to reference a value by name. Suppose that invoice_table contains details of an invoice, and cell D7 contains the tax. We could have a formula:

=invoice_table!$D$7

or, better:

=$(tax)

There are, as you would expect, a reasonable number of built-in functions, in the usual categories.

An interesting twist is that a formula can set not just element content, but alternatively an element attribute. The most obvious use of this is to generate id attributes. For example, suppose that I have a table with Product Codes in column A, and Unit Prices in column B. Then, in the B2 cell, I can have this formula setting the id attribute:

=($A2 & “_unitprice”)

(and similarly for the rest of column B). Now, if I want to look up a unit price, I simply construct the id from the product code and the suffix, and access the element directly by id – no need for a VLOOKUP function.

So far, the main difference from Excel is that we are not restricted to using  tabular structures (worksheets, in Excel). We can have values and formulas anywhere in a document structure. For erxample, I could drop a calculated value into an ordinary free-text paragraph, and pick up that value elsewhere in the document.

However, the really different aspect is that the formulas are based on the XPath 1.0 language – the Excel-like syntax is just cosmetic. XPath is a pattern-based query language that treats XML documents as trees of nodes-with-attributes. Here’s an example-based tutorial. This means that a formula can operate on a set of nodes (returned from an XPath expression), without knowing how many there are or where they are in a document.

We can tag values (wherever they are) using an attribute. In DITA, we could use @outputclass (not @id, as this must be unique). For example, I could tag various elements (possibly table cells, possibly not) with @outputclass = ‘cost’, and then sum these using the formula:

=sum(`//*[@outputclass='cost']`)

The backquotes encapsulate an XPath expression (as opposed to the XXE formula language). ‘//*’ means “any element anywhere in the document” (which we then filter by the outputclass attribute). Furthermore, XPath can access not only the current document, but also other documents (as individual documents, not as a document collection – a notion supported by XPath 2.0/XQuery).

I’ll discuss an example of all this in the next post.

Excel Name Manager Add-in

I love the way that technical blogs are a two-way channel: not only do I write about topics that other people can discover, I also find out about other things in the technical community.

I’ve just had a pingback from Jimmy Peña’s blog (thanks, Jimmy). As is the way, I had a quick browse around and saw this post. This points to Jan Karel Pieterse’s Excel Name Manager add-in. I haven’t done any more than download and point it quickly at the first workbook I could find, but it looks excellent. Something that’s always been lacking (and still is in 2007, despite the slightly improved dialog).

And the fact that it’s free makes it even better. Not because I’m mean (just Scottishly careful), but because it demonstrates the sharing mentality among the technical community which is one of the more encouraging aspects of modern life. Bravo!

Excel Value Distribution

I recently wanted to take a long list of values and find the distribution of these. So if my data consists of integers between 1 and 9 (or could be converted into such), then I want to see the count of 1s, the count of 2s, …, the count of 9s. Obviously, this would lend itself to presentation as a column chart:

So how do we get the count values in column C? A quick skim through Walkenbach’s Formulas didn’t reveal the answer, though I might have missed it.

As you might guess, we can use a single-cell array formula. In C2, we want to handle the boolean array {B2=data}, where data is the name of the whole set of data values in column A.

Note, incidentally, that we can’t use named-range-intersection inside an array formula: {value=data}, where value names the column B values, does not work.

So we need to OR the boolean array {B2=data}. However, we now hit the problem that AND and OR always produce a single result value, even inside an array formula. The trick is to convert TRUE to 1 and FALSE to 0; then add for disjunction, multiply for conjunction. So the formula we have is:

{=SUM((B2=data)*1)}

(with the braces indicating array-formula entry – Ctrl-Shift-Enter). This then fills down column C. The type conversion is effected by: FALSE * 1 = 0, TRUE * 1 = 1.

A more general version would allow us to set a value interval for the counts:

The min and max are generated from the interval. In E2 (for example), the formula now has two comparisons:

{=SUM((data >=C2)*(data<=D2))}

Here, the pairwise multiplication of the two boolean arrays effects the type conversion. SUM then operates on a single array of 0s and 1s.

The need for this type conversion from booleans to integers is a bit nasty, and comes from the implementation of AND and OR, which flattens arrays into a single set of values. So:

{=AND({TRUE;FALSE},{TRUE;FALSE})}

returns a single FALSE value, not an array {TRUE;FALSE}.

It’s instructive to think about how you might do this in VBA: with a function that takes the data range and returns an array of count values. There are clearly two levels of iteration, the one over data, which is implicit in the array formula, and one over the values/intervals, which is represented by filling the formula down the column.

Digression on Documentation: DITA

As is the way of things, I’ve been diverted from the delights of VBA, on to some other work. This is a migration of the system definition documentation for a major financial system, from traditional Word documents to an XML-based architecture called DITA. (Somewhat fancifully, the ‘D’ stands for Darwin, which is appropriate in this, his bicentenary year).

The production of technical documentation is undergoing something of a revolution at present. This is due to the maturing of a raft of technologies based on the XML markup language. Broadly speaking, these technologies provide a solution that sits between monolithic documents-as-files (such as Word documents), and relational databases, as complex aggregations of fine-grained information records.

With a Word document, the unit of content is the same as the unit of presentation: a file is edited, and the same file is printed. This applies equally to web pages (HTML), but with the page as the unit. In practice, it is difficult and time-consuming to identify, extract and recombine fragments of documents to produce new deliverables.

With a relational database, the information records can be conjoined, aggregated and filtered in very complex ways, using a query language (SQL). However, databases are not really suited to holding large free-text elements, like a section of a document. Also, there is no notion of hierarchical structuring in query output, in contrast to the hierarchy of chapters, sections and sub-sections that we are familiar with in documents.

The XML-based solutions aim to provide a middle course. Content is created and held in a form that is structured enough to identify, extract and recombine fragments of documents to produce new deliverables. At the same time, the content does not carry information about presentation (either the target format or the details of layout). This is provided by transformations of the content to produce deliverables in different formats, such as Word or PDF for printable documents, or hypertext (XHTML) for web presentation or online Help.

The challenge is to come up with an information model that defines and relates appropriate topics (i.e. basic chunks), in ways that allow querying, selection and combination in flexible ways. There’s a trade-off here between flexibility and chunk size. Too fine-grained and it’s impossible to manage; too coarse-grained and you’re back with monolithic documents. There’s a wider trend towards ‘medium-sized’ information chunks: think of blog posts, like this one, or Wiki pages.

More on this anon…


April 2014
M T W T F S S
« Dec    
 123456
78910111213
14151617181920
21222324252627
282930  

Follow

Get every new post delivered to your Inbox.