In an earlier post, I looked at a closed-bucket hashtable implementation based around Dictionary<TKey, LinkedList<TValue>>. Although this achieved the goals of having fast insertion, deletion and retrieval time, it also had a few shortcomings:

  • The collection is unsorted and cannot be sorted because it is orderless (the only way to guarantee sorting is to insert pre-sorted elements in order)
  • Searching for a value (whose key is unknown) is O(n) due to the use of linked lists as a bucket structure

I decided to revisit the CBHT class. Firstly, I wanted to refactor the code to derive a common base class. Then, extending this class, I wanted to provide an alternative implementation that was sorted (using a 2-tier sort; first by key and then by value) and provided faster searches.

CBHTBase<TKey, TValue>

The base type is a generic abstract class that conceals a lot of repeated functionality that arises from implementing multiple collection interfaces. It also hides the explicit interface implementations (which provide compatibility when the class is used in a less specific capacity). The result is considerably fewer methods and properties to implement in the concrete classes; only the following are abstract:

  • KeyCount
  • Keys
  • this[TKey key]
  • Add(TKey key, TValue value)
  • Clear()
  • ContainsKey(TKey key)
  • GetEnumerator()
  • Remove(TKey key)

SortedCBHT<TKey, TValue>

The sorted CBHT implementation must be based around two sorted types; one to represent the collection of buckets and one to represent the structure of the buckets themselves.

There are two built-in collections in the .NET Framework that are sorted by key; SortedList<TKey, TValue> and SortedDictionary<TKey, TValue>. Both classes are implemented using BSTs and both have similar characteristics. I have opted for SortedList<TKey, TValue> because:

  • It is conceptually closer to the array model for hashtables
  • The Keys and Values properties are indexed and aren’t regenerated when the collection changes
  • In my own tests with large numbers of randomly-generated keys, it outperforms SortedDictionary<TKey, TValue>

For the bucket structure, I have a ready-made solution in the form of the generic BST class I wrote for the last post. It is implicitly sorted by value and searches faster than a linked list.

Class Diagram

Class diagram for the sorted CBHT implementation

Final Words

Each of the two CBHT implementations has its strengths and weaknesses; the sorted version is much faster at searching, but slower at insertion. The unsorted version is faster at insertion and retrieval, but cannot be sorted and is much slower at searching. It is worthwhile to note that tests using large numbers of random keys and values showed that the bottleneck caused by searching in the unsorted version is greater than the performance defecit caused by insertion in the sorted version. Each therefore has its own use in future projects.

Download

SortedCBHT.cs (includes dependant classes)

Although the underlying data structures behind SortedDictionary<TKey, TValue> and SortedList<TKey, TValue> (in the System.Collections.Generic namespace) are binary search trees, there is a surprising absence of a BST-based collection from the .NET Framework. If one only wishes to store a series of unkeyed values, an IDictionary-based collection seems like overkill. The aforementioned classes also have the limitation of being unable to store duplicate keys.

With that in mind, I decided to implement a simple generic binary search tree in C#.

My objectives for this were as follows:

  • O(1) performance for Count and Clear()
  • O(log n) performance for Add(), Remove() and Contains() (assuming unsorted/random insertions)
  • In-order (sorted) iteration with foreach
  • Allow duplicate elements
  • Completely abstracted implementation (i.e. nodes are invisible to the caller, appears to be a sorted collection of elements only)

Since BSTs have a deterministic order with no indexing, the most relevant collection interface to implement is ICollection<T>. In addition to the methods and properties defined in the interface, I also want to provide a reverse iterator and a way of cloning the content (and structure) of the tree. The public face of the class is shown below:

Class diagram for the BST implementation

Note that the inner class, BSTNode, is private. It cannot be accessed from outside the scope of the collection. This is intentional; I want to avoid the clunkiness of the built-in LinkedList<T> class, which exposes its nodes as a public type and returns references to them. Internally, each node holds three references; to the lesser/greater nodes and to the parent node. The latter simplifies the Remove() operation and the traversal algorithm.

GetEnumerator() is hooked up to the InOrder property, which uses an iterator block to traverse the tree in-order (i.e. left subtree, node, right subtree). Since Add() performs an insertion sort, this traversal strategy ensures a sorted iteration. ReverseOrder is provided for use with the foreach statement, while PreOrder (i.e. node, left subtree, right subtree) can be used to clone the structure of the tree – its output can be fed directly into Add() to preserve the distribution of nodes in the tree. Tree traversal is commonly implemented using recursion, but this puts a cap on the maximum number of elements (due to the inevitability of encountering a StackOverflowException). To avoid this, a less elegant (but more memory-friendly) looping algorithm has been used (courtesy of this post by Haroon Saeed).

Remove() takes into account the three possible permutations for removing nodes while maintaining sort order:

  1. A leaf node is simply unlinked.
  2. A node with one child is replaced by its child (removed node’s parent links to the removed node’s child and vice-versa).
  3. For a node with two children, first the successor node is identified (i.e. the element with the lowest value within the “greater-than subtree”). The removed node’s value is replaced with the successor node’s value and then the old successor is removed according to rule #2.

The Count property is an internal field that is incremented/decremented as part of the insertion/deletion algorithms.

Example Usage

// construct and populate a BST of integers (random order)
BST<int> test = new BST<int>(new int[] { 12, 1, 8, 200, 657, 50, 2, 9, 13, 1024, 512, 768 });            

// display the elements by traversing in-order
foreach (int val in test) Console.Write("{0} ", val);
Console.WriteLine();

// alter the tree
test.Add(10);
test.Add(80);
test.Remove(12);
  // test the binary search
Console.WriteLine("'12' is " + (!test.Contains(12) ? "NOT " : "") + "in the collection.");
Console.WriteLine("'10' is " + (!test.Contains(10) ? "NOT " : "") + "in the collection.");

// display the new elements in reverse order
foreach (int val in test.ReverseOrder) Console.Write("{0} ", val);
Console.WriteLine();

Download

BST.cs

Rather disappointingly, there are no keyed collection classes in the .NET Framework that allow duplicate keys. Frequently, there is a need to use a hashtable structure where each key can have multiple values. The applications of such a collection would include:

  • An in-memory representation of a many-to-many relationship
  • A set of values that are grouped into categories
  • A set of values with a two-tier sort (sorted first by key, then value)
  • A set of named fields which support multiple values
  • A mechanism for finding and resolving duplicates

Interfaces

In order to maximise usefulness and compatibility with other .NET types, this new collection class must implement appropriate interfaces. A closed-bucket hashtable can be conceptualised as:

  • A collection of keys (we want to know the number of keys and be able to access the key elements)
  • A collection of values (irrespective of their keys)
  • A dictionary whereby we can access the values using their keys

The .NET Framework considers a dictionary to be a collection of KeyValuePair objects. In thinking of this collection as a dictionary, there are two conflicting concepts: On one hand, we want each pair to consist of a key and its value (with duplicate keys allowed), but on the other, we want each pair to consist of a key and a collection of the values corresponding to that key. Since we will exploit some aspects of each, we will adopt both concepts.

Therefore, we will implement the following interfaces:

  • IDictionary<TKey, ICollection<TValue>>
    This will enable us to treat our collection as both a collection of keys and as a dictionary in the second capacity mentioned above (i.e. each pair represents a key and its values).
  • ICollection<TValue>
    This will enable us to treat our class as a collection of values, irrespective of their keys.
  • IDictionary<TKey, TValue>
    This will enable us to treat our collection as a dictionary in the first capacity mentioned above (i.e. each pair represents a value and its key).

Note that in implementing all 3 of these interfaces, some methods and properties will have conflicting signatures. Where the behaviour of the conflicting members is identical, we will provide a public implementation. Where the behaviour of the conflicting members is intrinsically different (or if unapplicable to the collection’s ultimate use), we will provide an explicit implementation, which will be hidden from the end-user unless the class is used exclusively in the capacity of the owning interface. If both members should legitimately be public-facing, we can create aliases (such as KeyCount to resolve the ambiguity with Count for the values).

The resultant public face for the class will be:

Implementation

Having defined a contract for the behaviour of our collection, we must decide on an implementation. The most traditional implementation of a closed-bucket hashtable is as an array of linked lists. However, because we want to build a versatile, multi-pupose collection, it makes more sense to use a dynamically-sized collection. Additionally, it is convenient to recycle existing functionality where possible, so we will exploit the default generic implementation Dictionary<TKey, TValue>. Thus, our implementation will revolve around:

Dictionary<TKey, LinkedList<TValue>> mInnerDic;

Using this member variable, we can wrap the existing functionality for determining the existence and number of keys, accessing the key elements (as a Dictionary<TKey, TValue>.KeyCollection) and some principal operations such as Clear(), Remove(). Perhaps most importantly, we can wrap the ready-made indexer to retrieve the linked list of values, given a key.

The implementation uses no other instance variables; all other functionality is achieved by manipulating the inner collection ad-hoc.

Performance

Noteworthy methods/properties have the following efficiencies:

  • Add(key, value) – O(1) to O(n)
  • Contains(value) – O(n)
  • ContainsKey(key) – O(1)
  • Count – O(n)
  • KeyCount – O(1)
  • Remove(key) – O(1)
  • RemoveByValue(value) – O(n)

Usage

// initialise the collection
CBHT<string, int> integers = new CBHT<string, int>();
integers["odd"] = new int[] { 1, 3, 5, 7, 9 };
integers["even"] = new int[] { 2, 4, 8, 10 };
integers["prime"] = new int[] { 2, 3, 5, 6, 7 };

// manipulate some values
integers.Add("even", 6);
integers["even"].Remove(10);
integers.Remove(new KeyValuePair<string, int>("prime", 6));
integers.Add("powers-of-2", 32);
integers.Add("powers-of-2", 64);

// print out the values, grouped by category
foreach (string key in integers.Keys) {
    Console.WriteLine(key);
    Console.Write("\t");
    foreach (int value in integers[key]) {
        Console.Write("{0} ", value);
    }
    Console.WriteLine();
}

Download

CBHT.cs

Excel Automation provides a wealth of possibilities when developing reporting & printing features for RAD business apps. By taking advantage of the fact that most of your end-users will be running the Microsoft Office suite, you can quickly add functionality to:

  • Export tabular data from your application into a flexible medium
  • Transform tabular data into chart form
  • Harness spreadsheets as a data entry mechanism for your application
  • Read and write data in the Excel file format(s)

Although Excel Automation can be very powerful, it also has a number of common pitfalls; the least of which can leave unwanted background processes behind and the worst can result in severe performance bottlenecks when exporting data. I’ve collected what I believe to be the best-practice methods for working with Excel, as well as some undocumented (or hard to find) tips and tricks to get the most out of this technology.

Assemblies, Namespaces and Types

In order to begin, you need to have Microsoft Office installed on the development machine, along with the Microsoft Office Primary Interop Assemblies. You’ll need to add references to Microsoft.Office.Core and Microsoft.Office.Interop.Excel to your project.

It’s enormously useful to use a namespace alias to shorten the fully-qualified names of the Excel classes and interfaces to something more readable. I suggest:

using Excel = Microsoft.Office.Interop.Excel;

The PIAs are wrappers for their equivalent COM Type Libraries and thus you tend to work with interfaces rather than classes; with the exception of the root-level Application object, you do not instantiate Excel types using constructors. You may also find that the objects returned through the API implement multiple interfaces (although this tends to feature more in other Office applications, particularly Outlook). A number of methods and properties in the API return unspecified object types, which you must cast into the appropriate interface (the MSDN reference will help you to identify the actual return values).

The types you will use most often in Excel Automation are:

  • Application – controlling the instance of EXCEL.EXE, showing and hiding the application, creating/opening files and toggling performance-boosting options. Instantiated by constructing a new ApplicationClass object.
  • Workbook – the collection of spreadsheets and charts in the file you are working with. Usually used as a stepping stone to a single sheet. Stored in the Application.Workbooks collection and created using Application.Workbooks.Add().
  • Worksheet – provides access to the spreadsheet content, formatting, page setup, printing and saving. Stored in the Workbook.Worksheets collection and created using Workbook.Worksheets.Add().
  • Range – represents a range of cells. Can be thought of as a selection. Cells are not accessed individually, therefore you must use this type to read/write cell values, apply formatting, copy/paste, etc. There are literally dozens of ways to get a Range object, for example via Worksheet.Cells.

Working with Application

As soon as you instantiate ApplicationClass, Excel is loaded into memory and your application connects to its OLE server. Be mindful of this when choosing when to construct that first automation object. Right from the point when you do this, there are some things to be aware of:

  • Excel will keep on running until you properly dispose of the Application object. If you do not do this correctly, the process may linger until your app stops running or even beyond that point. You may end up with multiple instances of Excel too.
  • Unless you instruct it otherwise, the Excel application will be visible to the user, potentially interruptible and will recalculate all formulae on the spreadsheet every time you change a cell’s value. Also, events on Excel types will not be fired unless you request otherwise.
  • Any alterations to the above will persist after your code stops running.

You must, therefore, take care and ensure that you properly initialise before performing any automation functionality and then clean up immediately after you finish. You must take extra care to handle any exceptions that may be thrown during execution and run the clean-up code no matter what happens.

Simple Initialisation and Clean-up Example

using System;
using System.Runtime.InteropServices;
using Excel = Microsoft.Office.Interop.Excel;

public class ExcelDemo : IDisposable {

    private Excel.Application mApplication;

    public ExcelDemo() {
        mApplication = new Excel.ApplicationClass();        // loads the Excel process into memory
        mApplication.EnableEvents = true;                   // allows us to attach event handlers to Excel objects
        mApplication.Visible = false;                       // hides the Excel application
        mApplication.Interactive = false;                   // stops Excel from responding to user input
        mApplication.ScreenUpdating = false;                // boosts performance by preventing Excel from refreshing the GUI state
        mApplication.Calculation
            = Excel.XlCalculation.xlCalculationManual;      // disables automatic calculation of formulae
    }

    ~ExcelDemo() {
        Dispose();
    }

    public void Dispose() {
        if (mApplication != null) {
            // restore interactivity, GUI updates and calculation
            mApplication.Interactive = true;
            mApplication.ScreenUpdating = true;
            mApplication.Calculation = Excel.XlCalculation.xlCalculationAutomatic;

            // exit without saving
            mApplication.ActiveWorkbook.Close(false, Type.Missing, Type.Missing);
            mApplication.Quit();

            // release COM object
            Marshal.FinalReleaseComObject(mApplication);
            mApplication = null;
            GC.Collect();
        }
    }

    public void Go() {
        // automation code goes here
    }
}

In the above example, Excel is launched and prepared for efficient use in the constructor. Automation code can then be run inside the Go() method. When all operations have completed, the class can be disposed and will clean up after itself.

Working with Range

The Range interface is the mechanism that you will use to read/write cell values and apply formatting. There is no way to reference cells separately, other than to operate on a 1×1 Range object.

IMPORTANT: Retrieving Range objects is costly in terms of resources and processor time! You should write automation code that retrieves the minimum number of ranges required to complete the operation and include as many relevant cells in a range as possible. As soon as you begin to read/write non-trivial numbers of cells, you will see that accessing large numbers of small ranges creates a catastrophic bottleneck in your app. And it’s not just retrieving ranges that is costly, just about every operation you can perform on a range (e.g. writing values, merging cells, etc) has a significant overhead. Minimise the number of operations performed on Range objects.

To give you an example of the difference in performance possible if you reduce range retrievals, reuse range variables and minimise range operations, I was able to reduce the export time of a 16,000 row table from over 10 minutes to under 20 seconds.

Retrieving a Range

There are three main starting points to retrieving a range; you can work in absolute references (knowing a particular span of rows/columns), you can get a range that is relative to the position of another range or you can get a range that is relative to the size of another range:

Excel.Worksheet sheet = mApplication.ActiveSheet;

// absolute range (3x10 cells)
Excel.Range absolute = sheet.get_Range("A1:C10", Type.Missing);

// relative to position (3x10 cells, starting at D1)
Excel.Range relativeToPos = absolute.get_Offset(3, 0);

// relative to size (2x2 cells, starting at D1)
Excel.Range relativeToSize = relativeToPos.get_Resize(2, 2);

The Worksheet.Cells property points to a range representing the superset of all cells in the sheet. (Note that every time you access this property, you are retrieving a new Range object and thus incurring overheads!)

In the rare situation where you only want a 1×1 range, you can use the indexer on the Range interface:

IMPORTANT: Like most Office applications, Excel numbers its arrays from 1, rather then zero!

// indexer (top-left cell)
Excel.Range singleCell = (Excel.Range)sheet.Cells[1,1];

Reading and Writing Values

Since we want to minimise the number of ranges and range operations, it’s fortunate that we can read/write multiple cell values in just a single statement using the Range.Value2 property (if you’re wondering why it’s called Value2, there is a Value property which uses native types). For a 1×1 range (which you shouldn’t be using), you can get/set a scalar value. For all multi-cell ranges, cell values come in the form of a 2-dimensional object array:

// getting values
object[,] values = absolute.Value2;
for (int row = 0; row < values.GetLength(0); row++) {
    for (int col = 0; col < values.GetLength(1); col++) {
        Console.Write("{0} ", values[row, col]);
    }
    Console.WriteLine();
}

// setting values
values = new object[,] {
    {1, "Blah"},
    {"3", 4.5}
};
relativeToSize.Value2 = values;

Excel stores cell values as variant types, and will perform parsing/type-interpretation based on the format of each cell, so a string literal containing only an integer will be treated in the same way as a constant integer.

Formatting

As in the Excel application, it is best to set cell formatting before writing values to the spreadsheet – this way, you can coerce values to be interpreted as specific data types, or force values to be interpreted as text. Most basic text formatting is self-explanatory for the Range interface (e.g. Range.Font controls the font style, size and colour, Range.Interior determines the fill, etc), however I will give special mention to some areas which are not so well-documented:

Colours

All colours in Excel use a 32-bit RGBA representation, expressed as an Int32. In contrast, all colours in GDI+ (and hence the rest of the .NET Framework) use the ARGB representation. In order to set cell colours or make comparisons with the System.Drawing.Color class, you will need to write a wrapper:

public struct ExcelColor {
    byte R;
    byte G;
    byte B;
    byte A;

    public ExcelColor(byte r, byte g, byte b, byte a) {
        A = a;
        R = r;
        G = g;
        B = b;
    }

    public static implicit operator System.Drawing.Color(ExcelColor that) {
        return System.Drawing.Color.FromArgb(that.A, that.R, that.G, that.B);
    }

    public static implicit operator ExcelColor(System.Drawing.Color that) {
        return new ExcelColor(that.R, that.G, that.B, that.A);
    }

    public static explicit operator int(ExcelColor that) {
        return BitConverter.ToInt32(new byte[] { that.R, that.G, that.B, that.A }, 0);
    }

    public static explicit operator ExcelColor(int that) {
        byte[] bytes = BitConverter.GetBytes(that);
        return new ExcelColor(bytes[0], bytes[1], bytes[2], 255); // Excel ignores the alpha component
    }
}

You can then convert easily between each colour type:

// GDI+ colour to Excel colour
absolute.Font.Color = (int)(ExcelColor)System.Drawing.Color.Red;

// Excel colour to GDI+ colour
System.Drawing.Color c = (ExcelColor)(int)singleCell.Font.Color;
Console.WriteLine("{0}", c);

Finally, to set a cell’s interior colour to “No Fill”, you must assign the value -4142 to Range.Interior.ColorIndex instead (otherwise the colour will be interpreted as black instead of transparent).

Number Formats

To set the number format (or data type) for a range, you can pass a string value to the Range.NumberFormat property:

  • For integral, decimal and currency values, pass a format string in the same format as used by the Excel application, e.g. #,##0.00 for a number with 2 decimal places and a thousands separator.
  • For date and time values, pass a date format string, e.g. d MMM yyyy for something like 13 Mar 2010.
  • To force the “Text” format (i.e. to force all values to be treated as text), pass the at-sign (@).
  • To reset to the default “General” format, pass General or an empty string.

Other Range Operations

You can use Range.Insert() to insert a new row or column, or to paste-insert a range that has been copied using Range.Copy():

// insert a new row above A15
Excel.Range insertPos = (Excel.Range)sheet.Cells[1, 15];
insertPos.Insert(Excel.XlInsertShiftDirection.xlShiftDown, Type.Missing);

// duplicate the first 15 rows
sheet.get_Range("$1:$15").Copy(Type.Missing);
insertPos.Insert(Excel.XlInsertShiftDirection.xlShiftDown, Type.Missing);

You can use Range.Find() and Range.FindNext() to locate cells that match a particular search term (note that FindNext will loop back to the start of the range):

string firstMatch = null;

// set find options and get the first match
Excel.Range result = sheet.Cells.Find(
    "search term",
    Type.Missing,
    Type.Missing,
    Excel.XlLookAt.xlWhole,
    Type.Missing,
    Excel.XlSearchDirection.xlNext,
    true,
    Type.Missing,
    Type.Missing
);

while (result != null) {
    // get the cell address
    string address = result.get_AddressLocal(true, true, Excel.XlReferenceStyle.xlA1, Type.Missing, Type.Missing);

    // note the address of the first result. if we have returned to the start, break out of the loop
    if (firstMatch == null)
        firstMatch = address;
    else if (firstMatch == address)
        break;

    Console.WriteLine(address);    // show that this is one of the cells that matches the search term

    // advance to the next result
    result = sheet.Cells.FindNext(result);
}

Merged Cells

Some operations in Excel will throw an exception if performed on a range that contains merged cells. You can test the Range.MergeCells property to determine if the range contains merged cells, and if working with more than just a single cell, Range.MergeArea to retrieve a range representing the merged cells. When reading/writing values or applying formatting to a range with merged cells, operate on only the upper-left cell of the merge area.

Working with Worksheet

In addition to the Cells property, Worksheet also allows access to the drawing canvas via the Shapes property, as well as a range of printing/pagination settings. Use Worksheet.Activate() to change the active worksheet.

The worksheet also provides access to these key operations:

  • SaveAs() – to save or export the sheet and/or workbook.
  • PrintOutEx() – to print the sheet.
  • PrintPreview() – to display the print preview dialog (modal) for the sheet.

Among the properties of Worksheet.PageSetup are the header/footer properties (e.g. CenterHeader, LeftFooter, etc) Each of these is a string value, and uses a set of escape codes to apply formatting and insert dynamic content:

  • && – to insert a single ampersand
  • &n (e.g. &9) – to set the font size
  • &[Page] – page number
  • &[Pages] – number of pages
  • &[Date]
  • &[Time]
  • &”Font name”
  • &B – bold
  • &I – italic
  • &U – underline

Final Words

This is by no means a comprehensive guide to Excel Automation, but it should provide you with the tools necessary to get started – and – to avoid falling into those common pitfalls I mentioned. In future posts, I hope to share some real-world applications for Excel Automation. The bulk of my experience in this area comes from building a template-driven reporting system that uses Excel for its presentation. Although I cannot provide much from that system (it is bound by intellectual property), I will be able to offer up some tasty tidbits 🙂