This article is also available as a video, courtesy of Webucator. Click here to jump to the video section below.

 

For many data types, sorting is a logical, deterministic operation that works by comparing elements and determining whether they are less-than, greater-than or equal-to each other. For example, with a list of integers, we can simply use the <, > and == operators to perform the comparisons. With DateTime values, we might first convert them into UTC/GMT, then to their fundamental form, ticks (i.e. the number of 100-nanosecond intervals that have elapsed since DateTime.MinValue), then compare them as we did with integers.

With strings, however, things get more complicated. One must first decide on what basis we perform the comparison: Is it case-sensitive or case-insensitive? Are accented characters considered equal to their unaccented equivalents? If not, how do they compare? Are there language/regional/cultural factors to consider? Are there cases where two consecutive characters are read as if they were just one?

Thankfully, the .NET Framework supports a wide range of string comparisons and therefore several choices when it comes to sorting. The built-in StringComparer implementations are CurrentCulture, CurrentCultureIgnoreCase, InvariantCulture, InvariantCultureIgnoreCase, Ordinal and OrdinalIgnoreCase. Because I develop exclusively within a single locale and language, my preferred implementation is OrdinalIgnoreCase, which equates uppercase and lowercase letters while comparing the rest according to their order in the character set (i.e. digits before letters, with symbols on either side).

The problem with the built-in comparisons is that they operate on (broadly speaking) an alphabetical basis. Even when we desire strings to appear in alphabetical order, we do not always intend for such a literal interpretation; i.e. the following strings are sorted in alphabetical order, but not in the order a human would place them:

  • “1. Introduction”
  • “10. Summary”
  • “11. Notes”
  • “2. Oranges”
  • “3. Apples”
  • “4. Berries”

Similarly, we might have a list of strings which contain both alphabetical and numerical terms, expressed naturally:

  • “3rd man”
  • “Fifty-third page”
  • “First chase”
  • “First place”
  • “Four people”
  • “Second place”

A more sophisticated comparer is required if we are to sort strings of this kind in a logical order.

My Approach

My implementation works by breaking strings into tokens (i.e. words comprised of both letters and numbers, but not whitespace or punctuation) and, where recognised, assigning a numerical value to them. If not recognised, a token is compared alphabetically/lexicographically. I look for the following types of tokens:

  • Digit strings (including those with currency symbols, digit grouping symbols and decimal points), e.g. "123", "$123.00", "100,000"
    – These are converted using the Decimal.TryParse() method.
  • Ordinal digit strings; e.g. "1st", "2nd", "3rd", "100th"
    – After removing the suffix, the same process as above applies.
  • Words which translate directly into a number; e.g. "one", "two", "twenty"
    – These are matched individually to a decimal constant.
  • Ordinal words which translate directly into a number; e.g. "first", "third", "fifteenth"
    – As above.
  • Multiple consecutive words expressing a larger number; e.g. "nine hundred", "twenty-seven", "eight million, one hundred thousand, seven hundred and sixty six"
    – For these, we look ahead to the next token and, if necessary, add or multiply the numerical values together.

How the Lookahead Algorithm Works

This is the algorithm that converts a string into a numerical value that can be used for comparisons:

  1. Split the string using the regular expression \b(.+?)\b
  2. For each token:
    1. Attempt to convert the token to a numerical value (value) and note the type of match;
      No match, digit string, word, ordinal word, the word "and"/"&", the add modifier and the multiply modifier (used later)
    2. If the match was a word or ordinal word (with or without modifiers) then:
      1. Push the numerical value onto a new stack and reset the value to zero
      2. While there are still tokens remaining:
        1. Get the next token
        2. Attempt to convert the token to a numerical value (next) and note the type of match
        3. If there was no match or a digit string was matched, abort
        4. Pop the top value off the stack (temp)
        5. If the multiply modifier was matched:
          1. Continue popping elements off the stack and adding them to temp until the result is non-zero (or the next element on the stack is less than next)
          2. Multiply temp by next
          3. Push temp onto the stack, followed by a zero
        6. If the add modifier was matched:
          1. Add next onto temp
          2. Push temp onto the stack
      3. Pop all elements off the stack, adding them onto value
    3. If the original match was valid, return value – we will use this value to compare the tokens
  3. If no value has been returned by this point, we will compare the tokens as though they were normal strings

In short: If a numerical value is matched, we look ahead until we reach the end or the value is no longer valid. Each successive token is added to the original value, however when a multiplier is found (e.g. “hundred”, “thousand”, etc) we continue accumulating values in the stack until we are ready to perform the multiplication; only then can the result be added on.

How the Comparisons Are Performed

  1. Split each string into tokens
  2. While there are tokens remaining in both strings:
    1. Run the lookahead algorithm (above), consuming tokens from each string
    2. If numerical values were found for both strings, compare numerically
    3. If a numerical value was only found for one string, convert the numerical value into a string and compare it with the other token
    4. If neither string contained a numerical value, compare normally
    5. If the result of the comparison was less-than or greater-than, return the result
  3. If no result was returned (i.e. numerical values were equal), compare the number of tokens consumed from each string and regard the shortest to be lesser
  4. If the result from the previous step was less-than or greater-than, return the result
  5. If no result was returned, compare the original strings normally and return the result

In short: Where numerical values can be obtained from the strings, we will compare these with each other (regardless of how much of the string we consume in doing so; only the value matters). Where numerical values are absent from one string, the other string takes precedence. Once no more numerical values can be obtained, the remainder of each string is compared normally.

Implementation

I implemented this functionality in a C# class named NaturalLanguageStringComparer, which implements both the generic (string) and non-generic (object) versions of the IComparer interface. This is the standard mechanism for comparators in the .NET Framework, which enables the class to be used in:

  • The List<T>.Sort(), ArrayList.Sort() and Array.Sort() methods to perform a one-off sort on a collection
  • When initialising a SortedList<TKey, TValue> or SortedDictionary<TKey, TValue> to keep the elements in the correct order
  • In LINQ queries (the OrderBy() extension method) to order a sequence

You can also call the ParseNatural() method to utilise the numerical conversion functionality directly.

Example

List terms = new List();
terms.Add("1");
terms.Add("fourth inkwell");
terms.Add("10");
terms.Add("2");
terms.Add("one hundred and three");
terms.Add("five");
terms.Add("7th saucepan");
terms.Add("third compass");
terms.Add("eight");
terms.Add("one hundred and thirty six");
terms.Add("zero");
terms.Add("twenty one");
terms.Add("five hundred");
terms.Add("fifth pencil");
terms.Add("22nd bucket");

NaturalLanguageStringComparer nlsc = new NaturalLanguageStringComparer();
terms.Sort(nlsc);

This produces the following result:
zero
1
2
third compass
fourth inkwell
fifth pencil
five
7th saucepan
eight
10
twenty one
22nd bucket
one hundred and three
one hundred and thirty six
five hundred

Download

NaturalLanguageSorting.zip (Visual Studio 2012 solution, includes example)

If you find my code useful, please consider making a donation.

Video

After initially publishing this article, I was contacted by Webucator, a US-based technology training provider. Their course, “C# Solutions from the Web” showcases a range of articles like this one. I was very gratified to hear that they were including a video based on my article. You can watch the video below and, if you find it interesting, visit the link to the aforementioned course.

You may remember quite an old post of mine about the shortcomings of ExtractAssociatedIcon() which, surprisingly, still gets a fair number of hits. In light of some of the feedback i’ve received since the original post, i’ve decided to update the code to include the following fixes and improvements:

Both methods in IconTools now use SHGetFileInfo

As was pointed out by Michael (no surname provided), the native SHGetFileInfo() function (which we were already using to get the icon for existing files) can also be used to get the icon for a file’s type. The additional flag SHGFI_USEFILEATTRIBUTES instructs the API not to attempt to access the file, and simply locates the icon by examining its attributes. It turns out that you can get away with passing a file extension directly to this function.

The more long-winded implementation based on the ExtractIconEx() function is no longer used. This also removes the need to access the Windows Registry to find the DefaultIcon key.

Handles are now released using DestroyIcon

Another commenter on the original post, Cobein, pointed out (quite correctly) that the managed Icon class does not automatically release handles to icons created using the FromHandle() method.

This posed a slight design challenge, in that users could not be expected to do this after they finished using the icon returned by the method in the IconTools class. To get around this, I decided to use the Clone() method to copy the icon, before releasing the native handle and returning the managed copy instead.

The GetIconForFile() method now looks like this:

public static Icon GetIconForFile(string filename, ShellIconSize size) {
   SHFILEINFO shinfo = new SHFILEINFO();
   NativeMethods.SHGetFileInfo(filename, 0, ref shinfo, (uint)Marshal.SizeOf(shinfo), size);

   Icon icon = null;

   if (shinfo.hIcon.ToInt32() != 0) {
      // create the icon from the native handle and make a managed copy of it
      icon = (Icon)Icon.FromHandle(shinfo.hIcon).Clone();

      // release the native handle
      NativeMethods.DestroyIcon(shinfo.hIcon);
   }

   return icon;
}

Example Usage

Usage remains unchanged; if you used the previous version of this code, it will continue to work as before.

Icon smallIcon = IconTools.GetIconForFile(
   @"C:\Windows\System32\notepad.exe",
   ShellIconSize.SmallIcon
);

Icon largeIcon = IconTools.GetIconForExtension(".html", ShellIconSize.LargeIcon);

Download

IconTools.cs

Thank you!

At the time of writing, my ComboTreeBox control is the most popular item on this blog. I never could have imagined that a simple Windows Forms control would attract so much attention on the web! All I can say to those who have read my article and downloaded my code is, “Thank you!” Your continued support gives me the motivation to keep writing new and interesting content.

New Project Page

With this in mind, i’ve decided to move both ComboTreeBox and GroupedComboBox to their own project page here: Drop-Down Controls

This gives them a permanent home and makes it easier for me to manage occasional updates and new features. As before, you can download binaries and source code for the controls, as well as an example application that shows off their potential.

New Enhancements

This reshuffling has given me an opportunity to release a few bugfixes, including a memory leak in the Buffered Paint code.

Normal ComboTreeBox ComboTreeBox with ShowCheckBoxes set

I also took the opportunity to enhance the ComboTreeBox control by adding a ShowCheckBoxes property. In this mode of operation, the normal selection rules are suspended and checkbox glyphs are drawn beside each node. You can then select multiple items, accessible via the CheckedNodes property. This is further aligns the functionality offered by the control with that of the built-in TreeView control.

Enjoy!

Text Object Model Demo

I’ve recently been doing a lot of work with the WinForms RichTextBox control, trying to implement the familiar red wavy underlines associated with spell-checking. After some research, I discovered that the underlying (native) RichEdit control used by the RichTextBox does in fact support this style of underlining. The catch? It is only accessible through the EM_SETCHARFORMAT window message, which requires PInvoke. While it is relatively simple to set this style, there are drawbacks:

  • High overheads associated with marshalling
  • Clunky operation, requires you to send the CHARFORMAT2 structure
  • You can only get/set the style for the selected text

The last point in particular is of most concern to me, and is a serious limitation of the RichTextBox control. Almost all of the rich text functionality is restricted to the text selected in the control. In practice, this means that any non-trivial text manipulation requires you to constantly store the current selection, select the text you wish to manipulate, apply the styles and then restore the old selection. This is slow and extremely inefficient.

I then looked into something called the Text Object Model (or TOM). This comprises a set of COM interfaces which are implemented by the underlying RichEdit control. The functionality exposed by these interfaces is quite similar to that of the Microsoft Word object model, where you have the concept of Document, Range, Selection and so on. Importantly, TOM allows you to operate on ranges of rich text without the need to alter the selection. It also provides access to a range of functionality not otherwise available for the RichTextBox control:

  • Granular selection by character, word, sentence and paragraph
  • Font weights, underline styles, all-caps
  • Tab stops, list styles, full justification of text
  • Find functionality, translation to/from screen coordinates

Using the Text Object Model

The TOM interfaces are:

  • ITextDocument – Represents a top-level document, from which text ranges can be obtained.
  • ITextRange – Represents a range of rich text. Ranges can be moved, resized and styled with ease.
  • ITextSelection – Special text range that represents the selected text in the control.
  • ITextFont – Character formatting options.
  • ITextPara – Paragraph formatting options.

It is relatively easy to obtain an instance of ITextDocument from a RichTextBox control. One simply sends the EM_GETOLEINTERFACE message to the control. There are a number of ways that you can interact with the object that gets returned:

  1. Use the dynamic keyword to access the object’s members – this requires the least programming effort, but introduces the possibility for errors and has a number of overheads.
  2. Define the TOM interfaces in managed code and use COM interop – this allows you to work in a strongly-typed manner, but still has the overheads associated with marshalling. Also, the interfaces themselves are not terribly .NET-friendly, and they lack some of the conveniences that would be familiar to most .NET developers.
  3. Write wrapper classes for each of the TOM interfaces – while this approach requires the most coding, the end result is a fast, efficient way of accessing the TOM functionality. The calling code does not have to use COM interop and the wrapper classes can translate unfriendly constructs into more .NET-friendly ones.

My solution

Needless to say, I settled for the 3rd option. I decided to implement the wrapper classes in a C++/CLI assembly, because this allowed me to access the TOM interfaces using native code, while ultimately exposing a set of managed classes. Well-written native code (that explicitly marshals data to managed code) will always be faster than COM interop, where the runtime has to apply more checks and is unable to make as many assumptions about the types it is operating on.

Some of the benefits offered by my wrappers include:

  • Managed enumerations to replace integer constants and flags
  • Properties to replace Get and Set methods
  • Translation of HRESULT codes into managed exceptions
  • Returning objects instead of pointers
  • Omission of TOM functionality not supported by the RichEdit control (e.g. text shadows, animation)
  • Use of more .NET-friendly types and nomenclature; e.g. IEquatable, ToString, IDataObject, Color and Point

The top-level class in my implementation is TextDocument, which wraps ITextDocument. It includes a static method which creates an instance from a RichTextBox control. Once created, you have full access to the Text Object Model functionality to manipulate the text inside the control.

Usage

Once obtaining a TextDocument instance from a RichTextBox control, working with formatted text is easy:

// create a RichTextBox control in the usual way
RichTextBox rtb = new RichTextBox();

// create a TOM document object (and enable advanced typography on the control)
TextDocument doc = TextDocument.FromRichTextBox(rtb, true);
TextRange range = doc.EntireRange;

// set some text
range.Text = "This is a piece of rich text.";
range.Font.Name = "Calibri";
range.Font.Size = 16f;

// find a word and apply formatting
range.FindText("piece");
range.Font.Bold = true;
range.Font.ForeColor = Color.Red;

// insert a tab
range.Collapse(RangePosition.End);
range.Text = "\t";

// resize range and apply more formatting
range.FindText("rich");
range.MoveEnd(TextUnit.Word, 2);
range.Font.UnderlineStyle = TextUnderlineStyle.Wave;
range.Font.UnderlineColor = TextUnderlineColor.Blue;

// append raw RTF using IDataObject
range.MoveEnd(TextUnit.Story, 1);
range.Collapse(RangePosition.End);
range.SetDataObject(new DataObject(
    DataFormats.Rtf, 
    @"{\rtf1\ansi\deff0\pard \par Here is some \ul more\ul0  rich text.\par}"
));

// set paragraph formatting
doc.EntireRange.Para.Alignment = TextAlignment.Center;

Download

Go to the project page here: TOM Classes for .NET

Final words

This set of managed wrapper classes for the Text Object Model provides a fast and efficient way to manipulate the text in a RichTextBox control. Importantly, it solves the limitation of only being able to apply styles to the selected text in the control. As an added bonus, it also provides access to a wider range of character and paragraph formats, as well as a number of convenience methods for working with ranges of formatted text. I hope you find it useful in your own rich text applications.

Additional resources

Text Object Model on MSDN
Rich Text Format Specification

Hi all,

Just a quick announcement to say that comments are working again.

It seems that the captcha plug-in I was using has been discontinued, including the server that validates the captchas themselves. I’ve replaced the old captcha plug-in with one that uses Google reCAPTCHA instead. This should be familiar to most users from other sites on the web. While I don’t like the clunky nature of most captchas, they are necessary on this blog given the amount of spam I receive.

Anyway, apologies for any inconvenience if you were unable to comment on an article. Everything should be fine now.