This article is also available as a video, courtesy of Webucator. Click here to jump to the video section below.

 

For many data types, sorting is a logical, deterministic operation that works by comparing elements and determining whether they are less-than, greater-than or equal-to each other. For example, with a list of integers, we can simply use the <, > and == operators to perform the comparisons. With DateTime values, we might first convert them into UTC/GMT, then to their fundamental form, ticks (i.e. the number of 100-nanosecond intervals that have elapsed since DateTime.MinValue), then compare them as we did with integers.

With strings, however, things get more complicated. One must first decide on what basis we perform the comparison: Is it case-sensitive or case-insensitive? Are accented characters considered equal to their unaccented equivalents? If not, how do they compare? Are there language/regional/cultural factors to consider? Are there cases where two consecutive characters are read as if they were just one?

Thankfully, the .NET Framework supports a wide range of string comparisons and therefore several choices when it comes to sorting. The built-in StringComparer implementations are CurrentCulture, CurrentCultureIgnoreCase, InvariantCulture, InvariantCultureIgnoreCase, Ordinal and OrdinalIgnoreCase. Because I develop exclusively within a single locale and language, my preferred implementation is OrdinalIgnoreCase, which equates uppercase and lowercase letters while comparing the rest according to their order in the character set (i.e. digits before letters, with symbols on either side).

The problem with the built-in comparisons is that they operate on (broadly speaking) an alphabetical basis. Even when we desire strings to appear in alphabetical order, we do not always intend for such a literal interpretation; i.e. the following strings are sorted in alphabetical order, but not in the order a human would place them:

  • “1. Introduction”
  • “10. Summary”
  • “11. Notes”
  • “2. Oranges”
  • “3. Apples”
  • “4. Berries”

Similarly, we might have a list of strings which contain both alphabetical and numerical terms, expressed naturally:

  • “3rd man”
  • “Fifty-third page”
  • “First chase”
  • “First place”
  • “Four people”
  • “Second place”

A more sophisticated comparer is required if we are to sort strings of this kind in a logical order.

My Approach

My implementation works by breaking strings into tokens (i.e. words comprised of both letters and numbers, but not whitespace or punctuation) and, where recognised, assigning a numerical value to them. If not recognised, a token is compared alphabetically/lexicographically. I look for the following types of tokens:

  • Digit strings (including those with currency symbols, digit grouping symbols and decimal points), e.g. "123", "$123.00", "100,000"
    – These are converted using the Decimal.TryParse() method.
  • Ordinal digit strings; e.g. "1st", "2nd", "3rd", "100th"
    – After removing the suffix, the same process as above applies.
  • Words which translate directly into a number; e.g. "one", "two", "twenty"
    – These are matched individually to a decimal constant.
  • Ordinal words which translate directly into a number; e.g. "first", "third", "fifteenth"
    – As above.
  • Multiple consecutive words expressing a larger number; e.g. "nine hundred", "twenty-seven", "eight million, one hundred thousand, seven hundred and sixty six"
    – For these, we look ahead to the next token and, if necessary, add or multiply the numerical values together.

How the Lookahead Algorithm Works

This is the algorithm that converts a string into a numerical value that can be used for comparisons:

  1. Split the string using the regular expression \b(.+?)\b
  2. For each token:
    1. Attempt to convert the token to a numerical value (value) and note the type of match;
      No match, digit string, word, ordinal word, the word "and"/"&", the add modifier and the multiply modifier (used later)
    2. If the match was a word or ordinal word (with or without modifiers) then:
      1. Push the numerical value onto a new stack and reset the value to zero
      2. While there are still tokens remaining:
        1. Get the next token
        2. Attempt to convert the token to a numerical value (next) and note the type of match
        3. If there was no match or a digit string was matched, abort
        4. Pop the top value off the stack (temp)
        5. If the multiply modifier was matched:
          1. Continue popping elements off the stack and adding them to temp until the result is non-zero (or the next element on the stack is less than next)
          2. Multiply temp by next
          3. Push temp onto the stack, followed by a zero
        6. If the add modifier was matched:
          1. Add next onto temp
          2. Push temp onto the stack
      3. Pop all elements off the stack, adding them onto value
    3. If the original match was valid, return value – we will use this value to compare the tokens
  3. If no value has been returned by this point, we will compare the tokens as though they were normal strings

In short: If a numerical value is matched, we look ahead until we reach the end or the value is no longer valid. Each successive token is added to the original value, however when a multiplier is found (e.g. “hundred”, “thousand”, etc) we continue accumulating values in the stack until we are ready to perform the multiplication; only then can the result be added on.

How the Comparisons Are Performed

  1. Split each string into tokens
  2. While there are tokens remaining in both strings:
    1. Run the lookahead algorithm (above), consuming tokens from each string
    2. If numerical values were found for both strings, compare numerically
    3. If a numerical value was only found for one string, convert the numerical value into a string and compare it with the other token
    4. If neither string contained a numerical value, compare normally
    5. If the result of the comparison was less-than or greater-than, return the result
  3. If no result was returned (i.e. numerical values were equal), compare the number of tokens consumed from each string and regard the shortest to be lesser
  4. If the result from the previous step was less-than or greater-than, return the result
  5. If no result was returned, compare the original strings normally and return the result

In short: Where numerical values can be obtained from the strings, we will compare these with each other (regardless of how much of the string we consume in doing so; only the value matters). Where numerical values are absent from one string, the other string takes precedence. Once no more numerical values can be obtained, the remainder of each string is compared normally.

Implementation

I implemented this functionality in a C# class named NaturalLanguageStringComparer, which implements both the generic (string) and non-generic (object) versions of the IComparer interface. This is the standard mechanism for comparators in the .NET Framework, which enables the class to be used in:

  • The List<T>.Sort(), ArrayList.Sort() and Array.Sort() methods to perform a one-off sort on a collection
  • When initialising a SortedList<TKey, TValue> or SortedDictionary<TKey, TValue> to keep the elements in the correct order
  • In LINQ queries (the OrderBy() extension method) to order a sequence

You can also call the ParseNatural() method to utilise the numerical conversion functionality directly.

Example

List terms = new List();
terms.Add("1");
terms.Add("fourth inkwell");
terms.Add("10");
terms.Add("2");
terms.Add("one hundred and three");
terms.Add("five");
terms.Add("7th saucepan");
terms.Add("third compass");
terms.Add("eight");
terms.Add("one hundred and thirty six");
terms.Add("zero");
terms.Add("twenty one");
terms.Add("five hundred");
terms.Add("fifth pencil");
terms.Add("22nd bucket");

NaturalLanguageStringComparer nlsc = new NaturalLanguageStringComparer();
terms.Sort(nlsc);

This produces the following result:
zero
1
2
third compass
fourth inkwell
fifth pencil
five
7th saucepan
eight
10
twenty one
22nd bucket
one hundred and three
one hundred and thirty six
five hundred

Download

NaturalLanguageSorting.zip (Visual Studio 2012 solution, includes example)

If you find my code useful, please consider making a donation.

Video

After initially publishing this article, I was contacted by Webucator, a US-based technology training provider. Their course, “C# Solutions from the Web” showcases a range of articles like this one. I was very gratified to hear that they were including a video based on my article. You can watch the video below and, if you find it interesting, visit the link to the aforementioned course.

Leave a reply

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> 

required