In an earlier article, I developed a set of managed wrapper classes for the Text Object Model (TOM); the API that underpins the RichEdit/RichTextBox control and provides an object model that is document-centric (rather than selection-centric, like the controls themselves). With it, you can efficiently manipulate rich text and, using the managed classes, do so from within a C# (or other .NET) project.

With Windows 8 came an update to the Text Object Model; TOM 2 includes enhancements such as:

  • Documents containing multiple stories
  • Default character and paragraph formatting
  • East Asian language and UTF-32 support
  • Object properties
  • Build-up/build-down operations for math text
  • Additional character effects and paragraph styles
  • Image insertion
  • Basic support for tables
  • Mechanism for manipulating rich text strings separately from the document

The TOM 2 functionality can also be used on earlier versions of Windows where Office 2013 is present; the document instance must be obtained from the RichEdit control that ships with Office in this scenario. This is preferred even on Windows 8, as the Office implementation is generally more complete than the RichEdit control included with the OS. (For example, the basic implementation does not include support for math text)

Extending the TOM classes

The TOM 2 functionality is exposed by a set of COM interfaces that extend the original Text Object Model interfaces; for example, ITextDocument2 extends ITextDocument, ITextRange2 extends ITextRange, etc. For my managed wrappers, I decided that each of the existing classes should encapsulate the functionality of both interfaces; attempting to access unsupported functionality will result in an exception. Adding a SupportedVersion property allows the caller to check for feature support and gracefully degrade.

When constructing each object, we call the QueryInterface method on the IUnknown interface of the underlying COM object to see if the newer TOM interface is implemented; otherwise the older interface is used; e.g:

ITextDocument* doc = NULL;
HRESULT hr1 = punk->QueryInterface(__uuidof(ITextDocument), (void**)&doc);
			
ITextDocument2* doc2 = NULL;
HRESULT hr2 = punk->QueryInterface(__uuidof(ITextDocument2), (void**)&doc2);			

punk->Release();
			
if (FAILED(hr1)) Marshal::ThrowExceptionForHR(hr1);

if (SUCCEEDED(hr2)) {
    // use ITextDocument2...
}
else {
    // use ITextDocument...
}

Most of the new methods and properties are implemented by calling the corresponding method on the COM object and converting between native and managed data types. In practice, this means that:

  • long values are implicitly converted to/from System.Int32
  • Enumerations are cast to the appropriate System.Enum type
  • BSTR values are marshalled to/from System.String
  • IStream objects are created from System.IO.Stream objects (by allocating a block of unmanaged memory and copying the data to it)
  • HRESULT values (other than S_OK and S_FALSE) cause managed exceptions to be thrown

As with the original managed wrappers, I endeavoured to translate from Win32/COM terminology to .NET nomenclature; for example:

  • Duplicate becomes Clone
  • IsEqual becomes Equals (also implementing IEquatable<T>)
  • Get/Set methods become properties
  • 1-based indexes are translated to 0-based indexes
  • Start/End positions are translated to Start & Length

The TOM 2 interfaces include some methods that replace methods from the original interfaces; e.g. GetDuplicate2. In the case of these, the wrapper class provides a single managed method that checks which interface is implemented, and then calls the appropriate native method.

Also as before, all TOM classes implement IDisposable to allow unmanaged resources to be released when they are no longer required.

Obtaining a TOM 2 document

An instance of the ITextDocument2 interface must be obtained from a RichEdit control; however, the Windows Forms RichTextBox control loads version 2.0 of the RichEdit control and does not support TOM 2, even on Windows 8. Basic TOM 2 functionality is provided by version 4.1 of the control, located in MSFTEDIT.DLL (which ships with Windows). Full TOM 2 functionality requires the RichEdit control that ships with Office 2013, so called “RichEdit 8” (the window class name is actually RICHEDIT60W).

By subclassing the RichTextBox control and overriding the CreateParams property, we can force a different version of the RichEdit control to be loaded. This trick involves using the native LoadLibrary method to load the DLL containing the control (either MSFTEDIT.DLL or RICHED20.DLL) and then changing the ClassName property to select the correct version; e.g:

virtual property System::Windows::Forms::CreateParams^ CreateParams {
    System::Windows::Forms::CreateParams^ get() override {
        LoadLibrary("C:\\Program Files\\Common Files\\Microsoft Shared\\OFFICE15\\RICHED20.DLL");
        System::Windows::Forms::CreateParams^ cp = RichTextBox::CreateParams;
        cp->ClassName = "RICHEDIT60W";
        return cp;
    }
}

In doing so, the platform of the calling process must match that of the DLL being loaded; e.g. a 64-bit application must load the DLL from the 64-bit version of Office.

Included with the managed TOM classes is the RichTextBoxEx control, which uses the above trick (as well as some querying of the Windows registry) to load the best available version of the native RichEdit control. With this control, you can obtain a TOM 2 document with very few lines of C# code:

using (RichTextBoxEx rtb = new RichTextBoxEx()) {
    TextDocument doc = TextDocument.FromRichTextBox(rtb);
    Console.WriteLine("Supported version is {0}", doc.SupportedVersion);
}

Math functionality

One of the key areas of functionality offered by TOM 2 concerns math text. The most notable feature is being able to convert a linear-form equation (i.e. consisting of a string of Unicode characters, or other text which can be interpreted by the TOM engine) into a built-up form (i.e. consisting of inline math objects, with full formatting) – and back again. These transformations are applied using the BuildUpMath and Linearize methods. A range of enumeration values/flags control this process. Also included is the ability to “fold” complex alphabetic characters into their plain text equivalents (using the new version of the GetText method).

e.g. BuildUpMath transforms the following Unicode string:

f(x)=a_0+∑_(n=1)^∞▒(a_n cos⁡〖nπx/L〗+b_n sin⁡〖nπx/L〗

…into the following built-up math text:

built-up-math

By request, I have extended the built-in TOM functionality by adding conversions from built-up math text to two common XML-based markup formats:

  • Office MathML (OMML) – Math RTF and OMML are very similar; the conversion parses the RTF and makes the necessary transformations to produce the XML output.
  • W3C MathML (MML) – The conversion leverages the XSL stylesheet included with Microsoft Office which translates OMML into MML.

The MathML code generated for the previous example is:

<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math">
  <mml:mi>f</mml:mi>
  <mml:mfenced separators="|">
    <mml:mrow>
      <mml:mi>x</mml:mi>
    </mml:mrow>
  </mml:mfenced>
  <mml:mo>=</mml:mo>
  <mml:msub>
    <mml:mrow>
      <mml:mi>a</mml:mi>
    </mml:mrow>
    <mml:mrow>
      <mml:mn>0</mml:mn>
    </mml:mrow>
  </mml:msub>
  <mml:mo>+</mml:mo>
  <mml:mrow>
    <mml:msubsup>
      <mml:mo stretchy="false">∑</mml:mo>
      <mml:mrow>
        <mml:mi>n</mml:mi>
        <mml:mo>=</mml:mo>
        <mml:mn>1</mml:mn>
      </mml:mrow>
      <mml:mrow>
        <mml:mi>∞</mml:mi>
      </mml:mrow>
    </mml:msubsup>
    <mml:mrow>
      <mml:mfenced separators="|">
        <mml:mrow>
          <mml:msub>
            <mml:mrow>
              <mml:mi>a</mml:mi>
            </mml:mrow>
            <mml:mrow>
              <mml:mi>n</mml:mi>
            </mml:mrow>
          </mml:msub>
          <mml:mrow>
            <mml:mrow>
              <mml:mi mathvariant="italic">cos</mml:mi>
            </mml:mrow>
            <mml:mo>⁡</mml:mo>
            <mml:mrow>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mi>n</mml:mi>
                  <mml:mi>π</mml:mi>
                  <mml:mi>x</mml:mi>
                </mml:mrow>
                <mml:mrow>
                  <mml:mi>L</mml:mi>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:mrow>
          <mml:mo>+</mml:mo>
          <mml:msub>
            <mml:mrow>
              <mml:mi>b</mml:mi>
            </mml:mrow>
            <mml:mrow>
              <mml:mi>n</mml:mi>
            </mml:mrow>
          </mml:msub>
          <mml:mrow>
            <mml:mrow>
              <mml:mi mathvariant="italic">sin</mml:mi>
            </mml:mrow>
            <mml:mo>⁡</mml:mo>
            <mml:mrow>
              <mml:mfrac>
                <mml:mrow>
                  <mml:mi>n</mml:mi>
                  <mml:mi>π</mml:mi>
                  <mml:mi>x</mml:mi>
                </mml:mrow>
                <mml:mrow>
                  <mml:mi>L</mml:mi>
                </mml:mrow>
              </mml:mfrac>
            </mml:mrow>
          </mml:mrow>
        </mml:mrow>
      </mml:mfenced>
    </mml:mrow>
  </mml:mrow>
</mml:math>

Other extensions

My implementation also contains some extension methods which allow TextRange objects to be used like the familiar StringBuilder class; i.e. Append, AppendLine, Insert and Remove.

Download

You can download the latest version of my TOM Classes for .NET from the project page.

Leave a reply

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> 

required