DOC vs DOCX

Explanation of the alternate processing flow for DOCX files in ePublisher 2011.3 and its implications.

by Ben Allums
March 19, 2012
Solutions

In ePublisher 2011.3, we introduced an alternate processing flow for the Microsoft Word Office Open XML (OOXML, and DOCX, hereafter) document format. This new approach addresses issues with processing DOCX files using the previous DOC processing flow.

Let's start with a brief history of the Microsoft Word integration with ePublisher. In 2004 – 2005, when ePublisher was being designed, the Word adapter leveraged existing code to process Word DOC files to ePublisher intermediate files (WIF), using Word VBA. With each successive release of Word (2007 and 2010), the same processing flow has been used.

However, in Word 2007, Microsoft introduced a new document format named Office Open XML, which uses a *.docx file extension when saved. Unlike the DOC format, the DOCX format is an XML-based open standard. Up until 2011.3, ePublisher continued to use the same VBA processing flow for both the DOC and DOCX formats.

The issues with this approach arise from the fact that the VBA-based processing flow normalizes all files to DOC format, resulting in a loss of formatting information from DOCX files in some cases.

The new DOCX adapter works around these limitations by leaving the original DOCX file in its native format. It uses a combination of DOM manipulation and XSL to produce the ePublisher intermediate files. This preserves formatting information from DOCX files more accurately.

While there are some growing pains associated with this new approach, such as maturity issues with the DOCX processing flow, it offers advantages like compatibility with 64-bit binaries, improved speed and memory performance, and potential cross-platform use.

For more information and updates on the DOCX processing flow, refer to DOCX Updates.

There are a number of natural advantages to the DOCX adapter. Because of the problems with character style runs, the DOC adapter is forever tied to legacy 32-bit code. The DOCX adapter has no such limitation. It represents a viable path toward 64-bit binaries. Also, the speed and memory performance of the DOCX implementation are far superior to the DOC implementation, which improves the scalability ceiling of the DOCX format. Finally, while there are no current plans to make the needed changes, the fact that DOCX is open (doesn't require Word in order to read and manipulate) opens the potential of the format to be used across platforms.