AI/NLP on Microsoft Word Documents

There are great AI/NLP libraries out there, like NLTK and SpaCy, just to name a few. However, putting AI/NLP libraries to work on native Microsoft Word (.doc and .docx) documents isn’t easy.

There are several challenges:

1. AI/NLP needs raw text.

No AI/NLP library or practitioner can understand ".docx" or even ".doc". They need raw text!

Native Documents provides raw text ready for AI/NLP input.

2. AI/NLP needs to communicate the results back to the user.

No AI/NLP library or practitioner knows how to write the results back into the ".docx" document. This is a problem even for hard-core OfficeOpenXML file format hackers!

Native Documents can write AI/NLP results back to the document.

3. AI/NLP needs direct interaction with users.

Many AI/NLP applications need to view and collaboratively edit Microsoft Word ".docx" and ".doc" documents. This is a problem since Microsoft Word cannot be embedded inside a Web-application:

Native Documents provides a fully customizable, Web-based Microsoft Word-compatible viewer and collaborative editor.

ND Technology

The capabilities provided by the Native Documents tools include:

The Native Documents Tools are available as an SDK, and as a service.

The service also includes:

At their core, the Native Documents tools are some 800000 lines of C/C++ code.

SDK

This C/C++ core is absolutely platform independent and can be used readily in Javascript (via WASM), Python, etc. The Javascript/WASM version is publicly available via NPM docx-wasm.

If you would like the SDK to be available in your programming language please let us know via support@nativedocuments.com.

Service

The Native Document tools are also available as a service. This service provides a rich REST-API.

The service can run on any Linux and is generally delivered as Docker containers.

PDF vs DOCX-based AI/NLP architecture

An alternative to a DOCX-based AI/NLP architecture is to convert everything to PDF. It is very easy to convert any ".docx" or ".doc" to PDF using for example our docx-wasm npm module, then using PDF text extraction tools to retrieve the raw text needed for the AI/NLP analysis process.

There are 3 problems we hear from AI/NLP practitioners who have tried a PDF-based architecture:

  1. The poor quality of the raw text extracted from the PDFs.
  2. Loss of semantic information conveyed by the document’s formatting; that is headings, numbering, etc. (See Improve your AI/NLP with better “raw text”)
  3. PDF is read-only, so ill-suited to living documents. Converting Microsoft Word documents to PDF takes the document out of the editing process. PDF is a static representation in a process where the AI/NLP expertise ultimately needs to be interactive within the document.

Extract Text

The most fundamental thing any AI/NLP practitioner needs is the raw text. The following sample uses our SDK to extract raw text from a ".docx" or ".doc" file:

const docx = require("@nativedocuments/docx-wasm"); // init docx engine docx.init({ // please go to https://developers.nativedocuments.com/ // to get a dev-id/dev-secret // you can also set the credentials in the environment variables ND_DEV_ID: "XXXXXXXXXXXXXXXXXXXXXXXXXX", ND_DEV_SECRET: "XXXXXXXXXXXXXXXXXXXXXXXXXX", ENVIRONMENT: "NODE", // required // if set to false the WASM engine will be initialized right now, // useful pre-caching (like e.g. for AWS lambda) LAZY_INIT: true }).catch( function(e) { console.error(e); }); async function extractText(document) { const api = await docx.engine(); await api.load(document); const raw = await api.exportRawText(); console.log(raw); await api.close(); } if (process.argv.length>2) { extractText(process.argv[2]); }

extractText.js

Please note that paragraph numbers are exported correctly, a thing other libraries struggle with:

>npm i @nativedocuments/docx-wasm >wget https://www.nativedocuments.com/assets/docs/test_drive/numbered_paragraph_sample.docx > node extractText.js "numbered_paragraph_sample.docx" 1. Numbered Paragraph 1.1. Numbered Paragraph 1.2. Numbered Paragraph 2. Numbered Paragraph 2.1. Numbered Paragraph 2.2. Numbered Paragraph

Annotating a DOCX

Having extracted the raw text it is very easy to feed it into an AI/NLP library. Every AI/NLP library works on ranges of the form [character start position, character end position).

Let’s assume you fed your AI/NLP algorithm with raw text and you got back the following result:

{ "[3..21)": { "text": "Comment 1", "author": "Native Documents", "date": 1561498919747 }, "[73..91)": { "text": "Comment 2", "author": "Native Documents", "date": 1561498919747 } }

The next problem is that of writing the results back into the document.

The following sample code shows how to annotate a ".docx" with comments:

const docx = require("@nativedocuments/docx-wasm"); const fs = require("fs"); // init docx engine docx.init({ // please go to https://developers.nativedocuments.com/ // to get a dev-id/dev-secret // you can also set the credentials in the environment variables ND_DEV_ID: "XXXXXXXXXXXXXXXXXXXXXXXXXX", ND_DEV_SECRET: "XXXXXXXXXXXXXXXXXXXXXXXXXX", ENVIRONMENT: "NODE", // required // if set to false the WASM engine will be initialized right now, // useful pre-caching (like e.g. for AWS lambda) LAZY_INIT: true }).catch( function(e) { console.error(e); }); async function annotateDOCX(document) { const api = await docx.engine(); await api.load(document); const raw = await api.exportRawText(); console.log(raw); const buffer=await api.exportDOCX({ "[3..21)": { "text": "Comment 1", "author": "Native Documents", "date": Date.now() }, "[73..91)": { "text": "Comment 2", "author": "Native Documents", "date": Date.now() } }); fs.writeFileSync("out.docx", new Uint8Array(buffer)); await api.close(); } annotateDOCX("numbered_paragraph_sample.docx");

annotateDocx.js

Assuming you've already installed docx-wasm and fetched numbered_paragraph_sample.docx:

> node annotateDocx.js

The result is a Word document containing a Word comment for each annotation.

Improve your AI/NLP with better “raw text”

Being able to extract the plain raw text from a Microsoft Word document is essential. However, Microsoft Word documents have so much more context to offer. So we also provide a JSON-based encoding of the raw text designed for AI/NLP input.

Since most AI/NLP work is centered at the paragraph level we chose to provide these documents to AI/NLP machines as a sequence of annotated paragraphs encoded in JSON. Each paragraph is annotated with complete and fully resolved properties.

Since this endpoint returns the fully resolved properties, you don't have to follow the styles hierarchy, to figure out - for example - whether a paragraph is a heading or has an "outline" level or not. The style resolution is done within our process of preparing the content for AI/NLP analysis and document interaction.

In legaltech applications for example, what this means is that you can readily use headings, numbering and indentation etc when segmenting a contract document into sections/articles/clauses.

The following example shows you how to extract the outline from a Microsoft Word document:

async function extractOutline(document) { const api = await docx.engine(); await api.load(document); const raw = await api.exportRawJSON(); const outline=raw.filter((par)=>par.pap.outlineLvl<9); console.log(JSON.stringify(outline,null,4)); await api.close(); }

In trying the above example, be sure to use a docx which contains outlineLvl<9, or it will return an empty [].

The JSON-based raw text format is very verbose and it provides all the information available in a Microsoft Word document. For example, all the paragraph numbering is available:

async function extractNumberedParagraphs(document) { const api = await docx.engine(); await api.load(document); const raw = await api.exportRawJSON(); const numberedParagraphs=raw.filter((par)=>par.list && !par.pap.outlineLvl<9); console.log(JSON.stringify(numberedParagraphs)); await api.close(); }

The properties are modelled after the OfficeOpenXML specification. If you have questions regarding a specific property, please see our wiki.

Viewing/Editing

Many AI/NLP applications need to display their results within a Web-browser. And typically --- once the user sees the annotations made by the AI/NLP --- the need for editing (and possibly collaborative editing) obvious.

Native Documents provides a fully customizable, Web-based, Microsoft Word-compatible viewer and editor.

You can quickly test the fidelity of the viewer with your own documents using our test site at https://canary.nativedocuments.com.

To test the editor, the easiest way to get started is to clone the GitHub project https://github.com/NativeDocuments/nd-WordFileEditor. Then do the usual npm install/npm start and you can start hacking:

npm install
npm start -nd-dev-id="${ND_DEV_ID}" -nd-dev-secret="${ND_DEV_SECRET}" -nd-service-url="${ND_SERVICE_URL}"

Try clicking the links in the above command: it will substitute the values you provide throughout this page, giving you commands ready to paste into your shell.

You can now visit http://127.0.0.1:8888 in your browser. By default all requests are "proxied" to ${ND_SERVICE_URL}. For how to install and use an on-premise version, please see further below.

Verify the service is working:

curl -f -L ${ND_SERVICE_URL} && echo "SERVICE IS WORKING"

Loading a Word document

To start interacting with the service, first you upload a document:

curl -X POST -H "X-ND-DEV-SECRET: ${ND_DEV_SECRET}" --data-binary @'sample.docx' ${ND_SERVICE_URL}/v1/DEV${ND_DEV_ID}00000000000000000000000000000000000000000000000000000000/upload

Once the document is uploaded you can use the returned nid to perform actions. Click the nid in one of the commands below for ease of use.

To view the document use:

http://127.0.0.1:8888/edit/${NID}

To edit the document use:

http://127.0.0.1:8888/edit/${NID}?author=${AUTHOR_TOKEN}

Exporting

Once the document is uploaded, you can export it at any time. Any edits will be present (since edits are sync'd continuously to the server).

You can the pdf:

curl -o out.pdf -X GET -H "X-ND-DEV-SECRET: ${ND_DEV_SECRET}" ${ND_SERVICE_URL}/v1/DEV${ND_DEV_ID}00000000000000000000000000000000000000000000000000000000/document/${NID}/?format=application/pdf

Or you can get the raw text:

curl -o out.txt -X GET -H "X-ND-DEV-SECRET: ${ND_DEV_SECRET}" "${ND_SERVICE_URL}/v1/DEV${ND_DEV_ID}00000000000000000000000000000000000000000000000000000000/document/${NID}/?format=application/vnd.nativedocuments.raw.text%2Btext"

Or you can get the raw JSON:

curl -o out.json -X GET -H "X-ND-DEV-SECRET: ${ND_DEV_SECRET}" "${ND_SERVICE_URL}/v1/DEV${ND_DEV_ID}00000000000000000000000000000000000000000000000000000000/document/${NID}/?format=application/vnd.nativedocuments.raw.json%2Bjson"

Or you can get the ".docx":

curl -o out.docx -X GET -H "X-ND-DEV-SECRET: ${ND_DEV_SECRET}" "${ND_SERVICE_URL}/v1/DEV${ND_DEV_ID}00000000000000000000000000000000000000000000000000000000/document/${NID}/?format=application/vnd.openxmlformats-officedocument.wordprocessingml.document"

Annotations Service

You can upload the annotations using the ranges endpoint:

curl -X POST -H "Content-Type: application/vnd.nativedocuments.raw.text+text" --data-binary @'ranges.txt' -H "X-ND-DEV-SECRET: ${ND_DEV_SECRET}" "${ND_SERVICE_URL}/v1/DEV${ND_DEV_ID}00000000000000000000000000000000000000000000000000000000/document/${NID}/ranges"

You can also combine the export and ranges operations:

curl -o out.docx -X POST -H "Content-Type: application/vnd.nativedocuments.raw.text+text" --data-binary @'ranges.txt' -H "X-ND-DEV-SECRET: ${ND_DEV_SECRET}" "${ND_SERVICE_URL}/v1/DEV${ND_DEV_ID}00000000000000000000000000000000000000000000000000000000/document/${NID}/?format=application/vnd.openxmlformats-officedocument.wordprocessingml.document"

On premise Docker install

Setting up the viewer/editor locally or on premise is quite simple using docker. (The service can be run on any Linux. If Docker is not working for you please let us know via support@nativedocuments.com. For an AWS Cloud Formation template please write to support@nativedocuments.com)

Please download the docker-compose.yaml template from http://downloads.nativedocuments.com/docker/NativeDocumentsServices.zip and unpack it. You need to customize the ".env" file. To get an eval license please visit developers.nativedocuments.com:

ND_WEBAPP_PORT=${ND_WEBAPP_PORT} ND_DOCKER_REGISTRY= ND_BUILD_TS=latest ND_DEPLOY_TAG=DEV ND_LICENSE_URL=data:application/x-x509-ca-cert;base64,...

Start the service with:

docker-compose up -d

Stop the service with:

docker-compose down

Stop NativeDocument and delete the nddata volume:

docker-compose down -v

Stop and clean up:

docker-compose down -v --rmi all

Using the on premise install for development

To use the locally installed docker instance for development instead of https://canary.nativedocuments.com, create an ".env" file inside nd-WordFileEditor:

ND_DEV_ID=${ND_DEV_ID} ND_DEV_SECRET=${ND_DEV_SECRET} ND_SERVICE_URL=${ND_SERVICE_URL}
Then simply start the dev server again and ${ND_SERVICE_URL} will be used as the service:
nd-WordFileEditor> npm start … [ND] proxy to service at "${ND_SERVICE_URL}" … Project is running at http://127.0.0.1:8888/webpack-dev-server/ … Asset Size Chunks Chunk Names app.js 60.4 KiB app [emitted] app host.js 6.85 KiB host [emitted] host …

Deployment

There are many way to deploy the service making sure it can handle large workloads. Please write to support@nativedocuments.com to get help with your deployment scenario (e.g. AWS CloudFormation, docker, …).

Conclusion

We hope this overview has raised some exciting possibilities for you. If there are other capabilities you’d like to explore, please reach out to us at support@nativedocuments.com.