Python – Extract highlighted words from Word documents (.docx) in Python

Extract highlighted words from Word documents (.docx) in Python… here is a solution to the problem.

Extract highlighted words from Word documents (.docx) in Python

I’m working on a bunch of word documents where I have highlighted text (words) (using color codes, e.g. yellow, blue, gray) and now I want to extract the highlighted words associated with each color. I’m programming in Python. Here’s what I’m doing so far:

Open the word document with [python-docx][1] and reach <w:r> the tag that contains the marks (words) in the document. I used the following code:

#!/usr/bin/env python2.6
# -*- coding: ascii -*-
from docx import *
document = opendocx('test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
  print word

Now I’m stuck in the section to check if each word has <w:highlight>. tag and extract the color code from it if it matches the yellow printed text inside the <w:t> label. I would appreciate it if someone could point me to extracting words from parsed files.

Solution

I’ve never used python-docx before, but it was useful that I found a snippet online where the XML structure of the highlighted text fragment looks like this:

 <w:r>
    <w:rPr>
      <w:highlight w:val="yellow"/>
    </w:rPr>
    <w:t>text that is highlighted</w:t>
  </w:r>

From there, coming up with this is relatively straightforward:

from docx import *
document = opendocx(r'test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)

WPML_URI = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
tag_rPr = WPML_URI + 'rPr'
tag_highlight = WPML_URI + 'highlight'
tag_val = WPML_URI + 'val'

for word in words:
    for rPr in word.findall(tag_rPr):
        if rPr.find(tag_highlight).attrib[tag_val] == 'yellow':
            print word.find(tag_t).text

Related Problems and Solutions