Python-docx : iterate through paragraphs, tables and images while maintaining order

docx : iterate through paragraphs, tables and images while maintaining order… here is a solution to the problem.

docx : iterate through paragraphs, tables and images while maintaining order

This is my first time posting here, and I would like to write a script that takes docx as input and selects certain paragraphs (including tables and images) to copy to another template document in the same order (not at the end). The problem I’m having is that when I start iterating through elements, my code can’t detect the image, so I can’t determine where the image is relative to the text and table, nor which image the image is.
In short, I got doc1:
text
Image
text
table
Text

What I ended up with was:
text
[Image Loss]
text
table
Text

What I’ve got so far :

– I can iterate through paragraphs and tables:

def iter_block_items(parent):
"""
Generate a reference to each paragraph and table child within *parent*,
in document order. Each returned value is an instance of either Table or
Paragraph. *parent* would most commonly be a reference to a main
Document object, but also works for a _Cell object, which itself can
contain paragraphs and tables.
"""
if isinstance(parent, _Document):
    parent_elm = parent.element.body
    # print(parent_elm.xml)
elif isinstance(parent, _Cell):
    parent_elm = parent._tc
else:
    raise ValueError("something's not right")

for child in parent_elm.iterchildren():
    if isinstance(child, CT_P):
        yield Paragraph(child, parent)
    elif isinstance(child, CT_Tbl):
        yield Table(child, parent)

I can get an ordered list of document images:

pictures = []
for pic in dwo.inline_shapes:
    if pic.type == WD_INLINE_SHAPE. PICTURE:
        pictures.append(pic)

I can insert a specific image at the end of the paragraph:

def insert_picture(index, paragraph):
    inline = pictures[index]._inline
    rId = inline.xpath('./a:graphic/a:graphicData/pic:pic/pic:blipFill/a:blip/@r:embed')[0]
    image_part = dwo.part.related_parts[rId]
    image_bytes = image_part.blob
    image_stream = BytesIO(image_bytes)
    paragraph.add_run().add_picture(image_stream, Inches(6.5))
    return

I use the function iter_block_items():

start_copy = False
for block in iter_block_items(document):
    if isinstance(block, Paragraph):
        if block.text == "TEXT FROM WHERE WE STOP COPYING":
            break

if start_copy:
        if isinstance(block, Paragraph):
            last_paragraph = insert_paragraph_after(last_paragraph,block.text)

elif isinstance(block, Table):
            paragraphs_with_table.append(last_paragraph)
            tables_to_apppend.append(block._tbl)

if isinstance(block, Paragraph):
        if block.text == ""TEXT FROM WHERE WE START COPYING":
            start_copy = True

Solution

You can find exactly the same working implementation of this at the following link:

Extracting paras, tables and images in document order

docx : iterate through paragraphs, tables and images while maintaining order

Solution

Related Problems and Solutions