Python makes reading Excel files faster

Python makes reading Excel files faster … here is a solution to the problem.

Python makes reading Excel files faster

I made a script that reads an Excel document and then checks if the first line contains “UPDATED”. If so, it writes the entire row to another Excel document with the same tab name.

My Excel document is 23 sheets of 1000 lines each, and now it takes more than 15 minutes to complete this. Is there any way to speed things up?

I’m thinking about multithreading or multiprocessing, but I don’t know which is better.


Update: The fact that my program takes 15 minutes to run is caused by READ-ONLY mode, and when I delete it, it only takes 2 seconds to run the program

import openpyxl
import os
from datetime import datetime

titles = ["Column1", "Column2", "Column3", "Column4", "Column5","Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", " Column16"]

def main():
    oldFilePath= os.getcwd() + "\oldFile.xlsx"
    newFilePath= os.getcwd() + "\newFile.xlsx"

wb = openpyxl.load_workbook(filename=oldFilePath, read_only=True)
    wb2 = openpyxl. Workbook()

sheets = wb.get_sheet_names()
    sheets2 = wb2.get_sheet_names()

#removes all sheets in newFile.xlsx
    for sheet in sheets2:
        temp = wb2.get_sheet_by_name(sheet)
        wb2.remove_sheet(temp)

for tab in sheets:
        print("Sheet: " + str(tab))
        rowCounter = 2

sheet = wb[tab]
        for row in range(sheet.max_row):
            if sheet.cell(row=row + 1, column=1).value == "": #if cell is empty stop reading
                break
            elif sheet.cell(row=row + 1, column=1).value == "UPDATED":
                if tab not in sheets2:
                    sheet2 = wb2.create_sheet(title=tab)
                    sheet2.append(titles)

for x in range(1, 17):
                    sheet2.cell(row=rowCounter, column=x).value = sheet.cell(row=row + 1, column=x).value

rowCounter += 1

sheets2 = wb2.get_sheet_names()

wb2.save(filename=newFilePath)

if __name__ == "__main__":
    startTime = datetime.now()
    main()
    print("Script finished in: " + str(datetime.now() - startTime))

Solution

For such a small workbook, there is no need to use read-only mode, and using it wisely is the cause of your own problems. Each call to ws.cell() forces openpyxl to parse the worksheet again.

So you either stop using read-only mode or use ws.iter_rows(), as I suggested in your previous question.

In general, if you think something is running slow, you should always analyze it instead of just trying something and hoping for the best.

Related Problems and Solutions