Python - Tesseract does not pick up text of different colors

Tesseract does not pick up text of different colors… here is a solution to the problem.

Tesseract does not pick up text of different colors

I’m trying to make a program that will scrape the text from the screenshot using tesseract and python, and I can get some of it without problems, but some of the text is lighter in color and won’t be picked up by tesseract. Here’s an example of the picture I’m using:

I want to get the text at the top of the image, but not the 3 options below.

This is the code I used to grab the text

result = pytesseract.image_to_string(
            screen, config="load_system_dawg=0 load_freq_dawg=0")

print("below is the total value scraped by the tesseract")
        print(result)

# Split up newlines until we have our question and answers
        parts = result.split("\n\n")

question = parts.pop(0).replace("\n", " ")
        q_terms = question.split(" ")
        q_terms = list(filter(lambda t: t not in stop, q_terms))
        q_terms = set(q_terms)

parts = "\n".join(parts)
        parts = parts.split("\n")

answers = list(filter(lambda p: len(p) > 0, parts))

When I have black plain text without colored background, I can get the answers array to be filled by the following 3 options, but not in this case. Is there any way to fix this?

Solution

You missed binarization, or thresholding.

In your case, you can simply apply a binary threshold on a grayscale image.

This is the result image with threshold = 177

Here1 you can learn more about Thresholding with opencv python library

Python – Tesseract does not pick up text of different colors

Tesseract does not pick up text of different colors

Solution

Related Problems and Solutions