Tesseract does not pick up text of different colors
I’m trying to make a program that will scrape the text from the screenshot using tesseract and python, and I can get some of it without problems, but some of the text is lighter in color and won’t be picked up by tesseract. Here’s an example of the picture I’m using:
I want to get the text at the top of the image, but not the 3 options below.
This is the code I used to grab the text
result = pytesseract.image_to_string(
screen, config="load_system_dawg=0 load_freq_dawg=0")
print("below is the total value scraped by the tesseract")
print(result)
# Split up newlines until we have our question and answers
parts = result.split("\n\n")
question = parts.pop(0).replace("\n", " ")
q_terms = question.split(" ")
q_terms = list(filter(lambda t: t not in stop, q_terms))
q_terms = set(q_terms)
parts = "\n".join(parts)
parts = parts.split("\n")
answers = list(filter(lambda p: len(p) > 0, parts))
When I have black plain text without colored background, I can get the answers
array to be filled by the following 3 options, but not in this case. Is there any way to fix this?
Solution
You missed binarization, or thresholding.
In your case, you can simply apply a binary threshold on a grayscale image.
This is the result image with threshold = 177
Here1 you can learn more about Thresholding with opencv python library