How to convert extracted text from PDF to JSON or XML format in Python?… here is a solution to the problem.
How to convert extracted text from PDF to JSON or XML format in Python?
I am using PyPDF2 to extract data from a PDF file and then convert to text format?
The PDF format of the file looks like this:
Name : John
Address: 123street , USA
Phone No: 123456
Gender: Male
Name : Jim
Address: 456street , USA
Phone No: 456899
Gender: Male
In Python I use this code:
import PyPDF2
pdf_file = open('C:\\Users\\Desktop\\Sampletest.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
page_content
This is the result I got from page_content:
'Name : John \n \nAddress: 123street , USA \n \nPhone No: 123456\n \nGender: Male \n \n \nName : Jim \n \nAddress: 456street , USA \n \nPhone No: 456899\n \nGender: Male \n \n \n'
How to format it in JSON or XML format so that I can use the extracted data in a SQL Server database.
I’ve also tried this method
import json
data = json.dumps(page_content)
formatj = json.loads(data)
print (formatj)
Output:
Name : John
Address: 123street , USA
Phone No: 123456
Gender: Male
Name : Jim
Address: 456street , USA
Phone No: 456899
Gender: Male
This is the same output I have in the word file, but I don’t think it’s in JSON format.
Solution
Not pretty, but I think it gets the job done. You’ll get a dictionary that is then printed out in a beautiful format by the JSON parser.
import json
def get_data(page_content):
_dict = {}
page_content_list = page_content.splitlines()
for line in page_content_list:
if ':' not in line:
continue
key, value = line.split(':')
_dict[key.strip()] = value.strip()
return _dict
page_data = get_data(page_content)
json_data = json.dumps(page_data, indent=4)
print(json_data)
Or, instead of the last 3 lines, just do the following:
print(json.dumps(get_data(page_content), indent=4))