Python - How to convert extracted text from PDF to JSON or XML format in Python?

How to convert extracted text from PDF to JSON or XML format in Python?… here is a solution to the problem.

How to convert extracted text from PDF to JSON or XML format in Python?

I am using PyPDF2 to extract data from a PDF file and then convert to text format?

The PDF format of the file looks like this:

Name : John 
Address: 123street , USA 
Phone No:  123456
Gender: Male 

Name : Jim 
Address:  456street , USA 
Phone No:  456899
Gender: Male

In Python I use this code:

import PyPDF2
pdf_file = open('C:\\Users\\Desktop\\Sampletest.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
page_content

This is the result I got from page_content:

 'Name : John \n \nAddress: 123street , USA \n \nPhone No:  123456\n \nGender: Male \n \n \nName : Jim \n \nAddress:  456street , USA \n \nPhone No:  456899\n \nGender: Male \n \n \n'

How to format it in JSON or XML format so that I can use the extracted data in a SQL Server database.

I’ve also tried this method

import json
data = json.dumps(page_content)
formatj = json.loads(data)
print (formatj)

Output:

Name : John 
Address: 123street , USA 
Phone No:  123456
Gender: Male 

Name : Jim 
Address:  456street , USA 
Phone No:  456899
Gender: Male

This is the same output I have in the word file, but I don’t think it’s in JSON format.

Solution

Not pretty, but I think it gets the job done. You’ll get a dictionary that is then printed out in a beautiful format by the JSON parser.

import json    

def get_data(page_content):
    _dict = {}
    page_content_list = page_content.splitlines()
    for line in page_content_list:
        if ':' not in line:
            continue
        key, value = line.split(':')
        _dict[key.strip()] = value.strip()
    return _dict

page_data = get_data(page_content)
json_data = json.dumps(page_data, indent=4)
print(json_data)

Or, instead of the last 3 lines, just do the following:

print(json.dumps(get_data(page_content), indent=4))

Python – How to convert extracted text from PDF to JSON or XML format in Python?

How to convert extracted text from PDF to JSON or XML format in Python?

Solution

Related Problems and Solutions