Java – How to run Hadoop wordcount programs on pdf and doc files?

How to run Hadoop wordcount programs on pdf and doc files?… here is a solution to the problem.

How to run Hadoop wordcount programs on pdf and doc files?

How do I run the Hadoop wordcount program on pdf and doc files?
When I try to run it on a pdf file, the output shows strange characters.

Solution

The file formats you mentioned are binary and are not suitable as input for word count unless they are preprocessed into plain text. You first have to convert them to plain text format using other tools/libraries.

There may be some free command-line utilities that can help you do this.

Related Problems and Solutions