Java – Use mapreduce to search for file-specific words in all other files present in HDFS

Use mapreduce to search for file-specific words in all other files present in HDFS… here is a solution to the problem.

Use mapreduce to search for file-specific words in all other files present in HDFS

I have multiple files with the employee’s name, ID, and skill set, and another file, “skills.txt, which contains a list of some specific skills. I’m trying to write a java mapreduce program to find employees with the skills mentioned in skills.txt.

For example, suppose there are 3 employee files as follows:
emp1.txt-
Name: Tom
EmpId: 001
Skills: C++, Java, SQL

emp2.​​ txt –
Name: Jerry
EmpId: 002
Skills: C++, PHP, SQL

emp3.txt-
Name: Jack
EmpId: 002
Skills: Java, PHP

Skills.txt-
PHP
SQL

Then my result should look like this.
PHP Jerry-002 ; jack -003
SQL Tom-001 ; Jerry – 002

All four of these files are in my HDFS.
I’m new to Hadoop and MapReduce. I’ve put a lot of effort into this, but haven’t gotten any proper logic to do it. I was able to write a program if there was only one skill and I acquired the skills needed to search as parameters to the mapreduce program. But I can’t do that when searching for multiple skills and those skills exist in a file format along with other employee files.

Solution

The solution is to add the skills.txt file to your DistributedCache. In your mapper, you read the file using the setup() function, and then:

Path[] uris = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String skillsfile = uris[0].toString(); 
BufferedReader in = new BufferedReader(new FileReader(patternsFile));

During job setup, you must add files to the distributed cache:

DistributedCache.addCacheFile(new URI(skillsFile), job.getConfiguration());

I hope this will help you on your way….

Related Problems and Solutions