Use mapreduce to search for file-specific words in all other files present in HDFS
I have multiple files with the employee’s name, ID, and skill set, and another file, “skills.txt, which contains a list of some specific skills. I’m trying to write a java mapreduce program to find employees with the skills mentioned in skills.txt.
For example, suppose there are 3 employee files as follows:
emp1.txt-
Name: Tom
EmpId: 001
Skills: C++, Java, SQL
emp2. txt –
Name: Jerry
EmpId: 002
Skills: C++, PHP, SQL
emp3.txt-
Name: Jack
EmpId: 002
Skills: Java, PHP
Skills.txt-
PHP
SQL
Then my result should look like this.
PHP Jerry-002 ; jack -003
SQL Tom-001 ; Jerry – 002
All four of these files are in my HDFS.
I’m new to Hadoop and MapReduce. I’ve put a lot of effort into this, but haven’t gotten any proper logic to do it. I was able to write a program if there was only one skill and I acquired the skills needed to search as parameters to the mapreduce program. But I can’t do that when searching for multiple skills and those skills exist in a file format along with other employee files.
Solution
The solution is to add the skills.txt file to your DistributedCache. In your mapper, you read the file using the setup() function, and then:
Path[] uris = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String skillsfile = uris[0].toString();
BufferedReader in = new BufferedReader(new FileReader(patternsFile));
During job setup, you must add files to the distributed cache:
DistributedCache.addCacheFile(new URI(skillsFile), job.getConfiguration());
I hope this will help you on your way….