Java - How to iterate JavaRDD using foreach and use spark java to find a specific element from each line

How to iterate JavaRDD using foreach and use spark java to find a specific element from each line… here is a solution to the problem.

How to iterate JavaRDD using foreach and use spark java to find a specific element from each line

I have these lines in my text file:

Some different lines....

Name : Praveen  
Age : 24  
Contact : 1234567890  
Location : India  

Some different lines....

Name : John  
Contact : 1234567890  
Location : UK  

Some different lines....  

Name : Joe  
Age : 54  
Contact : 1234567890  
Location : US

A few different lines indicate some additional information in between.

Now I need to read the file and extract the personnel information. If any key is missing, it should be read as an empty string (age is missing in second-person information).

JavaRDD<String> data = jsc.textFile("person.report");

List<String> name = data.filter(f -> f.contains("Name")).collect();
List<String> age = data.filter(f -> f.contains("Age")).collect();
List<String> contact = data.filter(f -> f.contains("Contact")).collect();
List<String> location = data.filter(f -> f.contains("Location")).collect();

When I execute and iterate over the for loop in the way described above, the age of person 3 will be assigned to person 2.

Solution

First of all, you are collecting everything on the driver, are you sure this is what you want to do? It doesn’t work with big data sets….

Basically, your problem is that what you think is a record is not in one line. By default, Spark treats each row as a separate record. Here, however, your record has several lines (name, age, location…). To overcome this problem, you need to find another separator. If there is a specific string in “some different line”, use it and set this property:

sc.hadoopConfiguration.set("textinputformat.record.delimiter","specific string")

Then you can write like this:

val cols = Seq("Name","Age", "Contact", "Location")
sc.textFile("...")
  .map( _.split("\n"))
  .map(x => cols
       .map( col => x.find(_.startsWith(col)).getOrElse(col+" :") ) )

All rows for one person will be in the same record for you to work on at will. If you can’t find any suitable delimiters, your record may have a name, so you can use “Name:”.

In Java8, you can implement it in the same way using streams. It’s a bit verbose, but since the question is for java, there you go :

String[] array = {"Name", "Age", "Contact", "Location"};
List<String> list = Arrays.asList(array);
sc.textFile("...")
    .map(x -> Arrays.asList(x.split("\n")))
    .map(x -> list.stream()
                  .map(col -> x.stream()
                               .filter(line -> line.startsWith(col))
                               .findAny()
                               .orElse(col+" :"))
                  .collect(Collectors.toList()) );

Java – How to iterate JavaRDD using foreach and use spark java to find a specific element from each line

How to iterate JavaRDD using foreach and use spark java to find a specific element from each line

Solution

Related Problems and Solutions