Java – What is the most effective way to remove stop words from a huge corpus of text?

What is the most effective way to remove stop words from a huge corpus of text?… here is a solution to the problem.

What is the most effective way to remove stop words from a huge corpus of text?

I would like to know an effective way to remove stop words from a huge corpus of texts.
Currently my approach is to convert stop words to regular expressions, match lines of text to regular expressions and remove them.

For example

String regex ="\\b(?:a|an|the|was|i)\\b\\s*";
 String line = "hi this is regex approach of stop word removal";
 String lineWithoutStopword = line.replaceAll(regex,"");

Is there any other effective way to remove stop words from the huge corpus.

Thanks

Solution

With Spark, one way to do this is to subtract stop words from text after tokenizing them with words.

val text = sc.textFile('huge.txt')
val stopWords = sc.textFile('stopwords.txt')
val words = text.flatMap(line => line.split("\\W"))
val clean = words.subtract(stopwords)

If you need to work with very large text files (>> GB), it is more efficient to think of stop phrases as memory structures that can be broadcast to each worker.

The code will look like this:

val stopWords = sc.textFile('stopwords.txt')
val stopWordSet = stopWords.collect.toSet
val stopWordSetBC = sc.broadcast(stopWordSet)
val words = text.flatMap(line => line.split("\\W"))
val clean = words.mapPartitions{iter =>
    val stopWordSet = stopWordSetBC.value
    iter.filter(word => !stopWordSet.contains(word))
}

Note that for this to work, the words of the original text must be normalized.

Related Problems and Solutions