Java – Apache Spark : Update global variables in workers

Apache Spark : Update global variables in workers… here is a solution to the problem.

Apache Spark : Update global variables in workers

I’m curious if the simple code below can work in a distributed environment (it works fine in a standalone environment)?

public class TestClass {
    private static double[][] testArray = new double[4][];
    public static void main(String[] args) {
        for(int i = 0; i<4; i++)
        {
            testArray[i] = new double[10];
        }
        ...
        JavaRDD<String> testRDD = sc.textFile("testfile", 4).mapPartitionsWithIndex(
            new Function2<Integer, Iterator<String>, Iterator<String> >() {
                @Override
                public Iterator<String> call(Integer ind, Iterator<String> s) {
                    /*Update testArray[ind]*/
                }
            }, true
        );
    ...

If it should work, I wonder how Spark sends a portion of the testArray from the worker node to the master node?

Solution

No. It should not work in a distributed environment.

Variables captured in the closure are serialized and sent to the worker. The data initially set in the driver will be available to workers, but any updates at the worker level will only be accessible locally.

Locally, the

variables are in the same memory space, so you’ll see updates, but this doesn’t extend to the cluster.

You need to transform the calculations based on the RDD operation to collect the results.

Related Problems and Solutions