Java – How to use newAPIHadoopRDD in Java?

How to use newAPIHadoopRDD in Java?… here is a solution to the problem.

How to use newAPIHadoopRDD in Java?

I’m trying to read from cassandra to JavaRDD Below is my code

public class SparkWCassandra {

public static void main(String[] args) {
    JavaSparkContext jsc = new JavaSparkContext("local","spark Cassandra");
    String KeySpace = "retail";
    String InputColumnFamily = "ordercf";

try {
        Job job  = new Job();
        job.setInputFormatClass(CqlPagingInputFormat.class);
        ConfigHelper.setInputInitialAddress(job.getConfiguration(), "localhost");
        ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
        ConfigHelper.setInputColumnFamily(job.getConfiguration(), KeySpace, InputColumnFamily);
        ConfigHelper.setInputPartitioner(job.getConfiguration(), "Murmur3Partitioner");

} catch (IOException ex) {
        Logger.getLogger(SparkWCassandra.class.getName()).log(Level.SEVERE, null, ex);
    }
    }
}

The next step should be jsc.newAPIHadoopRDD() but I don’t quite understand what these parameters mean and what I should pass to it.
The keyspaces and tables created in Cassandra are as follows

CREATE TABLE salecount (product_id text,sale_count int, PRIMARY KEY (product_id));
  CREATE TABLE ordercf (user_id text,
  time timestamp,
    product_id text,
    quantity int,
    PRIMARY KEY (user_id, time));
  INSERT INTO ordercf (user_id, time, product_id, quantity) VALUES ('bob', 1385983646000,'iphone', 1);
  INSERT INTO ordercf (user_id, time, product_id, quantity) VALUES ('tom', 1385983647000,'samsung', 4);
  INSERT INTO ordercf (user_id, time, product_id, quantity) VALUES ('dora', 1385983648000,'nokia', 2);
  INSERT INTO ordercf (user_id, time, product_id, quantity) VALUES ('charlie', 1385983649000,'iphone', 2);

Can anyone give an example of how to use newAPIHadoopFile? Thanks!

Solution

You really should try Spark’s official DataStax Cassandra driver, which is available in here It allows you to access Cassandra directly without the Hadoop API. It is easier to use and probably faster.

Related Problems and Solutions