Apache Spark provides the following distributed storage features:

  • HDFS (Hadoop File System)
  • RDD (Resilient Distributed Dataset)
  • RDD is the key feature of Apache Spark in distributed RAM-based storage and distributed computing.

Distributed computing on Apache Spark:

  • Java 8 lambda function used with RDD
  • Any other libraries or connectors (Couchbase-Spark Connector, MongoDB-Spark Connector,…) implemented with RDD

Some of the examples using Apache Spark RDD are described below.

Turn a list into RDD to make it distributed data and then used in distributed computing:

List<String> list;
JavaRDD<String> rdd = sparkContext.parallelize(list);

Turn a text file to a list of lines and make the RDD:

//file path can be file:, hdfs:,...
JavaRDD<STring> rdd = sparkContext.textFile("some_file_path");

Make another list from a list:

List<String> list;
JavaRDD<String> rdd = sparkContext.parallelize(list);

//java 8 lambda function
JavaRDD<Integer> anotherRdd = rdd.map(item -> {
  return item.length;

//short cut for the above map
//java 8 lambda expression
JavaRDD<Integer> anotherRdd = rdd.map(item -> item.length);

Save an RDD to RAM for using again:


Note, lambda function, lambda expression may not be executed on the same thread or process on the same JVM:

int sumLength = 0;
List<String> list; //with some data inside
JavaRDD<String> rdd = sparkContext.parallelize(list);

//sumLength may be wrong, smaller!
rdd.foreach(str -> {
  sumLength += str.length;

Create an RDD of a pairs:

JavaRDD<String> rdd; //with some data

//make RDD of pairs
JavaPairRDD<String,Integer> pairRdd = 
rdd.map(str -> {
  return new Tuple2(str,str.length);

Filter an RDD:

JavaRDD<String> rdd; //with some data
String filter = "some-substring";

//grep command example
JavaRDD<String> grepRdd = rdd.filter(str -> {
  if (str.indexOf(filter)>=0)
    return true;
    return false;

Many other RDD operations may be found here: