On Big Data, Hadoop, Spark & Everything Else: Counters and Spark

Common perusal of spark application code shows me that we often have multiple transformations on a RDD followed by an action and sometimes preceded/followed by a count operation to find out the number of records. The count operation is usually costly when you have billions of records and especially if you have this pattern in your code - it starts slowing the system down. I agree to certain extent that RDD's still lay in the JVM memory space and this operation might not be as costly as we think -- so make sure you profile for your code.

For example:

val baseRDD = sc.textFile("mysamplefile.txt")
val firstElementRDD = baseRDD.map(elem => elem.split(",")(0))
firstElementRDD.saveAsObjectFile("path_to_object_file")
val count = firstElementRDD.count()
/* Something gets done with this count, How can we eliminate this? */

We can declare a simple accumulator to do the same job:

val acc = sc.accumulator(0)
val baseRDD = sc.textFile("mysamplefile.txt")
val firstElementRDD = baseRDD.map{case elem => acc.add(1)
elem.split(",")(0)}
firstElementRDD.saveAsObjectFile("path_to_object_file")
val count = acc

This is a very common programming pattern for accumulators but is often overlooked.

On Big Data, Hadoop, Spark & Everything Else

Tuesday, January 6, 2015

Counters and Spark - Think Accumulators

No comments:

Post a Comment