Common perusal of spark application code shows me that we often have multiple transformations on a RDD followed by an action and sometimes preceded/followed by a count operation to find out the number of records. The count operation is usually costly when you have billions of records and especially if you have this pattern in your code - it starts slowing the system down. I agree to certain extent that RDD's still lay in the JVM memory space and this operation might not be as costly as we think -- so make sure you profile for your code.
For example:
val baseRDD = sc.textFile("mysamplefile.txt")
val firstElementRDD = baseRDD.map(elem => elem.split(",")(0))
firstElementRDD.saveAsObjectFile("path_to_object_file")
val count = firstElementRDD.count()
/* Something gets done with this count, How can we eliminate this? */
We can declare a simple accumulator to do the same job:
val acc = sc.accumulator(0)
val baseRDD = sc.textFile("mysamplefile.txt")
val firstElementRDD = baseRDD.map{case elem => acc.add(1)
elem.split(",")(0)}
firstElementRDD.saveAsObjectFile("path_to_object_file")
val count = acc
This is a very common programming pattern for accumulators but is often overlooked.
For example:
val baseRDD = sc.textFile("mysamplefile.txt")
val firstElementRDD = baseRDD.map(elem => elem.split(",")(0))
firstElementRDD.saveAsObjectFile("path_to_object_file")
val count = firstElementRDD.count()
/* Something gets done with this count, How can we eliminate this? */
We can declare a simple accumulator to do the same job:
val acc = sc.accumulator(0)
val baseRDD = sc.textFile("mysamplefile.txt")
val firstElementRDD = baseRDD.map{case elem => acc.add(1)
elem.split(",")(0)}
firstElementRDD.saveAsObjectFile("path_to_object_file")
val count = acc
This is a very common programming pattern for accumulators but is often overlooked.