On Big Data, Hadoop, Spark & Everything Else: Spark 101 : How to test your new installation

You have installed Apache Spark (or some of the other distributions like Cloudera, MapR or Hortonworks) and are quite excited to start using it.

But wait -- How do you know you installed it correctly? There are 2 simple tests I want you to run (They take less than a minute, I promise!!)

Open the spark shell (./spark-shell) and run

sc.parallelize(1 to 1000).count()

What to check:
a. It runs and comes back with res0: Long = 1000 as result
b. It distributed the computation across nodes of the cluster. This is evident from the TaskSetManager logs you see when it computes

The second example to quickly test is our ubiquitous Pi computation

sc.parallelize(1 to 1000).map {
case _ => val x = Math.random()
val y = Math.random()
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _) * 4 /1000

It should return back with res1: Int = 3 as result. Check the logs above to make sure that the computation did not throw any exception and was distributed across multiple nodes on the cluster.

On Big Data, Hadoop, Spark & Everything Else

Friday, November 7, 2014

Spark 101 : How to test your new installation

No comments:

Post a Comment