You have installed Apache Spark (or some of the other distributions like Cloudera, MapR or Hortonworks) and are quite excited to start using it.
But wait -- How do you know you installed it correctly? There are 2 simple tests I want you to run (They take less than a minute, I promise!!)
Open the spark shell (./spark-shell) and run
sc.parallelize(1 to 1000).count()
What to check:
a. It runs and comes back with res0: Long = 1000 as result
b. It distributed the computation across nodes of the cluster. This is evident from the TaskSetManager logs you see when it computes
The second example to quickly test is our ubiquitous Pi computation
sc.parallelize(1 to 1000).map {
case _ => val x = Math.random()
val y = Math.random()
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _) * 4 /1000
It should return back with res1: Int = 3 as result. Check the logs above to make sure that the computation did not throw any exception and was distributed across multiple nodes on the cluster.
But wait -- How do you know you installed it correctly? There are 2 simple tests I want you to run (They take less than a minute, I promise!!)
Open the spark shell (./spark-shell) and run
sc.parallelize(1 to 1000).count()
What to check:
a. It runs and comes back with res0: Long = 1000 as result
b. It distributed the computation across nodes of the cluster. This is evident from the TaskSetManager logs you see when it computes
The second example to quickly test is our ubiquitous Pi computation
sc.parallelize(1 to 1000).map {
case _ => val x = Math.random()
val y = Math.random()
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _) * 4 /1000
It should return back with res1: Int = 3 as result. Check the logs above to make sure that the computation did not throw any exception and was distributed across multiple nodes on the cluster.