Cluster Computing, MapReduce, Hadoop, Spark
Table of Contents
1 Cluster Computing
3 Hadoop
4 Spark
- http://spark.apache.org/ "" Apache Spark is a fast and general engine for large-scale data processing. Applications can be written in Scala, which is an very powerful and expressive functional programming language (Stratosphere also supports Scala). It is really fast on job setup, hence it is very suited for small and medium sized data and ad-hoc evaluations.
- https://prestodb.io/ "" According to Facebook, Presto is a new interactive query system that operates fast at petabyte scale that is founded on a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. And like Spark, all processing is in memory. Facebook recently open-sourced the code and the Presto community can be found here. Unlike Spark or Hadoop, Presto can concurrently use a number of data stores as sources. All that is needed are “connectors” that provide interfaces for metadata, data locations, and data access. This obviates the need to move data around in order to query it—a requirement that’s becoming critical to many IT administrators. Simply plug the data source into Presto and—presto!—it can be interactively queried in real time. Connectors are currently available for Hadoop/Hive (Apache and Cloudera distributions) and Cassandra. But one can imagine more could be built for the enterprises’ existing data stores.
- https://databricks.com/spark/getting-started-with-apache-spark
4.1 Short Examples of Spark Programs
Spark can be used for compute-intensive tasks. This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.
val count = sc.parallelize(1 to NUM_SAMPLES).filter { _ => val x = math.random val y = math.random x*x + y*y < 1 }.count() println(s"Pi is roughly ${4.0 * count / NUM_SAMPLES}")
5 Reading beyond the Lectures
- https://databricks.com/spark/getting-started-with-apache-spark; see all of their courses at https://academy.databricks.com
- ./dean-ghemawat-mapreduce-osdi04.pdf Original paper that introduced MapReduce which later became Hadoop.
- ./pmNotes-hadoop.html Hadoop and Alternatives