Apache Spark Tutorial (2024)

What is Apache Spark ?

Fast, expressive cluster computing system compatible with Apache Hadoop

Spark is a general purpose platform for large-scale distributed processing, for processing Big Data and performing complex parallel computations inspired by Hadoop

Improves efficiency through: In-memory computing primitives and General computation graphs
Rich APIs in Java, Scala, Python
Interactive shell

We will look at core concepts of Apache Spark, Understanding what lies beneath by learning to implement the core part of the Apache Spark

Let's get started with Apache Spark :

History

Originally started as a grad research project developed by UC-Berkeley's AMPLab in 2009
Creating the BDAS (Berkeley Data Analytics Stack)
June 2013: Spark accepted into Apache Incubator as Open source
Feb 2014: Spark becomes a top-level Apache project
Has undergone rapid development, 20+ version releases in under 6 years!
Almost 100 organizations are listed on the "Powered by Spark" wiki
Contributors include Databricks, Cloudera, Hortonworks, IBM, and Intel

Apache spark vs hadoop

A quote from Reynold Xin (Co-founder of Databricks at 2014 Daytona Gray Sort 100TB Benchmark):

"Using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Spark sorted the same data 3x faster using 10x fewer machines."

High performance :

Speed-up factors of between x10 and x100 compared to Hadoop
Takes advantage of in-memory data sets and minimizes disk I/O, Supported by a variety of configurable memory and disk options

Improves usability through : Spark can be run alongside Hadoop utilizing the same cluster :

To interface with Hadoop-compatible data sources and sinks
The Hadoop Distributed File System, HBase, Cassandra, etc

Spark can be run alongside Hadoop utilizing the same cluster and use same YARN or Mesos management frameworks

Spark Components/Stack

Spark provides an API that facilitates distributed processing, Operations like Mapand Reduceas provided by Hadoop But Spark provides many more and with much greater flexibility

Spark SQL : High-speed, low-latency SQL query engine
Spark Streaming : Reliably process data streams in real time, it has Built-in support for commonly used data sources Network-and file-based streams, Flume, Kafka, Twitter, etc.
Spark MLlib : APIs for classification, clustering, linear regression, etc.
Spark GraphX : APIs for graph analytics, Data networks, relationships, social media, etc.

Benefits of Using Spark

Spark retains intermediate data in memory wherever possible :

Speeds up processing
Particularly important in data analysis and machine learning applications, which are Iterative and interactive in nature

Provides a flexible and intuitive programming model

Collections of elements partitioned across cluster nodes
Provide high-level operations that run in parallel
Based on the RDD, Dataset, and DataFrame abstractions

Spark's API has support for multiple languages :

Scala, Java, Python and R
Spark is primarily written in Scala
Scala and Java can be trivially mixed!

Develop complex multi-phase applications in a single environment