Apache Spark Tutorial (2023)
What is Apache Spark ?
Fast, expressive cluster computing system compatible with Apache Hadoop
Spark is a general purpose platform for large-scale distributed processing, for processing Big Data and performing complex parallel computations inspired by Hadoop
- Improves efficiency through: In-memory computing primitives and General computation graphs
- Rich APIs in Java, Scala, Python
- Interactive shell
We will look at core concepts of Apache Spark, Understanding what lies beneath by learning to implement the core part of the Apache Spark
Let's get started with Apache Spark :
- Originally started as a grad research project developed by UC-Berkeley's AMPLab in 2009
- Creating the BDAS (Berkeley Data Analytics Stack)
- June 2013: Spark accepted into Apache Incubator as Open source
- Feb 2014: Spark becomes a top-level Apache project
- Has undergone rapid development, 20+ version releases in under 6 years!
- Almost 100 organizations are listed on the "Powered by Spark" wiki
- Contributors include Databricks, Cloudera, Hortonworks, IBM, and Intel
Apache spark vs hadoop
A quote from Reynold Xin (Co-founder of Databricks at 2014 Daytona Gray Sort 100TB Benchmark):
"Using Spark on 206 EC2 machines, we sorted 100 TB of data on disk in 23 minutes. In comparison, the previous world record set by Hadoop MapReduce used 2100 machines and took 72 minutes. This means that Spark sorted the same data 3x faster using 10x fewer machines."
High performance :
- Speed-up factors of between x10 and x100 compared to Hadoop
- Takes advantage of in-memory data sets and minimizes disk I/O, Supported by a variety of configurable memory and disk options
Improves usability through : Spark can be run alongside Hadoop utilizing the same cluster :
- To interface with Hadoop-compatible data sources and sinks
- The Hadoop Distributed File System, HBase, Cassandra, etc
Spark can be run alongside Hadoop utilizing the same cluster and use same YARN or Mesos management frameworks
Spark provides an API that facilitates distributed processing, Operations like Mapand Reduceas provided by Hadoop But Spark provides many more and with much greater flexibility
- Spark SQL : High-speed, low-latency SQL query engine
- Spark Streaming : Reliably process data streams in real time, it has Built-in support for commonly used data sources Network-and file-based streams, Flume, Kafka, Twitter, etc.
- Spark MLlib : APIs for classification, clustering, linear regression, etc.
- Spark GraphX : APIs for graph analytics, Data networks, relationships, social media, etc.
Benefits of Using Spark
Spark retains intermediate data in memory wherever possible :
- Speeds up processing
- Particularly important in data analysis and machine learning applications, which are Iterative and interactive in nature
Provides a flexible and intuitive programming model
- Collections of elements partitioned across cluster nodes
- Provide high-level operations that run in parallel
- Based on the RDD, Dataset, and DataFrame abstractions
Spark's API has support for multiple languages :
- Scala, Java, Python and R
- Spark is primarily written in Scala
- Scala and Java can be trivially mixed!
Develop complex multi-phase applications in a single environment
- Applications can mix the core API with Spark SQL, Streaming, MLlib etc
- Reducing the number of tools that data scientists have to learn