Apache Spark – Data Processing Framework

Apache Spark is a unified analytics engine for large-scale data processing.

Apache Spark is an open source data processing framework for performing Big Data Analytics on distributed computing cluster.

Matei Zaharia at UC Berkeley’s AMPLab started Spark in 2009.

Why Apache Spark?

Data is expanding at huge speed and is present in all forms. There is a need of framework which not only helps in storing the data but also analyzing it according to business needs. Hadoop is such framework but it has several limitations. –

  • Hadoop operates on batch processing of data leading to high latency.
  • Committed to map reduce programming model
  • Limited programming language API options
  • Not fit for iterative algorithms like Machine Learning Algorithms
  • Pipe lining of tasks is not easy

Keeping in mind the above limitations , Apache Spark was developed as the solution.

Features of Spark

Spark has several advantages when compared to other big data and Map Reduce technologies like Hadoop and Storm.

  1.  Spark offers low latency due to reduced disk input and output operation.
  2. Unlike Hadoop Spark maintains the intermediate results in memory rather than writing every intermediate output to disk.  This results in less execution time.
  3. Spark doesn’t execute the tasks immediately but maintains a chain of operations as meta-data of the job called DAG. The action on the DAG happens only when an action operation is called on to the transformation DAG. This process is called as lazy evaluation.
  4. Spark currently supports in  the following programming languages to develop applications:
    • Scala
    • Java
    • Python
    • R
  5.  Spark supports SQL like queries, streaming data, machine learning and data processing in terms of graph.
  6. Spark can be integrated with various data sources like SQL, NoSQL, HDFS, local file system etc.

There are many more benefits of using Spark in comparison to Hadoop.

Spark Architecture

Job is submitted in the Spark Context  and then request will go to the cluster manager  and depending on the requirements for the job cluster manager will work with worker nodes to create executors which will create multiple tasks and allocate memory for processing.

Apache Spark contains the following components to play a role in doing operations.

Driver Program (Spark Context) 

It is the central point and the entry point of the Spark Shell (Scala, Python, and R). The driver program runs the main () function of the application and is the place where the Spark Context is created. Spark Driver contains various components – DAGScheduler, TaskScheduler, BackendScheduler and BlockManager responsible for the translation of spark user code into actual spark jobs executed on the cluster.

Cluster Manager

It is responsible for allocating resources to the Spark Job. There are 3 different types of cluster managers

  • Hadoop YARN,
  • Apache Mesos
  • Standalone

Choosing a cluster manager for any spark application depends on the goals of the application because all cluster managers provide different set of scheduling capabilities.

Worker Nodes

They act as the slave for master daemon.

 

Executor

Executor is a distributed agent responsible for the execution of tasks. Every spark applications has its own executor process. Executors usually run for the entire lifetime of a Spark application and this phenomenon is known as “Static Allocation of Executors”.

  • Executor performs all the data processing.
  • Reads from and Writes data to external sources.
  • Executor stores the computation results data in-memory, cache or on hard disk drives.
  • Interacts with the storage systems.
This was an overview of Spark and why it come into existence.

 

Leave a Reply

Your email address will not be published. Required fields are marked *