How Spark on AWS helps Big Data
WHAT SPARK IS ABOUT
Apache Spark is an open-source big data processing framework built around speed, easy to use, and sophisticated analytics. It was first developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.
Spark has many advantages compared to other Big Data and MapReduce technologies like Hadoop and Storm.
Firstly, Spark gives us a comprehensive, united framework to manage big data processing requirements with a wide variety of data sets that are diverse in nature.
Spark endows applications in Hadoop clusters to run 100 times faster in memory and 10 times faster even when running on disk. Spark lets the user to quickly write the applications in Java, Scala, or Python.
FEATURES OF SPARK
Spark takes MapReduce to the next level with reasonable shuffles in the data processing. With capabilities like in-memory data storage and real-time processing, the performance is several times faster than the other big data technologies.
Spark also supports the assessment of big data queries, which helps in optimizing the steps in data processing workflows. It also provides a higher level API to improve the developer’s productivity and a consistent architect model for big data solutions.
Spark holds intermediate results in memory rather than writing them in disk which is very useful especially when the user needs to work on the same dataset for multiple times. Spark operators and performs external operations when data does not fit in the memory.
Spark will try to store as much as data in memory and then will discharge it to the disk. It can also store part of the data set in memory and the remaining data on the disk. The developer has to look at their data and use cases to estimate the memory requirements.
OTHER SPARK FEATURES INCLUDE
– It supports more than Map and Reduce functions.
– It optimizes arbitrary operator graphs.
-It also provides brief and consistent API’s in Scala, Java, and Python.
Other than Spark API, there are some additional libraries which are a part of the Spark ecosystem and provides some additional capabilities in Big Data analytics.
The libraries includes:
Spark Streaming can be used for processing the real-time data streaming. This is completely based on the micro batch style of computing and processing. It uses the DStream which is a series of RDDs, to process the real-time data.
Spark SQL provides the capability to disclose the Spark datasets over JDBC API and allows running the SQL queries on Spark data by using traditional BI and visualization tools. Spark SQL allows the developer to ETL their data from different sources and transforms it and exposes it for ad-hoc querying.
MLlib in Spark is an extensible machine learning library which consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, and dimensionality reduction.
GraphX is the new Spark API for graphs and graph-parallel computations. To support graph computation, GraphX exposes a set of fundamental operators such as subgraph, joinVertices, and aggregateMessages as well as an optimized variant of the Pregel API. In addition to that, GraphX also includes a collection of graph algorithms and builders to simplify the graph analytics tasks.