Category Archives: Apache Spark



What is Apache Spark? Why there is a serious hiss going on about this? If you are in Big Data analytics business, should you really care about Spark? We hope this article will help you to answer some of the questions raised in your mind in the recent past.

Apache Spark is a powerful open-source processing engine for Hadoop data which is built around speed, easy to use, and sophisticated analytics. Apache Spark is primarily a parallel data processing framework that can work with Apache Hadoop to make it immensely easy to develop fast.

Let’s go through this article and know about the top 10 things of Apache Spark which are being highlighted in the Big data world.

1. Lighting fast processing

When it comes to Big Data processing, speed always matters. A user always looks for processing a huge amount of data as fast as possible. Spark enables the applications in Hadoop clusters to run 100x faster in memory, and 10x faster on disk. Spark stores the intermediate processing data in-memory. It uses the concept of a Resilient Distributed (RDD), which allows it to transparently store the data in memory and persists it to disc if only it’s needed.

2. Ease of use:

Spark lets the user to quickly write applications in Java, Scala, and Python. This helps the developers to create and run their applications on their well-known programming languages. It comes with the in-built set of 80 high-level operators.

3. Supports sophisticated analytics:

In addition to “map” and “reduce” operations, Spark also supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms. Also, a user can combine all these capabilities seamlessly in a single workflow.

4. Real-time stream processing:

Spark can also handle real-time streaming. Map-reduce majorly handles and process the data which is already stored. However, Spark can manipulate data in real time using Spark Streaming.

5. Ability to integrate with Hadoop and existing Hadoop data:

Spark can run autonomously. Apart from that, it can also run on Hadoop 2’s YARN cluster manager and can read any Hadoop existing data. That is one of the biggest advantages. This feature of Spark makes it suitable for shifting the existing Hadoop applications if that application use case is suitable for Spark.

6. Runs everywhere:

Spark runs on Hadoop, Mesos, Standalone, and in the Cloud. It can access varied data sources including HDFS, Cassandra, HBase, and S3.

7. Spark supports lazy evaluation of Big Data queries:

Spark also supports lazy evaluation of Big Data queries, which helps the optimization of the steps in data processing workflows. It also provides higher-level API to improve developer productivity and a steady architect model for Big Data solutions.

8. Optimizes arbitrary operator graphs:

Spark offers normal execution model that can optimize arbitrary graphs and supports in-memory computing, which process the data faster than disk-based engines like Hadoop.

9. Spark offers interactive shell for Scala and Python. This feature is not yet available in Java.

Spark also can handle real-time streaming. Map-reduce majorly handles and process the data which is already stored. However, Spark can also manipulate data in real time using Spark Streaming.

10. Active and expanding community:

Spark is built by a wide set of developers from over 50 companies. The project started in 2009 and now more than 450 developers have contributed to Spark already. It consists of an active mailing list and JIRA for tracking the issues.