Category Archives: big data

HOW SPARK ON AWS HELPS BIG DATA

How Spark on AWS helps Big Data

WHAT SPARK IS ABOUT

Apache Spark is an open-source big data processing framework built around speed, easy to use, and sophisticated analytics. It was first developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.

Spark has many advantages compared to other Big Data and MapReduce technologies like Hadoop and Storm.

Firstly, Spark gives us a comprehensive, united framework to manage big data processing requirements with a wide variety of data sets that are diverse in nature.

Spark endows applications in Hadoop clusters to run 100 times faster in memory and 10 times faster even when running on disk. Spark lets the user to quickly write the applications in Java, Scala, or Python.

FEATURES OF SPARK

Spark takes MapReduce to the next level with reasonable shuffles in the data processing. With capabilities like in-memory data storage and real-time processing, the performance is several times faster than the other big data technologies.

Spark also supports the assessment of big data queries, which helps in optimizing the steps in data processing workflows. It also provides a higher level API to improve the developer’s productivity and a consistent architect model for big data solutions.

Spark holds intermediate results in memory rather than writing them in disk which is very useful especially when the user needs to work on the same dataset for multiple times. Spark operators and performs external operations when data does not fit in the memory.

Spark will try to store as much as data in memory and then will discharge it to the disk. It can also store part of the data set in memory and the remaining data on the disk. The developer has to look at their data and use cases to estimate the memory requirements.

OTHER SPARK FEATURES INCLUDE

– It supports more than Map and Reduce functions.
– It optimizes arbitrary operator graphs.
-It also provides brief and consistent API’s in Scala, Java, and Python.

SPARK’S ECOSYSTEM

Other than Spark API, there are some additional libraries which are a part of the Spark ecosystem and provides some additional capabilities in Big Data analytics.

The libraries includes:
SPARK STREAMING

Spark Streaming can be used for processing the real-time data streaming. This is completely based on the micro batch style of computing and processing. It uses the DStream which is a series of RDDs, to process the real-time data.

SPARK SQL

Spark SQL provides the capability to disclose the Spark datasets over JDBC API and allows running the SQL queries on Spark data by using traditional BI and visualization tools. Spark SQL allows the developer to ETL their data from different sources and transforms it and exposes it for ad-hoc querying.

SPARK MLLIB

MLlib in Spark is an extensible machine learning library which consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, and dimensionality reduction.

SPARK GRAPHX

GraphX is the new Spark API for graphs and graph-parallel computations. To support graph computation, GraphX exposes a set of fundamental operators such as subgraph, joinVertices, and aggregateMessages as well as an optimized variant of the Pregel API. In addition to that, GraphX also includes a collection of graph algorithms and builders to simplify the graph analytics tasks.

BIG DATA:APACHE SPARK

Big Data:Apache Spark

The open source technology has been around and popular for few years. But 2016 was the year where Spark went from a predominant technology to a bona-fide superstar.

Big Data Apache Spark

Apache Spark has become so popular as it provides data engineers and data scientists with a powerful, consolidated engine that is both fast (100x faster than Apache Hadoop for large-scale data processing) and easy to use.

In this article, we will discuss some of the key points one encounters when working with Apache Spark.

WHAT SPARK IS ALL ABOUT:

Apache spark is an open-source big data processing framework built around quickness, ease of use, and sophisticated analytics. Apache spark is built on top of Hadoop MapReduce and it extends the MapReduce model to effortlessly use more types of calculations which include Interactive queries and Stream processing.

Spark has several more advantages while compared to other big data and MapReduce technologies like Hadoop and Storm.

Firstly, Spark gives us a comprehensive, united framework to manage big data processing requirements with a wide variety of data sets that are distinct in nature.

Spark authorizes applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster while running on disk.

Spark lets the user to quickly write applications in Java, Scala, and Python. It comes with an inbuilt set of 80 high-level operators. A user can use it interactively to query the data within the shell. In addition to Map and Reduce operations, it also supports SQL queries, streaming data, machine learning, and graph data processing. A user can use this standalone or can combine them to run in a single data pipeline use case.

FEATURES OF SPARK

Spark takes MapReduce to the next level with less cost shuffles in the data processing. With the capabilities like in-memory data storage and real-time processing, the performance can be numerous times faster than other big data technologies.

Spark holds intermediate results in memory rather than writing them to disk which is very useful especially when you need to work on the same dataset a multiple number of times. It can store part of a data set in memory and the remaining data in the disk. A user has to look at his data and use cases to assess the memory requirements. With this in-memory data storage, Spark comes with some performance advantages.

Other Spark features include:

  • Supports more than Reduce and Map functions.
  • Provides brief and consistent APIs in Scala, Java, and Python.
  • Offers interactive shell for Scala and Python.

Spark is done in Scala Programming Language and runs in JVM (Java Virtual Machine) environment. Currently, it supports the following languages for developing applications using Spark.

  • Scala
  • Java
  • Python
  • Clojure

SPARK ECOSYSTEM

Other than Spark core API, there are some additional libraries that are part of the Spark ecosystem and provides other added capabilities in Big Data analytics and Machine Learning areas. These libraries include the following.

  • Spark streaming: It can be used for processing the real-time streaming data. This process is based on the micro batch style of computing and processing.
  • Spark SQL: It provides the capability to expose the Spark datasets over JDBC API and allows running the SQL queries on Spark data by using traditional BI and visualization tools.
  • Spark MLlib: MLlib is Spark’s scalable machine learning library consists of common learning algorithms and utilities such as regression, clustering, collaborative filtering, and underlying optimization primitives.

FIVE THINGS TO WATCH OUT IN BIG DATA

FIVE THINGS TO WATCH OUT IN BIG DATA

Big Data: There are a lot of definitions tossed around, but, what each term has in common is that big data deals with the huge volume of unstructured data created from a business process.

Big data is like a puzzle. Put it together in a way that works for your organization, and you can help it thrive. In this article, we will let you know the five things you need to watch in big data.

UNSTRUCTURED DATA GROWTH:

Unstructured data volumes – composed of things like human information from social media, video, audio, and images, machine sensor data, Internet of Things data (IoT), and business data in various formats of work documents will continue to grow at a breath-taking rate. According to the research by Gartner, IoT data, excluding PCs, tablets, and smartphones will grow to 26 billion connected devices by 2020. Organizations will increasingly seek solutions that can tie structured and unstructured data sources together and generates connected media from social media and video analytics. This will give greater context to the structured data that most organizations have come to rely on.

YOU WILL NEED A NEW EXPERTISE FOR BIG DATA:

Are you setting up a big data analysis system? Then your biggest hurdle will be finding the right talent who knows how to work on the tools to analyse the data. Big data relies on solid data modelling. Organizations will have to focus on data science. They have to hire statistical modellers, text mining professionals, and people who are specialized in sentiment analysis. This may not be the same skill set that today’s analysts versed in business intelligence tools may readily know.

Another skill you need to have on hand is the ability to brawl a large amount of data needed to store and parse the data. You may need to hire a few supercomputer administrators from the local universities or research labs.

THE CLOUD WILL PLAY AN IMPORTANT ROLE:

Most of the data sources for big data are outside the firewall and inside the cloud. This includes external social media such as Facebook and LinkedIn, as well as internal social media sources such as Chatter. Because the speed of analysis on a larger set of data set is a key consideration. Big data analytics require unique infrastructure such as Hadoop or SAP’s HANA which is less likely to have an in-house environment.

There is no requirement for you to invest in an infrastructure – it can all be delivered as services from the cloud. As a result, in many cases, it is always better to go with a cloud-based big data model, so that you can enjoy the benefits without purchasing the unique infrastructure and without needing to worry about hiring specialists to manage the infrastructure.

BIG DATA WILL CHANGE IT OPERATIONS:

Companies who “get” big data are going to apply big data principles and practices to their internal IT operations first and foremost. Big data analytics plays a vital role in identifying IT security threats, which are continually growing and evolving. It also delivers connected intelligence across IT operations domains, generating insights which drive innovation and critical business advantage. This process will rejuvenate the traditional services desk, and the transformation to a big data service desk will bring business capabilities to deliver service anywhere.

MORE FOCUS ON SOLUTIONS, NOT JUST TOOLS:

There will be an increasing focus on integrated solutions for big data in 2016 – not just products, services, and tools. Organizations will look to combine and integrate their tools and platforms for information management, analytics, search, and for other applications.

WHAT PROCESSES WILL ALLOW AWS FOR STORING AND ANALYZING BIG DATA?

What Processes will allow AWS for storing and Analyzing Big Data?

The following services are described in order from collecting, processing, storing and analyzing big data:

– Amazon Kinesis Streams
– AWS Lambda
– Amazon Elastic MapReduce
– Amazon Machine Learning
– Amazon Dynamo DB
– Amazon Redshift
– Amazon Elastic Search Service
– Amazon Quick Sight

In addition, Amazon EC2 instances are also available for self-managed big data applications.

HOW DO YOU UTILIZE AMAZON REDSHIFT FOR THE BIG-DATA PROBLEM?

How do you Utilize Amazon Redshift for the Big-Data Problem?

Redshift is a peta-scale data distribution center (it can likewise begin with giga-scale), that lies on Ansi SQL interface. As you can put as much data as you like into the DWH and you can run any sort of SQL you wish against this data, this is a decent framework to construct any Agile and big data analysis framework. Redshift has numerous examination capacities, for the most part utilizing Window capacities. You can calculate averages and medians, and also percentiles, dense rank etc.

 
 
 
 
 
 
 
 

WHAT ALL DWH TOOLS ARE AVAILABLE TO SUPPORT BIG DATA UPLOADS?

What all DWH tools are available to support Big Data uploads?

With respect to, there are many DWH and reporting tools that you can associate with Redshift. The most widely recognized ones are Tableau, QlikView, Looker or YellowFin, particularly on the off chance that you don’t have any current DWH, where you might need to continue utilizing devices like Jasper Soft or Oracle BI.

 
 
 
 
 
 
 

ROLE OF DATA SCIENTISTS IN BIG DATA

Role of Data Scientists IN big data

Rising apace with the relatively new technology of big data is the new job title called “Data Scientist” while not tied exclusively to big data projects. The data scientist role complements them because of the increased breadth and depth of data being examined, compared to traditional roles.

What does a data scientist do?

The data scientist will be responsible for designing and implementing processes and layouts for complex, large-scale data sets used for modelling, data mining, and research purposes. The data scientist is also responsible for business case development, planning, coordination and collaboration with various internal and vendor teams, managing the lifecycle of analysis of the project, and interface with business sponsors to provide periodic updates.

A data scientist would be responsible for:
⦁ Extracting data relevant for analysis (by coordinating with developers)
⦁ Developing new analytical methods and tools as required.
⦁ Contributing to data mining architectures, modelling standards, reporting, and data analysis methodologies.
⦁ Suggesting best practices for data mining and analysis services.
⦁ Creating data definitions for new databases or changes to existing ones as needed for analysis.

Big Data:

The term “Big Data”, which has become a buzzword, is a massive volume of structured and unstructured data that cannot process or analysed using traditional processes or tools. There is no exact definition of how big a dataset should be in order or considered as Big Data.

Big Data is also defined by three V’s i.e., Volume, Velocity, and Variety.

Volume: Big data implies enormous volume of data. We currently see the growth in the data storage, as the data is not only the text data, but also in the format of video, music, and large images on social media channels. It is granular nature of data that is unique. It is very common to have Terabytes and Petabytes of the storage system for organizations. As the database increases, the applications and architecture built to support the data need to be evaluated quite often. Sometimes the same data is evaluated with multiple angles even though the original data is same and the new found intelligence creates an explosion of the data.

Velocity: Velocity deals with the fast rate at which data is received and perhaps acted upon. The increase of data and social media explosion have changed how we look at the data. The flow of data is massive and continuous. Now-a-days people rely on social media to update them on the latest happenings. The data movement is almost real-time and the update window has reduced to a fraction of seconds.

Variety: Data can be stored in multiple formats. Big data variety refers to unstructured and semi-structured data types such as text, audio, and abnormality in data. Unstructured data has many of the same requirements as structured data such as summarization, audibility, and privacy. The real world has data in many formats and that is the major challenge we need to overcome with the Big data.

The future of Big Data:

The demand for big data talent and technology is exploding day-by-day. Over the last two years, the investment in big data solutions has been tripled. As our world continues to become more information driven by year over year, industry analysts predict that the big data market will easily expand by another ten times within the next decade. Big data is already proving its value by allowing companies to operate at a new level of intelligence and worldliness.

future of big data