Category Archives: Hadoop

WHAT ARE HADOOP PAIN POINTS?

What are Hadoop Pain Points?

As we know, Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, excessive processing power and the ability to handle virtually boundless concurrent tasks. Hadoop is powerful, but, like most of the systems, it has some sharp edges. Explore this article and know what are the pain points of Hadoop

Hadoop isn’t a database

Hadoop is amply different from an access and storage perspective to throw a lot of people off. Databases abstract away the details of on-disk organization, file formats, and serialization, partitioning, optimization for varied access patterns. Topics such as “data modeling” are treated either at the logical layer or assumed as a relational engine. As an example, most of the people are not aware of how relational database engines performs various forms of joins.

Hadoop is a distributed system

Deployment, composition, management, monitoring, and debugging a single threaded, single process system can be tough. A multithreaded single and multi-process system is harder. A multi-threaded, multi-process, and distributed system is harder. Hadoop has a ton of moving parts and while it gets better with each release, it’s still a complex system that requires specialized knowledge. That said, this isn’t dissimilar from other systems. The main stumbling block is that most people don’t have tons of experience with distributed systems.

Hadoop has a huge ecosystem

There are a huge number of open sources and commercial products/projects have hop-up around Hadoop that interoperate with it in some way. Each of these comes with its own complications. More than a single system, Hadoop is an entire world until itself.

Hadoop is evolving

In the grand scheme of things, Hadoop is a young system. It’s evolving and changing at an extremely rapid pace. Hence, there are a huge number of things to keep up with if you want to know all the details.

Hadoop tooling is still developing:

Many existing tools and similar systems are designed to deal with the data that resides in relational databases. While the ecosystem is growing at a tremendous rate, not all of the tools you might expect have been fully updated in support of HDFS and Hadoop MapReduce. But, many of the commercial vendors in the ETL, EDW, BI and analytics spaces are well on their way. Some have already arrived.

Hadoop is still a young technology– it’s clear that lots of organizations need more resources, competence, solutions, and tools to relieve the execution difficulties. Each week we see brand-new market entrants, which are accelerating the rate of Hadoop adoption. In fact, different verticals are including their own unique set of devices that satisfy demands such as integrated security and regulatory compliance capabilities. Hadoop-experimentation is drawing to a close, now the developers are going into a phase of fast adoption and even a little beyond the early adopter phase, because companies are producing finest practices, looking for standardization and ease-of-use so that users can successfully obtain understandings at a faster speed.

WHAT’S THE DIFFERENCE BETWEEN HADOOP 1.X AND HADOOP 2.X?

What’s the difference between Hadoop 1.x and Hadoop 2.x?

HDFS federation brings important measures of scalability and reliability to Hadoop. YARN, the other major advance in Hadoop 2, brings significant performance improvements for some applications, supports additional processing models, and implements a more flexible execution engine.

YARN is a resource manager that was created by separating the processing engine and resource management capabilities of MapReduce as it was implemented in Hadoop 1. YARN is often called the operating system of Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop.

Like an operating system on a server, YARN is designed to allow multiple, diverse user applications to run on a multi-tenant platform. In Hadoop 1, users had the option of writing MapReduce programs in Java, in Python, Ruby or other scripting languages using streaming, or using Pig, a data transformation language. Regardless of which method was used, all fundamentally relied on the MapReduce processing model to run.