Category Archives: Apache Hadoop Yarn

ALL YOU NEED TO KNOW ABOUT APACHE HADOOP YARN

All You Need To Know About Apache Hadoop Yarn

YARN is one of the key features in the second-generation of Hadoop 2 version of the Apache Software Foundation’s open source distributed processing framework. Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications.

Back in 2012, YARN became a sub-project of the huge Apache Hadoop project. YARN is a software rewrite that decouples MapReduce’s resource management and scheduling capabilities from the data processing component, which enables the Hadoop to support in a varied processing approach and a broader array of applications. The original manifestation of Hadoop closely paired the Hadoop Distributed File System (HDFS) with the batch-oriented MapReduce programming framework, which handles resource management and job scheduling on Hadoop systems and supports the parsing and condensing of data sets parallelly.

YARN combines a central resource manager that reunites the way applications are used by the Hadoop system resources with node manager agents which monitor the processing operations of individual cluster nodes. Separating HDFS from MapReduce with YARN makes the Hadoop environ ment more suitable for operational applications which can’t wait for batch jobs to finish.

DEVELOPING YARN APPLICATIONS

YARN provides the capabilities to build custom application frameworks on top of Hadoop, users also get new complexity. Building applications for YARN are notably more complex than building traditional MapReduce applications on top of pre-YARN Hadoop because the user needs to develop an ApplicationMaster in ResourceManager which is launched when a client request arrives. The ApplicationMaster has many requirements, including implementation of a number of required protocols to communicate with the ResourceManager (for requesting resources) and NodeManager (to allocate containers). For existing MapReduce users, a MapReduce ApplicationMaster minimizes any new work required, making the amount of work required to deploy MapReduce jobs similar to pre-YARN Hadoop.

YARN allocates the resources within a cluster, performs processing, exposes touchpoints for monitoring the progress of the application, and finally releases resources and does general clean-up when the application is complete. A boilerplate implementation of this life cycle is available under a project called Kitten. Kitten is a set of tools and code that simplifies the development of applications in YARN, allowing the user to focus on the logic of their application and initially ignores the details of negotiation and runs with the constraints of the various entities in a YARN cluster.

FINAL WORD

Although Hadoop continues to grow in the big data market, it has begun an evolution to address yet-to-be-defined large-scale data workloads. YARN is still under active development and may not be suitable for production environments, but YARN provides significant advantages over traditional MapReduce. It permits the development of new distributed applications beyond MapReduce, allowing them to coexist simultaneously with one another in the same cluster. YARN, with its new capabilities and new complexities, will soon be coming to a Hadoop cluster near you.