AVAILABILITY OF THE JOBTRACKER MACHINE IN HADOOP/MAP-REDUCE IMPLEMENTATIONS

ABSTRACT
Due to the growing demand for Cloud Computing services, the need and importance of Distributed Systems cannot be underestimated. However, it is di cult to use the traditional Message Passing Interface (MPI) ap-proach to implement synchronization, coordination,and prevent deadlocks in distributed systems. This di culty is lessened by the use of Apache's Hadoop/MapReduce and Zookeeper to provide Fault Tolerance in a Homo-geneously Distributed Hardware/Software environment.

In this thesis, a mathematical model for the availability of the JobTracker in Hadoop/MapReduce using Zookeeper's Leader Election Service is examined. Though the availability is less than what is expected in a k Fault Tolerance system for higher values of the hardware failure rate, this approach makes coordination and synchronization easy, reduces the e ect of Crash failures, and provides Fault Tolerance for distributed systems.

The availability model starts with a Markov state diagram for a general case of N Zookeeper servers followed by speci c cases of 3,4,and 5 servers. Both software and hardware faults are considered in addition to the e ect of hardware and software repair rates. Comparisons show that, the system availability changes with change in the number of Zookeeper servers, with 3 servers having the highest availability.


The model presented in this study can be used to decide on how many servers are optimal for maximum availability and from which vendor they must be purchased. It can also help determine what time to use a Zookeeper coordinated Hadoop cluster to perform critical tasks.


TABLE OF CONTENTS

List of Tables
List of Figures
Abstract

CHAPTER ONE
1  Introduction
1.1 Problem Statement
1.2 Objectives
1.3 Thesis Organization

CHAPTER TWO
2  Cloud Computing and Fault Tolerance
2.1 Cloud Computing
2.2 Types of Clouds
2.3 Virtualization in the Cloud
2.3.1 Advantages of virtualization
2.4 Fault, Error and Failure
2.4.1 Faults Types
2.5 Fault Tolerance
2.5.1 Fault-tolerance Properties
2.5.2 K Fault Tolerant Systems
2.5.3 Hardware Fault Tolerance
2.5.4 Software Fault Tolerance
2.6 Properties of a Fault Tolerant Cloud
2.6.1 Availability
2.6.2 Reliability
2.6.3 Scalability

CHAPTER THREE
3 Hadoop/MapReduce Architecture
3.1 Hadoop/MapReduce
3.2 MapReduce
3.3 Hadoop/MapReduce versus other Systems
3.3.1 Relational Database Management Systems (RDBMS)
3.3.2 Grid Computing
3.3.3 Volunteer Computing
3.4 Features of MapReduce
3.4.1 Automatic Parallelization and Distribution of Work
3.4.2 Fault Tolerance in Hadoop/MapReduce
3.4.3 Cost E  ciency
3.4.4 Simplicity
3.5 Limitations of Hadoop/MapReduce
3.6 Apache's ZooKeeper
3.6.1 ZooKeeper Data Model
3.6.2 Zookeeper Guarantees
3.6.3 Zookeeper Primitives
3.6.4 Zookeeper Fault Tolerance
3.7 Related Work

CHAPTER FOUR
4 Availability Model
4.1 JobTracker Availability Model
4.1.1 Related Work
4.2 Model Assumptions
4.3 Markov Model for a Multi-Host System
4.3.1 The Parameter  s(t)
4.4 Markov Model for a Three-Host (N = 3)
                        Hadoop/MapReduce Cluster Using
                        Zookeeper as Coordinating Service
4.5 Numerical Solution to the System of Di erential Equations
4.5.1 Interpretation of Availability plot of the JobTracker
4.6 Discussion of Results
4.6.1 Sensitivity Analysis

CHAPTER FIVE
5  Conclusion and Future Work
5.1 Conclusion
5.2 Future Work
Appendix


Chapter 1

Introduction

The e ectiveness of most modern information (data) processing involves the ability to process huge datasets in parallel to meet stringent time con-straints and organizational needs. A major challenge facing organizations today is the ability to organize and process large data generated by cus-tomers. According to Nielson Online[1] there are more than 1,733,993,741 internet users. How much data these users are generating and how it is pro-cessed largely determines the success of the organization concerned. Con-sider the social networking site Facebook; as at August 2011, it has over 750 million active users[2] who spend 700 billion minutes per month on the network. They install over 20 million applications every day and interact with 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) each month. Since April 2010 when social plugins were launched, an average of 10,000 new websites has integrated with Facebook. The amount of data generated in Facebook is estimated as follows [3]:

12 TB of compressed data added per day

800 TB of compressed data scanned per day 25,000 map-reduce jobs per day

65 million  les in HDFS

30,000 simultaneous clients to the HDFS NameNode

It was a similar demand to process large datasets in Google that inspired Engineers in Google to introduce MapReduce [4]. At Google MapReduce is used to build Index for Google Search, Article clustering for Google News and perform Statistical machine translations. At Yahoo!, it is used to build Index for Yahoo! Search and spam detection. And at Facebook, MapReduce is used for Data mining, Ad optimization, and Spam detection [5]. MapRe-duce is designed to use commodity nodes (runs on cheaper machines) that can fail at any time. Its performance does not reduce signi cantly due to.....

For more Computer Science Projects click here
================================================================
Item Type: Project Material  |  Size: 81 pages  |  Chapters: 1-5
Format: MS Word   Delivery: Within 30Mins.
================================================================

Share:

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Search for your topic here

See full list of Project Topics under your Department Here!

Featured Post

HOW TO WRITE A RESEARCH HYPOTHESIS

A hypothesis is a description of a pattern in nature or an explanation about some real-world phenomenon that can be tested through observ...

Popular Posts