Title page for ETD etd-06282012-165206

Type of Document Master's Thesis
Author Mantha, Pradeep Kumar
Author's Email Address pmanth2@tigers.lsu.edu
URN etd-06282012-165206
Title An Extensible and Scalable Pilot-MapReduce Framework for Data Intensive Applications on Distributed Cyberinfrastructure
Degree Master of Science in Systems Science (M.S.S.S.)
Department Computer Science
Advisory Committee
Advisor Name Title
Jha, Shantenu Committee Chair
Allen, Gabrielle Committee Member
Hall, Randall Committee Member
  • MapReduce
  • Distributed Computing
  • Pilot Job and Data
  • Simple API for Grid Applications (SAGA)
  • Genome Sequence Alignment
  • BWA
Date of Defense 2012-05-21
Availability unrestricted
The volume and complexity of data that must be analyzed in scientific applications is increasing exponentially. Often, this data is distributed; thus, the ability to analyze data by localizing it will yield limited returns. Therefore, an efficient processing of large distributed datasets is required, whilst ideally not introducing fundamentally new programming models or methods. For example, extending MapReduce - a proven effective programming model for processing large datasets, to work more effectively on distributed data and on different infrastructure (such as non-Hadoop, general-purpose clusters) is desirable. We posit that this can be achieved with an effective and efficient runtime environment and without refactoring MapReduce itself. MapReduce on distributed data requires effective distributed coordination of computation (map and reduce) and data, as well as distributed data management (in particular the transfer of intermediate data units). To address these requirements, we design and implement Pilot-MapReduce (PMR) - a flexible, infrastructure-independent runtime environment for MapReduce. PMR is based on Pilot abstractions for both compute (Pilot- Jobs) and data (Pilot-Data): it utilizes Pilot-Jobs to couple the map phase computation to the nearby source data, and Pilot-Data to move intermediate data using parallel data transfers to the reduce computation phase. We analyze the effectiveness of PMR over applications with different characteristics (e. g. different volumes of intermediate and output data). Our experimental evaluations show that the Pilot abstraction for data movement across multiple clusters is promising, and can lower the execution time span of the entire MapReduce execution. We also investigate the performance of PMR with distributed data using a Word Count and a genome sequencing application over different MapReduce configurations. We find that PMR is a viable tool to support distributed NGS analytics by comparing and contrasting the PMR approach to similar capabilities of Seqal and Crossbow, two Next Generation Sequencing(NGS) Hadoop MapReduce based applications. Our experiments show that PMR provides the desired flexibility in the deployment and configuration of MapReduce runs to address specific application characteristics and achieve an optimal performance, both locally and over wide-area multiple clusters.

  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  manthathesis.pdf 821.97 Kb 00:03:48 00:01:57 00:01:42 00:00:51 00:00:04

Browse All Available ETDs by ( Author | Department )

If you have questions or technical problems, please Contact LSU-ETD Support.