Title page for ETD etd-05062005-163801

Type of Document Dissertation
Author Yenduri, Sumanth
URN etd-05062005-163801
Title An Empirical Study of Imputation Techniques for Software Data Sets
Degree Doctor of Philosophy (Ph.D.)
Department Computer Science
Advisory Committee
Advisor Name Title
S.S. Iyengar Committee Chair
B.B. Karki Committee Member
Donald Kraft Committee Member
R.C. Ward Committee Member
G. Gu Dean's Representative
  • effort estimation
  • clustering techniques
  • missing data techniques
  • missing value analysis
  • effort prediction models
Date of Defense 2005-04-20
Availability unrestricted
Software Project Effort/Cost/Time Estimation has been one of the hot topics of research in the current software engineering industry. Solutions for effort/cost/time estimation are in great demand. Knowledge of accurate effort/cost/time estimates early in the software project life cycle enables project managers manage and exploit resources efficiently. The constraints of cost and time can also be met. To this day, most companies rely on their historical database of past project data sets to predict estimates for future projects. Like other data sets, software project data sets also suffer from numerous problems. The most important problem is they contain missing/incomplete data. Significant amounts of missing or incomplete data are frequently found in data sets utilized to build effort/cost/time prediction models in the current software industry. The reasons are numerous and the missingness is inevitable. The traditional approaches used by the companies ignore all the missing data and provide estimates based on the remaining complete information. Thus, the very estimates are prone to bias.

In this thesis, we investigate the application of a few well-known data imputation techniques (Listwise Deletion, Mean Imputation, 10 variants of Hot-Deck Imputation and Full Information Maximum Likelihood Approach) to six real-time software project data sets. Using the imputed data sets we build effort prediction models to evaluate their performance. We study the inherent characteristics of software project data sets such as data set size, missing mechanism, pattern of missingness etc and provide a generic classification schema for all software project data sets based on their characteristics. We further implement a hybrid methodology for solving the same. We perform useful experimental analyses and compare the impacts of these methods for enhancing prediction accuracies. We also highlight the conditions to be considered and measures to be taken while using an imputation technique. We note the ideal and worst conditions for each method. Finally, we discuss the findings and the appropriateness of each method for data imputation to software project data sets.

  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  Yenduri_dis.pdf 545.97 Kb 00:02:31 00:01:17 00:01:08 00:00:34 00:00:02

Browse All Available ETDs by ( Author | Department )

If you have more questions or technical problems, please Contact LSU-ETD Support.