

Type of Document Master's Thesis Author Chandrasekaran, Sabrina Author's Email Address sabrina.chandrasekaran@gmail.com URN etd-07082010-190449 Title Implementation and Analysis of a Top-K Retrieval System for Strings Degree Master of Science in Systems Science (M.S.S.S.) Department Computer Science Advisory Committee
Advisor Name Title Shah, Rahul Committee Chair Busch, Konstantin Committee Member Kannan, Rajgopal Committee Member Keywords
- suffix trees
- Top-K retrieval system
Date of Defense 2010-06-30 Availability unrestricted Abstract Given text which is a union of d documents of strings, D = d1, d2,...., dd, the emphasisof this thesis is to provide a practical framework to retrieve the K most relevant documents
for a given pattern P, which comes as a query. This cannot be done directly, as going
through every occurrence of the query pattern may prove to be expensive if the number of
documents that the pattern occurs in is much more than the number of documents (K) that
we require. Some advanced query functionality will be required, as compared to listing the
documents that the pattern occurs in, because a dened notion of "most relevant" must be
provided. Therefore, an index needs to be built before hand on T so that the documents
can be retrieved very quickly. Traditionally, inverted indexes have proven to be effective
in retrieving the Top-K documents. However, inverted indexes have certain disadvantages,
which can be overcome by using other data structures like suffix trees and suffix arrays.
A framework was originally provided by Muthukrishnan [29] that takes advantage of
the number of relevant documents being less than the occurence of the query pattern. He
considered two metrics for relevance:frequency and proximity and provided a framework
that took O(n log n) space. Recently, Hon et al [14] provided a framework that takes O(n)
space to retrieve the Top-K documents with more optimal query times, O(P + K logK)
for arbitrary score functions. In this thesis we study the practicality of this index and
provide added functionalities, based on the index, to retrieve Top-K documents for specific
cases like phrase searching. We also provide functionality to output the K most relevant
documents(according to page rank) when two patterns are given as queries.
Files
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access ChandrasekaranThesis.pdf 2.34 Mb 00:10:50 00:05:34 00:04:52 00:02:26 00:00:12
If you have questions or technical problems, please Contact LSU-ETD Support.