Univerzální index textových dokumentů

Švantner, Marek

Universal Full-Text Index

diploma thesis (DEFENDED)

View/Open

Záznam o průběhu obhajoby (289.9Kb)

Permanent link

http://hdl.handle.net/20.500.11956/8148

Identifiers

Study Information System: 41651

CU Caralogue: 990008660300106986

Referee

Skopal, Tomáš

Faculty / Institute

Faculty of Mathematics and Physics

Discipline

Data Engineering

Department

Department of Software Engineering

Date of defense

5. 2. 2007

Publisher

Univerzita Karlova, Matematicko-fyzikální fakulta

Language

Czech

Grade

Excellent

Diplomová práce se zabývá návrhem a implementací vysoce efektivního univerzálního indexu textových dokumentů. Univerzální znamená možnost jednak konfigurovat struktury indexových záznamů a metod zpracování dat (bez nutnosti rekompilace), jednak použít knihovnu indexu i pro jiné účely, například pro tvorbu tezauru, reprezentaci bibliografických vztahů nebo pro reprezentaci určité třídy funkcí v jiných oblastech než jsou dokumentografické systémy. Pro implementaci je navržen dynamický invertovaný soubor, který umožňuje efektivně provádět aktualizační operace bez nutnosti přebudování datové struktury. Specifickými oblastmi práce jsou i on-line komprese indexu a zajištění odolnosti datové struktury proti výpadkům pomocí transakčního zpracování. Je odvozena konstantní amortizovaná složitost struktury, která je poté experimentálně ověřena. Další experimenty se týkají i výkonu kompresních metod a vlivu parametrů datové struktury na její výkon a zabraný prostor. Diplomová práce obsahuje vlastní implementaci univerzálního indexu v C/C++ testovanou v prostředích Linux a Windows XP.

Abstract (English)

This diploma thesis deals with the design and implementation of a highly efficient universal index of textual documents. Universal stands for an opportunity to configure structures of index records and methods of the index data processing (without recompiling an application). Furthermore, it means that the index library can be used even for other purposes, for example to implement a thesaurus, to represent bibliographic relationships or even for generic representation of a specific class of functions in other areas than documentographic systems. The index is implemented using the dynamic inverted file which can be efficiently updated without need of the data structure rebuilding. Specific issue is on-line index compression and failure recovery via the transactional log. It is shown that the amortized complexity of the data structure is linear. This fact is afterwards experimentally verified. Other experiments address the compression methods and the impact of the data structure parameters on its efficiency. The diploma thesis contains the implementation of the universal index in C/C++. It has been tested in the Linux and Windows XP environments.

Citace dokumentu

Metadata

Show full item record