Vyhledávání v českých strukturovaných datech pomocí stemmingu

Tattermusch, Jan

Searching Czech Structured Data using Stemming

diploma thesis (DEFENDED)

View/Open

Záznam o průběhu obhajoby (83.67Kb)

Permanent link

http://hdl.handle.net/20.500.11956/33994

Identifiers

Study Information System: 79173

Referee

Kuboň, Vladislav

Faculty / Institute

Faculty of Mathematics and Physics

Discipline

Software Systems

Department

Institute of Formal and Applied Linguistics

Date of defense

6. 9. 2010

Publisher

Univerzita Karlova, Matematicko-fyzikální fakulta

Language

Czech

Grade

Excellent

Tato práce implementuje a popisuje komponentu pro fulltextové vyhledávání s podporou eského doplování diakritiky a stemmingu. Doplňovač diakritiky pracuje na statistickém principu a zohleduje kontext. Práce obsahuje pět stemmerů připravených k okamžitému použití (dva algoritmické a tři hybridní), jejichž vlastnosti jsou diskutovány. Komponenta je vystavěna nad knihovnou Apache Lucene a poskytuje jednoduché rozhraní pro dotazování a přidávání, mazání a změnu indexovaných dokumentů. Ukládané dokumenty se skládají z pojmenovaných polí s de novanými datovými typy. Komponenta umožňuje de novat krom běžných fulltextových dotaz také netriviální dotazy s dopňujícími omezeními a ovlivnit vlastní zpasob výpočtu skóre výsledků dotazu. Výkon komponenty je dostatečný pro středně vytížené aplikace a orientační výkon je dle měření 50 dotazů za vteřinu nad úložištěm obsahujícím 2,7 milionu dokumenta. Přínos doplňování diakritiky a stemmingu pro kvalitu fulltextového vyhledávání byl měřen pomocí MAP a byl vyhodnocen jako významný.

Abstract (English)

This work describes and implements a component for fulltext searching with czech diacritics restoration and stemming support. Diacritics restoration is based on statistical principles and is context dependent. This work presents ve stemmers ready for immediate use (two algorithmic stemmers and three hybrid stemmers) and discusses their properties. The component is implemented using Apache Lucene library and provides a simple interface for querying and insertions, deletions and updates of documents indexed. Stored documents consist of named elds with prede ned data types. Besides regular fulltext queries, the component also supports non-trivial queries with additional constraints and provides a way to customize the way query result score is computed. Component's performance is suffcient for medium-load applications and is approximately 50 queries per second with a repository that contains 2.7 million documents. Contribution of stemming and diacritics restoration to the quality of fulltext searching was measured using MAP and is signi cant.

Citace dokumentu

Metadata

Show full item record