N-gramový jazykový model pro český spellchecker

Richter, Michal

N-gram language model for a Czech spellchecker

bachelor thesis (DEFENDED)

View/Open

Záznam o průběhu obhajoby (135.6Kb)

Permanent link

http://hdl.handle.net/20.500.11956/18573

Identifiers

Study Information System: 47572

Referee

Bojar, Ondřej

Faculty / Institute

Faculty of Mathematics and Physics

Discipline

General Computer Science

Department

Institute of Formal and Applied Linguistics

Date of defense

9. 9. 2008

Publisher

Univerzita Karlova, Matematicko-fyzikální fakulta

Language

Czech

Grade

Excellent

Cílem práce je prozkoumat možnosti použití n-gramových jazykových modelů pro kontrolu českého spellingu a napsat rozšíření pro spellchecker, které dokáže najít překlepy, jež jsou zároveň platnými českými slovy. Dále také napsat jednoduchou webovou aplikaci, která bude rozšířený spellchecker prezentovat. V této práci byl také prozkoumán vliv využití lemmatizace a morfologické analýzy slov na úspěšnost hledání překlepů. V práci jsou popsány použité metody jazykového modelování. Dále také postup práce programu, který provádí kontrolu spellingu s využitím jazykových modelů. Potom následuje popis způsobu získání dat pro trénování jazykových modelů, zhodnocení vytvořených jazykových modelů. Nakonec jsou uvedeny dosažené výsledky pro jednotlivé varianty kontroly.

Abstract (English)

The aim of this thesis is to explore the possibilities of using n-gram language models for spellchecking Czech texts and to implement an extension to the spellchecker which would be able to find such misspelled words that are true Czech words. Furthermore, the aim was to implement a simple web application which would present the extended spellchecker. The influence of using lemmatization and morphology analysis of words regarding the hit rate of finding misspelled words was also looked into. The methods of language modelling used in the thesis are described first. What follow, then, is the description of the procedure of the spellchecking program using language models. The next part shows the way of getting the data for language model training. In the following part, the evaluation of the language models created is presented. The final part shows the results achieved for each option of spellchecking.

Citace dokumentu

Metadata

Show full item record