Poloautomatická analýza struktury textu

Šenkýř, Michal

Half-automatic recognition of text structure

diploma thesis (DEFENDED)

View/Open

Záznam o průběhu obhajoby (136.4Kb)

Permanent link

http://hdl.handle.net/20.500.11956/31044

Identifiers

Study Information System: 62144

Referee

Skopal, Tomáš

Faculty / Institute

Faculty of Mathematics and Physics

Discipline

Software Systems

Department

Department of Applied Mathematics

Date of defense

24. 5. 2010

Publisher

Univerzita Karlova, Matematicko-fyzikální fakulta

Language

Czech

Grade

Very good

Práce popisuje návrh a implementaci algoritmu, který na základě počáteční lidské nápovědy převádí data v HTML dokumentech vygenerovaných z databáze, avšak určených pro lidské čtení, do strukturovaného tvaru vhodného pro strojové čtení. Na vstupu se předpokládá přítomnost nějaké (nejčastěji grafické) struktury v dokumentu a poskytnutí několika vzorových, sémanticky označených, položek v dokumentu uživatelem. Na výstupu se poté očekává zachycení sémantické struktury dat v dokumentu. Součástí výsledné aplikace je editorová část, která obsahuje grafické nástroje pro snadné označení sémantiky vzorových položek, a serverová část, která obsahuje nástroje pro následné hromadné zpracování dokumentů. Aplikace byla testována na realitních inzertních webech a výsledky tohoto testování byly rozebrány na konci práce. Práce stručně představuje také jiné existující aplikace založené na podobném principu a poskytuje jejich srovnání.

Abstract (English)

This thesis describes the design and implementation of an algorithm that, using some initial hints from the user, converts data in HTML documents generated from a database and inteded for human readability, into a structured form suitable for computer processing. The input document is assumed to have some structure (usually a visual layout) and the user must provide a sample of semantically labelled items in the document. The output is expected to reflect the semantic structure of the provided data. The resulting application is composed of an editor part which includes a graphical tool for easy labelling of sample items, and a server part, which includes a tool for the subsequent mass processing of additional documents. The application was tested on real estate advertising webs and the results of the testing were analysed. The thesis also surveys other existing applications based on similar principles and provides their comparison.

Citace dokumentu

Metadata

Show full item record