Poloautomatická analýza struktury textu

Šenkýř, Michal

Half-automatic recognition of text structure

dc.contributor.advisor	Kolman, Petr
dc.creator	Šenkýř, Michal
dc.date.accessioned	2017-04-21T07:49:07Z
dc.date.available	2017-04-21T07:49:07Z
dc.date.issued	2010
dc.identifier.uri	http://hdl.handle.net/20.500.11956/31044
dc.description.abstract	Práce popisuje návrh a implementaci algoritmu, který na základě počáteční lidské nápovědy převádí data v HTML dokumentech vygenerovaných z databáze, avšak určených pro lidské čtení, do strukturovaného tvaru vhodného pro strojové čtení. Na vstupu se předpokládá přítomnost nějaké (nejčastěji grafické) struktury v dokumentu a poskytnutí několika vzorových, sémanticky označených, položek v dokumentu uživatelem. Na výstupu se poté očekává zachycení sémantické struktury dat v dokumentu. Součástí výsledné aplikace je editorová část, která obsahuje grafické nástroje pro snadné označení sémantiky vzorových položek, a serverová část, která obsahuje nástroje pro následné hromadné zpracování dokumentů. Aplikace byla testována na realitních inzertních webech a výsledky tohoto testování byly rozebrány na konci práce. Práce stručně představuje také jiné existující aplikace založené na podobném principu a poskytuje jejich srovnání.	cs_CZ
dc.description.abstract	This thesis describes the design and implementation of an algorithm that, using some initial hints from the user, converts data in HTML documents generated from a database and inteded for human readability, into a structured form suitable for computer processing. The input document is assumed to have some structure (usually a visual layout) and the user must provide a sample of semantically labelled items in the document. The output is expected to reflect the semantic structure of the provided data. The resulting application is composed of an editor part which includes a graphical tool for easy labelling of sample items, and a server part, which includes a tool for the subsequent mass processing of additional documents. The application was tested on real estate advertising webs and the results of the testing were analysed. The thesis also surveys other existing applications based on similar principles and provides their comparison.	en_US
dc.language	Čeština	cs_CZ
dc.language.iso	cs_CZ
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.title	Poloautomatická analýza struktury textu	cs_CZ
dc.type	diplomová práce	cs_CZ
dcterms.created	2010
dcterms.dateAccepted	2010-05-24
dc.description.department	Department of Applied Mathematics	en_US
dc.description.department	Katedra aplikované matematiky	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.identifier.repId	62144
dc.title.translated	Half-automatic recognition of text structure	en_US
dc.contributor.referee	Skopal, Tomáš
dc.identifier.aleph	001384569
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Softwarové systémy	cs_CZ
thesis.degree.discipline	Software Systems	en_US
thesis.degree.program	Informatika	cs_CZ
thesis.degree.program	Computer Science	en_US
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Katedra aplikované matematiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Department of Applied Mathematics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Softwarové systémy	cs_CZ
uk.degree-discipline.en	Software Systems	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Velmi dobře	cs_CZ
thesis.grade.en	Very good	en_US
uk.abstract.cs	Práce popisuje návrh a implementaci algoritmu, který na základě počáteční lidské nápovědy převádí data v HTML dokumentech vygenerovaných z databáze, avšak určených pro lidské čtení, do strukturovaného tvaru vhodného pro strojové čtení. Na vstupu se předpokládá přítomnost nějaké (nejčastěji grafické) struktury v dokumentu a poskytnutí několika vzorových, sémanticky označených, položek v dokumentu uživatelem. Na výstupu se poté očekává zachycení sémantické struktury dat v dokumentu. Součástí výsledné aplikace je editorová část, která obsahuje grafické nástroje pro snadné označení sémantiky vzorových položek, a serverová část, která obsahuje nástroje pro následné hromadné zpracování dokumentů. Aplikace byla testována na realitních inzertních webech a výsledky tohoto testování byly rozebrány na konci práce. Práce stručně představuje také jiné existující aplikace založené na podobném principu a poskytuje jejich srovnání.	cs_CZ
uk.abstract.en	This thesis describes the design and implementation of an algorithm that, using some initial hints from the user, converts data in HTML documents generated from a database and inteded for human readability, into a structured form suitable for computer processing. The input document is assumed to have some structure (usually a visual layout) and the user must provide a sample of semantically labelled items in the document. The output is expected to reflect the semantic structure of the provided data. The resulting application is composed of an editor part which includes a graphical tool for easy labelling of sample items, and a server part, which includes a tool for the subsequent mass processing of additional documents. The application was tested on real estate advertising webs and the results of the testing were analysed. The thesis also surveys other existing applications based on similar principles and provides their comparison.	en_US
uk.file-availability	V
uk.publication.place	Praha	cs_CZ
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Katedra aplikované matematiky	cs_CZ
dc.identifier.lisID	990013845690106986