Sledování témat v elektronickém zpravodajství

Bílek, Karel

News Topics Tracking

bachelor thesis (DEFENDED)

View/Open

Záznam o průběhu obhajoby (79.98Kb)

Permanent link

http://hdl.handle.net/20.500.11956/50239

Identifiers

Study Information System: 62909

Referee

Holan, Tomáš

Faculty / Institute

Faculty of Mathematics and Physics

Discipline

General Computer Science

Department

Institute of Formal and Applied Linguistics

Date of defense

7. 9. 2011

Publisher

Univerzita Karlova, Matematicko-fyzikální fakulta

Language

Czech

Grade

Excellent

Keywords (Czech)

Zpravodajství, články, témata, klíčová slova

Keywords (English)

News, articles, topics, keywords

V této práci se snažím nalézt definici zpravodajského tématu tak, aby byla detekce těchto témat v textu implementovatelná a kvalita této detekce měřitelná. Popisuji možné metody - "prosté" počítání slov, případně se zavedením stopslov; TF-IDF; dále popisuji problém textové klasifikace, mírně se dotknu text clusteringu. Dále popisuji přístupy, nazvané latent semantic in- dexing a latent Dirichlet allocation. Také popisuji experimenty s "prostým" počítáním slov, TF-IDF a textovou klasifikací na databázi článků z něko- lika elektronických zdrojů; vznik této databáze v práci popisuji rovněž. Ke způsobu řešení pomocí textové klasifikace uvádím metriku pomocí měření přesnosti a úplnosti; podle těchto metrik měřím několik variant textové klasi- fikace. 1

Abstract (English)

In this thesis, I try to find a definition of a news topic to make topic detec- tion implementable and its quality measurable. I describe various methods - a "simple" words counting, optionally with stopwords. I also describe TF-IDF and the text categorization problem. I touch the subject of text clustering. Then I briefly describe approaches called latent semantic indexing and la- tent Dirichlet allocation. The thesis includes my experiments with "simple" words counting, TF-IDF and text categorization on database of articles from several online news websites; I also describe the creation of this database. Precision and recall are used as a metric to text categorization approach. 1

Citace dokumentu

Metadata

Show full item record