Deduplikační metody v databázích

Vávra, Petr

Deduplication methods in databases

dc.contributor.advisor	Kyjonka, Vladimír
dc.creator	Vávra, Petr
dc.date.accessioned	2017-04-27T03:30:23Z
dc.date.available	2017-04-27T03:30:23Z
dc.date.issued	2010
dc.identifier.uri	http://hdl.handle.net/20.500.11956/34009
dc.description.abstract	V této práci studujeme úlohy odhalování duplicit v databázích v rámci datové kvality. Za duplicity považujeme ty záznamy, které se sice mohou syntakticky lišit, ale které sémanticky představují tentýž objekt reálného světa. Hlavním cílem této práce je shrnout současné deduplikační metody z hlediska jejich nároků, výsledků a využitelnosti v praxi. Detailněji se zaměříme na porovnání dvou kategorií deduplikačních metod - těch, které vyžadují detailní informace o doméně, a těch, které se bez nich naopak dokáží obejít. Praktickou částí této práce je proto implementace vlastní metody z rodiny vzdálenostních metod nevyžadující žádné znalosti, jejíž výsledky porovnáme s výsledky komerčního nástroje používaného v praxi, který naopak využívá detailních znalostí dat, ve kterých jsou hledány duplicity.	cs_CZ
dc.description.abstract	In the present work we study the record deduplication problem as an issue of data quality. We define duplicates as records having different syntax and the same semantics and which are representing the same real-world entity. The main goal of this work is to provide the overview of existing deduplication methods according to their requirements, results and usability. We focus on the comparison of two groups of record deduplication methods - with and without the domain knowledge. Therefore, the second part of this work is dedicated to the implementation of our method which does not utilize any domain knowledge and compare its results with the results of commercial tool deeply utilizing the domain knowledge.	en_US
dc.language	Čeština	cs_CZ
dc.language.iso	cs_CZ
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	Deduplikace	cs_CZ
dc.subject	unifikace	cs_CZ
dc.subject	matching	cs_CZ
dc.subject	kvalita dat	cs_CZ
dc.subject	Deduplication	en_US
dc.subject	unification	en_US
dc.subject	matching	en_US
dc.subject	data quality	en_US
dc.title	Deduplikační metody v databázích	cs_CZ
dc.type	diplomová práce	cs_CZ
dcterms.created	2010
dcterms.dateAccepted	2010-09-06
dc.description.department	Department of Software Engineering	en_US
dc.description.department	Katedra softwarového inženýrství	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.identifier.repId	88538
dc.title.translated	Deduplication methods in databases	en_US
dc.contributor.referee	Skopal, Tomáš
dc.identifier.aleph	001389686
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Software Systems	en_US
thesis.degree.discipline	Softwarové systémy	cs_CZ
thesis.degree.program	Computer Science	en_US
thesis.degree.program	Informatika	cs_CZ
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Katedra softwarového inženýrství	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Department of Software Engineering	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Softwarové systémy	cs_CZ
uk.degree-discipline.en	Software Systems	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Velmi dobře	cs_CZ
thesis.grade.en	Very good	en_US
uk.abstract.cs	V této práci studujeme úlohy odhalování duplicit v databázích v rámci datové kvality. Za duplicity považujeme ty záznamy, které se sice mohou syntakticky lišit, ale které sémanticky představují tentýž objekt reálného světa. Hlavním cílem této práce je shrnout současné deduplikační metody z hlediska jejich nároků, výsledků a využitelnosti v praxi. Detailněji se zaměříme na porovnání dvou kategorií deduplikačních metod - těch, které vyžadují detailní informace o doméně, a těch, které se bez nich naopak dokáží obejít. Praktickou částí této práce je proto implementace vlastní metody z rodiny vzdálenostních metod nevyžadující žádné znalosti, jejíž výsledky porovnáme s výsledky komerčního nástroje používaného v praxi, který naopak využívá detailních znalostí dat, ve kterých jsou hledány duplicity.	cs_CZ
uk.abstract.en	In the present work we study the record deduplication problem as an issue of data quality. We define duplicates as records having different syntax and the same semantics and which are representing the same real-world entity. The main goal of this work is to provide the overview of existing deduplication methods according to their requirements, results and usability. We focus on the comparison of two groups of record deduplication methods - with and without the domain knowledge. Therefore, the second part of this work is dedicated to the implementation of our method which does not utilize any domain knowledge and compare its results with the results of commercial tool deeply utilizing the domain knowledge.	en_US
uk.file-availability	V
uk.publication.place	Praha	cs_CZ
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Katedra softwarového inženýrství	cs_CZ
dc.identifier.lisID	990013896860106986