Heuristiky pro kompresi špatně formovaného XML

Szabó, Mária

Heuristics for compression of non-well-formed XML

diploma thesis (DEFENDED)

View/Open

Záznam o průběhu obhajoby (129.5Kb)

Permanent link

http://hdl.handle.net/20.500.11956/17214

Identifiers

Study Information System: 48622

Referee

Matouš, Václav

Faculty / Institute

Faculty of Mathematics and Physics

Discipline

Software systems

Department

Department of Software Engineering

Date of defense

24. 9. 2008

Publisher

Univerzita Karlova, Matematicko-fyzikální fakulta

Language

Czech

Grade

Very good

XBW [9] je modulární program na bezeztrátovou komprimaci textů umožňující použití růných kompresních algoritmů. Název XBW pochází ze spojení slov XML a BWT, protože právě v kombinaci XML parseru s Burrows-Wheelerovou transformací byly dosaženy nejlepší výsledky. Proto se v práci zaměřujeme na zlepšení výsledků v kombinaci s BWT. Na souborech o velikosti kolem 20MB, tvořených stovkami konkatenovaných webových stránek, jsme dokázali zrychlit aplikaci až o 37% za cenu zhoršení kompresního poměru o 5%. Avšak i s tímto zhoršením máme stále nejméně o 38% lepší kompresní poměr oproti programu Rar. Zrychlení bylo dosaženo implementací nového typu parseru používajícího slovníky tagů a elementů. Práce obsahuje rovněž reimplementaci parseru z původního projektu XBW změněnou od základů zachovávající princip použití slovníků tagů a atributů. Reimplementace vedla k průměrnému zlepšení kompresního poměru o 2% za současného průměrného zrychlení běhu aplikace o 4%.

Abstract (English)

XBW [9] is a modular application for lossless text compression, which enables to use several compression algorithms. The best results were reached with the combination of XML parser and Burrows-Wheeler transformation. Thus XBW stands for merged shortcuts XML and BWT. Therefore we try to improve the results in combination with BWT in the thesis. On les with size about 20MB, generated from hundreds of concatenated webpages, we achieve 37 % faster compression time at the cost of 5% worse compression ratio. However, this compression ratio is by 38% better when it comes to confrontation with Rar software. This acceleration was reached by a new type of parser based on dictionaries of tags and elements. Thesis contains also a new, completely rewritten, implementation of original parser, based on the same principle of tag and attribute dictionaries. With this reimplemetation we improved the average compression speed by 4% and average compression ratio by 2%.

Citace dokumentu

Metadata

Show full item record