Classification in data streams with abrupt concept drift in a subset of features

Procházka, Martin

Klasifikace proudu dat s náhlou změnou distribuce v podmnožině charakteristik

dc.contributor.advisor	Lisý, Viliam
dc.creator	Procházka, Martin
dc.date.accessioned	2024-11-29T07:04:19Z
dc.date.available	2024-11-29T07:04:19Z
dc.date.issued	2024
dc.identifier.uri	http://hdl.handle.net/20.500.11956/193091
dc.description.abstract	Detekce malwaru je klíčovým aspektem kybernetické bezpečnosti a představuje řadu výzev, zejména ve scénářích uvažujících proud dat, kde dochází k silné změně distribuce a velkému zpoždění mezi obdržením dat a získáním jich třídy. Změna distribuce je charak- terizována přítomností vysoce indikativních, ale rychle se měnících rysů, jako jsou speci- fické názvy souborů nebo mutexy. Malware však vykazuje také řadu stabilních rysů, jako jsou typy připojení nebo metody zpeněžení, které zůstávají v čase relativně konzis- tentní. V této práci formalizujeme tento scénář a dále zkoumáme hypotézu, že adaptivní odstranění silně driftujících podmnožin rysů může mít velký vliv na výkonnost algo- ritmu. V práci prokažeme, že současné metody opravdu vykazují nedostatky spojené s těmito rysy, zejména potom v krátkých obdobích po příchodu nové distribuce. Aby- chom ověřili hypotézu o zlepšení výkonnosti prostřednictvím adaptivního odstraňování příznaků, předkládáme dvě řešení: jedno založené na detekci změny distribuce pomocí Hellingerovy vzdálenosti a druhé na inkrementálním algoritmu Gaussian Mixture Model. Oba přístupy vyhodnocujeme na reálných datech a na naší syntetické datové sadě a ukazujeme výrazné zlepšení na syntetických datech a slibné výsledky na reálných datech. Kromě toho uvádíme komplexní vysvětlení technik...	cs_CZ
dc.description.abstract	Malware detection is a crucial aspect of cybersecurity, presenting several challenges, particularly in data stream scenarios that experience strong concept drift and label de- lay. The concept drift is characterized by the presence of highly influential yet rapidly changing features, such as specific filenames or mutexes, alongside stable features, such as connection types or monetization methods, which remain relatively consistent over time. In this thesis, we formalize this scenario and further exploit the hypothesis that the adaptive removal of severely drifting subsets of features may have a great impact on procedure performance. We indeed demonstrate that current methods exhibit shortcom- ings connected with these features, especially during short periods following the arrival of a new concept. To validate the hypothesis of performance improvement through adaptive feature elimination, we propose two solutions: one based on Hellinger distance concept drift detection and the other on an incremental Gaussian Mixture Model algorithm. We evaluate both approaches using real-life data and our synthetic dataset, showing sig- nificant improvements on the synthetic dataset and promising results on real-life data. Additionally, we provide a comprehensive explanation of the techniques employed in the thesis. 1	en_US
dc.language	English	cs_CZ
dc.language.iso	en_US
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	malware detection\|concept drift\|data stream\|concept drift detection\|Gaussian Mixture Models	en_US
dc.subject	detekce škodlivého software\|změna distribuce\|proud dat\|detekce změny distribuce\|Gaussian Mixture Models	cs_CZ
dc.title	Classification in data streams with abrupt concept drift in a subset of features	en_US
dc.type	diplomová práce	cs_CZ
dcterms.created	2024
dcterms.dateAccepted	2024-09-06
dc.description.department	Department of Algebra	en_US
dc.description.department	Katedra algebry	cs_CZ
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.identifier.repId	266139
dc.title.translated	Klasifikace proudu dat s náhlou změnou distribuce v podmnožině charakteristik	cs_CZ
dc.contributor.referee	Bošanský, Branislav
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Mathematics for Information Technologies	en_US
thesis.degree.discipline	Matematika pro informační technologie	cs_CZ
thesis.degree.program	Mathematics for Information Technologies	en_US
thesis.degree.program	Matematika pro informační technologie	cs_CZ
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Katedra algebry	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Department of Algebra	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Matematika pro informační technologie	cs_CZ
uk.degree-discipline.en	Mathematics for Information Technologies	en_US
uk.degree-program.cs	Matematika pro informační technologie	cs_CZ
uk.degree-program.en	Mathematics for Information Technologies	en_US
thesis.grade.cs	Velmi dobře	cs_CZ
thesis.grade.en	Very good	en_US
uk.abstract.cs	Detekce malwaru je klíčovým aspektem kybernetické bezpečnosti a představuje řadu výzev, zejména ve scénářích uvažujících proud dat, kde dochází k silné změně distribuce a velkému zpoždění mezi obdržením dat a získáním jich třídy. Změna distribuce je charak- terizována přítomností vysoce indikativních, ale rychle se měnících rysů, jako jsou speci- fické názvy souborů nebo mutexy. Malware však vykazuje také řadu stabilních rysů, jako jsou typy připojení nebo metody zpeněžení, které zůstávají v čase relativně konzis- tentní. V této práci formalizujeme tento scénář a dále zkoumáme hypotézu, že adaptivní odstranění silně driftujících podmnožin rysů může mít velký vliv na výkonnost algo- ritmu. V práci prokažeme, že současné metody opravdu vykazují nedostatky spojené s těmito rysy, zejména potom v krátkých obdobích po příchodu nové distribuce. Aby- chom ověřili hypotézu o zlepšení výkonnosti prostřednictvím adaptivního odstraňování příznaků, předkládáme dvě řešení: jedno založené na detekci změny distribuce pomocí Hellingerovy vzdálenosti a druhé na inkrementálním algoritmu Gaussian Mixture Model. Oba přístupy vyhodnocujeme na reálných datech a na naší syntetické datové sadě a ukazujeme výrazné zlepšení na syntetických datech a slibné výsledky na reálných datech. Kromě toho uvádíme komplexní vysvětlení technik...	cs_CZ
uk.abstract.en	Malware detection is a crucial aspect of cybersecurity, presenting several challenges, particularly in data stream scenarios that experience strong concept drift and label de- lay. The concept drift is characterized by the presence of highly influential yet rapidly changing features, such as specific filenames or mutexes, alongside stable features, such as connection types or monetization methods, which remain relatively consistent over time. In this thesis, we formalize this scenario and further exploit the hypothesis that the adaptive removal of severely drifting subsets of features may have a great impact on procedure performance. We indeed demonstrate that current methods exhibit shortcom- ings connected with these features, especially during short periods following the arrival of a new concept. To validate the hypothesis of performance improvement through adaptive feature elimination, we propose two solutions: one based on Hellinger distance concept drift detection and the other on an incremental Gaussian Mixture Model algorithm. We evaluate both approaches using real-life data and our synthetic dataset, showing sig- nificant improvements on the synthetic dataset and promising results on real-life data. Additionally, we provide a comprehensive explanation of the techniques employed in the thesis. 1	en_US
uk.file-availability	V
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Katedra algebry	cs_CZ
thesis.grade.code	2
uk.publication-place	Praha	cs_CZ
uk.thesis.defenceStatus	O