Document embedding using Transformers

Burian, David

Embedování dokumentů pomocí Transformerů

diploma thesis (DEFENDED)

View/Open

Záznam o průběhu obhajoby (347.1Kb)

Permanent link

http://hdl.handle.net/20.500.11956/190630

Identifiers

Study Information System: 250786

Referee

Variš, Dušan

Faculty / Institute

Faculty of Mathematics and Physics

Discipline

Computer Science - Artificial Intelligence

Department

Institute of Formal and Applied Linguistics

Date of defense

10. 6. 2024

Publisher

Univerzita Karlova, Matematicko-fyzikální fakulta

Language

English

Grade

Excellent

Keywords (Czech)

embedding dokumentů|destilování znalostí|SBERT|Paragraph Vector|Longformer

Keywords (English)

document embedding|knowledge distillation|SBERT|Paragraph Vector|Longformer

V této práci představujeme metodu strojového učení modelů emedující dokumenty, která není náročná na výpočetní zdroje ani nevyžaduje anotovaná trénovací data. S přís- tupem učitele a studenta, distilujeme kapacitu SBERTa zaznamenat strukturu textu a schopnost Paragraph Vektoru zpracovat dlouhé dokumenty do našeho výsledného em- bedovacího modelu. Naší metodu testujeme na Longformeru, Transformeru s řídkou attention vrstvou, který je schopný zpracovat dokumenty dlouhé až 4096 tokenů. Prozk- oumáme několik ztrátových funkcí, které nutí studenta (Longformera) napodobovat výs- tupy obou učitelů (SBERTa a Paragraph Vektoru). V experimentech ukazujeme, že i přes omezený kontext SBERTa, je distilace jeho výstupů pro výkon studenta zásad- nější. Nicméně student dokáže získat prospěch z obou učitelů. Naše metoda vylepšuje výsledek Longformera na osmi úlohách, které zahrnují predikci citace, detekci plagiarismu i vyhledávání na základě podobnosti dokumentů. Naše metoda se navíc ukazuje jako obzvláště účinná v situacích s málo dotrénovávacími daty, kde námi natrénovaný student překoná i oba učitele. Podobným výkonem odlišně natrénovaných studentů ukazujeme, že naše metoda je robustní vůči různým změnám, a navrhujeme možné oblasti budoucího výzkumu. 1

Abstract (English)

We develop a method to train a document embedding model with an unlabeled dataset and low computational resources. Using teacher-student training, we distill SBERT's capacity to capture text structure and Paragraph Vector's ability to encode extended context into the resulting embedding model. We test our method on Longformer, a Transformer model with sparse attention that can process up to 4096 tokens. We explore several loss functions for the distillation of knowledge from the two teachers (SBERT and Paragraph Vector) to our student model (Longformer). Throughout experimentation, we show that despite SBERT's short maximum context, its distillation is more critical to the student's performance. However, the student model can benefit from both teachers. Our method improves Longformer's performance on eight downstream tasks, including citation prediction, plagiarism detection, and similarity search. Our method shows excep- tional performance with few finetuning data available, where the trained student model outperforms both teacher models. By showing consistent performance of differently con- figured student models, we demonstrate our method's robustness to various changes and suggest areas for future work. 1

Citace dokumentu

Metadata

Show full item record