Abstractive Summarization of spoken and written instructions with BERT
1 Introduction
The motivation behind this work involves making the growing amount of user-generated online content more accessible in order to help user digest more easily the ever growing information put at their disposal.
However, many creators of online content use a variety of casual language, and professional jargon to advertise their content. Hence the summarization of this type of content implies not only the extraction of important information from the source but also a transformation to a more coherent and structured output. That is why in this paper the focus is put on both extractive and abstractive summarization of narrated instructions in both written and spoken forms.
Problematic : Language models for summarization of conversational text often face issues with fluency , intelligibility and repetition.
Aim of this paper : Using a BERT-based model for summarizing spoken language from ASR (speech to text) inputs in order to develop a geeral tool that can be used across a variety of domain for How2 articles and videos.
2 Prior work
The work on sequence to sequence models from Sutskever et al. and Cho et al opened up a new possibilities for neural networks in natural language processing (NLP). From 2014 to 2015, LTSMs became the dominant approach in the industry which achieved state of the art result.
=> Such architectural changes became successful in tasks such as speech recognition, machine translation, parsing and image captioning.
In 2017 a paper by Vaswani et al provided a solution to the fixed length vector problem enabling neural network to focus on important parts of the input for prediction tasks. Applying attention mechanisms with transformers became more dominant for tasks such as translation and summarization.
=> In abstractive video summarization, models wich incorporate variations of LSTM and deep layered neural networks have become state of the art performers. In addition to textual inputs, recent research in multi-modal summarization incorporates visual and audio modalities into language models to generate summaries of video content.
In this paper, video summarization is approached by extending top performing single-document text summarization models to a combination of narrated instructional videos, texts and news documents of various styles, length and literary attributes.
3. Methodology
3.1 Data Collection
Used datasets :
n.b. Despite the development of instructional datasets such as Wikihow and How2 advancements in summarizations have been limited by the availability of human annoted transcripts and summaries. To extend this reseqrch boundaries, the authors complemented exisitng labeled summarization datasets with auto-generated instructional video scripts and human-curated descriptions.
3.2 Preprocessing
Due to the diversity and complexity of the input data, the authors built a pre-processing pipeline for aligning the data to a common format.
=> In order to maintain, the fluency and coherency in human written summaries, data were cleaned and sentence structures restored. Entity detection was also applied from an open source software library called spacy on top of the action of the nltk library used here to remove introductions and anonymize the inputs of this summarization model.
3.3 Summarization model
The BertSum models proposed by Yang Liu and Mirella Lapata in their paper Text Summarization with Pretrained encoders (2019) is the basic structure for the model used in this paper.
This includes both extractive and abstractive summarization models, which employs a document level encoder based on BERT. The transformer architecture applies a pretrained BERT encoder with a randomly initialized Transformer decoder. It uses two different learning rates: a low rate for the encoder and a separate higher rate for the decoder to enhance learning.
As stated in previous research, the original model contained more than 180 millions parameters and used two Adam optimizers with beta 1 = 0.9 and beta 2 = 0.999 for the encoder and decoder respectively. However, in this model, the encoder used a learning rate of 0.002 and the decoder had a learning rate of 0.2 to ensure that the encoder was trained with more accurate gradients while the decoder became stable.
=> Application of the curriculum learning hypothesis taking into account the training order. In this sense the model is first trained on textual scripts and then on video scripts which presents additional challenges of ad-hoc flow and conversational language.
3.4 Scoring of results
Results were scored using ROUGE, the standard metric for abstractive summarization. Additionally, we added Content F1 scoring, a metric proposed by Carnegie Mellon University to focus on the relevance of content. Finally, to score passage with no written summaries, we surveyed human judges with a framework for evaluation using Python, Google Forms and Excel spreadsheets.
4 Experiments and results
The BertSum model trained on CNN/DailyMail resulted in state of the art scores when applied to samples from those datasets. However, when tested on our How2 Test dataset, it gave very poor
performance and a lack of generalization in the model.
The best results on HOw2 videos were accomplished by leveraging the full set of labeled datasets with order preserving configuration.
=> The best ROUGE score obtained in this configuration was comparable to the best results among new documents.
Conclusion
Despite employing BERT,, the scores obtained did not surpass the ones obtained in other research papers. However, it did appear to improve the fluency and efficiency of the summaries for the users in the How-To domain.
Abstractive summaries appear to be helpful for reducing the effects of speech-to-text errors that we observed in some videos transcript, especially auto-generated closed captionning.