Abstractive Summarization of spoken and written instructions with BERT

 

1 Introduction

 

The motivation behind this work involves making  the growing amount of user-generated online content more accessible in  order to help user digest more easily the ever growing information put at their disposal.

 

However, many creators of online content use a variety of casual language, and professional jargon to advertise their content. Hence the summarization of this type of content implies not only the extraction of   important information from the source but also a transformation  to a more coherent and structured output. That is  why in this paper the focus is put on both extractive and abstractive summarization of narrated instructions in both written and spoken forms.

 

Problematic :  Language models for summarization of conversational text often  face issues with fluency , intelligibility and repetition.

 

Aim of this paper : Using  a BERT-based model for summarizing spoken language from ASR (speech to text) inputs in  order to  develop a geeral tool that can be used across a variety of domain for How2 articles and videos.

 

2 Prior work

 

The work on  sequence to sequence models from Sutskever et al. and Cho et al opened up a new possibilities for neural networks in natural language processing (NLP). From 2014 to 2015, LTSMs became the dominant approach in the industry which achieved state of the art result.

=> Such architectural changes became successful in tasks such as speech recognition, machine translation, parsing and image captioning.

 

In 2017 a paper by Vaswani  et al  provided a solution to the  fixed length  vector problem enabling neural network to focus on important parts of the input for prediction tasks. Applying  attention  mechanisms with transformers became more dominant for tasks such  as translation and summarization.

=> In abstractive video summarization, models wich incorporate variations  of LSTM and deep layered neural networks have  become state of the art performers. In addition to textual inputs, recent research  in multi-modal summarization incorporates visual and audio modalities into language models to generate summaries of video content.

 

In this paper, video summarization is approached by  extending top performing single-document text summarization models to a combination  of narrated instructional videos, texts and news documents of various styles, length and literary attributes.

 

3. Methodology

 

3.1 Data Collection 

 

Used datasets :

  • CNN/DailyMail dataset
  • Wikihow dataset
  • How2 Dataset

n.b. Despite the development of instructional datasets such as Wikihow and How2 advancements in  summarizations have been  limited by the availability  of human annoted transcripts and summaries. To extend this reseqrch boundaries, the authors complemented exisitng labeled summarization datasets with  auto-generated instructional video scripts and  human-curated descriptions.

 

3.2 Preprocessing

 

Due to the diversity and complexity of  the  input  data, the authors built a pre-processing pipeline for aligning the data to a common  format.

 

=> In order to maintain, the fluency and  coherency  in human written summaries, data were cleaned and sentence structures restored. Entity  detection was also applied from an open source software library called spacy  on top of the action of the nltk library used here to remove introductions and anonymize the inputs of this summarization model.

 

3.3 Summarization model

 

The BertSum models proposed by Yang Liu and  Mirella  Lapata in their paper Text Summarization with Pretrained encoders (2019) is the basic structure for the model used in this paper.

 

This includes both extractive and abstractive summarization models, which employs a document level encoder based on BERT. The transformer architecture applies a pretrained BERT encoder with a randomly initialized Transformer decoder. It uses two different  learning rates:  a low rate for the encoder and a separate higher rate for the decoder to enhance  learning.

 

As stated in  previous research, the original model contained more than 180 millions parameters and used two Adam optimizers with beta 1 = 0.9  and beta 2 = 0.999 for the  encoder and decoder respectively.  However, in this model,  the encoder used a learning rate of 0.002 and the decoder had a learning rate of 0.2 to ensure that the encoder was trained with more accurate gradients while the decoder became  stable.

 

=> Application  of the curriculum learning hypothesis taking into account the training order. In this sense the model is first trained on textual scripts and then on video scripts which presents additional  challenges of ad-hoc flow and conversational language.

 

3.4 Scoring of results

 

Results were scored using ROUGE, the standard metric for abstractive summarization. Additionally, we added Content F1 scoring, a metric proposed by Carnegie Mellon University to focus on the relevance of content. Finally, to score passage with no written summaries, we surveyed human judges with a framework for evaluation using Python, Google Forms and Excel spreadsheets.

 

4 Experiments and results


The BertSum model trained on CNN/DailyMail resulted in state of the art scores when applied to samples from those datasets. However, when tested on our How2 Test dataset, it gave very poor performance and a lack of generalization in the model.

 

The best results on HOw2 videos were accomplished by leveraging the full set of labeled datasets with order preserving configuration.

 

=> The best ROUGE score obtained in this configuration was comparable to the best results among new documents.

 

Conclusion

 

Despite employing BERT,, the scores obtained did not surpass the ones obtained in other research papers. However, it did appear to improve the fluency and efficiency of the summaries for the users in the How-To domain.

 

Abstractive summaries appear to be helpful for reducing the effects of speech-to-text errors that we observed in some videos transcript, especially auto-generated closed captionning.