Text Summarization

(extractive vs abstractive)

Image by Kelly Sikkema

Text Summarization is critical for large enterprises that are dealing with customer support compliance or cases because the case description is too long and very detailed.

In Machine Learning, extraction and abstractive are techniques for text summarization which can be used in large enterprises either to summarize above case descriptions or recommend alternative text for the case officer to summarize the case report.


Extraction is better suited for large enterprises as it uses the keywords from the case description to describe a case and generate a summary which is much more relevant to the case than generating irrelevant sentences for the case officers.

Extraction is a technique which requires text which is free from all the punctuations and stop words which are irrelevant for the summary and are not important, so, first a list of words is generated from the given text free from all the punctuations. Then, according to the tokenizer chosen, i.e stem or lemma, a list of tokens is generated free from all the stop words. Using this list of tokens, various n-grams i.e a continuous sequence of n items/words are generated. A sentence tokenizer is run on these n-grams and sentence scores are calculated for these sentences by its n-grams. The article summary is then generated by using these sentences, the sentence scores and the average sentence score.


On the other hand, Abstraction is a technique for generating the case description summary which captures the essence of the source text. The generated summary potentially contains new phrases and sentences that may not appear in the source text. A word embedding model and vocab is built using the source text to further build the text summarization model. The model uses an encoder and a decoder to produce a text output for a text input. The generated summary using this technique can be longer than the generated summary using extraction as it contains new phrases and sentences rather than using the ones that are already available in the source text.


In conclusion, Both the techniques mentioned above can help in reducing the time by upto 20% to create a summary of any document, especially legal lengthy documents, by the Lawyer or Document Assessment officer. If keeping the original sentences/keywords/phrases are a priority, then extraction is the best option for the enterprise. If the length of the summary is not a bottleneck, then abstraction can be used to generate a summary with new sentences which are different from the ones given in the summary which may sound more natural and better to the human brain as it is not just a copy of the source text.