Topic based textual summarization

5 min readJul 12, 2018

A document can contain information regarding multiple topics viz., sports and politics. A simple summarization method would present this document in a concise format (single summary). How about a setup that can regard topics present in the document to generate summary considering that topic (topic based summary)? A method as published in “Generating topic oriented summaries using neural network” in NAACL, 2018.

Authors: Kundan Krishna and Balaji Vasan Srinivasan

Problem at hand?

To generate a summary tuned to a specific topic of interest.

Title: IMF backs Universal Basic Income in India, serves Modi govt a political opportunity
Article: Ahead of Union Budget 2018, the Narendra Modi-led governments last full-year budget to be presented in February, the International Monetary Fund (IMF) has made a strong case for India adopting a fiscally neutral Universal Basic Income by eliminating both food and fuel subsidies ..
- Business: imf claim eliminating energy “ tax subsidies “ would require a increase in fuel taxes and retail fuel prices such as petrol prices and tax of rs400 ($ 6) per tonne on coal consumption …
- Politics: narendra modi-led government ‘s last full-year budget to be presented in february. the international monetary fund has made a strong case for india adopting a fiscally neutral universal basic income by eliminating both food and fuel subsidies …
- Social: universal basic income is a form of social security guaranteed to citizens and transferred directly to their bank accounts and is being debated globally …

Input and Output

Input

an input document, a topic of interest

Output

a concise summary of the input document tuned to the topic

What’s already existing?

Extractive summarization: Methods that employ steps to identify relevant contexts/sentences in the document which is necessary to encapsulates the information. These are usually statistical methods leveraging corpus based counts and overlaps between the segments. Couple of methods are TextRank and Statistical-GraphicalRank, both a version of PageRank algorithm. Recently, authors have used RNNs (SummaRuNNer) to determine important sentences as a (sequence) binary classification.
Abstractive summarization: With surge in deep learning based methods, encoder-decoder setup has swept the floor with summarization being no exception. One of the recent method leverages pointer-generator (PG) network. Early methods revolved around template based approaches.
Topical summarization: Approaches involve two steps: 1. Identifying topics for the document. 2. Identifying topic-relevant sentences and assigning higher weightage to them to be considered for summarization.

Why to look for a new method?

By and large, almost all methods concentrate on generating a single summary in general. The methods which consider topic, identify relevant sentences based on sentence level features.

Directly incorporating sentence level statistics in Seq2Seq framework is something that needs to be explored.

A quick overview

The paper advances the pointer-generator based seq2seq (encoder-decoder) framework to create something called ‘topic aware pointer-generator network’ that would take input a document and topic against which summary is to be generated. A solution to bottleneck of creating training triplet for document, topic and respective summary is also laid out quite comprehensively.

What they propose?
In contrast to vanilla PG, proposed method uses topic as cue in addition to word embeddings in encoder. Authors use one-hot topic vector to concatenate with each embedding to create sequence. Non-zero value in topic vector denotes the amount of bias that should be put towards generating the summary.
This is the main contribution of the paper along with a mechanism to create dataset (raw document, topic, topic-oriented summary triplet) for topic based summarization.

Let’s dive in crux!

Vanilla pointer-generator model: At each timestep t, pgen is calculated which when combined with attention distribution (copying) and vocabulary distribution (generating) results in final distribution. The motivation is to include OOV words in the final distribution. Model architecture taken from Pointer-Generator paper (https://arxiv.org/pdf/1704.04368.pdf)

Let’s see the architecture:

Pointer-generation: A method to generate/copy the next token to be generated. The choice between generating and copying is obtained through probability distribution. The modified probability also handles out-of-vocabulary word which is a positive contrast against vanilla seq2seq.
Coverage mechanism: To address the issue of repeated tokens in the generated sequence, a coverage vector is proposed which is used as extra input to the attention mechanism. Coverage vector is the sum of attention distributions over all previous decoder timesteps which is representative of information of previous attentions.

On top of PG network, authors propose to induce topic awareness by appending topic as one-hot feature along with word embeddings while encoding the document.

Understanding the decoding process:

There are a couple of major steps involved in generating the word w at timestep t:

Computing attention: This takes input from encoded states (i), current decoded state (s), coverage mechanism (c). Attention determines the relative importance of words in input sequence (computed for each decoding step)

i represents ith word in input sequence. t represents t-th timestep in decoding sequence

Attention vector for input sequence at t-th timestep of decoding

Computing probabilities: Method computes two probabilities, one for deciding whether to generate a new word from the vocabulary (p-gen) and other to decide which word to generate from vocabulary (p-vocab).

Based on p-gen, model probabilistically decide whether to generate a new word or copy the word from input sequence. ht* is attended encoded state, st is the decoder state, yt is the last word generated in the summary

p-vocab is computed through a linear transformation of [st, ht*] with sigmoid activation.

Total probability of a word w to be generated in the summary. For OOV words, p-vocab = 0 and hence their probability to be generated comes from 1-pgen and the attention score they receive.

Objective Function:

Negative log likelihood of target word is used as primary loss function:

Negative log likelihood of the target word at timestep t

Coverage loss to penalize repeated words in generated sequence:

ct is distribution over input words in input document which represents the attention that each word has received so far.

Coverage loss attempt to avoid repeatedly attending to same words. i represents each word in input document

With addition of coverage mechanism, modified loss equation comes out to be:

Final loss function

Which dataset do they use?

News articles released at KDD Data Science + Journalism Workshop 2017 tagged with topics like politics, sports, education etc.

However, authors applied whole set of methods to systematically generate the training dataset using the given data.

What numbers do they improve on?

ROUGE F1 score: Metric evaulate the similarity between two document based on the overlap between n-grams. ROUGE-1, ROUGE-2, ROUGE-L
Authors also show the relevance of generated summary using human-judgment. The task was to pick better summary from the given pair of summary (one from the proposed method and the other based on a baseline).

Sample output from the model

Sample output from the presented method.

* All figures and equations have either been taken directly from the paper(s) or have been adapted as per my understanding. Reference paper(s) include: Get To The Point: Summarization with Pointer-Generator Networks: https://arxiv.org/pdf/1704.04368.pdf