Summarization (SUMM)

SUMM automates the process of generating a gist of a legal case document that captures the critical aspects of the case

Type of Task Text Generation
Dataset In-Abs (Shukla et al., 2022)
Language English
No. of documents 7,130
Type of summary Abstractive
Evaluation Metric ROUGE-L, BERT-SCORE

Task Motivation and Description

Going through legal documents (spanning tens of pages) can be a time-consuming and cumbersome activity; however, if there was a summary of the document, that could help legal practitioners and make their workflow more efficient.

The task of summarization involves generating a gist (of a legal document) that captures the critical aspects of the case.

Summarization is a standard task in NLP; however, in the case of the legal domain, there are a few additional challenges, such as:

(i) case documents are generally very lengthy, and thus the summaries are long too;

(ii) large-scale summarization datasets are difficult to build since it is expensive to gather expert annotations.

Summarization could be extractive (selecting the important sentences) or abstractive (generating the gist). In our setting, summarization is an abstractive generation task.

Dataset

We collected Supreme Court of India judgments from the website of Legal Information Institute of India, which provides free and non-profit access to databases of Indian law. Abstractive summaries (also called headnotes) are available for some of these cases, of which we include 7,130 case documents, together with their headnotes/summaries as part of the dataset. We reserve 100 randomly-selected document-summary pairs for evaluation, and the remaining 7,030 pairs are used for training. The dataset is named In-Abs: Indian legal documents Abstractive summarization, and was released in our previous work Shukla et al. (2022).

Dataset Format

Each document (json) has the following format:

Dict{
  'id': string  // case identifier
  'num_doc_tokens': int // number of words in full document
  'num_summ_tokens': int  // number of words in summary
  'document': List(string)  // sentences of case document
  'summary': List(string) // sentences of summary
}

Task Evaluation

We use standard metrics for summarization such as ROUGE-1, ROUGE-2, ROUGE-L F1-scores and BERT-SCORE (Zhang et al., 2020).

Baseline Models

We apply both extractive and abstractive methods on the dataset.

Extractive Methods

We apply the following extractive techniques:

(i) CaseSummarizer (Legal-specific, Unsupervised)

(ii) DSDR (Open domain, Unsupervised)

(iii) Gist (Legal-specific, Supervised)

(iv) SummaRuNNer (Open domain, Supervised)

To adapt the abstractive gold-standard summaries for these extractive methods, we use the technique suggested by Narayan et al. (2018).

Abstractive Methods

We also apply the following abstractive techniques:

(i) BART (Open domain)

(ii) Legal-Pegasus (Legal-specific)

(iii) Legal-LED (Legal-specific)

While Legal-LED can accommodate a large number of documents (16,384 token limit), the same is not true for the other models. To overcome this problem, we chunk the document into equal-sized chunks (each chunk size is lesser than the model length limit) and pass each chunk through the model. The summaries for each chunk are concatenated to form the final summary. To convert the overall document summary (gold standard) into chunk-wise summaries, we follow the approach given by Gidiotis and Tsoumakas (2020). All the models are fine-tuned on the summarization dataset.

Results

Algorithm Rouge-1 Rouge-2 Rouge-L BERT-Score
DSDR 0.485 0.222 0.270 0.848
CaseSummarizer 0.454 0.229 0.279 0.843
SummaRuNNer 0.493 0.255 0.274 0.849
Gist 0.471 0.238 0.308 0.842
BART 0.495 0.249 0.330 0.851
Legal-Pegasus 0.488 0.252 0.341 0.851
Legal-LED 0.471 0.235 0.332 0.856