Summarization (SUMM)
SUMM automates the process of generating a gist of a legal case document that captures the critical aspects of the case
Type of Task | Text Generation |
Dataset | In-Abs (Shukla et al., 2022) |
Language | English |
No. of documents | 7,130 |
Type of summary | Abstractive |
Evaluation Metric | ROUGE-L, BERT-SCORE |
Task Motivation and Description
Going through legal documents (spanning tens of pages) can be a time-consuming and cumbersome activity; however, if there was a summary of the document, that could help legal practitioners and make their workflow more efficient.
The task of summarization involves generating a gist (of a legal document) that captures the critical aspects of the case.
Summarization is a standard task in NLP; however, in the case of the legal domain, there are a few additional challenges, such as:
(i) case documents are generally very lengthy, and thus the summaries are long too;
(ii) large-scale summarization datasets are difficult to build since it is expensive to gather expert annotations.
Summarization could be extractive (selecting the important sentences) or abstractive (generating the gist). In our setting, summarization is an abstractive generation task.
Dataset
We collected Supreme Court of India judgments from the website of Legal Information Institute of India, which provides free and non-profit access to databases of Indian law.
Abstractive summaries (also called headnotes) are available for some of these cases, of which we include 7,130 case documents, together with their headnotes/summaries as part of the dataset.
We reserve 100 randomly-selected document-summary pairs for evaluation, and the remaining 7,030 pairs are used for training.
The dataset is named In-Abs: Indian legal documents Abstractive summarization
, and was released in our previous work Shukla et al. (2022).
Dataset Format
Each document (json) has the following format:
Dict{
'id': string // case identifier
'num_doc_tokens': int // number of words in full document
'num_summ_tokens': int // number of words in summary
'document': List(string) // sentences of case document
'summary': List(string) // sentences of summary
}
Task Evaluation
We use standard metrics for summarization such as ROUGE-1, ROUGE-2, ROUGE-L F1-scores and BERT-SCORE (Zhang et al., 2020).
Baseline Models
We apply both extractive and abstractive methods on the dataset.
Extractive Methods
We apply the following extractive techniques:
(i) CaseSummarizer (Legal-specific, Unsupervised)
(ii) DSDR (Open domain, Unsupervised)
(iii) Gist (Legal-specific, Supervised)
(iv) SummaRuNNer (Open domain, Supervised)
To adapt the abstractive gold-standard summaries for these extractive methods, we use the technique suggested by Narayan et al. (2018).
Abstractive Methods
We also apply the following abstractive techniques:
(i) BART (Open domain)
(ii) Legal-Pegasus (Legal-specific)
(iii) Legal-LED (Legal-specific)
While Legal-LED can accommodate a large number of documents (16,384 token limit), the same is not true for the other models. To overcome this problem, we chunk the document into equal-sized chunks (each chunk size is lesser than the model length limit) and pass each chunk through the model. The summaries for each chunk are concatenated to form the final summary. To convert the overall document summary (gold standard) into chunk-wise summaries, we follow the approach given by Gidiotis and Tsoumakas (2020). All the models are fine-tuned on the summarization dataset.
Results
Algorithm | Rouge-1 | Rouge-2 | Rouge-L | BERT-Score |
---|---|---|---|---|
DSDR | 0.485 | 0.222 | 0.270 | 0.848 |
CaseSummarizer | 0.454 | 0.229 | 0.279 | 0.843 |
SummaRuNNer | 0.493 | 0.255 | 0.274 | 0.849 |
Gist | 0.471 | 0.238 | 0.308 | 0.842 |
BART | 0.495 | 0.249 | 0.330 | 0.851 |
Legal-Pegasus | 0.488 | 0.252 | 0.341 | 0.851 |
Legal-LED | 0.471 | 0.235 | 0.332 | 0.856 |