Court Judgment Prediction with Explanation (CJPE)

CJPE requires, given the facts and other details of a court case, predicting the final outcome, i.e., appeal granted/denied (Prediction), as well as identifying the sailent sentences leading to the decision (Explanation)

Type of Task Binary Text Classification, Explanability
Dataset ILDC (Malik et al., 2021)
Language English
No. of documents 34k
Evaluation Metric macro-F1

Task Motivation and Description

Given that India has a massive backlog of pending cases, it is imminent to develop systems that could aid the justice delivery process. To augment a judge in the judicial decision-making process, we introduce the task of Court Judgment Prediction with Explanation (CJPE). Note that idea behind this task is not to replace human judges but to aid them. Furthermore, the task requires the system to explain its decision so that it is interpretable for a human using it.

Formally, the task of Court Judgment Prediction with Explanation (CJPE) involves predicting the final judgment (appeal accepted or denied, i.e., binary outcome of 0 or 1) for a given judgment document (having facts and other details) and providing the explanation for the decision.

The explanations, in this case, are in the form of the crucial sentences in the text that lead to the predicted decision.

Dataset

For the CJPE task, we use our previously created Indian Legal Document Corpus (ILDC) (Malik et al., 2021). ILDC is a corpus of 35k legal judgment documents (in English) from the Supreme Court of India. Each document is annotated with the ground truth (actual decision given by the judge); further, a small subset of the documents are annotated with explanations by legal experts. During the dataset creation process, all steps are taken concerning ethical considerations; for example, the dataset is normalized with respect to named entities to remove any biases in the data. Please refer to Malik et al., (2021) for more details about the dataset, annotations, and de-biasing process. We use the same splits (train/test/dev) as in ILDC.

Dataset Format

Each document (json) has the following format:

Dict{
  'id': string  // case identifier
  'text': string  // contents of case
  'label': ClassLabel // final ACCEPT/REJECT decision
  'expert_1': Dict{
    'label': ClassLabel // final decision according to expert_1
    'rank_1': List(string)  // sentences contributing to final decision, divided into ranks from 1 to 5
    'rank_2': List(string)
    'rank_3': List(string)
    'rank_4': List(string)
    'rank_5': List(string)
  }
  'expert_2': Dict{...} // similar to expert_1
  'expert_3': Dict{...} // similar to expert_1
  'expert_4': Dict{...} // similar to expert_1
  'expert_5': Dict{...} // similar to expert_1
}

Task Evaluation

Prediction part of the CJPE task is evaluated using standard accuracy and macro-F1 score metric. The explanation part is evaluated using ROUGE scores (Lin, 2004).

Baseline Models

We experimented with a battery of models for the CJPE task. Details are provided in Malik et al. (2021). Here we explain only the baseline model. We use a competitive baseline model: Hierarchical Transformer. In particular, we use the XLNet+BiGRU model. Since legal documents are long, the document is divided sequentially into chunks (of size 512 tokens), and then a representation is learned for each chunk using an XLNet model. A BiGRU model is applied on the top of these chunks, and a feed-forward network for the prediction follows this. For the explanation part, we use the occlusion method. The basic idea behind the occlusion method is to mask a sentence and then see the change in prediction probability. The prediction probability change indicates the sentence salience for making the prediction. More the change in probability, the more salient the sentence.

A detailed analysis of the model and a case study is provided in Malik et al. (2021). In general, the baseline model (best among all the experimented models) has a performance of about 76% F1 score and is far from the average human performance of 94%.

Results

Predictions

Model mP mR mF1 Acc.
XLNet + BiGRU (trained on ILDC-single) 75.11 75.06 75.09 75.06
XLNet + BiGRU + Attn. (trained on ILDC-single) 75.26 75.22 75.25 75.22
XLNet + BiGRU (trained on ILDC-multi) 77.80 77.78 77.79 77.78
XLNet + BiGRU + Attn. (trained on ILDC-multi) 77.32 76.82 77.07 77.78

Explanability of Best Model

Metric Expert 1 Expert 2 Expert 3 Expert 4 Expert 5
Jaccard sim. 0.333 0.317 0.328 0.324 0.318
Overlap-Min 0.744 0.589 0.810 0.834 0.617
Overlap-Max 0.390 0.414 0.360 0.350 0.401
ROUGE-1 0.444 0.517 0.401 0.391 0.501
ROUGE-2 0.303 0.295 0.296 0.297 0.294
ROUGE-L 0.439 0.407 0.423 0.444 0.407
BLEU 0.160 0.280 0.099 0.093 0.248
Meteor 0.220 0.30 0.180 0.177 0.279