Abstract:
Thousands of cases are registered every year in the Supreme Court and High
Courts of Pakistan, and almost all the judgments of these cases are available
on their respective websites in the form of PDF documents. Since these doc uments are easily accessible containing personal information of the parties
involved such as petitioner and respondent names, organization names and
their addresses, anyone can easily intrude on their privacy and can identify
the persons and organizations mentioned in the judgments. These documents
do not follow a proper format and are semi-structured so extracting personal
information of the parties is a difficult task. Automated anonymization of
court room records is a solution for de-identifying all the personal information
from these documents. Although, unstructured form of these documents and
the uncertainty of natural language makes this a challenging task. This re search focuses on extracting personal information of the parties involved and
anonymizing them in publicly accessible documents. We used BERT-NER to
train and extract three labels from the dataset containing 213 judgements of
Supreme Court of Pakistan. We created this dataset by extracting raw judge ments from the Supreme Court of Pakistan’s website and labelled it using
our formulated annotation guidelines. The labels we used are Per (Person),
Org (organization) and Loc (Location) which after extraction using NER
were anonymized by replacing them with generic words.