Abstract:
SLR plays a pivotal role in enabling the deaf and speech-impaired community to communicate
effectively with the rest of society. Traditional SLR methods deploying wearable sensor-based
modalities restrict signer movements and daily activities due to their invasive nature, thereby
have limited utility. Additionally, appearance-based methods relying on visual features from
RGB video data are computationally expensive due to the need to process high-dimensional
input. These methods are also prone to performance degradation under varying lighting conditions, different camera angles, and complex backgrounds. Furthermore, RGB data contains
a large amount of visual redundancy, which can cause encoders to overlook critical information necessary for accurate sign interpretation, while also raising privacy concerns by capturing identifiable features of the signer. In contrast, pose-based methods, leveraging hand and
body pose information, present a promising alternative. This thesis aims to advance the field of
SLR by developing novel, accurate, efficient, and scalable architectures that leverage standalone
skeleton/pose-based input modality, pushing the boundaries of existing approaches and addressing key limitations in current methods. It also contributes to the field through the creation of a
local dataset for Pakistani Sign Language (PSL) Recognition.
In the preliminary stage, this research addresses the complexity of SLR by introducing a novel
framework that leverages a Graph Convolutional Network (GCN) integrated with bottleneck
layer structures and residual connections. This architecture facilitates efficient spatio-temporal
feature extraction from skeleton joints, improving recognition accuracy while minimizing computational complexity, thereby achieving state-of-the-art (SOTA) performance. Building on this
foundation, an early-fused multi-input architecture along with a novel part-attention mechanism
is developed. The proposed architecture processes joint and bone information separately using
an efficient residual graph convolutional network (ResGCN), generating independent features
for each stream. These features are then fused at an early stage to construct a unified stream,
enabling more comprehensive and robust feature representation, without imposing additional
computational burdens. Moreover, the scalability analysis of the proposed architectures is performed by testing their performance on increasing sizes of data and the growing number of
classes to be recognized, yielding very good results.
Further advancements include the development of a multi-scale efficient graph convolutional
network (MSE-GCN), which employs separable convolution layers set in a multi-scale setting
and embedded in a multi-branch network along with an early fusion scheme.In addition, this
research introduces a novel hybrid attention module, termed Spatial Temporal Joint-Part Attention (ST-JPA), designed to selectively focus on the most critical body parts and informative joints
within specific frames of a sign sequence. The multi-scale approach captures both short-range
and long-range dependencies between joints, enabling the model to generate robust feature representations across different temporal and spatial scales. Simultaneously, the use of separable
convolution layers reduces computational complexity, significantly improving efficiency without sacrificing accuracy. The MSE-GCN achieves exceptional accuracy on challenging datasets
while maintaining low inference costs, highlighting the effectiveness of graph-based methods
for SLR. Extensive experiments with progressively larger datasets and an increasing number of
classes have demonstrated the model’s scalability. Its generalizability has been further validated
through testing on both cross-dataset and cross-domain tasks, where it achieved state-of-the-art
(SOTA) performance in both scenarios.
Another significant contribution of this research is the development of a comprehensive skeleton/pose based Pakistani Sign Language (PSL) dataset, comprising 52 commonly used dynamic
URDU words/signs. The dataset is constructed using non-expert signers, ensuring diversity in
clothing, background, and illumination, and verified against professional signers to ensure accuracy and reliability. Pose-based baseline studies are conducted using K-Fold Cross-Validation
and signer-Independent Split Protocols to establish initial benchmarks. In addition, this research
introduces a novel framework, Efficient-Sign, which leverages efficient graph convolutional networks specifically for Pakistani SLR.This work aims to pave the way for more advanced PSL
recognition and translation studies.
In a nutshell, the proposed research contributes to bridging the communication gap for individuals with hearing and speech impairments globally and in Pakistan and advances the field of
SLR by providing accurate, scalable, generalizable and efficient skeleton/pose based solutions
for real-world SLR applications.