Abstract:
Speech and hearing impairment is a condition that limits a person's capacity to communicate
verbally and audibly. Individuals who are impacted by this adopt sign language and various
alternative forms of communication. Even though sign language has become more widely used
recently, it is still difficult for non-signers to engage with the individuals that use sign language.
There has been promising improvement in the disciplines of motion and gesture detection
combining techniques of computer vision and deep learning. This study aims to put forward an
approach that uses deep learning techniques to automate the recognition of American Sign
Language, thereby lowering barriers to effective communication among the hard of hearing
individuals and hearing communities. Previously, several techniques of deep learning were
employed for sign language gesture recognition. Video sequences are used as an input for
extraction of spatial and temporal information. Word-level sign language recognition (WSLR)
technology advancements can drastically reduce the necessity for human translators and enable
the signers and non-signers to easily communicate. The majority of methods currently in use rely
on the use of extra equipment like sensor devices, gloves, or depth cameras. The ease of usage in
real life situations is, however, constrained by these limitations. Such situations may benefit from
deep learning techniques that are entirely vision-based and non-intrusive. American Sign
Language has its own rules for syntax and grammar, much like any other spoken language. ASL,
like every other language, is a living language that evolves and develops through time. The
majority of ASL users are found in both Canada and the United States of America. In order to
complete their current and "international" degree requirements, most schools and institutions
across the US accept ASL. This study uses deep learning methods to predict American Sign
Language using the WLASL (word-level American Sign Language) dataset. For the dataset, a
subset of 50 classes was chosen from WLASL. This study used a combination of VGG16-LSTM
and ConvLSTM based to work with spatio-temporal features. These models were chosen due to
their ability to work with spatial and temporal features. We observed that VGG16-LSTM
outperformed the ConvLSTM architecture. Both models' performances are examined using
accuracy as an evaluation metric and judged according to how well they perform on test videos.