Abstract:
Visual impairments affect approximately 2.2 billion individuals worldwide, impacting
people of all ages and genders. These disabilities significantly affect personal lives and
contribute to a substantial global financial burden. According to theWorld Health Organization
(WHO), adults with visual impairments experience higher rates of unemployment
and are more prone to depression and anxiety. Additionally, the estimated annual loss
in productivity due to visual impairments is approximately US$400 billion in purchasing
power parity.
Despite numerous ongoing efforts to prevent blindness, this study focuses on assisting
visually impaired individuals in comprehending and visualizing their surroundings using
mobile phone cameras. By leveraging several pre-trained deep learning models, including
YOLO, ByteTrack, MiDaS, BLIP, and GPT-3.5, this research converts video input
from mobile phone cameras into textual descriptions that convey the surrounding scene.
The proposed system aims to enhance the independence and quality of life for visually
impaired individuals by providing real-time, accessible information about their environment.
It is noteworthy that several recent attempts have been made to address this issue; however,
most of these solutions fall short in fully understanding a scene and delivering voice
output in real-time. This study seeks to overcome these limitations by employing advanced
deep learning techniques to provide a more accurate and timely interpretation of
visual information, thereby offering a more effective aid for the visually impaired