Abstract
Assistive technologies gained traction in the medical field over the last few decades. Novel approaches have been developed in order to support people with disability to communicate effectively. However, little research has been conducted on the other side of the coin, that is, assistive technologies to help people who do not have a disability to understand and comprehend the language of disabled. This study describes the early development of a hand alphabet recognition that intends to accomplish a functioning dactylology conversion from sign language to English print in a live streaming video. Through a video analysis, each frame is processed using a segmentation technique to partition it into different segments (e.g., pixels of hand gesture). The dactylology conversion algorithm was implemented in a mobile application where users can watch video containing an on-screen sign language interpreter and understand fingerspelling used as a communication by hearing- and speech-impaired people. Through the sample dataset of 13 videos of American Sign Language manually collected (N=10) and recorded (N=3), the application was tested for its accuracy in detecting the alphabet in a video (94.16%), and the correctness of conversion of the detected alphabet into English print (89.65%). This study contributes to the list of existing novel approaches that aims to promote social positive effects as well as improve the quality of life for both disabled and all the people they socialize with.
Keywords: Dactylology, Assistive Technology, Sign Language, Video Segmentation, Fingerspelling, Hand Alphabet Recognition
Introduction
In a world where millions of people are deaf-mute, Dactylology, or the science of communication using hands and fingers (e.g., one-handed alphabet, two-handed alphabet), is one of the, if not the only one, communication modalities that lets people with and without disability to express and send ideas and thoughts to each other. In fact, there are more than 120 distinctive sign lingos used in various nations such as American, French, German, Spanish, Filipino, Japanese, Indo-Pakistani, and more. The difficulty in establishing a universal communication modality between disabled people with different sign languages, and people with and without disability led to a stirring invitation of technology implementation. For instance, one study [1] developed a convolutional neural network model for American Sign Language alphabet recognition. The classification model was then combined with a multi-view augmentation strategy to exploit 3D information from depth images. On the other hand, artificial neural network was utilized in another study [2] to design and develop an American Sign Language recognition system with a sensory glove and a three-dimensional motion tracker for extracting gesture data features. These technology-based communication advancements are some of the most novel contributions to the vast reaches of healthcare information, intelligent application systems, and even communication technology.
In a deeper look, there has been a widespread of development of intelligent systems [3, 4, 5, 6] and studies [7, 8, 9] that intend to establish and understand the usage of technology-based assistance to people with disabilities in performing basic communication tasks. Naves, Rocha, and Pino [10] developed an alternative communications system by uniting electromyographic (EMG) signals to the field of Human-Computer Interaction (HCI) to attend and serve patients severely disabled by amyotrophic lateral sclerosis. HCI and EMG were both incorporated to EDITH system – a computer software package consisting of communication features designed for a multimedia environment. In the robotics field, a mobile robotic arm was developed by Gushi, Higa, Uehara, and Soken for people with severe disabilities [11]. The robotic arm can perform several tasks by using eye movements, which are detected by an image processing technique. In another field, Garcia [12] developed a speech therapy game application to assist aphasic people in learning how to communicate again just like before their stroke occurred. Such technologies have been proven as an important tool and instrumental in promoting social positive effects as well as improving the quality of life not only for disabled but also of their family and relatives. Santos et al. [13] confirmed the positive relationship between quality of life of people and assistive technology (e.g., VISIMP [14]) making it a sought-after invention of our time. These stigmatized and marginalized social groups have now a way to establish their position and promote inclusion within the society.
Granted, these novel approaches have been developed in order to aid people with disability to communicate effectively. Notwithstanding, little research has been conducted on the other side of the coin, that is, assistive technologies to assist people who do not have a disability to also understand and comprehend the language of disabled. In fact, most people do not clearly understand the sign language. Therefore, aside from the research gap, there is also a communication gap between the deaf communities and the public. This study describes the early development of a hand alphabet recognition that intends to achieve a functioning dactylology conversion from sign language to English print in a live streaming video. Through a video analysis, each frame is processed using a video segmentation technique to partition it into different segments (e.g., pixels of hand gesture). The dactylology conversion algorithm was implemented in a mobile application where users could watch video containing an on-screen sign language interpreter and understand fingerspelling used as a communication by hearing and speech-impaired people. Not only does the mobile application provide a new communication modality, it also sensitizes and offer awareness on how to communicate with deaf and mute people.
Related Works
The growth of multimedia information led to an extensive interest towards a video indexing of media information and video retrieval for accessing the acquired information stored in a database. For a better performance video indexing and retrieval system, a proper video segmentation algorithm must be applied [15]. According to Dhiman and Dhanda [16], video segmentation refers to the process of decaying a video data into meaningful segments that have strong correlation with the real world. In the computer science field, there are several schemes to use when performing video segmentation and every algorithm has different advantages and disadvantages. Figure 1 shows an example of how a single video frame is segmented to extract the signer.
Beevi and Natarajan [17] proposed a video segmentation algorithm for MPEG-4 encoding systems to build segmentation results with low computation load by using baseline, shadow cancellation, and adaptive threshold modes. Similarly, a frame-by-frame technique with computationally efficient results was proposed by Vora and Raman [18] through clustering of visually similar generic object segments in a video via extraction using top-k region proposal to generate preliminary masks of foreground object. Alternatively, Hassani et al. [19] managed to use a region merging process for spatial and motion information to implement their proposed time-consistent video segmentation algorithm for real-time application. Another technique in partitioning video information for further analysis is the graph-based hierarchical video segmentation [20]. This method used four main steps that starts with generating a graph for k-sized frame block followed by a calculation of hierarchical scales. Then, a calculation for the inference of video segmentations through a thresholding process is accomplished. Lastly, temporal coherence video segments are calculated by merging two consecutive segmented blocks. Li et al. [21], on the other hand, utilized an algorithm called suboptimal low-rank decomposition (SOLD) to decompose the representation coefficient matrix into sub-matrices of low ranks. The efficiency analysis revealed that this method is faster and more effective than HGB and SHGB. Further, intelligent approaches are likewise proposed for recognizing hand gestures in a natural manner. Chaudhary, Raheja, Das, and Raheja [22] grouped these approaches into fuzzy logic, genetic algorithm, and artificial neural networks. Verma and Dev [23] used fuzzy clustering based finite state machines to recognize hand gestures efficiently. Nolker [24], on the other hand, used neural network to distinguish fingertips transformable into finger joint angles of a hand model. This allowed a full reconstruction of a three-dimensional hand shape, with 16 segments and 20 joint angles. Hu, Yu, Li, and Ma [25] used an extraction of a human parametric 2D model to estimate human posture and recognize human activity. In their system, the genetic algorithm was applied to make a model with human silhouette.
Another area of related works that needs to be reviewed is the hand recognition system where the recognized gestures could be utilized as part of a more intelligent system [26], or for controlling a robot [27]. To produce gesture recognition systems, different approaches have been proposed from using additional hardware, such as gloves and color markers, to the use of skin-based segmentation for feature extraction [28, 29, 30, 31, 32, 33, 34]. The growing importance of gesture recognition in the society [35] led to a stimulating revolution among systems inventors and gave birth to essential applications in numerous areas like surveillance systems, robotics, HCI, healthcare, education, etc. In addition, sign language recognition has been likewise a beneficiary, and received a special attention from all the advancement of gesture recognition. Nevertheless, the development of gesture recognition systems has formed many lessons based from the drawbacks of existing and completed prototype. For instance, a neural network classifier is too time consuming (e.g., learning ten words in four days [36]) to make although progress in computer hardware created a faster result. An orientation histogram method becomes problematic when dealing with similar gestures but different histograms or different gestures with similar histograms [37].
Methods
The main purpose of this study is to translate hand alphabet of the American Sign Language into English print in order to bridge a communication gap between people with and without disability when using dactylology as a communication modality. Towards the realization of this goal, various image and video processing techniques were utilized based on the experimental results of other existing studies. First, the strategy of Vora and Raman [18] in video object segmentation was the basis of the core processes of the hand gesture recognition. Moreover, a skin detection algorithm of Garcia et al. [38] was slightly mimicked particularly on image processing techniques used to enhance the accuracy result. For this application, each video frame undergoes several processes such us (1) color illumination restoration to recover details of a frame, (2) histogram equalization to redistribute color intensities, (3) lighting correction to re-adjust dark areas, and (4) noise reduction to remove unwanted pixels. For this to become possible, each frame must be extracted from the video media file. Afterwards, the extracted frames proceed to the core processes of the algorithm, such as (1) hand detection and extraction via an image segmentation with the mixture of background subtraction and three-frame-difference method, (2) object tracking based on a motion consistency algorithm [39], and finally, (3) hand alphabet recognition was performed using a convolutional neural network model for the classification process. Figure 2 shows a segmentation and tracking of hand gesture frame-by-frame to automatically detect and recognize sign language alphabet. Meanwhile, the system block diagram that exhibits how the whole process works is illustrated on Figure 3.
Hand Detection and Extraction
After the preprocessing of video frames, the first step towards the recognition of hand features is segmentation. A common cue when segmenting body parts like a hand is the skin color [38] since it is invariant to scale and rotation changes [40]. However, the result of the segmentation is affected by illumination conditions. As such, a segmented skin colored-regions might not be a skin but of another region with similar color. Fortunately, a sign language interpreter customarily conveys the information using hand movements with a stationary body. Consequently, detecting a moving object in a video sequence will probably result to a moving hand. Weng, Huang, and Da [41] proposed a new interframe difference algorithm for detecting a moving object in a video through the combination of background subtraction and three frame-difference method. The first process is to subtract the current frame to the previous and next frames separately, and add together the results to generate a grayscale image. Afterwards, another grayscale image will be created by subtracting the current frame to the background image. The final output is a binary image made from the sum of two previously generated grayscale images, which helps in adding a bounding box on a region that has a constant moving object.
Hand Gesture Tracking
Once the target object is segmented and the feature is extracted, a hidden bounding box stays on that object for tracking purposes in order to perform the same process on the next frames until the end of a video sequence. He, Qiao, Wen, and Li proposed an object tracing based on a motion consistency (MCT) [39], which serves as the basis of the algorithm used in tracking the segmented hand. MCT states that the object that needs tracking must be known in the first frame. The segmented region during the hand detection and extraction processes determines the "known target object" prior to shifting to the state transition model, which is used to select the candidate samples in the current video frame. Subsequently, the target state prediction estimates the target position including motion directions and distances based on a motion consistency. Hence, the tracking result is determined by the position factor with the holistic responses of each candidate.
Hand Alphabet Recognition
To classify the hand alphabet sign language, a convolutional neural network model for classification was tested, trained, and evaluated through the combination of various technologies such as Python programming language, openCV for real-time computer vision, and TensorFlow for machine learning with the inclusion of PIP packages such as matplotlib, numpy, opencv, and tensorflow. After classification, the alphabet sign language is converted into an English print and displayed on the mobile application. To recognize a word and/or to separate a string (converted sign language) into actual English words, a spellchecker and autocorrect algorithm was used through the comparison of translated sign language to an actual English word dictionary.
Experimental Results
Through the sample dataset of 13 videos of American Sign Language manually collected (N=10) and recorded (N=3), the application was tested for its accuracy in detecting the alphabet in a video, and the correctness of conversion of the detected alphabet into English print. Upon the calculations, detection accuracy of 94.16% and conversion accuracy of 89.65% were obtained as shown on Table 1. Manual labeling was performed in all of the videos to determine the number of alphabet made by a sign language interpreter. A common error in the detection of alphabet is the recognition of fingerspelled letters that must be traced in the air such as "Z" and "J". In addition, "K" and "P" use a similar hand shape where the former is palm down while the latter is palm forward that confuses the algorithm. Notwithstanding, the spellchecker and autocorrect algorithm was able to translate the sign language into correct English words especially for those words with problematic letters.
Video Index | Number of Frames | Number of Alphabet | Detected Alphabet | Detection Accuracy | Correct Character | Conversion Accuracy |
---|---|---|---|---|---|---|
1 | 3744 | 145 | 144 | 99.31 | 121 | 84.03 |
2 | 3576 | 124 | 120 | 96.77 | 110 | 91.67 |
3 | 4176 | 169 | 151 | 89.35 | 140 | 92.72 |
4 | 3336 | 139 | 131 | 94.24 | 118 | 90.08 |
5 | 4104 | 171 | 169 | 98.83 | 149 | 88.17 |
6 | 5136 | 214 | 201 | 93.93 | 192 | 95.52 |
7 | 1824 | 76 | 72 | 94.74 | 66 | 91.67 |
8 | 2664 | 111 | 100 | 90.09 | 92 | 92.00 |
9 | 2352 | 98 | 92 | 93.88 | 89 | 96.74 |
10 | 3384 | 141 | 124 | 87.94 | 115 | 92.74 |
11 | 3096 | 129 | 120 | 93.02 | 103 | 85.83 |
12 | 1416 | 45 | 43 | 95.56 | 34 | 79.07 |
13 | 1512 | 56 | 54 | 96.43 | 46 | 85.19 |
Mean: | 3102 | 124 | 117 | 94.16 | 106 | 89.65 |
Conclusion and Recommendations
In this study, a mobile application for hand alphabet recognition for dactylology conversion to English print was presented. Grounded from various algorithms and methodologies, the preliminary results of the experimental assessment specify a very encouraging outcome with a 94.16% detection accuracy and 89.65% conversion accuracy, at least based from the dataset supplied. One problem that needs to be addressed is the accuracy enhancement of the classification model particularly on letters with the same hand shape and fingerspelled alphabet that must be traced in the air. As of writing, the mobile application is still limited because it is an ongoing and the first phase of the project that focuses on conversion of sign languages. For future work and next phase of the project, a classifier will be modelled to detect words in a sign language to extend the hand alphabet. This extension of the algorithm is expected to be useful when communicating and enhancing interactions with people with disabilities. Conversion of sign language to English print then the text to speech audio is also feasible to remove the hassle of reading textual information but this is only a recommendation for future authors for now. Overall, this evaluation study presented a support to the list of existing novel approaches in promoting social positive effects as well as improving the quality of life for both disabled and all the people they socialize with.
Acknolwedgements
The authors would like to thank FEU Institute of Technology for funding the conference presentation.