Hand Gestures Detection on Mobile Devices

Problem Description

We plan to implement real-time hand gesture recognition to run on mobile phones (iOS) without using Apple’s API. Capturing the motions of the hands and interpreting the meanings associated with each gesture.

The project can have various applications, including enhancing the mobile gaming experience, enabling hand-free control of devices, and facilitating communication for disabled users.

Previous work

We are currently detecting single gestures only; a pre-trained full frame classifier is used in place of a combination of detector + classifier. The full frame classifier we are using is a ResNeXt101 model and it is expected to achieve 95.67% accuracy on F1 Gestures.

Our approach

We use CNN and PyTorch trained model for classification, but since only CoreML models can be run on iOS devices, we’ll use Apple’s CoreML converter tool to convert the trained model to be suitable for mobile devices. We also experiment with using the iPhone's depth sensor to help with the classification if the depth map improves the accuracy of classification/hand detection.

Video Demo

Datasets

For training and testing the models, we use the dataset from https://arxiv.org/abs/2206.08219.

This dataset contains 552,992 samples divided into 18 classes of gestures. The annotations consist of bounding boxes of hands with gesture labels and markups of leading hands. The proposed dataset allows for building HGR systems, which can be used in video conferencing services, home automation systems, the automotive sector, services for people with speech and hearing impairments, etc. We are especially focused on interaction with devices to manage them. That is why all 18 chosen gestures are functional, familiar to the majority of people, and may be an incentive to take some action. In addition, we used crowdsourcing platforms to collect the dataset and took into account various parameters to ensure data diversity. We describe the challenges of using existing HGR datasets for our task and provide a detailed overview of them. Furthermore, the baselines for the hand detection and gesture classification tasks are proposed.

Results

We successfully converted the Pytorch model into a Core ML model that can be deployed on mobile devices. The classifier can achieve a relatively stable classification of hand gestures and display the results (as indicated by the demo).

Discussion

Potential improvements:

Add Detector: Instead of using a simple full frame classifier, we can use a combination of detector + classifier, to improve the application performance and provide a higher prediction accuracy.
Increase Recognition Accuracy: While the current recognition rate is impressive, there is room for improvement. Further refining the machine learning algorithms can improve the system's ability to correctly recognize hand gestures.
Expand Gesture Library: The system could benefit from recognizing a broader array of gestures. Adding more complex or culturally-specific gestures could make the application more versatile and user-friendly.