Real-time Hand Gesture Recognition for AR Interaction

1. Overview

A sophisticated real-time hand gesture recognition system developed during my internship at Hanon Systems, implementing a hybrid architecture that combines classical computer vision with deep learning approaches. The system leverages depth sensing camera’s capabilities enhanced by Extended Kalman filtering for precise 3D tracking, while incorporating both MediaPipe-based gesture recognition and optimized ONNX neural network implementations for robust hand detection and pose estimation.

System pipeline showing data flow from camera through Python backend to Unity frontend

📌 For full quality, watch the video on YouTube

2. Hand Detection and Tracking

2.1 MediaPipe Landmark Detection

The system uses Google’s MediaPipe Hands library as the foundation for initial hand detection and landmark extraction:

Extracts 21 keypoints from each hand in real-time
Provides a skeletal representation of hand pose
Computationally efficient with high accuracy
Enables detection of both left and right hands independently

Hand detection with MediaPipe showing 21 landmark points and their connections

2.2 3D Tracking with Extended Kalman Filter

To achieve precise and stable 3D tracking, the system combines MediaPipe landmarks with depth data and applies Extended Kalman Filtering:

Multi-stage Depth Filtering Pipeline:
- Spatial filtering to reduce noise
- Temporal filtering for consistency
- Kalman filtering for smooth tracking
Extended Kalman Filter Implementation:
- 2D state vector (position, velocity) estimation
- Optimized noise matrices for hand motion
- 30Hz update rate with dynamic time-step handling
- Advanced outlier rejection for robust tracking

Visualization of the multi-stage depth filtering process showing raw depth data being transformed into smooth 3D positions

Extended Kalman Filter implementation flowchart showing the prediction-correction cycle

2.3 ONNX Neural Network Integration

The system incorporates an optimized ONNX-based pipeline to improve performance:

Two-stage Detection System:
- Palm Detection (192x192 input)
- Hand Landmark Detection (224x224 input)
Performance Optimizations:
- FP16 quantization reducing model size by 50%
- Unity Barracuda engine for GPU acceleration
- Custom tensor preprocessing pipeline
- Overall inference time <33ms (30+ FPS)

ONNX neural network pipeline with parallel palm detection and hand landmark models

3. Gesture Recognition System

3.1 Static Gesture Recognition

The static gesture recognition system employs geometric analysis of hand landmarks:

Joint Angle Calculation: Analysis of angles between finger joints
Finger State Detection: Adaptive thresholds for open/closed/bent states
Palm Orientation Analysis: Using normal vectors to determine hand orientation
Real-time Confidence Scoring: Certainty evaluation for each classification

Demonstration of supported static gestures: GRAB, OPEN_PALM, PINCH, and POINT

3.2 Dynamic Gesture Recognition

For dynamic gestures, the system uses a Gated Recurrent Unit (GRU) neural network combined with custom motion pattern analysis:

GRU Neural Network:
- Input: Sequence of 30 frames (63 features per frame)
- Architecture: 2 hidden layers with 32 units each
- Output: Classification with confidence scores for each dynamic gesture
Motion Pattern Analysis:
- Velocity component extraction (dx, dy)
- Horizontal/vertical motion ratio analysis
- Specialized pattern detectors for SWIPE, CIRCLE, and WAVE gestures

GRU-based dynamic gesture recognition pipeline with sequence preprocessing and temporal smoothing

Demonstration of supported dynamic gestures: SWIPE_LEFT, SWIPE_RIGHT, and CIRCLE

4. Unity Integration and AR Interaction

4.1 Real-time Hand Rigging

The Unity frontend provides visualization and interaction capabilities:

Hand Model: Fully articulated 3D hand with 21 joints
Inverse Kinematics: Realistic hand movement based on tracking data
Real-time Physics: Dynamic object interaction with collision detection
WebSocket Communication: Low-latency data streaming between Python backend and Unity frontend

Real-time hand rigging in Unity. The 3D hand model accurately mirrors the user's hand movements and gestures

4.2 AR Interaction Demo

The system was integrated into an AR environment for demonstration purposes:

Intuitive Interactions: Users can grab, move, and place virtual objects
Gesture-Triggered Events: Different gestures trigger different interactions
Real-time Response: The system maintains 30+ FPS for seamless user experience

Note: The virtual flower arrangement scene shown in the demo was created solely for demonstration purposes. During my internship at Hanon Systems, the actual implementation was focused on enabling automotive technicians to practice precise component placement and assembly procedures for virtual HVAC systems in an automotive manufacturing context.

5. Performance Metrics

Component	Metric	Value
Hand Tracking	Tracking Precision	<7.5mm error
Static Gestures	Recognition Accuracy	97%
Dynamic Gestures	Recognition Latency	<33ms
ONNX Models	Palm Detection Inference	8-10ms
ONNX Models	Landmark Detection Inference	12-15ms
Overall System	Frame Rate	30+ FPS

6. Contribution

During my internship at Hanon Systems, I contributed to the HVAC Systems Simulation Team as a Machine Learning Engineer Intern by architecting and implementing a real-time 3D hand tracking and gesture recognition system for augmented reality (AR) applications within Unity. This system enabled over 50 automotive technicians to practice component placement in virtual HVAC systems.

My key contributions included:

Integrating a depth-sensing camera for capturing 3D data and utilizing MediaPipe and an Extended Kalman Filter for robust 3D hand tracking, achieving less than 7.5mm ground truth tracking accuracy
Designing and implementing both static gesture recognition (using geometric analysis) and dynamic gesture recognition (using a custom-trained GRU network and motion analysis), achieving 97% accuracy for static gestures and under 30ms latency for dynamic gestures
Optimizing the pipeline with ONNX, achieving 33% faster inference with 50% smaller model size than MediaPipe
Establishing real-time communication between a Python backend and the Unity AR frontend using WebSockets, enabling a 30Hz data streaming rate with packets under 1KB
Implementing the 3D hand model rigging and inverse kinematics for realistic hand movement in the Unity environment

The demonstration scene shown in this portfolio was created after the internship to showcase the capabilities of the system, while the actual implementation at Hanon Systems was focused on HVAC component visualization and interaction.

7. Skills & Technologies Used

Languages & Frameworks: Python, C++, PyTorch, ONNX, MediaPipe, WebSockets, OpenCV, NumPy
Machine Learning: 3D Tracking, Landmark Detection, Depth Sensing, Geometric Analysis, Kalman Filtering
Deep Learning: Recurrent Neural Networks (GRU), Model Optimization (ONNX), GPU Inference (Unity Barracuda)
Computer Vision: Landmark Detection, Depth Filtering, Motion Analysis, Real-time Processing
Augmented Reality: Unity Development, 3D Interaction Design, Virtual Object Manipulation

8. Project Repository

VirtuHand: Main project repository