Real-time Hand Gesture Recognition for AR Interaction

A sophisticated system combining computer vision and deep learning for hand tracking and gesture recognition in augmented reality

1. Overview

A sophisticated real-time hand gesture recognition system developed during my internship at Hanon Systems, implementing a hybrid architecture that combines classical computer vision with deep learning approaches. The system leverages depth sensing camera’s capabilities enhanced by Extended Kalman filtering for precise 3D tracking, while incorporating both MediaPipe-based gesture recognition and optimized ONNX neural network implementations for robust hand detection and pose estimation.

System Architecture

System pipeline showing data flow from camera through Python backend to Unity frontend

Full Demo

📌 For full quality, watch the video on YouTube


2. Hand Detection and Tracking

2.1 MediaPipe Landmark Detection

The system uses Google’s MediaPipe Hands library as the foundation for initial hand detection and landmark extraction:

  • Extracts 21 keypoints from each hand in real-time
  • Provides a skeletal representation of hand pose
  • Computationally efficient with high accuracy
  • Enables detection of both left and right hands independently
MediaPipe Landmarks

Hand detection with MediaPipe showing 21 landmark points and their connections

2.2 3D Tracking with Extended Kalman Filter

To achieve precise and stable 3D tracking, the system combines MediaPipe landmarks with depth data and applies Extended Kalman Filtering:

  • Multi-stage Depth Filtering Pipeline:
    • Spatial filtering to reduce noise
    • Temporal filtering for consistency
    • Kalman filtering for smooth tracking
  • Extended Kalman Filter Implementation:
    • 2D state vector (position, velocity) estimation
    • Optimized noise matrices for hand motion
    • 30Hz update rate with dynamic time-step handling
    • Advanced outlier rejection for robust tracking
Depth Filtering

Visualization of the multi-stage depth filtering process showing raw depth data being transformed into smooth 3D positions

EKF Flowchart

Extended Kalman Filter implementation flowchart showing the prediction-correction cycle

2.3 ONNX Neural Network Integration

The system incorporates an optimized ONNX-based pipeline to improve performance:

  • Two-stage Detection System:
    • Palm Detection (192x192 input)
    • Hand Landmark Detection (224x224 input)
  • Performance Optimizations:
    • FP16 quantization reducing model size by 50%
    • Unity Barracuda engine for GPU acceleration
    • Custom tensor preprocessing pipeline
    • Overall inference time <33ms (30+ FPS)
ONNX Pipeline

ONNX neural network pipeline with parallel palm detection and hand landmark models


3. Gesture Recognition System

3.1 Static Gesture Recognition

The static gesture recognition system employs geometric analysis of hand landmarks:

  • Joint Angle Calculation: Analysis of angles between finger joints
  • Finger State Detection: Adaptive thresholds for open/closed/bent states
  • Palm Orientation Analysis: Using normal vectors to determine hand orientation
  • Real-time Confidence Scoring: Certainty evaluation for each classification
Static Gestures

Demonstration of supported static gestures: GRAB, OPEN_PALM, PINCH, and POINT

3.2 Dynamic Gesture Recognition

For dynamic gestures, the system uses a Gated Recurrent Unit (GRU) neural network combined with custom motion pattern analysis:

  • GRU Neural Network:
    • Input: Sequence of 30 frames (63 features per frame)
    • Architecture: 2 hidden layers with 32 units each
    • Output: Classification with confidence scores for each dynamic gesture
  • Motion Pattern Analysis:
    • Velocity component extraction (dx, dy)
    • Horizontal/vertical motion ratio analysis
    • Specialized pattern detectors for SWIPE, CIRCLE, and WAVE gestures
GRU Architecture

GRU-based dynamic gesture recognition pipeline with sequence preprocessing and temporal smoothing

Dynamic Gestures

Demonstration of supported dynamic gestures: SWIPE_LEFT, SWIPE_RIGHT, and CIRCLE


4. Unity Integration and AR Interaction

4.1 Real-time Hand Rigging

The Unity frontend provides visualization and interaction capabilities:

  • Hand Model: Fully articulated 3D hand with 21 joints
  • Inverse Kinematics: Realistic hand movement based on tracking data
  • Real-time Physics: Dynamic object interaction with collision detection
  • WebSocket Communication: Low-latency data streaming between Python backend and Unity frontend
Hand Rigging

Real-time hand rigging in Unity. The 3D hand model accurately mirrors the user's hand movements and gestures

4.2 AR Interaction Demo

The system was integrated into an AR environment for demonstration purposes:

  • Intuitive Interactions: Users can grab, move, and place virtual objects
  • Gesture-Triggered Events: Different gestures trigger different interactions
  • Real-time Response: The system maintains 30+ FPS for seamless user experience

Note: The virtual flower arrangement scene shown in the demo was created solely for demonstration purposes. During my internship at Hanon Systems, the actual implementation was focused on enabling automotive technicians to practice precise component placement and assembly procedures for virtual HVAC systems in an automotive manufacturing context.


5. Performance Metrics

Component Metric Value
Hand Tracking Tracking Precision <7.5mm error
Static Gestures Recognition Accuracy 97%
Dynamic Gestures Recognition Latency <33ms
ONNX Models Palm Detection Inference 8-10ms
ONNX Models Landmark Detection Inference 12-15ms
Overall System Frame Rate 30+ FPS

6. Contribution

During my internship at Hanon Systems, I contributed to the HVAC Systems Simulation Team as a Machine Learning Engineer Intern by architecting and implementing a real-time 3D hand tracking and gesture recognition system for augmented reality (AR) applications within Unity. This system enabled over 50 automotive technicians to practice component placement in virtual HVAC systems.

My key contributions included:

  • Integrating a depth-sensing camera for capturing 3D data and utilizing MediaPipe and an Extended Kalman Filter for robust 3D hand tracking, achieving less than 7.5mm ground truth tracking accuracy
  • Designing and implementing both static gesture recognition (using geometric analysis) and dynamic gesture recognition (using a custom-trained GRU network and motion analysis), achieving 97% accuracy for static gestures and under 30ms latency for dynamic gestures
  • Optimizing the pipeline with ONNX, achieving 33% faster inference with 50% smaller model size than MediaPipe
  • Establishing real-time communication between a Python backend and the Unity AR frontend using WebSockets, enabling a 30Hz data streaming rate with packets under 1KB
  • Implementing the 3D hand model rigging and inverse kinematics for realistic hand movement in the Unity environment

The demonstration scene shown in this portfolio was created after the internship to showcase the capabilities of the system, while the actual implementation at Hanon Systems was focused on HVAC component visualization and interaction.


7. Skills & Technologies Used

  • Languages & Frameworks: Python, C++, PyTorch, ONNX, MediaPipe, WebSockets, OpenCV, NumPy
  • Machine Learning: 3D Tracking, Landmark Detection, Depth Sensing, Geometric Analysis, Kalman Filtering
  • Deep Learning: Recurrent Neural Networks (GRU), Model Optimization (ONNX), GPU Inference (Unity Barracuda)
  • Computer Vision: Landmark Detection, Depth Filtering, Motion Analysis, Real-time Processing
  • Augmented Reality: Unity Development, 3D Interaction Design, Virtual Object Manipulation

8. Project Repository