Vision-Language-Action Enhanced Robotic Leaf Manipulation

Overview

A real-time vision system for leaf manipulation that combines geometric computer vision with deep learning and Vision-Language-Action (VLA) models. This hybrid system integrates YOLOv8 for segmentation, RAFT-Stereo for depth estimation, and a custom CNN enhanced with LLaVA-1.6-Mistral-7B for intelligent grasp reasoning.

Key achievements:

Self-supervised learning eliminating 100% manual annotation
LoRA fine-tuning achieving 88% validation accuracy
Confidence-weighted framework dynamically balancing traditional CV, ML, and VLA predictions
Custom CUDA kernels and TensorRT acceleration
AWS GPU training infrastructure with Docker containerization

Multi-stage perception pipeline enhanced with Vision-Language-Action integration

Vision-Language-Action (VLA) System

LLaVA Integration and Fine-tuning

Foundation Model Enhancement for Grasp Reasoning

Integrated LLaVA-1.6-Mistral-7B foundation model with parameter-efficient LoRA fine-tuning:

Base Model: LLaVA-1.6-Mistral-7B (CLIP + Vicuna) for vision-language understanding
Fine-tuning: LoRA adaptation (rank=8, alpha=32) for leaf grasping tasks
Training Infrastructure: AWS GPU acceleration with MLflow experiment tracking
Performance: 88.0% validation accuracy through systematic hyperparameter optimization
Experiments: 4 systematic configurations with comprehensive evaluation

Hybrid CV-VLA Decision Framework

Dynamic Confidence-Based Integration

The system implements intelligent fusion between traditional CV, ML, and VLA predictions:

# Confidence-based weighting strategy
if vla_confidence > 0.8:
    weights = (0.4, 0.3, 0.3)  # CV, ML, VLA
elif vla_confidence > 0.5:
    weights = (0.7, 0.2, 0.1)  # Conservative VLA influence
else:
    weights = (0.9, 0.1, 0.0)  # Pure CV fallback

High VLA Confidence: Balanced three-way integration
Medium Confidence: CV-dominant with VLA assistance
Low Confidence: Traditional CV fallback for reliability
Adaptive Learning: VLA influence grows with operational experience

Multi-Stage Perception Pipeline

Instance Segmentation (YOLOv8)

Fine-tuned on approximately 900 images achieving 68% mAP@[0.5:0.95] with TensorRT optimization for real-time performance.

Depth Estimation (RAFT-Stereo)

High-precision depth maps with sub-pixel accuracy (<0.5px) enhanced with custom CUDA kernels for accelerated point cloud generation.

Stereo vision pipeline: RGB input → depth estimation → 3D reconstruction

Hybrid Grasp Point Selection

Traditional CV Pipeline

Pareto optimization for leaf selection with geometric scoring:

Clutter Score (35%): Isolation using Signed Distance Fields
Distance Score (35%): Camera proximity with exponential falloff
Visibility Score (30%): Frame position and completeness

Grasp point selection criteria:

Flatness Analysis (25%): Surface smoothness via depth gradients
Approach Vector Quality (40%): Optimal robot orientation
Accessibility (15%): Camera-relative positioning
Edge Awareness (20%): Boundary distance analysis

ML Enhancement with MLflow Tracking

Custom GraspPointCNN with comprehensive experiment management:

Self-Supervised Learning: CV-generated training data (100% annotation-free)
MLflow Integration: 60+ tracked experiments across attention mechanisms
Architecture: 9-channel input with spatial/channel attention
Performance: 93.14% validation accuracy, 94.79% F1 score

GraspPointCNN with attention mechanism for grasp quality prediction

VLA-Enhanced Decision Making

Language-guided grasp reasoning with confidence weighting:

Prompt Engineering: Structured queries for grasp point evaluation
Confidence Scoring: Dynamic assessment of VLA prediction quality
Fallback Strategy: Robust degradation to proven CV algorithms
Continuous Learning: Adaptation through operational feedback

Hybrid CV-ML-VLA pipeline: Traditional geometric analysis (left) enhanced with foundation model reasoning (right)

Production Optimization

Custom CUDA Kernel Development

Developed GPU kernels addressing CPU bottlenecks in point cloud generation:

Performance: 5x speedup (150ms → 30ms) for real-time operation
Implementation: Parallelized 1.5M pixel processing with memory optimization

TensorRT & Model Optimization

System-wide acceleration with model compilation:

Models: YOLOv8, RAFT-Stereo, GraspPointCNN, and LLaVA components
Techniques: FP16 precision, operator fusion, graph optimization
Results: 35% throughput improvement (20 → 27 FPS)

AWS Training Infrastructure

Cloud-based training pipeline for VLA fine-tuning:

Infrastructure: g4dn.xlarge instances with Tesla T4 GPUs
Cost Efficiency: LoRA fine-tuning reduces computational requirements
Scalability: MLflow tracking across distributed experiments
Deployment: Docker containerization for environment consistency

Production-optimized VLA-enhanced grasping system in operation

Results & Performance

Performance Results

VLA System Performance

Model Performance Metrics:

LLaVA-1.6-Mistral-7B: 88.0% validation accuracy with LoRA fine-tuning (AWS GPU + MLflow)
GraspPointCNN: 93.14% validation accuracy with spatial attention (self-supervised)
Hybrid Integration: 82.66% field success rate with confidence weighting (production deployment)

System Performance Comparison (150 test cases)

Performance Improvements:

Overall Success Rate: 78.00% → 82.66% (+4.66% improvement)
Feature Alignment: 80.67% → 83.33% (+2.66% improvement)
Edge Case Handling: 75.33% → 77.33% (+2.00% improvement)
Accuracy: 25.3px → 27.1px (+1.8px improvement)

Production Optimization Results

System Optimizations:

VLA Training (AWS): CPU-only → GPU acceleration (3x speedup)
Point Cloud Generation: 150ms → 30ms (5x speedup)
Inference Throughput: 20 FPS → 27 FPS (35% improvement)
Dataset Creation: Manual annotation → Self-supervised (100% elimination)

Key Contributions

System Development:

Complete VLA integration pipeline with LLaVA-1.6-Mistral-7B foundation model
LoRA fine-tuning achieving 88% validation accuracy through systematic optimization
Hybrid decision framework balancing CV, ML, and VLA predictions with confidence weighting
Self-supervised learning eliminating manual annotation requirements
AWS GPU training infrastructure with MLflow experiment tracking

Performance Optimization:

Custom CUDA kernels for 5x point cloud generation speedup
TensorRT model compilation for 35% inference improvement
Production Docker deployment achieving 82.66% field success rate

This research is conducted under Prof. Abhisesh Silwal and Prof. George A. Kantor.

Skills and Technologies

Foundation Models: LLaVA-1.6-Mistral-7B, LoRA Fine-tuning, Vision-Language Integration
Languages: Python, C++, CUDA
Deep Learning: PyTorch, CNN Architecture, Self-Supervised Learning, Attention Mechanisms
Computer Vision: Instance Segmentation, Depth Estimation, Point Cloud Processing, 3D Perception
MLOps: MLflow Experiment Tracking, Model Versioning, Hyperparameter Optimization
Cloud & Performance: AWS EC2/GPU, Custom CUDA Kernels, TensorRT Optimization
Production: Docker Containerization, ROS2 Integration, Real-time Systems

Project Repositories

LeafGrasp-Vision-ML: Main Repository with VLA System Integration
YOLOv8 Segmentation: Real-time Leaf Instance Segmentation
RAFT-Stereo: High-Precision Depth Estimation with CUDA
REX-Robot: 6-DOF Gantry Robot Control System

References

[1] Srecharan Selvam, Abhisesh Silwal, George Kantor “Self-Supervised Learning for Robotic Leaf Manipulation: A Hybrid Geometric-Neural Approach”, https://arxiv.org/pdf/2505.0370, Under review at ICCV 2025.

[2] Silwal, A., Zhang, X. M., Hadlock, T., Neice, J., Haque, S., Kaundanya, A., Lu, C., Vinatzer, B. A., Kantor, G., & Li, S. (2024). Towards an AI-Driven Cyber-Physical System for Closed-Loop Control of Plant Diseases. Proceedings of the AAAI Symposium Series, 4(1), 432-435.