Weekly Update — April 16, 2026
VLA-Flight
Instruction-Grounded Drone Navigation with Vision-Language-Action and Onboard IMU
Yashas Shashidhara
Week of Apr 7 – 16
Motivation
The Problem
We want drones that can follow natural language instructions like "fly along the crop row and avoid the obstacle."
Current approaches (OpenVLA, AerialVLA) use massive 7B+ parameter language models. They need server-grade GPUs and can't run on the drone itself.
The goal: A model small enough to run onboard a drone in real time, while still understanding language instructions.
Approach
VLA-Flight: What Makes It Different
Replace the giant language model with a tiny, purpose-built architecture.
7.5B
OpenVLA / AerialVLA
Server GPU required
vs.
~80M
VLA-Flight
Runs on Jetson Orin NX at >30Hz
How? Swap the 7B LLM for a frozen MiniLM sentence encoder (22M) + a small cross-attention fusion decoder (5M). Use continuous action output instead of binned tokens. Result: 100x smaller, 10x faster.
Key Ideas
Five Core Contributions
Dual IMU Fusion
Use IMU data two ways: raw sensor fusion for fine dynamics (wind, drift) + convert bearing to text hints ("forward-right") for coarse navigation intent.
Action Chunking + VAE
Predict 10 future actions at once for smooth flight, but only execute 5 before replanning. A VAE prevents averaging when multiple good paths exist.
Edge Deployment
~80M params total. Runs entirely onboard a Jetson Orin NX 16GB at >30Hz. No server needed.
Safety Module
Zero-parameter geometric correction: velocity clamping, altitude bounds, obstacle repulsion. Adds zero latency.
3-Stage Training
Stage 1: learn to see (vision + IMU). Stage 2: learn to navigate (full model). Stage 3: specialize for agriculture.
Architecture
How the Model Works
Three inputs go in, safe flight commands come out.
| | |
EfficientViT IMU Encoder MiniLM (frozen)
(15M) (2M) (22M)
| | |
+---------+-------+-----------------------+
|
Fusion Decoder — cross-attention (5M)
|
VAE Bottleneck
|
Action Chunking → [vx, vy, vz, yaw] x 10 steps
|
Safety Module (0 params) → execute 5 safe steps
This Week
What I Worked On
UZH-FPV Dataset
Downloaded (~30-50GB), preprocessed, and ran training end-to-end.
Complete
MUN-FRL Dataset
Manually downloaded (batch script was broken). Fixed preprocessing pipeline.
Preprocessed
NaN Training Bug
Training kept crashing in Stage 2. Diagnosed and applied initial fix.
Tuning
Blackbird Dataset
MIT hosting is down. Can't download yet.
Blocked
Issue #1
MUN-FRL: Download Script Was Broken
- The batch download script (download_datasets.sh) ran without errors
- But every downloaded .bag file was empty — it saved the HTTP header instead of the actual file
- Files were KB instead of GB — no real sensor data inside
Fix: Manually downloaded each ROS bag file from the MUN-FRL docs site. Verified each file with rosbag info before processing.
Issue #2
Preprocessing Script: Wrong ROS Topics
The script had UZH-FPV topic names hardcoded. MUN-FRL uses completely different names, so the script silently found nothing.
| Data |
Script Expected |
MUN-FRL Actual |
| Camera |
/cam0/image_raw |
/camera_forward/image_raw |
| IMU |
/imu |
/xsens/imu/data |
| Actions |
/control/command |
Derived from PPK/GNSS |
Issue #2 (continued)
Also Fixed: Message Types + Sync
- Wrong message type: Script expected
CompressedImage, but MUN-FRL uses raw sensor_msgs/Image. The script would hang forever waiting for messages that were flying right past it.
- Wrong image format: Script assumed BGR8 color, but MUN-FRL is Mono8 (grayscale). Had to add manual conversion.
- IMU-to-Image sync: Images come at 20Hz, IMU at 400Hz. Added
ApproximateTimeSynchronizer to align them.
- No action labels: MUN-FRL doesn't have control commands in the bag. Derived velocity actions from ground truth positions: v = (pos[t+1] - pos[t]) / dt
Issue #3
Training Crashes with NaN in Stage 2
Stage 1 (perception pretraining) works fine. But ~10 epochs into Stage 2, the loss becomes NaN and training dies.
- Stage 2 unfreezes the Fusion Decoder — this is where vision and IMU signals get combined via cross-attention
- The learning rate (1×10-3) was too high for this sensitive layer. Attention scores exploded.
- We train in fp16 (half-precision) because the Jetson needs it for speed. But fp16 maxes out at ~65,000 — the exploding values overflow to infinity, then NaN.
Issue #3 (Fix)
Fix: Lower the Learning Rate
Smaller learning rate = smaller weight updates = numbers stay within fp16 range.
stage2_lr: 1.0e-3 → 2.0e-4
stage3_lr: 1.0e-3 → 3.0e-4
Result: Training no longer crashes. Loss decreases smoothly through Stage 2. But the optimal LR is still unknown — need to sweep to find the sweet spot.
Next Week
Next Steps
Learning Rate Sweep
Try LRs from 1×10-4 to 5×10-4 to find the best value that trains fast without crashing.
Diversify Instructions
Right now every episode has the same instruction text. The model ignores language if it's always identical. Need unique descriptions per trajectory.
Train on MUN-FRL
Now that preprocessing works, run training on this outdoor dataset for agricultural domain generalization.
Blackbird Dataset
Monitor MIT hosting. Download and preprocess once the site is back up.
Evaluate Checkpoints
Run evaluation on UZH-FPV trained models. Compare trajectory metrics against baselines.
End
Questions?
VLA-Flight — ~80M params, >30Hz on Jetson Orin NX 16GB
Yashas Shashidhara
github.com/Yashas1600