Weekly Update — April 16, 2026

VLA-Flight

Instruction-Grounded Drone Navigation with Vision-Language-Action and Onboard IMU

Yashas Shashidhara Week of Apr 7 – 16
Motivation

The Problem

We want drones that can follow natural language instructions like "fly along the crop row and avoid the obstacle."

Current approaches (OpenVLA, AerialVLA) use massive 7B+ parameter language models. They need server-grade GPUs and can't run on the drone itself.

The goal: A model small enough to run onboard a drone in real time, while still understanding language instructions.
Approach

VLA-Flight: What Makes It Different

Replace the giant language model with a tiny, purpose-built architecture.

7.5B
OpenVLA / AerialVLA
Server GPU required
vs.
~80M
VLA-Flight
Runs on Jetson Orin NX at >30Hz
How? Swap the 7B LLM for a frozen MiniLM sentence encoder (22M) + a small cross-attention fusion decoder (5M). Use continuous action output instead of binned tokens. Result: 100x smaller, 10x faster.
Key Ideas

Five Core Contributions

Dual IMU Fusion

Use IMU data two ways: raw sensor fusion for fine dynamics (wind, drift) + convert bearing to text hints ("forward-right") for coarse navigation intent.

Action Chunking + VAE

Predict 10 future actions at once for smooth flight, but only execute 5 before replanning. A VAE prevents averaging when multiple good paths exist.

Edge Deployment

~80M params total. Runs entirely onboard a Jetson Orin NX 16GB at >30Hz. No server needed.

Safety Module

Zero-parameter geometric correction: velocity clamping, altitude bounds, obstacle repulsion. Adds zero latency.

3-Stage Training

Stage 1: learn to see (vision + IMU). Stage 2: learn to navigate (full model). Stage 3: specialize for agriculture.

Architecture

How the Model Works

Three inputs go in, safe flight commands come out.

Camera Image IMU Sensors "Fly along the crop row" | | | EfficientViT IMU Encoder MiniLM (frozen) (15M) (2M) (22M) | | | +---------+-------+-----------------------+ | Fusion Decoder — cross-attention (5M) | VAE Bottleneck | Action Chunking → [vx, vy, vz, yaw] x 10 steps | Safety Module (0 params) → execute 5 safe steps
This Week

What I Worked On

UZH-FPV Dataset

Downloaded (~30-50GB), preprocessed, and ran training end-to-end.

Complete

MUN-FRL Dataset

Manually downloaded (batch script was broken). Fixed preprocessing pipeline.

Preprocessed

NaN Training Bug

Training kept crashing in Stage 2. Diagnosed and applied initial fix.

Tuning

Blackbird Dataset

MIT hosting is down. Can't download yet.

Blocked
Issue #1

MUN-FRL: Download Script Was Broken

Fix: Manually downloaded each ROS bag file from the MUN-FRL docs site. Verified each file with rosbag info before processing.
Issue #2

Preprocessing Script: Wrong ROS Topics

The script had UZH-FPV topic names hardcoded. MUN-FRL uses completely different names, so the script silently found nothing.

Data Script Expected MUN-FRL Actual
Camera /cam0/image_raw /camera_forward/image_raw
IMU /imu /xsens/imu/data
Actions /control/command Derived from PPK/GNSS
Issue #2 (continued)

Also Fixed: Message Types + Sync

Issue #3

Training Crashes with NaN in Stage 2

Stage 1 (perception pretraining) works fine. But ~10 epochs into Stage 2, the loss becomes NaN and training dies.

Issue #3 (Fix)

Fix: Lower the Learning Rate

Smaller learning rate = smaller weight updates = numbers stay within fp16 range.

# Stage 2: Fusion decoder unfrozen stage2_lr: 1.0e-3 → 2.0e-4 # Stage 3: Agricultural domain refinement stage3_lr: 1.0e-3 → 3.0e-4
Result: Training no longer crashes. Loss decreases smoothly through Stage 2. But the optimal LR is still unknown — need to sweep to find the sweet spot.
Next Week

Next Steps

Learning Rate Sweep

Try LRs from 1×10-4 to 5×10-4 to find the best value that trains fast without crashing.

Diversify Instructions

Right now every episode has the same instruction text. The model ignores language if it's always identical. Need unique descriptions per trajectory.

Train on MUN-FRL

Now that preprocessing works, run training on this outdoor dataset for agricultural domain generalization.

Blackbird Dataset

Monitor MIT hosting. Download and preprocess once the site is back up.

Evaluate Checkpoints

Run evaluation on UZH-FPV trained models. Compare trajectory metrics against baselines.

End

Questions?

VLA-Flight — ~80M params, >30Hz on Jetson Orin NX 16GB

Yashas Shashidhara github.com/Yashas1600