Weekly Update — April 16, 2026

VLA-Flight

Instruction-Grounded Drone Navigation with Vision-Language-Action and Onboard IMU

Yashas Shashidhara Week of Apr 7 – 16

Motivation

The Problem

We want drones that can follow natural language instructions like "fly along the crop row and avoid the obstacle."

Current approaches (OpenVLA, AerialVLA) use massive 7B+ parameter language models. They need server-grade GPUs and can't run on the drone itself.

The goal: A model small enough to run onboard a drone in real time, while still understanding language instructions.

Approach

VLA-Flight: What Makes It Different

Replace the giant language model with a tiny, purpose-built architecture.

7.5B

OpenVLA / AerialVLA
Server GPU required

vs.

~80M

VLA-Flight
Runs on Jetson Orin NX at >30Hz

How? Swap the 7B LLM for a frozen MiniLM sentence encoder (22M) + a small cross-attention fusion decoder (5M). Use continuous action output instead of binned tokens. Result: 100x smaller, 10x faster.

Key Ideas

Five Core Contributions

Dual IMU Fusion

Use IMU data two ways: raw sensor fusion for fine dynamics (wind, drift) + convert bearing to text hints ("forward-right") for coarse navigation intent.

Action Chunking + VAE

Predict 10 future actions at once for smooth flight, but only execute 5 before replanning. A VAE prevents averaging when multiple good paths exist.

Edge Deployment

~80M params total. Runs entirely onboard a Jetson Orin NX 16GB at >30Hz. No server needed.

Safety Module

Zero-parameter geometric correction: velocity clamping, altitude bounds, obstacle repulsion. Adds zero latency.

3-Stage Training

Stage 1: learn to see (vision + IMU). Stage 2: learn to navigate (full model). Stage 3: specialize for agriculture.

Architecture

How the Model Works

Three inputs go in, safe flight commands come out.

  Camera Image       IMU Sensors       "Fly along the crop row"
       |                 |                       |
  EfficientViT     IMU Encoder         MiniLM (frozen)
    (15M)            (2M)                  (22M)
       |                 |                       |
       +---------+-------+-----------------------+
                 |
          Fusion Decoder — cross-attention (5M)
                 |
           VAE Bottleneck
                 |
          Action Chunking → [vx, vy, vz, yaw] x 10 steps
                 |
          Safety Module (0 params) → execute 5 safe steps
      

This Week

What I Worked On

UZH-FPV Dataset

Downloaded (~30-50GB), preprocessed, and ran training end-to-end.

Complete

MUN-FRL Dataset

Manually downloaded (batch script was broken). Fixed preprocessing pipeline.

Preprocessed

NaN Training Bug

Training kept crashing in Stage 2. Diagnosed and applied initial fix.

Tuning

Blackbird Dataset

MIT hosting is down. Can't download yet.

Blocked

Issue #1

MUN-FRL: Download Script Was Broken

The batch download script (download_datasets.sh) ran without errors
But every downloaded .bag file was empty — it saved the HTTP header instead of the actual file
Files were KB instead of GB — no real sensor data inside

Fix: Manually downloaded each ROS bag file from the MUN-FRL docs site. Verified each file with rosbag info before processing.

Issue #2

Preprocessing Script: Wrong ROS Topics

The script had UZH-FPV topic names hardcoded. MUN-FRL uses completely different names, so the script silently found nothing.

Data	Script Expected	MUN-FRL Actual
Camera	/cam0/image_raw	/camera_forward/image_raw
IMU	/imu	/xsens/imu/data
Actions	/control/command	Derived from PPK/GNSS

Issue #2 (continued)

Also Fixed: Message Types + Sync

Wrong message type: Script expected CompressedImage, but MUN-FRL uses raw sensor_msgs/Image. The script would hang forever waiting for messages that were flying right past it.
Wrong image format: Script assumed BGR8 color, but MUN-FRL is Mono8 (grayscale). Had to add manual conversion.
IMU-to-Image sync: Images come at 20Hz, IMU at 400Hz. Added ApproximateTimeSynchronizer to align them.
No action labels: MUN-FRL doesn't have control commands in the bag. Derived velocity actions from ground truth positions: v = (pos[t+1] - pos[t]) / dt

Issue #3

Training Crashes with NaN in Stage 2

Stage 1 (perception pretraining) works fine. But ~10 epochs into Stage 2, the loss becomes NaN and training dies.

Stage 2 unfreezes the Fusion Decoder — this is where vision and IMU signals get combined via cross-attention
The learning rate (1×10^-3) was too high for this sensitive layer. Attention scores exploded.
We train in fp16 (half-precision) because the Jetson needs it for speed. But fp16 maxes out at ~65,000 — the exploding values overflow to infinity, then NaN.

Issue #3 (Fix)

Fix: Lower the Learning Rate

Smaller learning rate = smaller weight updates = numbers stay within fp16 range.

# Stage 2: Fusion decoder unfrozen
stage2_lr: 1.0e-3  → 2.0e-4

# Stage 3: Agricultural domain refinement
stage3_lr: 1.0e-3  → 3.0e-4
      

Result: Training no longer crashes. Loss decreases smoothly through Stage 2. But the optimal LR is still unknown — need to sweep to find the sweet spot.

Next Week

Next Steps

Learning Rate Sweep

Try LRs from 1×10^-4 to 5×10^-4 to find the best value that trains fast without crashing.

Diversify Instructions

Right now every episode has the same instruction text. The model ignores language if it's always identical. Need unique descriptions per trajectory.

Train on MUN-FRL

Now that preprocessing works, run training on this outdoor dataset for agricultural domain generalization.

Blackbird Dataset

Monitor MIT hosting. Download and preprocess once the site is back up.

Evaluate Checkpoints

Run evaluation on UZH-FPV trained models. Compare trajectory metrics against baselines.

End

Questions?

VLA-Flight — ~80M params, >30Hz on Jetson Orin NX 16GB

Yashas Shashidhara github.com/Yashas1600