Project 02 · 2025

Jeff.

A small voice-controlled robot. Local LLM, speech-to-text, and object detection on a Jetson Orin Nano. Motion control on Arduino.

At a glance

Role: Solo build - hardware, firmware, software
Year: 2025
Compute: NVIDIA Jetson Orin Nano + Arduino
Software: Python, C++, Ollama, Whisper, Piper, YOLO, OpenCV
Sensing: Camera, microphone, ultrasonic
Status: Working prototype

Demos

Two early runs.

Fig. 01 — Human recognition. Jeff turns and stops as soon as he sees me, but doesn't drive the rest of the way. A code bug in the approach step, since fixed.

Fig. 02 — Early driving failure. YOLO detects me correctly, but the drive functions were mixed up so the motion comes out wrong. Navigation works correctly now.

What it does

What Jeff does.

Jeff is a voice-driven robot. It has four main capabilities:

Push-to-talk voice interaction. Whisper for speech-to-text, Piper for text-to-speech.
Conversational replies from a local LLM hosted by Ollama on the Jetson.
Visual awareness through a camera, with YOLO for object detection.
Object-aware navigation. Commands like "go to the bottle" are converted into small steering and driving steps based on live detections.

Speech, language, and vision all run on-device. There is no cloud API in the loop.

Architecture

System overview.

   ┌─────────┐    ┌─────────┐    ┌──────────────────────┐
   │  Mic    │──▶ │ Whisper │──▶ │                      │
   └─────────┘    └─────────┘    │                      │
                                 │   Jetson controller  │
   ┌─────────┐                   │   ──────────────     │   ┌─────────┐
   │ Camera  │ ────────────────▶ │   Ollama (LLM)       │──▶│ Piper   │──▶ Speaker
   └─────────┘                   │   YOLO detection     │   └─────────┘
                                 │   Command parser     │
                                 │                      │   ┌─────────┐
                                 │                      │──▶│ Serial  │
                                 └──────────────────────┘   └────┬────┘
                                                                 ▼
                                                          ┌─────────────┐
                                                          │   Arduino   │──▶ Motors
                                                          │   + sonar   │──▶ Safety stop
                                                          └─────────────┘

The Jetson runs perception, conversation, and planning. The Arduino runs low-level motion: timed drives, encoder-based turns, stops, speed changes, and the ultrasonic safety cutoff.

Voice loop

Voice loop.

Push-to-talk avoids continuous-inference cost on the Jetson. Whisper transcribes the captured audio locally. The transcript first passes through a direct command parser that catches simple motion phrases like "drive 5 seconds" or "turn 45 degrees" and sends them straight to the Arduino — no LLM in the loop. This shaves the ~10 s LLM latency off any task that doesn't actually need reasoning.

If the parser doesn't match, the transcript goes into a system prompt that instructs the LLM to return two parts: a short spoken reply, and, only if the user asked for motion, a structured command on its own line. Piper synthesises the spoken reply. The command line, if present, is parsed and translated into the Arduino's serial protocol.

Vision

Vision and navigation.

For descriptive prompts the LLM is given a still from the camera. For navigation prompts a YOLO pipeline locates the target object in the frame, computes a bearing and a rough distance, and emits a sequence of small steering and driving commands.

The navigation loop re-checks the frame between motion steps so a moving target, or a target the robot loses sight of, is handled without driving Jeff into a wall.

Safety

Safety architecture.

Two layers. First, a command contract: nothing reaches the motors unless the controller emits an explicit, structured movement command. The LLM can output whatever it wants and the robot will not move.

Second, the Arduino runs an independent ultrasonic loop. If something is too close, the firmware stops the motors regardless of what the Jetson is asking for.

Hardware

Hardware.

Jetson Orin Nano running the Python controller, Ollama, Whisper, Piper and YOLO. Arduino with an Adafruit Motor Shield driving the motors with encoder feedback and handling the ultrasonic sensor. The two boards talk over USB serial with a simple ASCII protocol.

The base is a lasercut MDF plate. Power is split across two batteries: a small LiPo for the motors and Arduino, and a separate larger one for the Jetson, so high-current motor draw never browns out the compute side.

Challenges

What I ran into.

The first surprise was latency. The local LLM is the dominant cost — I'm running Qwen 3.5 at 4B parameters and a single reply takes around ten seconds, which is too slow for navigation. That's why YOLO handles navigation directly and the LLM is reserved for analysis prompts like "what can you see?" or "what room are you in?".

The second was GPU memory. Moving Whisper from CPU to GPU dropped transcription from 2–3 seconds to under one, but the code crashed after a few turns. Whisper was holding many small GPU allocations — especially in the always-listening wake-word mode — and reserving and releasing them so often that the LLM couldn't find a contiguous ~4 GB block when it needed to think. I tried sandboxing Whisper into a fixed memory budget; it didn't solve it. Push-to-talk avoids the problem in practice.

Nautiq - mobile app for sailors

→