Gaara AI
Technology

How the AI sees you,
frame by frame.

Gaara AI combines real-time computer vision with sequence-aware deep learning — the same techniques used in research labs, optimised to run live in your browser.

The pipeline

Camera to coaching cue in four stages.

Camera
30fps webcam
Pose Extraction
MediaPipe Holistic
LSTM
30-frame window
Feedback
<250ms response
Stage 1 · Pose Extraction

MediaPipe Holistic.

Every frame is fed through Google's MediaPipe Holistic model, which extracts full-body landmarks in real time on commodity hardware.

The output is a complete biomechanical snapshot — body, hands, and face — ready to be consumed by the recognition model.

Per Frame
Body landmarks33 keypoints
Hand landmarks21 × 2 hands
Face mesh468 points
Inference rate30fps
Stage 2 · Feature Engineering

1,662 features per frame.

Landmarks are flattened into a single feature vector consumed by the LSTM. Each row below shows how the dimensions add up.

Pose landmarks
33 × 4
132
x, y, z + visibility per body keypoint
Hand landmarks
21 × 3 × 2
126
x, y, z per point, both hands
Face mesh
468 × 3
1,404
x, y, z per facial landmark
Total per frame
1,662
fed into LSTM
Stage 3 · Recognition

LSTM neural network.

A Long Short-Term Memory network processes sequences of 30 frames — capturing the temporal dynamics that distinguish a correct movement from a flawed one.

01
Input Layer
(30, 1662)
30 time steps × 1,662 features per frame
02
LSTM 1
64 units
ReLU activation, return sequences
03
LSTM 2
128 units
ReLU activation, return sequences
04
LSTM 3
64 units
ReLU activation, final state only
05
Dense
64 units
ReLU activation
06
Dense
32 units
ReLU activation
07
Output
Softmax
Action class probabilities
Stage 4 · Real-Time Loop

Sliding window inference at 30fps.

Sliding window

The latest 30 frames are kept in a rolling buffer. Each new frame replaces the oldest, giving the model continuous context.

Throttled inference

Predictions run every 250ms — fast enough to feel instant, slow enough to avoid burning compute when the body isn't moving.

Stable detection

A shot or pose is only confirmed after 3 consecutive frames pass the confidence threshold — preventing flicker from noisy poses.

Stack

What it's built on.

MediaPipe Holistic
Pose & landmark extraction
TensorFlow / Keras
LSTM model training & inference
FastAPI
Python backend serving predictions
Next.js + React
Frontend coaching interface
Firebase Auth
User authentication
Vercel + EC2
Edge frontend, GPU backend

Want to license this stack?

We license our pose-recognition pipeline for custom sports and wellness products. Talk to us about your use case.