Technology

How the AI sees you,
frame by frame.

Gaara AI combines real-time computer vision with sequence-aware deep learning — the same techniques used in research labs, optimised to run live in your browser.

The pipeline

Camera to coaching cue in four stages.

Camera

30fps webcam

Pose Extraction

MediaPipe Holistic

LSTM

30-frame window

Feedback

<250ms response

Stage 1 · Pose Extraction

MediaPipe Holistic.

Every frame is fed through Google's MediaPipe Holistic model, which extracts full-body landmarks in real time on commodity hardware.

The output is a complete biomechanical snapshot — body, hands, and face — ready to be consumed by the recognition model.

Per Frame

Body landmarks33 keypoints

Hand landmarks21 × 2 hands

Face mesh468 points

Inference rate30fps

Stage 2 · Feature Engineering

1,662 features per frame.

Landmarks are flattened into a single feature vector consumed by the LSTM. Each row below shows how the dimensions add up.

Source

Count

Total Dims

Description

Pose landmarks

33 × 4

132

x, y, z + visibility per body keypoint

Hand landmarks

21 × 3 × 2

126

x, y, z per point, both hands

Face mesh

468 × 3

1,404

x, y, z per facial landmark

Total per frame

1,662

fed into LSTM

Stage 3 · Recognition

LSTM neural network.

A Long Short-Term Memory network processes sequences of 30 frames — capturing the temporal dynamics that distinguish a correct movement from a flawed one.

Input Layer

(30, 1662)

30 time steps × 1,662 features per frame

LSTM 1

64 units

ReLU activation, return sequences

LSTM 2

128 units

ReLU activation, return sequences

LSTM 3

64 units

ReLU activation, final state only

Dense

64 units

ReLU activation

Dense

32 units

ReLU activation

Output

Softmax

Action class probabilities

Stage 4 · Real-Time Loop

Sliding window inference at 30fps.

Sliding window

The latest 30 frames are kept in a rolling buffer. Each new frame replaces the oldest, giving the model continuous context.

Throttled inference

Predictions run every 250ms — fast enough to feel instant, slow enough to avoid burning compute when the body isn't moving.

Stable detection

A shot or pose is only confirmed after 3 consecutive frames pass the confidence threshold — preventing flicker from noisy poses.

Stack

What it's built on.

MediaPipe Holistic

Pose & landmark extraction

TensorFlow / Keras

LSTM model training & inference

FastAPI

Python backend serving predictions

Next.js + React

Frontend coaching interface

Firebase Auth

User authentication

Vercel + EC2

Edge frontend, GPU backend

Want to license this stack?

We license our pose-recognition pipeline for custom sports and wellness products. Talk to us about your use case.

Talk to Sales View Products

How the AI sees you,frame by frame.

Camera to coaching cue in four stages.

MediaPipe Holistic.

1,662 features per frame.

LSTM neural network.

Sliding window inference at 30fps.

Sliding window

Throttled inference

Stable detection

What it's built on.

Want to license this stack?

How the AI sees you,
frame by frame.