Real-time pose recognition with MediaPipe and LSTM
When we set out to build Gaara AI, the goal was simple: deliver coach-grade biomechanical feedback in real time, on any device, with no installs. Six months later, we have a pipeline that processes 30 video frames per second through a MediaPipe Holistic model and an LSTM classifier — all in your browser.
MediaPipe Holistic gives us 33 pose landmarks, 21 hand landmarks per hand, and 468 face landmarks per frame. That's 1,662 numerical features per frame. The LSTM operates on a sliding window of 30 frames, classifying body movements in <250ms.
The trickiest part wasn't the ML — it was the engineering trade-offs. Run inference too often and the browser stutters. Run it too rarely and feedback feels laggy. We landed on a 250ms cadence with a 3-frame stable-detection guard, which feels instant but doesn't burn CPU when the body isn't moving.
Privacy was non-negotiable from day one. Raw video frames never leave the device. Only the 1,662-dimensional feature vector is sent to our API for inference. This kept us simple from a regulatory standpoint and made users far more comfortable practising at home.
If you're building anything similar, our biggest lesson: invest early in the stable-detection logic. The model will produce noise. Filtering it intelligently is what makes the product feel premium.
