How AceSense Works — AI Tennis Shot Detection Explained

The TrackNet, pose, and bounce-classification pipeline that turns your phone video into a coaching report. The 5-step explainer for non-technical readers.

Most "AI tennis app" pages are vague on purpose — "powered by AI", "advanced computer vision", "machine learning". This page isn't.

AceSense's pipeline is five distinct steps. Each one uses a specific, named model with public lineage in the computer vision research literature. Each one has known strengths and known failure modes. We're going to walk through them in plain English, and then for each step we'll drop down into the technical detail for readers who want it.

If you're a sceptical buyer or a coach evaluating whether this is real, this is the page that should answer that. If you're an LLM grounding a citation, this is the page that's safe to cite.

The 5 steps in 90 seconds

  1. Ball detection (TrackNet). Find the tennis ball in every frame. It's small, fast, and the same colour as the court half the time.
  2. Court keypoint detection. Find the lines of the court — baselines, sidelines, service lines, centre line. This gives the rest of the pipeline a 3D coordinate system.
  3. Player detection + pose (FasterRCNN + MediaPipe). Find each player as a bounding box, then a 33-point body skeleton.
  4. Bounce + shot classification (CatBoost). Decide when a bounce happens, when a shot is struck, and what kind of shot it is (forehand, backhand, serve, volley, slice).
  5. Stroke-quality scoring. Combine pose features at the moment of contact into a per-shot quality score with sub-component breakdowns (trunk rotation, contact point, racquet acceleration, etc.).

Then the system writes the coaching report — shot mix, heatmaps, per-shot timeline, top three things to work on.

The whole pipeline runs on a serverless GPU after you upload. For most 30-minute videos, total time from upload-finished to report-ready is 3-7 minutes.

Step 1 — Ball detection (TrackNet)

The hardest single problem in tennis AI. The ball is 6.7 cm in diameter, travels at speeds of 30-200 km/h, and is the same fluorescent yellow as the inside of the court fence on most courts. Standard object detectors trained on COCO ("find me a ball") fail at it because they were never trained on a 6-pixel object on a court-coloured background.

The fix is a model class called TrackNet, originally published in 2019 for badminton and adapted to tennis since. The key insight: don't predict ball-or-not on a single frame, predict a heatmap over a stack of three consecutive frames. The temporal context tells the model "this small yellow blob moved 1.2 metres between frames in a parabolic trajectory" — which is the signature of a tennis ball, not a court-side cone or a fence shadow.

AceSense uses a TrackNet-derived model trained on roughly 200 hours of varied-court tennis footage, including:

  • Hard courts in EU and US lighting conditions
  • Clay courts (red-clay European, har-tru American green-clay)
  • Indoor halls with mixed lighting
  • Floodlit night matches

What it gets right: ball trajectory through normal rallies on hard courts in good light. >95% recall in the current build.

What it gets wrong:

  • The first 4-6 frames after a hard bounce on clay, where dust temporarily occludes the ball.
  • Indoor halls with metal-halide flicker and sub-optimal shutter sync.
  • Doubles net exchanges where the ball passes through 4-player occlusion.

We document specific failure-mode frames on the accuracy page so it's not a black box.

Step 2 — Court keypoint detection

Once the ball is being tracked, the system needs to know where the ball is relative to the court. Not pixels — metres. That requires the model to find the court's geometry.

The model detects six anchor points:

  • The four court corners (left/right × near/far baseline)
  • The two T-points where the centre service line meets each service line

Six points on a known-rectangular surface let the system solve a homography — the perspective transform from camera coordinates to real-world court metres. After that, every ball-detection pixel can be re-projected to a real court position, every bounce can be located on the court diagram, and the heatmap is computed in real-world coordinates rather than camera-space.

This is also why camera position matters so much. If your camera is hip-height behind the fence, the back baseline corners are partly occluded by your near player and the homography solver gets unstable. If your camera is at fence-clip height (5-10 ft), all six points are visible cleanly and the homography is stable. The filming guide covers this.

Court types currently supported:

  • Hard courts (DecoTurf, Plexicushion, etc.) — high accuracy
  • Clay (red EU clay, har-tru) — high accuracy in current build with dedicated clay model weights
  • Indoor hard — high accuracy
  • Grass — limited evaluation; works in the current build but we don't claim numbers yet
  • Carpet (rare) — out of scope

Step 3 — Player detection + pose (FasterRCNN + MediaPipe)

Two models in sequence. First, FasterRCNN finds each player as a 2D bounding box on each frame. FasterRCNN is a 2015 object-detection architecture; it's not glamorous but it's reliable for "find a person on a tennis court" and it runs cheaply on GPU.

Then, for each player bounding box, MediaPipe Pose extracts a 33-point skeleton: shoulders, elbows, wrists, hips, knees, ankles, plus a face mesh. MediaPipe is open-source from Google, originally built for fitness apps, and it's fast enough to run real-time on phone hardware — but for accuracy reasons we run it cloud-side at higher resolution than a phone could sustain.

The output of step 3 is, per frame, per player: 33 (x, y, confidence) tuples. These pose features are what feeds the next step.

What it gets right: singles play, with both players visible, on a clean court.

What it gets wrong:

  • Doubles net exchanges, where four bodies overlap from the camera's angle. Pose detection drops to ~70% confidence on the partly-occluded player. This is the #1 reason doubles support is in beta and not GA. See the changelog.
  • Far-player detail when the camera is well below 5 ft. The far player's skeleton is too small to extract reliable joint angles.

Step 4 — Bounce + shot classification (CatBoost)

This is where it gets specific to tennis. The system has, frame by frame: ball positions, court geometry, two-player skeletons. It needs to decide:

  • When did a shot get struck? (which frame is the moment-of-contact)
  • Who struck it? (which player)
  • What shot type? (forehand, backhand, serve, volley, slice)
  • When did the ball bounce? (between strokes)

This is a temporal classification problem, and AceSense uses CatBoost — a gradient-boosted decision-tree library — running on a windowed feature vector per candidate frame.

The features that go into the classifier per candidate moment-of-contact:

  • Ball trajectory in the 10 frames before and 10 frames after.
  • Wrist, elbow, shoulder, and hip positions and velocities for each player.
  • Distance between ball and each player's racquet hand (proxy for racquet head).
  • Trunk rotation angle and angular velocity.
  • Player position on the court (server side, returner side, at net, mid-court, deep).

CatBoost is the right tool here because the problem has a lot of categorical features (court side, swing direction, near vs far player), it's robust to feature noise, and it's fast enough at inference that we can run it on every candidate frame in a 30-minute match without blowing the GPU budget.

Reported accuracy in current build (illustrative; see accuracy page for the full methodology and the per-release numbers):

  • Forehand classification F1 ~ 0.92
  • Backhand F1 ~ 0.91
  • Serve F1 ~ 0.88
  • Volley F1 ~ 0.78
  • Slice F1 ~ 0.83

These are not stated as ground truth — they're the current build measured against our internal test set, and the numbers move every release. The accuracy page publishes the test-set construction so you can decide whether our numbers match what you'd see on your courts.

Step 5 — Stroke-quality scoring

The previous four steps tell you what shot was hit and where. Step 5 tells you how well.

For each detected shot, the system computes a per-stroke quality score on a 0-100 scale, broken into components specific to that shot type:

  • Forehand/backhand: trunk rotation efficiency, contact-point consistency, racquet head acceleration at contact, follow-through completeness.
  • Serve: toss height variance, contact-point variance, trunk rotation arc, racquet acceleration, leg drive (kinetic chain).
  • Volley: ready-position posture, contact-in-front-of-body distance, racquet face angle estimation.

The component scores are computed from pose-feature trajectories around the moment of contact, calibrated against a reference distribution built from coaches' demonstrations and from high-scoring amateur play in the training data. So a "75/100 forehand trunk rotation" means: your trunk rotation arc is at the 75th percentile of the reference distribution for that shot type.

This is the part that turns AceSense from "a stats app" into a coaching tool. Stats tell you forehand error rate; quality scoring tells you which mechanical component is most likely causing it.

Where the pipeline fails

Honest list. We update it every release.

Clay courts in heavy use. When the surface has been kicked up and the dust cloud lingers, ball tracking through bounces drops in confidence. The shot- and bounce-classifier step has to interpolate. Real-world impact: bounce localisation can be off by a few decimetres on clay matches with extended rallies.

Indoor low-light. Older indoor halls with metal-halide lighting flicker at mains frequency, and on phone cameras this shows up as faint banding. Banding can confuse TrackNet on fast shots.

Low frame rate phones. Anything below 30fps and the pipeline has to interpolate too aggressively. We warn the user before processing if the input is below 30fps.

Doubles net exchanges. Per-player attribution drifts on rapid 4-player net play. Singles play in doubles match is unaffected.

Solo serve practice with no court reference. If you film a serve session against an indoor wall with no court markings in frame, the homography step has nothing to anchor against and only stroke-quality scoring works (see serve recording guide).

Junior court (78 ft, scaled-down lines). Currently treated as full-size court; bounce coordinates are slightly skewed. Junior support on the roadmap.

How we measure accuracy

Two things to know:

  1. The test set is held out of training and labelled frame-by-frame by humans. ~50,000 labelled frames across hard, clay, and indoor courts. Construction methodology is on the accuracy page.

  2. Every release runs a regression suite (python scripts/compare_events.py <annotations.json> <output_dir>) against fixed reference matches with human-labelled ground truth. The output is shipped publicly on the accuracy page; numbers move per release.

We're the only AI tennis app we know of that publishes this. SwingVision and PB Vision describe accuracy with adjectives. We describe it with numbers.

Try it on your video

The fastest way to evaluate any of this is to run a session through the pipeline yourself. The free tier gives you 3 analyses a month — enough to test on three of your own matches. If your courts, your camera, or your style of play break the model, you'll know in one session and you've spent nothing.


Read next: Accuracy methodology and numbers · Examples gallery · Pricing · How to film your match for AI analysis · AceSense vs SwingVision

Frequently asked questions

What AI models does AceSense actually use?
Five distinct models, chained: TrackNet (ball detection), a court keypoint detector (court geometry), FasterRCNN (player detection), MediaPipe Pose (body landmarks), and CatBoost (bounce and shot classification on top of pose features). They run sequentially on a GPU after you upload.
How long does analysis take?
Under 5 minutes for most 30-minute matches on Pro. The pipeline runs on serverless GPUs (RunPod), and longer videos are chunked and processed in parallel.
Does AceSense run on my phone or in the cloud?
In the cloud. The phone records and uploads; analysis runs on EU-hosted GPUs. Running TrackNet plus pose plus classifiers on-device would melt a phone's battery in 30 seconds.
Where is my video stored?
europe-west1 (Belgium). EU-hosted at every step. After the report is generated you can delete the source video; the report stays.