---
title: "How AI tennis shot detection actually works"
description: "A plain-English walkthrough of the five-stage pipeline that turns a phone-recorded tennis video into a post-match report, including where it fails."
slug: "how-ai-tennis-shot-detection-works"
date: "2025-07-04"
author: "Akshay Sarode"
authorBio: "Founder, AceSense. Building AI tennis tools in Europe."
category: "Methodology"
schema: "BlogPosting"
faq:
  - q: "What is TrackNet and why does AceSense use it?"
    a: "TrackNet is an open-source neural network architecture originally designed to track small, fast-moving objects in sports video, specifically tennis and badminton balls. It outputs a heatmap per frame showing where the ball most likely is. AceSense uses a TrackNet-derived model because a tennis ball at 80+ mph is the hardest object in the video to track, and standard object detectors (YOLO, FasterRCNN) miss it on motion blur. TrackNet was built for exactly this problem."
  - q: "Does AceSense use the same tech as Hawk-Eye?"
    a: "No, and we don't claim to. Hawk-Eye uses 6–10 high-frame-rate calibrated cameras on rigid mounts in a stadium environment, plus 3D triangulation. AceSense uses one phone camera and 2D inference. The two systems solve different problems, Hawk-Eye is for line-call accuracy at the millimetre level under controlled conditions; AceSense is for amateur-player technique and pattern analysis under uncontrolled conditions. We tell you on /accuracy where the limits are."
  - q: "Why pose detection? Why not just track the ball?"
    a: "Because the ball alone can't tell you what shot was hit. A backhand slice and a forehand drive can land in similar spots. The pose, where the player's hips, shoulders, racket arm, and feet are at contact, is what disambiguates. MediaPipe's pose model gives us 33 body keypoints per frame; the bounce/shot classifier uses those keypoints as features alongside the ball trajectory."
  - q: "Where does the pipeline fail most often?"
    a: "Three places, honestly: (1) very dark indoor courts with poor floodlighting, the ball detector loses contrast; (2) heavily worn clay courts where the lines have faded, court keypoint detection breaks; (3) doubles, where two players on one side of the court can confuse the bounce/shot classifier about which shot belongs to whom. We document failure modes on /accuracy."
  - q: "Is this real-time?"
    a: "No. AceSense is post-match analysis, you upload, the pipeline runs on a GPU in the cloud, and you get a report in minutes. Real-time on-device is a different (harder) problem with a much narrower phone-compatibility list. We picked async because it works on every modern phone and lets us use larger, more accurate models."
---

# How AI tennis shot detection actually works

You film a tennis match on your phone. You upload the video. Five minutes later, an app sends you back a PDF that says you hit 142 forehands, 67 backhands, 14 serves, with a heatmap of where each one bounced and a stroke-quality score per shot. How?

This post is the honest, plain-English walkthrough of that pipeline. No marketing fog. We'll go through the five stages, ball detection, court keypoints, player + pose, bounce/shot classification, and stroke-quality scoring, explain what each does, what it gets right, and where it fails.

## TL;DR: five stages in a sentence each

1. **Ball detection**, a TrackNet-style neural network finds the tennis ball in every frame, even at motion-blurred 80+ mph.
2. **Court keypoint detection**, a separate model finds the lines and corners of the court, so the system has a 2D coordinate system to map shots into.
3. **Player detection + pose**, FasterRCNN finds the players' bounding boxes; MediaPipe pose extracts 33 body keypoints per player per frame.
4. **Bounce + shot classification**, a CatBoost classifier looks at the ball trajectory plus the player's pose at contact and decides "that was a forehand," "that was a serve," "that was a bounce on the baseline."
5. **Stroke-quality scoring**, the pose features at each detected shot are scored against a reference distribution of well-executed shots.

The rest of this post is the long version of each step, with what each one fails at.

## Stage 1: Ball detection (TrackNet)

If there's one thing tennis video analysis lives or dies on, it's ball tracking. A tennis ball is small (6.7 cm), fast (an amateur serve hits 70–90 mph; a clean forehand drives at 50–70 mph), and motion-blurred to a streak in any phone-camera recording shot at 30 fps.

Standard object detectors, YOLO, FasterRCNN, the things you'd reach for to find a person in a frame, miss the ball most of the time. They were trained on objects with rigid edges and clear feature points. A tennis ball mid-flight is a yellow smudge.

[TrackNet](https://nol.cs.nctu.edu.tw:234/open-source/TrackNet) is the open-source architecture that solved this. It was built specifically for tracking small high-speed objects in sports video. Instead of returning a bounding box, it outputs a probability heatmap over the frame, "the ball is most likely *here*, with this confidence." Crucially, it takes three consecutive frames as input, so it has motion context: the ball isn't just a yellow blob, it's a yellow blob that *moved this way last frame*. That trajectory prior is what makes it work on motion blur.

AceSense uses a TrackNet-derived model. We don't claim to have invented it; the open-source heritage is real and we link to it. What we did do is retrain it on a much larger dataset of phone-recorded amateur matches, the public TrackNet weights were tuned on broadcast TV footage, which has cleaner contrast, controlled lighting, and a fixed camera. Phone footage is dirtier. The retrained model is what runs in production.

**What it fails at:** very low-contrast scenes (dim indoor courts, late-evening outdoor sessions), and balls partially occluded by the net or a player's body. The "How accurate is tennis ball tracking?" Google PAA exists for a reason, [search the question yourself](https://www.google.com/search?q=tennis+ball+tracking+app) and you'll find a community that's been burned by overclaims.

*(See fig. 1: the pipeline overview diagram, showing how the ball heatmap from TrackNet feeds downstream stages.)*

## Stage 2: Court keypoint detection

The ball's position in the frame is meaningless on its own. You need a coordinate system, *where on the court* did this happen. That's what court keypoint detection does.

The system locates the corners and key intersections of the tennis court, baseline corners, service line corners, centre service mark, net posts. Once those points are pinned in the frame, a homography transform converts any pixel coordinate into a real-world court coordinate (in metres, relative to the court). Now "the ball bounced at frame 4,213" becomes "the ball bounced 1.2 metres inside the baseline, 0.8 metres from the sideline."

Without this, you don't have a heatmap, you don't have line calls, and you don't have any way to say "this serve landed in the deuce box." Court keypoint detection is the unglamorous step that makes the report make sense.

**What it fails at:** courts with faded or covered lines (heavily worn clay, courts with snow patches, courts where the singles sticks are missing on a doubles court being used for singles). It also struggles when the camera is tilted off-axis from the baseline by more than ~30°. If you set the camera up correctly, see [/how-to/film-your-tennis-match](/how-to/film-your-tennis-match), this stage is rarely the bottleneck.

## Stage 3: Player detection + pose

Now we know where the ball is and where the court is. We need to know where the *players* are and what their bodies are doing.

Player detection runs FasterRCNN, a standard, well-understood object detector, on each frame to find the people. This is the easy part: humans on a tennis court are big, distinct objects against a high-contrast background. FasterRCNN nails it.

The harder part is pose. Once we have a bounding box around each player, we run [MediaPipe's pose model](https://developers.google.com/mediapipe/solutions/vision/pose_landmarker) inside that box to extract 33 body keypoints, head, shoulders, elbows, wrists, hips, knees, ankles, and foot points. MediaPipe is Google's open-source pose framework; it runs fast, it's accurate enough for tennis-scale movements, and it gives us the per-frame skeleton we need for the rest of the pipeline.

The pose is what separates AI tennis analysis from "AI ball tracking." Without pose, you can tell where shots land but not what *kind* of shot was hit, and you can't say anything about technique. Pose is the signal that makes coaching tips possible.

**What it fails at:** when the player is heavily occluded (the net pole crossing their torso, another player walking through the frame in doubles), or when the camera is so far away that the player is fewer than ~80 pixels tall. Phone cameras at standard fence-mount distance handle this fine; cameras placed in a stadium upper deck do not.

## Stage 4: Bounce + shot classification (CatBoost + pose features)

This is where the pieces come together. We have the ball trajectory (Stage 1), the court coordinates (Stage 2), and the player pose at every frame (Stage 3). The classifier's job is to look at those signals and label every event:

- **Bounce events:** the moment the ball hits the court. Used for line calls and the heatmap.
- **Shot events:** the moment a player makes contact with the ball. Used for shot counts and stroke-quality.
- **Shot type:** forehand, backhand, serve, volley. Each requires a different combination of ball-trajectory features and pose features.

The classifier is [CatBoost](https://catboost.ai/), a gradient-boosted decision tree library. We chose it over a deep neural network for two reasons: it's fast (the entire classification stage runs in seconds on a GPU), and it's interpretable (we can ask "why did you call this a backhand?" and get a feature-importance answer). For a system where we want to publish accuracy methodology and explain failures, interpretability matters.

The features that go in: ball trajectory derivatives (velocity, acceleration, height profile), distance from each player's racket-side wrist to the ball at the candidate contact frame, hip-shoulder rotation angle, foot stance, and a few more. The model was trained on tens of thousands of hand-labelled shots from amateur match footage.

**What it fails at:** doubles, where two players on one side of the court can confuse "which shot belonged to whom." Low-frame-rate video (sub-30 fps) where the ball-contact frame is genuinely missing. And shots that legitimately exist between categories, the "tweener" or the "between-the-legs return", get classified as the closest standard shot type, which is honest but imperfect.

## Stage 5: Stroke-quality scoring

The final stage is the one that produces a coaching tip rather than a stat. For each detected shot, we score the player's pose at contact (and a few frames before and after) against a reference distribution of well-executed shots of the same type.

The score is decomposed by component:
- **Preparation**, racket take-back, hip rotation, weight transfer.
- **Contact**, body position relative to the ball, racket-face angle estimate.
- **Follow-through**, racket finish, rotation completion, balance.

This isn't "AI says you're a 4.2 player." It's "your forehand contact-point distribution is 12 cm late on average compared to the reference; here's what that tends to cause." The score is a discussion starter, not a verdict, and we say so on every report.

**What it fails at:** non-classical techniques (extreme grip styles, deliberate technical choices that work for the player but score lower against an ATP/WTA-baseline reference). A coach who is intentionally building a player with non-textbook mechanics will see a lower stroke-quality score for shots that are, by their assessment, fine. We document this on the [stroke-quality feature page](/features/stroke-quality) and in the FAQ on every report.

## Where the pipeline fails (the honest section)

If you only read one section, read this. AI tennis tools have a credibility problem because most of them oversell. Here is where AceSense's pipeline fails today, in order of how often we see it:

1. **Indoor courts with insufficient lighting.** The ball detector needs contrast. Domes with bright LED arrays are fine; older indoor halls with yellowing fluorescents are hard.
2. **Heavily worn clay.** The court keypoint model needs visible lines. Faded clay confuses Stage 2 and cascades into bad coordinate mapping for Stages 4 and 5.
3. **Doubles.** Stage 4 (shot classification) was primarily trained on singles. Doubles works for shot detection on the player you're tracking, but the heatmap is calibrated to a singles court layout.
4. **Phone camera placement that violates the setup guide.** If the camera is too low, too high, or too far off-axis, every downstream stage degrades. Most of the bad-report support tickets we get are setup, not algorithm.

The full failure-mode catalogue is on [/accuracy](/accuracy), with example frames.

## How we measure all of this

The GPU pipeline lives in our `acesense-gpu-backend` project, and we run regression tests against hand-annotated match files using `python scripts/compare_events.py <annotations.json> <output_dir> --tolerance 5`. Every release, the script outputs precision/recall/F1 by event type. We publish the current numbers on [/accuracy](/accuracy) and update them per release. No invented numbers, no "industry-leading" without a citation. If you want the numbers, that page has them.

## What this all adds up to

Five stages, each doing a job that none of the others can: TrackNet finds the ball, court detection gives us coordinates, FasterRCNN + MediaPipe finds the players and their bodies, CatBoost classifies the events, and the stroke-quality scorer translates pose into a coaching artefact. The output is a PDF that tells you *what happened, where it happened, and how cleanly it was hit*, the three things a club player needs and a phone-camera-only setup can deliver.

If you've made it this far, you probably want to either see how this compares to other apps, [/compare/swingvision](/compare/swingvision) is the honest version, or try the pipeline on your own video. Both work.

---

**Next:** [How accurate is AceSense? Our methodology and benchmarks](/blog/how-accurate-is-acesense) walks through the regression suite and what the current build's numbers actually look like. Or head to [/how-it-works](/how-it-works) for the visual version of this page.