You film a tennis match on your phone. You upload the video. Five minutes later, an app sends you back a PDF that says you hit 142 forehands, 67 backhands, 14 serves, with a heatmap of where each one bounced and a stroke-quality score per shot. How?

This post is the honest, plain-English walkthrough of that pipeline. No marketing fog. We'll go through the five stages, ball detection, court keypoints, player + pose, bounce/shot classification, and stroke-quality scoring, explain what each does, what it gets right, and where it fails.

TL;DR: five stages in a sentence each

Ball detection, a TrackNet-style neural network finds the tennis ball in every frame, even at motion-blurred 80+ mph.
Court keypoint detection, a separate model finds the lines and corners of the court, so the system has a 2D coordinate system to map shots into.
Player detection + pose, FasterRCNN finds the players' bounding boxes; MediaPipe pose extracts 33 body keypoints per player per frame.
Bounce + shot classification, a CatBoost classifier looks at the ball trajectory plus the player's pose at contact and decides "that was a forehand," "that was a serve," "that was a bounce on the baseline."
Stroke-quality scoring, the pose features at each detected shot are scored against a reference distribution of well-executed shots.

The rest of this post is the long version of each step, with what each one fails at.

Stage 1: Ball detection (TrackNet)

If there's one thing tennis video analysis lives or dies on, it's ball tracking. A tennis ball is small (6.7 cm), fast (an amateur serve hits 70–90 mph; a clean forehand drives at 50–70 mph), and motion-blurred to a streak in any phone-camera recording shot at 30 fps.

Standard object detectors, YOLO, FasterRCNN, the things you'd reach for to find a person in a frame, miss the ball most of the time. They were trained on objects with rigid edges and clear feature points. A tennis ball mid-flight is a yellow smudge.

TrackNet is the open-source architecture that solved this. It was built specifically for tracking small high-speed objects in sports video. Instead of returning a bounding box, it outputs a probability heatmap over the frame, "the ball is most likely here, with this confidence." Crucially, it takes three consecutive frames as input, so it has motion context: the ball isn't just a yellow blob, it's a yellow blob that moved this way last frame. That trajectory prior is what makes it work on motion blur.

AceSense uses a TrackNet-derived model. We don't claim to have invented it; the open-source heritage is real and we link to it. What we did do is retrain it on a much larger dataset of phone-recorded amateur matches, the public TrackNet weights were tuned on broadcast TV footage, which has cleaner contrast, controlled lighting, and a fixed camera. Phone footage is dirtier. The retrained model is what runs in production.

What it fails at: very low-contrast scenes (dim indoor courts, late-evening outdoor sessions), and balls partially occluded by the net or a player's body. The "How accurate is tennis ball tracking?" Google PAA exists for a reason, search the question yourself and you'll find a community that's been burned by overclaims.

(See fig. 1: the pipeline overview diagram, showing how the ball heatmap from TrackNet feeds downstream stages.)

Stage 2: Court keypoint detection

The ball's position in the frame is meaningless on its own. You need a coordinate system, where on the court did this happen. That's what court keypoint detection does.

The system locates the corners and key intersections of the tennis court, baseline corners, service line corners, centre service mark, net posts. Once those points are pinned in the frame, a homography transform converts any pixel coordinate into a real-world court coordinate (in metres, relative to the court). Now "the ball bounced at frame 4,213" becomes "the ball bounced 1.2 metres inside the baseline, 0.8 metres from the sideline."

Without this, you don't have a heatmap, you don't have line calls, and you don't have any way to say "this serve landed in the deuce box." Court keypoint detection is the unglamorous step that makes the report make sense.

What it fails at: courts with faded or covered lines (heavily worn clay, courts with snow patches, courts where the singles sticks are missing on a doubles court being used for singles). It also struggles when the camera is tilted off-axis from the baseline by more than ~30°. If you set the camera up correctly, see /how-to/film-your-tennis-match, this stage is rarely the bottleneck.

Stage 3: Player detection + pose

Now we know where the ball is and where the court is. We need to know where the players are and what their bodies are doing.

Player detection runs FasterRCNN, a standard, well-understood object detector, on each frame to find the people. This is the easy part: humans on a tennis court are big, distinct objects against a high-contrast background. FasterRCNN nails it.

The harder part is pose. Once we have a bounding box around each player, we run MediaPipe's pose model inside that box to extract 33 body keypoints, head, shoulders, elbows, wrists, hips, knees, ankles, and foot points. MediaPipe is Google's open-source pose framework; it runs fast, it's accurate enough for tennis-scale movements, and it gives us the per-frame skeleton we need for the rest of the pipeline.

The pose is what separates AI tennis analysis from "AI ball tracking." Without pose, you can tell where shots land but not what kind of shot was hit, and you can't say anything about technique. Pose is the signal that makes coaching tips possible.

What it fails at: when the player is heavily occluded (the net pole crossing their torso, another player walking through the frame in doubles), or when the camera is so far away that the player is fewer than ~80 pixels tall. Phone cameras at standard fence-mount distance handle this fine; cameras placed in a stadium upper deck do not.

Stage 4: Bounce + shot classification (CatBoost + pose features)

This is where the pieces come together. We have the ball trajectory (Stage 1), the court coordinates (Stage 2), and the player pose at every frame (Stage 3). The classifier's job is to look at those signals and label every event:

Bounce events: the moment the ball hits the court. Used for line calls and the heatmap.
Shot events: the moment a player makes contact with the ball. Used for shot counts and stroke-quality.
Shot type: forehand, backhand, serve, volley. Each requires a different combination of ball-trajectory features and pose features.

The classifier is CatBoost, a gradient-boosted decision tree library. We chose it over a deep neural network for two reasons: it's fast (the entire classification stage runs in seconds on a GPU), and it's interpretable (we can ask "why did you call this a backhand?" and get a feature-importance answer). For a system where we want to publish accuracy methodology and explain failures, interpretability matters.

The features that go in: ball trajectory derivatives (velocity, acceleration, height profile), distance from each player's racket-side wrist to the ball at the candidate contact frame, hip-shoulder rotation angle, foot stance, and a few more. The model was trained on tens of thousands of hand-labelled shots from amateur match footage.

What it fails at: doubles, where two players on one side of the court can confuse "which shot belonged to whom." Low-frame-rate video (sub-30 fps) where the ball-contact frame is genuinely missing. And shots that legitimately exist between categories, the "tweener" or the "between-the-legs return", get classified as the closest standard shot type, which is honest but imperfect.

Stage 5: Stroke-quality scoring

The final stage is the one that produces a coaching tip rather than a stat. For each detected shot, we score the player's pose at contact (and a few frames before and after) against a reference distribution of well-executed shots of the same type.

The score is decomposed by component:

Preparation, racket take-back, hip rotation, weight transfer.
Contact, body position relative to the ball, racket-face angle estimate.
Follow-through, racket finish, rotation completion, balance.

This isn't "AI says you're a 4.2 player." It's "your forehand contact-point distribution is 12 cm late on average compared to the reference; here's what that tends to cause." The score is a discussion starter, not a verdict, and we say so on every report.

What it fails at: non-classical techniques (extreme grip styles, deliberate technical choices that work for the player but score lower against an ATP/WTA-baseline reference). A coach who is intentionally building a player with non-textbook mechanics will see a lower stroke-quality score for shots that are, by their assessment, fine. We document this on the stroke-quality feature page and in the FAQ on every report.

Where the pipeline fails (the honest section)

If you only read one section, read this. AI tennis tools have a credibility problem because most of them oversell. Here is where AceSense's pipeline fails today, in order of how often we see it:

Indoor courts with insufficient lighting. The ball detector needs contrast. Domes with bright LED arrays are fine; older indoor halls with yellowing fluorescents are hard.
Heavily worn clay. The court keypoint model needs visible lines. Faded clay confuses Stage 2 and cascades into bad coordinate mapping for Stages 4 and 5.
Doubles. Stage 4 (shot classification) was primarily trained on singles. Doubles works for shot detection on the player you're tracking, but the heatmap is calibrated to a singles court layout.
Phone camera placement that violates the setup guide. If the camera is too low, too high, or too far off-axis, every downstream stage degrades. Most of the bad-report support tickets we get are setup, not algorithm.

The full failure-mode catalogue is on /accuracy, with example frames.

How we measure all of this

The GPU pipeline lives in our acesense-gpu-backend project, and we run regression tests against hand-annotated match files using python scripts/compare_events.py <annotations.json> <output_dir> --tolerance 5. Every release, the script outputs precision/recall/F1 by event type. We publish the current numbers on /accuracy and update them per release. No invented numbers, no "industry-leading" without a citation. If you want the numbers, that page has them.

What this all adds up to

Five stages, each doing a job that none of the others can: TrackNet finds the ball, court detection gives us coordinates, FasterRCNN + MediaPipe finds the players and their bodies, CatBoost classifies the events, and the stroke-quality scorer translates pose into a coaching artefact. The output is a PDF that tells you what happened, where it happened, and how cleanly it was hit, the three things a club player needs and a phone-camera-only setup can deliver.

If you've made it this far, you probably want to either see how this compares to other apps, /compare/swingvision is the honest version, or try the pipeline on your own video. Both work.

Next: How accurate is AceSense? Our methodology and benchmarks walks through the regression suite and what the current build's numbers actually look like. Or head to /how-it-works for the visual version of this page.

How AI tennis shot detection actually works