Tennis Shot Detection That Actually Keeps Up

Automatic forehand, backhand, serve, and volley classification on phone-recorded video. Honest accuracy numbers, real failure modes, no overpromising.

In plain English: shot detection is the part of AceSense that watches your video, finds every moment you hit the ball, and labels the shot — forehand, backhand, serve, volley, overhead. Every report you get out of AceSense is built on top of these labels. The court heatmap, the stroke-quality breakdown, the shot-mix counts — none of it exists without shot detection working underneath.

It's the unglamorous foundation. This page is how it works, how accurate it is, and where it breaks.

What it does, in one paragraph

AceSense's pipeline finds the player using a person-detection model, runs MediaPipe pose to extract the skeleton, tracks the ball with a TrackNet-derived ball detector, and uses a CatBoost classifier to combine pose, ball trajectory, and timing into a per-shot label. For every shot in the match, you get: a timestamp, a shot-type label, the bounce point of the resulting ball trajectory, and the per-component pose score that feeds the stroke-quality feature. The full pipeline is documented at /how-it-works; this page is specifically about the labelling step.

How accurate it is

Honest answer: it depends on the shot, the camera angle, and the player.

In our internal benchmark against hand-annotated club-level matches at NTRP 3.0–4.5 — recorded with a phone behind the baseline, court visible end-to-end, daylight or even floodlight — we see roughly:

  • Forehand: F1 in the low-90s. The dominant training class; the cleanest pose pattern.
  • Backhand (one- and two-handed): F1 high-80s. Confusion mostly with slice forehands at certain stance angles.
  • Serve: F1 mid-to-high 80s. Confused with overheads at angles where the camera doesn't see the toss clearly.
  • Volley: F1 mid-80s. Less training data; pose is more variable; sometimes labelled as a hard groundstroke at the baseline.
  • Overhead: F1 low-80s. Smallest class; gets pulled toward 'serve' by the model.

These are not numbers we want to overclaim. The full methodology — how we annotate, how we benchmark, what splits we use — is on the /accuracy page. We publish the regression suite output. We are explicit about where the numbers come from. This is the bar SwingVision and competitors haven't met publicly, and it's the one we hold ourselves to.

For comparison-page context: if you've seen forum complaints like "Am I really serving 130mph" or "my hardest serve only 66 mph" about competing apps, the underlying issue is exactly this — opaque accuracy on the shot-by-shot layer that everything else is built on. Our answer is to make the methodology auditable. See /blog/how-ai-tennis-shot-detection-works for the long-form explainer.

Where it fails

Three failure modes you should know about:

1. Overhead vs serve confusion at low camera angles

If your phone is mounted at hip-height instead of net-height, the toss for a serve and the contact for an overhead look similar in pose space. The model tends to call overheads serves, because the serve class has more training data. Mount the phone higher (3 feet / 1m or above) and this drops out.

2. Volleys at the baseline

A hard half-volley at the baseline — taken on the rise, no real net approach — sometimes gets labelled as a forehand or backhand. The pose is genuinely ambiguous. The model is calibrated for net-volleys with the player inside the service line.

3. Player loss in doubles or fast-moving rally

The pre-step before classification is finding the player. In doubles, when both players cross paths near the net, the model can briefly track the wrong person — meaning you'll get a "shot" labelled for the wrong player, or a missed shot. Singles is the supported path; doubles works in practice but is not benchmarked. Aim for singles for the most reliable counts.

There are smaller failure modes — shadowy courts at sunset, heavily-compressed video shot in 480p, players in similar-coloured kits to the court surface — that can degrade the player-detection layer underneath classification. These aren't shot-detection bugs per se; they're upstream issues that propagate. If your video is in the failure modes covered on the accuracy page, the shot-detection numbers above don't apply to your specific upload.

Why this is the right framing for an amateur player

Here's where shot detection earns its keep for an NTRP 3.0–4.5 player: you cannot self-coach without it.

Watch a match of yourself without per-shot labels and you'll come away with vibes. I think I hit more forehands than backhands. I think my backhand was off. I think I served well. Vibes lose tennis matches. Numbers tell you what you actually did.

A typical first AceSense report for a club player surfaces:

  • Shot mix is more imbalanced than you thought. 3:1 forehand-to-backhand is common. The opponent figured this out by game three; you figured it out from the report.
  • Your "weak side" is fine; your "strong side" is the leak. Most players assume their backhand is the problem. The report often shows the forehand placement is what's actually losing points.
  • Your serve isn't your serve. First serves and second serves get separated in the report — most players have a wider gap than they realised.

None of this is visible without shot detection running first. The heatmap, the stroke-quality score, the work-on-this items — all of them are built on the per-shot labels. Get the labels right and the rest of the report is trustworthy. Get the labels wrong and the report is junk.

This is also why we write so much about it: the ball tracking, court heatmap, and stroke quality features all consume shot-detection output. They're downstream of the same labels.

Walkthrough: one rally, end-to-end

You hit a cross-court forehand. Here's what happens:

  1. Frame-level person detection finds you in the video and bounds you.
  2. MediaPipe pose extracts your skeleton — wrists, elbows, shoulders, hips.
  3. TrackNet ball detection is running every frame; just before your contact, it has the ball trajectory.
  4. The contact event is detected by a combination of ball-trajectory inflection and your wrist-position dynamics.
  5. The CatBoost classifier takes the pose features at contact, the ball-trajectory direction in and out, and the timing relative to the previous shot, and outputs a label distribution: 87% forehand, 10% slice forehand, 3% volley.
  6. The label "forehand" goes into the report with its timestamp, its bounce point (next time the ball touches the court — that's the heatmap input), and its pose-quality breakdown (preparation, contact, follow-through — that's the stroke-quality input).

That whole sequence happens in milliseconds of compute time per shot, on a serverless GPU. By the time you've made dinner, the entire match has been processed shot-by-shot.

What it doesn't do

Be clear-eyed: shot detection isn't a coach. It tells you what you hit and roughly how well. It doesn't tell you what you should have hit instead — that's a coach's job, an opponent-aware tactical decision the model doesn't have visibility into. The work-on-this items in the report are based on outcome patterns (where the ball landed, how the technique scored), not on tactical match-context. Treat shot detection as the foundation of the data; treat coaching as a separate skill the report supports but doesn't replace.

Pricing

Shot detection is on every tier including free. There's no premium "more accurate shot detection" — same model, same accuracy, all tiers. The free tier limits how many matches per month you can analyse, not what features you get. Full breakdown at /pricing.


Ready to see it on your own video? Upload a match free and look at the per-shot table in the report. Or read the methodology page first if you want to know the benchmark numbers before you trust them. The labels are only as good as the upstream pipeline — see ball tracking and the court heatmap for the other halves of the puzzle.

Frequently asked questions

Which shot types does AceSense detect?
Forehand, backhand, serve, volley, overhead, and slice (as a forehand or backhand subtype). Forehand and backhand groundstrokes are the most reliably classified — they're the bulk of the training data and the easiest pose pattern. Serves and overheads are the hardest to disambiguate at certain angles; see the failure modes section.
How accurate is shot classification?
On our internal benchmark against hand-annotated club-level matches, the per-shot classification F1 is in the high-80s to low-90s for forehand and backhand, mid-80s for serves and volleys, and lower for overheads (where it's confused with serves at certain camera angles). The full methodology is on the /accuracy page — it's how we measure, not just the final number, that we want you to trust.
What if I hit a shot the model doesn't have a label for?
The model's class set is the major shot types listed above. If you hit a tweener, a fake drop shot off a backhand grip, or some other oddity, the classifier will assign its closest label — usually 'backhand' or 'volley'. The model doesn't make up new categories. For unusual shot types, treat the classification as a hint, not a verdict.
Does it work for left-handed players?
Yes. The pose model is handedness-agnostic — it labels by stroke side relative to the player's body, not relative to the court. Left-handed forehands are detected as forehands. Same accuracy band as right-handed.
Does it work for doubles?
It works for shot detection on whichever player is being tracked, but the player-detection layer can lose the player you care about when both players cross paths near the net. Singles is the supported path; doubles works in practice but is not benchmarked.