Most AI tennis apps describe accuracy with adjectives. "Industry-leading." "Highly accurate." "Pro-grade." Here are numbers.

This is the methodology page in blog form. It explains how we measure AceSense's accuracy, what the current build's numbers actually look like, and, the part nobody else seems to write, where the model fails. If you're a sceptic, this is the page that should either earn your trust or send you somewhere else.

TL;DR: the honest version

We hand-annotate a held-out set of amateur match videos and run a regression suite against the pipeline output every release.
The script that does it lives in our GPU backend: python scripts/compare_events.py <annotations.json> <output_dir> --tolerance 5.
Shot detection (was a shot hit?) is high accuracy. Shot classification (was it a forehand or backhand?) is slightly lower. Ball speed estimation from monocular phone video has a meaningful error band that we hedge explicitly.
Failure modes are concentrated in three places: dim indoor courts, heavily worn clay, and doubles.
The full per-release numbers live on /accuracy. This post is the methodology behind that page.

Why this page exists

There's a Google PAA, "How accurate is tennis ball tracking?", that surfaces every time someone searches for a tennis AI app. The community has been burned. Look at the r/10s thread "Am I really serving 130 mph?" or the companion r/10s thread "my hardest serve only 66 mph?", players posting the same complaint about competitor apps: the numbers don't match physical reality. The community's defence is "use it for relative comparisons, not absolute numbers."

That's a fine workaround if you got the app for free. It's not a fine answer if you paid $400 a year. So the bar for AceSense's accuracy page is: publish the methodology, publish the numbers, document the failures, update per release.

The five things we measure

The pipeline (explained in detail here) produces five kinds of output. Each has its own accuracy metric.

1. Shot-event detection

Question: Did a shot happen at this frame?

Metric: Precision, recall, F1, with a tolerance window. A predicted shot at frame N matches a ground-truth shot at frame M if |N − M| ≤ 5 frames (167 ms at 30 fps). The tolerance accounts for the fact that "the moment of contact" is itself a 1–2-frame ambiguity in 30-fps footage.

What we hit (current build): high-90s F1 on clean, well-filmed singles match video. We publish the exact number on /accuracy.

Why it's high: the contact event has very strong visual signals, ball trajectory inflection, racket-arm pose, ball-to-racket distance minimum. The model has a lot to work with.

Where it fails: very fast exchanges at the net (volley exchanges where contact-to-contact is under 250 ms can fool the de-duplication step), and shots where the player is heavily occluded by the net post.

2. Shot-type classification

Question: Given that a shot happened, was it a forehand, backhand, serve, or volley?

Metric: Per-class precision/recall/F1. We report a confusion matrix.

What we hit: low-90s F1, with the most confusion between forehand and backhand in unusual stances (open-stance backhand, defensive forehand from the deep ad corner). Serve detection is essentially solved (very high F1) because the pose signature is so distinct. Volley detection is the weakest of the four because volleys are short, fast, and pose-ambiguous.

Why classification is harder than detection: detection just needs "something happened"; classification needs to disambiguate two pose configurations that can look similar in a 2D phone-camera projection.

3. Bounce-event detection

Question: Did the ball bounce at this frame, at this location?

Metric: Frame-tolerance F1 (same 5-frame window) plus a 2D position error in metres on the court coordinate system.

What we hit: F1 in the high-80s to low-90s, with position error typically under 30 cm on hard courts in good lighting.

Why position error matters more than detection F1 for bounces: a bounce-event detection that's off by 30 cm at the baseline is a different line call than one off by 5 cm. We report both numbers. We do not claim Hawk-Eye precision; the Wimbledon line-calling explainer explains why.

4. Court-keypoint accuracy

Question: Are the four corners and key court intersections correctly located in the frame?

Metric: Pixel error per keypoint, averaged across the keypoint set, normalised by court width in pixels.

What we hit: sub-1% normalised error on hard courts and most clay courts; meaningfully worse on courts with faded lines or occluding fence posts.

Why this matters: every downstream coordinate (heatmap, line call, bounce position) inherits the court-keypoint error. A 2% court error is a 50 cm line-call error. So we measure this carefully and the threshold to even produce a report is "all four corners detected with sub-2% error."

5. Ball-speed estimation

Question: What was the ball's speed at contact (or just after the bounce)?

Metric: Mean absolute percentage error vs a radar reference. The reference set is small, we don't have radar at every shoot, but it's the only honest way to validate speed.

What we hit: roughly 6–10% MAPE on serves filmed under recommended setup conditions; meaningfully worse on groundstrokes (the trajectory is shorter, the inferred 3D component is noisier).

Why we hedge speed estimates in the report: because monocular speed estimation from a phone camera has a structural error band that no amount of model improvement collapses to zero. We have an entire post on this, Why your serve speed reading might not be 130 mph (or 66 mph), because the community has been mis-sold on this for years.

How the regression suite actually runs

This is the bit most "we tested it carefully" claims skip. Here's the literal command:

cd acesense-gpu-backend
python scripts/compare_events.py \
  games/tennis/data/sample_match_1min_720p.mp4.annotations_update.json \
  output/sample_match_1min_720p \
  --tolerance 5

It reads a hand-annotated JSON of every event in a video, runs the pipeline output through a temporal-matching algorithm with the specified frame tolerance, and prints precision/recall/F1 per event type plus a confusion matrix for shot types. We run this on every release, against a held-out test set the model never sees in training.

The annotation files come from acesense-annotate, our open desktop tool for frame-accurate labelling. Coaches who use it produce .annotations.json files that feed the training set on opt-in; the held-out test set is a separately curated subset that's never seen training.

How we built the test set

A test set is only as good as its representativeness. Ours covers:

Surfaces: hard, clay, indoor hard. (Grass and carpet under-represented; flagged on /accuracy.)
Levels: NTRP 3.0–4.5 amateur play (our actual users), with a smaller validation slice on 4.5–5.0 to catch over-fitting to lower-level pose distributions.
Camera positions: all the recommended setups in /how-to/film-your-tennis-match, plus deliberately bad setups (off-axis, too low, too far) so we can characterise the degradation.
Lighting conditions: outdoor sunny, outdoor overcast, outdoor floodlit evening, indoor LED-bright, indoor fluorescent-dim. The dim-fluorescent slice is where the ball detector struggles, and we report that separately so a buyer can decide.
Match formats: singles dominant; small doubles slice with separate metrics.

We are deliberately not using broadcast TV footage in the test set. Broadcast footage is cleaner than what amateurs film. Reporting accuracy on broadcast footage is the apples-to-oranges error every legacy system makes. Our test set is phone footage, and our accuracy claims apply to phone footage.

Where the pipeline fails (in priority order)

Honest section. If you read nothing else, read this:

Dim indoor halls with low-frequency fluorescent lighting. Ball detector loses contrast on the ball; the trajectory becomes intermittent; downstream events get noisy. Mitigation: use a phone with good low-light performance (recent flagships do better than budget phones here).
Heavily worn clay courts. Court keypoint detection breaks when lines are faded or partially covered. Mitigation: brush the lines before recording. We're working on a "low-line-confidence" mode that uses a coarser court estimate.
Doubles. Singles is the supported path. In doubles, two players on one side can confuse the shot-classifier's "which player hit this?" assignment. We're explicit about this on /use-cases/club-players.
Off-axis camera setups. A camera tilted more than ~30° off the centreline of the baseline degrades court-keypoint detection. The fix is the setup guide.
Sub-30 fps phone video. Frame-rate matters. 30 fps is the floor; 60 fps is materially better; 24 fps will produce missed contact frames on fast shots.

This list will look different in 12 months. It looks like this today. The /accuracy page has the dated changelog.

What we don't measure (yet)

A few things we would like to publish numbers for and don't yet:

Spin estimation accuracy. Spin is shown in the report but the validation infrastructure is younger than the rest. We haven't earned the right to publish a tight number.
Stroke-quality calibration. "How well does our pose-based score correlate with a coach's manual assessment?" requires a coach-labelled test set we're still building.
Long-form stamina/fatigue effects within a match. Currently treated as out-of-scope.

When we have these, we'll publish them on /accuracy and add them to the regression suite.

What this means for you

If you're an NTRP 3.0–4.5 amateur filming on a recent phone with a reasonable setup on a hard or well-maintained clay court, the modal AceSense user, the pipeline will produce a report whose shot detection, shot type, heatmap, and stroke-quality score you can trust to a meaningful degree. The serve-speed and spin numbers come with an error band; treat them as relative measures over time, not absolute truth.

If you're filming in a dim indoor hall on faded clay during a doubles match, every stage of the pipeline is at the edge of its competence, and the report will reflect that. We'll show you the lower confidence rather than hide it.

The methodology is here. The numbers are on /accuracy. The comparison to alternatives is on /compare/swingvision. If after reading all three you still want to try it on your own video, the free tier is here.

Related reading: How AI tennis shot detection actually works is the technical companion to this page. Why your serve speed reading might not be 130 mph is the deep-dive on Stage 5 of the pipeline.

How accurate is AceSense? Our methodology and benchmarks