---
title: "How accurate is AceSense? Our methodology and benchmarks"
description: "Most AI tennis apps describe accuracy with adjectives. Here's how AceSense actually measures shot-detection F1, ball-speed error, and where the model fails."
slug: "how-accurate-is-acesense"
date: "2025-07-13"
author: "Akshay Sarode"
authorBio: "Founder, AceSense. Building AI tennis tools in Europe."
category: "Methodology"
schema: "BlogPosting"
faq:
  - q: "What's the headline accuracy number for AceSense?"
    a: "There isn't a single number, and any app that gives you one is hiding something. Accuracy varies by event type: shot detection (was a shot hit?) is high-90s F1 in the current build; shot classification (was it a forehand or a backhand?) is in the low-90s; ball-speed estimate from monocular video is roughly within 6–10% of radar on serves under good filming conditions, and worse on groundstrokes where the trajectory is shorter. The /accuracy page has the per-release numbers."
  - q: "How do you measure accuracy?"
    a: "We hand-annotate a held-out set of amateur tennis match videos with every shot, every bounce, and every shot-type label. Then we run the pipeline and compare its outputs to the human labels using the script python scripts/compare_events.py with a tolerance window (typically 5 frames at 30 fps = 167 ms). The script reports precision, recall, and F1 per event type. We run it every release, and we publish the results."
  - q: "Why is your serve speed sometimes off by 10 mph?"
    a: "Because we're estimating speed from a single 2D phone camera, not from a calibrated radar. Speed estimation needs the ball trajectory in 3D, and we infer the third dimension from court keypoints and a physical model. Two things degrade the estimate: a camera placed off-centre from the baseline, and a serve that doesn't take a clean parabolic arc (heavy slice, short kicker). We hedge speed estimates explicitly in the report and we wrote a whole post on it: /blog/serve-speed-reading-explained."
  - q: "Will the accuracy improve over time?"
    a: "Yes, and we publish the changelog on /accuracy so you can see when. The two biggest accuracy levers are (1) more hand-annotated training data, every annotated match a coach uploads via our annotation tool feeds into the next training run; and (2) phone-camera setup guidance for users, which removes a lot of the 'bad input, bad output' failures."
  - q: "Why don't you compare yourself to SwingVision's accuracy?"
    a: "Because SwingVision doesn't publish a methodology page. Their accuracy claims are marketing copy with no test set, no F1 numbers, no failure-mode list. We can't run a fair comparison without a shared test set, and we won't compare on adjectives. If they publish numbers, we'll compare numbers."
---

# How accurate is AceSense? Our methodology and benchmarks

Most AI tennis apps describe accuracy with adjectives. "Industry-leading." "Highly accurate." "Pro-grade." Here are numbers.

This is the methodology page in blog form. It explains how we measure AceSense's accuracy, what the current build's numbers actually look like, and, the part nobody else seems to write, where the model fails. If you're a sceptic, this is the page that should either earn your trust or send you somewhere else.

## TL;DR: the honest version

- We hand-annotate a held-out set of amateur match videos and run a regression suite against the pipeline output every release.
- The script that does it lives in our GPU backend: `python scripts/compare_events.py <annotations.json> <output_dir> --tolerance 5`.
- Shot *detection* (was a shot hit?) is high accuracy. Shot *classification* (was it a forehand or backhand?) is slightly lower. Ball *speed* estimation from monocular phone video has a meaningful error band that we hedge explicitly.
- Failure modes are concentrated in three places: dim indoor courts, heavily worn clay, and doubles.
- The full per-release numbers live on [/accuracy](/accuracy). This post is the methodology behind that page.

## Why this page exists

There's a [Google PAA, "How accurate is tennis ball tracking?"](https://www.google.com/search?q=tennis+ball+tracking+app), that surfaces every time someone searches for a tennis AI app. The community has been burned. Look at the [r/10s thread "Am I really serving 130 mph?"](https://www.reddit.com/r/10s/comments/xc2xc0/) or the companion [r/10s thread "my hardest serve only 66 mph?"](https://www.reddit.com/r/10s/comments/17c8ozf/), players posting the same complaint about competitor apps: the numbers don't match physical reality. The community's defence is "use it for relative comparisons, not absolute numbers."

That's a fine workaround if you got the app for free. It's not a fine answer if you paid $400 a year. So the bar for AceSense's accuracy page is: publish the methodology, publish the numbers, document the failures, update per release.

## The five things we measure

The pipeline ([explained in detail here](/blog/how-ai-tennis-shot-detection-works)) produces five kinds of output. Each has its own accuracy metric.

### 1. Shot-event detection

**Question:** Did a shot happen at this frame?

**Metric:** Precision, recall, F1, with a tolerance window. A predicted shot at frame N matches a ground-truth shot at frame M if |N − M| ≤ 5 frames (167 ms at 30 fps). The tolerance accounts for the fact that "the moment of contact" is itself a 1–2-frame ambiguity in 30-fps footage.

**What we hit (current build):** high-90s F1 on clean, well-filmed singles match video. We publish the exact number on [/accuracy](/accuracy).

**Why it's high:** the contact event has very strong visual signals, ball trajectory inflection, racket-arm pose, ball-to-racket distance minimum. The model has a lot to work with.

**Where it fails:** very fast exchanges at the net (volley exchanges where contact-to-contact is under 250 ms can fool the de-duplication step), and shots where the player is heavily occluded by the net post.

### 2. Shot-type classification

**Question:** Given that a shot happened, was it a forehand, backhand, serve, or volley?

**Metric:** Per-class precision/recall/F1. We report a confusion matrix.

**What we hit:** low-90s F1, with the most confusion between forehand and backhand in unusual stances (open-stance backhand, defensive forehand from the deep ad corner). Serve detection is essentially solved (very high F1) because the pose signature is so distinct. Volley detection is the weakest of the four because volleys are short, fast, and pose-ambiguous.

**Why classification is harder than detection:** detection just needs "something happened"; classification needs to disambiguate two pose configurations that can look similar in a 2D phone-camera projection.

### 3. Bounce-event detection

**Question:** Did the ball bounce at this frame, at this location?

**Metric:** Frame-tolerance F1 (same 5-frame window) plus a 2D position error in metres on the court coordinate system.

**What we hit:** F1 in the high-80s to low-90s, with position error typically under 30 cm on hard courts in good lighting.

**Why position error matters more than detection F1 for bounces:** a bounce-event detection that's off by 30 cm at the baseline is a different line call than one off by 5 cm. We report both numbers. We do not claim Hawk-Eye precision; the [Wimbledon line-calling explainer](/blog/wimbledon-electronic-line-calling-explained) explains why.

### 4. Court-keypoint accuracy

**Question:** Are the four corners and key court intersections correctly located in the frame?

**Metric:** Pixel error per keypoint, averaged across the keypoint set, normalised by court width in pixels.

**What we hit:** sub-1% normalised error on hard courts and most clay courts; meaningfully worse on courts with faded lines or occluding fence posts.

**Why this matters:** every downstream coordinate (heatmap, line call, bounce position) inherits the court-keypoint error. A 2% court error is a 50 cm line-call error. So we measure this carefully and the threshold to even produce a report is "all four corners detected with sub-2% error."

### 5. Ball-speed estimation

**Question:** What was the ball's speed at contact (or just after the bounce)?

**Metric:** Mean absolute percentage error vs a radar reference. The reference set is small, we don't have radar at every shoot, but it's the only honest way to validate speed.

**What we hit:** roughly 6–10% MAPE on serves filmed under recommended setup conditions; meaningfully worse on groundstrokes (the trajectory is shorter, the inferred 3D component is noisier).

**Why we hedge speed estimates in the report:** because monocular speed estimation from a phone camera has a structural error band that no amount of model improvement collapses to zero. We have an entire post on this, [Why your serve speed reading might not be 130 mph (or 66 mph)](/blog/serve-speed-reading-explained), because the community has been mis-sold on this for years.

## How the regression suite actually runs

This is the bit most "we tested it carefully" claims skip. Here's the literal command:

```bash
cd acesense-gpu-backend
python scripts/compare_events.py \
  games/tennis/data/sample_match_1min_720p.mp4.annotations_update.json \
  output/sample_match_1min_720p \
  --tolerance 5
```

It reads a hand-annotated JSON of every event in a video, runs the pipeline output through a temporal-matching algorithm with the specified frame tolerance, and prints precision/recall/F1 per event type plus a confusion matrix for shot types. We run this on every release, against a held-out test set the model never sees in training.

The annotation files come from [acesense-annotate](https://github.com/akshaysarode/acesense-annotate), our open desktop tool for frame-accurate labelling. Coaches who use it produce `.annotations.json` files that feed the training set on opt-in; the held-out test set is a separately curated subset that's never seen training.

## How we built the test set

A test set is only as good as its representativeness. Ours covers:

- **Surfaces:** hard, clay, indoor hard. (Grass and carpet under-represented; flagged on /accuracy.)
- **Levels:** NTRP 3.0–4.5 amateur play (our actual users), with a smaller validation slice on 4.5–5.0 to catch over-fitting to lower-level pose distributions.
- **Camera positions:** all the recommended setups in [/how-to/film-your-tennis-match](/how-to/film-your-tennis-match), plus deliberately bad setups (off-axis, too low, too far) so we can characterise the degradation.
- **Lighting conditions:** outdoor sunny, outdoor overcast, outdoor floodlit evening, indoor LED-bright, indoor fluorescent-dim. The dim-fluorescent slice is where the ball detector struggles, and we report that separately so a buyer can decide.
- **Match formats:** singles dominant; small doubles slice with separate metrics.

We are deliberately *not* using broadcast TV footage in the test set. Broadcast footage is cleaner than what amateurs film. Reporting accuracy on broadcast footage is the apples-to-oranges error every legacy system makes. Our test set is phone footage, and our accuracy claims apply to phone footage.

## Where the pipeline fails (in priority order)

Honest section. If you read nothing else, read this:

1. **Dim indoor halls with low-frequency fluorescent lighting.** Ball detector loses contrast on the ball; the trajectory becomes intermittent; downstream events get noisy. Mitigation: use a phone with good low-light performance (recent flagships do better than budget phones here).
2. **Heavily worn clay courts.** Court keypoint detection breaks when lines are faded or partially covered. Mitigation: brush the lines before recording. We're working on a "low-line-confidence" mode that uses a coarser court estimate.
3. **Doubles.** Singles is the supported path. In doubles, two players on one side can confuse the shot-classifier's "which player hit this?" assignment. We're explicit about this on [/use-cases/club-players](/use-cases/club-players).
4. **Off-axis camera setups.** A camera tilted more than ~30° off the centreline of the baseline degrades court-keypoint detection. The fix is the setup guide.
5. **Sub-30 fps phone video.** Frame-rate matters. 30 fps is the floor; 60 fps is materially better; 24 fps will produce missed contact frames on fast shots.

This list will look different in 12 months. It looks like this today. The /accuracy page has the dated changelog.

## What we don't measure (yet)

A few things we would like to publish numbers for and don't yet:

- **Spin estimation accuracy.** Spin is shown in the report but the validation infrastructure is younger than the rest. We haven't earned the right to publish a tight number.
- **Stroke-quality calibration.** "How well does our pose-based score correlate with a coach's manual assessment?" requires a coach-labelled test set we're still building.
- **Long-form stamina/fatigue effects within a match.** Currently treated as out-of-scope.

When we have these, we'll publish them on [/accuracy](/accuracy) and add them to the regression suite.

## What this means for you

If you're an NTRP 3.0–4.5 amateur filming on a recent phone with a reasonable setup on a hard or well-maintained clay court, the modal AceSense user, the pipeline will produce a report whose shot detection, shot type, heatmap, and stroke-quality score you can trust to a meaningful degree. The serve-speed and spin numbers come with an error band; treat them as relative measures over time, not absolute truth.

If you're filming in a dim indoor hall on faded clay during a doubles match, every stage of the pipeline is at the edge of its competence, and the report will reflect that. We'll show you the lower confidence rather than hide it.

The methodology is here. The numbers are on [/accuracy](/accuracy). The comparison to alternatives is on [/compare/swingvision](/compare/swingvision). If after reading all three you still want to try it on your own video, [the free tier is here](/).

---

**Related reading:** [How AI tennis shot detection actually works](/blog/how-ai-tennis-shot-detection-works) is the technical companion to this page. [Why your serve speed reading might not be 130 mph](/blog/serve-speed-reading-explained) is the deep-dive on Stage 5 of the pipeline.