AI HTS Classification Accuracy: What the Benchmarks Actually Show

Every vendor claims 90%+ accuracy. Independent benchmarks tell a different story. Here's what we measure, how we measure it, and what 70% means for real-world use.

Headline numbers (2026)

On a 200-item benchmark of novel CBP rulings from 2024-2025 — products the agent hasn't seen before — htsapi.dev achieves:

"Novel" is the important word. Many published HTS benchmarks recycle products the model already saw during training. We test on rulings issued after the agent's knowledge cutoff, so the numbers reflect generalization, not memorization.

How we got from 59% to 70%: the v2 agent

An earlier version of htsapi.dev used a two-stage pipeline: a fine-tuned semantic retrieval model returned top-50 candidate HTS codes, then an LLM picked the best one. That hit 59.1% at 10-digit on the public ATLAS benchmark — already the best published result for an automated system, but bottlenecked by retrieval. Roughly 31% of ATLAS items weren't in the top-50 candidates at all, so no amount of reasoning could recover them.

The v2 architecture replaces the rigid two-stage pipeline with an agent that has tools. The agent can:

Instead of being constrained by retrieval, the agent investigates: searches rulings, reads legal notes, verifies that ruling codes still exist in the current schedule, and commits when confident. The result: 10-point lift on the same products that broke the v1 retrieval ceiling.

Confidence calibration

Accuracy alone is the wrong metric. What matters more: does the agent know when it's right? When it says "high confidence," is it actually more accurate than when it says "medium"? That's calibration.

From the v2 benchmark run:

Confidence% of calls10-digit accuracy4-digit accuracy
High~55%~85%~95%
Medium~35%~55%~70%
Low~10%~30%~50%

When the agent says "high," trust it. When it says "medium," verify. When it says "low," it usually returns a clarification question instead of a commit. The confident_to field tells you the deepest level the agent is sure about — sometimes it commits at 6-digit and tells you what would unlock the 10-digit answer.

Why not 90%?

Every commercial tool claims high accuracy. Here's why honest numbers are lower:

1. Human experts don't agree either

Customs classifiers agree with each other roughly 85-92% at 6 digits. The WCO estimates 1 in 3 customs entries globally is misclassified. Licensed customs broker exam pass rates run in the single digits to ~30% per sitting. Classification is genuinely hard.

2. Self-reported benchmarks use different test sets

Tarifflo claims 88.3% on its own 103-item benchmark — authored by its founder, test set not public. Zonos claims ">90%" on its website but scored 43.7% on an independent 103-item test. The gap between self-reported (90%+) and independent (40-70%) reflects different test sets, different difficulty levels, and marketing inflation.

3. Some products are structurally hard

The agent still struggles with:

No tool should be used for filing without human review. 70% at 10-digit means 3 in 10 classifications need correction. The value is in first-pass triage and audit defense: the agent narrows ~26,000 possible codes to a high-confidence candidate with CBP ruling citations and GRI rationale. A human reviewer picks or corrects the answer in seconds, not minutes.

What we don't claim

We don't claim "the best HTS classification API." We claim:

How we built it

The agent runs Gemini Flash with function calling. Each classification call:

  1. Reads the product description and decides which 2-3 HTS chapters could plausibly apply
  2. Searches CBP rulings via pgvector semantic similarity over 134,050 indexed rulings
  3. Reads ruling text when search snippets aren't conclusive
  4. Compares headings within a candidate chapter, citing legal notes when ruling out alternatives
  5. Verifies the 10-digit code exists in the current 2026 schedule (rulings often reference obsolete codes)
  6. Commits with rationale — code, confidence, GRI rule cited, ruling number cited
  7. Post-processing in Postgres adds Section 301/232 duties from Census, FTA rates, legal notes

Average classification: ~8 seconds end-to-end including Census duty fetch.

About other benchmarks

ATLAS (arXiv:2509.18400) — 259 real CBP rulings, public, reproducible. We previously hit 59.1% on this with v1 architecture.

Tarifflo benchmark (arXiv:2412.14179) — 103 items, authored by Tarifflo's founder, test set not public, claims 88.3%. Not independently reproducible.

We publish our headline numbers on a novel CBP rulings benchmark because:

Try it yourself. The free demo runs the same agent. Describe any product, see the classification with confidence levels, ruling citations, and effective duty rates. No signup needed.

Sources

Related guides