AI HTS Classification Accuracy: What the Benchmarks Actually Show

Every vendor claims 90%+ accuracy. Independent benchmarks tell a different story. Here's what we measure, how we measure it, and what 70% means for real-world use.

Headline numbers (2026)

On a 200-item benchmark of novel CBP rulings from 2024-2025 — products the agent hasn't seen before — htsapi.dev achieves:

70% exact 10-digit accuracy — the agent returns the exact code CBP assigned
70% at 6-digit (internationally harmonized HS level)
80% at 4-digit heading

"Novel" is the important word. Many published HTS benchmarks recycle products the model already saw during training. We test on rulings issued after the agent's knowledge cutoff, so the numbers reflect generalization, not memorization.

How we got from 59% to 70%: the v2 agent

An earlier version of htsapi.dev used a two-stage pipeline: a fine-tuned semantic retrieval model returned top-50 candidate HTS codes, then an LLM picked the best one. That hit 59.1% at 10-digit on the public ATLAS benchmark — already the best published result for an automated system, but bottlenecked by retrieval. Roughly 31% of ATLAS items weren't in the top-50 candidates at all, so no amount of reasoning could recover them.

The v2 architecture replaces the rigid two-stage pipeline with an agent that has tools. The agent can:

search_rulings — semantic search over 134,050 CBP CROSS rulings
get_chapter_headings — list 4-digit headings for a chapter to compare alternatives
get_chapter_notes — read the legal notes that exclude or include specific products
get_codes_under_heading — see all 6/8/10-digit codes under a heading
read_ruling — pull full CBP ruling text when a search snippet isn't enough

Instead of being constrained by retrieval, the agent investigates: searches rulings, reads legal notes, verifies that ruling codes still exist in the current schedule, and commits when confident. The result: 10-point lift on the same products that broke the v1 retrieval ceiling.

Confidence calibration

Accuracy alone is the wrong metric. What matters more: does the agent know when it's right? When it says "high confidence," is it actually more accurate than when it says "medium"? That's calibration.

From the v2 benchmark run:

Confidence	% of calls	10-digit accuracy	4-digit accuracy
High	~55%	~85%	~95%
Medium	~35%	~55%	~70%
Low	~10%	~30%	~50%

When the agent says "high," trust it. When it says "medium," verify. When it says "low," it usually returns a clarification question instead of a commit. The confident_to field tells you the deepest level the agent is sure about — sometimes it commits at 6-digit and tells you what would unlock the 10-digit answer.

Why not 90%?

Every commercial tool claims high accuracy. Here's why honest numbers are lower:

1. Human experts don't agree either

Customs classifiers agree with each other roughly 85-92% at 6 digits. The WCO estimates 1 in 3 customs entries globally is misclassified. Licensed customs broker exam pass rates run in the single digits to ~30% per sitting. Classification is genuinely hard.

2. Self-reported benchmarks use different test sets

Tarifflo claims 88.3% on its own 103-item benchmark — authored by its founder, test set not public. Zonos claims ">90%" on its website but scored 43.7% on an independent 103-item test. The gap between self-reported (90%+) and independent (40-70%) reflects different test sets, different difficulty levels, and marketing inflation.

3. Some products are structurally hard

The agent still struggles with:

Chemicals with IUPAC names — "4[N-(2,4-Diamino-6-Pteridinylmethyl)-N-Methylamino] Benzoic Acid"
Function-based classifications — "parts suitable for use with machines of heading 84.71"
Multi-material composites — products requiring GRI 3 essential character analysis
Stale ruling codes — CBP rulings from 2006 may reference 10-digit codes that have since been restructured. The agent verifies via get_codes_under_heading, but occasionally trusts an obsolete code.

No tool should be used for filing without human review. 70% at 10-digit means 3 in 10 classifications need correction. The value is in first-pass triage and audit defense: the agent narrows ~26,000 possible codes to a high-confidence candidate with CBP ruling citations and GRI rationale. A human reviewer picks or corrects the answer in seconds, not minutes.

What we don't claim

We don't claim "the best HTS classification API." We claim:

The accuracy numbers above are reproducible — run the same products through the API and check.
Every result includes the CBP ruling cited, the GRI rule applied, and confidence calibration.
We test on novel rulings, not training data.
When a CBP ruling exists, the agent finds it. When it doesn't, the agent reasons from the schedule.

How we built it

The agent runs Gemini Flash with function calling. Each classification call:

Reads the product description and decides which 2-3 HTS chapters could plausibly apply
Searches CBP rulings via pgvector semantic similarity over 134,050 indexed rulings
Reads ruling text when search snippets aren't conclusive
Compares headings within a candidate chapter, citing legal notes when ruling out alternatives
Verifies the 10-digit code exists in the current 2026 schedule (rulings often reference obsolete codes)
Commits with rationale — code, confidence, GRI rule cited, ruling number cited
Post-processing in Postgres adds Section 301/232 duties from Census, FTA rates, legal notes

Average classification: ~8 seconds end-to-end including Census duty fetch.

About other benchmarks

ATLAS (arXiv:2509.18400) — 259 real CBP rulings, public, reproducible. We previously hit 59.1% on this with v1 architecture.

Tarifflo benchmark (arXiv:2412.14179) — 103 items, authored by Tarifflo's founder, test set not public, claims 88.3%. Not independently reproducible.

We publish our headline numbers on a novel CBP rulings benchmark because:

Real CBP rulings, not LLM-generated descriptions
Issued after model training cutoff (no memorization)
Reproducible — anyone can pull the same rulings from CROSS

Try it yourself. The free demo runs the same agent. Describe any product, see the classification with confidence levels, ruling citations, and effective duty rates. No signup needed.

AI HTS Classification Accuracy: What the Benchmarks Actually Show

Headline numbers (2026)

How we got from 59% to 70%: the v2 agent

Confidence calibration

Why not 90%?

1. Human experts don't agree either

2. Self-reported benchmarks use different test sets

3. Some products are structurally hard

What we don't claim

How we built it

About other benchmarks

Sources

Related guides