AI HTS Classification Accuracy: What the Benchmarks Actually Show
Every vendor claims 90%+ accuracy. Independent benchmarks tell a different story. Here's what we measure, how we measure it, and what 70% means for real-world use.
Headline numbers (2026)
On a 200-item benchmark of novel CBP rulings from 2024-2025 — products the agent hasn't seen before — htsapi.dev achieves:
- 70% exact 10-digit accuracy — the agent returns the exact code CBP assigned
- 70% at 6-digit (internationally harmonized HS level)
- 80% at 4-digit heading
"Novel" is the important word. Many published HTS benchmarks recycle products the model already saw during training. We test on rulings issued after the agent's knowledge cutoff, so the numbers reflect generalization, not memorization.
How we got from 59% to 70%: the v2 agent
An earlier version of htsapi.dev used a two-stage pipeline: a fine-tuned semantic retrieval model returned top-50 candidate HTS codes, then an LLM picked the best one. That hit 59.1% at 10-digit on the public ATLAS benchmark — already the best published result for an automated system, but bottlenecked by retrieval. Roughly 31% of ATLAS items weren't in the top-50 candidates at all, so no amount of reasoning could recover them.
The v2 architecture replaces the rigid two-stage pipeline with an agent that has tools. The agent can:
search_rulings— semantic search over 134,050 CBP CROSS rulingsget_chapter_headings— list 4-digit headings for a chapter to compare alternativesget_chapter_notes— read the legal notes that exclude or include specific productsget_codes_under_heading— see all 6/8/10-digit codes under a headingread_ruling— pull full CBP ruling text when a search snippet isn't enough
Instead of being constrained by retrieval, the agent investigates: searches rulings, reads legal notes, verifies that ruling codes still exist in the current schedule, and commits when confident. The result: 10-point lift on the same products that broke the v1 retrieval ceiling.
Confidence calibration
Accuracy alone is the wrong metric. What matters more: does the agent know when it's right? When it says "high confidence," is it actually more accurate than when it says "medium"? That's calibration.
From the v2 benchmark run:
| Confidence | % of calls | 10-digit accuracy | 4-digit accuracy |
|---|---|---|---|
| High | ~55% | ~85% | ~95% |
| Medium | ~35% | ~55% | ~70% |
| Low | ~10% | ~30% | ~50% |
When the agent says "high," trust it. When it says "medium," verify. When it says "low," it usually returns a clarification question instead of a commit. The confident_to field tells you the deepest level the agent is sure about — sometimes it commits at 6-digit and tells you what would unlock the 10-digit answer.
Why not 90%?
Every commercial tool claims high accuracy. Here's why honest numbers are lower:
1. Human experts don't agree either
Customs classifiers agree with each other roughly 85-92% at 6 digits. The WCO estimates 1 in 3 customs entries globally is misclassified. Licensed customs broker exam pass rates run in the single digits to ~30% per sitting. Classification is genuinely hard.
2. Self-reported benchmarks use different test sets
Tarifflo claims 88.3% on its own 103-item benchmark — authored by its founder, test set not public. Zonos claims ">90%" on its website but scored 43.7% on an independent 103-item test. The gap between self-reported (90%+) and independent (40-70%) reflects different test sets, different difficulty levels, and marketing inflation.
3. Some products are structurally hard
The agent still struggles with:
- Chemicals with IUPAC names — "4[N-(2,4-Diamino-6-Pteridinylmethyl)-N-Methylamino] Benzoic Acid"
- Function-based classifications — "parts suitable for use with machines of heading 84.71"
- Multi-material composites — products requiring GRI 3 essential character analysis
- Stale ruling codes — CBP rulings from 2006 may reference 10-digit codes that have since been restructured. The agent verifies via
get_codes_under_heading, but occasionally trusts an obsolete code.
What we don't claim
We don't claim "the best HTS classification API." We claim:
- The accuracy numbers above are reproducible — run the same products through the API and check.
- Every result includes the CBP ruling cited, the GRI rule applied, and confidence calibration.
- We test on novel rulings, not training data.
- When a CBP ruling exists, the agent finds it. When it doesn't, the agent reasons from the schedule.
How we built it
The agent runs Gemini Flash with function calling. Each classification call:
- Reads the product description and decides which 2-3 HTS chapters could plausibly apply
- Searches CBP rulings via pgvector semantic similarity over 134,050 indexed rulings
- Reads ruling text when search snippets aren't conclusive
- Compares headings within a candidate chapter, citing legal notes when ruling out alternatives
- Verifies the 10-digit code exists in the current 2026 schedule (rulings often reference obsolete codes)
- Commits with rationale — code, confidence, GRI rule cited, ruling number cited
- Post-processing in Postgres adds Section 301/232 duties from Census, FTA rates, legal notes
Average classification: ~8 seconds end-to-end including Census duty fetch.
About other benchmarks
ATLAS (arXiv:2509.18400) — 259 real CBP rulings, public, reproducible. We previously hit 59.1% on this with v1 architecture.
Tarifflo benchmark (arXiv:2412.14179) — 103 items, authored by Tarifflo's founder, test set not public, claims 88.3%. Not independently reproducible.
We publish our headline numbers on a novel CBP rulings benchmark because:
- Real CBP rulings, not LLM-generated descriptions
- Issued after model training cutoff (no memorization)
- Reproducible — anyone can pull the same rulings from CROSS
Sources
- ATLAS: Benchmarking and Adapting LLMs for Global Trade Classification (arXiv 2509.18400)
- Benchmarking Harmonized Tariff Schedule Classification Models (arXiv 2412.14179, Tarifflo benchmark)
- ATLAS dataset on HuggingFace (259-item test set)
- CBP CROSS — Customs Rulings Online Search System
- WCO General Rules of Interpretation