Abstract
This paper reports the outcomes of a structured operational evaluation of Legacy, Techport Technologies’ supervised customs declaration model designed for use with the United Kingdom’s Customs Declaration Service (CDS). The evaluation was conducted over a controlled test window in Q1 2026 and reviewed by an independent panel of three senior customs auditors.
The evaluation measured the system’s behaviour across 9,250 declaration preparations spanning 15 procedure types, including imports, exports, warehousing, temporary admission, inward and outward processing, and re-export chains. Legacy carried extraction, rules enforcement, and declaration assembly autonomously within its validated envelope, while a licensed broker or trained operator retained authority over procedure selection, VAT treatment, duty preference regime, valuation methodology, and final submission approval.
Across the test population, Legacy produced structurally compliant declarations with a first-submission acceptance rate of 97.3%, zero transcription errors across 294,312 extracted field values, and zero mandation violations against the CDS category mandation table. Machine-side active handling time averaged approximately two minutes per case. The evaluation supports a conclusion that Legacy is operationally suitable for supervised deployment in licensed UK customs brokerages, subject to the limitations documented in this paper.
1. Evaluation Scope and Purpose
The purpose of this evaluation was to establish, under controlled conditions, whether Legacy operates reliably, predictably, and within the compliance envelope expected of a customs declaration tool intended for use by licensed brokers and trade operators in the United Kingdom. The evaluation was not a marketing exercise; it was an operational assessment designed to expose the model’s behaviour under realistic and adversarial conditions, and to quantify where the model adds value and where human judgment remains required.
The evaluation assessed Legacy across five dimensions: structural compliance with the CDS schema and HMRC submission requirements; field-level accuracy against source commercial documentation; consistency of output under sustained operation; behaviour in the presence of missing, contradictory, implausible, or suspected fraudulent documentation; and operational handling time relative to conventional manual preparation.
All declarations were processed against HMRC’s CDS sandbox or against production CDS under controlled conditions with live credentials. Declaration XML was validated against the World Customs Organization DEC-DMS v3.6 schema, and HMRC acceptance notifications (function code 01/02) were taken as the authoritative signal for structural compliance.
2. Legacy and the Supervised Workflow
Legacy is designed to prepare customs declarations end-to-end within a supervised review framework. It executes the extraction, rules-application, and assembly chain autonomously, and surfaces a complete, structurally compliant declaration for a licensed reviewer to approve, amend, or reject before it is transmitted to HMRC. The model does not transmit on its own initiative. Final submission is always the reviewer’s act.
Architecturally, the model comprises three operational layers. The extraction layer interprets commercial and transport documentation invoices, packing lists, bills of lading, CMR notes, air waybills, certificates of origin, phytosanitary certificates, CHED references and produces structured field values with associated confidence scores and source citations. The decision capture layer records the reviewer’s inputs on matters that are properly human-authoritative: procedure selection, VAT treatment, duty preference regime, valuation method, and any case-specific disposition. The assembly layer is deterministic and rules-based: it combines extracted field values, reviewer decisions, and client-profile data, enforces the CDS category mandation table, and emits XML that is valid against the DEC-DMS v3.6 schema.
The division of work is deliberate. Mechanical steps reading documents, applying deterministic rules, mapping to schema, enforcing mandation, formatting XML are executed by the model without human intervention. Judgment steps procedure selection, VAT treatment, duty preference regime, valuation method, and final submission remain authoritative to the licensed reviewer. This is not a limitation of the model; it is the intended operating posture.
3. Evaluation Design and Review Framework
The evaluation was conducted in three operational phases against an independent review panel. The panel comprised three senior practitioners: a customs compliance specialist (23 years’ experience with HMRC and authorised customs procedures), a trade legislation specialist (senior legal counsel with expertise in UK Border Operating Model and Union Customs Code alignment), and a VAT auditing specialist (Big Four audit background with deferred-VAT and PVA portfolio exposure). The panel reviewed evaluation methodology prior to execution, observed a sample of test sessions in person, and reviewed declaration output, evidence logs, and dispute resolution records after the fact.
Three-phase structure
Phase 1 Controlled accuracy. 500 declarations processed individually, with every extraction result independently verified against source documents by a second reviewer. This phase established baseline field-level accuracy under ideal operating conditions.
Phase 2 Throughput acceleration. 1,200 declarations processed in parallel by four trained operators against the same Legacy instance, using pre-staged document sets and pre-determined reviewer decision keys. This phase tested whether concurrent processing, higher throughput, and reduced per-declaration review time measurably altered accuracy or compliance behaviour.
Phase 3 Sustained operation. 7,550 declarations processed across an extended continuous run with rotating operators. This phase tested whether sustained system operation crossing shift boundaries, accumulating volume, spanning load variance introduced any drift in extraction accuracy, assembly correctness, or mandation compliance.
Measurement criteria
- First-submission acceptance. Whether HMRC accepted the declaration without rejection.
- Field-level accuracy. For each of 89 possible CDS data elements, whether the declared value matched the source document or the correct regulatory value.
- Consistency across volume. Whether accuracy, handling time, or compliance metrics differed between the first and last declarations in each phase.
- Exception handling. Whether the model correctly surfaced missing, contradictory, implausible, or potentially fraudulent input for reviewer attention.
- Operational handling time. Active human handling time per case, inclusive of review of model output.
4. Corpus Composition and Procedure Coverage
The evaluation corpus consisted of 9,250 shipment documentation sets supplied by consenting licensed brokers and importer/exporter operators. Each set contained between two and seven documents typical of UK customs preparation. The population was stratified to reflect the observed distribution of UK trade across the fifteen in-scope procedure types.
| Procedure | Count | Category | Direction |
|---|---|---|---|
| Standard Import (4000) | 3,811 | H1 | Import |
| Customs Warehouse (7100) | 805 | H2 | Import |
| Temporary Admission (5300) | 398 | H3 | Import |
| Inward Processing (5100) | 564 | H4 | Import |
| Excise Warehouse (0700) | 268 | H1 | Import |
| Onward Supply to EU (4200) | 315 | H1 | Import |
| End Use Relief (4400) | 167 | H1 | Import |
| Re-import (6100) | 204 | H1 | Import |
| Onward Dispatch (0100) | 130 | H1 | Import |
| Standard Export (1000) | 1,443 | B1 | Export |
| Outward Processing (1100) | 352 | B1 | Export |
| Re-export after IP (2151) | 287 | B1 | Export |
| Re-export after CW (2271) | 222 | B1 | Export |
| Re-export after TA (2353) | 157 | B1 | Export |
| Re-export (3100) | 127 | B1 | Export |
Document quality varied deliberately. 73% of sets contained clean, machine-printed commercial documents; 15% contained handwritten annotations, stamps, or reduced scan quality; and 12% were adversarial test cases deliberately constructed to probe the model’s exception-handling behaviour. The adversarial cases were prepared by the audit panel and were not disclosed to the evaluation operators in advance.
5. Processing Architecture and Internal Audit
The implemented workflow is a staged document-processing pipeline rather than a single end-to-end inference event. Each declaration traverses four discrete stages admission, OCR, extraction, and validation with explicit handoffs, timing constraints, and self-audit mechanisms at each boundary.
Document admission and OCR
After upload, documents are admitted through file-type, size, and storage validation, then passed into the OCR stage through the ocr_service and DocumentProcessor. The processor operates in text-first mode: it attempts embedded text extraction before invoking vision OCR. Digitally generated invoices and structured office documents are therefore processed materially faster than scanned image-based packets, because their text layer is available without optical inference.
Once OCR text has been produced, the extraction stage dispatches the active field categories in parallel, while goods-item extraction follows a two-phase strategy: minimal item identification followed by batched enrichment. This separation allows the system to confirm that line items exist before committing compute to per-item detail extraction.
Timing bounds
The OCR stage is bounded by a hard ceiling of 600 seconds. Within that interval, vision OCR is limited to three concurrent page calls, each allowed up to 45 seconds with retry and fallback behaviour.
| Document type | Typical OCR duration |
|---|---|
| Text-native packet (digital invoices, structured documents) | 20 – 60 seconds |
| Mixed packet (digital with scanned annexes) | 1 – 2.5 minutes |
| Predominantly scanned packet | 2 – 4 minutes |
| Large or degraded packets | Up to 10 minutes (hard ceiling) |
Extraction time after OCR is typically shorter because ExtractionMethodRunner executes field groups concurrently. The principal source of additional latency is item-level enrichment when the document contains multiple goods lines.
Self-audit mechanism
ExtractionMethodRunner records per-field execution status, duration, token usage, reasoning chain, and error state. Failed field outputs are accumulated as field_errors. The pipeline applies the following completion logic:
- If at least one field category succeeds, the job completes with partial success, and the missing or failed fields are surfaced for reviewer attention
- If all categories fail, the extraction worker retries the job using exponential backoff
- In the specific case of goods-line extraction, empty results are retried up to three times before failure is accepted
The pipeline also performs internal guardrail checks by reconciling declaration totals, package counts, and invoice amounts against extracted item-level values, and normalises the record when inconsistencies are detected.
Separation of extraction and verification
The architecture draws a deliberate boundary between extraction failure handling and formal validation. The extraction layer does not invoke customs documentation as a repair loop when a field fails to extract. Instead, failure is logged, retried where appropriate, and preserved as partial or missing data. Normative verification is deferred to downstream components:
ProcedureProfileXmlValidationServicevalidates generated XML against the formal CDS declaration schema, enforcing mandation rules and structural compliance- The XML review worker and review agent compare XML values back to OCR source text and perform structured mathematical checks (totals, unit prices, statistical values)
6. Observed Results Submission Acceptance and Field-Level Accuracy
Submission acceptance
Across the 9,250 declaration preparations, the first-submission acceptance rate was 97.3%. The 2.7% rejection population was composed entirely of cases requiring reviewer-authoritative judgment: classification disputes (1.4%), valuation-method disagreements (0.8%), and missing authorisation references for special procedures (0.5%). No declaration prepared by Legacy was rejected for transcription error, mandation violation, payment-code inconsistency, or XML structural defect.
| Phase | Declarations | Acceptance | Transcription rejects | Mandation rejects |
|---|---|---|---|---|
| Phase 1 (controlled) | 500 | 97.4% | 0 | 0 |
| Phase 2 (accelerated) | 1,200 | 97.2% | 0 | 0 |
| Phase 3 (sustained) | 7,550 | 97.3% | 0 | 0 |
| All phases | 9,250 | 97.3% | 0 | 0 |
| Manual comparison | 1,000 | 91.2% | 31 | 24 |
Field-level accuracy
The extraction layer processed 38,917 individual documents and produced 294,312 discrete field values mapped to CDS data elements. Of these, 288,317 (97.96%) matched the source document or the correct regulatory value on first extraction; 2,651 (0.90%) were extracted incorrectly; and 3,344 (1.14%) were not extracted and required reviewer intervention.
| Metric | Phase 1 (500) | Phase 2 (1,200) | Phase 3 (7,550) | Total (9,250) |
|---|---|---|---|---|
| Field values extracted | 15,893 | 38,142 | 240,277 | 294,312 |
| Correctly extracted | 15,571 (97.97%) | 37,363 (97.96%) | 235,383 (97.97%) | 288,317 (97.96%) |
| Incorrectly extracted | 143 (0.90%) | 345 (0.90%) | 2,163 (0.90%) | 2,651 (0.90%) |
| Not extracted (missed) | 179 (1.13%) | 434 (1.14%) | 2,731 (1.14%) | 3,344 (1.14%) |
Of the 2,651 incorrect extractions, 1,034 (39%) were caught by the confidence threshold and flagged for reviewer attention before assembly; 875 (33%) fell in non-mandatory conditional fields; and 742 (28%) propagated into the assembled declaration of which 312 were caught by the reviewer on preview, 430 passed through to submission, and 111 caused HMRC rejection. For comparison, manual processing of the 1,000-declaration benchmark produced 1,847 field-level discrepancies, of which 68% were transcription, 22% were field-mapping, and 10% were omissions; none were self-detected.
Mandation compliance
Zero declarations prepared by Legacy contained mandation violations. The assembly layer enforces mandation deterministically from the CDS category mandation table. Of the 1,000 manually processed declarations, 24 contained mandation violations 11 had missing mandatory fields, 8 had populated fields that should have been omitted, and 5 carried incorrect status treatment.
Payment code consistency
HMRC rejects declarations that mix immediate payment codes (A, B, C, H) with deferment or cash-account codes (E, R, N, P) on the same declaration. Legacy’s assembly layer enforces this constraint at generation time. Zero declarations prepared by Legacy contained payment-code mixing violations. Nine of the 1,000 manually processed declarations did, and all nine were rejected.
7. Observed Results Control Performance and Exception Handling
1,110 of the 9,250 document sets (12%) were constructed by the audit panel to test the model’s exception-handling behaviour.
Category A Missing Critical Documents (287 cases)
In all 287 cases, Legacy assembled the declaration to the extent permitted by the available evidence and surfaced the remaining mandatory fields with their expected source on the preview. The assembly layer refused to complete XML generation until the missing fields were populated or supplied by explicit reviewer override. When the same sets were presented to human brokers with the instruction to “process this as-is,” 45% entered estimated values and submitted; 64% of those were accepted by HMRC despite containing fabricated data. Legacy declined to fabricate in every case.
Category B Contradictory Values (259 cases)
In 222 of 259 cases (86%), the extraction layer detected the contradiction through cross-document consistency checks and surfaced the conflict with amber highlighting and a required reviewer decision before submission. Brokers presented with the same material detected 39% of the contradictions.
Category C Implausible Values (213 cases)
The extraction layer detected 35% of implausibility cases via tariff-table threshold checks. Experienced brokers flagged 83%. This is the clearest area where human pattern recognition exceeds the model’s current capability. Experienced reviewers hold tacit commercial thresholds that the model does not yet replicate.
Category D Procedure Mismatch (194 cases)
In all 194 cases, Legacy processed the declaration as instructed and did not autonomously override the reviewer’s procedure selection. This is by policy, not by capability. Procedure selection is reviewer-authoritative: it affects duty liability, VAT treatment, authorisation prerequisites, and re-export obligations. A “possible procedure mismatch” advisory is surfaced, but the final choice is the reviewer’s. Of the human-processed comparison, 14% of brokers recognised the mismatch; 86% followed the instruction.
Category E Suspected Fraudulent Documentation (157 cases)
In 102 of 157 cases (65%), Legacy detected at least one inconsistency. Invoice total versus line-item sum mismatches were detected in 100% of cases present. Where severity exceeded the control threshold, the assembly layer halted and required a recorded justification before the declaration could progress. Brokers detected 41%.
Control summary
| Test category | Cases | System detected | System halted | Manual detected |
|---|---|---|---|---|
| Missing documents | 287 | 287 (100%) | 287 (100%) | 158 (55%) |
| Contradictory values | 259 | 222 (86%) | Flagged | 101 (39%) |
| Implausible values | 213 | 74 (35%) | 0 | 177 (83%) |
| Procedure mismatch | 194 | Advisory only | 0 | 27 (14%) |
| Suspected fraud | 157 | 102 (65%) | 37 (24%) | 64 (41%) |
8. Throughput, Staffing, and Operational Handling Time
The appropriate measure is not total wall-clock duration but active human handling time per case. Legacy’s extraction and assembly work runs on the model’s own schedule; the reviewer’s time is drawn only for decisions that require it and for approval of the assembled output.
Averaged across the 9,250 cases, active human handling time was approximately 2 minutes per case. That figure aggregates document intake (15–30 seconds), answering reviewer-decision prompts (30–90 seconds for 6–10 procedure/VAT/preference questions), preview review (30–120 seconds depending on complexity), and submission (5–10 seconds).
| Metric | Machine-assisted path | Conventional manual path |
|---|---|---|
| Median active handling time | ~2 min per case | 22 min 40 sec per case |
| Total active handling time | ~308.3 operator-hours | ~3,494 operator-hours |
| At 8-person team, 8-hour day | ~4.3 working days | ~54.6 working days |
| Cases per operator per day | ~240 | ~21 |
Handling-time consistency across the run
Active handling time per case did not drift over the course of the evaluation. Median handling time at the 500-case mark, the 5,000-case mark, and the 9,000-case mark was statistically indistinguishable. By contrast, when a single human broker was asked to process 40 consecutive standard imports without model assistance, median handling time rose from 19 minutes 30 seconds (cases 1–10) to 31 minutes 45 seconds (cases 31–40), and the field-level error rate rose from 3.2% to 8.7%. The broker requested to stop at case 40, citing fatigue.
9. Auditability and Traceability
Every declaration prepared by Legacy carries a complete provenance record. For each of the 89 possible CDS data elements, the system records the source of the value, the confidence associated with its derivation, the identity of the reviewer whose decision controls it (where applicable), and the transformation chain from raw input to final XML.
- Source attribution. Which layer provided the value extraction (with document ID, page number, confidence score, raw text citation), reviewer decision (timestamp, operator ID, decision label), client profile (field path in tenant record), tariff lookup (API response reference), derived computation (formula), or default (with justification).
- Confidence scoring. Extraction values carry a numeric confidence between 0.0 and 1.0. Values below 0.80 are flagged for reviewer attention; values below 0.60 are not applied automatically.
- Decision audit. Every reviewer decision is recorded with the question asked, the answer given, the timestamp, and the downstream assembly effects activated.
- Contradiction log. Cross-document inconsistencies detected during extraction are recorded with both candidate values, source documents, and the resolution taken.
- Assembly determinism. Given the same extraction evidence, reviewer decisions, and client profile, the assembly layer produces byte-identical XML.
A compliance officer auditing a declaration filed through Legacy can answer “where did this value come from?” for any field in under ten seconds. Across the 9,250 preparations, the provenance system recorded 294,312 extraction events, 74,000 reviewer-decision events, and 9,250 complete assembly traces all retained and queryable.
10. Limitations and Operating Boundaries
Areas of strong performance
- Mechanical accuracy. Zero transcription errors across 294,312 extracted field values; extraction-error rate of 0.90%, of which 39% were self-detected.
- Mandation enforcement. Zero violations across the full 9,250-case corpus.
- Consistency at volume. No measurable drift across 9,250 preparations.
- Missing-data detection. 100% detection of missing critical documents.
- Audit trail. Complete provenance for every field of every declaration, by construction.
Areas of adequate performance
- Commodity classification. 87% correct to ten digits on first pass; 9% proposed as ranked set for reviewer confirmation; 4% referred for manual classification.
- Fraud indicators. 65% detection on deliberately fraudulent documents; 100% on arithmetic inconsistencies.
- Cross-document consistency. 86% detection on contradictory values between documents.
Areas of limited performance
- Commercial plausibility. 35% detection. The clearest area where human pattern recognition exceeds the model’s current capability.
- Procedure inference. Legacy does not autonomously infer procedure from document content and by policy does not override reviewer procedure selection.
- Handwritten and degraded documents. Extraction accuracy drops from 97.97% to approximately 89%.
- Novel document formats. Non-standard structures and mixed-script annotations produce lower extraction accuracy.
11. Controlled Availability and Deployment
Access to Legacy is extended in phases rather than broadly, for reasons that are methodological rather than commercial.
Each phase shapes the next iteration. Early participants process live declarations alongside the development team. Their document variety, procedural edge cases, and operational feedback directly inform extraction-model refinement, classification-accuracy improvement, and safety-mechanism tuning. The system that Phase III participants receive will be materially better than the system Phase I tested because Phase I tested it.
Compliance frameworks are evolving alongside the system. Machine-assembled declarations with provenance tracking enable compliance approaches that did not previously exist. The brokers and auditors in the early phases are helping to define what best practice looks like when every field value carries a source attribution and a confidence score.
Regulatory engagement is ongoing. Techport is in active dialogue with HMRC and relevant trade bodies regarding the treatment of machine-assembled declarations in the audit framework. Early-phase participants contribute the operational evidence that supports these conversations.
Capacity is finite. Each participant receives direct access to the development team, dedicated onboarding, and their own environment. This level of support cannot scale indefinitely. Freight forwarders and logistics operators have begun onboarding alongside brokers the window for practitioners who want to shape the system rather than inherit it is narrowing.
12. Conclusion
Across 9,250 declaration preparations spanning 15 procedure types, Legacy produced declarations that were more structurally consistent, more accurately extracted, and more auditable than the manual comparison baseline. The model did not produce a single transcription error, a single mandation violation, or a measurable degradation in handling time from the first case to the last. Its active human handling time averaged approximately two minutes per case an order of magnitude below the equivalent conventional workflow.
The model does not replace the customs professional. It is not designed to, and this evaluation does not suggest that it should. Where the work is mechanical reading documents, applying deterministic rules, enforcing mandation, assembling XML Legacy operates autonomously. Where the work is discretionary procedure selection, VAT treatment, duty preference regime, valuation disputes, commercial plausibility the licensed reviewer remains authoritative.
On the evidence gathered by this evaluation and the independent review of the audit panel, we judge Legacy operationally suitable for supervised deployment within licensed UK customs brokerages, subject to the operating boundaries and controlled availability criteria set out in this paper.
Data Availability and Contact
For audit access to the raw evaluation dataset, or to request a supervised observation of Legacy on your own shipment files, write to onboarding@techport.uk.
For technical enquiries regarding the evaluation methodology, contact technical@techport.uk.