Classifying 7,000 Product Codes from Four Words of Text
Every product that crosses an international border gets an HS code — a Harmonized System code that classifies it for customs. There are roughly 7,000 distinct codes at the 8-digit level. A "lithium-ion battery for mobile phones" gets one code. A "lithium-ion battery for electric vehicles" gets a different one. Getting the code wrong means delays, fines, or seized shipments.
The catch: the product descriptions that accompany these shipments are short, noisy, and often cryptic. The average description in our dataset is four words. Some are a single word. Many are abbreviated, misspelled, or in domain shorthand that no standard NLP model has ever seen.
As part of my M.Tech coursework and independent study, I spent several months experimenting with deep learning approaches to this problem — LSTM, BiLSTM, GRU, CNN, and hybrid architectures. These are my notes on what I tried, what worked, what did not, and what I learned about classifying text when the text barely exists.
The Data
The dataset consists of customs records spanning several years. Each record contains a product description and its corresponding 8-digit HS code. The data was anonymized for this work — exact volumes and date ranges are omitted.
Short, noisy, abbreviated descriptions. Domain abbreviations, shorthand, and cryptic codes. The model has to learn what they mean.
Three properties make this dataset challenging:
Extreme class imbalance. 7,000 classes, but the distribution is heavily skewed. A few hundred codes account for the majority of records. Thousands of codes appear fewer than 100 times across the entire dataset. Long-tail classification is the core challenge.
Descriptions are absurdly short. The average description is 3-4 words after cleaning. About 9% of descriptions are a single word. Standard NLP models are designed for sentences and paragraphs. Here, we are trying to classify fragments — often just a product name or an abbreviation.
Multi-label ambiguity. Roughly 10% of descriptions map to multiple HS codes. A product description like "valve" could be classified under several different chapters depending on material, function, or application. The description alone is insufficient to disambiguate.
Preprocessing
Preprocessing turned out to be one of the most impactful stages. The raw descriptions contain unit abbreviations (ctn, kgs, ltr, pcs), measurement values, special characters, and domain-specific shorthand that add noise without semantic value.
The pipeline:
- Lowercase and clean — remove special characters, normalize whitespace
- Domain stopwords — standard English stopwords plus a custom list of trade terms (ctn, bags, kgs, ltr, pcs, nos, etc.) that appear across all categories and carry no discriminative signal
- Digit filtering — remove most numeric values (quantities, weights) while preserving meaningful numbers in product codes
- HS code standardization — pad all codes to 8 digits, filter out the catch-all code used for unclassifiable items
- Deduplication — identical description-code pairs significantly reduced the working dataset
After cleaning, I built a Doc2Vec model on the full corpus to generate 100-dimensional word embeddings. These embeddings encode the semantic relationships between trade terms — learning, for example, that "polyester" and "nylon" are closer together than "polyester" and "steel." The embeddings are used as the input layer for all deep learning models.
The Experiments
I tested six architectures, progressively adding complexity. Each model was trained with batch size 128, categorical cross-entropy loss, and sequences padded to a fixed length (50 or 100 tokens). Evaluation used top-1, top-2, and top-3 accuracy — because in practice, suggesting the correct code among the top 3 predictions is often sufficient for a human operator to select the right one.
1. Word2Vec + Logistic Regression (Baseline)
The simplest approach: average the Doc2Vec word vectors for each description to get a single document vector, then feed it to a logistic regression classifier. 70/30 train-test split.
This baseline is deliberately naive. It ignores word order entirely — "cotton blue shirt" and "shirt blue cotton" produce the same vector. But it establishes the floor. If a deep learning model cannot beat averaged word vectors plus logistic regression, something is wrong.
Strengths: Fast to train. No GPU required. Interpretable. Handles the full 7,000-class problem without memory issues.
Weakness: Ignores word order and context. The averaging operation loses information, especially for short sequences where every word matters.
2. LSTM
The first neural approach: a single LSTM layer with 100 hidden units on top of the Doc2Vec embeddings (frozen, not fine-tuned). The LSTM processes the sequence left-to-right and the final hidden state is passed to a softmax output layer.
Observation: The LSTM improved over the baseline, but the improvement was modest. With descriptions averaging 4 tokens, the LSTM's sequential processing has little context to accumulate. The final hidden state after 4 steps is not dramatically richer than the averaged embedding. The LSTM's advantage — capturing long-range dependencies — is irrelevant when the range is 4 tokens.
Training note: With RMSprop optimizer and 10 epochs over 1,000 steps per epoch, training was stable but convergence was slow. The model's capacity (100 hidden units) was deliberately kept small to avoid overfitting on the short inputs.
3. Bidirectional LSTM
The hypothesis: reading the description in both directions should help, since a 4-word description has no clear "beginning" and "end" in the way a sentence does. "Battery lithium mobile" and "mobile lithium battery" should produce similar representations.
Architecture: forward LSTM (64 units) and backward LSTM (64 units), outputs concatenated to produce a 128-dimensional representation. Dropout at 0.5. Adam optimizer.
Observation: A meaningful improvement over the unidirectional LSTM. The bidirectional encoding captured more information from the short sequences because it did not privilege left-to-right order. On descriptions of 3-4 words, direction matters less than full coverage — the BiLSTM gave each token context from both sides.
What surprised me: switching from RMSprop to Adam and from frozen to trainable embeddings (128d) made a larger difference than the bidirectional architecture itself. The embeddings fine-tuned during training adapted to the domain vocabulary in ways the pre-trained Doc2Vec could not.
4. GRU with Parallel CNNs
This was the first hybrid architecture: an LSTM for sequential features, combined with parallel CNN branches for n-gram features.
The idea: the LSTM captures sequential dependencies (word order), while the parallel CNNs with kernel sizes 1, 2, and 3 capture unigram, bigram, and trigram patterns respectively. The merge combines both views. Average pooling (instead of max pooling) retains more information from the short sequences, and a large dense layer (2,048 units with ELU activation and batch normalization) provides capacity for the 7,000-class output.
Observation: This was the first architecture that showed a significant jump. The CNN branches were particularly effective because short product descriptions are essentially collections of n-grams. "Cotton tshirt" is a bigram. "Lithium battery mobile" is a trigram. The parallel CNN captured these patterns directly, without requiring the sequential processing that LSTMs impose.
Key insight: the kernel-size-1 CNN (unigram) was surprisingly important. For single-word descriptions — which make up 9% of the data — it is the only CNN branch that produces useful features. Removing it dropped accuracy on single-word descriptions noticeably.
5. Bidirectional LSTM + CNN Hybrid
A refinement of the previous approach: replace the unidirectional LSTM with a bidirectional LSTM (2x 50 units), and add a separate CNN path with kernel sizes 1 and 3 (50 filters each, combined to 100). The LSTM and CNN outputs are merged, then processed by a GRU (100 units) before batch normalization and the final softmax.
Architecture: embedding (50d) → bidirectional LSTM ‖ Conv1D(k=1,3) → merge → GRU(100) → BatchNorm → softmax. Optimizer: RMSprop.
Observation: The additional GRU on top of the merged features helped refine the representation, but the improvement over architecture #4 was marginal. With 50-dimensional embeddings (half the size of architecture #4's 128d), this model was lighter and faster to train, but slightly less expressive. The trade-off favored the larger embeddings.
6. TextCNNRNN (TensorFlow)
The final architecture was a TensorFlow implementation combining CNN feature extraction (filter sizes 3, 4, 5 with 128 filters each) with a GRU (100 hidden units, dropout 0.5). Max pooling (pool size 4) reduces the CNN outputs before they are processed by the GRU.
This model was the most complex: three parallel CNN branches with larger filter sizes capture broader n-gram patterns, max pooling selects the strongest activations, and the GRU learns sequential patterns over the pooled features. The GRU uses a dropout wrapper (keep probability 0.5) and produces sequence-length-aware outputs — respecting the actual length of each description rather than the padded length.
Observation: The length-aware output collection was this model's most important feature. Because descriptions vary dramatically in length (1 to 100+ tokens), a model that treats all sequences as padded-to-50 wastes capacity on padding tokens. This architecture tracked the real length and used only the relevant GRU outputs for classification. On short descriptions, this was a meaningful improvement.
What I Learned
Across all experiments, five patterns emerged consistently:
CNNs beat LSTMs on short text. This was the most important finding. For descriptions averaging 4 words, the sequential processing of an LSTM adds little value — there is almost no long-range dependency to capture. CNNs, which extract n-gram patterns through convolution, are a more natural fit. A bigram kernel on a 4-word description sees nearly the entire input. An LSTM needs to process all 4 tokens sequentially to achieve the same coverage. The CNN is faster and, for this data, more effective.
Trainable embeddings outperformed frozen embeddings. Pre-trained Doc2Vec embeddings capture general word relationships, but trade vocabulary is specialized. "Flng" (flange), "ss" (stainless steel), "rf" (raised face) — these abbreviations have no useful representation in a general-purpose embedding. Allowing the embedding layer to fine-tune during training let the model learn domain-specific representations. The jump from frozen to trainable embeddings was one of the largest single improvements.
Embedding dimension matters. 128-dimensional embeddings consistently outperformed 50-dimensional ones. With 7,000 output classes, the model needs enough representational capacity in the input space to distinguish fine-grained categories. 50 dimensions was too compressed — it forced the model to collapse distinctions that the 128-dimensional space could preserve.
Top-3 accuracy is the practical metric. Top-1 accuracy on 7,000 classes with 4-word descriptions is inherently limited — there are many descriptions where the correct code cannot be determined unambiguously from the text alone. But top-3 accuracy was substantially higher across all models. In a production system where a human operator selects from suggested codes, presenting 3 candidates is enough to be useful. Optimizing for top-3 rather than top-1 changes the model selection calculus.
The preprocessing pipeline added more value than any architecture change. Custom domain stopwords, digit filtering, and HS code standardization — each of these preprocessing steps improved all models roughly equally. The aggregate effect of preprocessing was larger than the difference between the best and worst model architectures. This is a humbling lesson: before spending a week trying a new architecture, spend a day on better data cleaning.
What I Would Do Differently
Looking back, several things I would change:
Hierarchical classification. HS codes are hierarchical. The first 2 digits identify the chapter (e.g., 85 = electrical machinery). The first 4 digits identify the heading. The first 6 identify the subheading. Instead of predicting all 8 digits at once from 7,000 flat classes, a hierarchical approach — first predict the chapter, then the heading within that chapter, then the subheading — would reduce the effective number of classes at each level and could leverage the structure of the code system.
Character-level features. Many descriptions contain abbreviations that are informative at the character level — two-letter material codes, unit shorthands, and industry-specific acronyms. These are character patterns, not word patterns. A character-level CNN or character embeddings concatenated with word embeddings would likely help, especially for the abbreviated descriptions that dominate this dataset.
Attention mechanisms. The Transformer architecture was published last year, and attention mechanisms are showing strong results on classification tasks. For short-text classification, self-attention could help the model learn which words in a description are most important for the code prediction — "lithium" matters more than "battery" for distinguishing battery types. I plan to experiment with attention in the next iteration.
Better handling of multi-label cases. The 10% of descriptions that map to multiple HS codes were treated as single-label (using the first code). A multi-label formulation with binary cross-entropy loss would be more principled and could improve recall for ambiguous descriptions.
Related
- Building NLP Pipelines Before Transformers Were Easy
- Attention Is All You Need — A Practitioner's Guide to the Transformer
- Switching from TensorFlow to PyTorch — A Practical Assessment
HS code classification is a deceptively hard problem. The descriptions are short, noisy, domain-specific, and ambiguous. The class space is enormous. The practical requirements — suggest the right code within the top 3 — are achievable with deep learning, but the architecture matters less than the data preparation and the embedding quality.
The lesson I keep returning to: in NLP, the data and the features dominate the model. A well-preprocessed dataset with trainable embeddings and a simple CNN will outperform a complex LSTM on poorly prepared data, every time.