Building NLP Pipelines Before Transformers Were Easy
For the past year at Halialabs, I have been building named entity recognition systems for production. Not the kind you see in tutorials — not spaCy out of the box on English news text. The kind where your entities are domain-specific, your languages include Malay and Bahasa Indonesia, your training data is noisy, and the model needs to run on modest hardware with sub-second latency.
The Transformer paper changed everything in theory. In practice, in mid-2019, most production NLP still runs on architectures from 2016-2017. BERT exists, but fine-tuning it for NER requires GPU infrastructure that many startups do not have, and deploying it at inference time is expensive. The workhorse of production NER today is the BiLSTM-CRF.
This is a practical account of what building these pipelines actually looks like.
Why NER Is Harder Than It Looks
Named entity recognition appears simple: given a sentence, identify the spans that refer to people, organizations, locations, dates, and so on. The standard framing is sequence labeling — assign a BIO tag (Begin, Inside, Outside) to each token.
The difficulty is in the details. Consider these sentences:
- "Apple released a new product" — Apple is an organization.
- "He ate an apple" — apple is not an entity.
- "Jordan played for the Bulls" — Jordan is a person. But also a country.
- "Bank Negara Malaysia raised rates" — three tokens, one entity.
Context determines entity type. Boundary detection determines entity span. Both are hard. And in domain-specific text — legal documents, medical records, financial reports — the entities are not persons and organizations. They are contract clauses, drug names, instrument identifiers, regulatory references. The model has never seen these in pre-training.
The BiLSTM-CRF Architecture
The architecture that works best for production NER right now is the BiLSTM-CRF, introduced by Lample et al. (2016) and Huang et al. (2015). It combines three ideas:
Word embeddings provide the initial representation. We use pre-trained GloVe or FastText embeddings, concatenated with character-level embeddings from a small character CNN or LSTM. The character embeddings capture morphological features — suffixes like "-tion" or "-ing", capitalization patterns, digit patterns. This is critical for unseen words and for languages with rich morphology.
Bidirectional LSTM provides contextual representations. A forward LSTM reads left-to-right, a backward LSTM reads right-to-left, and their hidden states are concatenated at each position. This gives each token a representation that incorporates context from both directions. Unlike a Transformer, the BiLSTM processes the sequence sequentially, but for the sequence lengths we work with (typically under 200 tokens), this is fast enough.
Conditional Random Field (CRF) provides structured prediction. This is the key insight that separates good NER from mediocre NER. A naive approach would classify each token independently — pass the BiLSTM output through a softmax and pick the highest-scoring tag. But this ignores dependencies between tags. A tag sequence like B-PER followed by B-LOC is unlikely (you expect I-PER after B-PER). The CRF layer models transition probabilities between tags, ensuring the output is a globally coherent sequence, not a series of independent predictions.
BiLSTM-CRF: embeddings feed a bidirectional LSTM, whose outputs are decoded by a CRF that enforces valid tag sequences.
The CRF is what makes this architecture production-ready. Without it, we consistently saw 2-4 F1 points lower on entity-level evaluation, mostly from boundary errors — the model would start an entity but fail to end it correctly, or tag a single token as B-PER I-LOC.
The Feature Engineering That Still Matters
One thing I have learned building these systems: the model architecture is maybe 40% of the performance. The rest is data and features. Here is what actually moves the needle in production NER:
Character embeddings are essential, not optional. For languages like Malay, where many entity names are transliterated or borrowed from English, Arabic, or Tamil, a word-level vocabulary will never cover all entity mentions. Character-level CNNs (typically 3-4 filter widths, 25-50 filters each) capture subword patterns that word embeddings miss. They handle unseen words, capitalization, and morphological patterns. In our experiments, removing character embeddings dropped F1 by 5-8 points on domain-specific entities.
Gazetteers are underrated. A gazetteer is a lookup table — a list of known entities. "Is this token in a list of known company names?" "Is this bigram in a list of known regulatory bodies?" In academic papers, gazetteers are often dismissed as engineering rather than research. In production, they are some of the highest-ROI features you can add. We maintain gazetteers for each entity type and encode gazetteer membership as binary features concatenated with the embedding. This is especially valuable for entities that are rare in the training data but well-known in the domain.
Pre-trained embeddings matter more than model size. We experimented with GloVe, FastText, and domain-specific embeddings trained on our own corpora. FastText consistently outperformed GloVe on multilingual text because it handles subword information — the embedding for an unseen word is composed from its character n-grams. For English financial text, GloVe was sufficient. For Southeast Asian languages, FastText was the clear winner.
The Production Pipeline
A production NER system is not just a model. It is a pipeline. Here is what ours looks like:
Production NER is a pipeline, not a model. Preprocessing and post-processing often matter more than model architecture.
Preprocessing is where most bugs live. Tokenization inconsistencies between training and inference are a common source of silent failures. If your training data was tokenized with spaCy but your production pipeline uses regex tokenization, your model will see different inputs at inference time. We standardized on a simple regex tokenizer and never looked back. It is less sophisticated than spaCy's, but it is deterministic and fast.
Post-processing is where most precision is gained. Raw model outputs often contain errors that simple rules can fix. If the model tags "Bank" as B-ORG but misses "Negara" as I-ORG, a post-processing rule that checks for known multi-word entities in a gazetteer can recover the full span. We found that rule-based post-processing added 3-5 F1 points on top of the model's raw output. This is not cheating — it is engineering.
Entity linking is a separate problem. NER tells you that "BNM" is an organization. Entity linking tells you it refers to Bank Negara Malaysia. For production systems that need to populate a knowledge base or database, entity linking is essential. We use a combination of exact match, fuzzy match, and embedding-based similarity against a curated entity database.
What BERT Changes — And What It Does Not
BERT was published in October 2018, and it immediately became the most talked-about paper in NLP. The idea is straightforward: pre-train a Transformer encoder on a large corpus using masked language modeling and next-sentence prediction, then fine-tune on downstream tasks. BERT achieves state-of-the-art results on 11 NLP benchmarks, including NER.
For NER specifically, BERT replaces the embedding layer and the BiLSTM. Instead of GloVe + character CNN + BiLSTM, you use BERT's contextualized embeddings directly as input to the CRF. The representations are better because they are pre-trained on massive data, and they are contextual — the embedding for "bank" in "river bank" is different from "bank" in "central bank".
But BERT does not solve everything:
- Inference cost. BERT-base has 110 million parameters. Our BiLSTM-CRF has about 3 million. On CPU, BERT inference is 10-50x slower depending on sequence length. For batch processing, this is manageable. For real-time systems with latency requirements under 100ms, it is a problem.
- Domain adaptation. BERT is pre-trained on English Wikipedia and BooksCorpus. If your domain is Malay legal documents, the pre-trained representations may not help much. Domain-specific pre-training (training BERT on your own corpus) helps, but it requires significant compute.
- Data efficiency is not magic. BERT helps most when you have moderate amounts of labeled data (1,000-10,000 examples). With very little data (under 100 examples), it still struggles. With a lot of data (over 50,000 examples), the gap between BERT and a well-tuned BiLSTM-CRF narrows.
- The CRF still matters. Even with BERT embeddings, adding a CRF layer on top improves results. The Transformer captures context, but the CRF captures tag-sequence constraints. They solve different problems.
My current assessment: BERT will become the default for NER within a year or two, once inference optimization catches up (distillation, quantization, pruning). But the BiLSTM-CRF is not going away for edge cases — low-resource languages, real-time systems, and situations where you need a model that fits in memory on a small device.
Lessons from Production
After building NER systems for multiple domains and languages, here are the patterns I keep coming back to:
Start with rules, add ML incrementally. For a new domain, the fastest path to a working system is a rule-based approach: gazetteers, regex patterns, and deterministic rules. This gives you a baseline that works immediately and generates training data for the ML model. We often run the rule-based system in production for weeks before deploying the ML model, using the rule-based output (with human corrections) as training data.
Evaluation is harder than modeling. Entity-level F1 is the standard metric, but it hides important details. A system with 85% F1 might be excellent for one use case and unusable for another, depending on whether the errors are on common entities or rare ones, and whether precision or recall matters more. We track entity-level F1 broken down by type, frequency, and length, and we maintain a curated test set that represents the distribution of entities our users actually care about.
Active learning is worth the investment. Labeling data is expensive. Active learning — selecting the most informative examples for human annotation — reduces the amount of labeled data you need by 2-5x. We use uncertainty sampling: the model labels a batch of unlabeled data, we send the most uncertain predictions to human annotators, and we retrain. This loop compounds over time.
Multi-language is a different problem. Building NER for English is well-studied. Building NER for Malay, Bahasa Indonesia, or Tamil is a different challenge. Pre-trained embeddings are lower quality. Labeled data is scarce. Tokenization rules are different. We found that multilingual FastText embeddings, combined with transliteration as a preprocessing step, work reasonably well. But the accuracy gap between English and low-resource languages is real and significant.
What Comes Next
The direction is clear: pre-trained Transformers will replace task-specific architectures for most NLP applications. BERT for English, and multilingual BERT or XLM for other languages. The question is not whether, but when — and the answer depends on infrastructure, not algorithms.
What I am watching:
- Model distillation. Can we compress BERT into a model small enough for real-time inference? Early distillation efforts suggest yes, with modest quality loss.
- Domain-specific pre-training. SciBERT and BioBERT show that pre-training on domain text improves downstream performance. I expect every major domain to have its own pre-trained model within two years.
- Few-shot NER. The holy grail is a model that can learn new entity types from a handful of examples. Pre-trained Transformers make this more plausible, but we are not there yet.
- End-to-end systems. The pipeline approach (tokenize, embed, predict, post-process, link) works but is fragile. End-to-end models that go from raw text to structured entities in a single pass would be more robust.
Related
- Attention Is All You Need — A Practitioner's Guide to the Transformer
- From NER Pipelines to LLM Agents: How Production NLP Changed in Seven Years
- Classifying 7,000 Product Codes from Four Words of Text
The BiLSTM-CRF is not glamorous. It is not a Transformer. It does not have a catchy name or a leaderboard-topping paper. But it works, it is fast, it is understandable, and it ships. In production NLP, that still counts for a lot.
For anyone starting an NLP project today: learn the Transformer, but master the pipeline. The model is the smallest part of the system.