From NER Pipelines to LLM Agents: How Production NLP Changed in Seven Years

March 2025

In 2018, I was at NUS studying the Transformer paper and building BiLSTM-CRF NER systems at a startup in Singapore. The state of the art for named entity recognition was a 3-million-parameter model with hand-tuned character embeddings and gazetteer features. Deploying it meant exporting a PyTorch model, writing a Flask API, and SSH-ing into a server.

Seven years later, I am publishing research on multi-agent LLM protocols — systems where multiple language models delegate tasks to each other, negotiate payload formats, and maintain governed sessions. The NER pipeline I built in 2018 would now be a single function call to an LLM.

This is not a history lesson. It is an attempt to connect the dots — to identify what actually changed in production NLP, what stayed the same, and what the arc tells us about where we are going.

Era 1: Feature Engineering (pre-2018)

Before I entered the field, production NLP was dominated by hand-crafted features. Conditional Random Fields for sequence labeling. Bag-of-words and TF-IDF for classification. Regex patterns and rule-based systems for extraction. The skill was in feature engineering — knowing which features to compute, how to combine them, and how to handle the edge cases.

The models were small, fast, and interpretable. They were also brittle. Every new domain required new features. Every new language required new tokenizers, new gazetteers, new rules. The cost was human expertise, and it did not scale.

Era 2: Pre-training (2018-2019)

ELMo, BERT, and the Transformer changed the economics of NLP overnight. Instead of engineering features for each task, you could pre-train a language model on a large corpus and fine-tune it on your specific task with a few thousand examples. Transfer learning had arrived for NLP, the way ImageNet had arrived for computer vision years earlier.

At Halialabs, I lived through this transition. We were building BiLSTM-CRF NER systems with GloVe embeddings and character CNNs. BERT arrived in October 2018 and immediately outperformed everything we had built on English benchmarks. But we could not use it — inference was too slow for our latency requirements, and BERT was not pre-trained on Malay or Bahasa Indonesia.

The lesson: breakthrough papers change what is possible. Deployment constraints determine what actually changes. BERT was state of the art in 2018. We were still deploying BiLSTM-CRFs in production in 2019 because they were faster, smaller, and worked for our languages.

Era 3: Scale (2020-2022)

GPT-3 in 2020 demonstrated that scaling language models — making them much bigger, trained on much more data — produced qualitatively new capabilities. Few-shot learning. Arithmetic from prompts. Code generation. The scaling laws (Kaplan et al.) showed this was predictable: loss decreases as a power law of model size and compute.

For practitioners, the implication was profound. The old paradigm — collect labeled data, train a task-specific model, deploy it — started to look expensive and slow compared to prompting a large model. But the large models were expensive to run, non-deterministic, and available only through APIs that regulated industries could not use.

I had moved to a global bank by this point, building data platforms and AI infrastructure. The scaling era was visible in research papers but invisible in enterprise production. The gap between what was possible and what was deployable widened with every new model release.

Era 4: Application (2023-2024)

ChatGPT in November 2022 was the tipping point. Not because it was technically superior to GPT-3 — it was a fine-tuned version with RLHF — but because it made LLM capabilities legible to non-technical stakeholders. Executives who had never heard of language models were suddenly asking about them. Budget appeared. Urgency appeared. Governance did not.

2023-2024 was the year of the enterprise pilot. RAG (Retrieval-Augmented Generation) became the dominant pattern for enterprise LLM applications — ground the model in your documents, reduce hallucinations, stay within your data boundaries. Every organization I spoke to was building a "chat with your docs" system. Most were discovering the same things: chunking strategy matters more than embedding model. Retrieval quality matters more than generation quality. Evaluation is the hardest problem.

Open-source models (LLaMA, Mistral, Mixtral) made on-premise deployment possible for the first time. This was the inflection point for regulated industries — you could now run a capable language model inside your own infrastructure, fine-tune it on your own data, and control every aspect of the deployment.

But governance lagged. Model risk frameworks designed for logistic regression and gradient-boosted trees could not handle non-deterministic models with emergent capabilities trained on unknown data. The gap between what was being deployed and what was being governed was uncomfortable.

Era 5: Agents (2025)

We are now in the agent era. LLMs are no longer just question-answering systems. They are being given tools — the ability to search databases, call APIs, execute code, and delegate to other models. Multi-agent systems, where specialized models collaborate on complex tasks, are moving from research prototypes to early production deployments.

This is what motivated my recent research on LDP (LLM Delegate Protocol). When multiple agents delegate to each other, the protocols they use matter. Current protocols like MCP treat every model as a black box — they expose skill names but not model identity, latency characteristics, or cost profiles. Identity-aware routing, where the system knows which model is behind each agent and routes accordingly, can reduce latency by 12x on tasks that do not need the largest model.

The agent era introduces new problems: how do you govern a system where one model decides which other model to call? How do you audit a chain of delegations? How do you attribute errors when three models contributed to a single output? These are the problems I expect to define the next few years of production AI.

What Changed

The obvious changes are architectural. In seven years, production NLP went from:

Task-specific models → general-purpose models adapted to tasks
Feature engineering → pre-training + prompting
Small models (3M parameters) → large models (100B+)
Single models → multi-model systems
Structured output → natural language output
Deterministic pipelines → probabilistic generation

Less obvious but equally important: the bottleneck shifted. In 2018, the bottleneck was model quality — the model was not good enough, so you compensated with features and rules. In 2025, the bottleneck is everything around the model — evaluation, governance, retrieval quality, cost optimization, and system architecture.

What Stayed the Same

This is the part that matters more.

Data quality still dominates model quality. In 2018, bad training data produced bad NER models. In 2025, bad retrieval data produces bad RAG outputs. The models changed. The dependency on data quality did not. Every era of NLP has been a contest between better models and worse data, and data keeps winning.

Evaluation is still the hardest problem. In 2018, we argued about entity-level F1 vs. token-level F1 vs. partial match scoring. In 2025, we argue about RAGAS vs. G-Eval vs. human evaluation vs. LLM-as-judge. The metrics changed. The fundamental difficulty — measuring whether a system is good enough for production — did not.

The pipeline matters more than the model. In 2018, preprocessing and post-processing added 3-5 F1 points on top of the model. In 2025, chunking strategy and retrieval pipeline add more quality than switching embedding models. The model is always the most visible component and never the most impactful one.

Governance is always late. In 2018, model governance for NER was an afterthought. In 2023, model governance for LLMs was an afterthought. The technology moves faster than the frameworks designed to govern it. This has been true for every era, and there is no reason to think it will change.

Production is where learning happens. Research papers describe what is possible. Production systems describe what works. The gap between these two has existed for every era, and bridging it has always been the highest-leverage work.

The Meta-Lesson

Looking back across seven years, the pattern I see is this: each era made the previous era's hard problem trivial and revealed a new hard problem.

Pre-training made feature engineering trivial. Scale made task-specific training trivial. Applications made capability access trivial. Agents are making single-model architectures trivial.

But each era also introduced a new hard problem. Pre-training introduced the compute bottleneck. Scale introduced the governance gap. Applications introduced the evaluation crisis. Agents are introducing the coordination and attribution problem.

The practitioners who thrive are not the ones who master each era's dominant technique. They are the ones who see the new hard problem early — while everyone else is still celebrating that the old hard problem is solved.

Agentic AI for Serious Engineers — Engineering guide covering agent architectures, evaluation, and production deployment
Building NLP Pipelines Before Transformers Were Easy
Attention Is All You Need — A Practitioner's Guide to the Transformer
Why I Chose Regulated AI Over Startup Speed

In 2018, I trained a 3-million-parameter model to find company names in Malay text. In 2025, I am designing protocols for systems where multiple billion-parameter models negotiate and delegate. The scale is different. The fundamental challenges — data quality, evaluation, governance, system thinking — are not.

If you are entering the field today: learn the current tools, but invest in the fundamentals. The tools change every two years. The fundamentals have not changed in seven. They probably will not change in the next seven either.