← Writing

Why We Chose dbt Over BigQuery Dataform

When you're building a data platform on Google Cloud, Dataform looks like the obvious choice. It's native to BigQuery. It's managed by Google. It's integrated into the Cloud Console. If you're making decisions based on vendor alignment alone, the conversation ends here.

We went with dbt. Not because it's trendy. Because when you evaluate transformation frameworks through the lens of security posture, operational governance, team capability, and long-term architectural flexibility, dbt is the stronger choice for enterprise environments — particularly regulated ones.

This isn't a feature comparison matrix. It's the reasoning behind a decision that affected every downstream workflow in our data platform.

The JavaScript Supply Chain Problem

This was the argument that ended the debate internally.

Dataform is built on Node.js. Its transformation runtime executes JavaScript. Its package ecosystem runs through npm. This means every Dataform project inherits the security surface area of the npm dependency tree — and that surface area is enormous.

The npm ecosystem has a well-documented history of supply chain attacks and vulnerability disclosures:

These aren't theoretical risks. They're incidents that happened. And in a regulated financial institution, each one of them triggers a response cycle.

What a JavaScript Vulnerability Means in a Regulated Environment

When a CVE is published against an npm package in your dependency tree, the clock starts. Depending on the severity and your organization's vulnerability management policy, you're typically looking at a 30 to 90-day remediation window — often shorter for critical or high-severity vulnerabilities.

Here's what that remediation cycle actually looks like:

  1. Detection — Your software composition analysis (SCA) tool flags the vulnerability. Snyk, Dependabot, or whatever you're running surfaces the CVE against a transitive dependency three levels deep.
  2. Triage — Security team evaluates severity, exploitability, and blast radius. Is this dependency actually invoked at runtime, or is it a dev dependency? Is the vulnerable code path reachable? In complex npm trees, answering this question alone can take days.
  3. Patch availability — You check if the vulnerable package has released a fix. Often it hasn't. Or the fix exists but hasn't propagated to the intermediate dependency that pulls it in. You're now waiting on a maintainer you don't know, in a timezone you can't predict, with no SLA.
  4. Compatibility testing — If a patch exists, you update the dependency and run your full test suite. But npm dependency resolution is fragile. Bumping one package can cascade version conflicts across the tree. You've now spent a sprint on a security fix that has nothing to do with your actual data models.
  5. Approval and deployment — In a regulated environment, the patched version goes through change management. CAB review, deployment window, post-deployment validation. For a data transformation tool.

Now multiply this by frequency. In 2023 alone, the npm registry disclosed over 7,000 advisories. Not all of them will hit your dependency tree, but a meaningful percentage will. We were looking at 2 to 4 vulnerability remediation cycles per quarter — each one consuming engineering time that should have been spent building data products.

Why dbt Avoids This

dbt is Python-based. The Python packaging ecosystem has its own issues, but the attack surface is structurally smaller for several reasons:

For an enterprise security team, this distinction is not academic. It's the difference between a transformation layer that generates constant vulnerability noise and one that sits quietly in the background doing its job.

SQL-First vs. JavaScript-Wrapped SQL

Dataform uses SQLX — a format that embeds SQL inside JavaScript blocks. You write your transformation logic in SQL, but the orchestration, configuration, and any dynamic logic happens in JavaScript.

dbt uses SQL with Jinja templating. You write SQL. You use Jinja for control flow, macros, and configuration. The mental model stays in SQL.

This matters for two reasons:

Team capability. Data engineers and analytics engineers think in SQL. Most of them are comfortable with Python. Very few of them are proficient JavaScript developers. Choosing Dataform means either hiring for JavaScript skill in your data team (hard, expensive, and creates the wrong talent profile) or accepting that your data engineers will write JavaScript they're not confident maintaining.

In practice, we found that Dataform's JavaScript layer created a two-tier system. Senior engineers could use the full power of SQLX — JavaScript functions, custom assertions, dynamic SQL generation using JS. Junior engineers treated the JavaScript layer as boilerplate they copied without understanding. This is exactly the pattern that produces brittle, hard-to-debug pipelines.

Code review and auditability. In regulated environments, data transformation code is auditable. When a regulator asks how a number was computed, you trace it from the dashboard back through the transformation layer to the source. That trace needs to be legible to people who aren't software engineers — risk officers, model validators, compliance analysts.

SQL is legible. A dbt model is a SELECT statement with some Jinja macros. Anyone with basic SQL knowledge can read it, reason about it, and validate that it's doing what the documentation says it's doing. A Dataform model wrapped in JavaScript config blocks, with logic split between .sqlx files and JavaScript includes, is harder to follow for non-developers. The audit trail gets muddier.

Vendor Lock-In and Multi-Warehouse Portability

Google acquired Dataform in December 2020. Since then, it's been integrated into the BigQuery Console as a first-party service. This integration is convenient — but it's also a lock-in vector.

Dataform works with BigQuery. That's it. If your data strategy involves any possibility of multi-cloud, hybrid, or migration scenarios, Dataform is a dead end. You'd be rewriting every model.

dbt supports BigQuery, Snowflake, Redshift, Databricks, Spark, Postgres, and dozens of other warehouses through its adapter system. A dbt model written for BigQuery can be ported to Snowflake with adapter-specific changes to materialization and SQL dialect, but the core logic, the testing framework, the documentation, and the operational patterns transfer directly.

We're a BigQuery shop today. We may be a BigQuery shop in five years. But our data strategy doesn't assume that. And in an industry where cloud strategy shifts happen at the executive level with limited warning, having a transformation layer that isn't welded to a single warehouse is a hedge worth having.

Testing and Data Quality

dbt's testing framework is one of its strongest features. Out of the box, you get schema tests — unique, not_null, accepted_values, relationships. These are declared in YAML alongside the model definition. They run as part of every pipeline execution. They're version-controlled with the code.

Beyond built-in tests, the dbt ecosystem offers dbt_expectations — a port of Great Expectations patterns to dbt. This gives you distributional tests, regex pattern matching, row-count checks, and statistical assertions. All in SQL. All auditable.

Dataform has assertions, which serve a similar purpose but are less mature. They're defined inline in SQLX files. The pattern coverage is narrower. And because they're JavaScript-configured, extending them requires JavaScript proficiency.

In a regulated data platform, testing is not optional and not secondary. Every model that feeds a regulatory report, a risk calculation, or a financial disclosure needs tests that prove it's producing correct results. dbt makes this a first-class concern. Dataform treats it as a feature.

Community, Ecosystem, and Package Maturity

dbt has an ecosystem that Dataform can't match. This isn't opinion — it's numbers:

Dataform's community is smaller by an order of magnitude. It's primarily used within Google Cloud-native organizations. The package ecosystem is limited. When you need help, your options are Google Cloud support (if you're paying for it) and a handful of community forums.

Ecosystem size isn't vanity. It directly impacts your team's velocity. When a well-maintained package already implements the pattern you need — SCD Type 2 modeling, say, or Salesforce source transformation — your team uses it instead of building from scratch. When a pattern is well-documented across dozens of blog posts and conference talks, your engineers learn it faster. When the community is large enough that your specific problem has been solved before, debugging is faster.

Documentation and Lineage

dbt generates a documentation site from your model definitions. Every model, every column, every test, every source — documented and linked. The lineage graph shows how data flows from source through staging, intermediate, and mart layers. This is generated from code, not maintained separately.

For governance and audit purposes, this is transformative. When a regulator asks "where does this number come from?", you open the dbt docs site, click on the model, and trace lineage upstream. Column-level descriptions, test results, and freshness checks are all visible. The documentation is always current because it's generated from the same code that produces the data.

Dataform provides lineage tracking and a dependency graph. But the documentation capabilities are less developed, and the lineage is confined to BigQuery. In a platform where data flows through Pub/Sub, Dataflow, and BigQuery before reaching dbt, the transformation layer's lineage is one piece of the puzzle — but it's the piece that dbt makes easy to generate and maintain.

CI/CD and Development Workflow

dbt integrates cleanly into standard software engineering workflows:

Dataform's development workflow is centered on its web IDE and Google Cloud integration. You can use Git, but the workflow is designed around Dataform's own environment. If your team uses VS Code, has established CI/CD pipelines, and wants to treat data transformation code like any other software project, Dataform's opinionated workflow creates friction.

The Acquisition Question

Google acquired Dataform. This brings benefits — BigQuery integration, Google Cloud Console embedding, managed infrastructure. It also brings risk.

Google has a well-documented history of deprecating products. When a managed service is also your transformation framework, the switching cost is high and the exit path is narrow. If Google decides to merge Dataform into a broader BigQuery feature, change the API surface, or deprecate it in favor of something else, your transformation layer moves with it — or breaks.

dbt Labs is an independent company. dbt Core is open source (Apache 2.0). If dbt Labs disappeared tomorrow, the project would continue. Your models are SQL files. Your tests are YAML. Your macros are Jinja. There's no proprietary runtime, no platform dependency, and no single vendor whose strategic decisions can force a rewrite of your transformation layer.

In enterprise architecture, this kind of optionality isn't a nice-to-have. It's risk management.

When Dataform Does Make Sense

This isn't a one-sided argument. Dataform has legitimate strengths:

For a small team building an analytics stack on BigQuery with limited security review requirements, Dataform is a reasonable choice. For an enterprise data platform in a regulated industry where security posture, auditability, talent portability, and architectural flexibility are non-negotiable — dbt is the stronger foundation.

The Decision Framework

If you're making this choice today, here's how I'd frame it:

Criterion dbt Dataform
Security surface area Python + SQL packages Node.js + npm ecosystem
Vulnerability remediation Low frequency, manageable High frequency, complex dep trees
Team skill alignment SQL + Python (data team norm) SQL + JavaScript (less common)
Warehouse portability Multi-warehouse BigQuery only
Testing maturity Extensive, ecosystem-backed Assertions (growing)
Community size 80,000+ Slack, packages hub Smaller, GCP-focused
Vendor risk Open source, portable Google-owned, BigQuery-coupled
Setup complexity Requires deployment infra Managed, zero-setup
Cost Core free, Cloud is licensed Included in BigQuery

Conclusion

The choice between dbt and Dataform isn't a tooling decision. It's an architectural decision that affects security posture, team capability, governance readiness, and long-term platform flexibility.

For us, the JavaScript supply chain risk alone was disqualifying. In an environment where every CVE triggers a remediation cycle, where every dependency needs to be auditable, and where security review isn't optional — choosing a transformation framework built on npm is choosing to accept ongoing operational overhead that has nothing to do with transforming data.

dbt gave us a SQL-first transformation layer with a shallow dependency tree, a mature testing framework, multi-warehouse portability, and an ecosystem that accelerates rather than constrains our data team. The setup cost is higher than Dataform. The operational cost is lower. For an enterprise data platform, that's the right trade.


If you're evaluating transformation frameworks for a BigQuery data platform, I'd suggest starting with the security review. Run npm audit on a fresh Dataform project and pip audit on a fresh dbt project. Count the findings. Estimate the remediation effort. That comparison tells you more than any feature matrix.