I got introduced to DBT when I started at Vantage Data nearly a year back and although I’ve heard of DBT before, I really didn’t understand the full capabilities until I got my hands dirty. Looking back, all I can say is WOW! Using DBT together with something like Snowflake Data Warehouse results in this perfect integration of platforms that enhance the way we transform and analyze. With DBT, the quality of data improves, the speed at which the models run are remarkable, and the documentation around its use is first-class.
DBT is here, it’s ready and it’s changing the Data Engineering landscape. So, let’s find out a bit more.
What is DBT?
Using DBT, people are discovering that they can build their data pipelines faster, with fewer engineers and with less maintenance. However, what exactly is DBT?
DBT is a data modeling tool that makes life much easier for analysts and engineers. It allows you to write SQL queries without having to worry about dependencies. It enables you to write SQL code without having to continually duplicating sections of queries you’ve already prepared, and you can split queries into containers (bits of code).
DBT, like traditional databases, is built on SQL, but it has additional functionality built on top of it utilizing templating engines such as jinja. This effectively lets you to retrieve, rearrange, and organize your data using additional logic in your SQL. You may then compile and run this code using DBT’s run command to retrieve just the pieces you need in the transformations. It can also be swiftly coded, tested, and adjusted without having to wait for it to process all your data, allowing you to quickly develop new, improved versions of your programs.
Data analysts and data engineers can use DBT to automate the data transformation process, testing and deployment. This is especially important because many business’s reporting data has increasingly complicated business logic especially as they scale. The DBT tool keeps track of all changes made to the underlying logic and uses version control to make it simple to trace data and update or modify the pipeline.
You Don’t have to Re-invent the wheel – Reuse your Code
Let’s look at this DAG above. Generating the documents like the above are so easy with DBT. We can see that the model ‘stg_app_orders’ depends on ‘base_app_payments’ and ‘base_app_orders’. In DBT, you can run the ‘stg_app_orders’ model and it was also run the two dependent models as well. That being said, ‘base_app_payments’ and ‘base_app_orders’ are independent of ‘stg_app_orders’ and can be used by other models as well. This helps keep the codes reusable and modular.
You can write a model once and then reference it in subsequent models with DBT. Because you’re utilizing the same logic in all of your models, the code is more dependable.
DBT makes code modular. This not only improves data quality, but it also saves engineers a lot of time. DBT also encourages casting and renaming directly at the source, which improves data quality. Smaller data “upkeep” duties are essentially done in your base models.
With DBT, you no longer need to reference raw data which can lead to errors downstream. Instead of using the raw data, other models can now reference the base models. This avoids mistakes such as unintentionally casting your dates to two distinct types of timestamps or naming the same column twice.
DBT helps with speed
DBT allows you to structure your code into base and intermediary data models, which speeds up the execution of your core data models. Because your models are modular, you only need to execute them once before referencing them in subsequent models. You aren’t squandering time and resources by repeatedly repeating the same code blocks.
When using the DBT run command, DBT models are also processed in parallel. Models with dependencies aren’t run until their upstream models are finished, whereas models that don’t have any dependents are run simultaneously. This improves throughput while reducing run time.
DBT’s Documentation is First Class
DBT automatically generates documentation in the form of a static website from your project. How good your documentation is and how the consumers can access it is up to you.
DBT allows you to write model descriptions directly into the code. It also makes it simple to remember column names and descriptions. This documentation is written in a.yml file that corresponds to the location where your model is kept.
These .yml files can then be used to build a website that contains all of your DBT documentation. DBT makes this extremely simple by supplying the DBT documentation command to its users.
Seeds and Tests
Seeds and testing are among the other good features included in DBT. Seeds are csv files that you submit to your data warehouse after adding them to your DBT project. If you have a list of unique codes that you think you’ll need in your analysis but don’t have in your existing data, this is a quick and good option.
DBT Tests allow you to verify what aspects of your data sources fit your requirements. Tests can include checking primary keys are unique or verifying non-null columns. You can also check the primary and foreign keys between tables to ensure that a value doesn’t exist in one but not the other. This is extremely useful for organizations that want to adhere to good data practices.
Data teams may use ELT pipelines with DBT to break down the increasingly complicated data extraction, loading, and processing activities while also enhancing team productivity, sanity, and data ownership.
DBT is available in both a free open source and a premium cloud version, offering your organization feature flexibility as you construct the ideal tech stack. With native connectivity to tools like Snowflake, BigQuery, Redshift, Stitch, and many others, DBT integrates neatly into an existing modern technological stack.
All in all, DBT is a cutting-edge tool that is definitely worth giving a try as it may simplify your data ELT(or ETL) pipeline.
Interested in more? Check out our product, Vantage Point. Vantage Point (VP) is a no-code, click & go business acceleration tool which enables data driven decisions across your business. It drives interactivity across all parts of your organization by communicating value (KPIs), autogenerating tasks with cutting-edge ML/AI technology and enabling users to combine VP’s ML/AI recommendations with their own analysis. You can finally track the exact ROI impact throughout your entire business with Vantage Point.
Get in touch by following this link
Senior Technology Architect