Skip to content
NextLytics
Megamenü_2023_Über-uns

Shaping Business Intelligence

Whether clever add-on products for SAP BI, development of meaningful dashboards or implementation of AI-based applications - we shape the future of Business Intelligence together with you. 

Megamenü_2023_Über-uns_1

About us

As a partner with deep process know-how, knowledge of the latest SAP technologies as well as high social competence and many years of project experience, we shape the future of Business Intelligence in your company too.

Megamenü_2023_Methodik

Our Methodology

The mixture of classic waterfall model and agile methodology guarantees our projects a high level of efficiency and satisfaction on both sides. Learn more about our project approach.

Products
Megamenü_2023_NextTables

NextTables

Edit data in SAP BW out of the box: NextTables makes editing tables easier, faster and more intuitive, whether you use SAP BW on HANA, SAP S/4HANA or SAP BW 4/HANA.

Megamenü_2023_Connector

NextLytics Connectors

The increasing automation of processes requires the connectivity of IT systems. NextLytics Connectors allow you to connect your SAP ecosystem with various open-source technologies.

IT-Services
Megamenü_2023_Data-Science

Data Science & Engineering

Ready for the future? As a strong partner, we will support you in the design, implementation and optimization of your AI application.

Megamenü_2023_Planning

SAP Planning

We design new planning applications using SAP BPC Embedded, IP or SAC Planning which create added value for your company.

Megamenü_2023_Dashboarding

Business Intelligence

We help you with our expertise to create meaningful dashboards based on Tableau, Power BI, SAP Analytics Cloud or SAP Lumira. 

Megamenü_2023_Data-Warehouse-1

SAP Data Warehouse

Are you planning a migration to SAP HANA? We show you the challenges and which advantages a migration provides.

Business Analytics
Megamenü_2023_Procurement

Procurement Analytics

Transparent and valid figures are important, especially in companies with a decentralized structure. SAP Procurement Analytics allows you to evaluate SAP ERP data in SAP BI.

Megamenü_2023_Reporting

SAP HR Reporting & Analytics

With our standard model for reporting from SAP HCM with SAP BW, you accelerate business activities and make data from various systems available centrally and validly.

Megamenü_2023_Dataquality

Data Quality Management

In times of Big Data and IoT, maintaining high data quality is of the utmost importance. With our Data Quality Management (DQM) solution, you always keep the overview.

Career
Megamenü_2023_Karriere-2b

Working at NextLytics

If you would like to work with pleasure and don't want to miss out on your professional and personal development, we are the right choice for you!

Megamenü_2023_Karriere-1

Senior

Time for a change? Take your next professional step and work with us to shape innovation and growth in an exciting business environment!

Megamenü_2023_Karriere-5

Junior

Enough of grey theory - time to get to know the colourful reality! Start your working life with us and enjoy your work with interesting projects.

Megamenü_2023_Karriere-4-1

Students

You don't just want to study theory, but also want to experience it in practice? Check out theory and practice with us and experience where the differences are made.

Megamenü_2023_Karriere-3

Jobs

You can find all open vacancies here. Look around and submit your application - we look forward to it! If there is no matching position, please send us your unsolicited application.

Blog
Megamenü_CTA_Webinar
Live Webinar:
Bridging Business and Analytics: The Plug-and-Play Future of Data Platforms
Register for free!
 

Data Modeling dbt on Databricks: A Practical Guide

Data teams are increasingly looking for ways to unify data modeling and governance within a single, scalable platform. Databricks has cemented its position in the market with a powerful lakehouse architecture that blends data engineering, data science, and analytics capabilities.

Unlike traditional Data Warehouse software, sold as a finished application by vendors like SAP or Oracle, Databricks is a platform on which you can build a SQL-based warehouse. To achieve this, a team would have to choose and adapt a warehousing design pattern and logical framework to build their logical data model in.

The open source data transformation tool dbt (data build tool) has become the industry standard framework for SQL-first transformations and data modeling. It is a perfect fit for building an Enterprise Data Warehouse on Databricks.

Bringing these two together allows analysts and engineers to transform raw data into reliable, production-ready datasets directly on the Databricks platform using best practices from software engineering. With the release of dbt support in Databricks Jobs, you can fully utilize the power of Databricks to run, orchestrate and monitor your dbt workflows directly within the platform.

dbt on Databricks diagram

In this blog post, we demonstrate how to combine dbt and Databricks to create scalable, reliable and performant data transformations based on an example workflow for data warehousing: bringing in data from various source systems and refining information in a Medallion architecture for analysis and reuse. We will walk you through the necessary steps:

  • Creating a skeleton dbt project and bringing it to Databricks via a Git repository
  • Defining the intended stages of data modeling using dbt
  • Setting up a Databricks compute cluster to run the dbt project
  • Configuring a Databricks Job to execute the dbt project stages


Finally, we will take a look at how dbt can help handle changing data points over time and changing data schema.

Project Setup: Local Initialization and Project Transport

Before you can run dbt jobs in Databricks, you’ll need to initialize a dbt project and configure it to connect to your Databricks workspace. The most common workflow sees you initializing the dbt project locally and then deploying it on Databricks using the Workspaces Git integration.

Initialize dbt locally

Start by creating your project in a local development environment. Using the dbt-databricks adapter, you can scaffold a new project with:

pip install dbt-core dbt-databricks dbt init my_dbt_project

1. Create dbt project root

During setup, choose Databricks as your target adapter. You will be asked to provide connection information for your databricks instance, as well as a Databricks Access Token. After the process is finished, it will create the standard dbt directory structure (models/, seeds/, tests/, etc.) and a dbt_project.yml configuration file.

Image_1_dbt_folder_structure

 

 

2. Transport the project to Databricks

Once the project is working locally, the next step is to make it accessible inside Databricks. The recommended approach is to push your dbt project into a Git repository, then link that repository to your Databricks workspaceRepos. Git ensures strict version control on the code base, manages multi-user collaboration and conflict resolution, and can be used to determine exactly which version of the workflow to run in the production environment.

 

Image_2_dbt_folder_structure_databricks


Secure your spot now for our webinar on November 11th!

Free Webinar: Bridging Business and Analytics: The Plug-and-Play Future of Data Platforms Navigating between Databricks, SAP Business Data Cloud, Fabric & Dremio     Register now! 


Data modeling with dbt

At this point, you’ve got the skeletonbones dbt project that can run locally for development, and is now available in your Databricks environment for production-ready execution. The next steps are to implement your data model and then create the Databricks Job to orchestrate dbt runs, where your transformations can be scheduled, monitored, and scaled automatically.

Let’s start with the data model: In our use case, we want to implement the commonly used medallion architecture, a design pattern popularized by Databricks. Here data is structured in three layers, Bronze (raw data), Silver (cleaned & structured data) and Gold (enriched data), with every step progressively combining data with the goal of creating metrics and extracting business insights for reporting purposes.

To simulate the interconnectedness of data landscapes in modern business use cases, our data sources for this example originate from multiple data sources as well. One table is extracted from a postgreSQL database that is connected to the Databricks Unity Catalog via Lakehouse Federation. The other is a delta table saved in our Databricks Catalog directly.

Bronze Layer: Persisting data from all sources

The first step of implementing the medallion architecture is to clone the raw data to our bronze level tables. This is done to collect all data in the same system, at the same place. Keep in mind that at this stage, no logical transformations nor enrichments of data are being done.

This is where the dbt models come into play. There, data sources as well as the targets, including any transformations are being defined. This is done via the sources.yml file in the /models directory as well as .sql-files in the dbt models folder structure. This is also where you can hierarchically structure your data model according to your needs.

Image_3_dbt_folder_structure_models

In our case, the two sources are being defined in the sources.yml file like this:

``` version: 2 sources: - name: sap_erp_file_export database: nextlytics schema: nextlytics-demo tables: - name: raw_planvorgaben - name: external_postgres database: ext_nly_demo_postgres schema: public tables: - name: raw_maschinendaten ```

This allows dbt to reference those tables by name in the following step and not bother with the technical details of their origin.

For our bronze tables, we just want to clone those sources into delta tables with a specified name. For that, we create a .sql file for each target, with the name of the file specifying the target table to be created.

For example, we want to import all the data from our external postgres table raw_maschinendaten into a table called b_maschinendaten. The prefix b_ signifies that this is in fact a bronze-level table.

Thus we create a file b_maschinendaten.sql with the following contents:

SELECT * FROM

The dbt-tag “bronze” that is included here will become relevant once we get to orchestrating the jobs.

After applying the same logic to our other data source, we end up with the data models for two bronze level tables, b_maschinendaten and b_planvorgaben.

Silver Layer: Refining and cleaning data to reflect business meaning

For our silver layer, we want to combine both of those tables into a single one. Since one of the tables contains information about production machine statistics and the other one holds data for planned machine operations, this data can be joined together neatly using sql.

The sql transformation logic will be applied directly in the model for our silver layer table s_machine_output.

``` WITH planvorgaben AS ( SELECT MaschinenID, Datum, Geplante_Produktionszeit_Min, Geplante_Gesamtteile FROM `nextlytics`.`nextlytics-demo`.`b_planvorgaben` ), maschinendaten AS ( SELECT maschinenid, ts, laufzeit_min, stillstand_min, gesamtteile, gutteile, ausschuss FROM `nextlytics`.`nextlytics-demo`.`b_maschinendaten` ) SELECT p.MaschinenID, p.Datum, p.Geplante_Produktionszeit_Min, p.Geplante_Gesamtteile, m.laufzeit_min, m.stillstand_min, m.gesamtteile, m.gutteile, m.ausschuss FROM planvorgaben p JOIN maschinendaten m ON p.MaschinenID = m.maschinenid -- Convert the timestamp to date for the join condition AND p.Datum = DATE(m.ts) ```

Note that in this example, our Silver layer table still contains German language column names taken more or less directly from the data source. When working with international data teams, applying translations to English is a common best practice for an Enterprise Data Warehouse and should be applied either in Bronze or Silver layer. We will come back to this later to show how dbt can handle such revisions.

Gold Layer: Aggregation and KPI calculation

For our gold layer we want to provide actionable insights for reporting and dashboarding purposes.

Thus we want to calculate essential KPIs derived from the data, in this case we decided to go for OEE (Overall Equipment Effectiveness) to gain insights regarding the effectiveness of our simulated production.


The model for our gold layer table g_oee_performance thus looks like this:``` WITH silver AS ( SELECT MaschinenID, Datum, Geplante_Produktionszeit_Min, Geplante_Gesamtteile, laufzeit_min, stillstand_min, gesamtteile, gutteile FROM ) SELECT MaschinenID, Datum, CAST(laufzeit_min - stillstand_min AS DOUBLE) / CAST(Geplante_Produktionszeit_Min AS DOUBLE) AS verfuegbarkeit, CAST(gutteile AS DOUBLE) / CAST(Geplante_Gesamtteile AS DOUBLE) AS leistung, CAST(gutteile AS DOUBLE) / CAST(gesamtteile AS DOUBLE) AS qualitaet, -- Calculate OEE as the product of the three factors ( CAST(laufzeit_min - stillstand_min AS DOUBLE) / CAST(Geplante_Produktionszeit_Min AS DOUBLE) ) * ( CAST(gutteile AS DOUBLE) / CAST(Geplante_Gesamtteile AS DOUBLE) ) * ( CAST(gutteile AS DOUBLE) / CAST(gesamtteile AS DOUBLE) ) AS OEE FROM silver ```

This concludes the data modelling processes we want to execute, the next step is to implement the orchestration using Databricks Jobs.

Preparing Databricks Compute

First some preparatory steps - jobs in Databricks run on Spark clusters. In our case, we need a cluster that fulfills the following requirements:

1. The cluster must have the dbt-databricks python library installed. This can be done via the “Libaries” tab in the Databricks cluster configuration.

Image_4_databricks_libraries

2. The cluster must have access to a Databricks Access token to be able to access the Databricks API for the creation of the tables. This can be configured using the cluster's environment variables in combination with the Databricks Secrets management utility.

Image_5_databricks_environment_var

 

Configuring and running the Databricks Job

Now that we have the model definitions finished and the cluster prepared, it is time to configure the actual pipeline for the data load and transformation. For this we are going to use the Databricks Jobs interface.

The job we create will in this case be split in three different subtasks for each target layer of the data model, as well as one additional task to generate the SQL to be executed as file output for debugging purposes, in case one of the processing steps might fail.

Image_6_databricks_job_tasks

This is where the tags we set in the model definitions come into play. By assigning those logical groups to processing steps, we are able to execute only a subset of the dbt transformations with a single call. Also take note to select the correct pre-configured cluster for the pipeline execution.

Image_7_databricks_job_details

We can now trigger the job via the “Runs” tab. Here we can set up scheduled pipeline executions, check the data lineage and monitor the status of each task/pipeline that has been run.

Image_8_databicks_job_runs_interface

This concludes the basic setup of how to integrate a dbt data warehousing workflow into Databricks, but there are some more common situations you may find yourself in, that we want to quickly examine here.

Data change tracking and schema changes

Time keeps marching ever onwards and with it, the data output and consumed by your business might change. May it be due to a version update in one of the source systems, a migration to a different source database or new business requirements demanding a shift in your established data structure.
Here we want to take a quick look at how a setup such as this might handle schema evolution (the changing of data structures) and data merges.

Consider that for our silver layer transformation, we want to replace the German column names for more descriptive and unified English translations. In this case we can just slightly alter the SELECT statement in the model definition like this:``` WITH planvorgaben AS ( SELECT MaschinenID, Datum, Geplante_Produktionszeit_Min, Geplante_Gesamtteile FROM `nextlytics`.`nextlytics-demo`.`b_planvorgaben` ), maschinendaten AS ( SELECT maschinenid, ts, laufzeit_min, stillstand_min, gesamtteile, gutteile, ausschuss FROM `nextlytics`.`nextlytics-demo`.`b_maschinendaten` ) SELECT p.MaschinenID AS machine_id, p.Datum AS prod_date, p.Geplante_Produktionszeit_Min AS planned_prod_time_min, p.Geplante_Gesamtteile as planned_prod_amount, m.laufzeit_min as running_time_min, m.stillstand_min as idle_time_min, m.gesamtteile as total_parts, m.gutteile as total_good_parts, m.ausschuss as total_scrap_parts FROM planvorgaben p JOIN maschinendaten m ON p.MaschinenID = m.maschinenid -- Convert the timestamp to date for the join condition AND p.Datum = DATE(m.ts) ```

On the next run of the task that updates the Silver layer table, the new column schema will be applied. Now all that is left is to alter the references in the SELECT statement of the gold level table and we’re done.

This works because sets the model to do a full table refresh on every data load.

Of course this strategy, while simple to implement, has significant drawbacks when the amount of data increases or schema changes become more common.

For this, dbt offers the option to use incremental models. This allows users to load only new data that has not already been processed via an automatic merge procedure. To do this, you need to provide the configuration  within the model definition.

This opens up a whole new selection of possible approaches, for example you can implement conditional operations within the model’s SELECT statement, that only trigger on an incremental load on an previously existing model. This could look something like this, where a datetime value from the data source is compared to the equivalent from the model table to ascertain whether new data has arrived between pipeline executions:``` SELECT * FROM ```

These use cases really only scratch the surface of what is possible by utilizing dbt. There are a myriad of other useful features to handle complex scenarios or performance intensive data loads, that go beyond the scope of this article. Among them are common data warehousing tasks like automatic schema evolution, complexmore granular incremental loading strategies, different merge/deletion scenarios and many more.

From SQL to Enterprise Data Warehouse with dbt: Our Conclusion

Using dbt on Databricks turns a greenfield SQL warehouse into a dependable operating model: a clear Medallion layout, explicit source and model definitions, and reproducible pipelines via Git and Databricks Jobs. Tags let you run targeted subsets, incremental models keep costs in check, and data lineage and monitoring provide transparency from Bronze to Gold layer for business metrics like OEE.

Bottom line: a smooth path from local prototype to production-grade data logistics—versioned, testable, and scalable. If you want to build or modernize your lakehouse with dbt/Databricks, we can help with architecture, implementation, and run-ready operations—practical and to the point.

If you want to know more about how you can use dbt to unlock the full potential of modern data warehousing in Databricks, feel free to contact us!

 

Learn more about  Databricks

 

FAQ

These are some of the most frequently asked questions about data modeling with dbt on Databricks.

Why use dbt on Databricks? Dbt (data build tool) is a data modeling framework using templated SQL to define complex data pipelines and transformations into reusable, traceable code and configuration files. Where Databricks offers a greenfield SQL warehouse platform built on top of the highly scalable Data Lakehouse architecture, dbt can add the routines and conventions needed to operate an Enterprise Data Warehouse with hundreds of data sources and thousands of interconnected data models.
How do I start locally?

Follow any dbt quick start tutorial on the web. In short, you need to follow a few basic steps: pip install dbt-core dbt-databricks → dbt init <project> → choose Databricks adapter, add workspace info + access token; folders like models/, tests/, dbt_project.yml are created.

How do I deploy to the Databricks workspace?

Push to Git and connect a folder in your Databricks workspace for versioning, team collaboration, and reproducible runs

What does the Medallion architecture data model look like?

Using Medallion architecture to build a logical warehouse model is not an exact science but largely up to project and team convention and understanding. Our examples roughly use the following interpretation: Bronze: clone and persist raw data. Silver: clean/standardize & join. Gold: calculate KPIs (e.g., OEE).

Can I use Data Vault modeling with dbt on Databricks?

Yes, producing a model according to Data Vault specifications is a perfect scenario for using dbt on Databricks. Data Vault logic can be used instead of the Medallion architecture or in synergy with it. Cookie cutter templates for Data Vault modeling with dbt are available, enabling data engineers to quickly build out Hubs, Link and Satellite structures on Databricks.

How do I orchestrate runs?

Use Databricks Jobs: create subtasks per layer, tags to run subsets, schedule/monitor runs, view lineage, and emit SQL artifacts for debugging.

What are the cluster prerequisites? Install dbt-databricks Python library; provide an access token via environment variables/Secrets.
How to handle schema evolution?

Choose full refresh (simple, costly at scale) or incremental models with is_incremental() and timestamp-based merges for performant growth. Incremental changes in combination with a fully versioned data modeling approach like Data Vault are the gold standard of resilient and reproducible warehousing.

,

avatar

Robin

Robin Brandt is a consultant for Machine Learning and Data Engineering. With many years of experience in software and data engineering, he has expertise in automation, data transformation and database management - especially in the area of open source solutions. He spends his free time making music or creating spicy dishes.

Got a question about this blog?
Ask Robin

Data Modeling dbt on Databricks: A Practical Guide
15:49

Blog - NextLytics AG 

Welcome to our blog. In this section we regularly report on news and background information on topics such as SAP Business Intelligence (BI), SAP Dashboarding with Lumira Designer or SAP Analytics Cloud, Machine Learning with SAP BW, Data Science and Planning with SAP Business Planning and Consolidation (BPC), SAP Integrated Planning (IP) and SAC Planning and much more.

Subscribe to our newsletter

Related Posts

Recent Posts