Recently, SAP and Databricks announced their strategic partnership delivering Databricks as a native component in the SAP Business Data Cloud. The seamless communication between the two systems enables data practitioners to develop data science solutions, leverage machine learning capabilities like tracking experiments, versioning of deployed models and of course benefit from Delta Lake, the powerful open source storage framework. Bringing Databricks workspaces into the SAP ecosystem can be a shortcut for 2 topics that have been notoriously difficult before: low-cost but readily accessible mass storage and native tooling to ingest third party data. This article shows one example of how you now can achieve the latter without any additional licensed integration tools and products.
In a recent article, we provided the bird’s eye view of SAP Databricks integration and answered some key questions regarding the strategic significance, the architecture and some potential limitations. Now it’s time for a hands-on example. We will develop a simple data application that pulls data from the ServiceNow API as a real-world example and ingests into a historized table in SAP Datasphere. The process will be scheduled to run in a predetermined timeframe, leveraging the scheduling capabilities of Databricks, where monitoring and alerts are available out-of-the-box. Let’s start with the architectural overview of the application.
Architecture
The application consists of a single Python notebook that orchestrates the data flow. Utilizing notebooks allows for cooperation between engineers and data analysts, and offers a quick and easy way to schedule the entire process. The diagram below outlines the flow of the data, and how each component relates to the others.
The source system is ServiceNow, a cloud-based platform that helps organizations automate and manage digital workflows across IT, HR, customer service, and other business functions. We will query the REST API and retrieve customer support tickets in an incremental manner; only the recently updated entries will be returned each time, reducing the size of the payload significantly. The destination will be a historized table in Databricks or SAP Datasphere. A historized table tracks historical changes in dimensional data over time by specifying the period that each record is considered active. When a change is identified in a certain record, a new entry is added and the most recent is retired. A typical SCD type 2 historized table consists of two metadata columns (valid_from and valid_to timestamps), natural and surrogate keys and multiple other attributes which are subject to change.
Such historized tables can be easily created and maintained through dlt (data load tool), an open source Python library that offers lightweight APIs for data loading from multiple sources, storing in popular database formats or cloud storage and offering an intuitive way to track changes in datasets. It is just a matter of defining our source and target, ServiceNow REST API and SAP Datasphere in this case, provide some configuration and let the pipeline run. dltHub will take care of data extraction from source in an incremental way and ingestion to the target system.
Implementation
The application consists of three simple methods that define the source configuration, connect to the destination (SAP Datasphere) and run the dltHub pipeline. Sensitive information like credentials can be stored securely in Databricks’ secret scope and retrieved within the notebook.
Scheduling the process is straightforward and can be achieved through the SAP Databricks UI. It is possible to define complex cron expressions for the interval, receive notification emails on success and failure, and even provide parameters for the execution.
The result will be a fully historized table with data from the ServiceNow table API. Every subsequent pipeline execution will only ingest new data and will keep track of changes and updates in existing records based on the defined composite key.
Interoperability in Practice
The presented approach is highly portable and a prime example of what systems’ interoperability in an ecosystem of open API specifications and open source tools can achieve: The Python dlt module we use for ELT-style data loading comes with numerous source and destination systems connections predefined and harmonizes all data transports to a minimum standard. Instead of extracting data from ServiceNow API, we could easily switch to any REST API, relational database, or object storage source. Salesforce, Shopify, Google Analytics, Jira, Asana and many others are all supported data sources. On the other hand, configuring different destinations is just a minor change in pipeline configuration and opens up various options to work with Databricks and SAP systems. Databricks Unity Catalog is a typical destination for data loads and can easily be leveraged as a persistent, fully replayable permanent record of changes retrieved from the source. Keep a Delta Table like this as the archive layer and build any refined downstream data objects from there.
SAP systems can be directly written to leveraging SAP HANA ODBC connections. This allows us to use the presented pipeline to ingest data from practically any source system into any SAP system running a HANA database: S4, BW, or Datasphere. Better yet: for this to work, we do not even need to have access to an SAP Databricks workspace. Any Databricks workspace on Azure, Google Cloud, or AWS can be used to run this kind of pipeline. If you really want to push the limits of efficiency, you do not even need Databricks. The Python code can be run anywhere - serverless application frameworks, local servers, inside Apache Airflow tasks. A beautiful example of interoperability and the endless possibilities of optimizing automated data processing systems to meet your specific process requirements or technical preferences.
The now immanent future with SAP Databricks and its native Delta Sharing integration into the SAP Business Data Cloud is what really makes this approach compelling for us: Instead of ever writing data through a bottleneck ODBC connection, SAP Databricks can store any incoming data in its native Delta Table format and just expose this data to BDC without any data replication.
How to integrate ServiceNow data with SAP Databricks: Our Conclusion
This hands-on example illustrates a straightforward yet powerful use case of SAP Databricks for building a data extraction and ingestion application. By utilizing dltHub, we can easily ingest information from ServiceNow incrementally into SAP Datasphere or any other HANA database, enabling consistent and auditable change tracking. The integration not only streamlines development and scheduling through an intuitive pro-code interface, but also enhances operational reliability with built-in monitoring and alerting. While our example focused on a simple ServiceNow integration, the same approach scales to complex enterprise scenarios, making it a solid foundation for advanced analytics, machine learning, and (near) real-time data products in the SAP Business Data Cloud.
Do you have questions on this or another topic? Simply get in touch with us - we look forward to exchanging ideas with you!
FAQ
Datasphere, Databricks, SAP BPC
