Databricks Q&A

Proud to be a Databricks Partner | 27G

On May 5th, I had the pleasure of presenting “Data Cleansing using Databricks (https://www.meetup.com/ohio-north-database-training/events/314363881/). 

During the meeting, many good questions were raised.  Listed below are the answers to these questions:

 

What are jobs in databricks?

Jobs are workloads that can be scheduled, managed, and automated without manual intervention. Workloads can be notebooks, SQL queries, or pipelines on a cluster.

 

How do jobs compare to other automation tools?

Feature

Jobs

Workflows

Delta Live Tables

Purpose

Automate & schedule tasks

Orchestrate multi-step pipelines

Declarative ETL pipelines

Best for

ETL, ML training, batch jobs

Complex DAGs

Data quality + streaming/batch ETL

Compute

Job clusters or all-purpose

Job clusters

Managed DLT clusters

Complexity

Low–medium

Medium–high

Medium

 

What is a “Delta Live Table”?

Delta Live Tables (DLT) is a Databricks feature that makes it much easier to build and run data pipelines, either batch or streaming. Simply write the transformations in SQL or Python, and DLT takes care of setting up the infrastructure, tracking dependencies, handling errors, and enforcing data quality rules through its builtin expectations.

 

Can you see what part of the ETL pipeline you're in?

Yes, you can see exactly which part of your ETL pipeline is running. This can be done in various ways depending on whether you’re using Jobs, Workflows, or Delta Live Tables (DLT). You get visibility through the Jobs UI, tasklevel run details, and (for DLT) a full lineage graph and event log.

 

What is the Medallion Architecture in databricks?

This is a data design pattern for data cleansing represented by three layers:

  1. Bronze (raw data)
  2. Silver (validated)
  3. Gold (enriched)

The purpose is it to improve data quality as it moves through the ETL pipeline. This is in preparation for Machine Learning.

 

What is Medallion Architecture? | Databricks

Reference: https://www.databricks.com/blog/what-is-medallion-architecture  

 

What happens when an error occurs in a notebook used in an ETL pipeline?

Notebooks are one of the most common building blocks for ETL in Databricks. When an error occurs, it does not silently continue. Databricks stops the pipeline at that step, logs the failure, and shows exactly what went wrong:

The task status becomes “Failed”.

The job run stops at that task unless retries are configured

Downstream tasks that depend on it are skipped

The cluster logs the notebook output and the exception

The job run is marked as Failed unless other branches succeed

 

As a Best Practice, consider using the following to handle errors gracefully inside the notebook:

Try/except blocks

Custom error messages

Return codes

dbutils.notebook.exit() with structured output

 

 

How can you log errors from ETL pipeline?

You can write to a variety of log files, as shown in the examples below:

 

To write to Databricks File System (DBFS) in Python:

with open("/dbfs/mnt/logs/etl_log.txt", "a") as f:

    f.write("ETL step failed at: " + str(current_timestamp()) + "\n")

 

To write to Cloud Object Storage in Python:

spark.sparkContext.parallelize(["error: null values found"]).saveAsTextFile("s3://my-bucket/logs/etl/errors/")

)

 

To write to Delta Tables in Python:

spark.createDataFrame(

    [(str(e), "cleaning_step", current_timestamp())],

    ["error_message", "step", "timestamp"]

).write.format("delta").mode("append").saveAsTable("etl_error_log")

 

To write to small log files using dbutils.fs.put() in Python:

dbutils.fs.put("/mnt/logs/etl_log.txt", "ETL started\n", overwrite=False)

 

Can I download databricks to run locally?

No, databricks is a Cloud-based Data Intelligence Platform built on Apache Spark. 

 

Are there any certification programs?

Yes, check out https://www.databricks.com/resources/learn/training/databricks-fundamentals?scid=7018Y000001Fi0MQAS&utm_medium=paid+search&utm_source=google&utm_campaign=23704764715&utm_adgroup=190027071290&utm_content=training&utm_offer=databricks-fundamentals&utm_ad=802561714288&utm_term=databricks%20certification&gad_source=1&gad_campaignid=23704764715&gbraid=0AAAAABYBeAhbUHaH2Rr665VOLlbBCD-zi&gclid=CjwKCAjwzevPBhBaEiwAplAxvsYZRzANToXfdm14hLr3t18XEiy06ktLJMw3DvvyGNuBsBTS1e1E1hoCciMQAvD_BwE

 

 

Comments

Popular posts from this blog

.NET MAUI with Blazor vs. XAML

Jun '25 Tech Events