Databricks Q&A

On May 5^th, I had the pleasure of presenting “Data Cleansing using Databricks (https://www.meetup.com/ohio-north-database-training/events/314363881/).

During the meeting, many good questions were raised. Listed below are the answers to these questions:

What are jobs in databricks?

Jobs are workloads that can be scheduled, managed, and automated without manual intervention. Workloads can be notebooks, SQL queries, or pipelines on a cluster.

How do jobs compare to other automation tools?

Feature	Jobs	Workflows	Delta Live Tables
Purpose	Automate & schedule tasks	Orchestrate multi-step pipelines	Declarative ETL pipelines
Best for	ETL, ML training, batch jobs	Complex DAGs	Data quality + streaming/batch ETL
Compute	Job clusters or all-purpose	Job clusters	Managed DLT clusters
Complexity	Low–medium	Medium–high	Medium

What is a “Delta Live Table”?

Delta Live Tables (DLT) is a Databricks feature that makes it much easier to build and run data pipelines, either batch or streaming. Simply write the transformations in SQL or Python, and DLT takes care of setting up the infrastructure, tracking dependencies, handling errors, and enforcing data quality rules through its built‑in expectations.

Can you see what part of the ETL pipeline you're in?

Yes, you can see exactly which part of your ETL pipeline is running. This can be done in various ways depending on whether you’re using Jobs, Workflows, or Delta Live Tables (DLT). You get visibility through the Jobs UI, task‑level run details, and (for DLT) a full lineage graph and event log.

What is the Medallion Architecture in databricks?

This is a data design pattern for data cleansing represented by three layers:

Bronze (raw data)
Silver (validated)
Gold (enriched)

The purpose is it to improve data quality as it moves through the ETL pipeline. This is in preparation for Machine Learning.

Reference: https://www.databricks.com/blog/what-is-medallion-architecture

What happens when an error occurs in a notebook used in an ETL pipeline?

Notebooks are one of the most common building blocks for ETL in Databricks. When an error occurs, it does not silently continue. Databricks stops the pipeline at that step, logs the failure, and shows exactly what went wrong:

The task status becomes “Failed”.

The job run stops at that task unless retries are configured

Downstream tasks that depend on it are skipped

The cluster logs the notebook output and the exception

The job run is marked as Failed unless other branches succeed

As a Best Practice, consider using the following to handle errors gracefully inside the notebook:

Try/except blocks

Custom error messages

Return codes

dbutils.notebook.exit() with structured output

How can you log errors from ETL pipeline?

You can write to a variety of log files, as shown in the examples below:

To write to Databricks File System (DBFS) in Python:

with open("/dbfs/mnt/logs/etl_log.txt", "a") as f:

f.write("ETL step failed at: " + str(current_timestamp()) + "\n")

To write to Cloud Object Storage in Python:

spark.sparkContext.parallelize(["error: null values found"]).saveAsTextFile("s3://my-bucket/logs/etl/errors/")

)

To write to Delta Tables in Python:

spark.createDataFrame(

[(str(e), "cleaning_step", current_timestamp())],

["error_message", "step", "timestamp"]

).write.format("delta").mode("append").saveAsTable("etl_error_log")

To write to small log files using dbutils.fs.put() in Python:

dbutils.fs.put("/mnt/logs/etl_log.txt", "ETL started\n", overwrite=False)

Can I download databricks to run locally?

No, databricks is a Cloud-based Data Intelligence Platform built on Apache Spark.

Are there any certification programs?

Yes, check out https://www.databricks.com/resources/learn/training/databricks-fundamentals?scid=7018Y000001Fi0MQAS&utm_medium=paid+search&utm_source=google&utm_campaign=23704764715&utm_adgroup=190027071290&utm_content=training&utm_offer=databricks-fundamentals&utm_ad=802561714288&utm_term=databricks%20certification&gad_source=1&gad_campaignid=23704764715&gbraid=0AAAAABYBeAhbUHaH2Rr665VOLlbBCD-zi&gclid=CjwKCAjwzevPBhBaEiwAplAxvsYZRzANToXfdm14hLr3t18XEiy06ktLJMw3DvvyGNuBsBTS1e1E1hoCciMQAvD_BwE

Search This Blog

All Things Microsoft AI

Databricks Q&A

Comments

Post a Comment

Popular posts from this blog

Oct '25 Regional Tech Events

.NET MAUI with Blazor vs. XAML

Jun '25 Tech Events