Databricks Q&A
On May 5th, I had the pleasure of presenting “Data Cleansing using Databricks (https://www.meetup.com/ohio-north-database-training/events/314363881/).
During the meeting, many good questions were raised. Listed below are the answers to these questions:
What are jobs in databricks?
Jobs are workloads that can be scheduled, managed, and automated without manual intervention. Workloads can be notebooks, SQL queries, or pipelines on a cluster.
How do jobs compare to other automation tools?
Feature
|
|||
Purpose
|
Automate & schedule tasks
|
Orchestrate multi-step pipelines
|
Declarative ETL pipelines
|
Best for
|
ETL, ML training, batch jobs
|
Complex DAGs
|
Data quality + streaming/batch ETL
|
Compute
|
Job clusters or all-purpose
|
Job clusters
|
Managed DLT clusters
|
Complexity
|
Low–medium
|
Medium–high
|
Medium
|
What is a “Delta Live Table”?
Delta Live Tables (DLT) is a Databricks feature that makes it much easier to build and run data pipelines, either batch or streaming. Simply write the transformations in SQL or Python, and DLT takes care of setting up the infrastructure, tracking dependencies, handling errors, and enforcing data quality rules through its built‑in expectations.
Can you see what part of the ETL pipeline you're in?
Yes, you can see exactly which part of your ETL pipeline is running. This can be done in various ways depending on whether you’re using Jobs, Workflows, or Delta Live Tables (DLT). You get visibility through the Jobs UI, task‑level run details, and (for DLT) a full lineage graph and event log.
What is the Medallion Architecture in databricks?
This is a data design pattern for data cleansing represented by three layers:
- Bronze (raw data)
- Silver (validated)
- Gold (enriched)
The purpose is it to improve data quality as it moves through the ETL pipeline. This is in preparation for Machine Learning.
Reference: https://www.databricks.com/blog/what-is-medallion-architecture
What happens when an error occurs in a notebook used in an ETL pipeline?
Notebooks are one of the most common building blocks for ETL in Databricks. When an error occurs, it does not silently continue. Databricks stops the pipeline at that step, logs the failure, and shows exactly what went wrong:
The task status becomes “Failed”.
The job run stops at that task unless retries are configured
Downstream tasks that depend on it are skipped
The cluster logs the notebook output and the exception
The job run is marked as Failed unless other branches succeed
As a Best Practice, consider using the following to handle errors gracefully inside the notebook:
Try/except blocks
Custom error messages
Return codes
dbutils.notebook.exit() with structured output
How can you log errors from ETL pipeline?
You can write to a variety of log files, as shown in the examples below:
To write to Databricks File System (DBFS) in Python:
with open("/dbfs/mnt/logs/etl_log.txt", "a") as f:
f.write("ETL step failed at: " + str(current_timestamp()) + "\n")
To write to Cloud Object Storage in Python:
spark.sparkContext.parallelize(["error: null values found"]).saveAsTextFile("s3://my-bucket/logs/etl/errors/")
)
To write to Delta Tables in Python:
spark.createDataFrame(
[(str(e), "cleaning_step", current_timestamp())],
["error_message", "step", "timestamp"]
).write.format("delta").mode("append").saveAsTable("etl_error_log")
To write to small log files using dbutils.fs.put() in Python:
dbutils.fs.put("/mnt/logs/etl_log.txt", "ETL started\n", overwrite=False)
Can I download databricks to run locally?
No, databricks is a Cloud-based Data Intelligence Platform built on Apache Spark.
Are there any certification programs?
Comments
Post a Comment