Databricks Databricks Certified Data Engineer Professional Practice Exams
Last updated on Apr 01,2025- Exam Code: Databricks Certified Data Engineer Professional
- Exam Name: Databricks Certified Data Engineer Professional Exam
- Certification Provider: Databricks
- Latest update: Apr 01,2025
The following code has been migrated to a Databricks notebook from a legacy workload:
The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.
Which statement is a possible explanation for this behavior?
- A . %sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
- B . Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
- C . %sh does not distribute file moving operations; the final line of code should be updated to use %fs
instead. - D . Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.
- E . %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.
Which is a key benefit of an end-to-end test?
- A . It closely simulates real world usage of your application.
- B . It pinpoint errors in the building blocks of your application.
- C . It provides testing coverage for all code paths and branches.
- D . It makes it easier to automate your test suite
Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?
- A . spark.sql.files.maxPartitionBytes
- B . spark.sql.autoBroadcastJoinThreshold
- C . spark.sql.files.openCostInBytes
- D . spark.sql.adaptive.coalescePartitions.minPartitionNum
- E . spark.sql.adaptive.advisoryPartitionSizeInBytes
To monitor temperatures via a Databricks SQL query and create alerts, the trigger condition for an alert is based on:
- A . The maximum temperature across all sensors
- B . The average temperature across all sensors
- C . The minimum temperature across any sensor
- D . The average temperature for at least one sensor
A data team’s Structured Streaming job is configured to calculate running aggregates for item sales to update a downstream marketing dashboard. The marketing team has introduced a new field to track the number of times this promotion code is used for each item. A junior data engineer suggests updating the existing query as follows: Note that proposed changes are in bold.
Which step must also be completed to put the proposed query into production?
- A . Increase the shuffle partitions to account for additional aggregates
- B . Specify a new checkpointlocation
- C . Run REFRESH TABLE delta, /item_agg’
- D . Remove .option (mergeSchema’, true’) from the streaming write
When a junior developer uses an old version of logic in their notebook due to using a personal branch, which approach allows them to review the current logic?
- A . Pull changes from the remote Git repository and switch to the branch dev-2.3.9
- B . Make a pull request using the Databricks REST API to update the current branch to dev-2.3.9
- C . Checkout the dev-2.3.9 branch directly and auto-resolve conflicts
- D . Merge all changes back to the main branch in the remote Git repository and clone the repo again
A Delta Lake table representing metadata about content from user has the following schema:
Based on the above schema, which column is a good candidate for partitioning the Delta Table?
- A . Date
- B . Post_id
- C . User_id
- D . Post_time
Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().
Which of the following statements is correct?
- A . DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.
- B . By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.
- C . The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.
- D . Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.
- E . The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.
Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().
Which of the following statements is correct?
- A . DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.
- B . By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.
- C . The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.
- D . Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.
- E . The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.
A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.
The user attempts and fails to accomplish this by adding an expectation to the report table definition.
Which approach would allow using DLT expectations to validate all expected records are present in
this table?
- A . Define a SQL UDF that performs a left outer join on two tables, and check if this returns null values for report key values in a DLT expectation for the report table.
- B . Define a function that performs a left outer join on validation_copy and report and report, and check against the result in a DLT expectation for the report table
- C . Define a temporary table that perform a left outer join on validation_copy and report, and define an expectation that no report key values are null
- D . Define a view that performs a left outer join on validation_copy and report, and reference this view in DLT expectations for the report table