Databricks-Certified-Professional-Data-Engineer Databricks Exam Lab Questions

Page: 5 / 8

Question 20

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

Options:

Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will tail.

Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Question 21

Which statement describes integration testing?

Options:

Validates interactions between subsystems of your application

Requires an automated testing framework

Requires manual intervention

Validates an application use case

Validates behavior of individual elements of your application

Question 22

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Which statement describes the execution and results of running the above query multiple times?

Options:

Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.

Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.

Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.

Question 23

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

Options:

Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.