Exam 2 | Notion

The table is created without the LOCATION clause, which means that it’s a managed table. Managed tables are tables whose metadata and data are managed by Databricks.

When you run DROP TABLE on a managed table, both the metadata and the underlying data files are deleted.

For production Structured Streaming jobs, which of the following retry policies is recommended to use ?

  In order to restart streaming queries on failure, it’s recommended to configure Structured Streaming jobs with the following job configuration:
  
  - Retries: Set to Unlimited.
  - Maximum concurrent runs: Set to 1. There must be only one instance of each query concurrently active.
  - Cluster: Set this always to use a new job cluster and use the latest Spark version (or at least version 2.1). Queries started in Spark 2.1 and above are recoverable after query and Spark version upgrades.
  - Notifications: Set this if you want email notification on failures.
  - Schedule: Do not set a schedule.
  - Timeout: Do not set a timeout. Streaming queries run for an indefinitely long time.
  
  [<https://docs.databricks.com/structured-streaming/query-recovery.html#configure-structured-streaming-jobs-to-restart-streaming-queries-on-failu>](<https://docs.databricks.com/structured-streaming/query-recovery.html#configure-structured-streaming-jobs-to-restart-streaming-queries-on-failure>)

A data engineer has a MLFlow model logged in a given “model_url”. They have registered the model as a Spark UDF using the following code:
- Which of the following code blocks allows the data engineer to accomplish this task ? • *test_df.apply(predict_udf, column_list).select(“record_id”, “prediction")(Incorrect) • *test_df.select(“record_id”, predict_udf(column_list).allas("prediction"))(Correct) • predict_udf(“record_id”, test_df).select(“record_id”, “prediction") • mlflow.pyfunc.map(predict_udf, test_df, “record_id”).allas("prediction") • mlflow.pyfunc.map(predict_udf, test_df, “record_id”).allas("prediction")
In Delta Lake tables, which of the following is the file format for the transaction log ? • Delta • Parquet • JSON(Incorrect) • Hive-specific format • Both, Parquet and JSON(Correct)

Delta Lake builds upon standard data formats. Delta lake table gets stored on the storage in one or more data files in Parquet format, along with transaction logs in JSON format.

In addition, Databricks automatically creates Parquet checkpoint files every 10 commits to accelerate the resolution of the current table state.

To perform streaming deduplication, we use dropDuplicates() function to eliminate duplicate records within each new micro batch. In addition, we need to ensure that records to be inserted are not already in the target table. We can achieve this using insert-only merge.

Reference:

https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html

https://docs.databricks.com/delta/merge.html#data-deduplication-when-writing-into-delta-tables

They want to remove previous 2 years data from the table without breaking the append-only requirement of streaming sources.