The table is created without the LOCATION clause, which means that it’s a managed table. Managed tables are tables whose metadata and data are managed by Databricks.

When you run DROP TABLE on a managed table, both the metadata and the underlying data files are deleted.

Delta Lake builds upon standard data formats. Delta lake table gets stored on the storage in one or more data files in Parquet format, along with transaction logs in JSON format.

In addition, Databricks automatically creates Parquet checkpoint files every 10 commits to accelerate the resolution of the current table state.

To perform streaming deduplication, we use dropDuplicates() function to eliminate duplicate records within each new micro batch. In addition, we need to ensure that records to be inserted are not already in the target table. We can achieve this using insert-only merge.

Reference:

https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.dropDuplicates.html

https://docs.databricks.com/delta/merge.html#data-deduplication-when-writing-into-delta-tables

They want to remove previous 2 years data from the table without breaking the append-only requirement of streaming sources.