Latest Jun 04, 2024 Databricks-Certified-Data-Engineer-Associate Brain Dump A Study Guide with Tips & Tricks for passing Exam [Q15-Q37]

Share

Latest Jun 04, 2024 Databricks-Certified-Data-Engineer-Associate Brain Dump: A Study Guide with Tips & Tricks for passing Exam

Databricks-Certified-Data-Engineer-Associate Question Bank: Free PDF Download Recently Updated Questions

NEW QUESTION # 15
A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task.
Which of the following approaches can the data engineer use to set up the new task?

  • A. They can clone the existing task in the existing Job and update it to run the new notebook.
  • B. They can create a new job from scratch and add both tasks to run concurrently.
  • C. They can create a new task in the existing Job and then add it as a dependency of the original task.
  • D. They can clone the existing task to a new Job and then edit it to run the new notebook.
  • E. They can create a new task in the existing Job and then add the original task as a dependency of the new task.

Answer: D


NEW QUESTION # 16
Which of the following tools is used by Auto Loader process data incrementally?

  • A. Checkpointing
  • B. Data Explorer
  • C. Databricks SQL
  • D. Unity Catalog
  • E. Spark Structured Streaming

Answer: E

Explanation:
Auto Loader provides a Structured Streaming source called cloudFiles that can process new data files as they arrive in cloud storage without any additional setup. Auto Loader uses a scalable key-value store to track ingestion progress and ensure exactly-once semantics. Auto Loader can ingest various file formats and load them into Delta Lake tables. Auto Loader is recommended for incremental data ingestion with Delta Live Tables, which extends the functionality of Structured Streaming and allows you to write declarative Python or SQL code to deploy a production-quality data pipeline. References: What is Auto Loader?, What is Auto Loader? | Databricks on AWS, Solved: How does Auto Loader ingest data? - Databricks - 5629


NEW QUESTION # 17
A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from the data engineering team to implement a series of tests to ensure the data is clean. However, the data engineering team uses Python for its tests rather than SQL.
Which of the following commands could the data engineering team use to access sales in PySpark?

  • A. SELECT * FROM sales
  • B. spark.sql("sales")
  • C. spark.table("sales")
  • D. spark.delta.table("sales")
  • E. There is no way to share data between PySpark and SQL.

Answer: C

Explanation:
Explanation
https://spark.apache.org/docs/3.2.1/api/python/reference/api/pyspark.sql.SparkSession.table.html


NEW QUESTION # 18
An engineering manager wants to monitor the performance of a recent project using a Databricks SQL query. For the first week following the project's release, the manager wants the query results to be updated every minute. However, the manager is concerned that the compute resources used for the query will be left running and cost the organization a lot of money beyond the first week of the project's release.
Which of the following approaches can the engineering team use to ensure the query does not cost the organization any money beyond the first week of the project's release?

  • A. They can set the query's refresh schedule to end on a certain date in the query scheduler.
  • B. They cannot ensure the query does not cost the organization money beyond the first week of the project's release.
  • C. They can set the query's refresh schedule to end after a certain number of refreshes.
  • D. They can set a limit to the number of DBUs that are consumed by the SQL Endpoint.
  • E. They can set a limit to the number of individuals that are able to manage the query's refresh schedule.

Answer: A

Explanation:
In Databricks SQL, you can use scheduled query executions to update your dashboards or enable routine alerts. By default, your queries do not have a schedule. To set the schedule, you can use the dropdown pickers to specify the frequency, period, starting time, and time zone. You can also choose to end the schedule on a certain date by selecting the End date checkbox and picking a date from the calendar. This way, you can ensure that the query does not run beyond the first week of the project's release and does not incur any additional cost. Option A is incorrect, as setting a limit to the number of DBUs does not stop the query from running. Option B is incorrect, as there is no option to end the schedule after a certain number of refreshes. Option C is incorrect, as there is a way to ensure the query does not cost the organization money beyond the first week of the project's release. Option D is incorrect, as setting a limit to the number of individuals who can manage the query's refresh schedule does not affect the query's execution or cost. Reference: Schedule a query, Schedule a query - Azure Databricks - Databricks SQL


NEW QUESTION # 19
Which of the following describes a scenario in which a data engineer will want to use a single-node cluster?

  • A. When they are working with SQL within Databricks SQL
  • B. When they are running automated reports to be refreshed as quickly as possible
  • C. When they are manually running reports with a large amount of data
  • D. When they are concerned about the ability to automatically scale with larger data
  • E. When they are working interactively with a small amount of data

Answer: E

Explanation:
Explanation
A Single Node cluster is a cluster consisting of an Apache Spark driver and no Spark workers. A Single Node cluster supports Spark jobs and all Spark data sources, including Delta Lake. A Standard cluster requires a minimum of one Spark worker to run Spark jobs.


NEW QUESTION # 20
A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:

Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?

  • A. Replace predict with a stream-friendly prediction function
  • B. Replace spark.read with spark.readStream
  • C. Replace "transactions" with the path to the location of the Delta table
  • D. Replace schema(schema) with option ("maxFilesPerTrigger", 1)
  • E. Replace format("delta") with format("stream")

Answer: B

Explanation:
To read from a stream source, the data engineer needs to use the spark.readStream method instead of the spark.read method. The spark.readStream method returns a DataStreamReader object that can be used to specify the details of the input source, such as the format, the schema, the path, and the options. The spark.read method is only suitable for batch processing, not streaming processing. The other changes are not necessary or correct for reading from a stream source. References: Structured Streaming Programming Guide, Read a stream, Databricks Data Sources


NEW QUESTION # 21
In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?

  • A. When the target table is an external table
  • B. When the target table cannot contain duplicate records
  • C. When the location of the data needs to be changed
  • D. When the source is not a Delta table
  • E. When the source table can be deleted

Answer: B

Explanation:
The MERGE INTO command is used to perform upserts, which are a combination of insertions and updates, based on a source table into a target Delta table1. The MERGE INTO command can handle scenarios where the target table cannot contain duplicate records, such as when there is a primary key or a unique constraint on the target table. The MERGE INTO command can match the source and target rows based on a merge condition and perform different actions depending on whether the rows are matched or not. For example, the MERGE INTO command can update the existing target rows with the new source values, insert the new source rows that do not exist in the target table, or delete the target rows that do not exist in the source table1.
The INSERT INTO command is used to append new rows to an existing table or create a new table from a query result2. The INSERT INTO command does not perform any updates or deletions on the existing target table rows. The INSERT INTO command can handle scenarios where the location of the data needs to be changed, such as when the data needs to be moved from one table to another, or when the data needs to be partitioned by a certain column2. The INSERT INTO command can also handle scenarios where the target table is an external table, such as when the data is stored in an external storage system like Amazon S3 or Azure Blob Storage3. The INSERT INTO command can also handle scenarios where the source table can be deleted, such as when the source table is a temporary table or a view4. The INSERT INTO command can also handle scenarios where the source is not a Delta table, such as when the source is a Parquet, CSV, JSON, or Avro file5.
References:
* 1: MERGE INTO | Databricks on AWS
* 2: [INSERT INTO | Databricks on AWS]
* 3: [External tables | Databricks on AWS]
* 4: [Temporary views | Databricks on AWS]
* 5: [Data sources | Databricks on AWS]


NEW QUESTION # 22
Which of the following describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables using SQL?

  • A. CREATE STREAMING LIVE TABLE should be used when the previous step in the DLT pipeline is static.
  • B. CREATE STREAMING LIVE TABLE should be used when the subsequent step in the DLT pipeline is static.
  • C. CREATE STREAMING LIVE TABLE should be used when data needs to be processed incrementally.
  • D. CREATE STREAMING LIVE TABLE is redundant for DLT and it does not need to be used.
  • E. CREATE STREAMING LIVE TABLE should be used when data needs to be processed through complicated aggregations.

Answer: C


NEW QUESTION # 23
A data engineer that is new to using Python needs to create a Python function to add two integers together and return the sum?
Which of the following code blocks can the data engineer use to complete this task?

  • A.
  • B.
  • C.
  • D.
  • E.

Answer: B

Explanation:
https://www.w3schools.com/python/python_functions.asp
https://www.geeksforgeeks.org/python-functions/


NEW QUESTION # 24
A data engineer has been using a Databricks SQL dashboard to monitor the cleanliness of the input data to an ELT job. The ELT job has its Databricks SQL query that returns the number of input records containing unexpected NULL values. The data engineer wants their entire team to be notified via a messaging webhook whenever this value reaches 100.
Which of the following approaches can the data engineer use to notify their entire team via a messaging webhook whenever the number of NULL values reaches 100?

  • A. They can set up an Alert with a new email alert destination.
  • B. They can set up an Alert with one-time notifications.
  • C. They can set up an Alert without notifications.
  • D. They can set up an Alert with a new webhook alert destination.
  • E. They can set up an Alert with a custom template.

Answer: D

Explanation:
A webhook alert destination is a way to send notifications to external applications or services via HTTP requests. A data engineer can use a webhook alert destination to notify their entire team via a messaging webhook, such as Slack or Microsoft Teams, whenever the number of NULL values in the input data reaches 100. To set up a webhook alert destination, the data engineer needs to do the following steps:
In the Databricks SQL workspace, navigate to the Settings gear icon and select SQL Admin Console.
Click Alert Destinations and click Add New Alert Destination.
Select Webhook and enter the webhook URL and the optional custom template for the notification message.
Click Create to save the webhook alert destination.
In the Databricks SQL editor, create or open the query that returns the number of input records containing unexpected NULL values.
Click the Create Alert icon above the editor window and configure the alert criteria, such as the value column, the condition, and the threshold.
In the Notification section, select the webhook alert destination that was created earlier and click Create Alert. Reference: What are Databricks SQL alerts?, Monitor alerts, Monitoring Your Business with Alerts, Using Automation Runbook Webhooks To Alert on Databricks Status Updates.


NEW QUESTION # 25
A data engineer runs a statement every day to copy the previous day's sales into the table transactions. Each day's sales are in their own file in the location "/transactions/raw".
Today, the data engineer runs the following command to complete this task:

After running the command today, the data engineer notices that the number of records in table transactions has not changed.
Which of the following describes why the statement might not have copied any new records into the table?

  • A. The previous day's file has already been copied into the table.
  • B. The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.
  • C. The COPY INTO statement requires the table to be refreshed to view the copied rows.
  • D. The PARQUET file format does not support COPY INTO.
  • E. The names of the files to be copied were not included with the FILES keyword.

Answer: A

Explanation:
Explanation
https://docs.databricks.com/en/ingestion/copy-into/index.html The COPY INTO SQL command lets you load data from a file location into a Delta table. This is a re-triable and idempotent operation; files in the source location that have already been loaded are skipped. if there are no new records, the only consistent choice is C no new files were loaded because already loaded files were skipped.


NEW QUESTION # 26
A data analyst has developed a query that runs against Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL.
Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?

  • A. spark.sql
  • B. SELECT * FROM sales
  • C. spark.table
  • D. spark.delta.table
  • E. There is no way to share data between PySpark and SQL.

Answer: A

Explanation:
The spark.sql operation allows the data engineering team to run a SQL query and return the result as a PySpark DataFrame. This way, the data engineering team can use the same query that the data analyst has developed and operate with the results in PySpark. For example, the data engineering team can use spark.sql("SELECT * FROM sales") to get a DataFrame of all the records from the sales Delta table, and then apply various tests or transformations using PySpark APIs. The other options are either not valid operations (A, D), not suitable for running a SQL query (B, E), or not returning a DataFrame (A). Reference: Databricks Documentation - Run SQL queries, Databricks Documentation - Spark SQL and DataFrames.


NEW QUESTION # 27
A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level.
Which of the following tools can the data engineer use to solve this problem?

  • A. Delta Live Tables
  • B. Delta Lake
  • C. Auto Loader
  • D. Data Explorer
  • E. Unity Catalog

Answer: A

Explanation:
Delta Live Tables is a tool that enables data engineers to build and manage reliable data pipelines with minimal code. One of the features of Delta Live Tables is data quality monitoring, which allows data engineers to define quality expectations for their data and automatically check them at every step of the pipeline. Data quality monitoring can help detect and resolve data quality issues, such as missing values, duplicates, outliers, or schema changes. Data quality monitoring can also generate alerts and reports on the quality level of the data, and enable data engineers to troubleshoot and fix problems quickly. Reference: Delta Live Tables Overview, Data Quality Monitoring


NEW QUESTION # 28
A data engineer is working with two tables. Each of these tables is displayed below in its entirety.
The data engineer runs the following query to join these tables together:
Which of the following will be returned by the above query?

  • A. Option B
  • B. Option E
  • C. Option C
  • D. Option A
  • E. Option D

Answer: D

Explanation:
Option A is the correct answer because it shows the result of an INNER JOIN between the two tables. An INNER JOIN returns only the rows that have matching values in both tables based on the join condition. In this case, the join condition is ON a.customer_id = c.customer_id, which means that only the rows that have the same customer ID in both tables will be included in the output. The output will have four columns: customer_id, name, account_id, and overdraft_amt. The output will have four rows, corresponding to the four customers who have accounts in the account table.


NEW QUESTION # 29
A data engineer runs a statement every day to copy the previous day's sales into the table transactions. Each day's sales are in their own file in the location "/transactions/raw".
Today, the data engineer runs the following command to complete this task:

After running the command today, the data engineer notices that the number of records in table transactions has not changed.
Which of the following describes why the statement might not have copied any new records into the table?

  • A. The previous day's file has already been copied into the table.
  • B. The format of the files to be copied were not included with the FORMAT_OPTIONS keyword.
  • C. The COPY INTO statement requires the table to be refreshed to view the copied rows.
  • D. The PARQUET file format does not support COPY INTO.
  • E. The names of the files to be copied were not included with the FILES keyword.

Answer: A

Explanation:
The COPY INTO statement is an idempotent operation, which means that it will skip any files that have already been loaded into the target table1. This ensures that the data is not duplicated or corrupted by multiple attempts to load the same file. Therefore, if the data engineer runs the same command every day without specifying the names of the files to be copied with the FILES keyword or a glob pattern with the PATTERN keyword, the statement will only copy the first file that matches the source location and ignore the rest. To avoid this problem, the data engineer should either use the FILES or PATTERN keywords to filter the files to be copied based on the date or some other criteria, or delete the files from the source location after they are copied into the table2. Reference: 1: COPY INTO | Databricks on AWS 2: Get started using COPY INTO to load data | Databricks on AWS


NEW QUESTION # 30
A data engineer has joined an existing project and they see the following query in the project repository:
CREATE STREAMING LIVE TABLE loyal_customers AS
SELECT customer_id -
FROM STREAM(LIVE.customers)
WHERE loyalty_level = 'high';
Which of the following describes why the STREAM function is included in the query?

  • A. The customers table is a reference to a Structured Streaming query on a PySpark DataFrame.
  • B. The data in the customers table has been updated since its last run.
  • C. The customers table is a streaming live table.
  • D. The STREAM function is not needed and will cause an error.
  • E. The table being created is a live table.

Answer: C

Explanation:
The STREAM function is used to process data from a streaming live table or view, which is a table or view that contains data that has been added only since the last pipeline update. Streaming live tables and views are stateful, meaning that they retain the state of the previous pipeline run and only process new data based on the current query. This is useful for incremental processing of streaming or batch data sources. The customers table in the query is a streaming live table, which means that it contains the latest data from the source. The STREAM function enables the query to read the data from the customers table incrementally and create another streaming live table named loyal_customers, which contains the customer IDs of the customers with high loyalty level. References: Difference between LIVE TABLE and STREAMING LIVE TABLE, CREATE STREAMING TABLE, Load data using streaming tables in Databricks SQL.


NEW QUESTION # 31
A data analyst has created a Delta table sales that is used by the entire data analysis team. They want help from the data engineering team to implement a series of tests to ensure the data is clean. However, the data engineering team uses Python for its tests rather than SQL.
Which of the following commands could the data engineering team use to access sales in PySpark?

  • A. SELECT * FROM sales
  • B. spark.sql("sales")
  • C. spark.table("sales")
  • D. spark.delta.table("sales")
  • E. There is no way to share data between PySpark and SQL.

Answer: C

Explanation:
The data engineering team can use the spark.table method to access the Delta table sales in PySpark. This method returns a DataFrame representation of the Delta table, which can be used for further processing or testing. The spark.table method works for any table that is registered in the Hive metastore or the Spark catalog, regardless of the file format1. Alternatively, the data engineering team can also use the DeltaTable.forPath method to load the Delta table from its path2. References: 1: SparkSession | PySpark
3.2.0 documentation 2: Welcome to Delta Lake's Python documentation page - delta-spark 2.4.0 documentation


NEW QUESTION # 32
Which of the following tools is used by Auto Loader process data incrementally?

  • A. Checkpointing
  • B. Data Explorer
  • C. Databricks SQL
  • D. Unity Catalog
  • E. Spark Structured Streaming

Answer: E

Explanation:
Auto Loader provides a Structured Streaming source called cloudFiles that can process new data files as they arrive in cloud storage without any additional setup. Auto Loader uses a scalable key-value store to track ingestion progress and ensure exactly-once semantics. Auto Loader can ingest various file formats and load them into Delta Lake tables. Auto Loader is recommended for incremental data ingestion with Delta Live Tables, which extends the functionality of Structured Streaming and allows you to write declarative Python or SQL code to deploy a production-quality data pipeline. Reference: What is Auto Loader?, What is Auto Loader? | Databricks on AWS, Solved: How does Auto Loader ingest data? - Databricks - 5629


NEW QUESTION # 33
Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?

  • A. Parquet files can be partitioned
  • B. Parquet files have the ability to be optimized
  • C. Parquet files will become Delta tables
  • D. CREATE TABLE AS SELECT statements cannot be used on files
  • E. Parquet files have a well-defined schema

Answer: E

Explanation:
Explanation
https://www.databricks.com/glossary/what-is-parquet#:~:text=Columnar%20storage%20like%20Apache%20Par Columnar storage like Apache Parquet is designed to bring efficiency compared to row-based files like CSV.
When querying, columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time-consuming compared to row-oriented databases.


NEW QUESTION # 34
Which of the following describes a benefit of creating an external table from Parquet rather than CSV when using a CREATE TABLE AS SELECT statement?

  • A. Parquet files can be partitioned
  • B. Parquet files have the ability to be optimized
  • C. Parquet files will become Delta tables
  • D. CREATE TABLE AS SELECT statements cannot be used on files
  • E. Parquet files have a well-defined schema

Answer: E

Explanation:
Option C is the correct answer because Parquet files have a well-defined schema that is embedded within the data itself. This means that the data types and column names of the Parquet files are automatically detected and preserved when creating an external table from them. This also enables the use of SQL and other structured query languages to access and analyze the data. CSV files, on the other hand, do not have a schema embedded in them, and require specifying the schema explicitly or inferring it from the data when creating an external table from them. This can lead to errors or inconsistencies in the data types and column names, and also increase the processing time and complexity.
References: CREATE TABLE AS SELECT, Parquet Files, CSV Files, Parquet vs. CSV


NEW QUESTION # 35
A new data engineering team team has been assigned to an ELT project. The new data engineering team will need full privileges on the table sales to fully manage the project.
Which command can be used to grant full permissions on the database to the new data engineering team?

  • A. GRANT SELECT ON TABLE sales TO team;
  • B. GRANT ALL PRIVILEGES ON TABLE team TO sales;
  • C. grant all privileges on table sales TO team;
  • D. GRANT SELECT CREATE MODIFY ON TABLE sales TO team;

Answer: C

Explanation:
To grant full privileges on a table such as 'sales' to a group like 'team', the correct SQL command in Databricks is:
GRANT ALL PRIVILEGES ON TABLE sales TO team;
This command assigns all available privileges, including SELECT, INSERT, UPDATE, DELETE, and any other data manipulation or definition actions, to the specified team. This is typically necessary when a team needs full control over a table to manage and manipulate it as part of a project or ongoing maintenance.
Reference:
Databricks documentation on SQL permissions: SQL Permissions in Databricks


NEW QUESTION # 36
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.
The cade block used by the data engineer is below:

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?

  • A. trigger(processingTime="5 seconds")
  • B. trigger(once="5 seconds")
  • C. trigger()
  • D. trigger("5 seconds")
  • E. trigger(continuous="5 seconds")

Answer: A

Explanation:
The processingTime option specifies a time-based trigger interval for fixed interval micro-batches. This means that the query will execute a micro-batch to process data every 5 seconds, regardless of how much data is available. This option is suitable for near-real time processing workloads that require low latency and consistent processing frequency. The other options are either invalid syntax (A, C), default behavior (B), or experimental feature (E). References: Databricks Documentation - Configure Structured Streaming trigger intervals, Databricks Documentation - Trigger.


NEW QUESTION # 37
......

New Databricks-Certified-Data-Engineer-Associate Exam Dumps with High Passing Rate: https://www.real4prep.com/Databricks-Certified-Data-Engineer-Associate-exam.html

Databricks-Certified-Data-Engineer-Associate Certification Exam Dumps with 102 Practice Test Questions: https://drive.google.com/open?id=1LC6m70sXYuSvuWQfgyL2f7G5QhLn6IfU