Free Databricks Databricks Machine Learning Associate Practice Exams

Question #1

A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem:

● Hyperparameter 1: [2, 5, 10]

● Hyperparameter 2: [50, 100]

Which of the following represents the number of machine learning models that can be trained in parallel during this process?

A . 3
B . 5
C . 6
D . 18

Reveal Solution Hide Solution

Question #2

A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.

Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

A . They can turn on Databricks Autologging
B . They can specify nested=True when starting the child run for each unique combination of hyperparameter values
C . They can start each child run inside the parent run’s indented code block using mlflow.start runO
D . They can start each child run with the same experiment ID as the parent run
E . They can specify nested=True when starting the parent run for the tuning process

Reveal Solution Hide Solution

Question #3

A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization’s leaders want to maximize the number of positive cases identified by the model.

Which of the following classification metrics should be used to evaluate the model?

A . RMSE
B . Precision
C . Area under the residual operating curve
D . Accuracy
E . Recall

Reveal Solution Hide Solution

Question #4

A data scientist has written a data cleaning notebook that utilizes the pandas library, but their colleague has suggested that they refactor their notebook to scale with big data.

Which of the following approaches can the data scientist take to spend the least amount of time refactoring their notebook to scale with big data?

A . They can refactor their notebook to process the data in parallel.
B . They can refactor their notebook to use the PySpark DataFrame API.
C . They can refactor their notebook to use the Scala Dataset API.
D . They can refactor their notebook to use Spark SQL.
E . They can refactor their notebook to utilize the pandas API on Spark.

Reveal Solution Hide Solution

Question #5

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library’s fmin operation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with the objective_function being passed as an argument to fmin.

They use the following code block to create the objective_function:

Which of the following changes does the data scientist need to make to their objective_function in order to produce a more accurate model?

A . Add test set validation process
B . Add a random_state argument to the RandomForestRegressor operation
C . Remove the mean operation that is wrapping the cross_val_score operation
D . Replace the r2 return value with -r2
E . Replace the fmin operation with the fmax operation

Reveal Solution Hide Solution

Question #6

A machine learning engineer is using the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:

Assuming the default Spark configuration is in place, which of the following is a benefit of using an Iterator?

A . The data will be limited to a single executor preventing the model from being loaded multiple times
B . The model will be limited to a single executor preventing the data from being distributed
C . The model only needs to be loaded once per executor rather than once per batch during the inference process
D . The data will be distributed across multiple executors during the inference process

Reveal Solution Hide Solution

Question #7

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

A . import pyspark.pandas as ps
df = ps.DataFrame(spark_df)
B . import pyspark.pandas as ps
df = ps.to_pandas(spark_df)
C . spark_df.to_pandas()
D . import pandas as pd
df = pd.DataFrame(spark_df)

Reveal Solution Hide Solution

Question #8

A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline.

They then set up a cross-validation process with pipeline as the estimator in the following code block:

Which of the following is a negative consequence of including pipeline as the estimator in the cross-validation process rather than rfr as the estimator?

A . The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode
B . The process will leak data from the training set to the test set during the evaluation phase
C . The process will be unable to parallelize tuning due to the distributed nature of pipeline
D . The process will leak data prep information from the validation sets to the training sets for each model

Reveal Solution Hide Solution

Question #9

A machine learning engineer is trying to scale a machine learning pipeline by distributing its single-node model tuning process. After broadcasting the entire training data onto each core, each core in the cluster can train one model at a time. Because the tuning process is still running slowly, the engineer wants to increase the level of parallelism from 4 cores to 8 cores to speed up the tuning process. Unfortunately, the total memory in the cluster cannot be increased.

In which of the following scenarios will increasing the level of parallelism from 4 to 8 speed up the tuning process?

A . When the tuning process in randomized
B . When the entire data can fit on each core
C . When the model is unable to be parallelized
D . When the data is particularly long in shape
E . When the data is particularly wide in shape

Reveal Solution Hide Solution

Question #10

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression

model for large data?

A . Logistic regression
B . Spark ML cannot distribute linear regression training
C . Iterative optimization
D . Least-squares method
E . Singular value decomposition

Reveal Solution Hide Solution