Professional-Data-Engineer Exam Dumps - Google Cloud Certified Questions and Answers

Question # 34

You want to migrate an Apache Spark 3 batch job from on-premises to Google Cloud. You need to minimally change the job so that the job reads from Cloud Storage and writes the result to BigQuery. Your job is optimized for Spark, where each executor has 8 vCPU and 16 GB memory, and you want to be able to choose similar settings. You want to minimize installation and management effort to run your job. What should you do?

Options:

Execute the job in a new Dataproc cluster.

Execute as a Dataproc Serverless job.

Execute the job as part of a deployment in a new Google Kubernetes Engine cluster.

Execute the job from a new Compute Engine VM.

Buy Now

Answer:

Explanation:

The key requirements are:

Migrate Spark 3 batch job.

Minimally change the job (reads from GCS, writes to BQ – standard for Spark on GCP).

Job optimized for Spark (specific executor vCPU/memory).

Ability to choose similar executor settings.

Minimize installation and management effort.

Dataproc Serverless (Option A)is designed for these use cases.

Spark 3 Support:Dataproc Serverless supports various Spark runtimes, including Spark 3.

Minimal Changes:Spark jobs reading from GCS and writing to BigQuery (using the Spark-BigQuery connector) are standard. Minimal code changes are generally needed.

Customizable Resources:Dataproc Serverless allows you to specify resources for the driver and executors, including vCPU and memory. You can configure these to match your optimized on-premises settings (e.g., 8 vCPU, 16 GB memory per executor, though specific available configurations should be checked).

Minimal Installation and Management:This is the core benefit of "serverless." You don't need to provision, manage, or scale clusters. You submit your batch job, and Google Cloud handles the underlying infrastructure. This significantly reduces operational overhead.

Let's analyze why other options are less suitable:

B (Compute Engine VM):You would need to manually install Spark, configure it, manage dependencies, and manage the VM itself. This is high management effort.

C (Google Kubernetes Engine cluster):While you can run Spark on GKE (e.g., using Spark on Kubernetes operator), it involves managing the GKE cluster, Spark deployment configurations, Docker images, etc. This is also significant management effort, more than Dataproc Serverless.

D (Dataproc cluster):A standard Dataproc cluster provides more control than serverless but also requires more management (creating, scaling, and managing the cluster lifecycle). Dataproc Serverless is specifically designed to minimize this management for batch jobs. Given the "minimize installation and management effort" requirement, serverless is preferred over a managed cluster if it meets the job's needs.

[Reference:, Google Cloud Documentation: Dataproc Serverless > Overview. "Dataproc Serverless for Spark lets you run Spark batch workloads without requiring you to provision and manage your own cluster... Submit your Spark workload to the Dataproc Serverless service. The service will run the workload on a managed compute infrastructure, autoscaling resources as needed.", Google Cloud Documentation: Dataproc Serverless > Submitting Spark batch workloads > Spark batch workload properties. This documentation shows how you can specify properties for driver and executor cores (spark.driver.cores, spark.executor.cores) and memory (spark.driver.memory, spark.executor.memory), allowing you to choose settings similar to your existing optimized job., , , ]

Question # 35

Your company operates in three domains: airlines, hotels, and ride-hailing services. Each domain has two teams: analytics and data science, which create data assets in BigQuery with the help of a central data platform team. However, as each domain is evolving rapidly, the central data platform team is becoming a bottleneck. This is causing delays in deriving insights from data, and resulting in stale data when pipelines are not kept up to date. You need to design a data mesh architecture by using Dataplex to eliminate the bottleneck. What should you do?

Options:

1. Create one lake for each team. Inside each lake, create one zone for each domain.

2. Attach each of the BigQuery datasets created by the individual teams as assets to the respective zone.

3. Have the central data platform team manage all zones' data assets.

1 Create one lake for each team. Inside each lake, create one zone for each domain.

2. Attach each to the BigQuory datasets created by the individual teams as assets to the respective zone.

3. Direct each domain to manage their own zone's data assets.

1 Create one lake for each domain. Inside each lake, create one zone for each team.

2. Attach each of the BigQuery datasets created by the individual teams as assets to the respective zone.

3. Direct each domain to manage their own lake's data assets.

1 Create one lake for each domain. Inside each lake, create one zone for each team.

2. Attach each of the BigQuery datasets created by the individual teams as assets to the respective zone.

3. Have the central data platform team manage all lakes' data assets.

Buy Now

Question # 36

You are building a new application that you need to collect data from in a scalable way. Data arrives continuously from the application throughout the day, and you expect to generate approximately 150 GB of JSON data per day by the end of the year. Your requirements are:

Decoupling producer from consumer

Space and cost-efficient storage of the raw ingested data, which is to be stored indefinitely

Near real-time SQL query

Maintain at least 2 years of historical data, which will be queried with SQ

Which pipeline should you use to meet these requirements?

Options:

Create an application that provides an API. Write a tool to poll the API and write data to Cloud Storage as gzipped JSON files.

Create an application that writes to a Cloud SQL database to store the data. Set up periodic exports of the database to write to Cloud Storage and load into BigQuery.

Create an application that publishes events to Cloud Pub/Sub, and create Spark jobs on Cloud Dataproc to convert the JSON data to Avro format, stored on HDFS on Persistent Disk.

Create an application that publishes events to Cloud Pub/Sub, and create a Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Cloud Storage and BigQuery.

Buy Now

Question # 37

Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?

Options:

Store the common data in BigQuery as partitioned tables.

Store the common data in BigQuery and expose authorized views.

Store the common data encoded as Avro in Google Cloud Storage.

Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.

Buy Now

Question # 38

Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

Options:

Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage

Cloud Pub/Sub, Cloud Dataflow, and Local SSD

Cloud Pub/Sub, Cloud SQL, and Cloud Storage

Cloud Load Balancing, Cloud Dataflow, and Cloud Storage

Buy Now

Question # 39

You have Cloud Functions written in Node.js that pull messages from Cloud Pub/Sub and send the data to BigQuery. You observe that the message processing rate on the Pub/Sub topic is orders of magnitude higher than anticipated, but there is no error logged in Stackdriver Log Viewer. What are the two most likely causes of this problem? Choose 2 answers.

Options:

Publisher throughput quota is too small.

Total outstanding messages exceed the 10-MB maximum.

Error handling in the subscriber code is not handling run-time errors properly.

The subscriber code cannot keep up with the messages.

The subscriber code does not acknowledge the messages that it pulls.

Buy Now

Question # 40

Which TensorFlow function can you use to configure a categorical column if you don't know all of the possible values for that column?

Options:

categorical_column_with_vocabulary_list

categorical_column_with_hash_bucket

categorical_column_with_unknown_values

sparse_column_with_keys

Buy Now

Question # 41

Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs?

Options:

Dataproc Worker

Dataproc Viewer

Dataproc Runner

Dataproc Editor

Buy Now

Question # 42

Flowlogistic’s CEO wants to gain rapid insight into their customer base so his sales team can be better informed in the field. This team is not very technical, so they’ve purchased a visualization tool to simplify the creation of BigQuery reports. However, they’ve been overwhelmed by all thedata in the table, and are spending a lot of money on queries trying to find the data they need. You want to solve their problem in the most cost-effective way. What should you do?

Options:

Export the data into a Google Sheet for virtualization.

Create an additional table with only the necessary columns.

Create a view on the table to present to the virtualization tool.

Create identity and access management (IAM) roles on the appropriate columns, so only they appear in a query.

Buy Now

Question # 43

Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.

Which approach should you take?

Options:

Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.

Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.

Use the NOW () function in BigQuery to record the event’s time.

Use the automatically generated timestamp from Cloud Pub/Sub to order the data.

Buy Now

Exam Code: Professional-Data-Engineer

Exam Name: Google Professional Data Engineer Exam

Last Update: Jun 15, 2025

Questions: 376

Professional-Data-Engineer PDF

$34 ~~$84.99~~

Add to Cart

Professional-Data-Engineer Testing Engine

$38 ~~$94.99~~

Add to Cart

Professional-Data-Engineer PDF + Testing Engine

$54 ~~$134.99~~

Add to Cart

Summer Limited Time 60% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: dealsixty

certsboard certification exams

Navigation:

Professional-Data-Engineer Exam Dumps - Google Cloud Certified Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Options:

Answer:

Professional-Data-Engineer PDF

Professional-Data-Engineer Testing Engine

Professional-Data-Engineer PDF + Testing Engine

Quick Links

Recently New Released Certification Exams

Site Secure