Top Dataproc Interview Questions (2024)
What is Google Cloud Dataproc?
What is the difference between Dataproc, dataflow and Dataprep?
Does Dataproc use Hadoop?
How Dataproc is helpful in Hadoop?
What are the Google Commands to turn down clusters automatically with scheduled deletion using Dataproc?
How to turn down clusters automatically with scheduled deletion in Rest Api using Dataproc?
How to check and view that a cluster has been scheduled for deletion in Dataproc?
What is PVM in Dataproc?
Q: What is Google Cloud Dataproc?
Ans:
Google Cloud Dataproc is a managed service for processing huge datasets (managed Spark and Hadoop service), like those used in big data initiatives (batch processing, querying, streaming, and machine learning). Google Cloud Platform, Google's public cloud offering, includes Dataproc. Users can utilise the Dataproc service to create managed clusters with anywhere from three to hundreds of nodes.
Q: What is the difference between Dataproc, dataflow and Dataprep?
Ans:
Dataproc is a Google Cloud product that provides Spark and Hadoop users with a Data Science/ML service. Dataflow, on the other hand, uses batch and stream processing to process data. It creates a new pipeline for data processing and on-demand resource production and removal. Dataprep, on the other hand, is UI-driven, scalable on-demand, and is entirely automated.
Q: Does Dataproc use Hadoop?
Ans:
Dataproc is integrated with Hadoop and the Hadoop Distributed File System (HDFS). When choosing computation and data storage solutions for Dataproc clusters and jobs, the following features and factors should be considered: Cloud Storage and HDFS: The Hadoop Distributed File System (HDFS) is used by Dataproc for storage.
Checkout our related posts :
Q: How Dataproc is helpful in Hadoop?
Ans:
With Dataproc, Google Cloud's managed Hadoop product, you can spin up as many or as few cluster resources as you need in the cloud.
- Jobs can get the resources they need.
- Clusters can be turned down when no jobs are running.
- Turn down clusters automatically with Scheduled Deletion.
- Clusters can be resized if job needs change.
- Dataproc provides autoscaling which provides flexible capability.
Q: What are the Google Commands to turn down clusters automatically with scheduled
deletion using Dataproc?
Ans:
You can use the Cluster Scheduled Deletion functionality to build a cluster by giving the
following scheduled deletion options to the gcloud dataproc clusters create command
.
--max-idle
- The time it takes from when the cluster enters the idle state to when it begins to delete (Min Value- 5 minutes , Max Value - 14 days).--expiration-time
- The time to begin deleting the cluster in ISO 8601 datetime format (we can use timestampgenerator to generate date time correctly). Min Value is 10 minutes from the current time and max value is 14 days from the current time--max-age
- The time it takes from when you submit a cluster create request to when the cluster starts to delete (Min Value- 10 minutes, Max Value - 14 days).
In your cluster create request, combine the --max-idle
flag with either the --expiration-time
or --max-age
flags. The first to become true will cause the cluster to shut down.
The cluster creation command accepts either the --expiration-time
or --max-age
flags, but not both.
gcloud dataproc clusters create cluster-name \
--region=region \
--max-idle=duration \
--expiration-time=time \
... other flags ...
Q: How to turn down clusters automatically with scheduled deletion in Rest Api using
Dataproc?
Ans:
Users can create a cluster by using the Cluster Scheduled Deletion feature with ClusterLifecycleConfig
fields in cluster.create
or cluster.patch
API request.
idleDeleteTtl
- The time it takes from when the cluster enters the idle state to when it begins to delete (Min Value- 5 minutes , Max Value - 14 days).autoDeleteTime
- The time to begin deleting the cluster. User can provide timestamp in RFC 3339 UTC "Zulu" format, accurate to nanoseconds.Min Value is 10 minutes from the current time and max value is 14 days from the current timeautoDeleteTtl
- The time it takes from when you submit a cluster create request to when the cluster starts to delete (Min Value- 10 minutes, Max Value - 14 days).
In your cluster create request, you can choose idleDeleteTtl
as well as either
autoDeleteTime
or autoDeleteTtl
. The first to become true will cause
the cluster to shut down.
The cluster creation command accepts either the autoDeleteTime
or autoDeleteTtl
flags, but not both.
Q: How to check and view that a cluster has been scheduled for deletion in Dataproc?
Ans:
We can check and view that a cluster has been scheduled for deletion in Dataproc in below ways:
Google Command
We can usegcloud dataproc clusters list
command.$ gcloud dataproc clusters list \ --region=region ... NAME WORKER_COUNT ... SCHEDULED_DELETE cluster-id number ... enabled ...
gcloud dataproc clusters describe
command to see cluster's LifecycleConfig scheduled deletion configurations.$ gcloud dataproc clusters describe cluster-name \ --region=region ... lifecycleConfig: autoDeleteTime: '2024-09-28T19:33:48.146Z' idleDeleteTtl: 1800s idleStartTime: '2024-09-28T18:33:48.146Z' ...
Rest Api
To check that a cluster has scheduled deletion enabled, perform aclusters.list
request.
Q: What is PVM in Dataproc?
Ans:
Dataproc clusters can use "secondary" workers in addition to conventional Compute Engine VMs as Dataproc workers (referred to as "primary" workers).
Secondary workers are divided into two categories: preemptible (PVM) and non-preemptible. In a cluster, all secondary workers must be of the same kind, either preemptible or non-preemptible. Preemptible is the default.
The default secondary worker type is preemptible workers. If Google Cloud needs them for other tasks, they are reclaimed and removed from the cluster. Although the removal of preemptible workers may have an impact on job stability, you may choose to use preemptible instances to reduce per-hour compute expenses for non-critical data processing or to create very large clusters at a reduced total cost.