Top Dataproc Interview Questions (2024) | TechGeekNext


Top Dataproc Interview Questions (2024)

  1. What is Google Cloud Dataproc?
  2. What is the difference between Dataproc, dataflow and Dataprep?
  3. Does Dataproc use Hadoop?
  4. How Dataproc is helpful in Hadoop?
  5. What are the Google Commands to turn down clusters automatically with scheduled deletion using Dataproc?
  6. How to turn down clusters automatically with scheduled deletion in Rest Api using Dataproc?
  7. How to check and view that a cluster has been scheduled for deletion in Dataproc?
  8. What is PVM in Dataproc?

Q: What is Google Cloud Dataproc?
Ans:

Google Cloud Dataproc is a managed service for processing huge datasets (managed Spark and Hadoop service), like those used in big data initiatives (batch processing, querying, streaming, and machine learning). Google Cloud Platform, Google's public cloud offering, includes Dataproc. Users can utilise the Dataproc service to create managed clusters with anywhere from three to hundreds of nodes.

Q: What is the difference between Dataproc, dataflow and Dataprep?
Ans:

Dataproc is a Google Cloud product that provides Spark and Hadoop users with a Data Science/ML service. Dataflow, on the other hand, uses batch and stream processing to process data. It creates a new pipeline for data processing and on-demand resource production and removal. Dataprep, on the other hand, is UI-driven, scalable on-demand, and is entirely automated.

Q: Does Dataproc use Hadoop?
Ans:

Dataproc is integrated with Hadoop and the Hadoop Distributed File System (HDFS). When choosing computation and data storage solutions for Dataproc clusters and jobs, the following features and factors should be considered: Cloud Storage and HDFS: The Hadoop Distributed File System (HDFS) is used by Dataproc for storage.

Checkout our related posts :

Q: How Dataproc is helpful in Hadoop?
Ans:

With Dataproc, Google Cloud's managed Hadoop product, you can spin up as many or as few cluster resources as you need in the cloud.

  1. Jobs can get the resources they need.
  2. Clusters can be turned down when no jobs are running.
  3. Turn down clusters automatically with Scheduled Deletion.
  4. Clusters can be resized if job needs change.
  5. Dataproc provides autoscaling which provides flexible capability.

Q: What are the Google Commands to turn down clusters automatically with scheduled deletion using Dataproc?
Ans:

You can use the Cluster Scheduled Deletion functionality to build a cluster by giving the following scheduled deletion options to the gcloud dataproc clusters create command.

  1. --max-idle - The time it takes from when the cluster enters the idle state to when it begins to delete (Min Value- 5 minutes , Max Value - 14 days).
  2. --expiration-time - The time to begin deleting the cluster in ISO 8601 datetime format (we can use timestampgenerator to generate date time correctly). Min Value is 10 minutes from the current time and max value is 14 days from the current time
  3. --max-age - The time it takes from when you submit a cluster create request to when the cluster starts to delete (Min Value- 10 minutes, Max Value - 14 days).

In your cluster create request, combine the --max-idle flag with either the --expiration-time or --max-age flags. The first to become true will cause the cluster to shut down.

The cluster creation command accepts either the --expiration-time or --max-age flags, but not both.

gcloud dataproc clusters create cluster-name \
    --region=region \
    --max-idle=duration \
    --expiration-time=time \
    ... other flags ...

Q: How to turn down clusters automatically with scheduled deletion in Rest Api using Dataproc?
Ans:

Users can create a cluster by using the Cluster Scheduled Deletion feature with ClusterLifecycleConfig fields in cluster.create or cluster.patch API request.

  1. idleDeleteTtl - The time it takes from when the cluster enters the idle state to when it begins to delete (Min Value- 5 minutes , Max Value - 14 days).
  2. autoDeleteTime - The time to begin deleting the cluster. User can provide timestamp in RFC 3339 UTC "Zulu" format, accurate to nanoseconds.Min Value is 10 minutes from the current time and max value is 14 days from the current time
  3. autoDeleteTtl - The time it takes from when you submit a cluster create request to when the cluster starts to delete (Min Value- 10 minutes, Max Value - 14 days).

In your cluster create request, you can choose idleDeleteTtl as well as either autoDeleteTime or autoDeleteTtl. The first to become true will cause the cluster to shut down.

The cluster creation command accepts either the autoDeleteTime or autoDeleteTtl flags, but not both.

Q: How to check and view that a cluster has been scheduled for deletion in Dataproc?
Ans:

We can check and view that a cluster has been scheduled for deletion in Dataproc in below ways:

  1. Google Command
    We can use gcloud dataproc clusters list command.
    $ gcloud dataproc clusters list \
         --region=region
    
    ...
    NAME         WORKER_COUNT ... SCHEDULED_DELETE
    cluster-id   number       ... enabled
    ...

    gcloud dataproc clusters describe command to see cluster's LifecycleConfig scheduled deletion configurations.

    $ gcloud dataproc clusters describe cluster-name \
        --region=region
    
    ...
    lifecycleConfig:
      autoDeleteTime: '2024-09-28T19:33:48.146Z'
      idleDeleteTtl: 1800s
      idleStartTime: '2024-09-28T18:33:48.146Z'
    ...

  2. Rest Api
    To check that a cluster has scheduled deletion enabled, perform a clusters.list request.

Q: What is PVM in Dataproc?
Ans:

Dataproc clusters can use "secondary" workers in addition to conventional Compute Engine VMs as Dataproc workers (referred to as "primary" workers).

Secondary workers are divided into two categories: preemptible (PVM) and non-preemptible. In a cluster, all secondary workers must be of the same kind, either preemptible or non-preemptible. Preemptible is the default.

DataProc PVM

The default secondary worker type is preemptible workers. If Google Cloud needs them for other tasks, they are reclaimed and removed from the cluster. Although the removal of preemptible workers may have an impact on job stability, you may choose to use preemptible instances to reduce per-hour compute expenses for non-critical data processing or to create very large clusters at a reduced total cost.








Recommendation for Top Popular Post :