Top AWS Glue Interview Questions (2024) | TechGeekNext


Top AWS Glue Interview Questions (2024)

  1. What is AWS Glue?
  2. What is AWS Glue Data Catalog?
  3. Which AWS services and open source projects make use of AWS Glue Data Catalog?
  4. What is AWS Glue Crawlers?
  5. How do you trigger a glue crawler in AWS Glue?
  6. What is the purpose of an AWS Glue Job?
  7. Do the aws glue APIs return the partition key fields in the order as they were specified when the table was created.
  8. How to join / merge all rows of an RDD in PySpark / AWS Glue into one single long line?
  9. How to create AWS glue job using CLI commands?
  10. How to get the total number of partitions in AWS Glue for specific range?
  11. When an AWS Glue job times out, how do we Retry it?

Q: What is AWS Glue?
Ans:

AWS Glue is an event-driven, serverless computing platform offered by Amazon as part of Amazon Web Services. It is a computing service which executes code in response to events and manages the computing resources needed by that code automatically.

AWS Glue

AWS Glue is a serverless data integration service that keeps things simple to discover, prepare, and combine data for analytics, machine learning, and application development.

Q: What is AWS Glue Data Catalog?
Ans:

Your persistent metadata store is the AWS Glue Data Catalog. It is a managed service that allows you to store, annotate, and share metadata in the AWS Cloud in the same manner that an Apache Hive metastore does. There is one AWS Glue Data Catalog per AWS region in each AWS account.

Q: Which AWS services and open source projects make use of AWS Glue Data Catalog?
Ans:

Following are the AWS services and open source projects that make use of the AWS Glue Data Catalog include:

  1. AWS Lake Formation
  2. Amazon Athena
  3. Amazon Redshift Spectrum
  4. Amazon EMR
  5. AWS Glue Data Catalog Client for Apache Hive Metastore

Take a look at our suggested post on AWS :

Q: What is AWS Glue Crawlers?
Ans:

In the AWS Glue Data Catalog, a crawler reads your data store, gathers metadata, and builds table definitions. All of the crawlers you create are listed in the Crawlers tab of the AWS Glue console.

Q: How do you trigger a glue crawler in AWS Glue?
Ans:

create-trigger command will create a schedule trigger named TechGeekNextTrigger in the activated state, and runs every day at 12:00pm UTC with a crawler named TechGeekNextCrawler.

aws glue create-trigger --name TechGeekNextTrigger --type SCHEDULED --schedule  "cron(0 12 * * ? *)" --actions CrawlerName=TechGeekNextCrawler --start-on-creation  

Q: What is the purpose of an AWS Glue Job?
Ans:

In AWS Glue, a job is the business logic that does the extract, transform, and load (ETL) operations. AWS Glue executes a script when you start a task that pulls data from sources, transforms it, and inserts it into targets. In the ETL section of the AWS Glue console, you could create jobs.

Q: Do the aws glue APIs return the partition key fields in the order as they were specified when the table was created.?
Ans:

Yes, the partition keys would be returned in the same order as they were specified when the table was created.

Q: How to join / merge all rows of an RDD in PySpark / AWS Glue into one single long line?
Ans:

Each rdd row can be mapped into one string per row using map, and the result of the map call can then be aggregated into a single large string:

result = rdd.map(lambda r: " ".join(r) + "\n")\
    .aggregate("", lambda a,b: a+b, lambda a,b: a+b)

Q: How to create AWS glue job using CLI commands?
Ans:

We can create AWS glue job by using below command:

aws glue create-job \
--name ${GLUE_JOB_NAME} \
--role ${ROLE_NAME} \
--command "Name=techgekkenxt1,ScriptLocation=s3:///" \
--connections Connections=${GLUE_CONN_NAME} \
--default-arguments file://${DEFAULT_ARGUMENT_FILE}

Q: How to get the total number of partitions in AWS Glue for specific range?
Ans:

By using the below command, we can get the total number of partitions with specified length.

aws glue get-partitions --database-name xx --table-name xx --query 'length(Partitions[])'

Q: When an AWS Glue job times out, how do we Retry it?
Ans:

Retrying a job only works if it has failed, not if it has timed out. For this, we'll need to create custom logic, such as Event Bridge listening for Glue timeout events and then running a Lambda to restart your task.








Recommendation for Top Popular Post :