Top AWS Glue Interview Questions (2024)

AWS Glue Interview Questions and Answers

What is AWS Glue?
What is AWS Glue Data Catalog?
Which AWS services and open source projects make use of AWS Glue Data Catalog?
What is AWS Glue Crawlers?
How do you trigger a glue crawler in AWS Glue?
What is the purpose of an AWS Glue Job?
Do the aws glue APIs return the partition key fields in the order as they were specified when the table was created.
How to join / merge all rows of an RDD in PySpark / AWS Glue into one single long line?
How to create AWS glue job using CLI commands?
How to get the total number of partitions in AWS Glue for specific range?
When an AWS Glue job times out, how do we Retry it?

Q: What is AWS Glue?
Ans:

AWS Glue is an event-driven, serverless computing platform offered by Amazon as part of Amazon Web Services. It is a computing service which executes code in response to events and manages the computing resources needed by that code automatically.

AWS Glue is a serverless data integration service that keeps things simple to discover, prepare, and combine data for analytics, machine learning, and application development.

Q: What is AWS Glue Data Catalog?
Ans:

Your persistent metadata store is the AWS Glue Data Catalog. It is a managed service that allows you to store, annotate, and share metadata in the AWS Cloud in the same manner that an Apache Hive metastore does. There is one AWS Glue Data Catalog per AWS region in each AWS account.

Q: Which AWS services and open source projects make use of AWS Glue Data Catalog?
Ans:

Following are the AWS services and open source projects that make use of the AWS Glue Data Catalog include:

AWS Lake Formation
Amazon Athena
Amazon Redshift Spectrum
Amazon EMR
AWS Glue Data Catalog Client for Apache Hive Metastore

Take a look at our suggested post on AWS :

AWS S3 Interview Questions

AWS EC2 Interview Questions

AWS Lamda Interview Questions

AWS SageMaker Interview Questions

AWS VPC Interview Questions

AWS DynamoDB Interview Questions

AWS RDS Interview Questions

AWS Cloud Security Interview Questions

Amazon Elastic Block Store (EBS) Interview Questions

AWS Redshift Interview Questions

AWS CloudFormation Interview Questions

AWS CloudWatch Interview Questions

Terraform Interview Questions and Answers

AWS Aurora Interview Questions

Spring Security Interview Questions

Transaction Interview Questions

Java 8 Interview Questions and Answers

Openshift Interview Questions

Apache Kafka Interview Questions

RabbitMQ Interview Questions

Apache Kafka Tutorial

Apache Storm Tutorial

Q: What is AWS Glue Crawlers?
Ans:

In the AWS Glue Data Catalog, a crawler reads your data store, gathers metadata, and builds table definitions. All of the crawlers you create are listed in the Crawlers tab of the AWS Glue console.

Q: How do you trigger a glue crawler in AWS Glue?
Ans:

create-trigger command will create a schedule trigger named TechGeekNextTrigger in the activated state, and runs every day at 12:00pm UTC with a crawler named TechGeekNextCrawler.

aws glue create-trigger --name TechGeekNextTrigger --type SCHEDULED --schedule  "cron(0 12 * * ? *)" --actions CrawlerName=TechGeekNextCrawler --start-on-creation

Q: What is the purpose of an AWS Glue Job?
Ans:

In AWS Glue, a job is the business logic that does the extract, transform, and load (ETL) operations. AWS Glue executes a script when you start a task that pulls data from sources, transforms it, and inserts it into targets. In the ETL section of the AWS Glue console, you could create jobs.

Q: Do the aws glue APIs return the partition key fields in the order as they were specified when the table was created.?
Ans:

Yes, the partition keys would be returned in the same order as they were specified when the table was created.

Q: How to join / merge all rows of an RDD in PySpark / AWS Glue into one single long line?
Ans:

Each rdd row can be mapped into one string per row using map, and the result of the map call can then be aggregated into a single large string:

result = rdd.map(lambda r: " ".join(r) + "\n")\
    .aggregate("", lambda a,b: a+b, lambda a,b: a+b)

Q: How to create AWS glue job using CLI commands?
Ans:

We can create AWS glue job by using below command:

aws glue create-job \
--name ${GLUE_JOB_NAME} \
--role ${ROLE_NAME} \
--command "Name=techgekkenxt1,ScriptLocation=s3:///" \
--connections Connections=${GLUE_CONN_NAME} \
--default-arguments file://${DEFAULT_ARGUMENT_FILE}

Q: How to get the total number of partitions in AWS Glue for specific range?
Ans:

By using the below command, we can get the total number of partitions with specified length.

aws glue get-partitions --database-name xx --table-name xx --query 'length(Partitions[])'

Q: When an AWS Glue job times out, how do we Retry it?
Ans:

Retrying a job only works if it has failed, not if it has timed out. For this, we'll need to create custom logic, such as Event Bridge listening for Glue timeout events and then running a Lambda to restart your task.

Top AWS Glue Interview Questions (2024)

AWS Glue Interview Questions and Answers

What is AWS Glue?

What is AWS Glue Data Catalog?

Which AWS services and open source projects make use of AWS Glue Data Catalog?

What is AWS Glue Crawlers?

How do you trigger a glue crawler in AWS Glue?

What is the purpose of an AWS Glue Job?

Do the aws glue APIs return the partition key fields in the order as they were specified when the table was created.

How to join / merge all rows of an RDD in PySpark / AWS Glue into one single long line?

How to create AWS glue job using CLI commands?

How to get the total number of partitions in AWS Glue for specific range?

When an AWS Glue job times out, how do we Retry it?

Q: What is AWS Glue? Ans:

Q: What is AWS Glue Data Catalog? Ans:

Q: Which AWS services and open source projects make use of AWS Glue Data Catalog? Ans:

Take a look at our suggested post on AWS :

Q: What is AWS Glue Crawlers? Ans:

Q: How do you trigger a glue crawler in AWS Glue? Ans:

Q: What is the purpose of an AWS Glue Job? Ans:

Q: Do the aws glue APIs return the partition key fields in the order as they were specified when the table was created.? Ans:

Q: How to join / merge all rows of an RDD in PySpark / AWS Glue into one single long line? Ans:

Q: How to create AWS glue job using CLI commands? Ans:

Q: How to get the total number of partitions in AWS Glue for specific range? Ans:

Q: When an AWS Glue job times out, how do we Retry it? Ans:

Recommendation for Top Popular Post :