New Release DAS-C01 AWS Certified Data Analytics Questions

Page: 9 / 14

Question 36

A company collects and transforms data files from third-party providers by using an on-premises SFTP server. The company uses a Pythonscript to transform the data.

The company wants to reduce the overhead of maintaining the SFTP server and storing large amounts of data on premises. However, the company does not want to change the existing upload process for the third-party providers.

Which solution will meet these requirements with the LEAST development effort?

Options:

Deploy the Python script on an Amazon EC2 instance. Install a third-party SFTP server on the EC2 instance. Schedule the script to run periodically on the EC2 instance to perform a data transformation on new files. Copy the transformed files to Amazon S3.

Create an Amazon S3 bucket that includes a separate prefix for each provider. Provide the S3 URL to each provider for its respective prefix. Instruct the providers to use the S3 COPY command to upload data. Configure an AWS Lambda function that transforms the data when new files are uploaded.

Use AWS Transfer Family to create an SFTP server that includes a publicly accessible endpoint. Configure the new server to use Amazon S3 storage. Change the server name to match the name of the on-premises SFTP server. Schedule a Python shell job in AWS Glue to use the existing Python script to run periodically and transform the uploaded files.

Use AWS Transfer Family to create an SFTP server that includes a publicly accessible endpoint. Configure the new server to use Amazon S3 storage. Change the server name to match the name of the on-premises SFTP server. Use AWS Data Pipeline to schedule a transient Amazon EMR cluster with an Apache Spark step to periodically transform the files.

Question 37

A company is reading data from various customer databases that run on Amazon RDS. The databases contain many inconsistent fields For example, a customer record field that is place_id in one database is location_id in another database. The company wants to link customer records across different databases, even when many customer record fields do not match exactly

Which solution will meet these requirements with the LEAST operational overhead?

Options:

Create an Amazon EMR cluster to process and analyze data in the databases Connect to the Apache Zeppelin notebook, and use the FindMatches transform to find duplicate records in the data.

Create an AWS Glue crawler to crawl the databases. Use the FindMatches transform to find duplicate records in the data Evaluate and tune the transform by evaluating performance and results of finding matches

Create an AWS Glue crawler to crawl the data in the databases Use Amazon SageMaker to construct Apache Spark ML pipelines to find duplicate records in the data

Create an Amazon EMR cluster to process and analyze data in the databases. Connect to the Apache Zeppelin notebook, and use Apache Spark ML to find duplicate records in the data. Evaluate and tune the model by evaluating performance and results of finding duplicates

Question 38

A human resources company maintains a 10-node Amazon Redshift cluster to run analytics queries on the company’s data. The Amazon Redshift cluster contains a product table and a transactions table, and both tables have a product_sku column. The tables are over 100 GB in size. The majority of queries run on both tables.

Which distribution style should the company use for the two tables to achieve optimal query performance?

Options:

An EVEN distribution style for both tables

A KEY distribution style for both tables

An ALL distribution style for the product table and an EVEN distribution style for the transactions table

An EVEN distribution style for the product table and an KEY distribution style for the transactions table

Question 39

A company hosts an on-premises PostgreSQL database that contains historical data. An internal legacy application uses the database for read-only activities. The company’s business team wants to move the data to a data lake in Amazon S3 as soon as possible and enrich the data for analytics.

The company has set up an AWS Direct Connect connection between its VPC and its on-premises network. A data analytics specialist must design a solution that achieves the business team’s goals with the least operational overhead.

Which solution meets these requirements?

Options:

Upload the data from the on-premises PostgreSQL database to Amazon S3 by using a customized batch upload process. Use the AWS Glue crawler to catalog the data in Amazon S3. Use an AWS Glue job to enrich and store the result in a separate S3 bucket in Apache Parquet format. Use Amazon Athena to query the data.

Create an Amazon RDS for PostgreSQL database and use AWS Database Migration Service (AWS DMS) to migrate the data into Amazon RDS. Use AWS Data Pipeline to copy and enrich the data from the Amazon RDS for PostgreSQL table and move the data to Amazon S3. Use Amazon Athena to query the data.

Configure an AWS Glue crawler to use a JDBC connection to catalog the data in the on-premises database. Use an AWS Glue job to enrich the data and save the result to Amazon S3 in Apache Parquet format. Create an Amazon Redshift cluster and use Amazon Redshift Spectrum to query the data.