在 EMR Serverless Console 上,创建application:
输入名称hive-serverless,选择类型 Hive 和最新版本:
其他保持默认,并创建application。
下面提交任务时,我们要将任务日志同时输出到CloudWatch,所以先给EMRServerlessS3RuntimeRole
添加写Cloudwatch Logs的权限:
添加CloudWatchFullAccessV2
:
点击Submit Job
:
输入以下详细信息:
Name | Hive-Serverless-Console |
Runtime role | EMRServerlessS3RuntimeRole |
Initialization Script Location S3 URI | s3://aws-data-analytics-workshops/emr-serverless-workshop/scripts/create_taxi_trip.sql |
Script location S3 URI | s3://aws-data-analytics-workshops/emr-serverless-workshop/scripts/count.sql |
初始化脚本为NewYork Taxi Trip
创建一个hive table:
CREATE EXTERNAL TABLE if not exists `nytaxitrip`(
`vendorid` bigint,
`tpep_pickup_datetime` string,
`tpep_dropoff_datetime` string,
`passenger_count` bigint,
`trip_distance` double,
`ratecodeid` bigint,
`store_and_fwd_flag` string,
`pulocationid` bigint,
`dolocationid` bigint,
`payment_type` bigint,
`fare_amount` double,
`extra` double,
`mta_tax` double,
`tip_amount` double,
`tolls_amount` double,
`improvement_surcharge` double,
`total_amount` double,
`congestion_surcharge` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://aws-data-analytics-workshops/shared_datasets/tripdata/';
Hive 创建表后,运行计数查询来查看该表中的记录数(count.sql内容):
SET hive.cli.print.header=true;
SET hive.query.name=TaxiTrips;
Select count(*) as count from nytaxitrip;
在作业配置下,使用之前cloudformation创建的S3桶名称更新DOC-EXAMPLE_BUCKET:
在其他设置下,将日志上传到S3的/logs
路径以及CloudWatch Logs
:
最后点击Submit job
。
Job成功完成后,转到Athena验证表是否已创建,并查询它以检查数据: