提交 Hive Job

创建Hive application

在 EMR Serverless Console 上,创建application:

image-20231221090349469

输入名称hive-serverless,选择类型 Hive 和最新版本:

image-20231221090413671

其他保持默认,并创建application。

为IAM Role 添加权限

下面提交任务时,我们要将任务日志同时输出到CloudWatch,所以先给EMRServerlessS3RuntimeRole 添加写Cloudwatch Logs的权限:

image-20231221093149454

添加CloudWatchFullAccessV2

image-20231221093236359

创建Job

点击Submit Job:

image-20231221090524340

输入以下详细信息:

Name Hive-Serverless-Console
Runtime role EMRServerlessS3RuntimeRole
Initialization Script Location S3 URI s3://aws-data-analytics-workshops/emr-serverless-workshop/scripts/create_taxi_trip.sql
Script location S3 URI s3://aws-data-analytics-workshops/emr-serverless-workshop/scripts/count.sql

image-20231221090635853

初始化脚本为NewYork Taxi Trip创建一个hive table:

CREATE EXTERNAL TABLE if not exists `nytaxitrip`(
  `vendorid` bigint, 
  `tpep_pickup_datetime` string, 
  `tpep_dropoff_datetime` string, 
  `passenger_count` bigint, 
  `trip_distance` double, 
  `ratecodeid` bigint, 
  `store_and_fwd_flag` string, 
  `pulocationid` bigint, 
  `dolocationid` bigint, 
  `payment_type` bigint, 
  `fare_amount` double, 
  `extra` double, 
  `mta_tax` double, 
  `tip_amount` double, 
  `tolls_amount` double, 
  `improvement_surcharge` double, 
  `total_amount` double, 
  `congestion_surcharge` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://aws-data-analytics-workshops/shared_datasets/tripdata/'; 

Hive 创建表后,运行计数查询来查看该表中的记录数(count.sql内容):

SET hive.cli.print.header=true;
SET hive.query.name=TaxiTrips;

Select count(*) as count from nytaxitrip;

在作业配置下,使用之前cloudformation创建的S3桶名称更新DOC-EXAMPLE_BUCKET

image-20231221091137480

在其他设置下,将日志上传到S3的/logs路径以及CloudWatch Logs

image-20231221091418152

最后点击Submit job

Job成功完成后,转到Athena验证表是否已创建,并查询它以检查数据:

image-20231221093910580