在上一节中,我们创建了S3存储桶,并在input目录上传了tripdata.csv文件。在这一节中,将继续使用此S3存储桶, 并使用EMR集群的Hive进行SQL查询。
切换到Cloud9 IDE, SSH登录到主节点
在命令行中,执行hive
进入交互式界面:
ny_taxi_test
,来指向之前上传的tripdata.csv数据集。运行以下的DDL语句来创建外部表,将<YOUR-BUCKET>
替换为上一步创建的S3桶名称:hive> CREATE EXTERNAL TABLE ny_taxi_test (
vendor_id int,
lpep_pickup_datetime string,
lpep_dropoff_datetime string,
store_and_fwd_flag string,
rate_code_id smallint,
pu_location_id int,
do_location_id int,
passenger_count int,
trip_distance double,
fare_amount double,
mta_tax double,
tip_amount double,
tolls_amount double,
ehail_fee double,
improvement_surcharge double,
total_amount double,
payment_type smallint,
trip_type smallint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION "s3://<YOUR-BUCKET>/input/";
hive> SELECT DISTINCT rate_code_id FROM ny_taxi_test;