在上一节中,我们创建了S3存储桶,并在input目录上传了tripdata.csv文件。在这一节中,将继续使用此S3存储桶, 并使用EMR集群的Hive进行SQL查询。
切换到Cloud9 IDE, SSH登录到主节点
在命令行中,执行hive进入交互式界面:

ny_taxi_test,来指向之前上传的tripdata.csv数据集。运行以下的DDL语句来创建外部表,将<YOUR-BUCKET> 替换为上一步创建的S3桶名称:hive> CREATE EXTERNAL TABLE ny_taxi_test (
              vendor_id int,
              lpep_pickup_datetime string,
              lpep_dropoff_datetime string,
              store_and_fwd_flag string,
              rate_code_id smallint,
              pu_location_id int,
              do_location_id int,
              passenger_count int,
              trip_distance double,
              fare_amount double,
              mta_tax double,
              tip_amount double,
              tolls_amount double,
              ehail_fee double,
              improvement_surcharge double,
              total_amount double,
              payment_type smallint,
              trip_type smallint
       )
       ROW FORMAT DELIMITED
       FIELDS TERMINATED BY ','
       LINES TERMINATED BY '\n'
       STORED AS TEXTFILE
       LOCATION "s3://<YOUR-BUCKET>/input/";
hive> SELECT DISTINCT rate_code_id FROM ny_taxi_test;
