从CLI提交 Spark Job

在CloudFormation的输出中,找到Role的ARN:

image-20231221071616907

进入Cloud9, 在 Cloud9 命令行中将其设置为环境变量:

export JOB_ROLE_ARN=<<EMRServerlessS3RuntimeRoleARN>>
export S3_BUCKET=s3://<<YOUR_BUCKET>>

image-20231221071802705

获取 EMR Serverless application的 ID 。转到 EMR Serverless 控制台并单击Application并复制ID :

image-20231221071928242

将这个ID也导出为环境变量:

export APPLICATION_ID=<<application_id>>

使用以下命令提交作业:

aws emr-serverless start-job-run --application-id ${APPLICATION_ID} --execution-role-arn ${JOB_ROLE_ARN} --name "Spark-WordCount-CLI" --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://us-east-1.elasticmapreduce/emr-containers/samples/wordcount/scripts/wordcount.py",
            "entryPointArguments": [
          "'"$S3_BUCKET"'/wordcount_output/"
        ],
            "sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1 --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
        }
    }'

image-20231221072306577

在EMR控制台查看状态:

image-20231221072329218

任务状态会经过 Scheduled -> pending -> Running -> Success。等任务运行结束后,到S3查看任务的输出:

image-20231221072653088

上一节我们在控制台运行Job时,同样把结果输出到同一个S3目录下;再次运行时,会自动先清空该目录下的文件