AWS

由于数据提取和存储成本,默认情况下不启用 EKS 控制平面的 CloudWatch 日志记录。s3 桶。此示例展示如何指定 CloudWatch 监控配置和 S3 日志路径作为作业配置的一部分。

aws emr-containers start-job-run \
--virtual-cluster-id ${EMR_EKS_CLUSTER_ID} \
--name spark-pi-logging \
--execution-role-arn ${EMR_EKS_EXECUTION_ARN} \
--release-label emr-6.2.0-latest \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://aws-data-analytics-workshops/emr-eks-workshop/scripts/pi.py",
        "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
        }
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G"
         }
      }
    ], 
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-containers/jobs", 
        "logStreamNamePrefix": "emr-eks-workshop"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3_BUCKET"'/logs/"
      }
    }
}'

您可以前往您指定的S3存储桶来检查日志。您的日志数据将发送到以下 Amazon S3 位置。

控制器日志- /logUri/virtual-cluster-id/jobs/job-id/containers/pod-name/(stderr.gz/stdout.gz)

驱动程序日志- /logUri/virtual-cluster-id/jobs/job-id/containers/spark-application-id/spark-job-id-driver/(stderr.gz/stdout.gz)

执行程序日志- /logUri/virtual-cluster-id/jobs/job-id/containers/spark-application-id/executor-pod-name/(stderr.gz/stdout.gz)

探索驱动程序日志的内容并在 stdout.gz 上运行 S3 Select 查询。下面的屏幕截图显示了 PySpark Pi 作业的输出和 Pi 的值。路径格式应为:s3://xxxx/yyyy/containers/spark-xxxx/spark-xxx-driver/stdout.gz

CDK CDK CDK

在 StartJobRun API 中,log_group_name 是 CloudWatch 的日志组名称,log_stream_prefix 是 CloudWatch 的日志流名称前缀。您可以在 AWS 管理控制台中查看和搜索这些日志。

控制器日志- logGroup/logStreamPrefix/virtual-cluster-id/jobs/job-id/containers/pod-name/(stderr/stdout)

驱动程序日志- logGroup/logStreamPrefix/virtual-cluster-id/jobs/job-id/containers/spark-application-id/spark-job-id-driver/(stderrstdout)

执行器日志- logGroup/logStreamPrefix/virtual-cluster-id/jobs/job-id/containers/spark-application-id/executor-pod-name/(stderr/stdout)

从 AWS 控制台 -> 服务 -> CloudWatch 选择日志组,然后选择 /emr-containers/jobs 选择 Workshop/xxx/jobs/xxx/containers/spark-xxx-drive/stdout ,如下所示: CDK CDK

以前的