Apache Beam 運算子

Apache Beam 是一個開放原始碼、統一的模型,用於定義批次和串流資料平行處理管道。使用其中一個開放原始碼 Beam SDK,您可以建構一個程式來定義管道。然後,管道會由 Beam 支援的分散式處理後端之一執行,其中包括 Apache Flink、Apache Spark 和 Google Cloud Dataflow。

注意

當 Apache Beam 管道在 Dataflow 服務上執行時,此運算子需要 Airflow worker 上安裝 gcloud 命令(Google Cloud SDK)<https://cloud.google.com/sdk/docs/install>。

在 Apache Beam 中執行 Python Pipelines

py_file 引數必須為 BeamRunPythonPipelineOperator 指定,因為它包含要由 Beam 執行的管道。Python 檔案可以位於 Airflow 有權限下載的 GCS 上,或位於本機檔案系統上(提供絕對路徑)。

py_interpreter 引數指定執行管道時要使用的 Python 版本,預設值為 python3。如果您的 Airflow 實例在 Python 2 上執行 - 請指定 python2 並確保您的 py_file 是 Python 2 格式。為了獲得最佳效果,請使用 Python 3。

如果指定了 py_requirements 引數,將建立一個具有指定需求的臨時 Python 虛擬環境,並且管道將在其中執行。

py_system_site_packages 引數指定是否可以從虛擬環境(如果指定了 py_requirements 引數)存取 Airflow 實例中的所有 Python 套件,建議避免使用,除非 Dataflow 作業需要它。

使用 DirectRunner 的 Python Pipelines

tests/system/apache/beam/example_python.py[原始碼]

start_python_pipeline_local_direct_runner = BeamRunPythonPipelineOperator(
    task_id="start_python_pipeline_local_direct_runner",
    py_file="apache_beam.examples.wordcount",
    py_options=["-m"],
    py_requirements=["apache-beam[gcp]==2.59.0"],
    py_interpreter="python3",
    py_system_site_packages=False,
)

tests/system/apache/beam/example_python.py[原始碼]

start_python_pipeline_direct_runner = BeamRunPythonPipelineOperator(
    task_id="start_python_pipeline_direct_runner",
    py_file=GCS_PYTHON,
    py_options=[],
    pipeline_options={"output": GCS_OUTPUT},
    py_requirements=["apache-beam[gcp]==2.59.0"],
    py_interpreter="python3",
    py_system_site_packages=False,
)

您可以針對此動作使用可延遲模式,以便非同步執行運算子。當運算子知道它必須等待時,這將使您有可能釋放 worker,並將恢復運算子的工作交給觸發器。因此,當它暫停(延遲)時,它不會佔用 worker 插槽,並且您的叢集將減少大量資源浪費在閒置的運算子或感測器上。

tests/system/apache/beam/example_python_async.py[原始碼]

start_python_pipeline_local_direct_runner = BeamRunPythonPipelineOperator(
    task_id="start_python_pipeline_local_direct_runner",
    py_file="apache_beam.examples.wordcount",
    py_options=["-m"],
    py_requirements=["apache-beam[gcp]==2.59.0"],
    py_interpreter="python3",
    py_system_site_packages=False,
    deferrable=True,
)

tests/system/apache/beam/example_python_async.py[原始碼]

start_python_pipeline_direct_runner = BeamRunPythonPipelineOperator(
    task_id="start_python_pipeline_direct_runner",
    py_file=GCS_PYTHON,
    py_options=[],
    pipeline_options={"output": GCS_OUTPUT},
    py_requirements=["apache-beam[gcp]==2.59.0"],
    py_interpreter="python3",
    py_system_site_packages=False,
    deferrable=True,
)

使用 DataflowRunner 的 Python Pipelines

tests/system/apache/beam/example_python.py[原始碼]

start_python_pipeline_dataflow_runner = BeamRunPythonPipelineOperator(
    task_id="start_python_pipeline_dataflow_runner",
    runner="DataflowRunner",
    py_file=GCS_PYTHON,
    pipeline_options={
        "tempLocation": GCS_TMP,
        "stagingLocation": GCS_STAGING,
        "output": GCS_OUTPUT,
    },
    py_options=[],
    py_requirements=["apache-beam[gcp]==2.59.0"],
    py_interpreter="python3",
    py_system_site_packages=False,
    dataflow_config=DataflowConfiguration(
        job_name="{{task.task_id}}", project_id=GCP_PROJECT_ID, location="us-central1"
    ),
)

tests/system/apache/beam/example_python_dataflow.py[原始碼]

start_python_job_dataflow_runner_async = BeamRunPythonPipelineOperator(
    task_id="start_python_job_dataflow_runner_async",
    runner="DataflowRunner",
    py_file=GCS_PYTHON_DATAFLOW_ASYNC,
    pipeline_options={
        "tempLocation": GCS_TMP,
        "stagingLocation": GCS_STAGING,
        "output": GCS_OUTPUT,
    },
    py_options=[],
    py_requirements=["apache-beam[gcp]==2.59.0"],
    py_interpreter="python3",
    py_system_site_packages=False,
    dataflow_config=DataflowConfiguration(
        job_name="{{task.task_id}}",
        project_id=GCP_PROJECT_ID,
        location="us-central1",
        wait_until_finished=False,
    ),
)

wait_for_python_job_dataflow_runner_async_done = DataflowJobStatusSensor(
    task_id="wait-for-python-job-async-done",
    job_id="{{task_instance.xcom_pull('start_python_job_dataflow_runner_async')['dataflow_job_id']}}",
    expected_statuses={DataflowJobStatus.JOB_STATE_DONE},
    project_id=GCP_PROJECT_ID,
    location="us-central1",
)

start_python_job_dataflow_runner_async >> wait_for_python_job_dataflow_runner_async_done

您可以針對此動作使用可延遲模式,以便非同步執行運算子。當運算子知道它必須等待時,這將使您有可能釋放 worker,並將恢復運算子的工作交給觸發器。因此,當它暫停(延遲)時,它不會佔用 worker 插槽,並且您的叢集將減少大量資源浪費在閒置的運算子或感測器上。

tests/system/apache/beam/example_python_async.py[原始碼]

start_python_pipeline_dataflow_runner = BeamRunPythonPipelineOperator(
    task_id="start_python_pipeline_dataflow_runner",
    runner="DataflowRunner",
    py_file=GCS_PYTHON,
    pipeline_options={
        "tempLocation": GCS_TMP,
        "stagingLocation": GCS_STAGING,
        "output": GCS_OUTPUT,
    },
    py_options=[],
    py_requirements=["apache-beam[gcp]==2.59.0"],
    py_interpreter="python3",
    py_system_site_packages=False,
    dataflow_config=DataflowConfiguration(
        job_name="{{task.task_id}}", project_id=GCP_PROJECT_ID, location="us-central1"
    ),
    deferrable=True,
)


在 Apache Beam 中執行 Java Pipelines

對於 Java 管道,jar 引數必須為 BeamRunJavaPipelineOperator 指定,因為它包含要由 Apache Beam 執行的管道。JAR 可以位於 Airflow 有權限下載的 GCS 上,或位於本機檔案系統上(提供絕對路徑)。

使用 DirectRunner 的 Java Pipelines

tests/system/apache/beam/example_beam.py[原始碼]

jar_to_local_direct_runner = GCSToLocalFilesystemOperator(
    task_id="jar_to_local_direct_runner",
    bucket=GCS_JAR_DIRECT_RUNNER_BUCKET_NAME,
    object_name=GCS_JAR_DIRECT_RUNNER_OBJECT_NAME,
    filename="/tmp/beam_wordcount_direct_runner_{{ ds_nodash }}.jar",
)

start_java_pipeline_direct_runner = BeamRunJavaPipelineOperator(
    task_id="start_java_pipeline_direct_runner",
    jar="/tmp/beam_wordcount_direct_runner_{{ ds_nodash }}.jar",
    pipeline_options={
        "output": "/tmp/start_java_pipeline_direct_runner",
        "inputFile": GCS_INPUT,
    },
    job_class="org.apache.beam.examples.WordCount",
)

jar_to_local_direct_runner >> start_java_pipeline_direct_runner

使用 DataflowRunner 的 Java Pipelines

tests/system/apache/beam/example_java_dataflow.py[原始碼]

jar_to_local_dataflow_runner = GCSToLocalFilesystemOperator(
    task_id="jar_to_local_dataflow_runner",
    bucket=GCS_JAR_DATAFLOW_RUNNER_BUCKET_NAME,
    object_name=GCS_JAR_DATAFLOW_RUNNER_OBJECT_NAME,
    filename="/tmp/beam_wordcount_dataflow_runner_{{ ds_nodash }}.jar",
)

start_java_pipeline_dataflow = BeamRunJavaPipelineOperator(
    task_id="start_java_pipeline_dataflow",
    runner="DataflowRunner",
    jar="/tmp/beam_wordcount_dataflow_runner_{{ ds_nodash }}.jar",
    pipeline_options={
        "tempLocation": GCS_TMP,
        "stagingLocation": GCS_STAGING,
        "output": GCS_OUTPUT,
    },
    job_class="org.apache.beam.examples.WordCount",
    dataflow_config={"job_name": "{{task.task_id}}", "location": "us-central1"},
)

jar_to_local_dataflow_runner >> start_java_pipeline_dataflow


在 Apache Beam 中執行 Go Pipelines

go_file 引數必須為 BeamRunGoPipelineOperator 指定,因為它包含要由 Beam 執行的管道。Go 檔案可以位於 Airflow 有權限下載的 GCS 上,或位於本機檔案系統上(提供絕對路徑)。從本機檔案系統執行時,等效的命令為 go run <go_file>。如果從 GCS 儲存桶提取,則會預先使用 go run init example.com/maingo mod tidy 初始化模組並安裝相依性。

使用 DirectRunner 的 Go Pipelines

tests/system/apache/beam/example_go.py[原始碼]

start_go_pipeline_local_direct_runner = BeamRunGoPipelineOperator(
    task_id="start_go_pipeline_local_direct_runner",
    go_file="files/apache_beam/examples/wordcount.go",
)

tests/system/apache/beam/example_go.py[原始碼]

start_go_pipeline_direct_runner = BeamRunGoPipelineOperator(
    task_id="start_go_pipeline_direct_runner",
    go_file=GCS_GO,
    pipeline_options={"output": GCS_OUTPUT},
)

使用 DataflowRunner 的 Go Pipelines

tests/system/apache/beam/example_go.py[原始碼]

start_go_pipeline_dataflow_runner = BeamRunGoPipelineOperator(
    task_id="start_go_pipeline_dataflow_runner",
    runner="DataflowRunner",
    go_file=GCS_GO,
    pipeline_options={
        "tempLocation": GCS_TMP,
        "stagingLocation": GCS_STAGING,
        "output": GCS_OUTPUT,
        "WorkerHarnessContainerImage": "apache/beam_go_sdk:latest",
    },
    dataflow_config=DataflowConfiguration(
        job_name="{{task.task_id}}", project_id=GCP_PROJECT_ID, location="us-central1"
    ),
)

tests/system/apache/beam/example_go_dataflow.py[原始碼]

start_go_job_dataflow_runner_async = BeamRunGoPipelineOperator(
    task_id="start_go_job_dataflow_runner_async",
    runner="DataflowRunner",
    go_file=GCS_GO_DATAFLOW_ASYNC,
    pipeline_options={
        "tempLocation": GCS_TMP,
        "stagingLocation": GCS_STAGING,
        "output": GCS_OUTPUT,
        "WorkerHarnessContainerImage": "apache/beam_go_sdk:latest",
    },
    dataflow_config=DataflowConfiguration(
        job_name="{{task.task_id}}",
        project_id=GCP_PROJECT_ID,
        location="us-central1",
        wait_until_finished=False,
    ),
)

wait_for_go_job_dataflow_runner_async_done = DataflowJobStatusSensor(
    task_id="wait-for-go-job-async-done",
    job_id="{{task_instance.xcom_pull('start_go_job_dataflow_runner_async')['dataflow_job_id']}}",
    expected_statuses={DataflowJobStatus.JOB_STATE_DONE},
    project_id=GCP_PROJECT_ID,
    location="us-central1",
)

start_go_job_dataflow_runner_async >> wait_for_go_job_dataflow_runner_async_done

此條目是否有幫助?