Python – Bigquery apache beam pipe “hanging” when using DirectRunner

Bigquery apache beam pipe “hanging” when using DirectRunner… here is a solution to the problem.

Bigquery apache beam pipe “hanging” when using DirectRunner

I’m curious if anyone else here has had a similar Python Apache Beam dataflow runner issue, as described below. (I can’t ship to CloudRunner yet

.)

Executing queries return fewer than 18 million rows. If I add a LIMIT to query (example: 10000), then the data flow works as expected. The WriteToBleve receiver is not included in the snippet, which is a custom receiver that supports writing to bleve. Index.

The Python SDK being used is 2.2.0, but I’m getting ready to launch some Java…

The last log message I see when I run the pipeline is:

WARNING:root:Dataset
my-project:temp_dataset_7708fbe7e7694cd49b8b0de07af2470b does not
exist so we will create it as temporary with location=None

The dataset is properly created and populated, and when I debug into the pipeline, I can see the results of the iteration, but the pipeline itself never seems to reach the write stage.

    options = {
        "project": "my-project",
        "staging_location": "gs://my-project/staging",
        "temp_location": "gs://my-project/temp",
        "runner": "DirectRunner"
    }
    pipeline_options = beam.pipeline.PipelineOptions(flags=[], **options)
    p = beam. Pipeline(options=pipeline_options)
    p | 'Read From Bigquery' >> beam.io.Read(beam.io.BigQuerySource(
        query=self.build_query(),
        use_standard_sql=True,
        validate=True,
        flatten_results=False,
    )) | 'Write to Bleve' >> WriteToBleve()

result = p.run()
    result.wait_until_finish()

Solution

Direct runners are designed to be used locally to debug and test pipelines with small amounts of data. It’s not particularly optimized for performance and isn’t suitable for large amounts of data – this is the case for both Python and Java.

That said, there are some very important improvements to the Python Direct Runner are in progress

I recommend that you try running on Dataflow to see if the performance is still not satisfactory.

Also, if you can write in Java – I recommend this: its performance is usually orders of magnitude better than Python, especially when reading from BigQuery: reading BigQuery is exported to an Avro file via BigQuery, and the name of the standard Python library for reading Avro files for performance is notoriously horrible, but unfortunately there is currently no well-performing and well-maintained alternative.

Related Problems and Solutions