Java – Google Cloud Dataflow service account not propagated to workers?

Google Cloud Dataflow service account not propagated to workers?… here is a solution to the problem.

Google Cloud Dataflow service account not propagated to workers?

We have multiple Google Cloud Dataflow jobs (written in Java/Kotlin) that can be run in two different ways:

  1. Initiated from the user’s Google Cloud account
  2. Start from the service account (with the required policies and permissions).

When a Dataflow

job is run from a user account, Dataflow provides the worker with default controller serviceaccount。 It does not provide authorized users to workers.

When running a Dataflow job from serviceaccount, I imagine using setGcpCredential The serviceaccount set is propagated to the worker VMs used by Dataflow in the background. JavaDocs doesn’t mention any of these, but they do mention credentials used to authenticate the GCP service.

In most of the use cases for Dataflow, we run the Dataflow job in project A while reading data from BigQuery in project B. Therefore, we provide users with reader access to the BigQuery dataset in Project B, as well as the service account used in the second way as described above. For BigQuery in project A, the same serviceaccount will also have jobUser and dataViewer roles.

The problem now is that in both cases, it seems that we need to provide the default Controller service account with access to the BigQuery dataset used in the Dataflow job. If we don’t, we’ll get permission to BigQuery denied (403) when the job tries to access the dataset in Project B.
For the second way described, I want the data flow to be separate from the default Controller service account. My gut feeling is that Dataflow doesn’t propagate the service accounts set up in PipelineOptions to workers.

In general, we provide the project, area, zone, temporary location (gcpTempLocation, tempLocation, stagingLocation), runner type (in this case, DataflowRunner), and gcpCredential as PipelineOptions.

So, does Google Cloud Dataflow really spread the service accounts provided to workers?

Update

We first tried adding options.setServiceAccount, as shown in Magda, but did not add IAM permissions. This results in the following error in the Dataflow log:

{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : " Current user cannot act as service account [email protected]. Causes: Current user cannot act as service account [email protected]..",
    "reason" : "forbidden"
  } ],
  "message" : " Current user cannot act as service account [email protected].. Causes: Current user cannot act as service account [email protected].",
  "status" : "PERMISSION_DENIED"
}

After that, we try to add roles/iam.serviceAccountUser to this service account. Unfortunately, this leads to the same error. This service account already has IAM roles Dataflow worker and BigQuery Job User.
The default compute engine, Controller serviceaccount [email protected], only has the Editor role, and we don’t add any other IAM roles/permissions.

Solution

I think you need to set up the Controller service account as well. You can use options.setServiceAccount("hereYourControllerServiceAccount@yourProject.iam.gserviceaccount.com") in the data flow pipeline options.

You need to add some additional permissions:

  • For Controller 😀 ataflow Worker and Storage Object Admin.

  • For performers: Service account users.

That’s what I found in Google’s docs and tried it for myself.

I think this might give you some insight :

For the BigQuery source and sink to operate properly, the following
two accounts must have access to any BigQuery datasets that your Cloud
Dataflow job reads from or writes to:

-The GCP account you use to execute the Cloud Dataflow job

-The controller service account running the Cloud Dataflow job

For example, if your GCP account is [email protected] and the project
number of the project where you execute the Cloud Dataflow job is
123456789, the following accounts must all be granted access to the
BigQuery Datasets used: [email protected], and
[email protected].

More stories about: https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#controller_service_account

Related Problems and Solutions