Java – Hive Hook has no Spark Hook

Hive Hook has no Spark Hook… here is a solution to the problem.

Hive Hook has no Spark Hook

I’m working on a project where I have to track the lineage of file conversions.
Suppose a file named SomeTextFile.txt goes through multiple hive operations and produces some excellent results as needed in the final stage.

Case: 1 File operation (if I apply a hive action to a file).

File–>FileAfterAction1–>FileAfterAction2—>FinalResultantFile

In this case, I’m using a hive hook that stores data related to an intermediate process applied to File.say in a text file from which the lineageEngine code reads and generates lineage of that final file.

Now, because Spark

is involved in the technology stack, clients can also apply Spark operations to files.

Case 2 Same thing happened with files, but now it’s a Spark operation.

Question – Is there any way to get intermediate information, what happened to the file between the start and end of the conversion.

What I’ve gotten from the net so far is a spark transform vomiting intermediate graph, but in my case, the client will apply the Spark operation instead of the Spark transform.
If you have some bandwidth, get involved.

Solution

https://issues.apache.org/jira/browse/SPARK-18127

This feature will be implemented in Spark 2.2

Related Problems and Solutions