Python – Use Python to write files line by line on Hadoop

Use Python to write files line by line on Hadoop… here is a solution to the problem.

Use Python to write files line by line on Hadoop

I’m working with a file with a different line pattern, so I need to parse each line and write the file line by line to HDFS as needed.

Is there a way to implement it in python?

Solution

You can use IOUtils in sc._gateway.jvm and use it to stream from a Hadoop file (or locally) to a file on Hadoop.

Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration
fs = FileSystem.get(Configuration())
IOUtils = sc._gateway.jvm.org.apache.hadoop.io.IOUtils
f = fs.open(Path("/user/test/abc.txt"))
output_stream = fs.create(Path("/user/test/a1.txt"))
IOUtils.copyBytes(f, output_stream, Configuration())

Related Problems and Solutions