How to connect to HDFS using pyarrow in python
I have pyarrow
installed and want to connect to an hdfs file in a Hadoop cluster. I have the following line and it gives me the error.
fs = pa.hdfs.connect(host='...', 50057, user='...', kerb_ticket='/tmp/krb5cc_0')
This is the error message I received
ArrowIOError: Unable to load libhdfs
How do I install libhdfs
? What other dependencies/settings do I need to do?
Solution
pyarrow.hdfs.connect(host='default', port=0, user=None, kerb_ticket=None, driver='libhdfs', extra_conf=None)
You must make sure that libhdfs.so
is in $HADOOP_HOME/lib/native
and $ARROW_LIBHDFS_DY
FOR HADOOP
:
bash-3.2$ ls $ARROW_LIBHDFS_DIR
examples libhadoop.so.1.0.0 libhdfs.a libnativetask.a
libhadoop.a libhadooppipes.a libhdfs.so libnativetask.so
libhadoop.so libhadooputils.a libhdfs.so.0.0.0 libnativetask.so.1.0.0
As far as I know, the last version is Hadoop 3.2.0
You can use DistributedCache to load any native shared library to distribute and symbolic link library files.
This example shows you how to distribute a shared library mylib.so and load it from a MapReduce task. see informations
First copy the library to HDFS:
bin/hadoop fs -copyFromLocal mylib.so.1 /libraries/mylib.so.1
The job launcher should contain the following:
DistributedCache.createSymlink(conf);
DistributedCache.addCacheFile(“hdfs://host:port/libraries/mylib.so.
1#mylib.so”, conf);The MapReduce task can contain:
System.loadLibrary("mylib.so");
Note: If you downloaded or built the native hadoop library, you do not need to use DistibutedCache to make the library available to your MapReduce task.