Java – Forces HDFS globStatus to skip directories for which it does not have permissions

Forces HDFS globStatus to skip directories for which it does not have permissions… here is a solution to the problem.

Forces HDFS globStatus to skip directories for which it does not have permissions

So I need to collect a lot of directories from HDFS, which themselves contain subdirectories, and I want to be able to use globStatus. My path pattern basically looks like this:

"/directory/*/{opt1,opt2}/{opt1,opt2,opt3}*"

Unfortunately, for some directories captured by *, I don’t have execute permission (can’t view content), but glob tries to look inside, resulting in an exception. Is there any way to ask glob to skip directories it doesn’t have permission to instead of failing completely?

I

know there are other ways to achieve the same goal, but as far as I can tell, it will be more complicated, and I think more requests need to be made to HDFS than simple globs.

Solution

Answer this question in case anyone else encounters this problem….

globStatus filtering is done on the client side as part of the FileSystem/Globber classes. Behind the scenes, it’s really just submitting a series of listStatus commands and filtering the return value. Getting the described behavior will require some custom logic, but it’s not less efficient than the globStatus API.

Related Problems and Solutions