Forces HDFS globStatus to skip directories for which it does not have permissions
So I need to collect a lot of directories from HDFS, which themselves contain subdirectories, and I want to be able to use globStatus. My path pattern basically looks like this:
"/directory/*/{opt1,opt2}/{opt1,opt2,opt3}*"
Unfortunately, for some directories captured by *, I don’t have execute permission (can’t view content), but glob tries to look inside, resulting in an exception. Is there any way to ask glob to skip directories it doesn’t have permission to instead of failing completely?
I
know there are other ways to achieve the same goal, but as far as I can tell, it will be more complicated, and I think more requests need to be made to HDFS than simple globs.
Solution
Answer this question in case anyone else encounters this problem….
globStatus
filtering is done on the client side as part of the FileSystem
/Globber
classes. Behind the scenes, it’s really just submitting a series of listStatus
commands and filtering the return value. Getting the described behavior will require some custom logic, but it’s not less efficient than the globStatus
API.