The correct way to read files from a directory using Python 2.6 in the bash shell
I’m trying to read in a file for text processing.
The idea is to run them on my virtual machine via a Hadoop pseudo-distributed file system using the map-reduce code I’m writing. The interface is Ubuntu Linux, and I have Python 2.6 installed. I need to read the file using sys.stdin and pass from the mapper to the reducer using sys.stdout.
Here is my mapper test code:
#!/usr/bin/env python
import sys
import string
import glob
import os
files = glob.glob(sys.stdin)
for file in files:
with open(file) as infile:
txt = infile.read()
txt = txt.split()
print(txt)
I’m not sure how glob works with sys.stdin
, I get the following error:
After pipeline testing:
[training@localhost data]$ cat test | ./mapper.py
I see:
cat: test: Is a directory
Traceback (most recent call last):
File "./mapper.py", line 8, in <module>
files = glob.glob(sys.stdin)
File "/usr/lib64/python2.6/glob.py", line 16, in glob
return list(iglob(pathname))
File "/usr/lib64/python2.6/glob.py", line 24, in iglob
if not has_magic(pathname):
File "/usr/lib64/python2.6/glob.py", line 78, in has_magic
return magic_check.search(s) is not None
TypeError: expected string or buffer
Currently, I just want to read into three small .txt
files in one directory.
Thanks!
Solution
I still don’t fully understand what your expected output is (list or normal
text), the following will work:
#!/usr/bin/env python
import sys, glob
dir = sys.stdin.read().rstrip('\r\n')
files = glob.glob(dir + '/*')
for file in files:
with open(file) as infile:
txt = infile.read()
txt = txt.split()
print(txt)
Then execute:
echo "test" | ./mapper.py
My suggestion is to provide the directory name via command line arguments, rather than through standard input like above.
If you want to adjust the output format, please let me know.
Hope this helps.