Python – The correct way to read files from a directory using Python 2.6 in the bash shell

The correct way to read files from a directory using Python 2.6 in the bash shell… here is a solution to the problem.

The correct way to read files from a directory using Python 2.6 in the bash shell

I’m trying to read in a file for text processing.

The idea is to run them on my virtual machine via a Hadoop pseudo-distributed file system using the map-reduce code I’m writing. The interface is Ubuntu Linux, and I have Python 2.6 installed. I need to read the file using sys.stdin and pass from the mapper to the reducer using sys.stdout.

Here is my mapper test code:

#!/usr/bin/env python

import sys
import string
import glob
import os

files = glob.glob(sys.stdin)
for file in files:
    with open(file) as infile:
        txt = infile.read()
        txt = txt.split()
    print(txt) 

I’m not sure how glob works with sys.stdin, I get the following error:

After pipeline testing:

[training@localhost data]$ cat test | ./mapper.py

I see:

cat: test: Is a directory
Traceback (most recent call last):
  File "./mapper.py", line 8, in <module>
    files = glob.glob(sys.stdin)
  File "/usr/lib64/python2.6/glob.py", line 16, in glob
    return list(iglob(pathname))
  File "/usr/lib64/python2.6/glob.py", line 24, in iglob
    if not has_magic(pathname):
  File "/usr/lib64/python2.6/glob.py", line 78, in has_magic
    return magic_check.search(s) is not None
TypeError: expected string or buffer

Currently, I just want to read into three small .txt files in one directory.

Thanks!

Solution

I still don’t fully understand what your expected output is (list or normal
text), the following will work:

#!/usr/bin/env python

import sys, glob

dir = sys.stdin.read().rstrip('\r\n')
files = glob.glob(dir + '/*')
for file in files:
    with open(file) as infile:
        txt = infile.read()
        txt = txt.split()
    print(txt)

Then execute:

echo "test" | ./mapper.py

My suggestion is to provide the directory name via command line arguments, rather than through standard input like above.
If you want to adjust the output format, please let me know.
Hope this helps.

Related Problems and Solutions