Python – How to read specific columns in pyspark?

How to read specific columns in pyspark?… here is a solution to the problem.

How to read specific columns in pyspark?

I’m new to pyspark. I want to read a specific column from the input file. I know how to do this in Pandas

df=pd.read_csv('file.csv',usecols=[0,1,2])

But is there a function like this in pyspark?

Solution

Hello, you can use map to select a specific column

from pyspark import SQLContext
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("ReadCSV")
sc = SparkContext(conf=conf) 
sqlctx = SQLContext(sc)
df=sc.textFile("te2.csv") \
   .map(lambda line: line.split(";" )) \
   .map(lambda line: (line[0],line[3])) \
   .toDF()

Related Problems and Solutions