Java – Why doesn’t Flume-NG HDFS sink write to a file when the number of events equals or exceeds batchSize?

Why doesn’t Flume-NG HDFS sink write to a file when the number of events equals or exceeds batchSize?… here is a solution to the problem.

Why doesn’t Flume-NG HDFS sink write to a file when the number of events equals or exceeds batchSize?

I’m trying to configure Flume so that logs roll over every hour or when they reach the default block size for HDFS (64 MB). Here is my current configuration:

imp-agent.channels.imp-ch1.type = memory
imp-agent.channels.imp-ch1.capacity = 40000
imp-agent.channels.imp-ch1.transactionCapacity = 1000

imp-agent.sources.avro-imp-source1.channels = imp-ch1
imp-agent.sources.avro-imp-source1.type = avro
imp-agent.sources.avro-imp-source1.bind = 0.0.0.0
imp-agent.sources.avro-imp-source1.port = 41414

imp-agent.sources.avro-imp-source1.interceptors = host1 timestamp1
imp-agent.sources.avro-imp-source1.interceptors.host1.type = host
imp-agent.sources.avro-imp-source1.interceptors.host1.useIP = false
imp-agent.sources.avro-imp-source1.interceptors.timestamp1.type = timestamp

imp-agent.sinks.hdfs-imp-sink1.channel = imp-ch1
imp-agent.sinks.hdfs-imp-sink1.type = hdfs
imp-agent.sinks.hdfs-imp-sink1.hdfs.path = hdfs://mynamenode:8020/flume/impressions/yr=%Y/mo=%m/d=%d/logger=%{host}s1/
imp-agent.sinks.hdfs-imp-sink1.hdfs.filePrefix = Impr
imp-agent.sinks.hdfs-imp-sink1.hdfs.batchSize = 10
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollInterval = 3600
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollCount = 0
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollSize = 66584576

imp-agent.channels = imp-ch1
imp-agent.sources = avro-imp-source1
imp-agent.sinks = hdfs-imp-sink1

My intention with the above configuration is to write HDFS in groups of 10 and then roll the files being written to hourly. What I see is that all the data seems to be kept in memory until I go below 64MB until the file rolls over after 1 hour. Are there any settings I should adjust to get the behavior I want?

Solution

Answer for yourself, Flume is writing data to HDFS in batches. The file length is reported as open because a block is being written.

Related Problems and Solutions