Find/remove duplicates in big data
I have a set of files. Each file should contain a unique set of lines across all files. For example, if file i contains line 1, other files should not contain line 1 (file i should also contain 1 entry for line 1
).
Question:
I need to remove all duplicates from these files. However, the total number of rows is over billions, so I can’t really insert all the files into memory and delete them at will.
Several solutions came to mind:
1- Create a table in the database and use each row as a unique key, then by putting all rows into the database we will remove all duplicates.
2- Use the Redis Set structure instead of DB.
3- Create a file with that line as the file name. So, once all the files are created, the duplicates will naturally disappear.
However, every solution I can think of requires a lot of time and resources that I can’t afford at the moment.
So my question is:
1- Which route seems more reliable according to the above scheme?
2- Is there a better solution/technique that I don’t know about?