Mysql – Find/remove duplicates in big data

Find/remove duplicates in big data… here is a solution to the problem.

Find/remove duplicates in big data

I have a set of files. Each file should contain a unique set of lines across all files. For example, if file i contains line 1, other files should not contain line 1 (file i should also contain 1 entry for line 1

).

Question:

I need to remove all duplicates from these files. However, the total number of rows is over billions, so I can’t really insert all the files into memory and delete them at will.

Several solutions came to mind:

1- Create a table in the database and use each row as a unique key, then by putting all rows into the database we will remove all duplicates.

2- Use the Redis Set structure instead of DB.

3- Create a file with that line as the file name. So, once all the files are created, the duplicates will naturally disappear.

However, every solution I can think of requires a lot of time and resources that I can’t afford at the moment.

So my question is:

1- Which route seems more reliable according to the above scheme?

2- Is there a better solution/technique that I don’t know about?

Related Problems and Solutions