Java – Hadoop – Analysis Log Files (Java)

Hadoop – Analysis Log Files (Java)… here is a solution to the problem.

Hadoop – Analysis Log Files (Java)

The log file looks like this:

Time stamp,activity,-,User,-,id,-,data

2013-01-08T16:21:35.561+0100,reminder,-,User1234,-,131235467,-,-
2013-01-02T15:57:24.024+0100,order,-,User1234,-,-,-,{items:[{"prd":"131235467","count": 5, "amount": 11.6},{"prd": "13123545", "count": 1, "amount": 55.99}], oid: 5556}
2013-01-08T16:21:35.561+0100,login,-,User45687,-,143435467,-,-
2013-01-08T16:21:35.561+0100,reminder,-,User45687,-,143435467,-,-
2013-01-08T16:21:35.561+0100,order,-,User45687,-,-,-,{items:[{"prd":"1315467","count": 5, "amount": 11.6},{"prd": "133545", "count": 1, "amount": 55.99}], oid: 5556}
...
...

Edit

Specific examples in this log:

User1234 has a reminder – this reminder has id=131235467 , after which he makes an order with the following data: {items:[{"prd":"131235467","count": 5, "amount": 11.6},{"prd": "13123545", "count": 1, "amount": 55.99}], oid: 5556}
In this case, the id and prd of data are the same, so I want to summarize the count* amount -> in this case as 5*11.6 = 58 and output as

User 1234    Prdsum: 58    

User45687 also placed an order, but he didn’t receive an alert, so there is no summary data for him

Output:

User45687    Prdsum: 0

Final output of this log:

User 1234    Prdsum: 58    
User45687    Prdsum: 0

My question is: how do I compare (?) This value –> id and prd in data?
The key is the user. Is it useful to customize Writable -> value= (id, data). I need some ideas.

Solution

I recommend that you get the raw output sum as the result of the first pass of a Hadoop job, so at the end of the Hadoop job, you get something like this:

User1234     Prdsum: 58    
User45687    Prdsum: 0

Then there is a second Hadoop job (or standalone job) that compares the various values and generates another report.

Do you need “presence” as part of your first Hadoop job? If so, then you will need to keep a HashMap or HashTable in the mapper or reducer to store the values of all keys (in this case, the user) for comparison – but this is not a good setup, IMHO. It’s best to just aggregate in one Hadoop job and compare in another.

Related Problems and Solutions