Python – Predicts user input using the current datetime, time, and past history

Predicts user input using the current datetime, time, and past history… here is a solution to the problem.

Predicts user input using the current datetime, time, and past history

Project

I’m working on a small project where users can create events (e.g. Eat, Sleep, Watch a movie, etc.). ) and log entries that match these events.

My data model looks like this (the application itself is in Python 3/Django, but I don’t think it’s important here):

# a sample event
event = {
    'id': 1,
    'name': 'Eat',
}

# a sample entry
entry = {
    'event_id': event['id'],
    'user_id': 12,

# record date of the entry
    'date': '2017-03-16T12:56:32.095465+00:00',

# user-given tags for the entry
    'tags': ['home', 'delivery'],

# A user notation for the entry, it can be positive or negative
    'score': 2,

'comment': 'That was a tasty meal',

}

Users can log as many entries as they want for any number of events, and they can create new events when needed. Data is stored in a relational database.

Now I want to make it easier for users to enter data by recommending relevant events to them when they visit the Add Entry form. Currently, they can select the event corresponding to their entry in the drop-down list, but I would like to recommend some related events to them based on that.

I’m thinking that given the user history (entries for all records), it should be possible to predict possible inputs by identifying patterns in entries, e.g

  • Eat is usually around noon and 7:00 pm every day
  • Sleep is usually after 10:00 pm
  • Movie viewing usually happens after 8:00 PM on Fridays

Ideally, I want a function, given the user ID and datetime, and using the user history, that will return a list of events that are more likely to occur:

def get_events(user_id, datetime, max=3):
    # implementation

# returns a list of up to max events
    return events

So if I take the previous example (with more human dates), I get the following result:

>>> get_events(user_id, 'Friday at 9:00 PM')
['Watch a movie', 'Sleep', 'Eat']

>>> get_events(user_id, 'Friday at 9:00 PM', max=2)
['Watch a movie', 'Sleep']

>>> get_events(user_id, 'Monday at 9:00 PM')
['Sleep', 'Eat', 'Watch a movie']

>>> get_events(user_id, 'Monday at noon')
['eat']

Of course, in real life, I would pass the real datetime, and I want to get an event ID in order to get the corresponding data from the database.

My question

(Sorry if it takes some time to explain the whole thing)

My practical question is, what are the actual algorithms/tools/libraries needed to achieve this? Is it possible to do it?

My current guess is that I need to use some fancy machine learning stuff, use things like scikit-learn and classifiers, train it with user history, and then let the whole thing do its magic.

I’m not at all familiar with machine learning, and I’m worried I don’t have enough math/science background to start on my own. Can you provide some citations to help me understand how to fix this, algorithms/vocabularies that I have to dig into, or some pseudocode?

Solution

I think the k-nearest neighbours (kNN) approach would be a good place to start. In this particular case, the idea is to look for the k events closest to a given time and calculate the events that occur most frequently.

Example

Say you have as input Friday at 9:00 PM. Take the distance of all
events in the database to this date and rank them in ascending order.
For example if we take the distance in minutes for all elements in the
database, an example ranking could be as follows.

('Eat', 34)
('Sleep', 54)
('Eat', 76)
   ...
('Watch a movie', 93)

Next you take the first k = 3 of those and compute how often they
occur,

('Eat', 2)
('Sleep', 1)

so that the function returns ['Eat', 'Sleep'] (in that order).

It is important to choose a good k. A value that is too small will allow unexpected outliers (doing something once at a particular moment) to have a big impact on the outcome. Choosing a k that is too large will cause irrelevant events to be included in the count. One way to mitigate this is to use distance-weighted kNN (see below).

Select the distance function

As mentioned in the comment, using a simple distance between two timestamps may lose some information, such as the day of the week. We can solve this problem by making the distance function d(e1, e2) slightly more complicated. In this case, we can choose it as a trade-off between the time of day and the day of the week, for example

d(e1, e2) = a * |timeOfDay(e1) - timeOfDay(e2)| * (1/1440) + 
            b * |dayOfWeek(e1) - dayOfWeek(e2)| * (1/7)

We normalized these two differences by the time of day in minutes and the maximum possible difference between the days of the week. A and b are parameters that can be used to give more weight to one of these differences. For example, if we choose a = 3 and b = 1, we say that it happens on the same day with three times the importance of happening at the same time.

Distance-weighted kNN

You can increase complexity (and hopefully improve performance) by assigning weights (e.g. distance) to all events based on their distance from a given point instead of simply selecting k closest elements. Let e be the input example and o be the example in the database. Then we calculate the weight of o relative to e

          1
w_o = ---------
      d(e, o)^2

We see that points lose weight faster than their distance to e increases. In your case, some elements will be selected from the final ranking. This can be done by summing the weights of the same event to calculate the final ranking of the event type.

Implementation

The good thing about kNN is that it is very easy to implement. You roughly need the following components.

  • Implementation of the distance function d(e1, e2).
  • A function that ranks all elements in a database based on this function and the given input example.

    def rank(e, db, d):
        """ Rank the examples in db with respect to e using
            distance function d.
        """
        return sorted([(o, d(e, o)) for o in db],
                      key=lambda x: x[1])
    
  • A function that selects some elements from that ranking.

Related Problems and Solutions