Over the years the quantity of information available to us has grown into something quite overwhelming. Every news agency now provides their content directly on the web which is great as it makes it very accessible but you don’t need too long to understand that there is more news than there is time to read it. One solution to this is to use information filtering. Unfortunately news websites don’t provide you with that service or at least not a sufficient version of it. When you’re looking for some recent news, you can use social media, where your circle of friends recommend news by sharing them but you’ll sometimes miss more underrated content. Usually you end up either going directly to a press website which is often overcrowded with news posts or you subscribe to their RSS feed in which case you get bombarded regularly with unwanted news.
An RSS feed is a file that contains the latest news posts from a content provider such as CNN for example. Some provide pre-filtering by offering you multiple RSS feeds to chose from, each representing a different category. Issue is, you end up with so many different feeds that it’s hard to keep track. People have tried to fix this by offering different types of apps that aggregate and organize your feeds making them easier to maintain. Here is a brief list of different RSS readers:
Even though those solutions are really useful, none of them solve the real issue. Users are still getting overwhelmed with a tremendous amount of unwanted information.
In this article I will share a first step towards a solution where users are provided with a personalized and contextualized news feed based on craft ai technology. Using craft ai technology is very intuitive but we have first to grasp the full extent of the underlying problem. I’ll explain my thought process to highlight the challenges of a recommendation application.
Tl;dr : if all you care is to see how craft ai is used skip directly to the Creating a user profile creating profile chapter.
Information filtering (IF) objective is to remove redundant and/or unwanted information from data streams. In the past information filtering was mostly used by governments as a tool to censor information but now in the age of information it’s all about bringing the right information to the right person. IF is especially used for unstructured or semi-structured data (e.g. documents, e-mail messages), mostly textual data. They are made to be able to handle a large amount of data and are based on a user profile.
Why prefer IF over information retrieval (IR)? Even though they might appear identical and can be easily mistaken for one another, they actually differ in five major aspects :
- Frequency of use - IR systems are designed for ad-hoc use of a one-time user, to fulfill a one-time information need. Instead, IF systems are designed for long term users with long term information needs, and for repetitive usage.
- Representation of information needs - in IR systems, user needs are expressed as queries. In IF systems, long term user needs are described in user profiles.
- Goal - IR systems select from databases relevant data items that match a query. IF systems screen out irrelevant data from incoming streams of data items, or collect and distribute relevant data items from certain sources, according to a user’s profile.
- Database - IR systems deal with relatively static databases. IF systems deal with dynamic data.
- Type of users - IR systems serve users who are not known to the system; anyone who has access to the system may pose a query. Users of IF systems need to be known to the system; the system has a model of the user, usually kept in form of a user profile.
- Scope of system - IF systems are concerned with social issues like user modeling and privacy that are most of the time of no concern to IR systems
As we can see, IF matches exactly our case, as we have a news stream from a news company to a user. Plus we don’t want the user having to do queries but want the feed to fit his preferences especially since the database is dynamic which makes it hard to know what to search.
As we’ve already said, news websites already offer categorization but it isn’t enough anymore because user profiles are in constant evolution and users don’t want to be changing their settings all the time. We have to find a way to continuously acquire knowledge about users and their preferences.
There are many different ways of acquiring user information The most popular ways are :
- User interrogation is a category that requires the user to be active and set his preferences for filtering. The system actively asks him to input information, such as filling out a form or even use predefined profiles.
- Document space is a method that sits between explicit and implicit approaches. It requires minimum user involvement. User is going to judge how relevant are the documents that are shown to him. Only relevant information are shown after that.
- Stereotypic Inference is an other mixed solution, User gives only information about himself and the system will then try and match his profile with some pre configured stereotypical user’s profile.
- Recording user behavior is an implicit approach. It’s objective is to be as less intrusive as possible. For example recording User behavior could be how a user reacts to an information, (did he read it? For how long? etc…).
In our case we aim at having a minimum impact on a user’s habits. We don’t want him filling out forms. An implicit approach seems to be the way to go, the only issue being how do we get data from news the user doesn’t interact with. A mixed solution with user behavior recording and document spacing might be the most efficient, eg: asking the user to rate post that doesn’t interest him.
Now that we know how we are going to record information on our users, what do we plan to do with those data to recommend news posts? This is why a certain branch of information filtering was created : Recommender systems. They are user driven and are very popular to present information items such as music, television and books.
Recommender systems (RS) have been one of the big challenges of the recent years, first driven by the need to recommend new and ever growing cultural content to users. The Netflix challenge that was launched in 2006 is a proof of that. Now, with the boom of data mining and machine learning people are trying to use RS in a wide variety of fields, from tourism to advertisement and even to some degree financial investments.
With great promises come great challenges, building an efficient RS isn’t an easy task and we’re going to see why.
Quality of Recommendations: Trust is the key word here. Users need recommendations, which they can trust. To achieve that, a recommender system should minimize false positive errors, i.e. products, which are recommended (positive), though the user does not like them (false). Such errors lead to dissatisfied users.
Sparsity: It is usual that even the most active users don’t provide constant and quality feedbacks, when compared to the available total of items. If we build a matrix user/item, representing the feedback of every user on every item. In this matrix we can try to find similar users, called neighbors. Since each user only sees a very small subset of items, that leads to sparse user-item matrices. then to the inability to locate successful neighbors and finally, the generation of weak recommendations. As a result, techniques to reduce the sparsity in user-item matrices should be proposed.
Scalability: Recommender Systems require computation, which grows with both the number of users and the number of products. An algorithm, that is efficient when the number of data is limited, can turn out to be unable to generate a satisfactory number of recommendations, as soon as the amount of data is increased.
Synonymy : Recommender Systems are usually unable to discover the latent association between products, which have different names, but still refer to similar objects.
First Rater Problem: An item cannot be recommended unless a user has rated it before. This problem applies to new items and also to obscure items and is particularly harmful to users with eclectic tastes. It is also known as the Cold Start Problem.
Unusual User Problem: It is also known as the Gray Sheep problem. It refers to individuals with opinions that are ”unusual”, meaning that they do not agree or disagree consistently with any group of people. Those individuals would not easily benefit from recommender systems since they would rarely, if ever, receive accurate predictions.
They are three main types of approaches to building a recommender system:
- Collaborative Filtering is a systems where users are recommended items based on the past behavior of all other users collectively. One huge advantage of said method is that it does not required to understand the content itself, which is why it’s really popular in movie recommendations. This system is based on the assumption that people who agreed in the past will agree in the future. Here are a sample of collaborative filtering systems that exist : Last.fm’s music recommendation Facebook friend’s recommendation * Twitter "who to follow" recommendation
- Content-based recommending approaches recommend items that are similar in content to items the user has liked in the past, or matched to the attributes of the user. A user profile is built and the system matches content to this profile. Here are a sample of collaborative filtering systems that exist : Pandora Radio Rotten Tomatoes
- Hybrid approach methods combine both collaborative and content-based approaches. Even if it’s getting older Netflix is still a good example of a hybrid approach, on their personal blog they have a two part article from 2012 that covers the netflix recommendations system .
From what we can see in our case the prefered method is Content-based recommending as it’s easier to extract information from a text than from a video. Especially since craft ai objective is to offer personalized learning. Meaning it’s possible to learn a profile for every single user. This fixes the issue of having unusual users. We could also argue that a Hybrid approach could be relevant, maybe as a final product, but for now we’re going to focus on how we can easily create personalized information filtering with craft ai.
Now that we know how we’re going to be gathering data on our user and what method we’re going to use to recommend him new content, how are we going to build his user profile? Finally that’s when craft ai intervenes with it’s learning API. We’re going to use it’s ability to discover a user’s habits and we’re going to use this profile to decide which news post to display.
As a developer, all you have to do is give some information on your craft ai model and input historical data, as they are produces. With that data the API will generate a decision tree. This decision tree is not fixed, as you continue to input data the decision tree evolves with it.
In this case, for the moment our model is fairly simple. As an input we want to give him what the user’s behavior is, has he read that news post or not. Furthermore, as we’re going for a content based recommendation system, it’s also important to give as an input what type of content it is. We’re not going to work on NLP for the moment, so the information on the content will be in which category he has been tagged by the news agency. As an output we want craft ai to predict whether the user wants to read a given post or not.
Here is our model :
- Category : which will be of the type “enum”. Enum will be between different news categories like sports or video games.
- Read : which will also be of the type “enum”. It will contain either “interested” or “not interested” depending on the user’s actions.
- Output will be Read. We want the API to predict if a user is “interested” or “not interested”
This model is then sent to a newly created craft ai agent. craft ai uses agents to instantiate a virtual representation of the object it has to learn from. In our case we can consider the agent as the virtual image of our user. In other cases, for instance, when using learning on objects such as thermostats, the agent can be imagined as the brain of the object.
The setup is complete. Now we need to retrieve the feedback from the user on what he’d like to see in his news feed. To dodge the cold start problem where going to start off by showing him an unfiltered stream of posts. Once we’re satisfied with our learning results we’ll kick in the prediction phase.
As we said before, we’re trying to use implicit methods to retrieve information on our user. If he clicks on a news post to read it, we conclude that he’s interested. Later on, we could scale that interest by calculating how long he stayed on that page, but for now let’s keep it simple. On the other hand, there is no simple way to implicitly find out that a user is not interested in a news. One solution could be to take in consideration ignored news posts. Unfortunately, when a post is ignored, it could be for a variety of reasons, from “I don’t have the time” to “I missed it”, so no conclusion can be extrapolated from that. This is why we’re going to use a “not interested” button so the user can give us a truthful feedback on news he doesn’t want to see.
After learning a user habit, here is the kind of tree that we have :
We already see an issue with the results we’re getting. Those choices seem a bit binary as he’s either going to read a certain category or not. This behavior does not reflect the reality, since a lot people like to read certain things at certain hours. This is what we call context. In which environment is our user evolving while he’s giving us his feedback?
The majority of existing approaches to recommender systems only focus on providing the most relevant items to users. Whereas every action a user makes has a context attached to it. Those contextual informations such as time and location can heavily influence a user's decision.
Traditionally a recommender system would work as follow :
- R : User × Item→Rating
What we really want is something more similar to :
- R : User × Item × Context→Rating
For our case we believe that the time of day is a valuable contextual information and we’re going to add to our model :
- Category : is the same as our previous model.
- Read : is also the same as our previous model.
- Time of day : It will be a number between 0 and 24 that represents the number of hours in the day since midnight.
- Timezone : which will be of the type timezone. It represents the timezone as an offset from UTC.
- Output will still be Read.
Now that we have our new model, let’s train it. Going through a lot of news manualy to record user’s feedback each time we want to test our model is kind of tedious. To do the training, we decided to create fake hard coded users that will simulate a real user's behaviors. Each one of those fake users has his own habits and will read different news categories at different times:
Here how the decision tree looks like for sarah after 1 day:
On the first day of learning a new user’s habits the decision tree is not precise enough and doesn’t fit our user profile yet.
After 2 days:
On the second day we can see that craft ai agent’s decision tree is already more precise.
After 3 days:
After 4 days:
As we can see, it only takes 4 days of normal usage for craft ai learning API to start to fit the user’s profile.
After 7 days:
After seven days most of the leafs of the decision tree are green. Indeed we didn’t explain it yet but the gradient color of the tree’s leaves represent the value of the confidence in the decision. To put it simply, the confidence values show how much the agent is confident in the decision that he’s giving. Green colors represent high confidence, whereas red colors is low confidence.
After seven days of active use, we have a robust decision tree of the user’s habits. We say active, because non simulated users might take slightly more time as you can’t expect a human user to always be as active. Regardless, no configuration from the user was required, just time and a regular use of the app. Since we are satisfied with our results, we can now let the craft ai agent filter the uninteresting news for the user on his own. We now activate the craft ai learning API’s predictions.
The filtering is really simple with the API’s prediction, we just ask the craft ai agent if we should display or not the news post. eg:
We have managed to build a basic content-based recommender system. It’s efficient and keeps learning from successful and unsuccessful recommendations. Unfortunately it’s far from being perfect: to create a working model quickly, we dodged a lot of the problematics we were facing. Most notables ones are proper categorization of documents with the help of NLP. For instance, at the moment we have a sports category but we don’t differentiate tennis from football. Another issue is, since the filtering is made instantaneously, once the user has set his profile it will be very hard to change it as only news that fits this profile will be pushed. There is a crucial need of implementing a discovery feature so that, if a user changes his habits, the decision tree can adapt. This is closely related to the famous machine learning tradeoff between exploration and exploitation, and we could improve our recommender system by using, for instance, bandits mechanism to try pushing new topics to users.
Information Filtering: Overview of Issues, Research and Systems by Uri Hanani, Bracha Shapira and Peretz Shoval 
Analysis of Recommender Systems’ Algorithms by Emmanouil Vozalis and Konstantinos G. Margaritis 
Recommender Systems by Prem Melville and Vikas Sindhwani 
Context-Aware Recommender Systems by Gediminas Adomavicius and Alexander Tuzhilin