When we were asked to build a recommender system for our marketplace, the first thought was to use an out-of-box implementation from spark MLlib like ALS (alternating least square based matrix factorization) or an equivalent off-the-shelf solution.
However, as we explored our data and business need, we understood that a more tailored solution will be required. Our catalog had upwards of 50 Mn unique items and only a small fraction of these products had view or purchase history. There was a cold start problem on the user side as well – most users don’t view or purchase enough products to reliably generate recommendations with user-item collaborative filtering methods.
In addition to that, we required recommendations for a widget on the homepage that was previously “hand-curated” by editors scanning the entire catalog and finding the best selection for customers to delight them and help drive traffic to other parts of the marketplace. Since editors couldn’t build selections for every customer they picked broad market trends and built one assortment to show. To automate curation, we had to find the best deals in the catalog, find deals that match the taste of every user and ideally find deals that would eventually make that product popular. We also wanted to recommend products from diverse categories in order to prevent any one category from dominating sales, and to maintain a ‘freshness’ factor. Here is how we decomposed this problem in 3 basic parts and how we came up with a modeling scheme:
- Finding the best deals from the catalog (Product Pool Selection): We used a forecasting model for predicting the lift in sales for a given product based on its historical data and discounts offered on it. Initially, we just used a simple price elasticity model by calculating an elasticity coefficient based on the ratio of quantity change and price change and then multiplying it with price change to get new predicted quantity. In a few days, we found that this is modeled better as a multivariate problem and regressing on price only gives a very high variance. There are other factors in forecasting like lifecycle of product, visibility, promotions etc. Based on this finding we replaced this model with a random forest regression model that took time series data for product sales and some product features into account and predicted net lift over baseline sales every day. In addition to improving prediction accuracy, it gives us flexibility to add more features in the future like the location of product placement, the number of views, searches for similar products etc. Based on this model we were able to filter the best deals in every category not just based on discounted price but likeability by customers. We have used this information in other areas of our business as well but that’s for later.
- Personalization of deals based on user history (Collaborative filtering based item recommendation): Now that we have the best deals, how do I maximize my conversion? Well, by tailoring the products to a customer’s preference. We used an item-item similarity matrix here. We gave it a simple spin to capture more data and instead of building it in the traditional way of calculating co-occurrence of items purchased together, we used co-occurrence of items browsed and items purchased. Then we looked at every user’s last purchased item and recommend the N products with the highest co-occurrence. We added two more improvements to this model:
- Instead of only using the last viewed item, use the last K views, assigning higher weights to more recent views.
- When the co-occurrence values are simple counts, popular products tend to dominate the recommendations. Using Jaccard Similarity or a similarity measure can correct this popularity bias. Anyway, this gives us personalized deals for all the users.
- Predicting product categories (category affinity identification): In objective 2 above we have already figured out the personalization so we should be ready to go to town with it, but if you look at it closely what we have built so far is a similar item recommender system that gives you best deals available in items similar to your last viewed item. But we want to delight our user and preemptively see what product would they like next. This is a hard problem to solve at the product level and hence we decided to do this at category level (some 200+ categories). We use over 6000 features describing users’ view and purchase history across verticals like physical goods, digital goods, travel, entertainment, bill payments etc. We pose this as a multiclass classification problem where we are trying to predict the next category where a user will make a purchase (not a view!). We have been using random forest implementation of spark ML Lib for this. Given we had the huge disparity in classes we had to write our own resampling strategy before calling the model as ML LIB does not have one implemented. We did cross validation to find the best parameters and used precision recall as our measure. So far we have seen some very interesting results with P@1 = 0.45 meaning in 45% cases we can predict the next category correctly and this increases as we go to P@10.
Overall, this model now tells Paytm precisely out of 200 odd categories what would the user like to purchase next and we are trying to take this to 2000 categories next.
4. Ensemble! (recommendation assembly): To recap, we now know the most likely purchase categories for each user, the most attractive deals in those categories and the products most correlated to each user’s last product view. We can go to town with this. We mix the probabilities from the three models to generate the sequence of products we show each user. Using the ensemble method described above we have powered 15 widgets on different parts of the marketplace that were serving similar objective. Though we have already seen very promising results like 3.5x lift in CTR and 2x lift in conversion rate coming from the widgets where this went live, there is a long way to go in improving these algorithms. Here are a few results showing how we have been able to improve our algorithms iteratively and drive more transactions
We also faced many challenges like class imbalance, data sparsity, huge differences in categories and figuring out similar feature sets so we can operationalize this easily etc. We will do some future blog posts focusing on challenges faced and what we did to solve them.
In summary, always think about the use case and think how to scale the solution. When you can decompose the problem into small parts and build solutions like frameworks, you can always scale better and repurpose solution to serve many similar use cases
Team Profile: The Midgar team is tasked to deliver a highly personalized experience on Paytm. The team is looking for super talented software engineers, machine learning engineers and data engineers. Check out other open positions as well. To know more, write us using the contact form.