Author: Sagun B.

At Paytm we focus on our customers above all else. When the customer is happy, we are happy.

The Personalization team takes care of user satisfaction by ensuring that we are serving them the right product at the right time and location. For those of you who have read our “Recommendations at Paytm” blog, you are aware of just how much work is involved in building a recommender system that picks the best items, from a pool of millions of products for every user, helping them find the most relevant products by cutting through the noise.

In this post, we wanted to reflect and share our beautiful journey in improving customer recommendations. We didn’t start by tackling this behemoth problem of selecting ~ 60 top relevant products from a pool of 70M+ products head on. We broke the problem down into a few smaller problems. One of them included predicting a customer’s in market category affinities. This means identifying what a customer is going to be most interested in next. When we started we were only predicting user categories at a higher order in the category tree (some 200+ categories). Since our first model we have gone through multiple iterations and now we predict affinities at leaf level categories (about 1800+). Going from 200 categories to 1800 categories was non-trivial as most categories fall in the long tail. The feature space explodes manifold when you increase the cardinality, and so does computation time. Using the old approach that worked for 200 categories did not work any more and hence we had to be more creative. Below we dive into some details to explain how we solved it.

The Journey

  1. First approach – Hierarchical classification: Our first intuition was to break the problem into smaller problems again. In this case, there were obvious hierarchies because category trees have a natural hierarchy. We decided to solve this through a typical hierarchical classification scheme. The approach was to predict an affinity distribution over high level, tier 1 (T1), categories and then predict an affinity for the leaf level, tier 4 (T4), categories. We used a simple random forest model where the feature space was some aggregate representation on user views and purchases. The model was built using RF from Spark MLLib which is seriously constrained on the tunable hyper parameters. The final category affinity distribution was then computed using the law of total probability:

This approach set the baseline for us with P@1 = 0.12 meaning in 12% of cases we could predict the next purchase category correctly.

  1. Iterate – Try a better model: Of course 12% accuracy is not acceptable by any standards. So we went back to the drawing board and decided to try a host of different algorithms to improve our category classifier. We tried a slew of classifiers at each tier (ranging from Naive Bayes, and Logistic Regression to Random Forests and Multilayer Perceptrons). However, the precision simply did not bulge by any sufficient amount to warrant pushing it to production.
  2. Tangent – Good old collaborative filtering: After having tried many different approaches it was finally time to approach the problem in a completely different way. One of the pain points of any classification problem is feature engineering. This means how do we represent a user in a way that the computer can understand? Traditionally this is done with the help of experts who understand the user demographics very well or through a lot of trial and error in the absence of experts. As a quick proof of concept, we tried basic item-item collaborative filtering (CF) to predict a user’s next purchase category. The reason to choose this approach is that feature engineering here is very trivial, a user is simply represented as the most recent items they have viewed/purchased. This worked surprisingly well. It bumped our P@1 by ~ 60% raising it to 0.19.But there was one glaring problem with our CF approach, the curse of sparsity. Very few items have a lot of interactions and a majority of our items have few to no interactions. Our catalog essentially followed a Zipfian distribution. This was causing us to predict unknown affinities for most of the T4 categories. So to try to solve this, we reduced our CF matrix to only incorporate category-category correlations. This helped us deal with 90% of the sparsity problem while not sacrificing too much precision – our P@1 dropped to about 0.17. Still not good enough!
  3. Iterate – FFNN : From our learnings above – using a feature space that simply represents the most recent interactions leads to improved performance, we now had a new direction to follow. So our next approach was to find an embedding for our catalog. When considering approaches to find an embedding we wanted to ensure we could somehow capture a local context (i.e. could it be designed to capture some form of transactional data). To do this we started with a simple neural embedding that can be derived by using a feed forward neural network (FFNN). The input is the prior and posterior interactions, and the target is the current interaction.Using this method we were able to completely deal with the sparsity problem and the P@1 remained consistent at 0.17. We did consider alternative matrix factorization solutions to deal with the sparsity problem, but a majority of them are incapable of capturing transactional context as mentioned above.
  4. Iterate – When in doubt, (Semi) Deep Learn! Now that we had an embedding and reasonably improved performance, we wanted to see if we could squeeze out a little bit more. So, we decided to add a final softmax layer that would act on the embeddings and predict category affinities. However, the softmax layer is a linear classifier and so we had to ensure our embedding made sense in a linear space. To address this issue we redesigned the FFNN (from 4) to use identity activation function in the hidden layer. This ensures that the embeddings as well as the linear superposition of embeddings had significance. Much to our surprise, passing these neural embeddings to the final softmax layer provided a significant boost to our P@1 which jumped to 0.30! This was when we decided to ship it. In production this has worked surprisingly well and we have seen many instances (days) when this goes beyond 0.45 which is significant.

End Game (so far) : The Category Classifier

The final category classifier that we arrived at through the journey above looks as follows:


Now we aren’t audacious enough to call that deep learning since the third and fourth layer of our network aren’t connected … so we are hoping we can get away with calling it semi-deep learning.

Here are a few results showing the success of our new model:


The evaluation was conducted on a prominent list shown on our homepage. The above chart reports the click through rate (CTR) gain through our recommendations over expert curated lists (i.e. our previous recommendation engine was improving CTR by 1.8x while our new engine improves CTR by 2.6x).

Concluding Remarks

Building a model for 1800 categories is non-trivial but it surfaces granularity in user preferences, and hence, is extremely important for user satisfaction. The first solution is seldom the right solution but it serves as a good baseline. Once the baseline is available, iterate, iterate and iterate again. Lastly, always try new things, approaches that worked for small (er) multi-class data sets do not necessarily work when the number of classes is increased by an order of magnitude.

Going forward we will try to explore more computationally expensive models. In our current iteration we have forsaken a fully connected network with more hidden layers in the interest of keeping the training time reasonable.

We work on a plethora of different models, and are constantly improving each of them daily. If you have any questions or suggestions you can reach me at or give me a shout on twitter @sagunb_.

We are also hiring for machine learning, software and data engineers to help us build the next gen personalization platform.