Thursday, May 24, 2018

21 - Personalizing Ecommerce Search using Logit and Gradient Boosting.


You run the P/L for an Ecommerce business - Innovation is your raison d’etre, bleeding edge technology, the bastion of your business and Data your lifeline.
 ·      How would it be If you could re-rank your site search Results based on Relevance of a Page?
 ·      How would it be If you could provide the top 10 URLs that a user is most likely to click based on a query term?
·      How would it be if you could extend this concept and employ personalization at scale so that search results and the ranking thereof is based on individual preferences?
 ·      What would this do to your conversions?
Great Marketing empowerment no doubt, but before delving into the details, let’s get the 101s in order.
 1)    Relevance: Denotes the actual score for the result page based on various factors. Pages are ranked based on the scores.
2)    Attributes: The Factors or Independent Variables that impact the final rank of the URL.
3)    Algorithm: A fancy term for the ranking process. Can be as simple as a thumb rule or a Lambda Mart Gradient Boosting process. ( Pretty Impressive eh?). For non nerds like myself, it is a formula that spits out the relevance of the page based on the page attributes provided.
4)    Query: The unique search term input by the user.
On a side note, the term Relevance is, to put it mildly, insidious. I spent weeks trying to get my arms around the two major data sets that impact relevance. Did I say I am not an AI nerd?? The two disparate data sets impacting Relevance are A- Page related and B- Web Log related. Part A is more of metadata and indexed information while Part B is more dynamic.
A: Page Metadata:
a.    Body Hits: Positioning of the search term within the document
b.    Body Length: No of occurrences of the term within the document.
c.     Anchors: No. of links with search terms within the document,
Good news is that most modern search frameworks like Lucene and by extension Elastic Search, Endeca and so on do this out of the box by indexing existing and new pages on the header, content, Anchors, meta tags etc.
B: Web Log: As the saying goes, Data est le pétrole noveau…and web logs are gold mines make no doubt about it– if you know how to read through and effectively employ the insights. For instance, web logs give you the following (**This is not an all-encompassing list though).
a)   Who has logged in and when? – Session Information
b)   What did they search for? – Query Term and Sub Terms
c)   What results were thrown up? Search Output
d)   What did they click through? Clicks, Misses and Skips
e)   How long did users spend on a particular page? Dwell Time.
The nugget in the whole piece is point d). Imagine a user issuing a query Q and the search engine spitting out URLs U1, U2, U3, U4 and U5 as the results in that order. Assume that the user clicks U3 and U4 in the browse process. The implications are the following
1)   U1 and U2 are SKIPS and need to be penalized because they got passed over by the user even though they were ranked 1 and 2 by the search engine.
2)   U5 is a MISS - even though it was visible on the page the user was gratified by U3 and U4 and did not bother to check out U5 further below. That’s a negative for U5 but not as bad as the previous case (1).
3)   U3 and U4 are CLICKS – successes but again their relevance to the user’s expectations depend on the “DWELL TIME” of each URL. If the user spent 30 seconds on U3 and only 2 seconds on U4, obviously U3 is more relevant to the user than U4 and if this happens across all users significantly, U3 always needs to have a higher rank than U4 for the named query.
The Framework : Given the information on the web logs, the following 5 part framework will be useful in implementing a web log based re-ranking plug in to your ecommerce search capabilities.
  •  Input Variables: Collate input variables from web log – they will include at the very minimum –user , session, query terms, CLICKS, SKIPS,MISSES, DWELL TIME, ORIGINAL RANK etc. The actual scope of the aggregation techniques required to metricize this information this is beyond the scope of this article.
  • Output Variable: The output any formula or statistical / AI Model you employ is the probability of the particular URL being relevant to the query– generally on a three point scale. In Simple Naïve Bayes terms, we are trying to predict.  Probability  ( Relevancy = HIGH / Given Input factors in Step 1 )
  • Algorithms: The Math around each of these approaches is certainly beyond this article and as a confession beyond the author as well.
    •  Point Wise Classification : Eg Naïve Bayes, Logit – Rank URLs by descending probability of an URL’s relevance being HIGH and / or MEDIUM.
    •    Pair Wise Classification : Eg Gradient Boosting, Lambda Mart – is a multi level measure that works on comparative cost of ranking one URL over the other across the complete pair wise URL combinations for a query.
  •  Re Ranking Efficacy:  NDCG (Net Discounted Cumulative Gain) is a measure of the Information Gain obtained by a particular ordering of URLs and their eventual relevance to the query posed. Simply put,  Higher the NDCG, better the information gain and that’s exactly what the algorithms seek to optimize.
  • Personalization  & CX:  For IT  Mature Organizations focused on Customer Centricity, employing Search Re-ranking and relevance, based on individual click history brings a wealth of personalization opportunities and behavioral context to user browsing actions.. all this at real time.

No comments: