the all-thing | 2010-09-04 16:08:20 -0400 ========================================== Smoothing users' votes ---------------------- Date: March 31, 2009 11:16pm Author: William Morgan Labels: stats URL: http://all-thing.net/smoothing.txt In a previous post [1] I describe how you can cook up a Bayesian framework that results in IMDB's so-called "true Bayesian estimate", a formula which, on its face, doesn't look particularly Bayesian. As my astute commenters pointed out, this formula has many simpler interpretations without needing to invoke the B word. For example, it's a linear interpolation between two values: \define{\that}{\hat{\theta}} \that=\lambda(v) R + (1-\lambda(v))\tau where R is our mean vote, \tau is some smoothing target, \lambda(v) is the smoothing weight. \lambda(v) can be any function, as long as it increases with v, stays between 0 and 1, and is 0 when v is 0. Those constraints give you the right behavior: with no votes, your estimate is \tau exactly; as you add votes, it approaches R, and \lambda(v) controls how fast that happens. This formulation naturally leads to the following question: if I'm smoothing like this to deal with paucity-of-data issues, what value of \tau should I pick? IMBD uses the \tau=C, the global movie mean. Intuitively that makes sense, but is it the right choice? What's nice about the expression for \that above is that the behavior we're most interested in is when v=0, i.e. when there are no votes. In that case, \that=\tau, because of how I've constrained \lambda(v). So finding the best \tau is equivalent to finding the best \that when v=0. \define{\risk}{R(\theta, \that)} \define{\loss}{L(\theta, \that)} \define{\exp}[1]{E_\theta\left[#1\right]} Happily, we can answer the question of the best \that analytically, at least if we're happy to imagining that there is a "true" value of the movie \theta. Given \theta, we can define a loss function \loss that describes how bad we think a particular value of \that is. But we don't really know what \theta is for any movie (if we did, we wouldn't be bothering with any of this). So we can generalize that a step further and define a risk function \risk=\exp{\loss} quantifying our _expected loss_: the aggregate of the loss function across all possible values of \theta, weighted by the probability of each value. This gives us the tool we really need to answer the question above: the \that that minimizes our risk is the winner. In the absence of any specific notions about errors, we'll use the standard loss function for reals, squared-error loss: \loss = (\theta-\that)^2. Then it's just a matter of churning the crank: \array{\arrayopts{\colalign{right center left}} \risk & = & \exp{\loss} \\ & = & \exp{(\theta-\that)^2} \\ & = & \exp{\theta^2-2\theta\that + \that^2} \\ & = & \exp{\theta^2} - 2\that \exp{\theta} + \that^2 } We can drop that first term since we're only interested in minimimizing this as a function of \tau. To find the minimum: \array{\arrayopts{\colalign{right center left}} \frac{d}{d\that} {-2\that} \exp{\theta} + \that^2 & = & 0 \\ -2 \exp{\theta} + 2\that & = & 0 \\ \that & = & \exp{\theta} } Unsurprisingly, we see that the best estimate of \theta under squared-error loss is the mean of the distribution of \theta. Since we're interested in the case where v=0, this implies that the best value to use for \tau is also the mean. So IMDB's choice of C makes sense: the mean vote over all your movies is a great estimate of the mean of the distribution of \theta. A couple concluding points: 1. This answer is specific to squared-error loss; if you plug in another loss function, the optimal value for \tau might very well change. And you might actually have a specific model in mind for how "bad" mis-estimates are. Maybe over-estimates are worse than under-estimates, or something like that. 2. The definition of the distribution of \theta is actually completely vague above. In fact we don't even talk about it; we just use it implicitly in our \exp{\cdot} terms. So you should feel free to plug in (the mean of) whatever distribution you believe most accurately represents your product/movie/whatever. IMDB could arguable to better by plugging in per-category means, or something even fancier. 3. IMDB is actually a particularly bad case because movie opinions are extremely subjective. If you're serious about modeling very subjective things, we should be talking about multinomial models, Dirichlet priors, and the like. But the take-home message is: in the absence of a specific loss function that you really believe, smoothing towards the mean isn't just intuitive, it's minimizing your risk. [1] http://all-thing.net/bayesian-average This delicious text version served up by Whisper .