<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en">
<head>
  <title>the all-thing</title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <link rel="stylesheet" href="/static/style.css" type="text/css" />
  <link rel="alternate" type="application/rss+xml" title="the all-thing RSS feed" href="/index.rss" />
  <link rel="alternate" type="text/plain" title="the all-thing in plain text" href="/index.txt" />
  <script type="text/javascript" src="/static/mootools.js"></script>
  <script type="text/javascript" src="http://music.masanjin.net:9292/waxiest.js"></script>
</head>
<body>

<div id="main">
  <div id="header">
    <h1><a  href="/">the all-thing</a></h1>
    
      <p>Showing only posts labeled "stats" (<a  href="/label/stats.rss">rss</a>). <a  href="/index/">See all posts</a>.</p>
    
  </div>
  <div id="sidebar">
    <h3>Recent comments</h3>

    <ul class="sidebar-list">
    
    <li><b><a  href="/whisper-0.5#58174069c046a78e55f02ef81da81e74">Dominique Julia</a></b>
        <i><a  href="/whisper-0.5">Whisper 0.5 released</a></i>
           one week ago
    </li>
    
    <li><b><a  href="/ruby-ncurses-and-thread-blocking#8fa2a0f392d7c0562d630e4936407c11">William Morgan</a></b>
        <i><a  href="/ruby-ncurses-and-thread-blocking">Ruby, Ncurses and blocked threads</a></i>
           three months ago
    </li>
    
    <li><b><a  href="/git-wtf-bf06ab7-released#533654a7a229569e27a6d0afd716c444">William Morgan</a></b>
        <i><a  href="/git-wtf-bf06ab7-released">git wtf bf06ab7 released</a></i>
           three months ago
    </li>
    
    <li><b><a  href="/git-wtf-bf06ab7-released#b7b7a905477674eb6985b34a964a0dca">Joao Nelas</a></b>
        <i><a  href="/git-wtf-bf06ab7-released">git wtf bf06ab7 released</a></i>
           three months ago
    </li>
    
    <li><b><a  href="/ruby-ncurses-and-thread-blocking#b00001114360ac152f87d4ac2a6e0c5b">Ollivier Robert</a></b>
        <i><a  href="/ruby-ncurses-and-thread-blocking">Ruby, Ncurses and blocked threads</a></i>
           three months ago
    </li>
    
    </ul>

    <h3>Authors</h3>
    <ul class="sidebar-list">
    
      <li><a class="author" href="/by/William+Morgan/">William&nbsp;Morgan</a>&nbsp;(65) </li>
    
    </ul>

    <h3>Tags</h3>
    <ul class="sidebar-list">
    
      <li><a class="label" href="/label/releases/">releases</a>&nbsp;(15) </li>
    
      <li><a class="label" href="/label/whisper/">whisper</a>&nbsp;(13) </li>
    
      <li><a class="label" href="/label/git/">git</a>&nbsp;(9) </li>
    
      <li><a class="label" href="/label/stats/">stats</a>&nbsp;(8) </li>
    
      <li><a class="label" href="/label/trollop/">trollop</a>&nbsp;(6) </li>
    
      <li><a class="label" href="/label/ruby/">ruby</a>&nbsp;(6) </li>
    
      <li><a class="label" href="/label/sup/">sup</a>&nbsp;(6) </li>
    
      <li><a class="label" href="/label/git-wtf/">git-wtf</a>&nbsp;(4) </li>
    
      <li><a class="label" href="/label/vm/">vm</a>&nbsp;(4) </li>
    
      <li><a class="label" href="/label/mathml/">mathml</a>&nbsp;(3) </li>
    
      <li><a class="label" href="/label/continuations/">continuations</a>&nbsp;(3) </li>
    
      <li><a class="label" href="/label/ditz/">ditz</a>&nbsp;(3) </li>
    
      <li><a class="label" href="/label/proglang/">proglang</a>&nbsp;(2) </li>
    
      <li><a class="label" href="/label/optimization/">optimization</a>&nbsp;(2) </li>
    
      <li><a class="label" href="/label/benchmarks/">benchmarks</a>&nbsp;(2) </li>
    
      <li><a class="label" href="/label/rubinius/">rubinius</a>&nbsp;(2) </li>
    
      <li><a class="label" href="/label/inlining/">inlining</a>&nbsp;(2) </li>
    
      <li><a class="label" href="/label/ubuntu/">ubuntu</a>&nbsp;(2) </li>
    
      <li><a class="label" href="/label/fibers/">fibers</a>&nbsp;(2) </li>
    
      <li><a class="label" href="/label/ritex/">ritex</a>&nbsp;(2) </li>
    
      <li><a class="label" href="/label/ruby1.9/">ruby1.9</a>&nbsp;(2) </li>
    
      <li><a class="label" href="/label/ncurses/">ncurses</a>&nbsp;(1) </li>
    
      <li><a class="label" href="/label/javascript/">javascript</a>&nbsp;(1) </li>
    
      <li><a class="label" href="/label/media/">media</a>&nbsp;(1) </li>
    
      <li><a class="label" href="/label/vim/">vim</a>&nbsp;(1) </li>
    
      <li><a class="label" href="/label/classification/">classification</a>&nbsp;(1) </li>
    
      <li><a class="label" href="/label/massachusetts/">massachusetts</a>&nbsp;(1) </li>
    
      <li><a class="label" href="/label/greasemonkey/">greasemonkey</a>&nbsp;(1) </li>
    
      <li><a class="label" href="/label/wine/">wine</a>&nbsp;(1) </li>
    
      <li><a class="label" href="/label/readme/">readme</a>&nbsp;(1) </li>
    
      <li><a class="label" href="/label/ancient-greek/">ancient-greek</a>&nbsp;(1) </li>
    
      <li><a class="label" href="/label/web/">web</a>&nbsp;(1) </li>
    
      <li><a class="label" href="/label/current+events/">current&nbsp;events</a>&nbsp;(1) </li>
    
    </ul>

    <h3>Other formats</h3>
    <ul class="sidebar-list">
    <li><a href="/index.rss"><img src="/static/rss-badge.png"/></a></li>
    <li><a href="/index.txt">plain text version</a></li>
    </ul>

    <h3 class="waxiest.author.original">Who is this man?</h3>
    <h3 class="waxiest.author.beautiful" style="display:none">I must find out more about this beautiful creature</h3>
    <h3 class="waxiest.author.beautifulbig" style="display:none">I MUST FIND OUT MORE ABOUT THIS BEAUTIFUL CREATURE</h3>
    <h3 class="waxiest.author.originalbig" style="display:none">WHO IS THIS MAN?</h3>

    <script type="text/javascript">
      var w = waxiest();
      w.optimizeHTMLSection("author", ["original", "beautiful", "beautifulbig", "originalbig"]);
    </script>

    <a href="http://masanjin.net" onClick="w.goalReached('greeting')">William Morgan</a>
  </div>
  <div id="content">
    
  <h2><a  href="/smoothing">Smoothing users&#8217; votes</a></h2>
  <div class="byline">
    <a  href="/by/William+Morgan/">William Morgan</a>,
    <span title="11 months ago">March 31, 2009 11:16pm</span>
  </div>
  
    <div class="labels"><span class='label'><a  href="/label/stats/">stats</a></span> </div>
  
  <p class='first'>In a <a href="http://all-thing.net/bayesian-average">previous post</a> I describe how you can cook
up a Bayesian framework that results in IMDB&#8217;s so-called &#8220;true Bayesian
estimate&#8221;, a formula which, on its face, doesn&#8217;t look particularly Bayesian.</p>
<p>As my astute commenters pointed out, this formula has many simpler
interpretations without needing to invoke the B word. For example, it&#8217;s a
linear interpolation between two values: <span title='\define{\that}{\hat{\theta}}' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"></math></span>
<div class='blockmath' title='\that=\lambda(v) R + (1-\lambda(v))\tau'><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mo>=</mo><mi>&lambda;</mi><mo stretchy='false'>(</mo><mi>v</mi><mo stretchy='false'>)</mo><mi>R</mi><mo>+</mo><mo stretchy='false'>(</mo><mn>1</mn><mo>&minus;</mo><mi>&lambda;</mi><mo stretchy='false'>(</mo><mi>v</mi><mo stretchy='false'>)</mo><mo stretchy='false'>)</mo><mi>&tau;</mi></math></div>
where <span title='R' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>R</mi></math></span> is our mean vote, <span title='\tau' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&tau;</mi></math></span> is some smoothing target, <span title='\lambda(v)' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&lambda;</mi><mo stretchy='false'>(</mo><mi>v</mi><mo stretchy='false'>)</mo></math></span>
is the smoothing weight.  <span title='\lambda(v)' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&lambda;</mi><mo stretchy='false'>(</mo><mi>v</mi><mo stretchy='false'>)</mo></math></span> can be any function, as long as it
increases with <span title='v' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>v</mi></math></span>, stays between 0 and 1, and is 0 when <span title='v' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>v</mi></math></span> is 0. Those
constraints give you the right behavior: with no votes, your estimate is
<span title='\tau' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&tau;</mi></math></span> exactly; as you add votes, it approaches <span title='R' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>R</mi></math></span>, and <span title='\lambda(v)' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&lambda;</mi><mo stretchy='false'>(</mo><mi>v</mi><mo stretchy='false'>)</mo></math></span>
controls how fast that happens.</p>
<p>This formulation naturally leads to the following question: if I&#8217;m smoothing
like this to deal with paucity-of-data issues, what value of <span title='\tau' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&tau;</mi></math></span> should I
pick?  <span class="caps">IMBD</span> uses the <span title='\tau=C' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&tau;</mi><mo>=</mo><mi>C</mi></math></span>, the global movie mean. Intuitively that makes
sense, but is it the right choice?</p>
<p>What&#8217;s nice about the expression for <span title='\that' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow></math></span> above is that the behavior we&#8217;re
most interested in is when <span title='v=0' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>v</mi><mo>=</mo><mn>0</mn></math></span>, i.e. when there are no votes. In that case,
<span title='\that=\tau' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mo>=</mo><mi>&tau;</mi></math></span>, because of how I&#8217;ve constrained <span title='\lambda(v)' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&lambda;</mi><mo stretchy='false'>(</mo><mi>v</mi><mo stretchy='false'>)</mo></math></span>. So finding the
best <span title='\tau' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&tau;</mi></math></span> is equivalent to finding the best <span title='\that' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow></math></span> when <span title='v=0' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>v</mi><mo>=</mo><mn>0</mn></math></span>.</p>
<p><span title='\define{\risk}{R(\theta, \that)}
\define{\loss}{L(\theta, \that)}
\define{\exp}[1]{E_\theta\left[#1\right]}' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"></math></span></p>
<p>Happily, we can answer the question of the best <span title='\that' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow></math></span> analytically, at
least if we&#8217;re happy to imagining that there is a &#8220;true&#8221; value of the movie
<span title='\theta' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&theta;</mi></math></span>.</p>
<p>Given <span title='\theta' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&theta;</mi></math></span>, we can define a loss function <span title='\loss' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>L</mi><mo stretchy='false'>(</mo><mi>&theta;</mi><mo>,</mo><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mo stretchy='false'>)</mo></mrow></math></span> that describes how
bad we think a particular value of <span title='\that' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow></math></span> is. But we don&#8217;t really know what
<span title='\theta' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&theta;</mi></math></span> is for any movie (if we did, we wouldn&#8217;t be bothering with any of
this). So we can generalize that a step further and define a risk function
<span title='\risk=\exp{\loss}' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>R</mi><mo stretchy='false'>(</mo><mi>&theta;</mi><mo>,</mo><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mo stretchy='false'>)</mo></mrow><mo>=</mo><mrow><msub><mi>E</mi><mi>&theta;</mi></msub><mrow><mo>[</mo><mrow><mi>L</mi><mo stretchy='false'>(</mo><mi>&theta;</mi><mo>,</mo><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mo stretchy='false'>)</mo></mrow><mo>]</mo></mrow></mrow></math></span> quantifying our <em>expected loss</em>: the aggregate of the
loss function across all possible values of <span title='\theta' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&theta;</mi></math></span>, weighted by the
probability of each value. This gives us the
tool we really need to answer the question above: the <span title='\that' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow></math></span> that minimizes
our risk is the winner.</p>
<p>In the absence of any specific notions about errors, we&#8217;ll use the standard
loss function for reals, squared-error loss: <span title='\loss = (\theta-\that)^2' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>L</mi><mo stretchy='false'>(</mo><mi>&theta;</mi><mo>,</mo><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mo stretchy='false'>)</mo></mrow><mo>=</mo><mo stretchy='false'>(</mo><mi>&theta;</mi><mo>&minus;</mo><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><msup><mo stretchy='false'>)</mo><mn>2</mn></msup></math></span>.
Then it&#8217;s just a matter of churning the crank:
<div class='blockmath' title='\array{\arrayopts{\colalign{right center left}}
\risk &amp; = &amp; \exp{\loss} \
                 &amp; = &amp; \exp{(\theta-\that)^2} \
                 &amp; = &amp; \exp{\theta^2-2\theta\that + \that^2} \
                 &amp; = &amp; \exp{\theta^2} - 2\that \exp{\theta} + \that^2
}'><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mtable columnalign='right center left'><mtr><mtd><mrow><mi>R</mi><mo stretchy='false'>(</mo><mi>&theta;</mi><mo>,</mo><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mo stretchy='false'>)</mo></mrow></mtd><mtd><mo>=</mo></mtd><mtd><mrow><msub><mi>E</mi><mi>&theta;</mi></msub><mrow><mo>[</mo><mrow><mi>L</mi><mo stretchy='false'>(</mo><mi>&theta;</mi><mo>,</mo><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mo stretchy='false'>)</mo></mrow><mo>]</mo></mrow></mrow></mtd></mtr><mtr><mtd></mtd><mtd><mo>=</mo></mtd><mtd><mrow><msub><mi>E</mi><mi>&theta;</mi></msub><mrow><mo>[</mo><mo stretchy='false'>(</mo><mi>&theta;</mi><mo>&minus;</mo><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><msup><mo stretchy='false'>)</mo><mn>2</mn></msup><mo>]</mo></mrow></mrow></mtd></mtr><mtr><mtd></mtd><mtd><mo>=</mo></mtd><mtd><mrow><msub><mi>E</mi><mi>&theta;</mi></msub><mrow><mo>[</mo><msup><mi>&theta;</mi><mn>2</mn></msup><mo lspace="verythinmathspace" rspace="0em">&minus;</mo><mn>2</mn><mi>&theta;</mi><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mo>+</mo><msup><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mn>2</mn></msup><mo>]</mo></mrow></mrow></mtd></mtr><mtr><mtd></mtd><mtd><mo>=</mo></mtd><mtd><mrow><msub><mi>E</mi><mi>&theta;</mi></msub><mrow><mo>[</mo><msup><mi>&theta;</mi><mn>2</mn></msup><mo>]</mo></mrow></mrow><mo>&minus;</mo><mn>2</mn><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mrow><msub><mi>E</mi><mi>&theta;</mi></msub><mrow><mo>[</mo><mi>&theta;</mi><mo>]</mo></mrow></mrow><mo>+</mo><msup><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mn>2</mn></msup></mtd></mtr></mtable></mrow></math></div></p>
<p>We can drop that first term since we&#8217;re only interested in minimimizing this as
a function of <span title='\tau' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&tau;</mi></math></span>. To find the minimum:</p>
<p><div class='blockmath' title='\array{\arrayopts{\colalign{right center left}}
\frac{d}{d\that} {-2\that} \exp{\theta} + \that^2 &amp; = &amp; 0 \
-2 \exp{\theta} + 2\that &amp; = &amp; 0 \
\that &amp; = &amp; \exp{\theta}
}'><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow><mtable columnalign='right center left'><mtr><mtd><mfrac><mrow><mi>d</mi></mrow><mrow><mi>d</mi><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow></mrow></mfrac><mrow><mo lspace="verythinmathspace" rspace="0em">&minus;</mo><mn>2</mn><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow></mrow><mrow><msub><mi>E</mi><mi>&theta;</mi></msub><mrow><mo>[</mo><mi>&theta;</mi><mo>]</mo></mrow></mrow><mo>+</mo><msup><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow><mn>2</mn></msup></mtd><mtd><mo>=</mo></mtd><mtd><mn>0</mn></mtd></mtr><mtr><mtd><mo lspace="verythinmathspace" rspace="0em">&minus;</mo><mn>2</mn><mrow><msub><mi>E</mi><mi>&theta;</mi></msub><mrow><mo>[</mo><mi>&theta;</mi><mo>]</mo></mrow></mrow><mo>+</mo><mn>2</mn><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow></mtd><mtd><mo>=</mo></mtd><mtd><mn>0</mn></mtd></mtr><mtr><mtd><mrow><mover><mrow><mi>&theta;</mi></mrow><mo>&Hat;</mo></mover></mrow></mtd><mtd><mo>=</mo></mtd><mtd><mrow><msub><mi>E</mi><mi>&theta;</mi></msub><mrow><mo>[</mo><mi>&theta;</mi><mo>]</mo></mrow></mrow></mtd></mtr></mtable></mrow></math></div></p>
<p>Unsurprisingly, we see that the best estimate of <span title='\theta' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&theta;</mi></math></span> under squared-error
loss is the mean of the distribution of <span title='\theta' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&theta;</mi></math></span>.
Since we&#8217;re interested in the case where <span title='v=0' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>v</mi><mo>=</mo><mn>0</mn></math></span>, this implies that the best
value to use for <span title='\tau' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&tau;</mi></math></span> is also the mean.</p>
<p>So IMDB&#8217;s choice of <span title='C' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>C</mi></math></span> makes sense: the mean vote over all your movies is a
great estimate of the mean of the distribution of <span title='\theta' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&theta;</mi></math></span>.</p>
<p>A couple concluding points:</p>
<ol>
	<li>This answer is specific to squared-error loss; if you plug in another loss
function, the optimal value for <span title='\tau' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&tau;</mi></math></span> might very well change. And you might
actually have a specific model in mind for how &#8220;bad&#8221; mis-estimates are. Maybe
over-estimates are worse than under-estimates, or something like that.</li>
	<li>The definition of the distribution of <span title='\theta' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&theta;</mi></math></span> is actually completely vague
above. In fact we don&#8217;t even talk about it; we just use it implicitly in our
<span title='\exp{\cdot}' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><msub><mi>E</mi><mi>&theta;</mi></msub><mrow><mo>[</mo><mo>&sdot;</mo><mo>]</mo></mrow></mrow></math></span> terms. So you should feel free to plug in (the mean of)
whatever distribution you believe most accurately represents your
product/movie/whatever. <span class="caps">IMDB</span> could arguable to better by plugging in
per-category means, or something even fancier.</li>
	<li><span class="caps">IMDB</span> is actually a particularly bad case because movie opinions are extremely
subjective.  If you&#8217;re serious about modeling very subjective things, we should
be talking about multinomial models, Dirichlet priors, and the like.</li>
</ol>
<p>But the take-home message is: in the absence of a specific loss function that
you really believe, smoothing towards the mean isn&#8217;t just intuitive, it&#8217;s
minimizing your risk.</p>
  <div class="comment-link">
    
    <a  href="/smoothing#comments">No comments</a>.
  </div>

  <h2><a  href="/bayesian-average">Understanding the &#8220;Bayesian Average&#8221;</a></h2>
  <div class="byline">
    <a  href="/by/William+Morgan/">William Morgan</a>,
    <span title="12 months ago">March 12, 2009 12:07pm</span>
  </div>
  
    <div class="labels"><span class='label'><a  href="/label/stats/">stats</a></span> </div>
  
  <p class='first'><span class="caps">IMDB</span> rates movies using a score they call the <a href="http://www.imdb.com/chart/top">true
Bayesian estimate</a> (bottom of the page).  I&#8217;m
pretty sure that&#8217;s a made-up term. A couple other sites, like BoardGameGeek,
use the same thing and call it a &#8220;Bayesian average&#8221;. I think that&#8217;s a made-up
term, too, even through there&#8217;s a <a href="http://en.wikipedia.org/wiki/Bayesian_average">Wikipedia article on
it</a>.</p>
<p>Nonetheless, the formula is simple, and it has a nice interpretation. Here it
is:</p>
<p><div class='blockmath' title='\frac{Cm + Rv}{m+v}'><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mfrac><mrow><mi>C</mi><mi>m</mi><mo>+</mo><mi>R</mi><mi>v</mi></mrow><mrow><mi>m</mi><mo>+</mo><mi>v</mi></mrow></mfrac></math></div></p>
<p>where <span title='C' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>C</mi></math></span> is the mean vote across all movies, <span title='v' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>v</mi></math></span> is the number of votes,
<span title='R' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>R</mi></math></span> is the mean rating for the movie, and <span title='m' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>m</mi></math></span> is the &#8220;minimum number of
votes required to be listed in the top 250 (currently 1300)&#8221;.</p>
<p>The nice interpretation is this: pretend that, in addition to the <span title='v' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>v</mi></math></span> votes
that users give a movie, you&#8217;re also throwing in <span title='m' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>m</mi></math></span> votes of score <span title='C' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>C</mi></math></span>
each. In effect you&#8217;re pushing the scores towards the global average, by <span title='m' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>m</mi></math></span>
votes.</p>
<p>Is this arbitarary? Actually, no. It&#8217;s the mean (i.e. <span class="caps">MLE</span>) of the posterior
distribution you get when you have a Normal prior with mean <span title='C' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>C</mi></math></span> and precision
<span title='m' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>m</mi></math></span>, and a Normal conditional with variance 1.0.</p>
<p>In other words, you&#8217;re starting with a belief that, in the absense of votes, a
movie/boardgame should be ranked as average, and you&#8217;re assuming that user
votes are normally-distributed around the &#8220;true&#8221; score with variance 1.0.  Then
you&#8217;re looking at the posterior distribution (i.e. the probability distribution
that arises as a result of those assumptions), and you&#8217;re picking the most
likely value from that, which in the case of Gaussians is the mean.</p>
<p>Let&#8217;s see how that works.</p>
<p>To find the posterior distribution, we could work through the math, or we could
just look at the <a href="http://en.wikipedia.org/wiki/Conjugate_prior">Wikipedia article on conjugate
priors</a>. We&#8217;ll see that the
posterior distribution of a Normal, when the prior is also a Normal, is a
Normal with mean</p>
<p><div class='blockmath' title='\frac{\tau_0 \mu_0 + \tau \sum_{i=1}^{n} x_i}{\tau_0 + n\tau}'><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mfrac><mrow><msub><mi>&tau;</mi><mn>0</mn></msub><msub><mi>&mu;</mi><mn>0</mn></msub><mo>+</mo><mi>&tau;</mi><msubsup><mo lspace="thinmathspace" rspace="thinmathspace">&Sum;</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></msubsup><msub><mi>x</mi><mi>i</mi></msub></mrow><mrow><msub><mi>&tau;</mi><mn>0</mn></msub><mo>+</mo><mi>n</mi><mi>&tau;</mi></mrow></mfrac></math></div></p>
<p>where <span title='\mu_0' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><msub><mi>&mu;</mi><mn>0</mn></msub></math></span> and <span title='\tau_0' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><msub><mi>&tau;</mi><mn>0</mn></msub></math></span> are the mean and precision of the prior,
respectively, <span title='\tau' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&tau;</mi></math></span> is the precision of the vote distribution, and <span title='n' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>n</mi></math></span> is
the number of votes. In the case of <span class="caps">IMDB</span>, we assumed above that <span title='\tau=1' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>&tau;</mi><mo>=</mo><mn>1</mn></math></span>, so
we have</p>
<p><div class='blockmath' title='\frac{\tau_0 \mu_0 + \sum_{i=1}^{n} x_i}{\tau_0 + n}'><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mfrac><mrow><msub><mi>&tau;</mi><mn>0</mn></msub><msub><mi>&mu;</mi><mn>0</mn></msub><mo>+</mo><msubsup><mo lspace="thinmathspace" rspace="thinmathspace">&Sum;</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></msubsup><msub><mi>x</mi><mi>i</mi></msub></mrow><mrow><msub><mi>&tau;</mi><mn>0</mn></msub><mo>+</mo><mi>n</mi></mrow></mfrac></math></div></p>
<p>Comparing the <span class="caps">IMDB</span> equation to this, we can see that <span title='v' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>v</mi></math></span> above is <span title='n' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>n</mi></math></span> here,
<span title='C' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>C</mi></math></span> above is <span title='\mu_0' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><msub><mi>&mu;</mi><mn>0</mn></msub></math></span> here, <span title='Rv=\frac{1}{v}\left(\sum_{i=1}^v v_i\right)\ v = \sum_{i=1}^v
v_i' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>R</mi><mi>v</mi><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>v</mi></mrow></mfrac><mrow><mo>(</mo><msubsup><mo lspace="thinmathspace" rspace="thinmathspace">&Sum;</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>v</mi></msubsup><msub><mi>v</mi><mi>i</mi></msub><mo>)</mo></mrow><mspace width="mediummathspace"/><mi>v</mi><mo>=</mo><msubsup><mo lspace="thinmathspace" rspace="thinmathspace">&Sum;</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>v</mi></msubsup><msub><mi>v</mi><mi>i</mi></msub></math></span> above is <span title='\sum_{i=1}^{n} x_i' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><msubsup><mo lspace="thinmathspace" rspace="thinmathspace">&Sum;</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></msubsup><msub><mi>x</mi><mi>i</mi></msub></math></span> here, and <span title='m' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>m</mi></math></span> above is the hyperparameter
<span title='\tau_0' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><msub><mi>&tau;</mi><mn>0</mn></msub></math></span>. So we know that even though <span class="caps">IMDB</span> says <span title='m' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>m</mi></math></span> is the &#8220;minimum number
of votes required to be listed in the top 250 list&#8221;, that&#8217;s an arbitrary
decision on their part: it can be anything and the formula still works. <span title='m' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>m</mi></math></span>
is the precision of the prior distribution; as it gets bigger, the prior
distribution gets &#8220;sharper&#8221;, and thus has more of an effect on the posterior
distribution.</p>
<p>Now the assumptions we made to get to this point are almost laughable. If
nothing else, we know that Gaussians are unbounded and continuous, and user
votes on <span class="caps">IMBD</span> are integers in the range of 1-10. The interesting take-away
message here is that even though we made a lot of assumptions above that were
laughably wrong, the end result is a reasonable formula with an nice, intuitive
meaning.</p>
  <div class="comment-link">
    
    <a  href="/bayesian-average#comments">13 comments by <b>John Henderson</b>, <b>William Morgan</b>, and two others</a>.
  </div>

  <h2><a  href="/old43">The St. Petersburg Paradox</a></h2>
  <div class="byline">
    <a  href="/by/William+Morgan/">William Morgan</a>,
    <span title="16 months ago">October 21, 2008  3:39pm</span>
  </div>
  
    <div class="labels"><span class='label'><a  href="/label/stats/">stats</a></span> </div>
  
  <p class='first'><a href="http://all-thing.net/2008/09/simpsons-paradox.html">On the topic of numeric
paradoxes</a>, here&#8217;s another
one that drove a lot of work in economic and decision theory: the <a href="http://en.wikipedia.org/wiki/St._Petersburg_paradox">St.
Petersburg paradox</a>.</p>
<p>Here&#8217;s the deal. You&#8217;re offered a chance to play a game wherein you repeatedly
flip a coin until it comes up heads, at which point the game is over. If the
coin comes up heads the first time, you win a dollar. If it takes two flips to
come up heads, you win two dollars. The third time, four dollars. The fourth
time, eight dollars. And so on; the rule is, if you see heads on the <span title='i' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>i</mi></math></span>th flip,
you win <span title='2^{i-1}' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><msup><mn>2</mn><mrow><mi>i</mi><mo>&minus;</mo><mn>1</mn></mrow></msup></math></span> dollars.</p>
<p>How much would you pay to play this game?</p>
<p>The paradox is: the expected value of this game is infinity, so according to
all your pretty formulas, you should immediately pay all your life savings for
a single chance at this game. (Each possible outcome has an expected value of
50 cents, and there are an infinite number of them, and expectation distributes
over summation, so the expected value is an infinite sum of 50 cents, which
works out to be a little thing I like to call infinity dollars.)</p>
<p>Of course that&#8217;s a paradox because it&#8217;s crazy talk to bet more than a few bucks
on such a game. The paradox highlights at least two problems with blithely
using positive EV as the reward you&#8217;ll get if you will play the game:</p>
<ol>
	<li>It assumes that the host of the game actually has infinite funds. The
Wikipedia article has a very striking breakdown of <a href="http://en.wikipedia.org/wiki/St._Petersburg_paradox#Finite_St._Petersburg_lotteries">what happens to the St.
Petersburg paradox when you have finite
funds</a>.
It turns out that even if your backer has access to the entire <span class="caps">GDP</span> of the
world in 2007, the expected value is only $23.77, which is quite a bit short
of infinity dollars.</li>
	<li>It assumes you play the game an infinite number of times. That&#8217;s the only way
you&#8217;ll get the expected value in your pocket. And the St. Petersburg paradox
is a great example of just how quickly your actual take-home degenerates when
subject to real-world constraints like finite repetitions. It turns out that
if you want to make $10, you&#8217;ll have to play the game one million times; if
you&#8217;re satisfied with $5, you&#8217;ll still have to play a thousand times.</li>
</ol>
<p>The classical answer to the paradox has been to talk about utility, marginal
utility and things like that; i.e., people with lots of money value more money
less than people without very much money. And recent answers to the paradox,
e.g. <a href="http://en.wikipedia.org/wiki/Cumulative_prospect_theory">cumulative prospect
theory</a>, are along the
lines of modeling how humans perceive risk, which (unsurprisingly) is not
really in line with the actual probabilities.</p>
<p>But it seems to me that these solutions all involve modeling human behavior and
explaining why a human wouldn&#8217;t pay a lot of money to play the game, either
because money means less as it gets bigger or because they mis-value risks. But
the actual paradox is <em>not</em> about human behavior or psychology. It&#8217;s the fact
that the expected value of a game is not a good estimate of the real-world
value of a game, because expected value can make assumptions about infinite
funds and infinite plays, and we don&#8217;t have those.</p>
<p>So <em>my</em> solution to the St. Petersburg paradox is this: drop all events that
have a probability less than some small epsilon, or a value more than some
large, um, inverse epsilon. That neatly solves both of the infinity
assumptions. (In this particular case one bound would do, because the
probabilities drop exponentially as the values rise exponentially, but not in
general.) I&#8217;ll call this the <span class="caps">REV</span>: the realistically expected value.</p>
<p>In this case, if you set the lower probability bound to be .01, and the upper
value bound to be one million, then the <span class="caps">REV</span> of the St. Petersburg paradox is
just about three bucks. (The upper value bound doesn&#8217;t even come into play.)
And that&#8217;s about what I&#8217;d pay to play it.</p>
<p>So there you go. Fixed economics for ya.</p>
  <div class="comment-link">
    
    <a  href="/old43#comments">No comments</a>.
  </div>

  <h2><a  href="/old19">A philosophical question</a></h2>
  <div class="byline">
    <a  href="/by/William+Morgan/">William Morgan</a>,
    <span title="17 months ago">October  6, 2008 11:25pm</span>
  </div>
  
    <div class="labels"><span class='label'><a  href="/label/stats/">stats</a></span> </div>
  
  <p class='first'>Is there really a difference between saying, &#8220;I don&#8217;t know anything, a priori, about the parameters of this distribution&#8221;, and using a uniform prior?</p>
<p>What about, &#8220;I don&#8217;t know anything about that value&#8221; versus &#8220;As far as I&#8217;m concerned, every possibility for that value is equally likely&#8221;?</p>
  <div class="comment-link">
    
    <a  href="/old19#comments">Two comments by <b>William</b> and <b>Brendan</b></a>.
  </div>

  <h2><a  href="/bayes-vs-mle">Bayes vs <span class="caps">MLE</span>: an estimation theory fairy tale</a></h2>
  <div class="byline">
    <a  href="/by/William+Morgan/">William Morgan</a>,
    <span title="17 months ago">October  6, 2008 10:33pm</span>
  </div>
  
    <div class="labels"><span class='label'><a  href="/label/stats/">stats</a></span> <span class='label'><a  href="/label/whisper/">whisper</a></span> </div>
  
  <p class='first'>I found a neat little example in one of my introductory stats books about
Bayesian versus maximum-likelihood estimation for the simple problem of
estimating a binomial distribution given only one sample.</p>
<p>I was going to try and show the math but since Blogger is not making it
possible to actually render MathML I&#8217;ll just hand-wave instead.
<em>[Fixed in <a href="http://masanjin.net/whisper/">Whisper</a>. &#8212;ed.]</em></p>
<p>So let&#8217;s say we&#8217;re trying to estimate a binomial distribution parameterized by
<span title='p' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>p</mi></math></span>, and that we&#8217;ve only seen one estimate. For example, someone flips a coin
once, and we have to decide what the coin&#8217;s probability of heads is.</p>
<p>The maximum likelhood estimate for <span title='p' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>p</mi></math></span> is easy: if your single sample is a 1,
then <span title='p=1' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>p</mi><mo>=</mo><mn>1</mn></math></span>, and if your sample is 0, <span title='p=0' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>p</mi><mo>=</mo><mn>0</mn></math></span>. (And if you go through the laborious
process of writing the log likelihood, setting the derivative equal to 0, and
solving it, you come up with the general rule of (# of 1&#8217;s) / (# of 1&#8217;s + # of
0&#8217;s), which is kinda what you would expect.)</p>
<p>In the coin case it seems crazy to say, I saw one head, so I&#8217;m going to assume
that the coin <em>always</em> turns up heads, but that&#8217;s because of our prior
knowledge of how coins behave. If we&#8217;re given a black box with a button and two
lights, and you press the button, and one of the lights come on, then maybe
estimating that that light always comes on when you press the button makes a
little more sense.</p>
<p>Finding the Bayesian estimate is slightly more complicated. Let&#8217;s use a uniform
prior. Our conditional distribution is <span title='f(1|p)=p' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>f</mi><mo stretchy='false'>(</mo><mn>1</mn><mo stretchy='false'>|</mo><mi>p</mi><mo stretchy='false'>)</mo><mo>=</mo><mi>p</mi></math></span> and <span title='f(0|p)=1-p' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>f</mi><mo stretchy='false'>(</mo><mn>0</mn><mo stretchy='false'>|</mo><mi>p</mi><mo stretchy='false'>)</mo><mo>=</mo><mn>1</mn><mo>&minus;</mo><mi>p</mi></math></span>, and if you work
it out, the posterior ends up as <span title='h(p|1)=2p' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>h</mi><mo stretchy='false'>(</mo><mi>p</mi><mo stretchy='false'>|</mo><mn>1</mn><mo stretchy='false'>)</mo><mo>=</mo><mn>2</mn><mi>p</mi></math></span> and <span title='h(p|0)=2(1-p)' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>h</mi><mo stretchy='false'>(</mo><mi>p</mi><mo stretchy='false'>|</mo><mn>0</mn><mo stretchy='false'>)</mo><mo>=</mo><mn>2</mn><mo stretchy='false'>(</mo><mn>1</mn><mo>&minus;</mo><mi>p</mi><mo stretchy='false'>)</mo></math></span>.</p>
<p>Now if we were in the world of classication, we&#8217;d take the <span class="caps">MAP</span> estimate, which
is a fancy way of saying the value with the biggest probability, or the mode of
the distribution. Since we&#8217;re using a uniform prior, that would end up as the
same as the <span class="caps">MLE</span>. But we&#8217;re not. We&#8217;re in the world of real numbers, so we can
take something better: the expected value, or the mean of the distribution.
This is known as the Bayes estimate, and there are some decision-theoretic
reasons for using it, but informally, it makes more sense than using the <span class="caps">MAP</span>
estimate: you can take into account the entire shape of the distribution, not
just the mode.</p>
<p>Using the Bayes estimate, we arrive at <span title='p=2/3' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>p</mi><mo>=</mo><mn>2</mn><mo>/</mo><mn>3</mn></math></span> if the sample was a 1, and
<span title='p=1/3' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>p</mi><mo>=</mo><mn>1</mn><mo>/</mo><mn>3</mn></math></span> if the sample was a zero. So we&#8217;re at a place where Bayesian logic and
frequentist logic arrive at very different answers, <em>even with a uniform
prior</em>.</p>
<p>Up till now we&#8217;ve been talking about &#8220;estimation theory&#8221;, i.e. the art of
estimating shit. But estimation theory is basically decision theory in
disguise, where your decision space is the same as your parameter space: you&#8217;re
deciding on a value for <span title='p' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>p</mi></math></span>, given your input data, and your prior knowledge, if
any.</p>
<p>Now what&#8217;s cool about moving to the world of decision theory is that we can
say: if I have to decide on a particular value for <span title='p' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>p</mi></math></span>, how can I minimize my
expected cost, aka my risk? A natural choice for a cost, or loss, function, is
squared error. If the true value is <span title='q' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>q</mi></math></span>, I&#8217;d like to estimate <span title='p' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>p</mi></math></span> in such a way
that <span title='E[(q-p)^2]' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>E</mi><mo stretchy='false'>[</mo><mo stretchy='false'>(</mo><mi>q</mi><mo>&minus;</mo><mi>p</mi><msup><mo stretchy='false'>)</mo><mn>2</mn></msup><mo stretchy='false'>]</mo></math></span> is minimized. So we don&#8217;t have to argue philosophically about
<span class="caps">MLE</span> versus <span class="caps">MAP</span> versus minimax versus Bayes estimates; we can quantify how well
each of them do under this framework.</p>
<p>And it turns out that, if you plot the risk for the <span class="caps">MLE</span> estimate and for the
Bayes estimate under different values of the true value <span title='q' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>q</mi></math></span>, then <span class="caps">MOST</span> of the
time, the Bayes estimate has lower risk than the <span class="caps">MLE</span>. It&#8217;s only when <span title='q' style='white-space: nowrap'><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mi>q</mi></math></span> is
close to 0 or to 1 that <span class="caps">MLE</span> has lower risk.</p>
<p>So that&#8217;s pretty cool. It seems like the Bayes estimate must be a superior
estimate.</p>
<p>Of course, I set this whole thing up. Those &#8220;decision-theoretic reasons&#8221; for
choosing the Bayes estimate I mentioned? Well, they&#8217;re theorems that show that
the Bayes estimate minimizes risk. And, in fact, the Bayes estimate of the mean
of the distribution is <em>specific</em> to squared-error loss. If we chose another
loss function, we could come up with a potentially very different Bayes
estimate.</p>
<p>But my intention wasn&#8217;t really to trick you into believing that Bayes estimates
are awesome. (Though they are!) I wanted to show that:</p>
<ol>
	<li>Bayes and classical approaches can come up with very different estimates,
even with a uniform prior.</li>
	<li>If you cast things in decision-theoretic terms, you can make some real
quantitative statements about different ways of estimating.</li>
</ol>
<p>In the decision theory world, you can <em>customize</em> your estimates to minimize
your particular costs in your particular situation. And that&#8217;s an idea that I
think is very, very powerful.</p>
  <div class="comment-link">
    
    <a  href="/bayes-vs-mle#comments">Two comments by <b>Brendan</b> and <b>William</b></a>.
  </div>

  <h2><a  href="/old29">Decision theory and approximate randomization</a></h2>
  <div class="byline">
    <a  href="/by/William+Morgan/">William Morgan</a>,
    <span title="17 months ago">October  5, 2008  8:27pm</span>
  </div>
  
    <div class="labels"><span class='label'><a  href="/label/stats/">stats</a></span> </div>
  
  <p class='first'>In my earlier post about decision theory I alluded to a superior alternative to the classic t-test. That alternative is approximate randomization. It&#8217;s a neat way to do a hypothesis test without having to make any assumptions about the nature of the sampling distribution of the test statistic, in contrast to the assumptions required by Student&#8217;s t-test and its brethren.</p>
<p>Approximate randomization is ideal for comparing the result of a complicated metric run on the output of a complicated system, because you don&#8217;t have to worry about modeling any of that complexity, or, more likely, praying to the central limit theorem while and ignoring the issue. Back in my machine translation days, I used it to calculate the significant difference between the <span class="caps">BLEU</span> scores of two MT systems. This was pretty much the ideal scenario&#8212;<span class="caps">BLEU</span> is a complicated metric (at least, from the statistical point of view, which is more comfortable with things like the t-statistic, aka &#8220;the difference between the two means divided by some stuff&#8221;), and MT output is the result of something even more complicated. It worked very well, and I even wrote <a href="http://cs.stanford.edu/people/wmorgan/sigtest.pdf">some slides on it</a>.</p>
<p>(In fact, there&#8217;s sometimes an even better reason to use AR over t-tests than just &#8220;it makes fewer assumptions&#8221;: t-tests tend to be overly conservative when their assumptions are violated. So if you&#8217;d be happier with the alternative hypothesis, AR will be more likely to show a difference than a t-test will. There&#8217;s a great chapter on this near the beginning of <a href="http://bayes.bgsu.edu/bcwr/">Bayesian Computation with R</a>, where Monte Carlo techniques are used to show how the test statistic sampling distribution changes under different ways of violating the assumptions.)</p>
<p>Something I&#8217;ve been thinking about a lot recently is how to apply approximate randomization to the Bayesian, decision-theoretic world of hypothesis tests. Unfortunately it&#8217;s not cut and dry. AR gives you a way of directly sampling from the sampling distribution of the test statistic under the null hypothesis. That&#8217;s all you need for classical tests, but in the Bayesian world, you also need to sample from the alternative distribution. For the common &#8220;two-tailed&#8221; case, the null hypothesis is that there&#8217;s no difference, and AR says, just shuffle everything around, because that shouldn&#8217;t make a difference. The alternative hypothesis is that there IS a difference, so I think you would somehow need to do something analogous, but under every possible way of there being a difference. And I&#8217;m not sure what that would really look like.</p>
  <div class="comment-link">
    
    <a  href="/old29#comments">One comment by <b>Brendan</b></a>.
  </div>

  <h2><a  href="/old39">Bayesian hypothesis testing and decision theory</a></h2>
  <div class="byline">
    <a  href="/by/William+Morgan/">William Morgan</a>,
    <span title="17 months ago">October  1, 2008  5:55pm</span>
  </div>
  
    <div class="labels"><span class='label'><a  href="/label/stats/">stats</a></span> </div>
  
  <p class='first'><img src="/static/evobayes.jpg" alt="" /></p>
<p>I&#8217;ve been doing a lot of learning at the new job. Not because people here are teaching me stuff, but more because I&#8217;m in a good position to spend a significant portion of my day learning about stuff that will help me do my job. (Which is great, and fun, and further reinforces what I know about myself by now&#8212;I&#8217;m a great self-directed learner and a very poor externally-directed learner.)</p>
<p>One of the things I&#8217;ve learned is that when it comes to statistics, I&#8217;m a Bayesian. And all the crap I learned about things like hypothesis testing and maximum likelihood estimation in my stats classes now seems horribly clunky and old-fashioned to me.</p>
<p>Let&#8217;s take hypothesis testing as an example. In the classical/frequentist world, you pick an arbitrary &#8220;small enough&#8221; probability (aka 5%), find the sampling distribution of your statistic under your null hypothesis, and if it&#8217;s below that threshold, say yea, else say nay.</p>
<p>Here are some things that are wrong/bad with that approach: the 5% threshold is completely arbitrary, the sampling distribution under the alternative hypothesis is not taken into consideration (i.e. you only care about type I errors), and you don&#8217;t have any way to balance the cost of type I vs type II errors. (Never mind the fact that people <span class="caps">ALWAYS</span> just use t-tests and ignore the fact that their datapoints are not actually distributed Normally and with the same means and variances. That, at least, I can tell you how to fix.)</p>
<p>Compare this with the Bayesian decision theory version of hypothesis testing: you assign a cost to the two types of error, calculate the posterior probability under both conditions, based on the observations and incorporating any prior knowledge if you have it, calculate a threshold that minimizes your expected cost, and accept or reject based on that. Doesn&#8217;t that just make more sense?</p>
<p>I highly recommend the book <a href="http://bayes.bgsu.edu/bcwr/">Bayesian Computation with R</a>. (Although it doesn&#8217;t actually talk about decision theory!) It has an associated blog: <a href="http://learnbayes.blogspot.com">LearnBayes</a>.</p>
<p>Other things to look at: <a href="http://quasar.as.utexas.edu/stat295.html">William H. Jefferys&#8217;s Stats 295 class materials</a> (especially <a href="http://quasar.as.utexas.edu/courses/stat295.2007/10BayesianHypothesisTesting.pdf">these slides</a>, which I&#8217;m still working my way through), and <a href="http://bayes-rules.blogspot.com/">his blog for the class</a>.</p>
  <div class="comment-link">
    
    <a  href="/old39#comments">No comments</a>.
  </div>

  <h2><a  href="/old21">Simpson&#8217;s Paradox</a></h2>
  <div class="byline">
    <a  href="/by/William+Morgan/">William Morgan</a>,
    <span title="17 months ago">September 24, 2008  5:46pm</span>
  </div>
  
    <div class="labels"><span class='label'><a  href="/label/stats/">stats</a></span> </div>
  
  <p class='first'>I found a really cool visual explanation of Simpson&#8217;s Paradox on the Wikipedier.</p>
<p><img src="/static/430px-Simpsons-vector.svg.png" alt="" /></p>
<p>Informally, Simpson&#8217;s Paradox states that, if you and I are competing, and I do better than you in category A, and I also do better than you in category B, my overall score for both categories combined could actually be <em>worse</em> than yours. The <a href="http://en.wikipedia.org/wiki/Simpson%27s_paradox">Wikipidia article</a> gives a real-life example:</p>
<blockquote>
<p>&#8220;In both 1995 and 1996, [David] Justice had a higher batting average [&#8230;] than [Derek] Jeter; however, when the two years are combined, Jeter shows a higher batting average than Justice.&#8221;</p>
</blockquote>
<p>And there&#8217;s also a famous legal case about Berkeley&#8217;s admission rates for women from the 70&#8217;s, where they were sued because the overall admission rate was lower for women than for men. Turns out that if you break it down by department, each department actually had a <em>higher</em> admission rate for women.</p>
<p>This all sounds crazy until you stare at the picture above for a while. The slopes of the lines are the percentages. Both solid blue vectors have smaller slopes than their corresponding solid red vectors, but when you add the them (shown as a dashed lines), the blue vectors have a bigger slope.</p>
<p>What the picture really makes clear is that a ratio or a percentage is not a complete description of the situation. Knowing a percentage is equivalent to knowing the angle of a vector without knowing its magnitude. You can see from the picture that this isn&#8217;t a weird corner case; there are <em>many</em> choices for the second blue vector that would have the same result.</p>
<p>It&#8217;s been probably 10-12 years since I learned about Simpson&#8217;s Paradox in some undergrad stats class. Now I finally really understand it.</p>
  <div class="comment-link">
    
    <a  href="/old21#comments">One comment by <b>Brendan</b></a>.
  </div>




  </div>

  <div id="footer" style="margin: 0px;">
    Served up by <a href="http://masanjin.net/whisper/">Whisper</a>. Yes!
  </div>
</div>
</body>
</html>
