Woohoo! Bayes’ Rule! Or should it be Bayes’s? Kruschke goes with Bayes’, so I guess I will, too, but the linguist in me really things it ought to be Bayes’s, or at least pronounced that way. I’d certainly say Jonas’s rule, if I knew a guy name Jonas with a rule. Ok, sorry, back to Bayes and Kruschke.
I really like the initial examples he starts off with. For whatever reason, equations are a lot of hard work for me, even though I like them, and it’s hard for me to understand them deeply enough to have an intuitive feel for what they are saying. The rain/clouds example is way more accessible, and the playing card probabilities are a nice enough “toy” example that actual numbers can be calculated, so overall I think he’s done a great job in choosing examples.
The gist of Bayes’ Rule is that it sets up a relationship between conditional probabilities, allowing you to calculate something you want to know, from quantities that you already have (or can estimate). Back to the idea of beliefs as probabilities, Bayesian inference boils down to the idea of calculating beliefs (probability of some parameter) given data, which is a conditional probability, like calculating the probability of rain (a parameter) given clouds (data). And the point and magic of Bayes’ Rule is that this can be calculated as a function of other probabilities, which we can get easier access to directly.
Of course, “easier” is relative, and that’s where all the computational stuff starts coming into play. But in this chapter, Kruschke sticks to the coin-flipping and playing-card examples, because the probabilities involved there can be calculated pretty easily.
But again like in the previous chapter, I feel like sometimes I needed to suspend my thoughts for a little to follow how Kruschke lays out things, because it took me a while to get the connection between what he starts off talking about as examples, and things that I see as valuable. I’m coming to this book from a fair amount of experience with a bunch of NHST tools, even “fancy” ones like mixed-effects models, and there have been times in these initial chapters where I was having trouble seeing the relevance to what I wanted to do, which was try to understand how Bayesian stats are an alternative to the NHST way of doing things. When he finally gets to talking about belief in a model, or belief in a model parameter, I could start recognizing those things as goals, but it was hard to follow the lead-up. So again — and I’m starting to sound like a broken record — depending on the audience, a bit of a back-and-forth re-reading of some of these initial chapters will probably be beneficial, in order to connect his very nice, clear description of concepts to the more complex things that you will inevitably care more about. To put it a little more negatively, at this point in the book, if you’re coming in with any expertise in other statistical methods, you may feel impatient, and it may be hard to work through these chapters without a clear view of how they connect to what you actually want to do. My two cents is that it’s worth the effort to suffer the suspense, though maybe a quick initial read, followed by a more careful return later would work best. The rest of this chapter is pretty straightforward with examples, but I think this chapter in particular should be re-read after getting farther, because it lays the foundations for the fundamental insights of Bayesian inference (at least, as I understand it so far).
On p. 62, section 4.2.2.1, he gets around to being very explicit about a point I had earlier regarding confusion between different kinds of probability. He does finally point out that θ, i.e., the parameter that you actually care about testing with your data, is not always a probability, even though it has been in his examples of coin-flipping. This is a point that I think an instructor could and should build in earlier, or at least watch out for, so that students don’t get the two completely confounded.
I like that Kruschke revisits the three goals of inference that he set up earlier, now clothing them in Bayesian terms. This is excellent, though again, worthy of re-reading later. I’ve heard noise about Bayes factors not being all that when it comes to model comparison, although that’s what Kruschke presents. “I’ve heard noise” is about as good as I can say now, though, so that will be something to watch out for in the future, maybe.
I’d like to end this post by paraphrasing and trying to re-express a point that Kruschke makes towards the end of the chapter, and one that I think the whole first seven chapters or so are all building towards. I’m hoping by paraphrasing, that I can solidify my own understanding.
The general point is to understand why complex computational methods (or at least complex to me) are needed for Bayesian analysis, when Bayes’ Rule seems to be rather simple and elegant. Coming at Bayesian analysis from the outside, I’ve had mixed impressions. One that it’s about a different philosophical take on how to draw inferences from data, which to me makes a good deal of sense. The other is that there’s a lot of tricky computational and mathematical aspects of it, which makes it seem more esoteric or difficult or even problematic for practical use. And I never understood how or why these things were connected. So my understanding from this chapter of Kruschke is that it all comes down to the “evidence,” to use the Bayesian term, that is, the denominator of Bayes’ Rule. To paraphrase Bayes’ Rule in terms of a model parameter (I have a regression coefficient in mind, for example), what we care about is the probability of a model parameter being a particular value given some data (the posterior). This is the whole “updated belief given the data” idea. But to get that, we need a prior belief, a likelihood, and evidence (again, all with the Bayesian sense of these words). The prior belief makes sense conceptually to me, and I can at least imagine how one might come up with such a thing. The likelihood is the probability of the data occurring given the model and parameters, and I can imagine that one can do that without too much difficulty, basically with a probability density.
So far, I have a pretty decent grip on this, I think. From my NHST training, I’m comfortable thinking about data being the result of grabbing some numbers out of a bag. What we are trying to understand is the properties of “the bag” (i.e., the world, the mind, etc.), but we can’t know everything in the bag, so we grab enough numbers to see if we can learn something about how the contents are structured. The NHST way of doing things is to formulate a distribution of what the numbers would look like if the bag had no structure and everything was just random — the null hypothesis — and then we can say, with this boring imaginary bag, it would be very unlikely (like say, less that 5% probability) that we would get numbers like the ones we’ve drawn. Therefore, the bag must be structured in some way. The way I understand it, Bayesian likelihood is like saying “okay, imagine the bag has this property, what would random pulls from the bag look like,” and that’s the probability density of the data given that parameter. While a little more involved than the NHST null hypothesis, because it’s not always just a “null hypothesis bag,” it’s not all that different, and I can imagine being able to come up with these values.
So the real trick, computationally, at least, seems to be the evidence. The evidence is the probability of the data happening across all possible parameter values. In the majority of cases I can think of that are of practical interest to a researcher, parameter values are pretty much always continuous, and therefore it’s impossible or impractical to actually calculate this value. That’s where all the “fancy” computational and mathematical techniques like MCMC sampling come into play, as providing ways of estimating this quantity in a reasonable way. And what makes it difficult is that because there’s a random element to this, sometimes the algorithms and computations can go astray, and that’s where a lot of active work is being done, to refine and improve the techniques for actually calculating the value of the evidence.
At least, this is how I’m seeing things now. All you Bayesians out there, if I’m screwing it up, please enlighten me!
EDIT: I wrote the above the first time I went through the chapter, and I’m leaving it as-is, in case you’re reading the book along with me. But I’m a little off base here. Hopefully I’ll get it sorted out better later. As a sneak preview, there’s not really anything about the “evidence” that’s a problem, but it is an issue of efficient sampling from a large space, so I’m sorta partly right. But this is more or less confirming my general point from above: while this chapter is well-written and (at least to a complete novice) seems to articulate the core of Bayesian inference in an accessible way, my guess is that most people would benefit from re-visiting it, once they’ve gotten their hands dirtier with actually doing Bayesian analysis, in the next few chapters.