5 Methodological Effort and Epistemological Reflection

Chapter 5
Methodological Effort and Epistemological Reflection

5.1 Understanding the Opposition between Bayesians and Frequentists
  5.1.1 Two Conceptions of Probability and Uncertainty
   5.1.1.1 The Concept of Conditional Probability
   5.1.1.2 The Bayesian and Frequentist Assumptions
  5.1.2 Comparison of the Two Approaches on Simple Cases
   5.1.2.1 Simple Case: When the Two Approaches Agree
   5.1.2.2 Benefits of Using an Informative Prior
   5.1.2.3 Case Where the Two Approaches Differ: Non Gaussianity and Few Data
  5.1.3 Numerical Methods to Solve Bayesian Problems
   5.1.3.1 Sampling the Posterior Distribution
   5.1.3.2 Post-Processing MCMCs
  5.1.4 Decision Making and Limitations of the Frequentist Approach
   5.1.4.1 Hypothesis Testing
   5.1.4.2 Pros and Cons of the two Approaches
5.2 Bayesianism, an Alternative to Popper’s Scientific Method
  5.2.1 Bayes Formula Throughout History
   5.2.1.1 The Origins
   5.2.1.2 The Frequentist Winter
   5.2.1.3 The Bayesian Renaissance
  5.2.2 Bayesian and Popperian Epistemologies
   5.2.2.1 The Epistemological Debate at the Beginning of the XX

^{t h}

Century
   5.2.2.2 Popper’s Logic of Scientific Discovery
   5.2.2.3 Verifiability, Holisticity and Parsimony: the Bayesian Alternative
5.3 Relevance for Interstellar Dust Studies
  5.3.1 The Particularities of Interstellar Dust Studies
   5.3.1.1 Complexity of the Physics
   5.3.1.2 Entanglement of the Observations
  5.3.2 The Principles of Hierarchical Bayesian Inference
   5.3.2.1 Non-Hierarchical Bayesian Formalism for SED Modeling
   5.3.2.2 The Introduction of Nuisance Variables
   5.3.2.3 The Role of the Hyperparameters
  5.3.3 Hierarchical Bayesian Models for ISD Studies
   5.3.3.1 Efficiency and Comparison to Other Methods
   5.3.3.2 The Role of the Prior
   5.3.3.3 Other Developments

(...) the Bayesian method is easier to apply and yields the same or better results. (...) the orthodox results are satisfactory only when they agree closely (or exactly) with the Bayesian results. No contrary example has yet been produced. (...) We conclude that orthodox claims of superiority are totally unjustified; today, the original statistical methods of Bayes and Laplace stand in a position of proven superiority in actual performance, that places them beyond the reach of mere ideological or philosophical attacks. It is continued teaching and use of orthodox methods that is in need of justification and defense.

(Edwin T. JAYNES; Jaynes, 1976)

We have stressed earlier that ISD studies were essentially empirical, because of the complexity of their subject. Most of the time, they consist in interpreting data with formulae and models. Yet, comparing observations to models is a very wide methodological topic. It also has epistemological ¹ consequences. All the knowledge we derive about ISD depends on the way it was inferred. The question of the methods we use, and the way we articulate different results to build a comprehensive picture of the ISM, is thus of utmost importance.

5.1 Understanding the Opposition between Bayesians and Frequentists

Historically, two competitive visions of the way empirical data should be quantitatively compared to models have emerged, the (i) Bayesian, and (ii) frequentist methods. We personally follow the Bayesian method and will try to give arguments in favor of its superiority. An efficient way to present the Bayesian approach is to compare it to its alternative, and to show how both methods differ. There is a large literature on the subject. The book of Gelman et al. (2004) is a reference to learn Bayesian concepts and techniques. The posthumous book of Jaynes (2003) is more theoretical, but very enlightening. Otherwise, several reviews have been sources of inspiration for what follows (Jaynes, 1976; Loredo, 1990; Lindley, 2001; Bayarri & Berger, 2004; Wagenmakers et al., 2008; Hogg et al., 2010; Lyons, 2013; VanderPlas, 2014). A good introduction to frequentist methods can be found in the books of Barlow (1989) and Bevington & Robinson (2003).

5.1.1 Two Conceptions of Probability and Uncertainty

Bayesian and frequentist methods differ by: (i) the meaning they attribute to probabilities; and (ii) the quantities they consider random. Their radically different approaches and the various bifurcations the two methods take to address a given problem originate from these sole conceptions.

5.1.1.1 The Concept of Conditional Probability

The concept of conditional probability is central to what follows. As we will see, Bayesian and frequentist approaches differ on this aspect.

The meaning of conditional probabilities. If $A$ and $B$ are two logical propositions, the conditional probability, noted $p (A | B)$ , is the probability of $A$ being true, knowing $B$ is true. To give an astronomical example, let’s assume that:

$A$ is the probability per unit time to observe a SN Ia, when pointing a telescope at a random star; and
$B$ is the probability to observe a binary system, when pointing a telescope at a random star.

Since most stars are LIMS, and that they have a lifetime $⟨ τ ⟩ ≃ 10$ Gyr, we can estimate the probability to observe a SN Ia, knowing we are observing a binary system:

p (SN Ia | binary) ≃ 0.1 {Gyr}^{- 1} .

(5.1)

On the contrary the probability to observe a binary system, knowing we are observing a SN Ia is:

p (binary | SN Ia) = 1,

(5.2)

because SN Ia happen only in binary systems. We see that $p (A | B) \neq p (B | A)$ . In our example, the two quantities do not even have the same units.

All probabilities are conditional. In practice, all probabilities are conditional. In the previous example, we have implicitly assumed that our possibility space, that is the ensemble of cases we can expect out of the experiment we are conducting, contained all the events where we are actually observing a star, when we are pointing our telescope at its coordinates. However, what if there is suddenly a cloud in front of the telescope? We would then need to account for these extra possibilities, which is equivalent to adding conditions. For instance, if we are conducting the same experiment from a ground-based telescope in London, during winter, we will get:

p (SN Ia | binary \land London \land winter) « 0.1 {Gyr}^{- 1},

(5.3)

where the symbol $\land$ denotes the logical “and”.

$\Rightarrow$ All probabilities are conditional and the possible conditions are limited by our own knowledge.

Bayes’ rule. Thomas BAYES derived, in the middle of the XVIII $^{t h}$ century, a formula to reverse the event and the condition, in conditional probabilities. Fig. 5.1 shows a Venn diagram, that is a graphic representation of a possibility space. We have shown an arbitrary event A, in red, and another event B, in blue. The intersection of A and B, numbered (2), can be written $A \land B$ . The probability of A, knowing B, is the probability of A, when B is considered as the new possibility space. In other words:

p (A | B) = \frac{(2)}{(2) + (3)} = \frac{p (A \land B)}{p (B)} .

(5.4)

We therefore have: $p (A \land B) = p (A | B) p (B)$ . By symmetry of A and B, we have $p (A \land B) = p (B \land A)$ , thus $p (A | B) p (B) = p (B | A) p (A)$ , which gives Bayes’ rule:

p (A | B) = \frac{p (B | A) p (A)}{p (B)} .

(5.5)

Figure 5.1: Venn diagram to demonstrate Bayes’ rule. Licensed under CC BY-SA 4.0.

5.1.1.2 The Bayesian and Frequentist Assumptions

We review here the assumptions of both methods, in a general, abstract way. We will illustrate this presentation with concrete examples in Sect. 5.1.2. Let’s consider we are trying to estimate a list of $n$ physical parameters, $\vec{x} = {(x_{i})}_{i = 1, \dots, n}$ , using a set of $m$ observations, $\vec{d} = {(d_{j})}_{j = 1, \dots, m}$ . Let’s also assume that we have a model, $f$ , predicting the values of the observables, ${\vec{d}}_{m o d}$ , for any value of $\vec{x}$ : ${\vec{d}}_{m o d} = f (\vec{x})$ .

The Bayesian viewpoint. The Bayesian approach considers that there is an objective truth, but that our knowledge is only partial and subjective. Bayesians thus assume the following.

Probabilities

quantify the plausibility of a proposition, when we only have incomplete knowledge. With this assumption, probabilities can be assigned to parameters and hypotheses. More precisely, the true value of a parameter is considered fixed and an hypothesis is either true or false, but our knowledge of the value of this parameter, or of the plausibility of this hypothesis, can be described by random variables. Probabilities are therefore used to quantify our partial, subjective knowledge.

The method

then consists in sampling the probability distribution of the parameters, conditional on the data, using Bayes’ rule:

\underset{p o s t e r i o r}{\underset{︸}{p (\vec{x} | \vec{d})}} \propto \underset{l i k e l i h o o d}{\underset{︸}{p (\vec{d} | \vec{x})}} \times \underset{p r i o r}{\underset{︸}{p (\vec{x})}} .

(5.6)

Compared to Eq. ( $5.5$ ), Eq. ( $5.6$ ) misses the denominator, $p (\vec{d})$ . This is because this distribution is independent of our variables, that are the physical parameters. If we were to explicit it, it would be:

p (\vec{d}) = \int p (\vec{d} | \vec{x}) p (\vec{x}) d^{n} \vec{x},

(5.7)

which is simply the normalization factor of $p (\vec{x} | \vec{d})$ . This factor can thus be estimated by numerically normalizing the posterior, hence the proportionality we have used in Eq. ( $5.6$ ). The three remaining terms are the following.

The posterior distribution,

p (\vec{x} | \vec{d})

, is what we are interested in. It is literally the PDF of the physical parameters (what we want), knowing the data (what we have).

The likelihood,

p (\vec{d} | \vec{x})

, is the probability of the data, for a fixed value of the parameters. It can be computed using our model,

f (\vec{x})

. For instance, assuming that our observations are affected by uncorrelated, Gaussian noise, with standard deviations

\vec{σ} = {(σ_{j})}_{j = 1, \dots, m}

, we can write:

p (\vec{d} | \vec{x}) = \frac{1}{{(2 π)}^{m ∕ 2} \prod_{j = 1}^{m} σ_{j}} \exp (- \sum_{j = 1}^{m} \frac{{(f_{j} (\vec{x}) - d_{j})}^{2}}{2 σ_{j}^{2}}) .

(5.8)

The prior distribution,

p (\vec{x})

, is a unique feature of the Bayesian approach. It literally quantifies all our prior knowledge about the values of

\vec{x}

. We will give concrete examples of what that could mean in the following sections.

The result

of the Bayesian approach is the posterior, as it contains all the information we want on the parameters, informed by the observations and our prior knowledge. We can then decide to synthesize this information by, for instance:

quoting moments of the posterior: means, $⟨x_{i} | \vec{d}⟩$ , standard deviation, $σ (x_{i} | \vec{d})$ , etc.;
quoting their correlation coefficient: $ρ (x_{i}, x_{j} | \vec{d})$ , etc.;
testing hypotheses: $P (x_{i} > const | \vec{d})$ , etc.;
and so on.

The frequentist viewpoint. The frequentist approach also considers that there is an objective truth, but it differs with the Bayesian viewpoint by rejecting its subjectivity. Frequentists thus assume the following.

Probabilities: are the limit to infinity of the occurrence frequency of an event, in a sequence of repeated experiments, under identical conditions ². This repeated event can be, for instance, the measure of a quantity. The rejection of the use of probabilities as a quantification of knowledge forbids frequentists to consider parameters or hypotheses as random variables. In other words, physical quantities have a single, true value, and hypotheses are either true or false. Only the data, which are tainted with uncertainties, can be considered as random variables.
The method: then consists in using the likelihood, $p (\vec{d} | \vec{x})$ , to perform several tests, the most well-known being the maximum likelihood. We can see here that frequentists consider the probability distribution of the data, given the physical parameters. The data are thus considered as variables, and the physical parameters, fixed.
The results: consist in describing what values of the physical parameters we would find if we were to repeat the experiment in the same conditions. Frequentists then compute the uncertainties on their estimate of the parameters by simulating new data that could have been obtained in the same conditions. We thus end-up with a distribution of parameter values, that is not a probability distribution. For instance, hypothesis testing can not be performed the Bayesian way, because we dot not have a conditional probability of the parameter. We will see in Sect. 5.1.4.1 that we need to resort to the infamous significance tests, instead.

$\Rightarrow$ Bayesians do not tamper with the data, whereas frequentists account for hypothetical data that have not actually been obtained.

5.1.2 Comparison of the Two Approaches on Simple Cases

We now compare the two approaches on a series of simple examples, in order to demonstrate in which situations the two approaches may differ.

5.1.2.1 Simple Case: When the Two Approaches Agree

Let’s assume we are trying to estimate the flux of a star, $F_{⋆}$ , in a given photometric filter, with the following assumptions.

The true flux of the star is $F_{t r u e} = 42$ (in arbitrary units).
We have performed $m = 3$ repeated measures, $d_{j} = F_{j}$ $(j = 1, \dots, m)$ .
Each flux has Gaussian independent, identically distributed (iid) uncertainties: $σ_{j} = σ_{F} = 14$ $(\forall j = 1, \dots, m)$ .
The physical quantity we want to estimate is simply the value of the flux, $F_{⋆}$ . This corresponds thus to the simplest case, where we have only $n = 1$ parameter and the model is the identity: $F_{m o d} = f (x) = x$ .

This is represented in Fig. 5.2. There is an analytic solution to this simple case:

F_{⋆} ≃ \frac{\sum_{j = 1}^{m} F_{j}}{m} \pm \frac{σ_{F}}{\sqrt{m}} = 53.9 \pm 8.1 .

(5.9)

Figure 5.2: Simulation of the measure of a stellar flux to compare Bayesian and frequentist methods. The blue dots with error bars represent the successive measures and their uncertainties. The magenta line shows the true value. The units are arbitrary. Licensed under CC BY-SA 4.0.

The Bayesian solution. The Bayesian solution is obtained by sampling the posterior distribution in Eq. ( $5.6$ ): $p (F_{⋆} | F_{1}, F_{2}, F_{3})$ .

The likelihood term

is exactly Eq. (

5.8

), in our case, as we have Gaussian iid uncertainties.

The choice of the prior

is the arbitrary part of the Bayesian model. For this first example, let’s assume that we have no idea what the flux should be. We thus take a flat, uninformative prior,

p (F_{⋆}) \propto 1

. This is a subjective choice that has the following consequences.

1.

This particular choice is called an improper prior, meaning it can not be normalized, as:

\int_{- \infty}^{\infty} p (F_{⋆}) d F_{⋆} = \infty

. In practice, we thus need to choose lower and upper bounds,

[F_{-}, F_{+}]

, beyond which the prior is 0. Since we are measuring a positive quantity ³, we can take

F_{-} = 0

as the lower bound. The upper bound could be taken as several times the flux of a 120

M_{⊙}

star at the distance of the source. We thus have:

p (F_{⋆}) = \{\begin{matrix} \frac{1}{F_{+} - F_{-}} & for F_{-} \leq F_{⋆} \leq F_{+} \\ 0 & otherwise. \end{matrix}

(5.10)

2.

This prior is also subjective, as it depends on the choice of the variable. If we decide to study

\ln F_{⋆}

, instead of

F_{⋆}

, it will lead to a different result. We will address this issue in Sect. 5.1.4.1. We can summarize it the following way.

If the choice of the prior matters in the final solution, it means that the weight of evidence brought by the data is weak. It is therefore natural that the way we decide to quantify our prior knowledge is important. This is an aspect of the Bayesian method we need to embrace.
On the opposite, if the weight of evidence brought by the data is large, the choice of the prior will not matter significantly. In other words, if the width of the likelihood is much narrower than a dex, the difference between multiplying by $p (F_{⋆})$ or $p (\ln F_{⋆})$ will be negligible.

The solution is represented in Fig. 5.3.a. We have sampled the posterior using a Markov Chain Monte-Carlo method (MCMC; using the code emcee by Foreman-Mackey et al., 2013). We will come back to MCMC methods in Sect. 5.1.3. The estimated value is $F_{⋆} ≃ 53.9 \pm 8.1$ ( $1 σ$ uncertainty). This is exactly the analytical solution in Eq. ( $5.9$ ). The $95 %$ credible range, that is the range centered on $⟨ F_{⋆} ⟩$ containing a $95 %$ probability, is ${C R}_{95 %} (F_{⋆}) = [37.8, 69.8]$ .

Figure 5.3: Bayesian and frequentist solutions to the problem of Fig. 5.2. Panel (a) shows the likelihood of the individual measures, in red, green and blue. The posterior, which is the product of the three likelihoods, is shown in yellow. We have filled its area corresponding to the $95 %$ credible range. The posterior average, $⟨ F_{⋆} ⟩$ is shown in dark yellow. Panels (b-e) shows random reproduction of the measures. The blue dots with error bars are randomly drawn around the maximum likelihood, $F_{M L}$ , using a bootstrapping method. The estimated value of $F_{⋆}$ resulting from these draws, ${\bar{F}}_{M L}^{(k)}$ , is shown in green. In all panels, the true value, $F_{t r u e}$ , is shown in magenta. Licensed under CC BY-SA 4.0.

The frequentist solution. There are different ways to approach this problem in the frequentist tradition. The most common solution would be to use a Maximum-Likelihood Estimation (MLE) method.

Maximum-likelihood value.: There are numerical tools to compute the MLE of complex models. We have used a Levenberg-Marquardt algorithm (e.g. Markwardt, 2009). The MLE value of $F_{⋆}$ we derive is $F_{M L} = 53.9$ , which is in agreement with the analytical solution in Eq. ( $5.9$ ). It however does not provide uncertainties.
Uncertainty estimates.: A common solution to estimate the uncertainties on $F_{M L}$ would be to perform bootstrapping. Following the frequentist conception of probabilities, we randomly draw new observations around the MLE value, a large number of times. We thus obtain a set of synthetic repeated measures, assuming the population distribution has $F_{M L}$ for mean. For each new set, we derive a new MLE value, ${\bar{F}}_{M L}^{(k)}$ $(k = 1, \dots, 3000)$ . The first four draws are represented in Fig. 5.3.b-e. The standard deviation of this sample is $8.1$ , in agreement with Eq. ( $5.9$ ). We can also compute the $95 %$ confidence interval, from this sample: ${C I}_{95 %} (F_{⋆}) = [37.8, 69.8]$ .

A few remarks. We can see that both methods give the same exact result, which is also consistent with the analytical solution (Eq. $5.9$ ). This is because the assumptions were simple enough to make the two approaches equivalent:

1.: by assuming a flat prior, we removed the effect of this Bayesian peculiarity;
2.: the symmetry and the iid nature of the noise made the sampling of the likelihood as a function of the parameters, or as a function of the data, identical.

Note also our subtle choice of terminology: (i) we talk about credible range in the Bayesian case, as this term designates the quantification of our beliefs; while (ii) we talk about confidence interval in the frequentist case, as it concerns our degree of confidence in the results, if the experiment was repeated a large number of times, assuming the population mean is the maximum likelihood.

5.1.2.2 Benefits of Using an Informative Prior

A first way to find differences between the Bayesian and frequentist approaches is to explore the effect of the prior. To that purpose, let’s keep the same experiment as in Sect. 5.1.2.1, but let’s assume now that the star we are observing belongs to a cluster, and we know its distance.

The Bayesian improvement. Contrary to Sect. 5.1.2.1, where we had to guess a very broad, flat prior, we can now refine this knowledge, based on the expected luminosity function, scaled at the known distance of the cluster. This is represented on Fig. 5.4. The posterior distribution (yellow) is now the product of the likelihood (green) and prior (blue). The frequentist solution has not changed, as it can not account for this kind of information. We can see that the maximum a posteriori is now closer to the true value than the maximum likelihood. This can be understood the following way.

The true flux: has been drawn from the luminosity function, because we have randomly targetted a star in this cluster. This is the definition of the luminosity function, which we happen to have chosen as the prior. This is often noted $F_{t r u e} \sim p (F_{⋆})$ , the $\sim$ symbol meaning “distributed as”.
The observed flux: has then been drawn from a normal law centered on $F_{t r u e}$ with variance $σ_{F}^{2}$ , that can be written: $F_{j} | F_{t r u e} \sim N (F_{t r u e}, σ_{F}^{2})$ . This is equivalent to saying that the observed flux has been drawn from the posterior $p (F_{⋆}) \times N (F_{t r u e}, σ_{F}^{2})$ . Sampling this distribution is thus the best choice we can make, considering the information we have. This is why we get an advantage over the frequentist result.

If we perform several such measures, there will be some Bayesian solutions that will get corrected farther away from the true flux. This is a consequence of stochasticity. On Fig. 5.4, keeping our value of $F_{t r u e}$ , this will be the case if the noise fluctuation, $- δ F$ is negative, that is if the observed flux is lower than $F_{t r u e}$ ( $F_{j} = F_{t r u e} - δ F < F_{t r u e}$ ). However, this deviation will be less important than the correction we would benefit from if the same fluctuation was positive, because the prior would be higher: $p (F_{t r u e} - δ F) > p (F_{t r u e} + δ F)$ (cf. Fig. 5.4). The prior would therefore correct less the likelihood on the left side, in this particular case. Thus, on average, taking into account this informative prior will improve the results.

Figure 5.4: The benefits of using an informative prior. The green curve shows the total likelihood. It is identical to the yellow curve in Fig. 5.3. The prior, which is taken as the luminosity function of the cluster, is represented in blue. The posterior, in yellow, is the product of these two distributions. Licensed under CC BY-SA 4.0.

Accumulation of data. In this case and the previous one (Sect. 5.1.2.1), we had three observations of the same flux. The posterior distribution we sampled was:

p (F_{⋆} | F_{1}, F_{2}, F_{3}) \propto p (F_{⋆}) \times p (F_{1} | F_{⋆}) \times p (F_{2} | F_{⋆}) \times p (F_{3} | F_{⋆}) .

(5.11)

This is because we considered the three measures as part of the same experiment. However, we could have chosen to analyze the data as they were coming. After the first flux, we would have inferred:

p (F_{⋆} | F_{1}) \propto p (F_{⋆}) \times p (F_{1} | F_{⋆}) .

(5.12)

This posterior would have been wider (i.e. more uncertain), as we would have had only one data point. Note that such an inference would have not been possible with a frequentist method, as we would have had one parameter for one constraint (i.e. zero degree of freedom). What is interesting to note is that the analysis of the second measure, can be seen as taking into account the first measure in the prior:

p (F_{⋆} | F_{1}, F_{2}) \propto p (F_{⋆}) p (F_{1} | F_{⋆}) \times p (F_{2} | F_{⋆}) = \underset{new prior}{\underset{︸}{p (F_{⋆} | F_{1})}} \times p (F_{2} | F_{⋆}),

(5.13)

and so on. For the third measure, the new prior would be $p (F_{⋆} | F_{1}, F_{2})$ :

p (F_{⋆} | F_{1}, F_{2}, F_{3}) \propto \underset{new prior}{\underset{︸}{p (F_{⋆} | F_{1}, F_{2})}} \times p (F_{3} | F_{⋆}) .

(5.14)

which is formally equivalent to Eq. ( $5.11$ ), but is a different way of looking at the prior. Notice that the original prior, $p (F_{⋆})$ , appears only once in the product. The more we accumulate data, the less important it becomes.

$\Rightarrow$ The Bayesian approach is an optimal framework to account for the accumulation of knowledge.

5.1.2.3 Case Where the Two Approaches Differ: Non Gaussianity and Few Data

The other reason why the two approaches might differ is because Bayesians sample $p (\vec{x} | \vec{d})$ , whereas frequentists produce a series of tests based on $p (\vec{d} | \vec{x})$ . The difference becomes evident when we consider non-Gaussian errors with small data sets.

Flux with a non-linear detector. Let’s assume that we are measuring again the flux from the same star ( $F_{⋆} = 42$ ), with $m = 3$ repetitions, but that our detector is now highly non-linear. This non-linearity translates into a heavily-skewed split-normal noise distribution (cf. Sect. C.2.2):

p (F_{j} | F_{⋆}) = \{\begin{matrix} \frac{1}{\sqrt{2 π} λ} \exp (- \frac{{(F_{j} - F_{⋆})}^{2}}{2 λ^{2}}) & if F_{j} \leq F_{⋆} \\ \frac{1}{\sqrt{2 π} λ τ} \exp (- \frac{{(F_{j} - F_{⋆})}^{2}}{2 λ^{2} τ^{2}}) & if F_{j} > F_{⋆} . \end{matrix}

(5.15)

with $λ = 0.3$ and $τ = 50$ . This noise distribution is the red curve in Fig. 5.5.a. In practice, this could for example be a very accurate detector suffering from transient effects. The measured value would be systematically higher than the true flux ⁴. We have simulated three such measures in Fig. 5.5.a, in blue. This problem was adapted from example 5 of Jaynes (1976).

Figure 5.5: Flux measures with a non-linear detector. Panel (a) shows the noise distribution, in red, and the three observations in blue. Panel (b) displays the Bayesian posterior in yellow and the frequentist bootstrapping in green. In both panels, the true flux is shown in magenta and the zone of inconsistent solutions is hatched in grey. Licensed under CC BY-SA 4.0.

The solutions. Knowing that the measured flux is always greater than or equal to the true flux, it is obvious that the solution should be lower than the lowest measured flux: $F_{⋆} \leq \min_{j} F_{j} = 47.6$ , in our particular case. This flux range, corresponding to inconsistent values, has been hatched in grey, in Fig. 5.5.

The Bayesian solution: is obtained the same way as before, by sampling the posterior. We again assume a flat prior, and take the likelihood in Eq. ( $5.15$ ). The posterior is shown in yellow, in Fig. 5.5.b. It has zero probability in the range that we qualified as “inconsistent”, and the true value falls in a high probability domain. The mean and standard deviation of the posterior give us $F_{⋆} ≃ 43.8 \pm 3.6$ , with ${C R}_{95 %} (F_{⋆}) = [34.6, 47.9]$ .
The frequentist solution: is obtained the same way as before. The maximum likelihood and the whole bootstrapping sample however falls in the “inconsistent” domain. We get $F_{⋆} ≃ 51.9 \pm 4.0$ , with ${C I}_{95 %} (F_{⋆}) = [47.4, 62.5]$ . We see here that the frequentist solutions fails at inferring the true flux. On top of that, it gives only “inconsistent” solutions. In this particular case, this is because of the asymmetry of the noise, which breaks the symmetry between $p (F_{j} | F_{⋆})$ and $p (F_{⋆} | F_{j})$ that we had in Sect. 5.1.2.1. This can be seen in Fig. 5.5.b. The frequentist solution is the mirror symmetric of the Bayesian posterior for that reason.

When $m$ increases, the frequentist solution gets closer and closer to the true flux. However, a bootstrapping analysis will reject the true flux in $100 %$ of the cases.

The reason of the frequentist failure. The failure of the frequentist approach is a direct consequence of its conception of probability (cf. e.g. VanderPlas, 2014, for a more detailed discussion and more examples). The frequentist method actually succeeds in returning the result it pretends to give: predicting a confidence interval where the solution would fall 95 % of the time, if we repeated the same procedure a large number of times. This is however not equivalent to giving the credible range where the true value of the parameter has a 95 % probability to be (the Bayesian solution). With the frequentist method, we have no guarantee that the true flux will be in the confidence interval, only the solution. We can see that the main issue with the frequentist approach is that it is difficult to interpret, even in a simple problem such as that of Fig. 5.5. “Bayesians address the question everyone is interested in by using assumptions no-one believes, while Frequentists use impeccable logic to deal with an issue of no interest to anyone” (Lyons, 2013). In the previous citation, the “assumption no-one believes” is the subjective choice of the prior, and the “issue of no interest to anyone” is the convoluted way frequentists formulate a problem, to avoid assigning probabilities to parameters.

$\Rightarrow$ Frequentist results can be inconsistent in numerous practical applications, and they never perform better than Bayesian methods.

5.1.3 Numerical Methods to Solve Bayesian Problems

Bayesian problems are convenient to formulate as they consist in laying down all the data, the model, the noise sources and the nuisance variables to build a posterior, using Bayes’ rule. The Bayesian results are also convenient to interpret as they all consist in using the posterior, which gives the true probability of the parameters. However, in between, estimating the average, standard deviation, correlation coefficients of parameters, or testing hypotheses can be challenging, especially if there are a lot of parameters or if the model is complex. Fortunately, several numerical methods have been introduced to make these tasks simpler. Most of these methods are based on Markov Chain Monte-Carlo (MCMC ⁵), which are a class of algorithms for sampling PDFs.

5.1.3.1 Sampling the Posterior Distribution

Markov Chains Monte-Carlo. A MCMC draws samples from the posterior. In other words, it generates a chain of $N$ values of the parameters, ${\vec{x}}_{k}$ ( $k = 1, \dots, N$ ). These parameter values are not uniformly distributed in the parameter space, but their number density is proportional to the posterior PDF. Consequently, there are more points where the probability is high, and almost none where the probability is low. It has several advantages.

Contrary to grid-based sampling or standard Monte-Carlo techniques, we spend most of our computing time estimating our model (which can be costly) where it matters, and not much time where it does not.

From the MCMC, it is very simple and easy to estimate moments of a given parameter, marginalizing over the other ones. For instance, if we have two parameters,

x

and

y

, the average of

x

, marginalizing over

y

would be:

⟨ x ⟩ \equiv \iint x . p (x, y) d x d y ≃ \frac{1}{N} \sum_{k = 1}^{N} x_{k},

(5.16)

where the second equality is simply the average of the sample. We could do the same for the standard deviation, or the correlation coefficient.

It also makes hypothesis testing very easy. For instance, if we want to know the probability that $x > y$ , we literally compute the fraction of points in the MCMC, where $x_{k} > y_{k}$ .

All these operations would have been much more expensive, in terms of computing time, if we had to numerically solve the integral. In particular, computing the normalization of the whole posterior would have been costly. From a technical point of view, a MCMC is a random series of values of a parameter where the value at step $k$ depends only the value at step $k - 1$ . We briefly discuss below the two most used algorithms. A good presentation of these methods can be found in the book of Gelman et al. (2004) or in the Numerical recipes (Press et al., 2007).

The Metropolis-Hastings algorithm. To illustrate this method and the next one, let’s consider again the measure of the flux of our star, with the difference, this time, that we would be observing it through two different photometric filters.

Let’s call $F_{⋆}$ and $G_{⋆}$ the fluxes in these two bands, with uncertainties $σ_{F} = 12$ and $σ_{G} = 10$ .
Let’s assume that the uncertainties in the filters are correlated, with a correlation coefficient, $ρ = 0.8$ , because, for instance, of the way the calibration was performed.

Let’s assume that we make only

m = 1

measure in each band, with observed fluxes

F_{1} = 42

and

G_{1} = 36

. The posterior, assuming a flat prior, is simply a bivariate normal distribution centered on the observed flux, with covariance matrix:

V = (\begin{matrix} σ_{F}^{2} & ρ σ_{F} σ_{G} \\ ρ σ_{F} σ_{G} & σ_{G}^{2} \end{matrix}) .

(5.17)

The posterior is thus, noting $\vec{x} = (F_{⋆}, G_{⋆})$ and $\vec{d} = (F_{1}, G_{1})$ :

p (\vec{x} | \vec{d}) \propto \exp (- \frac{1}{2} {(\vec{x} - \vec{d})}^{T} V^{- 1} (\vec{x} - \vec{d})) .

(5.18)

Contours of this distribution are represented in Fig. 5.6.a-b.

The algorithm proposed by Metropolis et al. (1953) and generalized by Hastings (1970) is the most popular method to sample any PDF. This is a rejection method, similar to what we have discussed for MCRTs, in Sect. 3.1.1.4 (cf. also Appendix C.2.3.1).

A proposal distribution,

p ({\vec{x}}_{k} | {\vec{x}}_{k - 1})

, first needs to be chosen. The choice of this distribution is instrumental in the sampling efficiency: (i) if it is too wide, a lot of draws will be rejected; (ii) if it is too narrow, the sampling steps are going to be small, and more iterations are going to be necessary to sample the posterior. For our present example, we choose a bivariate normal distribution, centered on

{\vec{x}}_{k - 1}

, whose width is the noise of our data,

s_{F} = σ_{F}

and

s_{G} = σ_{G}

p ({\vec{x}}_{k} | {\vec{x}}_{k - 1}) \propto \exp [- \frac{1}{2} {(\frac{F_{⋆, k} - F_{⋆, k - 1}}{s_{F}})}^{2} - \frac{1}{2} {(\frac{G_{⋆, k} - G_{⋆, k - 1}}{s_{G}})}^{2}]

(5.19)

The method

then consists in the following steps, iterated

N

times.

1.

At each iteration,

k

, we draw a new set of parameters,

{\vec{x}}_{k}

, from the proposal distribution:

{\vec{x}}_{k} \sim p ({\vec{x}}_{k} | {\vec{x}}_{k - 1})

2.

With this new value, we compute the acceptance probability, defined as:

α_{k} = \min (1, \frac{p ({\vec{x}}_{k} | \vec{d}) p ({\vec{x}}_{k - 1} | {\vec{x}}_{k})}{p ({\vec{x}}_{k - 1} | \vec{d}) p ({\vec{x}}_{k} | {\vec{x}}_{k - 1})}) .

(5.20)

The $\min$ function is here to make sure we get a result between 0 and 1. In our case, we have chosen a symmetric proposal distribution. Eq. ( $5.19$ ) thus implies that:

\frac{p ({\vec{x}}_{k - 1} | {\vec{x}}_{k})}{p ({\vec{x}}_{k} | {\vec{x}}_{k - 1})} = 1 \Rightarrow α_{k} = \min (1, \frac{p ({\vec{x}}_{k} | \vec{d})}{p ({\vec{x}}_{k - 1} | \vec{d})}) .

(5.21)

We just need to estimate our posterior at one position (assuming we saved the value of $p ({\vec{x}}_{k - 1} | \vec{d})$ , after the previous iteration). In addition, since we need only the ratio of two points in the posterior, we do not need to normalize it. This is the reason why this algorithm is so efficient.

3.

To update

{\vec{x}}_{k}

, we draw a random variable,

𝜃_{k}

, uniformly distributed between 0 and 1.

If $𝜃_{k} \leq α_{k}$ , we accept the new value, ${\vec{x}}_{k}$ .
If $𝜃_{k} > α_{k}$ , we reject the new value and keep the old one, ${\vec{x}}_{k} = {\vec{x}}_{k - 1}$ .

The initial value of the chain has to be a best guess.

The sampling of Eq. ( $5.18$ ) with the Metropolis-Hastings method is shown in Fig. 5.6.a.

Figure 5.6: Markov Chain Monte-Carlo algorithms. Panels (a) and (b) show contours, up to $5 σ$ , of the posterior distribution of Eq. ( $5.18$ ). In both panels, the blue line with the green dots represent the MCMC, for the first 1000 steps, starting from the magenta star. The burn-in phase is highlighted in cyan. Panels (c) and (d) represent the MCMCs of the two parameters, at the start of the chain. We have highlighted the burn-in phase in cyan. Panels (e) and (f) represent the ACFs of both parameters. We have also quoted the integrated autocorrelation times, $t_{i n t}$ . As indicated, left panels demonstrate the Metropolis-Hastings algorithm, while right panels show Gibbs sampling. Licensed under CC BY-SA 4.0.

Gibbs sampling. The number of parameters, $n$ , determines the dimension of the posterior. The higher this number is, the smaller the support of the function is (i.e. the region where the probability is non negligible). Metropolis-Hastings methods therefore will have a high rejection rate, if $n » 1$ , requiring longer chains to ensure convergence. Gibbs sampling (Geman & Geman, 1984) provides an alternative MCMC method, where all draws are accepted. Its drawback is that it requires normalizing, at each iteration, the full conditional distribution:

p (x_{i} | {\vec{x}}_{-}, \vec{d}) \equiv p (x_{i} | x_{1}, \dots, x_{i - 1}, x_{i + 1}, \dots, x_{n}, \vec{d})

(5.22)

that is the posterior fixing all parameters except one. This is only a one dimensional PDF, though, much less computing-intensive than the whole posterior. The method consists, at each iteration $k$ , to cycle through the different parameters, and draw a new value from Eq. ( $5.22$ ):

x_{i, k} \sim p (x_{i, k} | \underset{already updated}{\underset{︸}{x_{1, k}, \dots, x_{i - 1, k}}}, \underset{not yet updated}{\underset{︸}{x_{i + 1, k - 1}, \dots, x_{n, k - 1}}}, \vec{d}) .

(5.23)

Since this distribution has an arbitrary form, random drawing can be achieved numerically using the CDF inversion method (Appendix C.2.3.2). Fig. 5.6.b represents the Gibbs sampling of Eq. ( $5.18$ ). The squared pattern comes from the fact that we alternatively sample each parameter, keeping the other one fixed.

5.1.3.2 Post-Processing MCMCs

Assessing convergence. One of the most crucial questions, when using a MCMC method, is how long a chain do we need to run. To answer that question, we need to estimate if the MCMC has converged toward the stationary posterior. Concretely, it means that we want to make sure the sampling of the posterior is homogeneous, and that the moments and hypothesis testing we will perform will not be biased, because some areas of the parameter space have only been partially explored. The reason why the sampling of the parameter space might be incomplete is linked to the two following factors.

The burn-in

refers to the first drawn values, before the MCMC could find the support of the posterior. This can be seen in Fig. 5.6.a-b (highlighted in cyan). The arbitrary starting value (magenta star) is outside the

5 σ

contour of the PDF. The MCMC thus walks a few steps before finding the probable region. This burn-in phase is also highlighted in cyan, in Fig. 5.6.c-d. This burn-in phase could actually be significantly longer, for several non-exclusive reasons: (i) a larger number of parameters; (ii) a more degenerate model, with several local maxima; (iii) a less lucky choice of initial conditions; or (iv) a very well-constrained model, resulting in a very small support over the whole parameter space. There is no universal method to identify burn-in, it needs to be investigated carefully, most of the time. Running several MCMCs, starting from initial conditions distributed over the whole parameter space, is usually efficient.

The autocorrelation

of the MCMC results from the fact that the parameter value at step

k + 1

depends on step

k

. If several successive iterations stay in the same region of the posterior, this will create a portion of correlated values. The AutoCorrelation Function (ACF; e.g. Sokal, 1996) is an essential tool to determine the correlation length of a MCMC. The ACF,

\hat{ρ}

, of a given parameter, depends on the lag,

τ

, that is the number of steps between two arbitrary iterations:

\hat{ρ} (τ) \equiv \frac{N}{N - τ} \frac{\sum_{k = 1}^{N - τ} (x_{k} - ⟨ x ⟩) (x_{k + τ} - ⟨ x ⟩)}{\sum_{k = 1}^{N} {(x_{k} - ⟨ x ⟩)}^{2}} .

(5.24)

This is the correlation coefficient of the parameter with itself, shifted by $τ$ steps. The ACFs of our example are displayed in Fig. 5.6.e-f. We can see that the ACF starts at 1, for $τ = 0$ . It then drops over a few steps and oscillates around 0. The typical lag after which the ACF has dropped to 0, corresponds to the average number of steps necessary to draw independent values. This typical lag can be quantified, by the integrated autocorrelation time, $t_{i n t}$ ⁶:

t_{i n t} \equiv 1 + 2 \sum_{i = τ}^{N} \hat{ρ} (τ) .

(5.25)

It is represented in Fig. 5.6.e-f. It corresponds roughly to the average number of steps needed to go from one end of the posterior to the other. Different parameters of a given MCMC can in principle have very different $t_{i n t}$ (e.g. Galliano, 2018). To make sure that our posterior is properly sampled, we thus need to let our MCMC run a large number of steps, times $t_{i n t}$ , after burn-in. The effective sample size, $N_{e ff} \equiv N ∕ t_{i n t}$ , quantifies the effective number of steps that can be considered independent. We need $N_{e ff} » 1$ .

With the Metropolis-Hastings algorithm, the integrated autocorrelation time will depend heavily on the choice of the proposal distribution. We have explored the effect of the width of this distribution on $t_{i n t}$ . In Eq. ( $5.19$ ), instead of taking $s_{F} = σ_{F}$ and $s_{G} = σ_{G}$ , we have varied this parameter. Fig. 5.7.a represents the mean rejection rate as a function of $\sqrt{s_{F} s_{G} ∕ (σ_{F} σ_{G})}$ .

When $\sqrt{s_{F} s_{G} ∕ (σ_{F} σ_{G})} « 1$ , the proposal is much narrower than the posterior. The proposed steps are thus very small, and a lot of them are necessary to cross the posterior. This is why $t_{i n t}$ is very large in this case (cf. Fig. 5.7.b).
When $\sqrt{s_{F} s_{G} ∕ (σ_{F} σ_{G})} » 1$ , the proposal distribution is much larger than the posterior. Most proposals therefore falls outside the support of the posterior. They are therefore rejected close to 100 % of the times. This is why $t_{i n t}$ is also very large in this case.

Fig. 5.7.b shows that the only range where $t_{i n t}$ is reasonable is when the width of the proposal distribution is comparable to the width of the posterior.

Figure 5.7: Importance of the choice of the Metropolis-Hastings proposal distribution. Panel (a) represents the Metropolis-Hastings mean rejection rate, varying the width of the proposal distribution (Eq. $5.19$ ), when sampling Eq. ( $5.18$ ). Panel (b) represents the corresponding integrated autocorrelation time for the two parameters. Licensed under CC BY-SA 4.0.

Parameter inference. Numerous quantities can be inferred from a MCMC. We have previously seen that the average, uncertainties, and various tests can be computed using the posterior of the parameters of a source. This becomes even more powerful when we are analyzing a sample of sources. To illustrate this, let’s assume we are now observing $N_{⋆} = 5$ stars, through the same photometric bands as before. Fig. 5.8.b shows the posterior PDF of the two parameters of the five stars. It is important to distinguish the following two types of distributions.

The posterior of individual stars

are represented in Fig. 5.8. The error bars in panel (b) correspond to the mean and standard deviation of the marginal distributions in panels (a) and (c). They represent the uncertainty on the measured fluxes of each individual star. They are the moments of a given parameter, over the whole MCMC. This is what we have focussed on, until now.

The statistic distribution across the sample

is represented in Fig. 5.9. In panel (a), we have shown the distribution of the standard deviation of the sample, at each step

k

in the MCMC:

σ {(F_{⋆})}_{k} \equiv \frac{1}{N_{⋆} - 1} \sum_{i = 1}^{N_{⋆}} {(F_{i, k} - \frac{1}{N_{⋆}} \sum_{j = 1}^{N_{⋆}} F_{j, k})}^{2},

(5.26)

where $F_{i, k}$ is the observed flux of the star $i$ , at the MCMC iteration $k$ . It is how we can quantify the dispersion of the sample. We can thus quote the sample dispersion as (cf. Fig. 5.9.a):

σ (F_{⋆}) ≃ ⟨ σ (F_{⋆}) ⟩ \pm σ [σ (F_{⋆})] .

(5.27)

In our example, we have: $σ (F_{⋆}) ≃ 13.8 \pm 1.6$ and $σ (G_{⋆}) ≃ 6.8 \pm 1.0$ . We can see that these values correspond roughly to the intrinsic scatter between individual stars, in Fig. 5.8.b, but they are larger than the uncertainty on the flux of individual stars. We can do the same for the correlation coefficient, as shown in Fig. 5.9.b: $ρ (F_{⋆}, G_{⋆}) ≃ - 0.63 \pm 0.15$ . Notice that it is negative, because the correlation between stellar fluxes, in Fig. 5.8.b, points toward the lower right corner. However, the correlation between the uncertainties on $F_{⋆}$ and $G_{⋆}$ is positive: the individual ellipses point in the other direction, toward the upper right corner. We have deliberately simulated data with these two opposite correlations to stress the difference between the individual likelihood properties and those of the ensemble. Finally, we could have done the same type of estimate for the mean of the sample: $⟨ F_{⋆} ⟩ ≃ 52.2 \pm 1.5$ and $⟨ G_{⋆} ⟩ ≃ 26.0 \pm 1.0$ .

Figure 5.8: Post-processing of the MCMC of a sample of sources. In panel (b), the contours represent the posterior of five stars, observed through the two photometric bands (fluxes $F_{⋆}$ and $G_{⋆}$ ). The margin plots represent the marginalized distribution of the posterior of each individual star. The error bars in panel (b) are plotted from the mean and standard deviations of the posterior. Licensed under CC BY-SA 4.0.

Figure 5.9: MCMC statistics of a sample of sources. Panel (a) represents the posterior of the standard deviation of the sample in Fig. 5.8. It represents the PDF of the dispersion of the sample, not the width of individual PDFs. Similarly, panel (b) represents the correlation coefficient of the flux distribution. Licensed under CC BY-SA 4.0.

Quantifying the goodness of a fit. It is important, in any kind of model fitting, to be able to assess the quality of the fit. In the frequentist approach, this is done with the chi-squared test, which is limited in its assumptions to normal iid noise, without nuisance parameters. In the Bayesian approach, the same type of test can be done, accounting for the full complexity of the model (non-Gaussian errors, correlations, nuisance parameters, priors). This test is usually achieved by computing posterior predictive $p$ -values (ppp; e.g. Chap. 6 of Gelman et al., 2004). To illustrate how ppps work, let’s consider now that we are observing the same star as before, through four bands (R, I, J, H) and are performing a blackbody Bayesian fit to this SED, varying the temperature, $T_{⋆}$ , and the dilution factor, $Ω_{⋆}$ . This is represented in Fig. 5.10.b. The principle is the following.

1.

We generate a set of replicated data,

{\vec{d}}_{r e p}

, from our posterior:

p ({\vec{d}}_{r e p} | \vec{d}) \equiv \int p ({\vec{d}}_{r e p} | \vec{x}) p (\vec{x} | \vec{d}) d \vec{x} .

(5.28)

If we sampled our posterior with a MCMC, this integral can simply be computed by evaluating our model (the blackbody, in the present case), for values of our drawn parameters: ${\vec{d}}_{r e p} = {\{f ({\vec{x}}_{k})\}}_{k = 1}^{N}$ .

2.

We evaluate the mean and standard deviation of this replicated data set,

⟨{\vec{d}}_{r e p} | \vec{x}⟩

and

σ ({\vec{d}}_{r e p} | \vec{x})

. These quantities are the average and the dispersion of the predicted flux in the different bands. They will serve as position and scale references, when comparing model and observations.

3.

We compute a discrepancy metric,

T (\vec{d} | \vec{x})

. Several choices are possible, but the most common is to adopt a chi-squared equivalent:

T (\vec{d} | \vec{x}) \equiv \sum_{j = 1}^{m} \frac{{[d_{j} - ⟨d_{j} | \vec{x}⟩]}^{2}}{σ {(d_{j} | \vec{x})}^{2}} .

(5.29)

We compute this quantity both for the replicated set, $T ({\vec{d}}_{r e p} | \vec{x})$ , which is the blue distribution in Fig. 5.10.e, and for the observations, $T (\vec{d} | \vec{x})$ , which is the red line in Fig. 5.10.e (it is a single value). To be clear, only the data term in Eq. ( $5.29$ ) changes between $T ({\vec{d}}_{r e p} | \vec{x})$ and $T (\vec{d} | \vec{x})$ . The quantities $⟨d_{j} | \vec{x}⟩$ and $σ (d_{j} | \vec{x})$ are identical in both cases.

4.

The quality of the test is assessed by comparing both quantities. To that purpose, we compute the following probability:

p_{B} \equiv p (T ({\vec{d}}_{r e p} | \vec{x}) \geq T (\vec{d} | \vec{x}) | \vec{d}) .

(5.30)

If the difference between the replicated set and the data is solely due to statistical fluctuations, we should have on average $p_{B} ≃ 50 %$ . The fit is considered bad, at the $95 %$ level, if $p_{B} < 2.5 %$ or $p_{B} > 97.5 %$ .

We have illustrated this test in Fig. 5.10, varying the number of parameters and observational constraints, in order to explore the different possible cases.

A good fit: is shown in Fig. 5.10.b. We have varied both $T_{⋆}$ and $Ω_{⋆}$ to fit the 4 fluxes. Fig. 5.10.e shows that $T (\vec{d} | \vec{x})$ falls in the high probability range of $T ({\vec{d}}_{r e p} | \vec{x})$ . In other words, the average deviation of the replicated data, relative to the reference we have chosen, $⟨{\vec{d}}_{r e p} | \vec{x}⟩$ , is comparable to the deviation of the actual data relative to the same reference. The observations could thus have likely been drawn from our posterior.
A poor fit: is shown in Fig. 5.10.a. We have intentionally fixed the temperature of the fit at $T_{⋆} = 7000$ K, while the true value is $T_{⋆} = 6000$ K. We see that $p_{B} ≃ 1$ , in Fig. 5.10.d. The difference between the observations and the model, can thus not be explained by the scatter of the model. It is the sign of a bad fit. In our case, it is because the model is bad (wrong choice of fixed temperature).
An overfit: is shown in Fig. 5.10.c. This time we fit only two fluxes. With a chi-squared fit, we would have 0 degrees of freedom. We see that $p_{B} ≃ 0$ , in Fig. 5.10.f. The average model therefore gets too close to the observations. The ppp tells us this is very unlikely. It is however not an issue in terms of derived parameters. We can see that the true model (green) is well among the sampled model (blue), and the inferred parameters are consistent with their true values.

There is no issue with fitting even fewer constraints than parameters, with a Bayesian approach. The consequence is that the posterior is going to be very wide along the dimensions corresponding to the poorly constrained parameters. However, the results will be consistent, and the derived probabilities will be meaningful.

$\Rightarrow$ Contrary to the frequentist approach, we can fit Bayesian models with more parameters than data points.

Figure 5.10: Demonstration of the use of posterior predictive p-values. The top three panels show the fit of an observed stellar SED (red error bars) with a blackbody. The true model ( $T_{⋆} = 6000$ K) is shown in green. The blue lines are a subsample of the inferred model (at different steps in the MCMC). In panel (a), we have fixed the temperature to $T_{⋆} = 7000$ K, different from its true value. We thus vary only $Ω_{⋆}$ , resulting in a poor fit. In panel (b), we let both $T_{⋆}$ and $Ω_{⋆}$ vary, in order to get a good fit. In panel (c), we fit only two fluxes, with our two parameters. We are thus overfitting the data. The three bottom panels show the distribution of the discrepancy metric for the replicated data set (in blue), corresponding to the upper panels. We compare it to $T (\vec{d} | \vec{x})$ , in red. The low probability ranges are hatched in green. This example has been generated using emcee (Foreman-Mackey et al., 2013). Licensed under CC BY-SA 4.0.

5.1.4 Decision Making and Limitations of the Frequentist Approach

We now resume our comparison of the Bayesian and frequentist approaches, started in Sect. 5.1.2. We focus more on the interpretation of the results and synthesize the advantages and inconveniences of both sides.

5.1.4.1 Hypothesis Testing

Until now, we have seen how to estimate parameters and their uncertainties. It is however sometimes necessary to be able to make decisions, that is to choose an outcome or its alternative, based on the observational evidence. Hypothesis testing consists in assessing the likeliness of a null hypothesis, noted $H_{0}$ . The alternative hypothesis is usually noted $H_{1}$ , and satisfies the logical equation $H_{1} = \neg H_{0}$ , where the $\neg$ symbol is the logical negation. The priors necessarily obey $p (H_{0}) + p (H_{1}) = 1$ . To illustrate this process let’s go back to our first example, in Sect. 5.1.2. We are observing a star with true flux $F_{t r u e}$ , $m$ times, with an uncertainty $σ_{F}$ on each individual flux measurements, $F_{i}$ . This time, we want to know if $F_{t r u e} \leq F_{t e s t}$ , for a given $F_{t e s t}$ .

Bayesian hypothesis testing. Bayesian hypothesis testing consists in computing the posterior odds of the two complementary hypotheses:

\underset{p o s t e r i o r o d d s}{\underset{︸}{\frac{p (H_{1} | \vec{d})}{p (H_{0} | \vec{d})}}} = \underset{B a y e s f a c t o r}{\underset{︸}{\frac{p (\vec{d} | H_{1})}{p (\vec{d} | H_{0})}}} \times \underset{prior odds}{\underset{︸}{\frac{p (H_{1})}{p (H_{0})}}} .

(5.31)

The posterior odds is the ratio of the posterior probabilities of the two hypotheses. It is literally the odds we would use for gambling (e.g. a posterior odd of 3 corresponds to a 3:1 odd in favor of $H_{1}$ ). The important term in Eq. ( $5.31$ ) is the Bayes factor, usually noted $B F_{10} \equiv p (\vec{d} | H_{1}) ∕ p (\vec{d} | H_{0})$ . It quantifies the weight of evidence, brought by the data, against the null hypothesis. It tells us how much our observations changed the odds we had against $H_{0}$ , prior to collecting the data. Table 5.1 gives a qualitative scale to decide upon Bayes factors. We see that it is a continuous credibility range going from rejection to confidence. The posterior of our present example, assuming a wide flat prior, is:

p (F_{⋆} | F_{1}, \dots, F_{m}) = \frac{1}{\sqrt{2 π} σ_{F} ∕ \sqrt{m}} \exp (- \frac{1}{2} \frac{{(F_{⋆} - ⟨ F ⟩)}^{2}}{σ_{F}^{2} ∕ m}),

(5.32)

where $⟨ F ⟩ = \sum_{i = 1}^{m} F_{i} ∕ m$ . The posterior probability of $H_{0} = (F_{t r u e} \leq F_{t e s t})$ is then simply:

p (H_{0} | F_{1}, \dots, F_{m}) = \int_{- \infty}^{F_{t e s t}} p (F_{⋆} | F_{1}, \dots, F_{m}) d F_{⋆} = \frac{1}{2} [1 + \erf (\frac{1}{\sqrt{2}} \frac{F_{t e s t} - ⟨ F ⟩}{σ_{F} ∕ \sqrt{m}})] .

(5.33)

It is represented in Fig. 5.11.a. This PDF is centered in $⟨ F ⟩$ , since, as usual in the Bayesian approach, it is conditional on the data. Fig. 5.11.a shows the complementary posteriors of $H_{0}$ (red) and $H_{1}$ (blue), which are the incomplete integrals of the PDF. When we vary $F_{t e s t}$ , the ratio of the two posteriors, $B F_{10}$ , changes. Assuming we have chosen a very wide, flat prior, such that $p (H_{1}) ∕ p (H_{0}) ≃ 1$ , the Bayes factor becomes:

B F_{10} ≃ \frac{1}{p (H_{0} | F_{1}, \dots, F_{m})} - 1 .

(5.34)

Fig. 5.11.b represents the evolution of the Bayes factor as a function of the sample size, $m$ . In this particular simulation, $H_{0}$ is false. We see that, when $m$ increases, we accumulate evidence against $H_{0}$ , going through the different levels of Table 5.1. The evidence is decisive around $m ≃ 60$ , here.


Bayes factor, BF $_{10}$	Strength of evidence against H $_{0}$

1–3.2	Barely worth mentioning

3.2–10	Substantial

10–32	Strong

32–100	Very strong

>100	Decisive

Table 5.1: Jeffreys strength of evidence scale. This scale translates the value of a Bayes factor in a qualitative decision (Jeffreys, 1939). Below one, we consider

B F_{01}

instead, and discuss the evidence in favor of

H_{0}

Figure 5.11: Bayesian and frequentist hypothesis testing. Panel (a) represents the calculation of the posterior odds (Eq. $5.31$ ). Notice the distribution is centered on the average of the observed fluxes, $⟨ F ⟩$ , and we consider $F_{t e s t}$ as a variable. We want to assess the null hypothesis, $H_{0} = (F_{t r u e} \leq F_{t e s t})$ . In the particular example we have plotted, $H_{0}$ is false. The Bayes factor is simply the ratio of the blue and red areas. Panel (c) represents NHST for the same problem. Notice, this time, the distribution is centered on $F_{t e s t}$ , which is considered fixed, and the observations, $⟨ F ⟩$ , are considered variable. We have plotted the two-tailed $p$ -value decisions. The alternative to rejection is not acceptance, but the absence of significance. The two right panels illustrate the variation of the Bayes factors and $p$ -values, as a function of the sample size, $m$ . In these particular examples, we have chosen a fixed $F_{t e s t} = 38$ and $F_{t r u e} = 42$ , such that $H_{0}$ is false. The uncertainty is $σ_{F} = 14$ . The more we have data, the more we can constrain the solution. We can see that both Bayesian and frequentist methods conclude the right solution ( $H_{0}$ is false), around the same sample size ( $m ≃ 60 - 70$ in our case). Below this value, both methods are inconclusive. The Bayesian credibility scale is however more continuous than the frequentist NHST, though. The latter tells us our data are useless below $m ≃ 70$ . To avoid stochastic fluctuations, that would make the figure less easy to read, we have randomly drawn 10 000 samples for each value of $m$ and we show the averages. Licensed under CC BY-SA 4.0.

Frequentist hypothesis testing. Instead of computing the credibility of $H_{1}$ against $H_{0}$ , the frequentist approach relies on the potential rejection of $H_{0}$ . It is called Null Hypothesis Significance Test (NHST; e.g. Ortega & Navarrete, 2017). It consists in testing if the observation average, $⟨ F ⟩$ , could have confidently been drawn out of a population centered on $F_{t e s t}$ . In that sense, it is the opposite of the Bayesian case. This is demonstrated in Fig. 5.11.c. We see that the distribution is centered on $F_{t e s t}$ , which is assumed fixed, and the observations, $⟨ F ⟩$ , are assumed variable. Since frequentists can not assign a probability to $F_{t e s t}$ , they estimate a $p$ -value, that is the probability of observing the data at hands, assuming the null hypothesis is true. It is based on the following statistics:

t_{m} \equiv \frac{⟨ F ⟩ - F_{t e s t}}{σ_{F} ∕ \sqrt{m}} .

(5.35)

The $p$ -value of this statistics is then simply:

p_{0} = \frac{1}{2} [1 + \erf (\frac{t_{m}}{\sqrt{2}})] .

(5.36)

Notice it is identical to Eq. ( $5.33$ ), but different from the Bayes factor (Eq. $5.34$ ). It however does not mean the same thing. A significance test at $p_{0} = 0.05$ does not tell us that the probability that the null hypothesis is $5 %$ . It means that the null hypothesis will be rejected $5 %$ of the time ⁷. NHST decision making is represented in Fig. 5.11.c. It shows one of the most common misconceptions about NHST: the absence of rejection of $H_{0}$ does not mean that we can accept it. It just mean that the results are not significant. Accepting $H_{0}$ requires rejecting $H_{1}$ , and vice versa. Fig. 5.11.d shows the effect of sample size on the $p$ -value, using the same example as we have discussed for the Bayesian case. The difference is that the data are not significant until $m ≃ 70$ .

The Jeffreys-Lindley’s paradox. Although the method and the interpretations are different, Bayesian and frequentist tests give consistent results, in numerous applications. There are however particular cases, where both approaches are radically inconsistent. This ascertainment was first noted by Jeffreys (1939) and popularized by Lindley (1957). Lindley (1957) demonstrated the discrepancy on an experiment similar to the example we have been discussing in this section, with the difference that a point null hypothesis is tested: $H_{0} = (F_{t r u e} = F_{t e s t})$ . Lindley (1957) shows that there are particular cases, where the posterior probability of $H_{0}$ is $1 - α$ , and $H_{0}$ is rejected at the $α$ level, at the same time. This “statistical paradox”, known as the Jeffreys-Lindley’s paradox has triggered a vigorous debate, that is still open nowadays (e.g. Robert, 2014). The consensus about the paradox is that there is no paradox. The discrepancy simply arises from the fact that both approaches answer different questions, as we have been illustrating at several occasions in this chapter, and that these different interpretations can sometimes be inconsistent.

The recent controversy about frequentist significance tests. NHST has recently been at the center of an important controversy across all empirical sciences. We have already discussed several of the issues with frequentist significance tests. Let’s summarize them here (e.g. Ortega & Navarrete, 2017).

1.: Frequentist tests are conditional on model parameters and thus consider data that have not actually been observed. The general frequentist approach is difficult to grasp, even for advanced statisticians. It can easily lead to false interpretations.
2.: NHST is prone to overestimates and can state effects even if none exist. If we repeat an experiment a sufficient number of times, we will always end up rejecting $H_{0}$ . This potentially leads to a large number of false positives.
3.: There is a variety of statistics that one can test, and they are not all going to give the same result. In addition, the significance level is subjective and there are no clear rules how to choose the $p$ -value ( $p = 0.05$ , $p = 0.01$ , etc.). There is therefore some subjectivity in the frequentist approach. It is not in the prior, it is in the significance assessment.

Data dredging or $p$ -hacking has come into the spotlight during the last twenty years, although it was known before (e.g. Smith & Ebrahim, 2002; Simmons et al., 2011; Head et al., 2015). It points out that numerous scientific studies could be wrong, and several discoveries could have been false positives. This is particularly important in psychology, medical trials, etc., but could affect any field using $p$ -values. In 2016, the American Statistical association published a “Statement on statistical significance and $p$ -values” (Wasserstein & Lazar, 2016), saying that: “widespread use of ’statistical significance’ (generally interpreted as ’p<0.05’) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process”. They suggested “moving toward a ’post p<0.05’ era”. While some recommendations have been proposed to use $p$ -values in a more controlled way (e.g. Simmons et al., 2011), by deciding the sample size and significance level before starting the experiment, some researchers have suggested abolishing NHST (e.g. Loftus, 1996; Anderson et al., 2000). Several journals have stated that they will no longer publish articles reporting $p$ -values (e.g. Basic & Applied Social Psychology, in 2015, and Political Analysis, in 2018).

$\Rightarrow$ Frequentist $p$ -values are to be used with caution.

5.1.4.2 Pros and Cons of the two Approaches

We finish this section by summarizing the advantages and inconveniences of both approaches. This comparison is synthesized in Table 5.2.


Bayesian approach	Frequentist approach

CON choice of prior is subjective	PRO likelihood is not subjective

PRO can account for non-Gaussian errors, nuisance parameters, complex models & prior information	CON very limited in terms of the type of noise, the complexity of the model & can not deal with nuisance parameters

PRO the posterior makes sense (conditional on the data) & is easy to interpret	CON samples non-observed data, arbitrary choice of estimator & $p$ -value

PRO probabilistic logic $\Rightarrow$ continuum between skepticism & confidence	CON boolean logic $\Rightarrow$ a proposition is either true or false, which leads to false positives

PRO based on a master equation (Bayes’ rule) $\Rightarrow$ easier to learn & teach	CON difficult to learn & teach (collection of ad hoc cooking recipes)

CON heavy computation	PRO fast computation

PRO works well with small samples & heterogeneous data sets	CON does not work well with small samples, can not mix samples & require fixing the sample size and significance level before experimenting

PRO holistic & flexible: can account for all data & theories	CON strict: can account only for data related to a particular experiment

PRO conservative	CON can give ridiculous answers

Table 5.2: Pros and cons of the Bayesian and frequentist methods.

Hypotheses and information that can be taken into account. The two methods diverge on what information can be included in the analysis.

Bayesian

models can account for a maximum amount of information:

all types of noise and uncertainties, whether non-Gaussian, partially or fully correlated (cf. Sect. 5.1.2.3);
nuisance parameters, that are parameters necessary to estimate the model, but whose values are not relevant (we will give an example in Sect. 5.3.3.1);
complex, non-linear models with more parameters than observational constraints (cf. Sect. 5.1.3.2);
any kind of prior information (cf. Sect. 5.1.2.2).

Frequentist

analysis is limited in that way:

complex, correlated noise is difficult to include in the likelihood;
nuisance parameters can not be included and we can not have more parameters than observations;
complex models, with a lot of degeneracies fail frequentist approaches, such as maximum likelihood methods;
no prior information can be included.

In favor of the frequentist approach, we can note that the likelihood is perfectly objective, whereas the choice of the prior is subjective. The subtlety is however that this choice is subjective, as it depends on the knowledge we believe we have prior to the observation, but it is not arbitrary, as a prior can be rationally constructed. In addition, when the strength of evidence is large, the prior becomes unimportant. The prior is important only when the data are very noisy or unconvincing. In that sense, the prior does not induce a bias of confirmation.

Analysis and interpretation. As we have seen throughout Sect. 5.1, the point of view of the two approaches is very different.

Inference: is performed on the posterior, in the Bayesian approach. It gives the probability of the parameters, knowing the data. It makes sense and is easy to interpret. On the contrary, frequentists sample data that have not actually been observed and base their results on arbitrary choices of statistics and estimators. The results are difficult to interpret, as they consist in describing what could be the observations, for a given set of model parameters (cf. Sect. 5.1.4.1).
The underlying logic: is probabilistic, in the Bayesian approach. There is a continuum between skepticism and confidence that makes any data worth taking into account. Bayes factors quantify the strength of evidence brought by these data (cf. Sect. 5.1.4.1). On the contrary, frequentist logic is Platonic, a proposition is either true or false. With real-life uncertainties and stochasticity, this leads to false positives. By refusing to assign probabilities to parameters and hypotheses, frequentists rely on $p$ -values, while those are only one particular tool of the Bayesian analysis (e.g. Sect. 5.1.3.2). Credibility is progressive whereas significance is dichotomic.
Learning and teaching: of the Bayesian method is considerably easier, because it is based on a master equation, Bayes’ rule. All Bayesian problems start with Eq. ( $5.5$ ), which is developed to account for all the details we are modeling. On the contrary, frequentists methods are a collection of ad hoc cooking recipes, whose derivation is often obscure (e.g. R. Fisher’s book, “Statistical methods for research workers”;Fisher, 1925).
Computation: of Bayesian problems is intensive, as sampling the posterior is challenging (cf. Sect. 5.1.3). One of the advantages of frequentist methods is that they are usually fast. Even a Bayesian can use them, for instance, to find good MCMC starting points (e.g. Galliano, 2018, Sect. 4.2.1) or compute quick estimators (e.g. Galliano et al., 2021, Appendix F.2).

Overall applicability. In practice, choosing one approach over the other depends on the situation. There are however a lot of arguments in favor of the Bayesian point of view.

Sample size and data collection: are one of the major issues with frequentist methods. We have seen that NHST was problematic in that aspect, and that the stopping condition of an experiment could bias its significance. It is recommended to use large samples, and decide of the size and significance level, before conducting the measures. Consequently, contrary to Bayesians, frequentists can not (i) analyze partial data sets, as the stopping point could be instrumental in forcing one outcome over another (concept of $p$ -hacking discussed in Sect. 5.1.4.1), or (ii) combine heterogeneous data sets, as the limiting frequency would not have any sense.
Consistency: of the results is also an issue with the frequentist approach, as it can only account for data related to a particular experiment. On the contrary, the Bayesian approach is more flexible. We will even discuss in Sect. 5.3.3.2 that it is potentially holistic. It is at the same time more conservative, as the prior tends to prevents aberrant results, while we have seen that frequentist methods can give ridiculous answers (cf. Sect. 5.1.2.3).

$\Rightarrow$ For all these reasons, the Bayesian approach is more well-suited for most problems encountered in empirical sciences.

5.2 Bayesianism, an Alternative to Popper’s Scientific Method

The Bayesian and frequentist approaches lead to radically different epistemological points of view, that have important consequences on the way we study ISD. We start by briefly brushing the history of the competition between these two systems. We then discuss their consequences on the scientific method.

5.2.1 Bayes Formula Throughout History

The History of the introduction of probability in sciences and the subsequent competition between Bayesians and frequentists is epic. The book of McGrayne (2011) gives an invaluable overview of this controversy, that started two centuries ago.

5.2.1.1 The Origins

The emergence of the concept of probability. In antique societies, randomness was interpreted as the language of the gods. Hacking (2006) argues that the notion of probability emerged around 1660, in western Europe. Before this date, “probable” meant only “worthy of approbation” or “approved by authority”. In a few years, during the Renaissance, there was a transition of the meaning of “probable” from “commonly approved” to “probed by evidence”, what Gaston BACHELARD would have called an epistemological break. The time was ready for the idea. The Thirty Years’ War (1618–1648), which had caused several millions of deaths throughout western Europe, had just ended. It consolidated the division into Catholic and Lutheran states of a continent that had been religiously homogeneous for almost a thousand years. “Probabilism is a token of the loss of certainty that characterizes the Renaissance, and of the readiness, indeed eagerness, of various powers to find a substitute for the older canons of knowledge. Indeed the word ’probabilism’ has been used as a name for the doctrine that certainty is impossible, so that probabilities must be relied on” (Hacking, 2006, page 25). The first book discussing the concept of probability, applied to games of fortune, was published in 1657 by Christiaan HUYGENS (Huygens, 1657). It is however Blaise PASCAL (cf. Fig. 5.12.a) who is considered the pioneer in the use of probability as a quantification of beliefs. His wager ⁸ is known as the first example of decision theory (Pascal, 1670).

Figure 5.12: The probability pioneers. Credit: (a) Wikipedia, public domain; (b) Wikipedia, public domain; (c) Wikipedia, public domain.

The discovery of Bayes. Thomas BAYES (cf. Fig. 5.12.b) was an XVIII $^{t h}$ English Presbyterian minister. Coming from a nonconformist family, he had read the work of Isaac NEWTON, David HUME and Abraham DE MOIVRE (McGrayne, 2011). His interest in game theory led him to imagine a thought experiment.

His thought experiment

was developed between 1746 and 1749. He was trying to infer the position of a ball on a pool table behind him, that he could not see. His idea was to be able to start from a guess and refine it using some information.

1.: His assistant would throw on the table a first ball, whose position is to be inferred.
2.: His assistant would then throw a second ball and tell him if it landed on the left or the right of the first one.
3.: This procedure would be repeated until Bayes could infer the quadrant were the first ball is.

He derived Eq. ( $5.5$ ) to solve this problem.

The essay

presenting his formula (Bayes, 1763) was published after his death by Richard PRICE. Bayes defined the probability of an event as “the ratio between the value at which an expectation depending on the happening of the event ought to be computed, and the value of the thing expected upon its happening”. Richard PRICE added that his formula provides the probability of causes and can thus be applied to prove the existence of God.

The contribution of Laplace. Pierre-Simon LAPLACE (cf. Fig. 5.12.c) was the son of a small estate owner, in Normandie. His father pushed him towards a religious career, that led him to study theology. He however quit at age 21 and moved to Paris, where he met the mathematician Jean LE ROND D’ALEMBERT, who helped him to get a teaching position. Laplace then had a successful scientific and political career (cf. Hahn, 2005, for a complete biography). Among his many other scientific contributions, Laplace is the true pioneer in the development of statistics using Bayes’ rule. Some authors even argue that we should call the approach presented in Sect. 5.1 “Bayesian-Laplacian” rather than simply “Bayesian”. After having read the memoir of Abraham DE MOIVRE, he indeed understood that probabilities could be used to quantify experimental uncertainties. His 1774 memoir on “the probability of causes by events” (Laplace, 1774) contains the first practical application of Bayes’ rule. His rule of succession, giving the probability of an event knowing how many times it happened previously, was applied to give the probability that the Sun will rise again. Laplace rediscovered Bayes’ rule. He was only introduced to Bayes’ essay in 1781, when Richard PRICE came to Paris. Laplace had a Bayesian conception of probabilities: “in substance, probability theory is only common sense reduced to calculation; it makes appreciate with accuracy what just minds can feel by some sort of instinct, without realizing it” (Laplace, 1812).

5.2.1.2 The Frequentist Winter

The rejection of Laplace’s work. The frequentist movement was initiated by British economist John Stuart MILL, only ten years after the death of Laplace. There were several reasons for this reaction (Loredo, 1990; McGrayne, 2011).

The idea that probability should represent a degree of plausibility seemed too vague to provide the foundation for a mathematical theory. People also realized there was no clear way to assign priors.
The computation of Bayesian solutions by hand was crippling.
Laplace was despised in England for his support to Napoléon.

Mill’s disdain for the Bayesian approach was unhinged: “a very slight improvement in the data, by better observations, or by taking into fuller consideration the special circumstances of the case, is of more use than the most elaborate application of the calculus of probabilities founded on the data in their previous state of inferiority. The neglect of this obvious reflection has given rise to misapplications of the calculus of probabilities which have made it the real opprobrium of mathematics.” (Mill, 1843). The early anti-Bayesian movement was led by English statistician Karl PEARSON (cf. Fig. 5.13.a). Pearson developed: (i) the chi-squared test; (ii) the standard-deviation; (iii) the correlation coefficient; (iv) the $p$ -value; (v) the Principal Component Analysis (PCA). His book, “The Grammar of Science” (Pearson, 1892), was very influential, in particular to the young Albert EINSTEIN. Despite these great contributions, Pearson was a social Darwinist and a eugenicist.

Figure 5.13: The frequentist promoters. Credit: (a) Wikipedia, public domain; (b) Bibmath, public domain; (c) Wikipedia, licensed under CC BY-SA 2.0 DE.

The golden age of frequentism (1920-1930). Ronald FISHER (cf. Fig. 5.13.b) followed the way opened by Pearson. He is the most famous representative of the frequentist movement. He developed: (i) the maximum likelihood; (ii) NHST; (iii) the F-distribution and the F-test. His 1925 book, “Statistical Methods for Research Workers” (Fisher, 1925), was widely used in academia and industry. Despite the criticism we can address to the frequentist approach, Fisher’s contributions gave guidelines to rigorously interpret experimental data, that brought consistency to science. Fisher, who was like Pearson a eugenicist, was also paid as a consultant by the “Tobacco Manufacturer’s Standing Committee”. He spoke publicly against a 1950 study showing that tobacco causes lung cancer, by resorting to “correlation does not imply causation” (Fisher, 1957). Besides Pearson and Fisher, Jerzy NEYMAN (cf. Fig. 5.13.c) was also a prominent figure of frequentism at this time. These scientists, also known for their irascibility, made sure that nobody revived the methods of Laplace. McGrayne (2011) estimates that this golden era culminated in the 1920s-1930s.

The Bayesian resistance. Several prominent scientists, who were not intimidated by Fisher and his colleagues, perpetuated the Bayesian approach (McGrayne, 2011). Among them, we can cite the following two.

Harold JEFFREYS: (cf. Fig. 5.14.a) was a British geophysicist and mathematician. In 1926, he performed a Bayesian analysis of earthquake records and inferred that the Earth had a liquid core. This discovery could not have been possible with the frequentist approach, as the data were very scarce. Jeffreys initiated the Bayesian revival and was an early critic of NHST. His book, “Theory of probability” (Jeffreys, 1939), popularized the use of Bayes factors, as we have seen in Sect. 5.1.4.1.
Alan TURING: (cf. Fig. 5.14.b) was the founder of theoretical computer science and artificial intelligence. He also brilliantly put into practice Bayesian techniques during World War II. He secretly worked at Bletchley Park, near London, to decode the communications between German U-boats, that were using the cryptographic Enigma machine. Turing built a mechanical computer, called “The Bomb”, which he used to test combinations. He used Bayesian priors to reduce the number of combinations, looking for frequent German words and meteorological terms. He even developed a unit quantifying the weight of evidence (Bayes factor), named the “ban”, after the city of Banbury where punch cards were printed.

Figure 5.14: The Bayesian resistance. Credit: (a) Encyclopedia Britannica, used for non-commercial, educational purpose; (b) Wikipedia, public domain; (c) Wikipedia, public domain.

5.2.1.3 The Bayesian Renaissance

After World War II. The first computers were built during the war. Bayesian techniques were now becoming feasible. Their use rapidly increased in the early 1950s. McGrayne (2011) note a few of them.

Competitive businesses,: that usually favor pragmatic solutions over ideology, turned naturally to the Bayesian approach. Arthur BAILEY was a pioneer in applying Bayes’ rule to estimate insurance premiums. This was the only way to calculate the probability of catastrophic events, that had never happened before. This is the type of situations were frequentist methods are unusable. Howard RAIFFA and Robert SCHLAIFER taught Bayes’ rule for business and decision-making.
Nuclear security: is another example of a discipline requiring to estimate the probability of rare, unprecedented events. Frequentist studies concluded that nuclear plant incidents were unlikely, but would be catastrophic if they happened. In the 1970s, Norman RASMUSSEN estimated, in a Bayesian way, that it was the opposite: they were likely, but not necessarily catastrophic. The Three Mile Island incident (1979) proved him right. In another area, Bayesian search algorithms were used to find lost nuclear bombs and Russian nuclear submarines.
Epidemiological: studies, using Bayesian techniques, were pioneered by Jerome CORNFIELD, who ridiculed Fisher’s attempt at minimizing the link between smoking and lung cancer.
In academia,: Dennis LINDLEY and Jimmie SAVAGE were actively promoting Bayesian methods and showing the limitations of frequentist techniques, claiming that “Fisher is making Bayesian omelet without breaking Bayesian eggs”.

The great numerical leap forward. In the 1970s, the increasing power of computers opened new horizons to the Bayesian approach. The Metropolis-Hastings algorithm (cf. Sect. 5.1.3; Hastings, 1970) provided a fast, easy-to-implement method, which rendered Bayesian techniques more attractive. Gibbs sampling (cf. Sect. 5.1.3; Geman & Geman, 1984), which can be used to solve complex problems, put Bayes’ rule into the spotlight. We can note the following achievements.

The human genome: was decoded using Bayesian methods (Beaumont & Rannala, 2004).
In astrophysics,: Bayesian techniques were first applied to analyze the neutrino flux from SN1987A (Loredo & Lamb, 1989).

At the same time, Edwin JAYNES (cf. Fig. 5.14.c) was working at solidifying the mathematical foundations of the Bayesian approach. It culminated in his posthumous book, “Probability Theory: The Logic of Science” (Jaynes, 2003).

Bayesian techniques, nowadays. Looking at the contemporary literature, it appears that Bayesians have won over frequentists ⁹. In a lot of cases, this is however only a fashion trend due to the fact that the word “Bayesian” became hip in the 2010s, in astrophysics. There are already misuses of Bayesian methods. This is unavoidable. This might result from the fact that there is still a generation of math teachers and the majority of statistical textbooks ignoring the Bayesian approach. The difference with $p$ -hacking is that Bayesian-hacking is easier to spot, because interpreting posterior distributions is less ambiguous than NHST. The supremacy of Bayesian techniques is ultimately demonstrated by the success of Machine-Learning (ML). ML has Bayesian foundations. It is a collection of probabilistic methods. Using ML can be seen as performing posterior inference, based on the evidence gathered during the training of the neural network.

5.2.2 Bayesian and Popperian Epistemologies

The following epistemological considerations have been expressed in several texts (e.g. Good, 1975; Loredo, 1990; Jaynes, 2003; Hoang, 2020). The book of Jaynes (2003) is probably the most rigorous on the subject, while the book of Hoang (2020) provides an accessible overview.

5.2.2.1 The Epistemological Debate at the Beginning of the XX $^{t h}$ Century

Epistemology treats several aspects of the development of scientific theories, from their imagination to their validation. We will not discuss here how scientists can come up with ground-breaking ideas. We will only focus on how a scientific theory can be experimentally validated.

Scientific positivism. The epistemological point of view of Auguste COMTE (cf. Fig. 5.15.a) had a considerable influence on the XIX $^{t h}$ century epistemology, until the beginning of the XX $^{t h}$ century. Comte was a French philosopher and sociologist, who developed a complex classification of sciences and theorized their role in society. What is interesting for the rest of our discussion is that he was aiming at demarcating sciences from theology and metaphysics. He proposed that we need to renounce to understand the absolute causes (why), to focus on the mathematical laws of nature (how). In that sense, his system, called positivism, is not an empiricism (e.g. Grange, 2002). Comte stressed that a theoretical framework is always necessary to interpret together different experimental facts. His approach is a reconciliation of empiricism and idealism, where both viewpoints are necessary to make scientific discoveries. Positivism is not scientism, either. Comte’s view was, in substance, that science provides a knowledge that is rigorous and certain, but at the same time only partial and relative.

Figure 5.15: Three figures of modern epistemology. Credit: (a) Wikipedia, public domain; (b) Est Republicain, public domain; (c) Wikipedia, not licensed.

Conventionalism and verificationism. At the beginning of the XX $^{t h}$ century, two complementary epistemological points of view were debated.

Conventionalism: was represented, in physics and mathematics, by Henri POINCARÉ (cf. Fig. 5.15.b). This philosophy considers that human intuitions about the physical world are possibly flawed. Some of our scientific principles are only conventions. Poincaré, having worked on Lobachevskian geometries, was taking Euclidean geometry as an example (e.g. Bland, 2011). These conventions should thus be chosen so that they agree with the physical reality.
Verificationism: is another doctrine deriving from scientific positivism (e.g. Okasha, 2001). It postulates that a proposition has a cognitive meaning only if it can be verified by experience.

5.2.2.2 Popper’s Logic of Scientific Discovery

We now review the epistemology developed by Karl POPPER (cf. Fig. 5.15.c), which is the center of our discussion. Popper was an Austrian-born British philosopher who had a significant impact on the modern scientific method ¹⁰. His concepts of falsifiability and reproducibility are still considered as the standards of the scientific method, nowadays. His major book, “The Logic of Scientific Discovery” (Popper, 1959), was originally published in German in 1934, and rewritten in English in 1959. Before starting, it is important to make the distinction between the following two terms.

Deduction: is inferring the truth of a specific case from a general rule. For instance, knowing that SN Ia occur only in binary systems, if we observe a SN Ia, we can deduce that it originates from a binary system.
Induction: is inferring a general conclusion based on individual cases. For instance, if we observe a few SNRs with massive amounts of freshly-produced dust, we can induce that SNe produce massive amounts of dust.

The criticism of induction. Popper’s reflection focusses on the methods of empirical sciences. The foundation of his theory is the rejection of inductive logic, as he deems that it does not provide a suitable criterion of demarcation, that is a criterion to distinguish empirical sciences from mathematics, logic and metaphysics. According to him, the conventionalist and verificationist approaches are not rigorous enough. Popper argues that only a deductivist approach provides a reliable empirical method: “hypotheses can only be empirically tested and only after they have been advanced” (Popper, 1959, Sect. I.1). Deductivism is however not sufficient in Popper’s mind.

Falsifiability. Popper criticizes conventionalists who “evade falsification by using ad hoc modifications of the theory”. For that reason, verifiability is not enough. Empirical theories must be falsifiable, that is they must predict experimental facts that, if empirically refuted, will prove them wrong. His system, which was afterward called falsifiabilism, is the combination of: (i) deductivism; and (ii) modus tollens. The modus tollens is the following logical proposition:

((T \Rightarrow D) \land \neg D) \Rightarrow \neg T .

(5.37)

Put in words, it can be interpreted as: if a theory T predicts an observational fact D, and this fact D happens to be wrong, then we can deduce that the theory T is wrong. This principle is to be strictly applied: “one must not save from falsification a theory if it has failed” (Popper, 1959, Sect. II.4). Let’s take a pseudo-science example to illustrate Popper’s point. Let’s assume that a “ghost expert” pretends a ghost inhabits a given haunted house. The verificationist approach would consist in saying that, to determine if there is really a ghost, we need to go there and see if it shows up. Popper would argue that, if the ghost did not appear, our expert would claim that it was because it was intimidated or it felt our skepticism. Falsifiabilism would dictate to set experimental conditions beforehand by agreeing with the expert: if the ghost does not appear in visible light, in this house, at midnight, on a particular day, then we will deduce that this ghost theory is wrong. The ghost expert would probably not agree with such strict requirements. Popper would thus conclude that ghostology is not an empirical science.

Reproducibility. A difficulty of the empirical method is relating perceptual experiences to concepts. Popper argues that the objectivity of scientific statements lies in the fact that they can be inter-subjectively tested. In other words, if several people, with their own subjectivity, can perform the same empirical tests, they will rationally come to the same conclusion. This requires reproducibility. Only repeatable experiments can be tested by anyone. Reproducibility is also instrumental in avoiding coincidences.

Parsimony (Ockham’s razor). A fundamental requirement of scientific theories is that they should be the simplest possible. Unnecessarily complex theories should be eliminated. This is the principle of parsimony. Popper is aware of that and includes it in his system. This is however not the most convincing point of his epistemology. His idea is that a simple theory is a theory that has a high degree of falsifiability, which he calls “empirical content” (Popper, 1959, Sect. II.5). In other words, according to him, the simplest theories are those that have the highest prior improbability, whereas complex theories tend to have special conditions that help them evade falsification.

Popper’s epistemology is frequentist. It is obvious that Popper’s system has a frequentist frame of mind. It was indeed conceived at the golden age of frequentism (cf. Sect. 5.2.1.2).

The requirement of falsifiability: is reminiscent of NHST (cf. Sect. 5.1.4.1). Frequentists only accept a hypothesis by rejecting (i.e. falsifying) its alternative. Rigorous NHST requires setting up the detailed experimental procedure, the stop condition and the significance of the outcome beforehand. This is exactly what Popper requires to make sure the conditions of falsifiability are not tampered with. This makes it impossible to test several theories with a single experiment, contrary to the Bayesian approach.
Platonic logic: is at the center of Popper’s epistemology. The last third of his book (Popper, 1959) is actually devoted to probabilities. He favors their objective interpretation, in terms of frequency. This is because he does not even conceive the possibility to assign a probability to a theory. This is also the weakness of the frequentist approach (cf. Sect. 5.1).
Repeatability: is the necessary condition to satisfy the assumption that experimental uncertainties are the limiting frequency of the result’s fluctuations. This also makes it impossible to account for sparse or unique constraints.
Accumulation of knowledge: is impossible in this approach, as each individual experiment must be considered independently. It is impossible to account for prior knowledge. The Popperian approach, if it was actually applied by scientists, would even lead to a decrease of knowledge. Indeed, if we indefinitely try to falsify a theory, we will end up by rejecting it, just by luck ¹¹.

5.2.2.3 Verifiability, Holisticity and Parsimony: the Bayesian Alternative

We now discuss how the Bayesian approach provides an alternative to Popper’s epistemology. Jaynes (2003) demonstrates that probabilities, in the Bayesian sense, could be the foundation of a rigorous scientific method. Hoang (2020) even argues that Bayes’ rule is the optimal way to account for experimental data.

Falsifiability and the limits of Platonic logic. The refusal of Popper and frequentists to adopt probabilistic logic is the reason why their decision upon experimental evidence is so convoluted. The application of Platonic logic to the physical reality indeed presents some issues. One of the most famous aporias is “Hempel’s paradox” (Hempel, 1945). It states the following.

Hempel’s paradox:

if a logical proposition is true, its contraposition is necessarily true:

\underset{proposition}{\underset{︸}{(A \Rightarrow B)}} \Leftrightarrow \underset{contraposition}{\underset{︸}{(\neg B \Rightarrow \neg A)}} .

(5.38)

The example taken by Hempel (1945), is “all ravens are black” ( $raven \Rightarrow black$ ). The contraposition is “anything that is not black is not a raven” ( $\neg black \Rightarrow \neg raven$ ). From an experimental point of view, if we want to corroborate that all ravens are black, we can either: (i) find black ravens (i.e. verifying the proposition); or (ii) find anything that is neither black nor a raven, such as a red apple (i.e. verifying the contraposition). The second solution is obviously useless in practice. To take an astrophysical example, finding a quiescent H I cloud would be considered as a confirmation that star formation occurs only in H₂ clouds. This is one of the reasons why Popper requires falsifiability.

A Bayesian solution

to the paradox was proposed by Good (1960, who was Turing’s collaborator at Bletchley Park). First, in probabilistic logic:

(A \Rightarrow B) \Leftrightarrow p (B | A) = 1 .

(5.39)

The difference is that, with probabilities, we can deal with uncertainty, that is $0 < p (B | A) < 1$ . Second, Bayes factors quantify the strength of evidence (cf. Sect. 5.1.4.1), and it is different in the case of a black raven or a red apple. Good (1960) shows that the strength of evidence is negligible in the case of a red apple. The Bayesian solution is thus the most sensible one. Bayes factors are therefore the tool needed to avoid requiring falsifiability. We can adopt a verificationist approach and discuss if our data brought significant evidence.

$\Rightarrow$ Bayesianism does not require falsifiability. Bayes factors provide a way to quantify the strength of evidence brought by any data set.

Parsimony is hard-coded in Bayes’ rule. The principle of parsimony is directly implied by the use of Bayes factors. To illustrate this point, let’s assume we are fitting a two-parameter model to a data set, $\vec{d}$ , and we want to know if we can fix the second parameter (model $M_{1}$ ) or let it free (model $M_{2}$ ). The Bayes factor (Eq. $5.31$ ) is simply:

B F_{21} = \frac{p (\vec{d} | M_{2})}{p (\vec{d} | M_{1})} = \frac{\iint p (\vec{d} | x_{1}, x_{2}) p (x_{1}, x_{2} | M_{2}) d x_{1} d x_{2}}{\int p (\vec{d} | x_{1}) p (x_{1} | M_{1}) d x_{1}} .

(5.40)

Let’s assume that we have a Gaussian model. We can write the product of the prior and the likelihood, in this case:

\underset{l i k e l i h o o d}{\underset{︸}{p (\vec{d} | x_{1})}} \underset{p r i o r}{\underset{︸}{p (x_{1} | M_{1})}} = \underset{l i k e l i h o o d}{\underset{︸}{\frac{1}{{(\sqrt{2 π} σ)}^{m}} \exp (- \frac{χ^{2}}{2})}} \underset{p r i o r}{\underset{︸}{\frac{1}{Δ_{1}}}},

(5.41)

where we have assumed that the prior was flat over $Δ x_{1}$ and that we had $m$ observations with uncertainty $σ$ . Notice that the product in Eq. ( $5.41$ ) is proportional but not equal to the posterior. We indeed have not divided it by $p (\vec{d})$ , as this is the quantity we want to determine. If we approximate the posterior by a rectangle, we obtain the following rough expression, using the notations in Fig. 5.16.a:

p (\vec{d} | x_{1}) p (x_{1} | M_{1}) ≃ \frac{δ_{1}}{Δ_{1}} \exp (- \frac{χ_{m a x}^{2}}{2}) .

(5.42)

Let’s assume that adding parameter $x_{2}$ does not improve the fit. The $χ_{m a x}^{2}$ thus stays the same. This is represented in Fig. 5.16.b. With the same assumptions as in Eq. ( $5.42$ ), we obtain:

p (\vec{d} | x_{1}, x_{2}) p (x_{1}, x_{2} | M_{2}) ≃ \frac{δ_{1} δ_{2}}{Δ_{1} Δ_{2}} \exp (- \frac{χ_{m a x}^{2}}{2}) .

(5.43)

Eq. ( $5.40$ ) thus becomes:

B F_{21} ≃ \frac{δ_{2}}{Δ_{2}} « 1 .

(5.44)

It thus tell us that model $M_{1}$ is more credible. The penalty of adding the extra parameter is $δ_{2} ∕ Δ_{2}$ . More generally, we can encounter the three following situations.

1.: If the fit is indeed better, we will have: $B F_{21} ≃ δ_{2} ∕ Δ_{2} \exp [- (χ_{2}^{2} - χ_{1}^{2}) ∕ 2]$ . Model $M_{2}$ will be more credible only if the increased chi-squared compensates the penalty.
2.: If the fit is not better, but parameter $x_{2}$ is not constrained, we will have $δ_{2} ≃ Δ_{2}$ , and $B F_{21} ≃ 1$ . Both models will be equivalent because we will not have changed our prior knowledge about $x_{2}$ .
3.: If the fit is not better, but parameter $x_{2}$ has a smaller support than its prior, model $M_{2}$ will be less credible than $M_{1}$ . This is the example of Fig. 5.16.

Figure 5.16: Bayes factors and parsimony. Panel (a) represents the likelihood of a one-parameter model, in red. Its width is $δ_{1}$ . The prior is shown in blue and is wider (width $Δ_{1}$ ). The posterior, in green, is the product of the two. Its peak, the green dot, is roughly proportional to $\exp (- χ_{m a x}^{2} ∕ 2) ∕ Δ_{1}$ (Eq. $5.42$ ). Panel (b) represents the same model, adding an extra model parameter, $x_{2}$ . In the case we have represented, the parameter does not improve the fit, so that the chi-squared is similar. The maximum a posteriori is the green dot. It is roughly proportional to $\exp (- χ_{m a x}^{2} ∕ 2) ∕ (Δ_{1} Δ_{2})$ (Eq. $5.43$ ). Licensed under CC BY-SA 4.0.

Comparison of the Bayesian and Popperian methods. We can now compare both approaches. The following arguments are summarized in Table 5.3.

Theory corroboration

requires falsifiability, in the Popperian approach. Experiments need to be reproducible and the detailed procedure needs to be defined before starting acquiring data. The Bayesian approach is more flexible because falsifiability is not required and heterogeneous data sets can be accounted for. For instance, the first detection of gravitational waves (LIGO collaboration et al., 2016) and the first direct image of a black hole (Event Horizon Telescope Collaboration et al., 2019), combined, bring a large weight of evidence in favor of general relativity. On the contrary, a strict Popperian would argue that: (i) both experiments are independent and should not be mixed together; and (ii) these experiments are not falsifiable, as an absence of detection could have been blamed on their complexity. In that sense, the Popperian approach, if it was actually put in practice, would be extremely wasteful and would slow down the progress of science. Fortunately, the majority of scientists do not apply the Popperian method, most of them unconsciously. Our impression is that most scientists apply the Bayesian principles, at least qualitatively. This is probably because our own brain is Bayesian (e.g. Meyniel & Dehaene, 2017).

Reproducibility

is a requirement of the Popperian approach ¹², whereas Bayesians can account for unique, unreproducible data, such as earthquakes, SNe, GRBs, etc. The Bayesian approach will benefit from reproducible experiments, as they will increase the strength of evidence, but it is not a requirement.

Accumulation of knowledge

is natural in Bayesianism, as the prior is there to account for what any previous experience has taught us. This is what we demonstrated in Sect. 5.1.2.2:

\underset{new prior}{\underset{︸}{p (\vec{x})}} = \underset{initial prior}{\underset{︸}{1}} \times \underset{accumulated previous data}{\underset{︸}{p ({\vec{d}}_{1} | \vec{x}) \times \dots \times p ({\vec{d}}_{N} | \vec{x})}} .

(5.45)

Logical decision

is Platonic for Popperians. They see theories either right or wrong, until proven otherwise. The probabilistic logic of Bayesians is more progressive, because it uses a continuous credibility scale and accounts for all previous data. It makes it more conservative. Yet, Epistemological breaks are still possible if new data bring a large strength of evidence in favor of a new theory.

Scientist or lawyer?

The Popperian approach forces scientists to design experiments to test a single theory, whereas Bayesians can naturally compare several theories and use all available data. The latter favors a less doctrinal attitude, less ideological, that should be promoted in science. On the contrary, Popperians have an attitude closer to that of lawyers, as they are forced to defend one particular case, by attacking (falsifying) the alternatives ¹³.


	Bayesian	Popperian

Corroboration	Verifiability & strength of evidence	Requires falsifiability

Logic	Probabilistic: continuity between skepticism & confidence	Platonic: theories are either true or false at a given time

Repeatability	Can account for unique data	Requires reproducibility

Experimental data	Can account for small, heterogeneous data sets	Experimental settings need to be defined beforehand

External data	Holistic approach	No possibility to account for any data outside of the experiment

Parsimony	Bayes factors eliminate unnecessary complex theories	The most falsifiable theories are preferred

Knowledge growth	Prior accounts for past knowledge	Each experiment is independent

Attitude	Universal approach: can test any theory	Partial approach: only one theory can be tested

Application	Most scientists are unconsciously pragmatic Bayesians	Strict Popperians are rare & probably not very successful

Table 5.3: Comparison of Bayesian and Popperian epistemologies.

5.3 Relevance for Interstellar Dust Studies

We now demonstrate what the Bayesian methods, which constitute a true epistemological approach, can bring to ISD studies. In particular, we advocate that hierarchical Bayesian models are an even better application of this approach. We illustrate this point with our own codes.

5.3.1 The Particularities of Interstellar Dust Studies

We have discussed throughout this manuscript the difficulty to constrain dust properties using a variety of observational constraints. For instance, we have seen that there is a degeneracy between small and hot equilibrium grains in the MIR (cf. Sect. 3.1.2.2), or between the effects of the size and charge of small a-C(:H) (cf. Sect. 3.2.1.3). In a sense, we are facing what mathematicians call an ill-posed problem, with the difference that we do not have the luxury of rewriting our equations, because they are determined by the observables. We detail these issues below.

5.3.1.1 Complexity of the Physics

Degeneracy between microscopic and macroscopic properties. When we observe a [C II] $_{158 μ m}$ line, we know it comes from a C $^{+}$ atom, and we can characterize very precisely the physical nature of this atom. On the contrary, if we observe thermal grain emission, we know it can come from a vast diversity of solids, with different sizes, shapes, structures, and composition. Even if we observe a feature, such as the 9.8 $μ m$ silicate band or an aromatic feature, we still have a lot of uncertainty about the physical nature of its carrier. This is the fundamental difference between ISM gas and dust physics. The complexity of gas modeling comes from the difficulty to determine the variation of the environmental conditions within the telescope beam. This difficulty thus comes from our uncertainty about the macroscopic distribution of ISM matter and of the energy sources (stars, AGNs, etc.) in galaxies. We also have the same issue with dust. When studying ISD, we therefore constantly face uncertainties about both the microscopic and macroscopic properties. Assuming we have a model that accounts for variations of both the dust constitution and the spatial distribution of grains relative to the stars, the Bayesian approach is the only way to consistently explore the credible regions of the parameter space, especially if several quantities are degenerate. We will give some examples in Sect. 5.3.3.

Heterogeneity of the empirical constraints. A dust model, such as those discussed in Sect. 2.3, has been constrained from a variety of observables (cf. Sect. 2.2): (i) from different physical processes, over the whole electromagnetic spectrum; (ii) originating from different regions in the MW; (iii) with prior assumptions coming from studies of laboratory analogues and meteorites. When we use such a model to interpret a set of observations, we should in principle account for all the uncertainties that went into using these constraints:

the uncertainties on the observations used to design the model (depletions, extinction curves, etc.);
the uncertainties on the laboratory data (opacity, density, etc.);
the prior probability of the different assumptions.

Obviously, only the Bayesian approach can account for these, especially knowing that these different uncertainties will likely be non-Gaussian and partially correlated. This is however an ambitious task, and it has been done only approximately (e.g. Sect. 4.1.3 of Galliano et al., 2021). This is a direction that future dust studies should take.

5.3.1.2 Entanglement of the Observations

The observables are weakly informative. Another issue with ISD studies is that the observables, taken individually, bring a low weight of evidence. A single broadband flux is virtually useless, but a few fluxes, strategically distributed over the FIR SED, can unlock the dust mass and starlight intensity (cf. Sect. 3.1.2.2). Yet, these different fluxes come from different observation campaigns, with different instruments. If we add that the partially-correlated calibration uncertainties of the different instruments often dominate the error budget, we understand that the Bayesian approach is the only one that can rigorously succeed in this type of analysis. We will discuss the treatment of calibration uncertainties in Sect. 5.3.2.2.

Contaminations are challenging to subtract. The different sources of foreground and background contaminations, that we have discussed in Sect. 3.1.3.2, will become more and more problematic with the increasing sensitivity of detectors. Indeed, probing the diffuse ISM of galaxies requires to observe surface brightnesses similar to the MW foreground. In addition, the CIB is even brighter at submm wavelengths (cf. Fig. 3.24). These two contaminations, the MW and the CIB, have very similar SEDs and a complex, diffuse spatial structure. Accurately separating these different layers therefore requires probabilistic methods, using redundancy on large-scales (e.g. Planck Collaboration et al., 2016a). It requires modeling every component at once. Bayesian and ML methods are the most obvious solutions.

5.3.2 The Principles of Hierarchical Bayesian Inference

We now discuss the formalism of hierarchical Bayesian inference, applied to SED modeling. This method has been presented by Kelly et al. (2012) and Galliano (2018).

5.3.2.1 Non-Hierarchical Bayesian Formalism for SED Modeling

Posterior of a single source. Let’s assume that we are modeling a single observed SED (e.g. one pixel or one galaxy), sampled in $m$ broadband filters, that we have converted to monochromatic luminosities: $L_{ν}^{o b s} (λ_{j})$ ( $j = 1, \dots, m$ ). Let’s assume that these observations are affected by normal iid noise, with standard-deviation $σ_{ν}^{n o i s e} (λ_{j})$ . If we have a SED model, depending on a set of parameters $\vec{x}$ , such that the predicted monochromatic luminosities in the observed bands is $L_{ν}^{m o d} (λ_{j}, \vec{x})$ , we can write that:

L_{ν}^{o b s} (λ_{j}) = L_{ν}^{m o d} (λ_{j}, \vec{x}) + 𝜖 (λ_{j}) σ_{ν}^{n o i s e} (λ_{j}),

(5.46)

where $𝜖 (λ_{j}) \overset{iid}{\sim} N (0, 1)$ . In other words, our observations are the model fluxes plus some random fluctuations distributed with the properties of the noise. The distribution of the parameters, $\vec{x}$ , is what we are looking for. We can rearrange Eq. ( $5.46$ ) to isolate the random variable:

𝜖 (λ_{j}, \vec{x}) = \frac{L_{ν}^{o b s} (λ_{j}) - L_{ν}^{m o d} (λ_{j}, \vec{x})}{σ_{ν}^{n o i s e} (λ_{j})} .

(5.47)

Since we have assumed iid noise, the likelihood of the model is the product of the likelihoods of each individual broadbands:

p ({\vec{L}}_{ν}^{o b s} | \vec{x}) = \prod_{j = 1}^{m} p (𝜖 (λ_{j}, \vec{x})),

(5.48)

were ${\vec{L}}_{ν}^{o b s} \equiv {L_{ν}^{o b s} (λ_{j})}_{j = 1, \dots, m}$ . If we assume a flat prior, the posterior is (Eq. $5.6$ ):

p (\vec{x} | {\vec{L}}_{ν}^{o b s}) \propto \prod_{j = 1}^{m} p (𝜖 (λ_{j}, \vec{x})) .

(5.49)

The difference is that: (i) in Eq. ( $5.48$ ), the parameters, $\vec{x}$ , are assumed fixed, the different $p (𝜖 (λ_{j}, \vec{x}))$ are thus independent; whereas (ii) in Eq. ( $5.49$ ), the observations, ${\vec{L}}_{ν}^{o b s}$ , are assumed fixed, the different terms in the product are now correlated, because each $p (𝜖 (λ_{j}, \vec{x}))$ depends on all the parameters.

Modeling several sources together. If we now model $n$ sources with observed luminosities, ${\vec{L}}_{ν}^{o b s, i}$ ( $i = 1, \dots, n$ ), to infer a set of parameters, ${\vec{x}}_{i}$ , the posterior of the source sample will be:

p ({\vec{x}}_{1}, \dots, {\vec{x}}_{n} | {\vec{L}}_{ν}^{o b s, 1}, \dots, {\vec{L}}_{ν}^{o b s, n}) \propto \prod_{i = 1}^{n} \prod_{j = 1}^{m} p (𝜖_{i} (λ_{j}, {\vec{x}}_{i})) = \prod_{i = 1}^{n} p ({\vec{x}}_{i} | {\vec{L}}_{ν}^{o b s, i}) .

(5.50)

Notice that, in the second equality, the different $p ({\vec{x}}_{i} | {\vec{L}}_{ν}^{o b s, i})$ are independent, as each one depends on a distinct set of parameters, ${\vec{x}}_{i}$ . The sampling of the whole posterior distribution will thus be rigorously equivalent to sampling each individual SED, one by one.

$\Rightarrow$ With a non-hierarchical Bayesian approach, the sources in a sample are independently modeled.

5.3.2.2 The Introduction of Nuisance Variables

Nuisance variables are parameters we need to estimate to properly compare our model to our observations. The particular value of these variables is however not physically meaningful, and we end up marginalizing the posterior over them. The Bayesian framework is particularly well-suited for the treatment of nuisance parameters.

Calibration uncertainties. Calibration errors originate from the uncertainty on the conversion of detector readings to astrophysical flux (typically ADU/s to Jy/pixel). Detectors are calibrated by observing a set of calibrators, that are bright sources with well-known fluxes. The uncertainties in the observations of these calibrators and on the true flux of the calibrators translate into a calibration uncertainty.

Correlation between sources:: the offset resulting from this uncertainty will be the same for every observations made with a given detector. For instance, if an instrument’s calibration factor is $5 %$ higher than what it should be ¹⁴, all high signal-to-noise measures made with this instrument will report a flux higher by $5 %$ than its true value. The calibration uncertainty, for a given broadband filter, will thus be perfectly correlated between all our sources.
Partial correlation between wavelengths:: instruments are often cross-calibrated and use similar calibrators. The correlation procedure will thus induce a partial correlation between different broadband fluxes. An example of this type of correlation is discussed in Appendix A of Galliano et al. (2021).

Introduction into the posterior. To account for calibration uncertainties, we can rewrite Eq. ( $5.47$ ) as:

𝜖 (λ_{j}, \vec{x}, δ_{j}) = \frac{L_{ν}^{o b s} (λ_{j}) - L_{ν}^{m o d} (λ_{j}, \vec{x}) \times (1 + δ_{j})}{σ_{ν}^{n o i s e} (λ_{j})} .

(5.51)

We have now multiplied the model by $(1 + δ_{j})$ , where $δ_{j} \sim N (0, 𝕍_{c a l})$ is a random variable following a centered multivariate normal law ¹⁵ with covariance matrix, $𝕍_{c a l}$ . This random variable, which represents a correction to the calibration factor, is multiplicative: it scales the flux up and down. $𝕍_{c a l}$ contains all the partial correlations between wavelengths (cf. Appendix A of Galliano et al., 2021). The important point to notice is that the $δ_{j}$ do not depend on the individual object (index $i$ ), they are unique for the whole source sample. The posterior of Eq. ( $5.50$ ) now becomes:

p ({\vec{x}}_{1}, \dots, {\vec{x}}_{n}, \vec{δ} | {\vec{L}}_{ν}^{o b s, 1}, \dots, {\vec{L}}_{ν}^{o b s, n}) \propto p (\vec{δ}) \times \prod_{i = 1}^{n} \underset{\prod_{j = 1}^{m} p (𝜖_{i} (λ_{j}, {\vec{x}}_{i}, δ_{j}))}{\underset{︸}{p ({\vec{x}}_{i} | {\vec{L}}_{ν}^{o b s, i}, \vec{δ})}},

(5.52)

where $p (\vec{δ}) = N (0, 𝕍_{c a l})$ is the prior on $δ$ . We can make the following remarks.

1.: The prior on the calibration errors, $\vec{δ}$ , is the calibration uncertainty, $p (\vec{δ})$ , quoted by the different instrument teams (all the information is included in $𝕍_{c a l}$ ). It means that, by sampling Eq. ( $5.52$ ), we will infer values of $\vec{δ}$ that are potentially more accurate than the calibration coefficients provided by the instrument teams. In practice, this is however not the case, because: (i) the calibration sources are usually the brightest and the most well-constrained, it is unlikely to reach the same accuracy observing galaxies; (ii) observations of galaxies suffer from contaminations adding several biases that are difficult to take into account; and (iii) models used to interpret observations of galaxies are not as accurate as models of typical calibrators, such as stars or Uranus.
2.: By sampling Eq. ( $5.52$ ), we are inferring a single value of the $δ_{j}$ factors. This is because the calibration has been done only once, and stays the same. The randomness it introduces is a single draw. It is not a reproducible event. The calibration uncertainty can not be considered as the limiting frequency of a repeated procedure. This is why frequentists could not treat these uncertainties.
3.: The calibration factors in Eq. ( $5.52$ ) link the posteriors of the individual sources together. We have noted that they were independent in Eq. ( $5.50$ ). This is not the case anymore, as they all depend on $\vec{δ}$ . Thus: (i) inferring $\vec{δ}$ is made possible by simultaneously sampling several sources, sharing the same calibration coefficients; (ii) the presence of $\vec{δ}$ , at the same time, improves the fit of individual SEDs.

The posterior of the parameters is, in the end, the marginalization over $\vec{δ}$ of Eq. ( $5.52$ ):

p ({\vec{x}}_{1}, \dots, {\vec{x}}_{n} | {\vec{L}}_{ν}^{obs,1}, \dots, {\vec{L}}_{ν}^{o b s, n}) = \int p ({\vec{x}}_{1}, \dots, {\vec{x}}_{n}, \vec{δ} | {\vec{L}}_{ν}^{obs,1}, \dots, {\vec{L}}_{ν}^{o b s, n}) d^{m} \vec{δ} .

(5.53)

$\Rightarrow$ Calibration errors can be rigorously taken into account as nuisance parameters.

5.3.2.3 The Role of the Hyperparameters

Accounting for the evidence brought by each source. In Sect. 5.1.2.2, we have stressed that, when performing a sequential series of measure, we can use the previous posterior as the new prior. Yet, the posterior in Eq. ( $5.52$ ) does not allow us to do so, as: (i) the posteriors of individual sources all depend on $\vec{δ}$ , they must therefore be sampled at once; and (ii) the parameters, ${\vec{x}}_{i}$ , are not identical, we are not repeating the same measure several times as in Eq. ( $5.14$ ), we are observing different sources. Accounting for the accumulation of evidence is thus not as straightforward as in Eq. ( $5.14$ ). There is however a way to use an informative prior, consistently constrained by the sample. It is the Hierarchical Bayesian (HB) approach. To solve the conundrum that we have just exposed, we can make the following assumptions.

1.: We can assume that the parameters of the different sources are drawn from a common distribution, but this distribution is unknown. For instance, the dust masses in the pixels of a galaxy span only a few dexes. They are not arbitrarily distributed. Their distribution results from the complex physics at play: dynamics, star formation, dust evolution, etc.
2.: We can reasonably approximate this common distribution with a particular functional form, such as a multivariate Gaussian or a Student’s $t$ . Such a distribution is parametrized by its average, $\vec{μ}$ , and its covariance matrix, $𝕍$ . These parameters are called hyperparameters, because they control the distribution of physical parameters. This is why this approach is called hierarchical. There are two layers of modeling: (i) the common distribution of parameters, controlled by a set of hyperparameters; (ii) the SED model controlled by as many sets of parameters as there are sources. The average $\vec{μ}$ will therefore represent the mean of each SED model parameters ( $M_{d u s t}$ , $⟨ U ⟩$ , $q_{A F}$ , etc.; cf. Sect. 3.1.2.2) and, $𝕍$ , their intrinsic scatter and correlations (such as the correlation between $⟨ U ⟩$ and $q_{A F}$ ; cf. Sect. 4.3.2).
3.: This common distribution, controlled by hyperparameters, is treated as the prior of our SED model parameters. We can then infer the values of the hyperparameters when sampling the whole posterior and marginalize over them in the end.

The hierarchical posterior. With the HB approach, the full posterior of our source sample is:

p ({\vec{x}}_{1}, \dots, {\vec{x}}_{n}, \vec{δ}, \vec{μ}, 𝕍 | {\vec{L}}_{ν}^{obs,1}, \dots, {\vec{L}}_{ν}^{o b s, n}) \propto \prod_{i = 1}^{n} \underset{source likelihoods}{\underset{︸}{p ({\vec{L}}_{ν}^{obs,i} | {\vec{x}}_{i}, \vec{δ})}} \times \underset{parameter prior}{\underset{︸}{p ({\vec{x}}_{i} | \vec{μ}, 𝕍)}} \times \underset{hyperparameter prior}{\underset{︸}{p (\vec{μ}) p (𝕍)}} \times \underset{calibration prior}{\underset{︸}{p (\vec{δ})}} .

(5.54)

The hyperprior:

compared to Eq. (

5.52

), we have introduced a new term:

p ({\vec{x}}_{i} | \vec{μ}, 𝕍) \times p (\vec{μ}) p (𝕍)

. This is the hierarchical prior. The term

p ({\vec{x}}_{i} | \vec{μ}, 𝕍)

is what we have previously called the common distribution of parameters. It is the actual prior on the parameters, and it is parametrized by the hyperparameters. The other terms,

p (\vec{μ}) p (𝕍)

are the necessary priors on

\vec{μ}

and

𝕍

, that we can assume rather flat (cf. Sect. 3.2.4 of Galliano, 2018, for more details). The elements of

\vec{μ}

are drawn one by one, using Gibbs sampling (cf. Sect. 5.1.3). Regarding the elements of

𝕍

, we independently draw each standard-deviation and correlation coefficient (cf. Sect. 3.2.4 of Galliano, 2018).

The parameter space

corresponding to Eq. (

5.54

) has dimension (noting

q

the number of SED model parameters):

N_{dim} = \underset{S E D m o d e l p a r a m e t e r s}{\underset{︸}{n \times q}} + \underset{e l e m e n t s o f \vec{μ}}{\underset{︸}{q}} + \underset{d i a g o n a l o f 𝕍}{\underset{︸}{q}} + \underset{number of correlations}{\underset{︸}{q \times (q - 1) ∕ 2}} + \underset{calibration errors}{\underset{︸}{m}} .

(5.55)

For the composite model ( $q = 7$ ; cf. Sect. 3.1.2.2), constrained by $m = 10$ wavelengths, and for $n = 1000$ sources or pixels, we would have to sample a $N_{dim} ≃ 7000$ dimension parameter space.

5.3.3 Hierarchical Bayesian Models for ISD Studies

We now present practical illustrations of SED modeling with the HB approach. The first HB dust SED model was presented by Kelly et al. (2012). It was restrained to single MBB fits. Veneziani et al. (2013) then presented a HB model that could be applied to a combination of MBBs. The HiERarchical Bayesian Inference for dust Emission code (HerBIE; Galliano, 2018), was the first HB model, and to this day the only one to our knowledge, to properly account for full dust models, with: (i) realistic optical properties; (ii) complex size distributions; (iii) rigorous stochastic heating; (iv) mixing of physical conditions; (v) photometric filter and color corrections; and (vi) partially-correlated calibration errors. The following examples have been computed with HerBIE.

5.3.3.1 Efficiency and Comparison to Other Methods

To demonstrate the efficiency of the HB method and the fact that it performs better than its alternatives, we rely on the simulations presented by Galliano (2018). These simulations are simply obtained by randomly drawing SED model parameters from a multivariate distribution, for a sample of sources. This distribution is designed to mimic what we observe in typical star-forming galaxies. The SED model used is the composite approach (cf. Sect. 3.1.2.2), except when we discuss MBBs. Each set of parameters result in an observed SED, that we integrate into the four IRAC, the three PACS and the three SPIRE bands. We add noise and calibration errors. This way we can test fitting methods and assess their efficiency by comparing the inferred and the true values.

Close look at a fit. Fig. 5.19.a-b shows the posterior SED PDF of the faintest and brightest pixels in a simulation, fitted in a HB fashion.

For the faintest pixel: (Fig. 5.19.a), we can see that the PDF is wider, because it is less constrained. Half of the observations (green) are indeed only upper limits. The SED looks however realistic and matches very well its truth (in red). This is one of the advantages of the HB approach: when a source is poorly constrained, its posterior is dominated by the prior, which has been informed by the whole sample. Thus, despite few information on this particular source, we obtain realistic parameters and SED, because the rest of the sample is providing information about the typical shape of an SED in that luminosity range.
For the brightest pixel: (Fig. 5.19.b), the SED is much tighter. There are however spectral domains where the uncertainty can increase. For instance, notice that there is a wider spread around 3 $μ m$ and around 11 $μ m$ . This is because the charge of the PAHs gives the model a degree of freedom in this range (cf. Sect. 3.2.1.3). Yet, the model is poorly constrained. The true SED lies in the tail of the distribution. It is still consistent, though.
The calibration errors: are shown in Fig. 5.19.c. The red dots represent the biases we have introduced into the synthetic observations. These biases are the same for each pixel in the simulation. These errors are treated as nuisance parameters in Eq. ( $5.54$ ). The blue error bars show the inference of these biases. We see that they are most of the time consistent within $1 σ$ . We also see that they are most of the time consistent with zero. This is what we discussed in Sect. 5.3.2.2: typical galaxy observations are not accurate enough to refine the calibration of the instruments. The only outlier is the SPIRE $_{350 μ m}$ point. It is however less than $3 σ$ away from zero.
The posterior: distribution of these two pixels is shown in Fig. 5.18.b. We have represented the PDF of two parameters, $M_{d u s t}$ and $⟨ U ⟩$ , marginalizing over the rest. We see that the faintest pixel has a larger uncertainty, and that both posteriors are close to their true values (red dots). If we now compare these results to the same exact simulation fitted in a standard, non-hierarchical Bayesian way (Fig. 5.18.a), we notice that the PDF of the brightest pixel is very much the same, but the PDF of the faintest pixel is now much wider, covering a large fraction of the parameter space. This is because, as we have noted earlier, sources are individually fitted with a standard Bayesian method, they thus do not benefit from the information provided by the rest of the sample, through the prior.

$\Rightarrow$ In a HB model, the least-constrained sources are more corrected by the prior than the brightest ones.

Figure 5.17: Example of hierarchical Bayesian SED fits. The blue contours in panels (a) and (b) represent the posterior SED of the faintest and brightest pixels in a simulation presented by Galliano (2018). This simulation reproduces typical conditions in star-forming galaxies. The synthetic observations, including noise and calibration errors, are in green. The true model is shown in red, for reference. Panel (c) shows the simulated calibration errors in red, and their inferred posterior values, in blue. Licensed under CC BY-SA 4.0.

Figure 5.18: Posterior distributions with standard and hierarchical Bayesian methods. In each panel, we show the marginalized posterior of the dust mass, $M_{d u s t}$ , and mean starlight intensity, $⟨ U ⟩$ (Eq. $3.38$ ), for the two pixels in Fig. 5.17. Panel (a) corresponds to the case of a non-hierarchical Bayesian fit, and panel (b), to a full hierarchical fit. In both panels, the orange contours represent the faintest pixel and the blue contours, the brightest one. The uncertainty ellipses corresponding to these posteriors are the green ellipses. The true values are the red dots. Licensed under CC BY-SA 4.0.

Comparison between different approaches. We have been discussing the two extreme pixels of our simulation. Let’s now look at the whole source sample and compare several methods. Fig. 5.19 shows the same parameter space as in Fig. 5.18, but for all sources, with different methods.

The least-squares

method, which is frequentist, is shown in green, in Fig. 5.19.a. We note the following points.

1.: The inferred parameters have large individual uncertainties (i.e. big ellipses).
2.: The overall sample is quite scattered, covering three orders of magnitude, while the true values are within a dex.
3.: There is a false negative correlation between $M_{d u s t}$ and $⟨ U ⟩$ .

This false correlation is typical of frequentist methods, but not exclusive. It is the equivalent of the $β - T$ degeneracy we have already discussed in Sect. 3.1.2.1. Because of the way the model is parametrized, if we slightly overestimate the dust mass, we will indeed need to compensate by decreasing $⟨ U ⟩$ , to account for the same observed fluxes, and vice versa. This false correlation is thus induced by the noise.

The non-hierarchical Bayesian

method, in Fig. 5.19.b, provides a more accurate fit of the brightest sources (high

M_{d u s t}

and high

⟨ U ⟩

). The faintest sources are however quite scattered. There is still a false correlation, due to the same reasons as for the least-squares, but it is less significant.

The hierarchical Bayesian

method, in Fig. 5.19.c, on the contrary, provides an unbiased statistical account of the sample. We note the following points.

1.: The uncertainty of individual sources (blue ellipses) is moderate. It is never larger than the scatter of the true sample.
2.: The inferred mean and scatter of the sample properties are very close to their true values. Their comparison is given in Table 5.4.
3.: There is no false correlation. The inferred correlation coefficient is consistent with zero, its true value.

From a general point view, a HB method is efficient at removing the scatter between sources that is due to the noise. In addition, the inferred uncertainties on the parameter of a source are never larger than the intrinsic scatter of the sample, because this scatter is also the width of the prior. For instance, in Fig. 5.19.c, the lowest signal-to-noise sources (lower left side of the distribution) have uncertainties (blue ellipses) similar to the scatter of the true values (red points), because the width of the prior matches closely this distribution. On the opposite, the uncertainties of high signal-to-noise sources (upper right side) are much smaller, they are thus not significantly affected by the prior.

$\Rightarrow$ HB methods are efficient at recovering the true, intrinsic scatter of parameters and their correlations.

Figure 5.19: Comparison of least-squares, standard Bayesian and HB methods. Each panel shows the full simulation of Fig. 5.19, fitted with three different methods. The true values are the red dots. They are identical in the three panels. We have 100 sources. We represent the same parameter space as in Fig. 5.18. The ellipses represent the posterior of: (a) a least-squares fit; (b) the non-hierarchical Bayesian fit of Fig. 5.18.a; and (c) the hierarchical Bayesian fit of Fig. 5.18.c. Licensed under CC BY-SA 4.0.


	HB	True

$⟨ \ln M_{d u s t} ⟩$	$0.053 \pm 0.136$	0

$σ (\ln M_{d u s t})$	$0.477 \pm 0.065$	0.5

$⟨ \ln ⟨ U ⟩ ⟩$	$3.64 \pm 0.29$	3.742

$σ (\ln ⟨ U ⟩)$	$0.65 \pm 0.20$	0.4

$ρ (\ln M_{d u s t}, \ln ⟨ U ⟩)$	$- 0.088 \pm 0.144$	0

Table 5.4: Inferred statistical properties of the HB model in Fig. 5.19.c. These quantities are the inferred moments of the source distribution.

The emissivity-index-temperature degeneracy of MBBs. We have just seen that HB methods are efficient at removing false correlations between inferred properties. We emphasize that HB methods do not systematically erase correlations if there is a true one between the parameters (cf. the tests performed in Sect. 5.1 of Galliano, 2018, with intrinsic positive and negative correlations). This potential can obviously be applied to the infamous $β - T$ correlation discussed in Sect. 3.1.2.1. This false correlation has been amply discussed by Shetty et al. (2009). Kelly et al. (2012) showed, for the first time, that HB methods could be used to solve the degeneracy. Fig. 5.20 shows the results of Galliano (2018) on that matter. We can see that the false $β - T$ negative correlation obtained with a least-squares fit (Fig. 5.20.a) is completely eliminated with a HB method (Fig. 5.20.b).

Figure 5.20: Solving the emissivity-index-temperature degeneracy of MBBs with a HB model. The red dots in both panels represent a MBB simulation of 1000 sources in the $T - β$ plane (Galliano, 2018). Noise and calibration uncertainties have been added to the synthetic SEDs corresponding to these values. They have been fitted with: (a) a least-squares method, in green; and (b) a HB method, in blue. Licensed under CC BY-SA 4.0.

5.3.3.2 The Role of the Prior

We now further develop and illustrate the instrumental role of the hierarchical prior.

Linking the different sources. We have noted in Eq. ( $5.54$ ) that the hierarchical prior was breaking the independency between the different sources in the sample, encountered in the non-hierarchical Bayesian case. This is because the properties of the prior (the hyperparameters) are inferred from the source distribution, and the properties of the individual sources are affected, in return, by this prior. This is illustrated in Fig. 5.21. This figure shows the HB fits of three simulations, varying the median signal-to-noise ratio of the sample. We have represented a different parameter space, this time.

At high signal-to-noise: (Fig. 5.21.a), we see that the uncertainty of individual sources is significantly smaller than the scatter of their properties. The prior thus does not play an important role. This is a case where a non-hierarchical Bayesian fit would give very similar results.
At intermediate signal-to-noise: (Fig. 5.21.b), the brightest sources (on the right side) still have small uncertainties. However, most of the points now have uncertainties comparable to the sample scatter, because the posterior of each individual source starts to be dominated by the prior. We note that the maximum a posteriori (cyan stars) starts to cluster in the center.
At low signal-to-noise: (Fig. 5.21.c), the inferred values of $q_{P A H}$ (green stars) are now almost similar for every source. We are in the case where the uncertainty on each individual source is so large that we can not recover their individual values. We can however give their most likely value, based on the distribution of the sample. This type of result has to be interpreted in a Bayesian way, to be consistent (i.e. performing tests on the MCMC). The fact that the inferred values of $q_{P A H}$ all collapsed on a single point does not mean we would deduce that the points all have the same value. If we were performing some tests, we would realize that they are uncorrelated: if we were randomly drawing parameter values from the posterior, they would be scattered with a distribution similar to the true values. In a sense, we obtain here a result similar to stacking the sources, but in a smarter way, as some parameters are better constrained than others. For instance, in Fig. 5.21.c, we see that we have only access to $⟨ q_{P A H} ⟩$ ( $y$ -axis), but we resolve the $⟨ U ⟩$ of individual sources ( $x$ -axis).

$\Rightarrow$ HB methods are useful when the parameter uncertainty of some sources is comparable to or larger than the scatter of this property over the whole sample.

Figure 5.21: Demonstration of the effect of the prior in a HB model. The red dots in the different panels represent three simulations of 1000 sources, with similar physical properties, similar calibration uncertainties, but different signal-to-noise ratios (S/N). Each simulation has been fitted with our HB code (Galliano, 2018). We represent the marginalized posterior of the PAH mass fraction, $q_{P A H}$ , and mean starlight intensity, $⟨ U ⟩$ (cf. Sect. 3.1.2.2). Licensed under CC BY-SA 4.0.

Holistic prior. From a statistical point of view, it is always preferable to treat all the variables we are interested in as if they were drawn from the same multivariate distribution, however complex it might be. Stein (1956) showed that the usual estimator of the mean ( $\sum_{i} X_{i} ∕ N$ ) of a multivariate normal variable is inadmissible (for more than two variables), that is we could always find a more accurate one. In other words, if we were interested in analyzing together several variables, not necessarily correlated, such as the dust and stellar masses, it would always be more suitable to use estimators that combine all of them. This is known as Stein’s paradox. Although Stein (1956)’s approach was frequentist, this is a general conclusion. From a Bayesian point of view, it means that we should put all the variables we are interested in analyzing, even if they are not SED model parameters, in the prior. In addition, if these external variables happen to be correlated with some SED model parameters, they will help refining their estimates. This is illustrated in Fig. 5.22. We show in both panels a simulation of the correlation between the dust mass, which is a SED model parameter, and the gas, which is not. When performing a regular HB fit, and plotting the correlation as a function of $M_{g a s}$ , we obtain the correlation in Fig. 5.22.a. We see that the agreement with the true values breaks off at low mass (also the lowest signal-to-noise). If we now include $M_{g a s}$ in the prior ¹⁶, we obtain the correlation in Fig. 5.22.b. It provides a much better agreement with the true values. This is because adding $M_{g a s}$ in the prior brought some extra information. The information provided by a non-dusty parameter helped refine the dust SED fit. For instance, imagine that you have no constraints on the dust mass of a source, but you know its gas mass. You could infer its dust mass by taking the mean dustiness of the rest of the sample. This holistic prior does that, in a smarter way, as it accounts for all the correlations in a statistically consistent way.

$\Rightarrow$ The HB approach allows an optimal, holistic treatment of all the quantities of interest, even if they are not related to the dust.

Figure 5.22: The holistic approach: inclusion of external parameters into the prior. Both panels represent the same simulation (red dots) of 300 sources, with physical properties typical of star-forming galaxies, including noise and calibration errors (Galliano, 2018). We have added a parameter that is external to our SED model, the gas mass, $M_{g a s}$ . In panel (a), we show the HB inference of the dust mass as a function of the synthetic observations of the gas mass (cyan error bars). In panel (b), we show the HB inference of the parameters when they are both in the prior (blue ellipses). Licensed under CC BY-SA 4.0.

5.3.3.3 Other Developments

SED modeling is far from the only possible application of HB methods. We briefly discuss below two other models we have developed.

Cosmic dust evolution. The dust evolution model we have discussed in Sect. 4.3 has been fitted to galaxies by Galliano et al. (2021), in a HB way. To be precise, we have used the output of HerBIE, $M_{d u s t}$ , $M_{g a s}$ , $M_{⋆}$ , SFR and metallicity, as observables. We then have modeled the SFH-related parameters in a HB way, and have assumed that the dust efficiencies were common to all galaxies (i.e. we inferred one single value for the whole sample). These common parameters were not in the hierarchical prior, because their value is the same for all galaxies. They were however sampled with the other parameters, in a consistent way. We were successful at recovering dust evolution timescales consistent with the MW at Solar metallicity (cf. Sect. 4.3.1.2). One important improvement would be to treat everything within the same HB model: (i) the SED; (ii) the stellar and gas parameters; and (iii) the dust evolution. This is something we plan to achieve in the near future.

MIR spectral decomposition. The type of MIR spectral decomposition that we have discussed in Sect. 3.2.1.2 could also benefit from the HB approach. Although the model is mostly linear, there are a lot of degeneracies between the uncertainties of adjacent blended bands, such as the 7.6 and 7.8 $μ m$ features. In addition, plateaus and weak features are usually poorly constrained and their intensity can considerably bias the fit at low signal-to-noise. Hu et al. (in prep.) have developed such a HB MIR spectral decomposition tool. Its efficiency has been assessed on simulated data, and it is now being applied to M 82. This model will be valuable to analyze the spectra from the JWST.

1.Epistemology is the philosophical study of the nature, origin and limits of human knowledge. By extension, it is the philosophy of science.

2.It is often considered as the scientific definition of probabilities, while we will show later that the Bayesian definition has more practical applications.

3.Our variable is the true flux. It is positive. Measured fluxes can occasionally be negative because of noise fluctuations.

4.The distribution in Fig. 5.5.a has a very narrow tail on the lower side of $F_{t r u e}$ . It can thus in principle be lower, but this is such a low probability event that, for the clarity of the discussion, we will assume this is unlikely.

5.Numerous authors publish articles claiming to have solved a problem using “MCMC methods”. This is not the best terminology to our mind, especially knowing that MCMCs can be used to sample any distribution, not only a Bayesian posterior. These authors should state instead to have solved a problem in a Bayesian way (what), using a MCMC numerical method (how). The same way, we tell our students to say that they “modeled the photoionization”, rather than they “used Cloudy”.

6.This quantity is problematic to compute. Sokal (1996) and Foreman-Mackey et al. (2013) discuss an algorithm to evaluate it numerically.

7.The acceptance of a wrong hypothesis (false positive) is called “type I error”, whereas the rejection of a correct hypothesis (false negative) is called “type II error”.

8.Pascal’s wager states that it is rational to act as if God existed. If God indeed exists, we will be rewarded, which is a big win. If He doesn’t, we will have only renounced to some material pleasures, which is not a dramatic loss.

9.There are still a few frequentist trolls roaming Wikipedia’s mathematical pages.

10.If journalists are reading these lines, we stress that the peer-review process has nothing to do with the scientific method. It is just a convenient editorial procedure that filters poorly thought-out studies.

11.This could be called “P(opper)-hacking”.

12.Nowadays, scientists call “reproducibility” the action of providing the data and the codes a publication was prepared with. This is a good practice, but this is not Popper’s reproducibility. It should rather be qualified as “open source”.

13.Concerning the topics discussed in Sect. 4.3, we have seen this attitude in a small group of people trying to prove ISM dust is stardust, constantly ignoring the big picture summarized in Table 4.2.

14.In this example, $5 %$ is not the calibration uncertainty, but the error made because of the calibration uncertainty.

15.We could have taken a different distribution, such as a Student’s $t$ (e.g. Eq. 31 of Galliano, 2018).

16.From a technical point of view, including an external parameter in the prior, such as $M_{g a s}$ , can be seen as adding an identity model component: $M_{g a s} = f (M_{g a s})$ . Concretely, it means that, at each iteration, we sample $M_{g a s}$ from its uncertainty distribution, and the distribution of $M_{g a s}$ and its potential correlations with the other parameters inform the prior.

[next] [prev] [prev-tail] [front] [up]


(a) Blaise PASCAL	(b) Thomas BAYES	(c) Pierre-Simon LAPLACE
(1623–1662)	(1702–1761)	(1749–1827)


(a) Karl PEARSON	(b) Ronald FISHER	(c) Jerzy NEYMAN
(1857–1936)	(1890–1962)	(1894–1981)


(a) Harold JEFFREYS	(b) Alan TURING	(c) Edwin T. JAYNES
(1891–1989)	(1912–1954)	(1922–1998)


(a) Auguste COMTE	(b) Henri POINCARÉ	(c) Karl POPPER
(1798–1857)	(1854–1912)	(1902–1994)

Chapter 5Methodological Effort and Epistemological Reflection

5.1 Understanding the Opposition between Bayesians and Frequentists

5.1.1 Two Conceptions of Probability and Uncertainty

5.1.1.1 The Concept of Conditional Probability

5.1.1.2 The Bayesian and Frequentist Assumptions

5.1.2 Comparison of the Two Approaches on Simple Cases

5.1.2.1 Simple Case: When the Two Approaches Agree

5.1.2.2 Benefits of Using an Informative Prior

5.1.2.3 Case Where the Two Approaches Differ: Non Gaussianity and Few Data

5.1.3 Numerical Methods to Solve Bayesian Problems

5.1.3.1 Sampling the Posterior Distribution

5.1.3.2 Post-Processing MCMCs

5.1.4 Decision Making and Limitations of the Frequentist Approach

5.1.4.1 Hypothesis Testing

5.1.4.2 Pros and Cons of the two Approaches

5.2 Bayesianism, an Alternative to Popper’s Scientific Method

5.2.1 Bayes Formula Throughout History

5.2.1.1 The Origins

5.2.1.2 The Frequentist Winter

5.2.1.3 The Bayesian Renaissance

5.2.2 Bayesian and Popperian Epistemologies

5.2.2.1 The Epistemological Debate at the Beginning of the XXth Century

5.2.2.2 Popper’s Logic of Scientific Discovery

5.2.2.3 Verifiability, Holisticity and Parsimony: the Bayesian Alternative

5.3 Relevance for Interstellar Dust Studies

5.3.1 The Particularities of Interstellar Dust Studies

5.3.1.1 Complexity of the Physics

5.3.1.2 Entanglement of the Observations

5.3.2 The Principles of Hierarchical Bayesian Inference

5.3.2.1 Non-Hierarchical Bayesian Formalism for SED Modeling

5.3.2.2 The Introduction of Nuisance Variables

5.3.2.3 The Role of the Hyperparameters

5.3.3 Hierarchical Bayesian Models for ISD Studies

5.3.3.1 Efficiency and Comparison to Other Methods

5.3.3.2 The Role of the Prior

5.3.3.3 Other Developments

Chapter 5
Methodological Effort and Epistemological Reflection

5.2.2.1 The Epistemological Debate at the Beginning of the XX $^{t h}$ Century