On the frequentist properties of Bayesian nonparametric methods

In this paper, I will review the main results on the asymptotic properties of the posterior distribution in nonparametric or large dimensional models. In particular I will explain how posterior concentration rates can be derived and what we learn from such analysis in terms of impact of the prior distribution in large dimensional models. These results concern fully Bayes and empirical Bayes procedures. I will also describe some of the results that have been obtained recently in semi-parametric models, focusing mainly on the Bernstein von Mises property. Although these results are theoretical in nature, they shed light on some subtle behaviours of the prior models and sharpen our understanding of the family of functionals that can be well estimated, for a given prior model.


INTRODUCTION
Some five years ago, it was said at a Bayesian nonparametric workshop that the field was now growing so fast that it was not possible to keep up with all the evolutions and new findings.And indeed, Bayesian nonparametrics has grown to be a major field in Bayesian statistics with applications in a large number of fields within biostatistics, physics, economy, social sciences, computational biology, comuter vision and language processing.There is now a collection of textbooks on general Bayesian nonparametric models , such as (Dey et al., 1998, Ghosh and Ramamoorthi, 2003, Hjort et al., 2010) or even on some specific aspects of Bayesian nonparametric, see for instance (Rasmussen and Williams, 2006) for Machine learning and Gaussian processes.
With the elaboration of more sophisticated models, the need to understand their theoretical properties becomes crucial.Theoretical studies on Bayesian nonparametric or large dimensional models can be split -typically -into two parts: asymptotic frequentist properties and probabilistic properties of the random process defining the prior and/or the posterior distribution.In this paper, I will mainly describe the advances that have been obtained on the asymptotic frequentist properties of Bayesian nonparametric procedures.
When opposing Bayesian to frequentist statistics, one is merely opposing the methods of validation, since, at least from a frequentist view -point there is not a frequentist methods but all sorts of different " algorithms", say, and the question is on how to evaluate them.Interestingly, Bayesian statistics form a global approach in that it provides a generic methodology to make inference, together with inherent evaluation tools.This coherency sometimes lead (Bayesian) statisticians to question the need for understanding their (asymptotic) frequentist properties.I will not enter this dispute, however I will try along the way to explain why it is helpful to understand the asymptotic frequentist properties of Bayesian procedures, in particular in complex or large dimensional models, when intuition and subjective inputs cannot be fully invoked.
Although strictly speaking nonparametric designates infinite dimensional parameters, I will also discuss high dimensional models since they share common features with nonparametric models.
Bayesian nonparametric modelling was probably intiated from de Finetti's representa-tion of infinite exchangeable sequences, see (de Finetti, 1937), which states that any infinite exchangeable sequence (Yi, i ∈ N) has a distribution which can be represented as: so that the de Finetti measure Π can be understood as a prior distribution on P .Nowadays, more complex structures are modelled and used in practice.
Consider a statistical model associated to a set of observations Y n ∈ Y (n) ∼ P θ , θ ∈ Θ where n denotes a measure of information of the data Y n .In the exchangeable model (1) for instance Θ designates the set of probabilities on Y 1 , or the set of probability densities on Y 1 if we restrict our attention to dominated models.In regression or classification models of Y on X, Θ may denote the set of regression functions, or the set of conditional distributions or densities given X.Generally speaking Θ can have a very complex structure, be high or infinite dimensional.Hence, in such cases the influence of the prior is strong and does not entirely vanish asymptotically.It is then interesting to understand the types of implicit assumptions which are made by the choice of a specific prior and also within a family of priors which are the hyperparameters whose influence does not disappear as the number of observations increases.In some applications, hyperparameters are determined based on prior knowledge, as in (Yau et al., 2011), in others they are chosen based on the data as in (van de Wiel et al., 2013); in the latter case the approach is called empirical Bayes.In both cases it is important to assess the influence of these choices.From a theoretical view-point subjective priors and data dependent priors do not present the same difficulties, in Section ?? I describe the asymptotic behaviour of posterior distributions associated to priors that do not depend on the data while in Section 4 empirical Bayes posteriors are considered.
Before describing theoretical properties of Bayesian nonparametric procedures, I will recall in Section 2 the two main categories of Bayesian nonparametric prior models : namely those based on Dirichlet processes or its extensions and those based on Gaussian process priors.In Section 3 then the main results on posterior consistency and posterior concentration rates are presented, Section 4 treats the recent results on empirical Bayes procedures and Section 5 briefly describes advances in Semi-parametric models.

COMMON BAYESIAN NONPARAMETRIC MODELS
We do not intend to cover the whole spectrum of Bayesian nonparametric, but in this section we will review two important families of processes that are used in Bayesian nonparametric modelling.

Around the Dirichlet process
The most celebrated process used in prior modelling is the Dirichlet process prior DP , introduced by (Ferguson, 1974).The Dirichlet process can be characterized in many ways.It is parameterized by a mass M > 0 and and probability measure G0 on a space X .An explicit construction of its distribution is known as the stick -breaking representation, it is due to (Sethuraman, 1994) and is given by mutually independently, where δ θ j stands for the Dirac point mass at θj.We write G ∼ DP (M, G0).The Dirichlet process has various other representations which makes it a very useful process, see for instance (Ghosh andRamamoorthi, 2003, Lijoi andPrünster, 2010).
Most often, the Dirichlet process is not used alone in the prior modelling.It is commonly used combined with some kernel f θ in a mixture model : The above type of model is a powerful tool to estimate the density of Yi but it can also be considered for clustering given the discrete nature of the Dirichlet process.All sorts of variations around the mixture model ( 3) can be considered.For instance in (Kyung and Casella, 2010) the authors model the distribution of the random effects in a random effect model using a Dirichlet process.To go beyond exchangeable data, hierarchical Dirichlet processes, dependent Dirichlet processes, infinite hiden Markov models have been constructed, see (Hjort et al., 2010) for descriptions of these extensions.Also, extensions of the Dirichlet process (2) have been constructed based either on the Sethuraman representation or one of its other representations : normalized completely random measure, Polya urn representation, see (Lijoi and Prünster, 2010), or as a special case of Polya trees, (Lavine, 1992).

Around Gaussian processes
Gaussian processes form another class of very popular processes used in prior modelling in Bayesian nonparametrics.Bayesian modelling via Gaussian processes has strong connections with machine learning approaches as described in (Rasmussen and Williams, 2006).They are used to model curves.Roughly speaking a zero mean Gaussian process can be viewed as a set of random variables on a probability space (Ω, B, P ), (Wt, t ∈ T ) for some set T , with finite dimensional marginals following multivariate Gaussian distributions.It is caracterized by a covariance kernel K(s, t), s, t ∈ T .The behaviour of the Gaussian process is therefore driven by the choice of the Kernel.The most well known kernels are the exponential kernel Ka(s, t) = e −a t−s 2 , the Matérn Kernel where Kν is the Bessel function, a, ν > 0, and the Brownion motion kernel K(s, t) = s ∧ t.The first two refer to stationary Gaussian processes with T a normed space, while the Brownion motion is non stationary and sits on T ⊂ R + .These three classes of kernels are associated to very different behaviour of the process, the curves (Wt, t ∈ T ) drawn from these distributions have in particular different smoothness properties.The exponential kernel leads to infinitely differentiable curves, contrarywise to the Matérn or the Brownion motion.A key feature in understanding the behaviour of the Gaussian process associated to a given kernel, is its Reproducing Kernel Hilbert space (RKHS) H. Roughly speaking the RKHS is a Hilbert space and is the closure in L2(Ω, B, P ) of the functions t → E[Wt P i αiWs i ], see (van der Vaart and van Zanten, 2008a) for a review on the subject.
There are many other ways to construct probabilities on curves, in a similar spirit to Gaussian processes.Indeed Gaussian processes, under weak conditions, can be decomposed as P j Zjλjej where (ej)j form an orthonormal basis, λj > 0 and the Zj iid ∼ N (0, 1).Other types of projections on linear spaces can be considered using wavelets, splines, Legendre polynomials to name but a few.The prior is then typically formed by (1) choosing the dimension of the space from some distribution on N and (2) given the dimension of the space, drawing the coefficients of the projection on the space from some specific distribution.
We now describe the tools which have been developped to study the asymptotic behaviour of the posterior distribution on large or infinite dimensional spaces.

Notations and setup
Herafter we consider a Bayesian model (Y (n) , P θ , θ ∈ Θ) where (Θ, A) is the parameter space which is possibly infinite dimensional and A is its σ− field, with a prior probability Π on Θ.We assume that the model is dominated by some measure ν on Y n and we write f θ the density of P θ with respect to ν and n(θ) = log f θ (Y n ) the log-likelihood.Then the posterior distribution can be represented as, for all B ∈ A, Hereafter θ0 denotes the true value of the parameter, as we are now focusing on the frequentist properties of Π(.|Y n ).For all θ ∈ Θ, E θ and V θ denote respectively expectation and variance with respect to P θ .

Posterior consistency
Consider a Bayesian model as described in Section 3.1 with a prior probability Π on Θ, we say that the posterior distribution is consistent with respect to a loss function d(., .) on Θ0 ⊂ Θ if for all θ0 ∈ Θ0, In other words posterior consistency means that the posterior distribution concentrates around the true parameter θ0, in terms of the loss d(., .).Posterior concentration is a minimal requirement, in particular in the context of large dimensional models where it is not possible to construct fully subjective priors.Moreover, even from a subjective Bayes point of view, posterior consistency is important since it is the necessary and sufficient condition for the asymptotic merging of 2 posterior distributions associated to 2 different priors as the information in the data, n, goes to infinity, see (Diaconis and Freedman, 1986).
Although not all priors lead to posterior consistency, posterior consistency has been verified in a large number of models and of prior distributions.This was initiated by the work of (Schwartz, 1965) in the case of density estimation and extended by (Barron, 1988) for generic models.

Posterior concentration rates
Posterior concentration (or contraction) rates are defined by: Definition 1.The posterior distribution concentrates at rate n at θ0 if there exists M > 0 such that Posterior concentration or contraction rates are therefore a more precised version of posterior consistency since they provide an upper bound on the rate at which the posterior distribution shrinks towards the true parameter θ0.Typically n depends on caracteristics of θ0 and on properties of the prior distribution Π.
So why is it interesting to study posterior concentration rates ?From a frequentist point of view, (6) typically implies that Bayesian estimates such as the posterior mean or the posterior median have a frequentist risk of order n, see (Ghosal et al., 2000).It is also interesting for understanding the behaviour of credible balls with respect to d(., .),i.e. credible sets defined as so that zα is the 1 − α-th quantile of the posterior distribution of d(θ, θ) and θ is some given estimator, like the posterior mean of θ.Indeed as explained in (Hoffman et al., 2013), if the size (radius) of Cα and Hence the credible region is not necessarily a honest confidence region, but on average it is a confidence region with coverage 1 − α.
Finally, deriving the posterior concentration rates is enlightning about the way the prior distribution acts, which is particularly important in high dimensional models; we now explain how these posterior concentration rate can be derived.We illustrate in Section 3.3.2 in two families of examples why the study of posterior concentration rates shed some light on the impact of the prior.

Conditions and results.
Similarly to posterior consistency, posterior concentration rates are obtained by verifying the following types of conditions, see (Ghosal et al., 2000, Ghosal andvan der Vaart, 2007a): where • (ii) Testing condition : there exist Θn ⊂ Θ and a sequence of test functions φn Roughly speaking the argument follows from the following decomposition : write since Nn/Dn ≤ 1 and 0 ≤ φn ≤ 1, we can write The Kullback-Leibler condition allows to control and then using Markov inequality twice, There exist in the literature variations around this decomposition and the above conditions but the ideas are all along these lines.
Following the frequentist literature, one typically caraterizes the concentration rates in terms of a few features of the true parameter.For instance, in the case of curve estimation, it is common practice to either assume some smoothness property of the curve, like Hölder, Sobolev or Besov regularity or some shape constraints such as monotonicity or convexity.The obtained rates tend to be uniform over some functional classes or some collections of functional classes.
There is a growing literature on the field and large classes or prior distributions and models have been studied using the above approach.
In the context of density estimation for i.i.d random variables, the reknown Dirichlet process mixture models have been studied by (Ghosal and van der Vaart, 2007b, Kruijer et al., 2010, Scricciolo, 2014, Shen et al., 2013, Canale and de Blasi, 2013) among others in the case of Gaussian mixtures, by (Ghosal, 2001, Rousseau, 2010) for Beta mixtures.
Log-linear, log-spline, log-Gaussian process priors have been also considered by (Ghosal et al., 2000, Rivoirard and Rousseau, 2012b, van der Vaart and van Zanten, 2008b, van der Vaart and van Zanten, 2009) to name but a few.In (van der Vaart andvan Zanten, 2008b, van der Vaart andvan Zanten, 2009) posterior concentration rates have been derived for general models, when the prior on the unknown curve is constructed using a Gaussian process prior.Their results have been recently extended to a multivariate setup where both anisotropy and dimension reduction are incorporated in the prior model by (Bhattacharya et al., 2014a).Various other sampling models have been studied in the literature, following the above approach, such as inhomogeneous and Aalen point processes (Belitser et al., 2012, Donnet et al., 2014a), regression models (de Jonge and van Zanten, 2010), Gaussian times series (Rousseau et al., 2012) to name but a few.
In (van der Vaart and van Zanten, 2008b), the authors develop a very elegant strategy to verify conditions (i) and (ii) and thus determine posterior concentration rates in the context of Gaussian process priors, whatever the sampling model.Their approach makes use of the reproducing kernel Hilbert space H (RKHS) associated to a zero-mean Gaussian process W , viewed as a Borel map in a Banach space (B, .).More precisely, when the losses KL(θ0, θ), V θ 0 (θ0, θ) and d(θ0, θ) can be related (locally bounded typically) to the norm θ − θ0 of B, then n defined in (i) and (ii) can be bounded by the solution to where h H is the RKHS norm.They apply their results to the context of density, non linear regression, classification and white noise model.Other families of prior models have been studied in a generic way, i.e. somehow irrespective of the sampling model.For instance (Arbel et al., 2013) propose general conditions for prior distributions on some parameter where the conditional distribution of θ given k, π(.|k) has the form In other words, under the prior distribution, θ is truncated according to a distribution P on N, and given a truncation level k, the k non-null components of θ are independent.
3.3.2.What do the two conditions (i) and (ii) tell us about the impact of the prior distribution?.In the case of Gaussian process prior models for instance, (7) shows that posterior concentration rates are caracterized by the smoothness of the true curve θ0 and the smoothness of the Gaussian process itself, i.e. by its RKHS.Indeed, small ball probabilities log P ( W ≤ ) depend on the RKHS and the smoother the RKHS the larger − log P ( W ≤ ), while inf h∈H; h−θ 0 ≤ h 2 H indicates how well θ0 can be approximated by elements of the RKHS H. Hence, if θ0 is not smooth enough compared to the elements in the RKHS H the latter term will be large and the posterior distribution will tend to have a large bias while − log P ( W ≤ ) can be viewed as a measure of variance or spread.
Although (7) gives only an upper bound on the posterior concentration rate, some lower bounds have been derived in the literature showing that it is often a sharp upper bound, see (Castillo, 2008).These results have shown that Gaussian processes are not as flexible as one might have hoped and that the behaviour of the posterior distribution is highly dependent on the covariance kernel K(., .)which in turns determines the RKHS H, since its influence does not disappear asymptotically to first order.This is not only true for Gaussian processes.Generally speaking, the two main conditions (i) and (ii) above shed light on key features in the behaviour of the posterior distributions.First, the prior model needs to be flexible enough to approximate well the true distribution (in terms of Kullback-Leibler divergence).For instance, consider the problem of density estimation for i.i.d.data, and take a prior model based on location mixtures of Gaussian distributions.The posterior concentration rates associated to this type of prior models have been studied by (Ghosal and van der Vaart, 2007b, Kruijer et al., 2010, Scricciolo, 2014, Shen et al., 2013) where ϕσ denotes the density of a Gaussian random variable in R d with mean 0 and variance σ 2 I d .The prior is constructed by considering a prior on (P, σ) where P varies in the set of probability distributions on R d .A popular choice for the prior on P is the Dirichlet process DP (M, G) with mass M and base measure G, as defined in Section 2.1.Smooth densities on R can be well approximated by mixtures in the form (9). To understand what it means, we construct finite mixtures of Gaussian densities which approximate f , with as small a number of components as possible.Let f be a density which has Hölder (type) smoothness β, it is possible to construct a probability density for all β > 0, where F β is the distribution associated to f β , see (Kruijer et al., 2010, Shen et al., 2013).Then we approximate the continuous mixture by a finite mixture, and it can be proved that fF β ,σ can be approximated to the order σ κ for any κ > 0 by mixtures fP N ,σ where PN has at most N = O(| log σ|σ −1 ) supporting points.Controlling N is a crucial step in proving the Kullback-Leibler condition since it provides an upper bound on the number of constraints on the parameter space that are needed to approximate a density f with smoothness β by densities in the form (9). It thus leads to a lower bound on the prior mass of Kullback-Leibler neighbourhoods of f .Choosing σ in the form σ = n −1/(2β+1) (log n) q , q ∈ R, leads to condition (i) with 2 n n −2β/(2β+1) (log n) 2qβ+1 .Under the L1 or the Hellinger loss functions for d(., .), the tests in condition (ii) are constructed from the tests of (Schwartz, 1965) or (Birgé, 1983) and are controlled bounding from above the entropy (i.e. the logarithm of the number of small balls needed to cover the set) of subsets of finite location mixtures of Gaussian distributions with at most n 1/(2β+1) (log n) q+1 components.Finally the posterior concentration rates for densities f with smoothness β (in a local Hölder sense, as described in (Kruijer et al., 2010) or in (Shen et al., 2013)) and under some exponential type condition on the tails of f , is bounded by for some τ ≥ 0, which is the minimax rate of convergence in this functional class, up to a log n term.
Although deriving conditions (i) and (ii) is quite informative on the way the prior acts, in the case of these nonparametric mixture models the picture is far from being complete.In the case of smooth density estimation, one expects the posterior distribution on the scale σ to concentrate on small values.This would mean that a common variation of prior model ( 9), namely the location -scale mixture written as might not be the best suited prior model for estimating a smooth density.Indeed following the above computations we obtain a much too small lower bound on prior mass on Kullback-Leibler neighbourhoods of f , and the obtained posterior concentration rate is suboptimal, see (Canale and de Blasi, 2013).Whether this is an artefact of the proof or a real suboptimal result remains an open question.To be able to answer such a question, one needs to characterize fully neighbourhoods of the true density to obtain not only a lower bound on their mass but also an upper bound.Given the complexity of the geometry of mixture models, the latter is a much more formidable task than the former.Model ( 10) is however more commonly used than the location mixture ( 9), and it is often considered as better behaved.This decrepancy between theory and practice has not yet been resolved.
A second crucial aspect of conditions (i) and (ii) is the existence of tests with second type error bounded by e −Cn 2 n .This condition restricts the choice of loss functions.In particular, (Hoffman et al., 2013) show that if there exist parameters θ which are close for some intrinsic loss (for which the tests of condition (ii) can be constructed, such as the L2 loss in the white noise or the Hellinger distance in the density models) to θ0 but not in terms d(., .), then the testing method above will lead to suboptimal bounds.
Interestingly, the prior based on model ( 9) does not depend on the true smoothness β of the density f0, but the posterior adapts to the unknown smoothness β of f0.This is one of the strengths of the Bayesian methodology, by naturally incorporating hierarchical structures in the prior it often enables to construct posterior distributions having good frequentist properties over not only a functional class, but a collection of functional classes.

Bayesian nonparametrics : a useful tool to derive adaptive optimal methods
Bayesian methods have become popular in particular because they can easily incorporate hierarchical modelling.In the case of nonparametric models, this is also the case and for most families of priors studied so far, it has been possible to construct hierarchical versions of them so as to obtain good frequentist properties over collections of functional classes.
If the posterior concentration rate ( 6) is uniformly bounded when the true parameter θ0 is allowed to vary in a class Θ β ⊂ Θ by the frequentist minimax estimation rate over the same class under the same loss function, for instance n −β/(2β+1) for a β-Hölder ball in the setup of density estimation under the L1 loss, then we say that the posterior concentrates at the minimax rate over Θ β .If for a collection of classes, for instance Θ β , β ∈ [β1, β2], the posterior concentrates at the minimax rate within each class, then we say that it concentrates at the minimax adaptive rate.Hierarchical modelling of prior distributions naturally leads to minimax adaptive posterior concentration rates.
For instance, in the context of Gaussian process priors, (van der Vaart and van Zanten, 2009) study conditional Gaussian process priors on curves g defined as: where GP (0, K) denotes a Gaussian process prior with mean 0 and covariance kernel K(s, t) = e −(s−t) 2 and ΠA is a probability on R + .The authors then show that for various types of sampling models parametrized by the curve g, the posterior distribution concentrates around the true curve at a rate which is the minimax optimal estimation rate, up to a log n term, over a collection of Hölder classes, under a suitable prior ΠA .The prior does not depend on the supposed smoothness for the true curve and the posterior therefore leads to minimax adaptive estimators.This construction has been extended in particular by (Bhattacharya et al., 2014a) to anisotropic multivariate curves.
There is now a large range of results on posterior concentration rates of hierarchical nonparametric prior models where adaptive minimax (up to a log n term usually) posterior concentration rates have been achieved.For instance the hierarchical prior construction (8) has been proved to lead to adaptive minimax concentration rates over collections of Sobolev or Besov balls for a variety of models, in (Arbel et al., 2013) and for some linear inverse problems in (Ray, 2013, Knapik andSalomond, 2015).The nonparametric location mixture of Gaussian random variables with an inverse Gamma on the scale parameter σ also leads to adaptive minimax concentration rates over collections of locally Hölder classes, as described above.
In the last few years, Bayesian nonparametric adaptive methods have been studied in the literature where adaptation is achieved not only with respect to some smoothness caracteristic but also with respect to sparsity in high dimensional models.These include the sequence model where one observes n independent observations Yi = θi + i, i iid ∼ N (0, 1), i ≤ n which we will consider as an illustrative example of the types of phenomena that occur in high dimensional frameworks, but recall that it is simpler than other models like high dimensional regression or high dimensional graphical models.The most natural way to design a sparsity prior in this context is to first select the set S of non zero coefficients, and then put a prior on θS = (θi, i ∈ S).In (Castillo and van der Vaart, 2012) posterior concentration rates around θ in terms of the Lq losses, θ − θ0 q with 1 < q ≤ 2, are derived under some conditions on such priors.They show that considering a family of priors on S defined by first choosing the size |S| = p according to a distribution with exponential tails and then randomly selecting S given its size p leads to minimax adaptive posterior concentration rate r 2 n = p log(n/p) under the loss θ − θ0 2 2 uniformly over the set 0(p) = {θ, θ 0 ≤ p} with x 0 denoting the number of nonzero coefficients in x.
Although the approach described in Section 3.3 for deriving posterior concentration rates is used, (i) and (ii) are not the only steps in their proof.This is due to the complexity of the parameter space.Before using the usual testing and Kullback-Leibler arguments, the authors first prove that the posterior distribution concentrates on sets that have at most M p nonzero coefficients, for some large but fixed constant M .Then on this reduced parameter space they prove posterior concentration rates following steps (i) and (ii).This is common, if not inevitable, in high dimensional models with sparse parameters, when one needs to learn also the sparsity of the parameter.Interestingly, if the prior on |S| or on θS|S have too light tails, then the authors prove that the posterior concentrates at a subotimal rate for large values of θ0 2. In many applications this is often not a crucial issue, since large signals are easily detected and the statistical analysis is typically used to detect small signals.
The above family of sparse priors is appealing in a high dimensional but sparse context but it is difficult to implement and so far has only be implemented for moderately large dimensional models.Alternative priors have been proposed in the literature with posterior distributions easier to sample from but their asymptotic properties have not been studied.Recently (Bhattacharya et al., 2014b) have proposed a continuous type of shrinkage, closer in spirit to the Lasso, which also achieves optimal minimax adative posterior concentration rate, under the constraint that the true signal is not too large: θ0 2 2 ≤ p(log n) 4 where p = θ0 0, the number of nonzero components of θ0.(Castillo et al., 2015) have extended the results of (Castillo and van der Vaart, 2012) to the case of high dimensional linear regression, with a prior distribution on the sparsity inducing very sparse models.
Other families of high dimensional models have been considered, in particular sparse matrix and graphical models have been studied by (Banerjee and Ghosal, 2015, Bhattacharya and Dunson, 2011, Pati et al., 2014).

On frequentist coverage of credible regions in large dimensional models
As mentioned above posterior concentration rates are usefull to assess the size of posterior credible bands and their frequentist coverages verify if 1 − α is the Bayesian coverage of the credible band Cα.This does not imply however that Cα is a honest confidence region in a frequentist sense, i.e.
In parametric regular models, thanks to the Bernstein -von Mise theorem, ( 12) is valid on compact subsets of Θ and for standard credible regions like highest posterior density regions, or ellipses around the posterior mean or mode.In nonparametric models, it is expected not to be satisfied and the first results on frequentist coverage of credible regions in infinite dimensional models were negative.(Cox, 1993) and (Freedman, 1999) exhibited negative results in the context of Gaussian models with Gaussian priors, where almost surely under the prior the frequentist coverage of an 2 credible ball could be arbitrarily close to 0. Despite these results, the picture is not all negative.As said previously, an attractive feature of Bayesian (hierarchical) approaches is that they -when properly tuned -are adaptive procedures, and up to log n terms are often minimax adaptive over collections of functional classes, say Θ β , β ∈ A (in terms of their posterior concentration rate).Consider, for the sake of simplicity the white noise model and Θ β a Hölder ball with regularity β.If Cα is a 2 credible band for the parameter θ constructed under a minimax adaptive posterior distribution then it satisfies where n(β) is the minimax estimation rate over the class Θ β .In other words its size is adaptive minimax.It is known, see for instance (Cai and Low, 2006) that there does not exist honest confidence regions (i.e.satisfying ( 12)) which have adaptive size, unless A ⊂ [β1, 2β1] for some β1.Hence Cα cannot be an honest confidence region.
Can we find a subset Θ0 of Θ over which Cα could be considered an honest confidence region ?
Recently, in (Szabó et al., 2013), the authors have answered this question in the special case of the white noise model and under the empirical Bayes posterior described in Section 4.0.1 and based on (Szabò et al., 2013).They find a set of well behaved parameters Θ0 ⊂ 2, called the polished tail parameter set, over which ( 12) is verified.Their results rely heavily on the precise structure of the prior and sampling model, but they give some insight on what can be expected in other types of models.
In (Castillo and Nickl, 2013), conditions for deriving a weak nonparametric Bernstein von Mises theorem are derived in the white noise and density models, which leads to the construction of credible bands with correct asymptotic frequentist coverage.The types of priors considered in (Castillo and Nickl, 2013) are based on expansions of the curve on wavelet bases.The drawback of this approach is that the credible bands are constructed in terms of weighted L2 norms which are difficult to interpret in practice.An advantage however is that from this result it is possible to derive Bernstein von Mises theorems for smooth functionals of either the signal in the white noise model or the density in the density model.Obtaining refined results such as Berntein von Mises theorems for finite dimensional functionals of the parameter is typically easier than for the whole parameter, it is however not a simple task and positive general results have been obtained only recently and in still a rather restricted framework.This is presented briefly in Section 5.

SUMMARY POINTS
1.Many common Bayesian nonparametric prior models lead to posterior distribution with minimax concentration rates 2. Using hierarchical priors, it is possible and relatively easy to obtain adaptive procedures.Adaptation may be with respect to some smoothness or some sparsity aspects of the parameter.3. Posterior contraction rates are related to the size of credible balls or regions but not so much on their frequentist coverage.Understanding the frequentist coverage of a credible ball (or band) is more involved and only a few results have been obtained until now.It is becoming however an active area of research.
Empirical Bayes is an alternative to hierarchical Bayes; we describe in the following section some recent advances that have been obtained on the properties of empirical Bayes methods.

Empirical Bayes procedures
Traditionnally, empirical Bayes designates frequentist methods, in the context of multiple experiments where each experiment is associated to a specific parameter and where these parameters have a common distribution Π which is estimated using a frequentist estimator such as the maximum likelihood estimator.This was initiated by Robbins, see for instance (Robbins, 1964).However the term empirical Bayes is also used for any Bayesian approaches where the prior is data -dependent.In this section we are going to focus on the latter, which is widely used in practice, since more often than not, at some stage of the construction of the prior, some information coming from the data is used, in a more or less formalized way.It is typically believed that it should be better than an arbitrary choice of the hyperparameters of the prior.
The setup is the following.Consider a family of prior distributions on a parameter θ ∈ Θ indexed by a hyperparameter γ, (Π(•|γ), γ ∈ Γ).A hierarchical approach would consist in constructing a prior on γ ∈ Γ, while the empirical Bayes approach selects γ = γ based on the data Y n .There are many ways to choose γ and the two main categories are: (a) using moment conditions or similar considerations (b) using the maximum marginal likelihood estimator.There are other methods that do not quite enter into these categrories such as cross-validation or other frequentist methods used to select hyperparameters, but they have not been studied in the Bayesian setup so far.
The most common of the two is (a) although it is the less formalized.For instance, in the case of mixtures of Gaussian random variables, (Green and Richardson, 2001) and (Richardson and Green, 1997) the authors advocate, based on invariance considerations, the choice of m0 as the midrange of the data and τ 2 0 as the range of the data.Another possibility would be to use the relation The maximum marginal likelihood estimator γML of γ is defined, when it exists, as This approach has been used for instance by (George and Foster, 2000, Cui and George, 2008, Scott and Berger, 2010) in the context of variable selection in regression models, by (Belitser and Levit, 2002, Clyde and George, 2000, Szabò et al., 2013, Knapik et al., 2012) for sequence or white noise models, by (Liu, 1996) for the mass parameter M in a Dirichlet Process mixture prior.What are the asymptotic properties of such empirical Bayes properties?Can we derive general methods to study these properties, as is done in Section 3.3 for fully Bayes procedures?
The approach presented in Section 3.3 to prove posterior concentration rates for fully Bayesian posteriors uses repeatedly Fubini's argument, which cannot be applied in a context of data dependent prior.Moreover, in infinite dimensional models the prior distributions Π(•|γ), γ ∈ Γ, are often singular with respect to one -another, see for instance (Ghosh and Ramamoorthi, 2003) in the case of Dirichlet processes, or (van der Vaart and van Zanten, 2008b), in the case of Gaussian process priors.However, recently in (Donnet et al., 2014b) the authors derive a general theory to study posterior concentration rates in the context of data dependent priors.Also, building on some this theory, together with the results of (Petrone et al., 2014) in the finite dimensional case and of (Knapik et al., 2012, Szabò et al., 2013) in the Gaussian white noise model with Gaussian process priors, we now have a better understanding of the behaviour the marginal maximum likelihood estimator in infinite dimensional models.4.0.1.Dealing with data dependence on the prior .Consider a family of prior distributions (Π(•|γ), γ ∈ Γ) and a data dependent value γ and denote the empirical Bayes posterior by Π(•|Y n , γ).In this section we assume that there exists a compact set Γ0 ⊂ Γ ⊂ R d for some d < +∞, such that with probability going to 1, γ ∈ Γ0.
The aim is to find the smallest possible sequence n, such that Π(B c n (θ0)|Y n , γ) → 0 in probability under P θ 0 .To do so, we in fact prove that sup so that the pre-selection of Γ0 is important.
For instance, if γ = Ȳn or some other moment estimator, then under simple ergodicity conditions on P θ 0 , Γ0 can be choosen in the form Γ0 = [µ0 − , µ0 + ] where µ0 is the limite of Ȳn under P θ 0 , either in probability or almost surely.
The second key step in dealing with data dependent prior is to transfer the datadependence from the prior to the likelihood.To do so, we consider changes of measure ψ γ,γ : Θ → Θ such that if θ ∼ Π(•|γ) then ψ γ,γ (θ) ∼ Π(•|γ ).For instance in the case of the Dirichlet process mixture of Gaussian densities (9), using the stick-breaking representation of the Dirichlet process, we can write θ = fP,σ as and if γ = (m0, τ 2 0 ), then for all τ 0 > 0 and m 0 ∈ R, defining γ = (m 0 , τ 0 ) and we have f P ,σ ∼ Π(•|γ ), see (Donnet et al., 2014b) for more examples.The third step is a chaining argument which consists in partitioning of Γ0 into Nn bins of size un small enough and choosing points in each bin (γi) i≤Nn so that sup The first term can be handled straightforwardly using the approach described in Section 3.3, while the second needs to be slightly adapted by replacing the Kullback-Leibler neighbourhoods by the sets of θ such that with large enough probability in condition (i) and by replacing the control of the second type error of condition (ii) with respect to P θ by a second type error with respect to the measure with density q θ γ i (y n ) = sup |γ−γ i |<un f ψγ i ,γ (θ) (y n ).These arguments lead to a set of conditions to derive posterior concentration rates for empirical Bayes procedure which resemble conditions (i) and (ii) of Section 3.3, see Theorem 1 of (Donnet et al., 2014b).This is applied to Dirichlet process mixtures of Gaussian distributions, log-spline and log-linear prior models for density estimation and to Dirichlet process mixtures of uniforms in the context of Aalen point processes.
The pre-selection of Γ0, i.e. the asymptotic behaviour of γ is important.However, Γ0 need not necessarily be small, specially if the posterior concentration rate associated to a prior Π(.|γ) does not depend on γ.For instance in the case of the Dirichlet process mixture of Gaussian distributions with γ = (m0, τ 2 0 ) with m0 = Ȳn and τ 2 0 = R, with R the range of the data, the posterior concentration around a density f0 which has β -Hölder smoothness as described in Section 3.3 is still of the form n −β/(2β+1) up to a log n term, although Γ0 = [µ0 − , µ0 + ] × [a, (log n) κ ] for some positive constants , a, κ.If on the contrary the posterior concentration rate depends on γ, then it is crucial to have Γ0 shrink fast enough around the best possible value.4.0.2.Maximum marginal likelihood estimators.Maximum marginal likelihood empirical Bayes procedures are typically used in such context with the hope that the data dependent γML will be close enough to optimal values for γ.This is not always the case and subtil phenomena can occur, as has been shown in (Knapik et al., 2012, Szabò et al., 2013).In both papers the authors consider the white noise model, with inverse operator but for the sake of simplicity we pretend it is equal the identity, and a Gaussian process prior on θ : To model smooth curves under the prior distribution, it is common practice to consider τi going to infinity with i.The question is how .In (Szabò et al., 2013) the authors consider τ 2 i = τ 2 0 i −2α−1 , with γ = τ0 and using an explicit expression on the marginal likelihood mn(γ) show that if the true parameter θ0 has smoothness β , i.e. θ 2 0,i ≤ Li −2β−1 , then the empirical Bayes posterior has suboptimal posterior concentration rate for all β > α + 1/2 while it achieves the minimax adaptive posterior concentration rate over β < α + 1/2.Interestingly in (Knapik et al., 2012) the authors consider γ = α and they show that the empirical Bayes posterior concentration rate achieves minimax concentration rates for all β in this case.So why is there such a discrepancy between the two types of maximum marginal likelihood estimators?
In (Rousseau and Szabò, 2015), we describe the asymptotic behaviour of the maximum marginal likelihood estimator γ for a model where (Θ, .) is a Banach space, under some conditions on the model and the prior.In particular it shown that with probability going to 1, where mn is any sequence going to infinity and n(γ) is defined by Hence the maximum marginal likelihood estimator is minimizing the rate n(γ) and it can be checked that in the setup of (Knapik et al., 2012) the minimizer is optimal and leads to n −β/(2β+1) up to a log n term while in the case of (Szabò et al., 2013) the minimizer can be suboptimal when β > α + 1/2.Interestingly the asymptotic behaviour of the maximum marginal likelihood estimator γ is driven by the behaviour of Π ( θ − θ0 ≤ n(γ)|γ) and not so much by the sampling model.A similar result was obtained in the parametric framework by (Petrone et al., 2014).

SUMMARY POINTS
1.A methodology has been developped to derive posterior concentration rates for data dependent priors.The approach is similar to the theory developped by (Ghosal and van der Vaart, 2007a) in the case of regular priors.2. This approach can be quite easily applied to moment -type estimators and good frequentist properties of the empirical Bayes posterior have been obtained n this case.3. Maximum marginal likelihood estimators have subtil behaviours and they can be apprehended by minimizing the set of candidate rates n(γ) defined in (13).

Semi-parametric models and the Bernstein von Mises Theorem
In this section, I will describe some of the latest developpments that have been obtained so far in semi-parametric Bayesian inference, i.e. when the parameter of interest ψ is a finite dimensional functional of an infinite dimensional parameter, i.e. ψ = ψ(θ).Semi -parametric models are often consider in a context where θ = (ψ, η) with ψ ∈ S ⊂ R d and η ∈ E an infinite dimensional model.For instance in the non linear regression model η is the regression function and ψ the variance of the noise, although the parameter of interest is typically η in such models.In partially linear models the regression function of a response variable Y on a covariate vector X where g is non linear and some redundancies may exist between X1 and X2.In survival analysis, the Cox regression model is also a very common semi-parametric model.They are not however the only cases of semi-parametric problems and one might be interested in some mode general functionals of an infinite dimensional parameter, such as the cumulative distribution function at a given point, the mean of a distribution, the L2 norm of a square -integrable curve, etc.There are many semi-parametric models for which it is possible to estimate ψ at the rate √ n, see for instance (van der Vaart, 1998) for a general theory on regular semi-parametric models.What would be the Bayesian counter-part of this theory?How can we prove that the marginal posterior distribution of ψ concentrates at the rate √ n? Can we obtain a more precise description of the marginal posterior distribution of ψ?These questions can be answered by studying if the posterior distribution on ψ verifies the Bernstein -von Mises theorem (BvM).It says that asymptotically the marginal posterior distribution of √ n(ψ − ψ) converges (weakly or strongly) to a N (0, V0), under P θ 0 , where √ n( ψ − ψ(θ0)) converges in distribution to N (0, V0) under P θ 0 and ψ is some estimator of ψ.
Such properties have many interesting implications.In particular they allow to construct credible regions for ψ which have correct asymptotic frequentist coverage.
In (Castillo, 2010) and (Bickel and Kleijn, 2012) sufficient conditions are proposed to derive BvM in separated models in the form θ = (ψ, η).In (Rivoirard and Rousseau, 2012a) and in (Castillo and Rousseau, 2013) sufficient conditions to BvM are provided for linear functional of the density and for smooth functionals of the parameter in general models respectively.To explain the main features of these results I will present the arguments as in (Castillo and Rousseau, 2013).
The conditions are based on the following three ingredients: • (1) Concentration of the posterior : There exists some shrinking neighbourhood An of θ0 such that π(An|Y n ) = 1 + oP θ 0 (1).
• (2) Local asymptotic normality of the likelihood (LAN): locally around θ0 the loglikelihood can be approximated by a quadratic form.Assuming that a neighbourhood of θ0 can be embedded into a Hilbert space H, this local approximation takes the form where .L is the norm of the Hilbert space and Wn(.) is a linear operator on H, such that Wn(h) ∼ N (0, h 2 L ) when h ∈ H. • (3) Smoothness of the functional : On An, the functional can be linearly approximated: There exists ψ1 ∈ H such that ψ(θ) − ψ(θ0) =< ψ1, θ − θ0 >L +o(1) Then under some mild additional conditions BvM is valid if for t ∈ R with |t| small enough, Condition ( 14) is the key condition and roughly speaking means that it is possible to construct a change of variable θ → θ − tψ1/ √ n or close enough to it, leaving the prior and An almost unchanged.In the cited papers, some examples are studied where BvM is valid for families of smooth functionals, however examples are also provided where it is shown that BvM does not hold.To illustrate this and explain the meaning of ( 14), let θ ∈ 2 and a prior on θ constructed as in (8): with k ∼ π k and conditionnally on k, θi iid ∼ g for i ≤ k and θi = 0 otherwise.Assume, for the sake of simplicity that the functional ψ(θ) is linear and that the LAN norm .L is the L2 norm, as in the white noise model.Thus ψ(θ) =< ψ1, θ > for some ψ1 ∈ 2. To prove ( 14 (Rivoirard and Rousseau, 2012a, Castillo and Rousseau, 2013, Castillo, 2012).