An optimal transport approach for seismic tomography: application to 3D full waveform inversion

The use of optimal transport distance has recently yielded significant progress in image processing for pattern recognition, shape identification, and histograms matching. In this study, the use of this distance is investigated for a seismic tomography problem exploiting the complete waveform; the full waveform inversion. In its conventional formulation, this high resolution seismic imaging method is based on the minimization of the L2 distance between predicted and observed data. Application of this method is generally hampered by the local minima of the associated L2 misfit function, which correspond to velocity models matching the data up to one or several phase shifts. Conversely, the optimal transport distance appears as a more suitable tool to compare the misfit between oscillatory signals, for its ability to detect shifted patterns. However, its application to the full waveform inversion is not straightforward, as the mass conservation between the compared data cannot be guaranteed, a crucial assumption for optimal transport. In this study, the use of a distance based on the Kantorovich–Rubinstein norm is introduced to overcome this difficulty. Its mathematical link with the optimal transport distance is made clear. An efficient numerical strategy for its computation, based on a proximal splitting technique, is introduced. We demonstrate that each iteration of the corresponding algorithm requires solving the Poisson equation, for which fast solvers can be used, relying either on the fast Fourier transform or on multigrid techniques. The development of this numerical method make possible applications to industrial scale data, involving tenths of millions of discrete unknowns. The results we obtain on such large scale synthetic data illustrate the potentialities of the optimal transport for seismic imaging. Starting from crude initial velocity models, optimal transport based inversion yields significantly better velocity reconstructions than those based on the L2 distance, in 2D and 3D contexts.

The use of optimal transport distance has recently yielded significant progress in image processing for pattern recognition, shape identification, and histograms matching. In this study, the use of this distance is investigated for a seismic tomography problem exploiting the complete waveform; the full waveform inversion. In its conventional formulation, this high resolution seismic imaging method is based on the minimization of the L 2 distance between predicted and observed data. Application of this method is generally hampered by the local minima of the associated L 2 misfit function, which correspond to velocity models matching the data up to one or several phase shifts. Conversely, the optimal transport distance appears as a more suitable tool to compare the misfit between oscillatory signals, for its ability to detect shifted patterns. However, its application to the full waveform inversion is not straightforward, as the mass conservation between the compared data cannot be guaranteed, a crucial assumption for optimal transport. In this study, the use of a distance based on the Kantorovich-Rubinstein norm is introduced to overcome this difficulty. Its mathematical link with the optimal transport distance is made clear. An efficient numerical strategy for its computation, based on a proximal splitting technique, is introduced. We demonstrate that each iteration of the corresponding algorithm requires solving the Poisson equation, for which fast solvers can be used, relying either on the fast Fourier transform or on multigrid techniques. The development of this numerical method make possible applications to industrial scale data, involving tenths of millions of discrete unknowns. The results we obtain on such large scale synthetic data illustrate the potentialities of the optimal transport for seismic imaging. Starting from crude initial velocity models, optimal transport based propagation velocities, density, attenuation, and anisotropy parameters. A review of FWI and its application is provided in (Virieux and Operto 2009). FWI is now routinely used, both in the academy and the industry, mainly for the reconstruction of wave propagation velocities (Fichtner et al 2010, Tape et al 2010, Plessix and Perkins 2010, Sirgue et al 2010, Peter et al 2011, Zhu et al 2012, Warner et al 2013, Vigh et al 2014, Borisov and Singh 2015. Its advantage over conventional tomography methods, which are based on the interpretation of selected arrival times only, is its ability to recover higher resolution estimates. As the method involves millions (2D acoustic) to tens of billions (3D elastic) of discrete unknowns, it is based on a local minimization of the misfit, which iteratively updates an initial estimation of the wave velocity. In its conventional formulation, the misfit is computed as the L 2 distance between observed and predicted data. This yields a significant difficulty: the corresponding misfit function has multiple local minima. The initial guess of the solution should thus be sufficiently close from the global minimum for the method to converge to the desired estimation. The reason for the presence of these local minima is well understood. Smooth, macro-scale modifications of the velocity structure mainly impact the oscillatory seismic waveform through modifications of the travel-times, resulting in time shifted waveforms (Jannane et al 1989). However, the L 2 norm is not an appropriate tool for capturing these time-shifts. Seeing the waveform as a purely oscillatory signal, a local minimum is reached each time the velocity model matches the data with one or several phase shifts. This phenomenon is referred to as cycle skipping or phase ambiguity in the FWI community. The use of the L 2 norm thus requires to start from an initial velocity model kinematically compatible with the data, in the sense that the observed data is matched within half a phase, to prevent cycle skipping.
This strong requirement is of course a difficulty for the application of FWI as such an initial model may not be always available. This difficulty has prompted numerous investigations attempting to overcome this restriction. Multi-scale frequency approaches have been first introduced to mitigate the sensitivity to cycle skipping by enlarging the phase in the first steps of the method (Bunks et al 1995, Pratt 1999. This strategy is limited by the lowest available frequency, which is most of the time not low enough to sufficiently constrain the model and avoid cycle skipping. Image domain techniques have also been proposed (see Symes2008 and references therein for a review). The methods are based on improving the consistency of the velocity model through the re-focusing of migrated images along a dimension introduced artificially in the imaging condition (time lag, subsurface offset, illumination angle). These methods are able to generate smooth updates of the initial background models. However the high computation cost related to the repeated construction of image volumes through migration algorithms seems to have precluded their use in 3D configurations up to now. In addition, these methods rely exclusively on reflected waves as they are based on the construction of reflectivity images. Data-domain techniques represent a third type of strategies to mitigate cycle skipping. These methods are based on the modification of the misfit measurement: cross-correlation (Luo and Schuster 1991), and later on warping techniques (Hale 2013, Ma andHale 2013), have been proposed to access directly the time shifts between seismograms without travel-time picking. Misfit functions based on the instantaneous phase and envelope of the signals have been investigated in seismology (Fichtner et al 2008, Bŏzdag et al 2011. Recently, deconvolution approaches Sava 2011, Warner andGuasch 2014) have also been promoted for FWI, where Wiener filters are used to match observed and predicted data. All these strategies share the common purpose to produce more convex misfit functions, possibly at the expense of the high resolution expected from FWI. Another drawback of these strategies is related to the necessity to design a suitable penalization function to maximize the energy at zero time shift (or to minimize the energy away from zero time shift), which may require non-trivial parameter tuning. In addition, identification and windowing of the data may be required to robustly estimate the time shifts between traces, which is a difficult task for seismic data acquired at the exploration scale.
From an optimization point of view, the convergence to local minima results from the use of local optimization techniques. As a consequence, since the introduction of FWI, several attempts to apply global optimization schemes to FWI based on Monte Carlo techniques or genetic algorithms have been performed. The crucial issue for the success of these methods is the design of a suitable subsurface velocity parameterization allowing to drastically reduce the number of unknowns, which is not straightforward. Indeed, current high performance computing devices may provide the capability to perform such a global model space exploration for problems involving no more than several hundred discrete parameters, which is several order of magnitudes lower than the standard FWI problem size for realistic applications. In (Diouane et al 2016), a recent example of such an attempt is proposed, where the model parameterization is based on a Discrete Cosine Transform and a step function transform, and the global optimization method is based on the CMA-ES algorithm (Hansen et al 2006).

Contribution
In a recent study, the use of an optimal transport distance in the framework of FWI has been suggested as a novel data-domain technique (Engquist and Froese 2014). The main motivation relies on the fact that the optimal transport distance, as a global comparison tool, is particularly suited to capture time shifts between signals. This property is emphasized in (Engquist and Froese 2014) on 1D time-shifted Ricker signals: the optimal transport distance is a convex, nearly quadratic function of the time shift.
Building up on this idea, we have proposed a methodology for using an optimal transport distance within FWI in realistic 2D configurations . In this context, the seismic data is interpreted as a collection of 2D images, one for each seismic source. The signal recorded by each receiver is gathered in a 2D panel depending on the physical location of the receiver at the surface. This organization of the data is used for years by geophysicists for analyzing the data as it allows to easily identify all the seismic events (direct propagation, refraction, pre-and-post critical reflection, conversion). However, this data representation is rarely accounted for in the inversion algorithm. The L 2 distance performs pixel-by-pixel comparisons, while cross-correlation and deconvolution approaches rely on a computation of time-shifts trace by trace. Nonetheless, macro-scale variations of the velocity structures impacts the data by shifting the seismic events not only along the time axis but also along the receiver (space) axis. Warping techniques (Hale 2013) appear to us as a first attempt to account for these shifts considering the whole seismogram. Tracking these shifts using a 2D optimal transport distance should allow to account for these shifts more robustly, similarly as in pattern recognition applications. Applications of this strategy to 2D realistic case studies have confirmed its interest.
The methodology we have proposed  is based on a modified dual Kantorovich problem. This formulation allows for the non-conservation of the mass between the data, which is a crucial point: standard optimal transport distance relies on the assumption that the total mass is conserved between the compared images. For FWI applications, the concept of mass used in imaging science is to be understood as the intensity of the recorded signal at a given time by a given receiver. In this context, there is no reason for the total mass of the observed data to be equal to the total mass of the predicted data for seismic applications. For instance, if reflections are missing in the predicted data (due to the absence of the corresponding reflectors in the subsurface model), the predicted data will contain less mass than the observed one. Noise also contaminates the observed data in real applications, which in essence unpredictably contributes to its total mass.
In the present study, we propose to further analyze the mathematical foundation of this modified dual strategy. We show that the distance which is used is actually related to the Kantorovich-Rubinstein (KR) norm, which has strong connections with the dual Kantorovich problem, as well as with the L 1 distance, as pointed out in (Lellmann et al 2014). We also propose a novel numerical strategy for the computation of the KR norm, making possible to apply this strategy to realistic size 3D data sets for the first time. The corresponding problem is formulated as a non-smooth convex optimization problem, solved with the smultaneous descent method of multipliers (SDMMs), a proximal splitting strategy (Combettes and Pesquet 2011). We prove that the linear system to be solved at each iteration of SDMM corresponds to the solution of the Poissonʼs equation with homogeneous Neumann boundary conditions, which can be solved in linear complexity through multigrid techniques (Brandt 1977). This numerical strategy is implemented within the framework of time-domain acoustic FWI. An analysis of the KR distance is first performed for the comparison of time-signals to investigate the dependence of this distance with respect to time shifts, as it is performed in (Engquist and Froese 2014). The KR distance exhibits a single global minimum in this case. However, it appears not to be a convex function of the time-shift, as it seems non-differentiable at its minimum. This feature confirms the link between KR and L 1 distances. The shape of the L 2 and KR misfit function is then compared in a 2D more realistic context, for a bi-dimensional problem. Using the KR distance appears to level-up the secondary valleys of the misfit function, reducing the risk of being trapped in a local minimum following a gradient-based local optimization method. A 2D FWI application to the Marmousi model is presented, to emphasize the interest of the strategy, and its robustness to the presence of noise. Finally, a fully 3D experiment performed on the overthrust SEG/EAGE model is proposed. From these experiments, the use of the optimal transport distance appears as an interesting tool for FWI as it seems to mitigate efficiently the cycle skipping issue, while the numerical strategy which is proposed appears as feasible for realistic size 3D problems.

Structure of the paper
The study is divided in six sections. In section 2, the mathematical background of optimal transport is reviewed quickly, as well as the link between the KR norm and optimal transport. In section 3, the numerical strategy we set up to compute the KR norm for large scale problems is presented. In section 4, the formulation of the FWI problem using the KR norm is presented. In section 5, we present three different case studies, from 1D to 3D, emphasizing the main properties of FWI based on the KR distance. Concluding remarks and perspectives are given in section 6.

Notations
In what follows, X and Y denote two metric spaces. The space of probability distribution on X and Y are denoted by  X We use the standard discrete notations such that, j " x y z , , and we introduce the set of indexes on the mesh The total number of points in the mesh is =´Ń N N N x y z . We will also use the set of indexes    , , , x y

Optimal transport and the KR norm
We start by recalling the standard Monge formulation for the optimal transport problem. Given two probability distributions  m Î X ( ) and  n Î Y ( ), and a cost function c x y , the optimal transport problem is defined as The constraint m n = # T indicates that the push forward distribution m # T of μ by the mapping T is equal to the distribution ν. The optimal transport problem can thus be interpreted as determining the mapping T which transports the distribution μ onto the distribution ν in the sense of the equation (2), which minimizes the cost defined in (13), for a given cost function c x y , ( ). The problem (13) is difficult to solve, in particular because of the constraint (2). A relaxation of this problem has been proposed by (Kantorovich 1942), under the linear programming problem The operators p X and p Y are the projectors on X and Y respectively. The problem (14) generalizes (13) in the sense that, instead of considering a mapping T transporting each particle of the distribution μ to the distribution ν, it considers all pairs x y , ( ) of the space X×Y and for each pair defines how many particles of μ go from x to y. While in the context of the Monge formulation (13), each point of the space X has only one possible destination on Y, given by T x ( ), in the context of the Kantorovich formulation (14), the particles at point x can have various destinations in Y, given by g x y , ( ) for Î y Y. The constraint (15) ensures that the distribution μ is transported onto the distribution ν. The relaxed problem (14) admits a solution under very mild hypothesis, unlike Mongeʼs problem (13). In addition, when (14) admits a solution T, the measure p m = # I T , ( ) is a solution of the relaxed problem (13) (Ambrosio 2003, Pratelli 2007. The dual problem of (14) is formulated as where the ensemble of continuous and bounded functions on X (respectively Y) is denoted by . The standard duality result shows that the solution of (14) is equal to the solution of (16) (see for instance (Villani 2003) or (Santambrogio 2015) for a complete proof).
In the particular case where Y=X and the cost function c x y , ( ) is a distance (which will be the case in the remainder of this study), the dual problem (16) can be simplified into The duality result in this particular case is known as the KR theorem (Santambrogio 2015, chapter 3.2.1). The modification of the initial Monge problem into the dual form of the Kantorovich problem thus leads to the solution of the linear problem with linear constraints (17).
While, under its primal form (14), the optimal transport problem is defined for probability measures, the dual form is defined for general measures provided they have the same total mass, i.e. the mass is conserved from the mass distribution μ to the mass distribution ν. An illustration of this requirement follows. Consider the case where μ and ν have a different mass This simple example suggests a straightforward generalization of the dual Kantorovich problem which remains well posed when the total mass between μ and ν is not conserved. This generalization consists in complementing the 1-Lipschitz constraint with a bound constraint. The problem (17) thus becomes In this study, we focus on the particular case for which the distance function c x y , ( ) is the distance associated with the ℓ 1 norm on  d Interestingly, with this choice of cost function c x y , ( ), the generalization (22) corresponds to the definition of the KR norm (Bogachev 2007). This norm is defined in the space of Radon measures on Ω, which is the dual space, for the ¥   . norm, of the space of real valued continuous functions defined on Ω which are zero at infinity, denoted by ). Besides the link with optimal transport, the KR norm can also be interpreted as a generalization of the L 1 norm (in a similar sense that the generalization from total variation to total generalized variation norms), and shares some properties with the Meyerʼs G-norm. These similarities are studied in detail in (Lellmann et al 2014), where the use of the KR norm is proposed as an alternative to the L 1 norm in a TV denoising problem. In the remainder of the paper, the space of 1-Lipschitz functions for the distance induced by the ℓ 1 norm on  d and with infinity norm bounded by λ will be denoted by 3. An efficient numerical strategy for computing the KR norm

Discretization and reduction of the number of constraints
In this section we assume that the dimension d is set to 3. Using the notations introduced in section 1.5, the problem (22) is discretized as Computing a numerical approximation of the solution of (24) requires the solution of a convex optimization problem involving O N 2 ( ) linear constraints. This would lead to an unacceptable computational time for the large scale problems induced by FWI applications. However, a property of the ℓ 1 norm on  d can be used to reduce the number of constraints from O N 2 ( ) to O N ( ). Note that the following proposition can be rephrased in terms of graph theory by saying that the restriction of the ℓ 1 distance to the grid Z d is a geodesic distance on the graph over Z d where two nodes Î a b Z , d are joined by an edge if and only if Proposition. The two following assertions are equivalent To prove the reciprocal implication, consider a pair of points on the mesh denoted by u and v, such that , and w q are all adjacent on the grid, with monotonically varying coordinates. The key is to see that, for such a sequence of points, the ℓ 1 distance on  d ensures that figure 1 for an illustration). Now, consider a function j satisfying A2 ( ). The triangle inequality yields As the points w q are adjacent, the local inequalities described by A2 Putting together equations (28), (29) and (27) Using the equivalence (25), the problem (24) can be rewritten in its equivalent form The problem (32) is equivalent to (24) and only involves = P O N 2 ( ) constraints. This reduction of the order of the number of constraints gives the possibility to design an efficient numerical strategy to compute the KR norm.
3.2. Proximal splitting technique for the solution of (32) Convex optimization problems of type (33) involving at least one non-differentiable functions, here j f 2 ( ), can be efficiently solved through proximal splitting techniques, such as forward-backward algorithms, Douglas-Rashford splitting, or alternating direction method of multipliers (Combettes and Pesquet 2011). From our numerical experiments, among this class of proximal splitting techniques, the simultaneous-direction method of multipliers (SDMMs) was found to achieve the fastest convergence. The method can be briefly sketched as follows Algorithm 1. SDMM method for the solution of the problem (33).
Proximal splitting strategies rely on a splitting of the problem in terms of the functions j f 1 ( ) and j f 2 ( ) and the computation of the proximity operators of these two functions (scaled by a positive factor γ). Proximity operators can be seen as a generalization of convex projection operators. They are defined as For the particular case of the function f 1 and f 2 , closed-form formulations can be found such that j j g m n = -+ g prox , 40 Note that the scaling γ only acts on the proximity operator of g f 1 as g = i i K K . However, the choice of γ should be done with care. Small values for γ could slow down the convergence of the algorithm while too large values can yield numerical instabilities and hamper the convergence. In practice, we choose g = 0.9. The closed-form formulations (40) and (41) are inexpensive to compute with an overall complexity in O N ( ) operations. However, the SDMM algorithm requires the solution of a linear system involving the matrix + I A A T , which is the most time-consuming part of the algorithm.
3.2.2. A Laplacian operator with homogeneous Neumann boundary conditions. Let us inspect in more details what is the form of the matrix A A T . We assume the following ordering of the discrete vector of  N using the mapping With this ordering, the matrices D D , x y and D z may be defined by where the definition of the Kronecker product Ä defined in section 1.5 is used, and, for The following theorem proves that the matrix Δ actually corresponds to the second-order finite differences discretization of the 3D Laplacian operator defined on Ω with homogeneous Neumann boundary conditions.
Theorem. The theorem is stated for an arbitrary number of dimension  Î d . In this context W is a compact subset of  d such that is the number of discretization points in the direction i, and the discretization step h i is defined by corresponds to the second-order finite differences discretization of the Laplacian operator defined on W with homogeneous Neumann boundary conditions.
Proof. We start with the case d=1. In this case the second-order finite differences discretization of the Laplacian operator with homogeneous Neumann boundary conditions is the matrix of size N 1  ( ) using the mixed-product property of the Kronecker product. However, The linear system which has to be solved at each iteration of the SDMM algorithm thus corresponds to a second-order finite-differences discretization of the Poissonʼs problem j where Δ is a Laplacian operator with homogeneous Neumann boundary conditions. The best numerical strategies for the solution of such problems appears to rely either on fast Fourier transform algorithm with O N N log ( )complexity (Swarztrauber 1974) or multigrid solvers with O N ( ) complexity (Brandt 1977). In this study, we use the multigrid method implemented in the MUDPACK library (Adams 1989).
The combination of the reduction of the number of constraints using the property of the ℓ 1 distance and the observation that the matrix appearing in the SDMM strategy actually

Formulation of the FWI problem
The observed data corresponding to N s sources and N r receivers is denoted by where T is the recording time, v x P ( ) denotes the pressure wave (P-wave) velocity, p x t , ( ) is the pressure wavefield and f x t , s ( ) an explosive seismic source which we assume to be known in this study. For a given P-wave velocity v x P ( ) and a given source f x t , s ( ), the predicted data is denoted by where p x t , ( ) is the solution of (56). Conventional FWI is formulated as the minimization over the set of pressure wave velocity functions v x P ( ) of the L 2 distance between d v x t ,

Gradient computation
The numerical solution to the FWI problem (59) is computed through a local minimization technique. The quasi-Newton l-BFGS algorithm introduced by (Nocedal 1980) is used to this purpose. This requires the ability to compute the gradient of the misfit function f v P KR ( ). The adjoint-state technique provides an efficient strategy to compute this quantity (Lions 1968, Chavent et al 1974, Plessix 2006) which we present quickly here. For the sake of simplicity, a single seismic source is considered in what follows and R is the extraction operator at the receivers location such that

Consider the Lagrangian function
where the L 2 scalar product in the wavefield space is denoted by  ., .
( ) and g is a general distance function measuring the discrepancy between Rp and d obs . For p v P ( ) solution of the modeling equation (56), the Lagrangian function is ( ) denotes a general misfit function associated with the distance g. For the sake of simplicity, the dependence of p with v P is not written explicitly in the sequel. Taking the derivatives of equation (64) The adjoint state variable l is chosen to cancel the first term of (65), such that the gradient of the misfit function f v P ( ) can be computed as the scalar product in the wavefield space ⎛ avoiding the computation of ¶ ¶ p v P , the Jacobian operator of the pressure wavefield. Computing this matrix is prohibitively expensive for large-scale applications. The problem of computing efficiently the gradient is thus brought back to the ability to compute l. Cancelling the first term of (65) gives the adjoint equation In the context of the wave equation, the operator F v p , P ( ) is linear with respect to p and selfadjoint. For this reason, the adjoint wavefield l is the solution of the acoustic wave equation (56) backward in time, with a source term depending on the distance function g. Interestingly, this source term is the only quantity involved in the gradient computation which is impacted by the choice of the distance function g (as already noticed for instance in Brossier et al 2010, Luo and Sava 2011). In the case where the L 2 norm is used, this source term is simply

( )
Using the KR norm to compare the observed and predicted data within a FWI framework thus only requires to modify the definition of the source term of the adjoint equation from (68) to (71). In terms of implementation, this means that the maximization problem (17) has to be solved only once per source and per iteration of the FWI loop. The value of the criterion reached at the maximum yields the misfit function value, while the function j achieving this maximal value is the source term of the adjoint equation. As a consequence, introducing the KR norm within an existing FWI code has only a limited impact on the code structure. With a slight abuse of language, j is referred to as the KR residuals in the following.

Normalization
In the following experiments, the seismic data are considered to be defined on

( )
This amounts to treat the seismic data as square images with no preferred dimensions to perform the comparison. In addition, the parameter λ which controls the infinity norm of the solution to (32) is set to 1. This is a pragmatical choice. In turns, it requires to scale  [ ] so that their infinity norm is not too far from 1.

1D case study: sensitivity to time shift
We start the numerical investigation with a schematic 1D experiment similar to the one proposed in (Engquist and Froese 2014). A Ricker time signal serves as observed data, and the predicted data corresponds to this same Ricker signal, shifted in time (figure 2). The L 2 and KR distances, as functions of the time shift, are compared. In this particular case, as the energy of the signal is conserved by the time-shift, the bound constraint on the dual variable j is relaxed by setting λ to an arbitrary high value. The misfit profiles using the L 2 and KR distances are presented in figure 3. The misfit based on the L 2 distance presents two local minima, typical of cycle skipping, apart the global minimum. The misfit based on the KR distance presents a single minimum, which indicates a better robustness to the time shift. However, compared to the 2-Wasserstein distance used in (Engquist and Froese 2014, figure 3) which yields a convex misfit function, the misfit function here appears as not differentiable at its minimum and concave. The non-differentiability at the minimum is reminiscent of the L 1 norm, which might not be surprising according to the strong relation between the KR norm and this norm (Lellmann et al 2014). Note that, compared with the study of (Engquist and Froese 2014), the KR norm does not require to separate the signal in its positive and negative part.
A further insight on the KR distance is given in figure 4 where the L 2 residuals and the KR residuals are compared for the comparison of the two signals presented in the figure 2. While the L 2 residuals only correspond to the difference between these two signals, the KR residuals appears as an envelope of the L 2 residuals, with positive and negative DC components at the beginning and the end of the signal respectively. In addition, the extrema of the KR residuals present an angular shape typical of the L 1 norm. This particular shape may be related again to the Lipschitz constraint. The similarities between the KR norm and the L 1 norm may question the use of standard quasi-Newton solvers dedicated to the minimization of smooth functions such as the l-BFGS algorithm. However the numerical experiments presented in the next section demonstrate that, for 2D and 3D FWI applications, this property

2D case study: misfit function comparison in a more realistic configuration
In this experiment, a 2D configuration is considered. A fixed-spread surface acquisition is used, constituted of 168 receivers equally spaced each 100 m, at 50 m depth, from x=0 km to = x 16.85 km. A single source is located at = x 8.45 km. Similar to the experiment presented in (Mulder and Plessix 2008) to emphasize the local minima of the L 2 misfit function, the velocity model is assumed to vary linearly in depth such that The P-wave velocity is thus parameterized by the velocity at the origin v P,0 and the velocity vertical gradient α. The reference velocity model is chosen so that The valley on the right still plays the role of a barrier, however the height of the barrier has been significantly reduced compared to the L 2 case. This is an indication of a better behavior of the KR misfit function compared to the L 2 misfit function. Contrary to the previous 1D experiment, the misfit function appears as more regular. A better insight on the shape of the global minimum is provided in figure 6 where misfit function profiles along the velocity gradient α and the background velocity v P,0 respectively are presented. These profiles well illustrate how the secondary valley of attraction are lifted up for the KR distance. They also show that even if the KR misfit function is smoother, in the vicinity of the global minimum, the misfit function exhibits similarities with the L 1 norm and appears as not differentiable at the global minimum. In terms of resolution power, it should be noted that the width of the global valley of attraction is almost the same for the L 2 and KR misfit function. This is different from what is observed when cross-correlation or deconvolution approaches are used to reduce the sensitivity to the initial model and cycle skipping: in this case the valley of attraction is strongly widen. This reflects a resolution loss of the imaging method as, near the solution, models possibly quite different yield approximately the same misfit. The KR misfit function is thus expected to keep the same resolution of the L 2 misfit function while relaxing the constraint on the choice of the initial model.   In figure 7, the L 2 and KR residuals are presented for the velocity model corresponding to = v 2250 P,0 m s −1 , a = 0.9 s −1 . Cycle skipped diving and direct arrivals can be identified on the L 2 residuals for far offset receivers, between t=3 s and t=4 s. The corresponding KR residuals appear in this 2D case as a smooth version of the L 2 residuals, with a re-balancing of the energy between the mismatched direct and diving waves. Low frequency components seem also to be introduced in the KR residuals (black and white diffuse energy from left to the right of figure 7(c)). In the following experiments, this tendency to smooth and weight approximately equally the different mismatched seismic events is confirmed. The smoothing effect is not surprising: actually it is directly related to the repeated application of an approximate inverse of the Laplacian operator within the SDMM algorithm (solution of the linear system (54)).

Application to 2D and 3D realistic configurations
In the two following experiments, we consider FWI of synthetic offshore data in 2D (Marmousi model) and 3D (overthrust SEG/EAGE model) configurations. In both cases, the misfit functions f v L P 2 ( ) and f v P KR ( ) are minimized using the quasi-Newton l-BFGS method (Nocedal 1980) with the memory parameter l set to 20. The FWI implementation which is used interfaces the l-BFGS method provided in the SEISCOPE optimization toolbox . A regularization strategy based on a gradient smoothing is used, designed as a Gaussian filter with a correlation length equal to a fraction of the local wavelength (Operto et al 2006). As a surface acquisition is considered in both experiments, a preconditioning of the gradient simply based on a linear (in 2D) or quadratic (in 3D) amplification in depth is also applied.
The KR distance is computed with a fixed number of SDMM iterations. This number is selected such that the criterion which is maximized during the solution of the problem (33) only marginally evolves after this number of iterations. For the 2D Marmousi problem, this number is set to 50, while it is set to 100 for the 3D overthrust problem. The computational overhead associated with the use of the KR distance is related to the increase of the gradient computational time. For the 2D Marmousi case, this overhead is equal to 3.8 s for a gradient computational time of 20.6 s in the conventional L 2 distance FWI formulation. This represents almost an 19% increase of the computational time. For the 3D overthrust model, the overhead is equal to 180 s for a gradient computational time of 240 s in the conventional L 2 distance FWI formulation. This represents approximately a 75% increase of the computational time.
5.4.1. 2D Marmousi case study, sensitivity to noise. The P-wave velocity of the Marmousi 2 benchmark model is presented in figure 8(a). A fixed-spread surface acquisition with 128 sources each 125 m and 168 receivers each 100 m, at 50 m depth is considered. The synthetic data is generated using a Ricker source function centered on 5 Hz. The frequency content of the source is low-pass filtered below 3 Hz to mimic realistic seismic data for which this frequency band is contaminated by noise and therefore unavailable for inversion. Two initial models are considered: the first contains the main features of the exact model, only with smoother interfaces. The second is a strongly smoothed version of the exact model with very weak lateral variation and underestimated growth of the velocity in depth. In the following experiments, the misfit functions f v L P 2 ( ) and f v P KR ( ) are minimized using the quasi-Newton l-BFGS method (Nocedal 1980).
The results obtained using the L 2 and KR distances are presented in figures 8(d)-(g). The convergence to a correct estimation of the P-wave velocity model is reached using both the L 2 and KR distances starting from the first initial model. The models presented in figures 8(d) and (e) have been obtained after 100 l-BFGS iterations. A slightly better estimation is obtained using the KR norm in the low velocity zone at x=11 km, = z 2.5 km, where a high velocity artifact can be seen for the L 2 estimation. Starting from the second initial model, only the results obtained using the KR distance are meaningful ( figure 8(g)). The poor initial approximation of the P-wave velocity is responsible for cycle skipping and the L 2 estimation converges towards a local minimum ( figure 8(f)). The estimation obtained with the KR norm is significantly closer from the true model, despite low velocity artifacts in the shallow part at = x 1.5 km, z=1 km and in depth at x=12 km, = z 3.4 km. The results obtained with the KR distance are obtained after 439 l-BFGS iterations. The minimization of the L 2 misfit function with the same strategy fails after 83 iterations. From this experiment, the KR distance appears as an appropriate tool to mitigate cycle skipping in FWI.
In practice, low frequency seismic data required to initiate multi-scale FWI cycles are significantly contaminated by noise. Therefore, it is important to design stable strategies with respect to the interpretation of noisy data. For this reason, an additional inversion is performed using the KR distance, starting from the second initial model, with noisy data. A white Gaussian noise is added to the synthetic data. This white noise is band-pass filtered such that it belongs to the frequency-band of the data i.e. between 3 and 12.5 Hz. The signal over noise ratio (SNR) is taken equal to 5 (the power of the original data is 5 times higher than the power of the noise in the frequency), which is representative of the SNR observed on real data for this frequency band. The result of this experiment is presented in figure 9. The P-wave velocity estimation is only slightly degraded by the presence of noise. Not surprisingly, the less illuminated zone on the borders of the model are more affected. This is understandable as these are the zones where the redundancy of the information is the weakest. A high frequency oscillation is also introduced in the estimation, which could be removed in a second stage through smoothing/denoising strategies. Compared to the first case where no noise is introduced, the convergence is observed here after 502 l-BFGS iterations, corresponding to a A better insight on how the method reacts to the addition of noise can be obtained by observing the L 2 and KR residuals in the second initial model, with and without noise (figure 10). As already observed, the KR residuals appear smoother than the L 2 residuals, with  More interestingly, the comparison between KR and L 2 residuals without noise shows that the KR distance enhance the weighting of weaker amplitude seismic events, such as those observed after 3 s for the receivers in the vicinity of the source (located at x=8 km). This different weighting seems to be preserved when noise is introduced for the KR distance. Conversely, these weakly energetic events are in the noise limit for the L 2 residuals. The ability to keep these different weighting may explain the weak impact of the noise on the P-wave velocity estimation, when the KR norm is employed.
5.4.2. 3D application on the SEG/EAGE overthrust model. In this experiment, the shallow left part of the 3D SEG/EAGE overthrust model is considered ( figure 11). A 250 m deep water layer is added on top. This model covers a surface of 10×10 km 2 and is 2.5 km deep. A fixed spread surface acquisition is used, with´= 8 8 64 sources (respectivelý = 97 97 9409 receivers) regularly located each 1.2 km (respectively 100 m) in both x and y directions, and at 50 m depth. The synthetic dataset is generated using a Ricker source bandpass filtered between 3 and 7.5 Hz ( figure 12). The spatial discretization leads to a representation of the P-wave velocity model with´201 201 51 discrete points with a Each seismogram thus corresponds to a data cube of´´ 97 97 1000 10 7 discrete points. The purpose of this experiment is to focus on cycle skipping problems in a 3D context and compare the results obtained with the L 2 distance and the KR distance. Cycle skipping is mostly observed on diving waves, which sample the shallowest part of the model. For this reason, the initial model is chosen to poorly represent the exact model, especially in its shallow part. Slices of the exact and initial models at constant y=5 km and constant depth = z 1.5 km, z=2 km are presented in figure 13. The initial model is an almost constant velocity model around 3000m s −1 , while the velocity of the exact model reaches 3500 m s −1 already at z=1 km depth. For this reason, the kinematic of the diving waves is not correctly predicted by the initial model. This can be observed in figure 14, where the data associated with the source located at = x 4.8 km, = y 4.8 km, computed in the exact and the initial model are presented. For each model, three data panels are presented, corresponding to a slice in the data volume at constant x=5 km, constant y=5 km, and constant t=3 s. The data is dominated by the direct arrival propagating in the water layer and the strong reflection coming from the interface between the water layer and the see bottom. The relative complexity of the signal is related to the source signature: the Butterworth filters applied to the Ricker wavelet yield a complicated wavelet with a large time support (figure 12). As the source and the water layer are considered to be known, the initial model correctly reproduces the direct arrivals. However, a time shift of at least 0.3 s can be observed for the diving waves recorded by the farthest receivers. Conventional FWI using the L 2 distance is thus likely to produce inaccurate results in this configuration.
In figure 15, the L 2 and KR residuals in the initial model are presented. The three panels correspond to slices in the residual volume at constant x=5 km, constant y=5 km, and constant t=3 s, similarly as for the data in figure 14. As it has been noticed in the previous 2D experiments, the energy of the seismic events in the KR residuals is balanced so that the amplitude of each event is comparable, while the L 2 residuals are dominated by short-tointermediate offset missing events. Interestingly, this balance can be observed in the three panels, which testifies that the solution of the optimal transport problem is performed in the 3D volume without privileging one-dimension over the two others. Two distinct workflows for comparing the L 2 and the KR distance are followed. The first one simply consists in performing the inversion giving the freedom to the l-BFGS optimizer to perform as many iterations as possible. As for the Marmousi experiment, a Gaussian smoothing of the gradient is used, the correlation length being a fraction of the local wavelength (Operto et al 2006). Following this workflow, the inversion using the L 2 distance terminates after 61 iterations, while 229 iterations are performed with the KR distance. Both terminations are related to a linesearch failure: the optimizer is not able to minimize further the misfit function. The second workflow is based on a restarting process. At each stage, 100 l-BFGS iterations are allowed. The initial model for each stage corresponds to the final model of the previous stage. After the first termination on a linesearch failure (instead of meeting the maximum 100 iterations criterion), the iterations are restarted from the previous model, however the regularization is cancelled. The process ends when the second linesearch failure is detected. Following this second workflow, the inversion using the L 2 distance terminates the first stage after 61 iteration, and 517 additional iterations are performed (with a restart of l-BFGS each 100 iterations), for a total of 578 iterations. The inversion using the KR distance terminates the first stage after 361 iterations, and 239 additional iterations with no regularization are performed, for a total of 600 iterations. The reason for this second workflow is that we observed that the l-BFGS optimizer can be sensitive to the gradient smoothing. This Figure 14. Seismograms in the exact (left column) and initial (right column) models. Seismogram cross-section at constant y=5 km in the exact (a) and initial (b) models. Seismogram cross-section at constant x=5 km in the exact (a) and initial (b) models. Seismogram cross-section at constant t=3 s in the exact (e) and initial (f) models. rather pragmatical modification of the descent direction is not accounted for in the misfit function. Therefore, errors can be introduced in the estimation of the descent direction using the l-BFGS algorithm. This workflow also mimics hierarchical strategies which are often used for real data applications: in this sense it is a more realistic comparison than performing a single optimization as in workflow 1.
The results obtained following workflow 1 are presented in figure 16. Obvious signs of cycle skipping are visible in the estimation obtained with the L 2 distance following workflow 1. In the constant y sections ( figure 16(a)), low velocity artifacts can be observed at 1 km depth, in zones where the velocity update should be positive. Conversely, the KR distance provides a more reliable result in the shallow part of the model, until > z 1.5 km ( figure 16(b)). This difference between the L 2 and KR distance is emphasized by the constant z cross-sections presented in figures 16(c)-(f). At = z 1.5 km, the cross-section of the KR estimation presents the main structures of the exact model ( figure 16(d)). Conversely, the L 2  figure 16(c)). At z=2 km, the KR estimation still provides some relevant information on the exact model, for instance in the zone 6 km < < y 8 km, 2 km < < x 8 km ( figure 16(f)). Conversely, the L 2 estimation at this depth is completely cycle skipped exhibiting low velocity structures at the location where high velocity updates would be expected ( figure 16(g)). An additional illustration of the cycle skipping in the shallow part of the model using the L 2 distance is provided in figure 17. The slices at constant y=5 km of the estimated models are compared with the exact one and the zone below = z 1.25 km is shaded. As can be seen, the curved reflector at 1 km depth is replaced with a low velocity anomaly in the L 2 estimation, while its structure is arising in the KR estimation.
The results obtained following workflow 2 are presented in figure 18. The restarting procedure yields better estimation than the results obtained with a single optimization (workflow 1) both for the L 2 and KR misfit functions. For both distances, the shallow part until = z 1.5 km is approximately correctly recovered. However, stronger differences can be seen in depth. The slice at constant z=2 km reveals that the L 2 estimation still suffers from cycle skipping, with a low velocity anomaly located near x=4 km, y=4 km. Conversely, the KR estimation at this depth seems in better accordance with the exact model. This low velocity artifact is replaced with the correct high-velocity structure.
An analysis of the residuals in the final estimations is provided in figures 19 and 20. For workflow 1, as can be seen in constant x and constant y panels (figures 19(b) and (e)), the  (f)). This observation is confirmed by the residual panel at constant t=3 s, where the fringes associated to mismatched event are considerably reduced for receivers between x=2 km, y=2 km and x=8 km, y=8 km (figures 19(h) and (i)). For workflow 2, the difference between the residuals obtained with the L 2 distance and the KR distance is less obvious (figure 20). For both estimations, these residuals are considerably reduced compared to those obtained in the final estimations Figure 18. Cross-sections of the L 2 and KR estimations following workflow 2. Crosssection at constant y=5 km for the L 2 estimation (a), KR estimation (b). Cross-section at constant = z 1.5 km for the L 2 estimation (c), KR estimation (d). Cross-section at constant z=2 km for the L 2 estimation (e), KR estimation (f).
Inverse Problems 32 (2016) 115008 L Métivier et al following workflow 1. The uninterpreted part of the data corresponds to strong reflections coming from the deepest part of the model, which is consistent with the inaccurate reconstruction observed in depth. However, one can note again lower amplitude residuals associated with shallow diving waves using the KR distance, showing that in this case the kinematic of these waves is better reconstructed despite the cycle skipping observed in the initial model. As a final remark on this 3D experiment, one shall have in mind that the initial model which has been chosen is particularly inaccurate: in practice better estimations can be obtained without too much effort, accounting for instance for the velocity increase in depth. In this sense, this 3D experiment may not represent a realistic application of FWI, and one could question the relative inaccuracy of the results obtained using both L 2 and KR distance, especially in depth. Better results with both strategies could of course be obtained starting from a more accurate initial model. However, the purpose of this experiment is rather to push standard FWI strategy based on the L 2 distance to its limit, and observe what could be brought by the change to the KR distance. In this respect, the KR distance based FWI appears as a more robust tool for mitigating cycle skipping issues, feasible for realistic 3D applications. Figure 19. L 2 residuals in the initial model (left column), in the final model obtained using the L 2 (middle column) and KR (right column) distances following workflow 1. Cross-section for constant y=5 km in the initial model (a), in the final model obtained using the L 2 (b) and KR (c) distances following workflow 1. Cross-section for constant x=5 km in the initial model (d), in the final model obtained using the L 2 (e) and KR (f) distances following workflow 1. Cross-section for constant t=3 s in the initial model (g), in the final model obtained using the L 2 (h) and KR (i) distances following workflow 1.

Conclusion and perspectives
Large scale, smooth perturbations of the velocity are mainly responsible for delaying or accelerating the waves propagating within the subsurface, resulting in shifted events in the predicted and observed seismograms. Contrary to the L 2 distance, optimal transport distances have the capability to detect the shifts of recognizable patterns between images. For this reason, they can provide an interesting alternative for the reconstruction of velocity model. In this study, the optimal transport distance which is proposed is based on the KR norm. The computation of this norm is recast as a non-smooth optimization problem, which is solved efficiently following a proximal splitting technique, namely the SDMM algorithm. Each iteration of this algorithm requires the solution of a Poissonʼs equation with homogeneous Neumann boundary conditions, which is efficiently performed using a multigrid algorithm.
The synthetic experiments which are performed confirm the interest of this approach for FWI application. The KR distance between time-shifted 1D signals presents a single minimum as a function of the time-shift. In a 2D time-domain framework with a surface acquisition, the misfit maps using the L 2 distance and KR distance are computed, for a velocity model represented as a linear function of the depth. The effect of the introduction of the KR Figure 20. L 2 residuals in the initial model (left column), in the final model obtained using the L 2 (middle column) and KR (right column) distances following workflow 2. Cross-section for constant y=5 km in the initial model (a), in the final model obtained using the L 2 (b) and KR (c) distances following workflow 2. Cross-section for constant x=5 km in the initial model (d), in the final model obtained using the L 2 (e) and KR (f) distances following workflow 2. Cross-section for constant t=3 s in the initial model (g), in the final model obtained using the L 2 (h) and KR (i) distances following workflow 1.
Inverse Problems 32 (2016) 115008 L Métivier et al distance shows that the secondary valleys are lifted up, reducing the probability to converge towards a local minimum when using a gradient-based method. A 2D time-domain FWI experiment on the Marmousi model shows a better robustness of the KR distance with respect to the accuracy of the initial model. Interestingly, the method appears robust to the addition of noise to the data. Finally, the 3D experiment on the overthrust SEG/EAGE model emphasizes the possibility of using the KR norm even in this large scale configuration. The method appears again as a reliable tool to mitigate cycle skipping issues: starting from a poor initial model, FWI based on the KR distance is able to reconstruct more accurately the part of the model sampled by diving waves, which are heavily cycle skipped in the initial model. Future work will include application of this method to 2D and 3D real data in the FWI context, as well as a combination of this method with reflection FWI strategies (Chavent et al 1994, Plessix et al 1999, Xu et al 2012, Zhou et al 2015. These methods seek to improve the FWI reconstruction of the velocity in depth through the alternate reconstruction of the subsurface reflectivity and the smooth velocity background. This allows to account for transmission kernels between reflectors and the surface acquisition. The use of an optimal transport in this context may improve further the velocity reconstruction. Beyond FWI, this work yields interesting perspectives for tomography methods in general. As soon as a model parameter influencing the kinematic of the propagation is reconstructed from the matching of shifted measurements, the use of optimal transport distance could be advantageously considered. The KR distance is flexible enough to offer the possibility to compare non-positive signals in the sense of optimal transport, without requiring the mass conservation between the signals. The numerical strategy presented in this study also gives the possibility to consider large scale application: in this study, the data volume for the 3D application reaches O 10 7 ( ) discrete samples. As the method scales in linear complexity, larger-scale applications should be considered in the future.