Treatment of deterministic perturbations and stochastic processes within a quality control scheme

Meteorological in situ observational data comes with a variety of errors and uncertainties. Any further usage of this data requires a sophisticated quality control to detect, quantify and possibly eliminate or at least to reduce errors and to increase the value of the information. It must be assumed, that each observational value Ψobs is contaminated by errors Ψerr so that the true state Ψtrue is not known. Different kinds of errors can be identified. Each of them has different characteristics and therefore has to be detected through appropriate methods. For years, various methods as a self consistency test, clustering and nearest 5 neighbour techniques have been implemented in the complex quality control scheme of the Vienna Enhanced Resolution Analysis (VERA). Thereby former elaborations adressed the elimination and treatment of gross errors. In successioon the present investigation adresses the determination of stochastic and deterministic perturbations. In a first step we implemented the method to split up the observational value to smooth out the stochastic errors to the best and retain deterministic perturbations thereafter. Through controlled experiments on two dimensions the performance and limitations of the complex quality control 10 scheme has been investigated. The treatment of errors and signals on different scales and the limit of the usability of this property is the main focus of the presented investigation. We highly recommend to use the method for data quality control within a high resolution model analysing spatially distributed data in highly complex terrain.


Introduction
Meteorological observational in situ data comes with a variety of errors and uncertainties.Any further usage of this data requires a sophisticated quality control to detect, quantify and possibly eliminate or at least to reduce errors and to increase the value of the information.Different methodologies for detecting, handling and eliminating different kinds of data have been developed by for instance Gandin (1988) or Haiden et al. (2010).Phillips and Marks (1996), for example, suggest that for every model using spatial interpolation, should include an uncertainty map for the results, as every interpolation introduces an additional uncertainty to the original input values.Gandin (1988) suggests using a complex quality control able to treat all different kinds of errors individually as they occur.Working with or handling observational data within a data assimilation system of a weather analysis or prediction model requires individual quality control mechanisms according to different priorities.
The Vienna Enhanced Resolution Analysis (VERA) (Mayer and Steinacker, 2012) used in this work, is independet from any prognostic model or model first guess field and focuses on the statistical behaviour and the spatial and temporal consistency of observational data to detect and correct errors before the data is brought to a regular grid.Several models on the meso scale the minimization of a cost function and a finite element solver.As Phillips (1986) pointed out, the meteorological noise in the initial data can be reduced by adjusting amplitudes and phases of gravity modes, in this case, to values that are forced by non-linear interaction between Rossby modes.This was an early attempt to suppress meteorological noise in fields, generated by observational data.
For a better understanding of the methodology in section 2, a brief description of the error detection and eliminating procedure is being given.When it comes to gross error detection, and the overall complex quality control system, a detailed description can be found in (Mayer et al., 2012).This paper focusses on the further development of the complex quality control system within VERA and the possibility of the added value that can be gained when used in complex terrain.Different types of errors and noise are being separated and filtered to obtain a pure signal in meteorological fields.The described method within the VERA quality control scheme wants to preserve not only the synoptic scale, but also small scale orographically induced signals in the data.Therefore the accuracy of high resolution prognostic models could be enhanced if the quality controlled data is used within or as a part of a data assimilation process.We recommend to use the proposed scheme for the highly complex terrain and a high resolution model.But on the global scale and on flat terrain the 4D-VAR data assimilation scheme will be the better choice, as the model first guess field deliver a robust basis for data analysis on flat terrain.A positive impact on the WRF performance in the Alpine region, when using VERA quality controlled data, has been observed recently (pers.comm.Mayer, 2016).
The main goal of this paper is to investigate the performance, uncertainty and limitation of the proposed complex quality control by carrying out controlled experiments on two dimensions over complex terrain.An expansion to three dimensions as is common for regional models like INCA, WRF or COSMO could easily be carried out.The more the (wanted) signal is preserved and the more (unwanted) noise is filtered out from the data, the better the performance of the quality control scheme is.Section 2 explains the methodology, used data and performed controlled case studies are presented in Section 3, followed by the presentation and discussion of the results in Section 4. Conclusions and outlook finalize the paper in Section 5.

Methodology
Before irregularly distributed data are interpolated to a regular grid, complex quality control should be performed, to eliminate or correct errors (Gandin, 1988).According to the methodology of Steinacker et al. (2000), Sperka and Steinacker (2011) and Mayer et al. (2012), it must be assumed that each observational value Ψ obs is contaminated by errors Ψ err so that the true state Ψ true is not known.
As we normally only have observations available at discrete intervals, at stations at specific distances from each other, we can only derive those scales of the true field, which are much larger than the average station distance.We call this resolvable, generally smooth part of the field "synoptic" Ψ syn and denote the unresolvable rest by the term "sub scale" Ψ sub .Concerning sub scale patterns, a downscaling, which is performed in the VERA-system by the so called fingerprint technique, can be carried out, if access to additional information is available.Fingerprints Ψ f p are high resolution -with regard to the station distance -fields, for example, from remote sensing platforms like radar for precipitation, satellite infrared radiometric information for temperatures, high resolution topographic or land type information for parameters, which are correlated to elevation or other topographic and land type features, etc.The strength c f p of the fingerprint pattern has to be calibrated (weighted) by observations through statistical regression.The stronger the fingerprint pattern is present in the observational data, the higher the weighting factor c f p is.Several different fingerprints may be offered to the system.Fingerprints have some similarities to EOFs, but are physically, rather than statistically, determined .
It should be noted here, that the scale separation between the large (synoptic) scale and the subscale depends on the station density.If the mean station distance is in the order of 100 km, the large scale basically covers extra tropical cyclones and anticyclones.If an observational micro-net with a 1 km station distance is available, even convective systems or urban heat islands may become "synoptic" features.Furthermore it is impossible to separate the meteorological noise and random errors which we therefore combine to Ψ mn + Ψ re = Ψ noise .Hence an observational value can be split up into 6 separable parts: Normally, with the exception of Ψ ge , the amplitude of Ψ syn is larger than the amplitude of the other components of Ψ obs .
After the removal of gross errors, a low pass spectral, Gaussian, Laplacian or other adequate spatial filter will therefore create a field which is close to the synoptic component.The problem thereby is that such a filter will not only dampen random errors but also the meteorologically relevant smaller scale patterns.What we usually want are both the "clean" synoptic and the sub scale patterns as well.The difference between the observed value at a station and the filtered value Ψ obs − Ψ syn ("deviation") represents the basis for the error detection and qualification scheme of VERA.The whole procedure to separate the terms of 3 has to be carried out iteratively: -Iteration I (gross error detection): Many gross errors can be detected, when the deviation exceeds certain physical or statistical limits.VERA uses the following criterion: If the deviation at one station exceeds physical limits or (for normally distributed variables) the three-fold long term interquartile range of the same station, the observation is treated as a gross error.To avoid the impact of gross errors on the spatial analysis in the next iteration, observations characterized as gross errors are omitted in the further analysis.
-Iteration II (systematic error /bias correction): If the temporal mean value of the deviations over a long time (e.g. a month) at a station is different from zero, such a mean deviation is characterized as a bias.In the next iteration the data set of observations is corrected with regard to the detected biases.-Iteration III (finger print elimination): To be able to detect deterministic small scale patterns in the field, we need suitable fingerprints as mentioned above.We can offer the analysis system several possible fingerprints, for which the weights are determined by regressions.If a pattern is recognized in the data, the weight will be positive, if it is not recognized, the weight will be zero.A negative weight means that the inverse of a given pattern has been recognized.
In addition subtracting the deterministic small scale components in the form of weighted fingerprints from the observed value equation ( 5) yields -Iteration IV (multivariate small scale signal elimination): If single subscale signals, found by a multivariate approach in a scale are kept, the corresponding deviations from the left side of equation ( 6) can be subtracted to obtain: Alternatively if it is desired that these sub scale signals are filtered, Ψ subsig can be left on the right hand side as part of -Iteration V (random error elimination): Now the noise can be eliminated from the field by applying a suitable filter.
VERA takes an overlapping spatial Laplace filter (Mayer et al., 2012) to quantify the deviations, which are interpreted as random errors.By subtracting the latter from the left hand side of equation ( 7) or equation ( 8) the "clean" deterministic large scale (synoptic) part of the observation can finally be obtained. or The field of the quality checked and corrected "clean" synoptic and the deterministic subscale patterns can be recombined in the corresponding parts: or For a simple one dimensional example and for a data set without gross errors and biases the result of the filter process is shown in Fig. 1.As one can easily recognize, the filter response strongly depends on the scale and the amplitude of the synoptic pattern, and the amplitude of the noise (signal to noise range) with regard to the station distance.The VERA scheme published by (Steinacker et al., 2011) and (Mayer et al., 2012) executes the whole quality control package before calculating the spatial analysis fields.The presented quality control scheme within the analysis process is shown in (Fig. 2) and allows small scale deterministic signals in meteorological fields to be conserved.

Data
The performance of the presented quality control scheme cannot seriously be verified when solely error afflicted operational in situ data sets are used.For verification purposes the generation of data is proposed.The presented data processing makes it possibele to calculate the exact signal to noise ratio and therefore the exact mean and standard deviation of the desired atmospheric information and the noisy part of data.If not generated, the statistical terms of the components described in equation ( 2) are not known a priori.To prove the technical accuracy of the method and outline a sharp control it is indispensable to generate the different components of an observational value seperately and then analyse them.Therefore control experiments have been performed, where the set of non-dimensional components in equation ( 2) were generated.Data sets without any gross errors and biases were assumed, because the gross error detection and bias correction procedure is described in detail and extensively tested in (Mayer et al., 2012).For simplicity reasons we just take one fingerprint pattern (Ψ F P ).Anexemplary presentation is shown in (Fig. 3) .Furthermore subscale signals were not separated from random errors and hence it is possible to stick with the formulations of equations 8, 10 and 12. Then equation ( 2) reduces to The synoptic part of the field is analytically generated by a two dimensional, smooth, chess pattern wave system The amplitude A of the wave pattern is set arbitrarily to 1 and the wave numbers µ x and µ y vary for the different experimental settings between 0.005 km −1 for large scale waves and 0.04 km −1 for meso-β scale waves, which corresponds to wave lengths λ x and λ y of approximately 1250 km and 150 km respectively.For the fingerprint pattern the thermal fingerprint (Steinacker et al., 2006) and(Bica et al., 2006) has been chosen, which indicates the different heating/cooling pattern induced by lowlands, mountains and water bodies (Fig. 3).In the setting for the discussed examination, the dimensionless values of Ψ f p vary between 0 and 1.The weight c f p of the thermal fingerprint varies for the experimental settings between 1 and 5.The noise part of the field has been produced by a random generator leading to spatially uncorrelated Gaussian distributed numbers with a mean of 0 and a standard deviation between 0.2 and 2 and represents the roughest part of the field.Due to the variable settings of the wave length of the synoptic part, the amplitude of the fingerprint part and the amplitude of the noise part with regard to the amplitude of the synoptic part (signal to noise ratio) we can investigate, how well and effective the suggested quality control procedure can filter and eliminate the noise and retain the synoptic and fingerprint parts of the field and if or under what conditions there are limits of its applicability.

Test Domain
In   Europe, whereas in Scandinavia, on the Iberian peninsula and especially over the oceanic areas it is much lower.The mean distance between two adjacent stations in the whole domain is close to 90 km.In Central Europe it is around 30 km and in the data sparse maritime areas several hundred km.

Case Studies
For the evaluation of the performance of the filtering of the noisy part of the data, various case studies with different settings of parameters were performed.The settings of these case studies are listed in (Tab.1) and the associated statistics in (Tab.2).The designation of the case studies consists of the three parts that build the generated data value, characterized by different capital letters W, N and FP.W stands for the wavenumber, N for the noisy part and FP for the "fingerprint".The numbers directly following the capital letters indicate the weight (for FP) or the standard deviation (for the noise) or the applied wavenumer (for W).Within the quality control scheme the Bias correction and gross error correction was switched off.These parts have been extensively tested in previous elaborations (Steinacker et al., 2011) and (Mayer et al., 2012).

Statistics
For a robust interpretation and evaluation of the filter and its performance and limits, statistical analyses were performed.
-Noise ratio (N R) where ST D Ψ (syn+noise) is the standard deviation of the input signal before the application of the quality control and ST D Ψnoise the standard deviation of the noisy part of the initial signal.For calculating the ratio with quality controlled data, the ST D Ψ (syn+noise) which is the standard deviation of the output signal after the initial data was quality controlled can be applied in the formula.Therefore the NR could be described as the power of the noise devided by the power of the signal Kieser et al. (2005).
The correlation coefficient (CC) indicates, how well two series of data fit together.The squared CC gives the fraction of the variance, which is statistically explained by the regression 2 where y j are the observed values, ȳ their mean value and ŷ(x j ) the predicted values by the regression (Wilks, 2006).The correlation coefficient between the initial data Ψ syn+noise and the quality controlled data Ψ (syn+noise) is shown in Tab.3 in column CC.The correlation between the Ψ syn+noise and the Ψ syn part within the same case study and Ψ (syn+noise) with Ψ syn of the same case study is depicted in Tab.2 (column C1) respectively in Tab.3 in column C2.
For the spectral analysis a fast Fourier transformation (fft) was performed.The purpose is to visualize the different wavelengths and energy spectra of the initial and quality controlled signal.In section 4 the performance is discussed and the spectra depicted.For the statistical evaluation the noise ratio (N R), the standard deviation (ST D) and the correlation coefficients (CC, C1 and C2) were calculated for the original (Tab.2) and quality the controlled data (Tab.3).Comparing the performed statistics before (Tab.2) and after (Tab.3) the application of the quality control on Ψ (syn+noise) , a significant improvement is apparent from the lower STD of quality controlled data shown in Tab. 3. To get an idea of how the quality control is effecting the different signals originating from different scales a Fast Fourier Transformation (fft) was performed.For this purpose the initial data Ψ (syn+noise) and the quality controlled data Ψ (syn+noise) were detrended and a window function was applied.For the spectral analysis only data after the subtraction of the c f p Ψ f p part was used and is presented in the log-log graphs in Fig. 5. Since the observational data and therefore the quality controlled data is a mixture of different signals characterized by different wavelengths, a fft provides an insightful analysis.After the quality control the signals are no longer properly separable, but the fft gives an idea of the effect the quality control has on the initial data.

Performance
The graphs in Fig. 5 show the spectrum of wavelengths from longer wavelengths on the left to shorter wavelengths and their dissipation at the right end of the scale.With high energetic large scale vortices on the left end of the scale and the small eddies, noise and dissipation at the right end.With the preservation of large vortices and the reduction of smaller scale eddies one can say that the performance of the quality control scheme is as anticipated (Stull, 2009).

Limits of the filter
For different simulated atmospheric conditions the expected performance of the filter shows its limits.In table 1 the different conditions of the performed case studies are listed.In case study W001N1FP1 with a long wavelength in the Ψ syn part of the signal and the standard deviation of the Ψ noise around 1, the NR is significantly higher after the approach of the quality control scheme.Whereas the NR has barely improved in case study W001N02FP1, with the same data for Ψ syn but a standard 5 deviation for the meteorological noise Ψ noise of approximately 0.2.For a Ψ noise with ST D = 5 the NR shows significantly different ratios in all cases.In Fig. 6 the values for different parts of initial data is plotted in order of the magnitude of Ψ syn data.In the formula for the Ψ syn signal (Eq.14)A is set to 1 for all case studies.Therefore the maximum amplitude should be located around +2 respectively −2, depending on the added noisy part Ψ noise .Obviously visible is the damping of the noisy part of the initial data (green) due to the application of the filter (quality control).The fluctuations of the quality controlled 10 data (red) are of smaller amplitude than before the filter treatment.Another impact of the filter treatment not shown here is an additional damping of the Ψ syn which is often not requested and only appearing if both parts of the initial data are within a relative similar range of wavenumbers.This happens to a greater extent the smaller the difference between the wavelength of Ψ syn and Ψ noise gets.The latter impact of the filter is not likely to appear in real meteorological conditions where the synoptic scale signal and the meteorological noise is explicitly differentiable.The two case studies shown in Fig. 7 show significant improvement with respect to the reduction of deviation and data variability.In the chart on the left with the initial data composition of a wave number µ = 0.0015 in the Ψ syn part and a STD = 0.2 for Ψ noise , the C1 could be enhanced from 0.95 to C2 with 0.98 for the quality controlled data.The case study on the left with µ = 0.005 in Ψ syn and STD = 1 for Ψ noise had a C1 of 0.6 for the initial data which increased to a value of 0.8 for C2, the quality controlled data.As depicted in Tab.2 and Tab.3 correlations between the Ψ syn and the Ψ (syn+noise) respectively Ψ (syn+noise) could be enhanced significantly, which was somehow the aim of changing the routine of the quality control scheme.

Conclusions
A sophisticated data quality control forms the basis for a comprehensive analysis and subsequent use of measured data for data assimilation and forecasting purposes.The presented step within a continuously and long-lasting development process of a complex quality control system describes only a small part of the comprehensive and extensive field dealing with broad variety of errors, their detection and correction.The overall target of different quality control systems is to preserve and represent the current state of the atmosphere which is the closest to the truth someone can get.
Overall the performance of the quality control scheme is able to reduce the noisy part of an initial data set even if the variation is small.The more the wavenumber of the Ψ syn part distinguishes from the Ψ noise part of initial data fields, the more significant the filtering of the erroneous part of data will be.If the noisy, erroneous data and the "fingerprint" pattern are of the same scale, the subtraction of the "fingerprint" Ψ f p from the observational value Ψ obs would not be satisfying, as the subtraction would be vague and not sharp enough for preserving phenomena.Subsequently this quality control scheme would not yield best performances within the latter conditions.Considering real conditions within a complex terrain, a so called synoptic signal and the terrain induced modification will be of different scales and therefore the quality control system is able to manage the separation of the different signals.Even the meteorological noise is generally appearing on a different scale than the terrain induced signal.
Since the present composition is based on generated data a comprehensive evaluation using observational data would be the obvious next step.Further a detailed performance analysis within different environments in complex terrain will be carried out.
The main focus will lie on the applicability of the presented complex data quality control system to an area with dense observational data availability on the one hand and on the other hand to determine the opposite limit for useful analysis in data sparse areas.As in the present paper further investigations and analysis will be executed in highly complex terrain environments.The usability of open access observational data from partly private weather stations should be addressed by a data quality control scheme.The analysis of different parameters requires the development of different "fingerprints" and/or the usage of their combination to identify various meteorological phenomena.For this purpose an area in the Tropics with highly irregularly distributed in-situ observations within a diurnal climate is selected to evaluate the possibility of the presented methodology in the given environment.Additionally the benefit for an analysis by adding small areas where data is collected within a denser observation network should be determined.Now that the limits of the filter (high resolution analysis, significant differnece between the signals) are known, real data can be analysed with this method.The difference here is, that the exact noise ratio of real data is not known, but it is reasonable to assume that it is higher at situations with significant and strong synoptic gradients and therefore coherent atmospheric conditions in contrast to situtations where the gradient is weaker and therefore the signal to noise ratio is very low.A compariston with real observational data is the reasonable next step.For best possible outcome, the same location as shown in Fig. (4.) 5 will be used.The temperature and pressure data will be analysed and the selected case studies should fit the framework of the generated data.Coherent atmospheric conditions like gradient intensive synoptic patterns will be selected.Further evaluation will examine the performance of the presented method on different dense observational networks.It is expected that a denser network does not bring significant information to the performed analysis but the investigation will point out future perspectives.

Figure 1 .
Figure 1.One dimensional example (observational along a space coordinate s) of the effect of filtering observational data with and without consideration of small scale patterns (fingerprints).When observational data are filtered directly (dotted curve), much of the deterministic small scale pattern is lost.When filtering the observed data without the fingerprint pattern (dashed curve), we damp only the noise.The sum of the filtered synoptic part and the fingerprint part (continuous curve) results in a pattern, where small scale deterministic features stay unfiltered despite the efficient noise filtering.

Figure 2 .
Figure 2. Process of the quality control scheme.Ψ obs is the initial data at irregularly distributed observational station coordinates.Ψana is the analysed value, where possible deterministic, physically explicable patterns (ΨF P ) are extracted and weighted with the calculated factor c. Ψsyn+noise (large scale signal and meteorological noise) is the part of the analysed initial data that is unexplained by deterministic, physically explicable patterns.Ψ syn+noise is the quality controlled part of the initial data.
Fig.4 the location of 1311 observational stations within the European domain is shown.The test domain encompasses a large part of Europe and North Africa and is shown in Fig.4.The station location has been taken from the an available set of surface weather stations on a particular day.The density of observation sites is high in Central Geosci.Instrum.Method.Data Syst.Discuss., https://doi.org/10.5194/gi-2017-42Manuscript under review for journal Geosci.Instrum.Method.Data Syst.Discussion started: 21 December 2017 c Author(s) 2017.CC BY 4.0 License.

Figure 3 .
Figure 3. Thermal fingerprint of Europe as used in the VERA downscaling procedure.Contour lines are dimensionless and range between 0 and 10 in the operational setting.

Figure 4 .
Figure 4. Distribution of 1311 observational stations within the European section.

Figure 5 .
Figure 5. Spectral analysis performed with a fast Fourier transformation (fft) with the initial data Ψ (syn+noise) (green line) before the quality control.Red line represents the data Ψ (syn+noise) after the application of the quality control.The figure at the top shows the case study W005N1FP1 whereas at the bottom case study W001N1FP5 is depicted.In both case studies the noise input is exactly the same whereas the Ψsyn part is of shorter wavelength in the case study on the left.

Figure 6 .Figure 7 .
Figure 6.Distribution of Ψsyn (black solid line) at 1250 observational station coordinates ordered by the magnitude of Ψsyn.In green the initial data Ψ (syn+noise) before the filter performance test, red the data Ψ (syn+noise) after the application of the quality control.The wavelength for Ψsyn is 3600 km, the standard deviation for the Ψnoise part is two in the figure on the right and 1 in the left chart.Note the different scaling of both charts.
If we consider the fact, that during a typical shower the temperature drops, humidity rises, wind speed increases, wind direction changes, pressure rises, etc., we can get a more robust estimate of whether the value represents a signal or just a random error, when we also consider the spatial structure of the other mentioned parameters.The difference Geosci.Instrum.Method.Data Syst.Discuss., https://doi.org/10.5194/gi-2017-42Manuscriptunder review for journal Geosci.Instrum.Method.Data Syst.Discussion started: 21 December 2017 c Author(s) 2017.CC BY 4.0 License.to the fingerprint technique is that despite we can distinguish between signal and error or noise by the multivariate approach we cannot derive the scale of the phenomenon or the sub scale spatial pattern.Sub scale signals, uncovered by a multivariate approach are denoted by Ψ subsig .The residual of Ψ sub , which is neither detectable by the fingerprint nor by the multivariate approach is denoted by meteorological noise Ψ mn .The error part of the observations Ψ obs , which may be caused by a sensor calibration error, wrong reading, error introduced during transmission, coding or decoding, etc., can be split up into random errors Ψ re , systematic errors (bias) Ψ se and gross errors Ψ ge .Hence it is possible to split up each observational value into a number of parts:

Table 1 .
Conditions and characteristics of initial data components (Ψsyn, Ψnoise, Ψ f p ) for various case studies.The designation of the case studies consists of the capital letters W, N, FP; following numbers indicate either the applied weight, standard deviation or wavenumber.µx,µy = wavenumber, ST D=standard deviation of randomly distributed data (mean=0), c f p =weighting factor.

Table 2 .
Statistical information for parts of the initial data (Ψsyn and Ψnoise) used in case studies before the quality control was applied.N R is the noise ratio between Ψsyn+noise and Ψnoise.C1 is the correlation coefficient between Ψsyn+noise and Ψsyn.Geosci.Instrum.Method.Data Syst.Discuss., https://doi.org/10.5194/gi-2017-42Manuscript under review for journal Geosci.Instrum.Method.Data Syst.Discussion started: 21 December 2017 c Author(s) 2017.CC BY 4.0 License.

Table 3 .
Performance of the quality control system.Correlation coefficient (CC), noise ratio (NR), MEAN and standard deviation (STD) of Ψ (syn+noise) data after the application of the quality control is listed.