Point estimation

In statistics, point estimation involves the use of sample data to calculate a single value (known as a point estimate, since it identifies a point rather than an interval), which serves as a "best guess" or "best estimate" of an unknown quantity, for example, the population mean, the variance of a distribution, or a model parameter (in a parametric model).

Point estimation can be contrasted with interval estimation: interval estimates are typically either confidence intervals, in the case of frequentist inference, or credible intervals, in the case of Bayesian inference. More generally, a point estimator can be contrasted with a set estimator. Examples are given by confidence sets or credible sets. A point estimator can also be contrasted with a distribution estimator. Examples are given by confidence distributions, randomized estimators, and Bayesian posteriors.

Properties of point estimators

Biasedness

The bias is defined as the difference between the expected value of the estimator and the true value of the population parameter being estimated. It can also be described that the closer the expected value of a parameter is to the measured parameter, the lesser the bias. When the estimated number and the true value is equal, the estimator is considered unbiased. This is called an unbiased estimator. The estimator will become a best unbiased estimator if it has minimum variance. However, a biased estimator with a small variance may be more useful than an unbiased estimator with a large variance.^[1]

Mathematically speaking, if $T=h(X_{1},\dots ,X_{n})$ is an estimator based on a random sample $X_{1},\dots ,X_{n}$ drawn from a distribution $\mathbb {P} _{\theta }$ , the difference $\mathbb {E} _{\theta }[T]-\theta$ is called the bias of $T$ . The estimator $T$ is called unbiased for the parameter $\theta$ if the bias is zero, irrespective of the value of $\theta$ .^[2] Otherwise, $T$ is called biased. Examples of unbiased estimators are the sample mean and the unbiased sample variance.

The concept of (un-)biasedness can be generalized to other metrics than the mean. An unbiased estimator $T$ fulfills^[3] $\mathbb {E} _{\theta }[T]=\mathrm {arg} \min _{\theta '}\mathbb {E} _{\theta }[(T-\theta ')^{2}]=\theta .$ Thus, a more general condition for unbiasedness can be defined by^[4] $\mathrm {arg} \min _{\theta '}\mathbb {E} _{\theta }{\big [}W{\big (}T,\theta '{\big )}{\big ]}=\theta ,$ for some function $W$ . For example, if $W(T,\theta ')=|T-\theta '|$ , then the estimator is called median-unbiased^[5], since the median is a minimiser of the mean absolute error.

Consistency

A point estimator is called consistent, if the probability that the estimate is close to the true value tends to 1 as the sample size grows to infinity. If the estimate (almost) surely gets arbitrarily close to the true value, eventually, as the sample size grows to infinity, then the estimator is even called strongly consistent. Intuitively, a consistent estimator will be "probably approximately correct", and a strongly consistent estimator even "surely approximately correct", if the sample size is large enough. In the special case, where an estimator is unbiased, it is already consistent, if its variance decreases to zero as the sample size grows to infinity.

Efficiency

Efficiency is a property used to investigate the variance of unbiased estimators. According to the bias-variance decomposition, the variance of an unbiased estimator is equal to its mean squared error (MSE), which is often used as a measure for the approximation error of an estimator. Thus, it is desirable to seek for (unbiased) estimators that have minimal variance. If the distribution from which the data is drawn is somehow "well-behaved", it can be shown that the variance of "well-behaved" unbiased estimators cannot be smaller than a certain threshold $\sigma _{\mathrm {min} }^{2}$ (that is, the Cramér–Rao bound). An estimator whose variance is exactly $\sigma _{\mathrm {min} }^{2}$ is then called efficient.

More precisely, for a random sample $X_{1},\dots ,X_{n}$ drawn from a distribution $\mathbb {P} _{\theta }$ with parameter $\theta \in \mathbb {R}$ (efficiency can also be defined in a similar manner for parameter vectors) fulfilling the Cramér–Rao regularity conditions, any regular unbiased estimator $T=h(X_{1},\dots ,X_{n})$ of the parameter $\theta$ fulfills the Cramér–Rao bound

$\mathrm {Var} _{\theta }(T)\geq {\frac {1}{I(\theta )}}\qquad {\text{for any }}\theta ,$

where $I(\theta )$ is the Fisher-information of $\mathbb {P} _{\theta }$ . If $\mathrm {Var} _{\theta }(T)=1/I(\theta )$ for any value of $\theta$ , then $T$ is called efficient. In fact, it can be shown that $T$ is efficient, if and only if, $\mathbb {P} _{\theta }$ is an exponential family and $T$ (actually $h$ ) is the natural sufficient statistic.^[6]^[7]

Sometimes, the variance of an estimator is compared to the variance of some other estimator (rather than the Cramér–Rao bound): For two unbiased estimators $T_{1}$ and $T_{2}$ of the same parameter $\theta$ , the estimator $T_{2}$ would be called more efficient than $T_{1}$ if $\mathrm {Var} _{\theta }(T_{2})<\mathrm {Var} _{\theta }(T_{1})$ , irrespective of the value of $\theta$ .^[8] One can also say that the most efficient estimators are the ones with the least variability of outcomes. The notion of (relative) efficiency can be extended to biased estimators by saying that an estimator $T_{2}$ is more efficient than an estimator $T_{1}$ (for the same parameter of interest), if the MSE of $T_{2}$ is smaller than the MSE of $T_{1}$ .^[9]

Uniform minimum variance and unbiasedness (UMVU)

The concept of efficiency is restricted to estimators and distributions, which fulfill the Cramér–Rao regularity conditions. However, there are other settings, which do not fit into this framework, where there still exist unbiased estimators with uniformly minimal variance over all parameters $\theta$ . Such estimators are then called (uniformly) minimum-variance unbiased estimators (UMVUE/MVUE).

Given a distribution $\mathbb {P} _{\theta }$ with some unknown parameter $\theta \in \Theta$ , an unbiased estimator ${\hat {\theta }}$ is an UMVUE if for any other unbiased estimator ${\hat {T}}$ it holds^[10]

$\mathrm {Var} _{\theta }({\hat {\theta }})\leq \mathrm {Var} _{\theta }({\hat {T}}),$

for any $\theta \in \Theta$ .

Sufficiency

In statistics, the job of a statistician is to interpret the data that they have collected and to draw statistically valid conclusion about the population under investigation. But in many cases the raw data, which are too numerous and too costly to store, are not suitable for this purpose. Therefore, the statistician would like to condense the data by computing some statistics and to base their analysis on these statistics so that there is no loss of relevant information in doing so, that is the statistician would like to choose those statistics which exhaust all information about the parameter, which is contained in the sample.

We define sufficient statistics as follows: Let $X=(X_{1},\dots ,X_{n})$ be a random sample. A statistic $T(X)$ is said to be sufficient for $\theta$ (or for the family of distribution) if the conditional distribution of $X$ given $T$ does not depend on $\theta$ .^[11]

Asymptotic properties

Often, one is interested in the behaviour of a statistical estimation procedure as the sample size $n$ tends to infinity. Additionally to consistency, desirable asymptotic properties of an estimator $T_{n}$ for $\theta$ are:

asymptotic unbiasedness: Roughly speaking, the distribution of $T_{n}-\theta$ approaches a distribution with zero mean as $n$ tends to infinity. More precisely, $T_{n}-\theta$ converges in distribution to a distribution with zero mean as $n$ tends to infinity. If the expectation of $T_{n}^{2}$ does not explode when $n$ goes to infinity, then this is the same as saying that the bias $\mathbb {E} [T_{n}]-\theta$ converges to zero (though it might be non-zero for finite sample sizes). Every consistent estimator is asymptotically unbiased.
asymptotic normality: The distribution of $T_{n}$ approaches a normal distribution as $n$ tends to infinity. That is to say, the rescaled estimator $\sigma _{n}(T_{n}-\mu _{n})$ , for some sequences $(\mu _{n})_{n\in \mathbb {N} }$ and $(\sigma _{n})_{n\in \mathbb {N} }$ , converges in distribution to a standard normal distribution.
asymptotic efficiency: Assuming that $\theta$ is the parameter of a parametric model and that $T_{n}$ is both asymptotically normal and asymptotically unbiased, the distribution of ${\sqrt {n}}(T_{n}-\theta )$ converges, as the sample size $n$ tends to infinity, to a distribution whose variance is the inverse of the Fisher information, that is, the smallest possible variance for a regular unbiased estimator based on one data point.

Estimation methods

Below are some commonly used methods for estimating unknown parameters. The methods vary in their domain of applicability and on the underlying statistical paradigm (frequentist or Bayesian).

Frequentist methods

Maximum likelihood estimation (MLE)

The method of maximum likelihood, due to R.A. Fisher, is arguably the most important general method for estimating the parameters $\theta _{1},\dots ,\theta _{p}$ of a parametric model.^[12] According to this method, the best estimate for the unknown model parameter is the one that maximizes the probability (or probability density) of observing the data that has been observed. This probability as a function of the model parameters is called the likelihood, giving the method its name.^[13]

In mathematical terms, the methods works as follows: Let $X=(X_{1},\dots ,X_{n})$ denote a random data sample with joint probability density function or probability mass function $f_{\theta }$ depending on the vector $\theta =(\theta _{1},\dots ,\theta _{p})$ of model parameters. The function $\theta \mapsto f_{\theta }(X)$ is called the likelihood function, often denoted by $L$ . Assuming that true model parameter lies in a set $\Theta \subset \mathbb {R} ^{p}$ , a maximum likelihood estimator ${\hat {\theta }}_{\mathrm {MLE} }$ fulfills

$L({\hat {\theta }}_{\mathrm {MLE} })=\max _{\theta \in \Theta }L(\theta )=\max _{\theta \in \Theta }f_{\theta }(X).$

In practice, the likelihood function is often differentiable, in which case ${\hat {\theta }}_{\mathrm {MLE} }$ is a solution of the equations^[14]

${\frac {\partial L(\theta )}{\partial \theta _{i}}}=0,\quad i=1,\dots ,k.$

Under certain assumptions on the likelihood, the MLE is strongly consistent, asymptotically efficient and asymptotically normal.

Method of moments (MoM)

The method of moments is one of the oldest methods of estimation, which was first introduced by Karl Pearson in 1895.^[15]^[16]^[17] This method uses the functional relationship between quantities of interest of a distribution (e.g. the variance or model parameters) and the moments of that distribution, leading to a set of equations that are solved for the aforementioned quantities. Motivated by the law of large numbers, the moments of the distribution are estimated by the sample moments, that is, the arithmetic averages of powers of the data values.^[18] The method stands out for its simplicity and wide domain of applicatibility, but often do not yield the best estimators and sometimes even no estimator at all.^[19] There also exists generalizations of the method of moments.

Mathematically speaking, for a distribution $\mathbb {P} _{\theta }$ depending on some vector of parameters $\theta =(\theta _{1},\dots ,\theta _{k})$ , the equations of the first $k$ moments of $X\sim \mathbb {P} _{\theta }$ are established:

${\begin{aligned}&m_{1}=\mathbb {E} [X]=f_{1}(\theta _{1},\dots ,\theta _{k}),\\&\dots \\&m_{k}=\mathbb {E} [X^{k}]=f_{k}(\theta _{1},\dots ,\theta _{k}).\end{aligned}}$

If these equations have a solution, then there exists functions $g_{1},\dots ,g_{k}$ such that the parameters can be expressed by the moments via

${\begin{aligned}&\theta _{1}=g_{1}(m_{1},\dots ,m_{k}),\\&\dots \\&\theta _{k}=g_{k}(m_{1},\dots ,m_{k}).\end{aligned}}$

Given a random sample $X_{1},\dots ,X_{n}$ from $\mathbb {P} _{\theta }$ , the moments are then replaced by the sample moments ${\hat {m}}_{j}={\frac {1}{n}}\sum _{i=1}^{n}X_{i}^{j}$ , $j=1,\dots ,k$ , yielding the moment estimators:

${\begin{aligned}&{\hat {\theta }}_{1}=g_{1}({\hat {m}}_{1},\dots ,{\hat {m}}_{k}),\\&\dots \\&{\hat {\theta }}_{k}=g_{k}({\hat {m}}_{1},\dots ,{\hat {m}}_{k}).\end{aligned}}$

Under relatively low assumptions on the solution functions $g_{1},\dots ,g_{k}$ and the moments of $\mathbb {P} _{\theta }$ , the moment estimators are strongly consistent and asymptotically normal.

Lehmann–Scheffé theorem

The Lehmann–Scheffé theorem, although not an estimation method in itself, motivates two methods for constructing UMVU estimators in models, where a complete, sufficient statistic $T$ is available.^[20] ^[21] This applies, for example, to exponential families, where the natural sufficient statistic is always complete.

Let $X=(X_{1},\dots ,X_{n})$ be a vector of random samples from some distribution $\mathbb {P} _{\theta }$ depending on a parameter $\theta$ .

Method 1: For the estimation of $\theta$ , there exists at most one function $g$ that solves the equation $\mathbb {E} _{\theta }[g(T(X))]=\theta ,$ in which case the estimator ${\hat {\theta }}=g(T)$ is an UMVU estimator for $\theta$ .

Method 2: Given any unbiased estimator ${\tilde {\theta }}$ of $\theta$ , the estimator ${\hat {\theta }}(X)=\mathbb {E} _{\theta }[{\tilde {\theta }}(X)\mid T(X)],$ that is, the conditional expectation of ${\tilde {\theta }}$ given $T$ , is the unique UMVU estimator. In some cases, computing the conditional expectation of a suitably chosen unbiased estimator given $T$ might be easier than solving the equation from method 1.

Score matching estimation

Score matching is a rather new method for parameter estimation that is less intuitive than MLE, but is more advantageous for complicated distributions with a large number of parameters.^[22]

The data is assumed to be drawn from a distribution with probability density $f(\mathbf {x} ,{\boldsymbol {\theta }})$ , $\mathbf {x} \in \mathbb {R} ^{d}$ , parameterized by ${\boldsymbol {\theta }}$ . The key quantity, that is needed for the score matching estimation, is the score $\nabla _{\mathbf {x} }\log f(\mathbf {x} ,{\boldsymbol {\theta }})$ . The true parameter (vector) ${\boldsymbol {\theta }}$ is characterized by ${\begin{aligned}{\boldsymbol {\theta }}&=\mathrm {arg} \min _{{\boldsymbol {\theta }}'}{\frac {1}{2}}\int _{\mathbb {R} ^{d}}\lVert \nabla _{\mathbf {y} }\log f(\mathbf {y} ,{\boldsymbol {\theta }}')-\nabla _{\mathbf {y} }\log f(\mathbf {y} ,{\boldsymbol {\theta }})\rVert ^{2}f(\mathbf {y} ,{\boldsymbol {\theta }})\mathrm {d} \mathbf {y} \\&=\mathrm {arg} \min _{{\boldsymbol {\theta }}'}\sum _{i=1}^{d}\int _{\mathbb {R} ^{d}}\left(\partial _{y_{i}}^{2}\log f(\mathbf {y} ,{\boldsymbol {\theta }}')+{\frac {1}{2}}(\partial _{y_{i}}\log f(\mathbf {y} ,{\boldsymbol {\theta }}')^{2}\right)f(\mathbf {y} ,{\boldsymbol {\theta }})\mathrm {d} \mathbf {y} .\end{aligned}}$ In line with the second equality, for observations $\mathbf {x} _{1},\dots ,\mathbf {x} _{n}$ , the score matching estimate ${\hat {\boldsymbol {\theta }}}$ is defined by ${\hat {\boldsymbol {\theta }}}=\mathrm {arg} \min _{{\boldsymbol {\theta }}'}\sum _{j=1}^{n}\sum _{i=1}^{d}\left(\partial _{x_{i}}^{2}\log f(\mathbf {x} _{j},{\boldsymbol {\theta }}')+{\frac {1}{2}}(\partial _{x_{i}}\log f(\mathbf {x} _{j},{\boldsymbol {\theta }}')^{2}\right).$ The main advantage of score matching estimation compared to the maximum likelihood estimation is the following. Typically, the probability density is of the form $f(\mathbf {x} ,{\boldsymbol {\theta }})={\frac {1}{Z({\boldsymbol {\theta }})}}q(\mathbf {x} ,{\boldsymbol {\theta }}),$ with an unnormalized density $q(\mathbf {x} ,{\boldsymbol {\theta }})$ and a normalization constant $Z({\boldsymbol {\theta }})$ . The computation of latter is often the bottleneck of maximum likelihood estimation, especially if the dependence on ${\boldsymbol {\theta }}$ is non-analytic and the parameter vector is high-dimensional. By taking the derivatives of the logarithm of $f$ , $Z({\boldsymbol {\theta }})$ does not occur in the score matching estimation procedure, thus eliminating this issue.^[22]

Additionally, under mild restrictions on $f(\mathbf {x} ,{\boldsymbol {\theta }})$ , the score matching estimate can be shown to be weakly consistent.

Least square estimation (LSE)

The method of least squares is a parameter estimation procedure for regression, where observations $Y_{1},\dots ,Y_{n}$ need to be expressed as functions of the covariates $X_{1},\dots ,X_{n}\in \mathbb {R} ^{p}$ via

$Y_{i}=f_{\beta }(X_{i})+\varepsilon _{i}$

for some function $f_{\beta }$ depending on a parameter vector $\beta \in \mathbb {R} ^{p+1}$ and a random variable $\varepsilon _{i}$ accounting for random noise in the data. The ordinary least squares estimator (OLS) ${\hat {\beta }}_{\mathrm {OLS} }$ assumes that $f_{\beta }$ is a linear function of $\beta$ and estimates the parameter by minimizing of the sum of squared differences, called residuals:

${\hat {\beta }}_{\mathrm {OLS} }\in \mathrm {arg} \min _{\beta \in \mathbb {R} ^{p}}\sum _{i=1}^{n}\left(Y_{i}-\beta _{0}-X_{i,1}\beta _{1}-\dots -X_{i,p}\beta _{p}\right)^{2}.$

In fact, if the noise variables $\varepsilon _{i}$ are independent and identically normally distributed with mean zero, then the OLS coincides with the maximum likelihood estimator.

If the rank of the matrix of covariates $X=(X_{1},\dots ,X_{n})^{\top }\in \mathbb {R} ^{n\times p}$ is equal to $p$ , then the OLS is uniquely given by

${\hat {\beta }}_{\mathrm {OLS} }=(X^{\top }X)^{-1}X^{\top }Y,$

and it is the best linear unbiased estimator (BLUE) by the Gauß–Markov theorem. If, as above, the noise variables are independent and identically normally distributed, then the OLS is even UMVU for the entire class of unbiased estimators.^[23]

There are multiple variants of least squares estimation:

weighted least squares (WLS) estimation: The residuals are weighted with different weighting factors $w_{i}$ .
generalized least squares (GLS) estimation: The noise variables are allowed to be correlated.
nonlinear least squares estimation: The function $f_{\beta }$ can depend non-linearly on the parameter $\beta$ .
regularized least squares estimation: The parameter is chosen such that the sum of squared residuals plus a regularization term is minimized. Typical choices for the regularization are the euclidian norm $\lVert \beta \rVert _{2}^{2}$ , leading to Ridge regression, or the 1-norm $\lVert \beta \rVert _{1}$ , leading to Lasso regression.

Bayesian methods

Bayesian estimation methods take into account the statistician's prior belief on the distribution of the parameters, which is modeled as a distribution with density $\pi$ on the parameter space $\Theta$ , called the prior distribution. Given observations $X_{1},\dots ,X_{n}$ , which, for fixed $\theta \in \Theta$ , are assumed to be distributed according to a distribution with density $p_{\theta }$ , the prior distribution is updated according to Bayes' rule:

$p(\theta \mid X_{1},\dots ,X_{n})={\frac {p_{\theta }(X_{1},\dots ,X_{n})\pi (\theta )}{\int _{\Theta }p_{\theta }(X_{1},\dots ,X_{n})\pi (\theta )\mathrm {d} \theta }}.$ The updated distribution of the parameters given the data is called the posterior distribution. Many Bayesian point estimators are the posterior distribution's statistics of central tendency, e.g., its mean, median, or mode.

Bayes estimator

Given a vector of random observations $X=(X_{1},\dots ,X_{n})$ and a loss function $L$ , a Bayes estimator is any estimator ${\hat {\theta }}(X)$ that minimizes the Bayes risk:^[24] $R_{\pi }({\hat {\theta }})=\int _{\Theta }\mathbb {E} [L(\theta ,{\hat {\theta }}(X))]\pi (\theta )\mathrm {d} \theta =\mathbb {E} \left[\int _{\Theta }L{\big (}\theta ,{\hat {\theta }}(X){\big )}p(\theta \mid X)\mathrm {d} \theta \right].$

Obviously, the risk is minimized when the integral under the expectation is minimized as a function of the observations. Thus, a Bayes estimator is equivalently given as a minimizer of the posterior loss:

${\hat {\theta }}(X)\in \mathrm {arg} \min _{y\in \mathbb {R} ^{n}}\int _{\Theta }L{\big (}\theta ,y{\big )}p(\theta \mid X)\mathrm {d} \theta .$

The most common choices for $L$ are the following:

$L(\theta ,y)=(\theta -y)^{2}$ : In this case, the posterior loss is minimized by the posterior mean ${\hat {\theta }}(X)=\int _{\Theta }\theta \,p(\theta \mid X)\mathrm {d} \theta$ , as observed by Gauss.^[25]

$L(\theta ,y)=|\theta -y|$ : Here, the posterior loss is minimized by the median of the posterior distribution, as observed by Laplace.^[25]

By Wald's theorem, unique Bayesian estimators are admissible.^[26]^[27] Moreover, under certain assumptions on the prior and $p_{\theta }$ , the posterior mean estimator can be shown to be consistent, asymptotically normal and asymptotically efficient.^[28]

Maximum a posteriori (MAP)

The maximum a posteriori estimator ${\hat {\theta }}_{\mathrm {MAP} }$ is a maximizer of the posterior distribution:

${\hat {\theta }}_{\mathrm {MAP} }(X)\in \mathrm {arg} \max _{\theta \in \Theta }p(\theta \mid X).$

In a sense, the MAP estimator is a Bayesian analog to the maximum likelihood estimator. In fact, for a uniform prior distribution, the MAP estimator coincides with the MLE.

The MAP estimator has good asymptotic properties, even for many difficult problems, on which the maximum-likelihood estimator has difficulties. For regular problems, where the maximum-likelihood estimator is consistent, the maximum-likelihood estimator ultimately agrees with the MAP estimator.^[29]^[26]^[30]

Others

The Minimum Message Length (MML) point estimator is based in Bayesian information theory and is not so directly related to the posterior distribution.

Special cases of Bayesian filters are important:

Several methods of computational statistics have close connections with Bayesian analysis:

particle filter
Markov chain Monte Carlo (MCMC)

Point estimate v.s. confidence interval estimate

There are two major types of estimates: point estimate and confidence interval estimate. In the point estimate we try to choose a unique point in the parameter space which can reasonably be considered as the true value of the parameter. On the other hand, instead of unique estimate of the parameter, we are interested in constructing a family of sets that contain the true (unknown) parameter value with a specified probability. In many problems of statistical inference we are not interested only in estimating the parameter or testing some hypothesis concerning the parameter, we also want to get a lower or an upper bound or both, for the real-valued parameter. To do this, we need to construct a confidence interval.

Confidence interval describes how reliable an estimate is. We can calculate the upper and lower confidence limits of the intervals from the observed data. Suppose a dataset x₁, . . . , x_n is given, modeled as realization of random variables X₁, . . . , X_n. Let θ be the parameter of interest, and γ a number between 0 and 1. If there exist sample statistics L_n = g(X₁, . . . , X_n) and U_n = h(X₁, . . . , X_n) such that P(L_n < θ < U_n) = γ for every value of θ, then (l_n, u_n), where l_n = g(x₁, . . . , x_n) and u_n = h(x₁, . . . , x_n), is called a 100γ% confidence interval for θ. The number γ is called the confidence level.^[31] In general, with a normally-distributed sample mean, Ẋ, and with a known value for the standard deviation, σ, a 100(1-α)% confidence interval for the true μ is formed by taking Ẋ ± e, with e = z_1-α/2(σ/n^1/2), where z_1-α/2 is the 100(1-α/2)% cumulative value of the standard normal curve, and n is the number of data values in that column. For example, z_1-α/2 equals 1.96 for 95% confidence.^[32]

Here two limits are computed from the set of observations, say l_n and u_n and it is claimed with a certain degree of confidence (measured in probabilistic terms) that the true value of γ lies between l_n and u_n. Thus we get an interval (l_n and u_n) which we expect would include the true value of γ(θ). So this type of estimation is called confidence interval estimation.^[33] This estimation provides a range of values which the parameter is expected to lie. It generally gives more information than point estimates and are preferred when making inferences. In some way, we can say that point estimation is the opposite of interval estimation.

Notes

^ Dekking et al. (2005), p. 305
^ Dekking et al. (2005), p. 290
^ Casella & Berger (2002), p. 58
^ Lehmann, Erich L. (1951). "A General Concept of Unbiasedness". Annals of Mathematical Statistics. 22 (4): 587–592.
^ Brown, George W. (1947). "On Small-Sample Estimation". Annals of Mathematical Statistics. 18 (4): 582–585.
^ Lehmann & Casella (1998), p. 120–121
^ Casella & Berger (2002), p. 341
^ Dekking et al. (2005), p. 303
^ Dekking et al. (2005), p. 305
^ Lehmann & Casella (1998), p. 85
^ Sahu et al. (2015), p. 3
^ Sahu et al. (2015), p. 48
^ Dodge (2008), p. 334
^ Sahu et al. (2015), p. 48–49
^ Pearson, Karl (1895). "Contributions to the Mathematical Theory of Evolution. II. Skew Variation in Homogeneous Material". Philosophical Transactions of the Royal Society of London A. 186: 343–414.
^ Sahu et al. (2015), p. 47
^ Dodge (2008), p. 349
^ Dodge (2008), p. 348–349
^ Shao (2003), p. 207
^ Lehmann & Casella (1998), p. 88–89
^ Shao (2003), p. 162
^ ^a ^b Hyvärinen, Aapo (2005). "Estimation of Non-Normalized Statistical Models by Score Matching". Journal of Machine Learning Research. 6: 695–709.
^ Sahu et al. (2015), p. 56–57
^ Lehmann & Casella (1998), p. 5 & p. 225
^ ^a ^b Dodge, Yadolah, ed. (1987). Statistical data analysis based on the L1-norm and related methods: Papers from the First International Conference held at Neuchâtel, August 31–September 4, 1987. North-Holland Publishing.
^ ^a ^b Le Cam, Lucien (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag. ISBN 0-387-96307-3.
^ Lehmann & Casella (1998), p. 323
^ Lehmann & Casella (1998), p. 488–490
^ Ferguson, Thomas S. (1996). A Course in Large Sample Theory. Chapman & Hall. ISBN 0-412-04371-8.
^ Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate". Journal of the American Statistical Association. 77 (380): 831–834. doi:10.1080/01621459.1982.10477894. JSTOR 2287314.
^ Dekking et al. (2005), p. 343
^ Experimental Design – With Applications in Management, Engineering, and the Sciences. Springer: Paul D. Berger, Robert E. Maurer, Giovana B. Celli. 2019.
^ Sahu et al. (2015), p. 131

References

Bickel, Peter J.; Doksum, Kjell A. (2015). Mathematical Statistics: Basic and Selected Topics. Vol. I (2nd ed.). Taylor & Francis. ISBN 978-1-4987-2381-7.
Casella, George; Berger, Roger L. (2002). Statistical Inference (2nd ed.). Duxbury.
Dekking, F. M.; Kraaikamp, C.; Lopuhaä, H. P.; Meester, L. E. (2005). A Modern Introduction to Probability and Statistics. Springer. ISBN 978-1-85233-896-1.
Dodge, Yadolah (2008). The Concise Encyclopedia of Statistics. Springer. ISBN 978-0-387-32833-1.
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. ISBN 978-0-521-59271-0.
Lehmann, Erich L.; Casella, George (1998). Theory of Point Estimation (2nd ed.). New York: Springer. ISBN 0-387-98502-6.
Liese, Friedrich; Miescke, Klaus-J. (2008). Statistical Decision Theory: Estimation, Testing, and Selection. Springer.
Sahu, Pradip Kumar; Pal, Santi Ranjan; Das, Ajit Kumar (2015). Estimation and Inferential Statistics. Springer. ISBN 978-81-322-2514-0.
Shao, Jun (2003). Mathematical Statistics (2nd ed.). Springer. ISBN 978-0-387-95382-3.

[1] Dekking et al. (2005), p. 305

[2] Dekking et al. (2005), p. 290

[3] Casella & Berger (2002), p. 58

[4] Lehmann, Erich L. (1951). "A General Concept of Unbiasedness". Annals of Mathematical Statistics. 22 (4): 587–592.

[5] Brown, George W. (1947). "On Small-Sample Estimation". Annals of Mathematical Statistics. 18 (4): 582–585.

[6] Lehmann & Casella (1998), p. 120–121

[7] Casella & Berger (2002), p. 341

[8] Dekking et al. (2005), p. 303

[9] Dekking et al. (2005), p. 305

[10] Lehmann & Casella (1998), p. 85

[11] Sahu et al. (2015), p. 3

[12] Sahu et al. (2015), p. 48

[13] Dodge (2008), p. 334

[14] Sahu et al. (2015), p. 48–49

[15] Pearson, Karl (1895). "Contributions to the Mathematical Theory of Evolution. II. Skew Variation in Homogeneous Material". Philosophical Transactions of the Royal Society of London A. 186: 343–414.

[16] Sahu et al. (2015), p. 47

[17] Dodge (2008), p. 349

[18] Dodge (2008), p. 348–349

[19] Shao (2003), p. 207

[20] Lehmann & Casella (1998), p. 88–89

[21] Shao (2003), p. 162

[:2-22] Hyvärinen, Aapo (2005). "Estimation of Non-Normalized Statistical Models by Score Matching". Journal of Machine Learning Research. 6: 695–709.

[23] Sahu et al. (2015), p. 56–57

[24] Lehmann & Casella (1998), p. 5 & p. 225

[Dodge-25] Dodge, Yadolah, ed. (1987). Statistical data analysis based on the L1-norm and related methods: Papers from the First International Conference held at Neuchâtel, August 31–September 4, 1987. North-Holland Publishing.

[LeCam-26] Le Cam, Lucien (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag. ISBN 0-387-96307-3.

[27] Lehmann & Casella (1998), p. 323

[28] Lehmann & Casella (1998), p. 488–490

[29] Ferguson, Thomas S. (1996). A Course in Large Sample Theory. Chapman & Hall. ISBN 0-412-04371-8.

[FergJASA-30] Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate". Journal of the American Statistical Association. 77 (380): 831–834. doi:10.1080/01621459.1982.10477894. JSTOR 2287314.

[31] Dekking et al. (2005), p. 343

[32] Experimental Design – With Applications in Management, Engineering, and the Sciences. Springer: Paul D. Berger, Robert E. Maurer, Giovana B. Celli. 2019.

[33] Sahu et al. (2015), p. 131

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]