Difference between revisions of "Order Statistics"

From ShotStat
Jump to: navigation, search
(tl;dr)
m (Clarification)
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==tl;dr==
 
==tl;dr==
  
A typical hunter is interested in the 90% CEP, which we'll call here <math>\hat{R_{90}}</math>: the radius around point of aim expected to contain 90% of shots.
+
A typical hunter is interested in the 90% CEP, which we'll call here <math>R_{90}</math>: the radius around point of aim expected to contain 90% of shots.
  
Using order statistics We will derive "field estimates" that are easy to measure and compute.
+
Using order statistics, we will derive "field estimates" that are easy to measure and compute. ([http://lstange.github.io/mcgs/ Monte Carlo analysis of robustness and comparison with other statistics available here.])
  
Let <math>R_{m:n}</math> be the ''m<sup>th</sup>'' largest radius (i.e., ''m<sup>th</sup>'' "worst") of ''n'' sample shots on a target.
+
Let <math>R_{m:n}</math> be the ''m<sup>th</sup>'' smallest radius (i.e., ''(n-m+1)<sup>th</sup>'' "worst") of ''n'' sample shots on a target.
  
 
* If you can only take one shot then <math>\hat{R_{90}} = 3 R_{1:1}</math>
 
* If you can only take one shot then <math>\hat{R_{90}} = 3 R_{1:1}</math>
 
* If you can only take three shots then <math>\hat{R_{90}} = \sqrt{2} R_{3:3}</math>
 
* If you can only take three shots then <math>\hat{R_{90}} = \sqrt{2} R_{3:3}</math>
* If you take five shots and want to throw away the worst, <math>\hat{R_{90}} \approx 1.6 R_{4:5}</math>.
+
* If you take five shots and throw away the worst, <math>\hat{R_{90}} \approx 1.6 R_{4:5}</math>
 
* With ten shots, <math>\hat{R_{90}} \approx 0.7 (R_{6:10} + R_{9:10})</math>
 
* With ten shots, <math>\hat{R_{90}} \approx 0.7 (R_{6:10} + R_{9:10})</math>
  
Line 58: Line 58:
 
:<math>1-F(kx) = e^{-\frac{k^2 x^2}{2}}</math>
 
:<math>1-F(kx) = e^{-\frac{k^2 x^2}{2}}</math>
  
Integrating over all values of <math>x</math> gives total probability of the next miss radius exceeding <math>R_{90}</math>, which should be equal to 0.1:
+
Integrating over all values of <math>x</math> gives total probability of the next miss radius exceeding <math>R_{90}</math>, which should be equal to 10%:
  
:<math>\int_{0}^{\infty}f(x)\left(1-F(kx)\right) dx = \int_{0}^{\infty}x e^{-\frac{x^2}{2}} e^{-\frac{k^2 x^2}{2}} dx = 0.1</math>
+
:<math>\int_{0}^{\infty}f(x)\left(1-F(kx)\right) dx = \int_{0}^{\infty}x e^{-\frac{x^2}{2}} e^{-\frac{k^2 x^2}{2}} dx = (1 - 90\%)</math>
  
 
This equation is easy to solve on paper, but it will soon get more interesitng so why not warm [http://maxima.sourceforge.net Maxima] up now to find <math>k</math>.
 
This equation is easy to solve on paper, but it will soon get more interesitng so why not warm [http://maxima.sourceforge.net Maxima] up now to find <math>k</math>.
Line 90: Line 90:
 
Integrating over all values of <math>x</math>
 
Integrating over all values of <math>x</math>
  
:<math>\int_{0}^{\infty}f_{3:3}(x) (1-F(kx)) dx = \int_{0}^{\infty}3\left(1-e^{-\frac{x^2}{2}}\right)^{2} x e^{-\frac{x^2}{2}} e^{-\frac{k^2 x^2}{2}} dx = 0.1</math>
+
:<math>\int_{0}^{\infty}f_{3:3}(x) (1-F(kx)) dx = \int_{0}^{\infty}3\left(1-e^{-\frac{x^2}{2}}\right)^{2} x e^{-\frac{x^2}{2}} e^{-\frac{k^2 x^2}{2}} dx = (1 - 90\%)</math>
  
 
Solving for <math>k</math>, we get
 
Solving for <math>k</math>, we get
Line 115: Line 115:
 
:<math>m \approx 0.79681(n+1)-0.39841+\frac{1.16312}{n+1}</math>
 
:<math>m \approx 0.79681(n+1)-0.39841+\frac{1.16312}{n+1}</math>
  
For <math>n=5</math> optimum estimator is <math>m=5</math>, but <math>m=4</math> is not far behind and is less sensitive to outliers.
+
For <math>n=5</math> optimum estimator is <math>m=5</math>, but <math>m=4</math> is not far behind (relative efficiency 96.3%) and is less sensitive to outliers. There is a sharp drop after that, relative efficiency of <math>m=3</math> (median miss radius) is 76.3%.
  
 
Probability density of <math>m</math>th miss radius in a group of <math>n</math> shots is
 
Probability density of <math>m</math>th miss radius in a group of <math>n</math> shots is
Line 127: Line 127:
 
Integrating over all values of <math>x</math>
 
Integrating over all values of <math>x</math>
  
:<math>\int_{0}^{\infty}f_{4:5}(x) (1-F(kx)) dx = \int_{0}^{\infty} 20\left(1-e^{-\frac{x^2}{2}}\right)^{3} e^{-\frac{x^2}{2}} x e^{-\frac{x^2}{2}} x e^{-\frac{k^2 x^2}{2}} dx = 0.1</math>
+
:<math>\int_{0}^{\infty}f_{4:5}(x) (1-F(kx)) dx = \int_{0}^{\infty} 20\left(1-e^{-\frac{x^2}{2}}\right)^{3} e^{-\frac{x^2}{2}} x e^{-\frac{x^2}{2}} x e^{-\frac{k^2 x^2}{2}} dx = (1 - 90\%)</math>
  
 
Solving for <math>k</math>
 
Solving for <math>k</math>
Line 134: Line 134:
 
  find_root(integrate(f45(u)*exp(-k^2*u^2/2),u,0,inf)=0.1, k, 1, 2);
 
  find_root(integrate(f45(u)*exp(-k^2*u^2/2),u,0,inf)=0.1, k, 1, 2);
 
  1.578643529936508
 
  1.578643529936508
 +
 +
Relative efficiency of <math>R_{4:5}</math> compared to <math>R_{5:5}</math>:
 +
 +
f45(x):=20*(1-exp(-x^2/2))^3*x*exp(-x^2);
 +
mean45:integrate(x*f45(x),x,0,inf);
 +
f55(x):=5*(1-exp(-x^2/2))^4*x*exp(-x^2/2);
 +
mean55:integrate(x*f55(x),x,0,inf);
 +
float((integrate(x^2*f55(x),x,0,inf)-mean55^2)*mean45^2/((integrate(x^2*f45(x),x,0,inf)-mean45^2)*mean55^2));
 +
0.9627613473444434
  
 
==Ten Shots==
 
==Ten Shots==
Line 149: Line 158:
 
===Details===
 
===Details===
  
According to M. M. Siddiqui 1964 paper in [[References]], the two optimal order statistics are <math>0.639(n+1)</math> and <math>0.927(n+1)</math>. For <math>n=10</math> it’s 7 and 10, but 6 and 9 work [http://lstange.github.io/mcgs nearly as well]. Not using the worst miss radius means that R<sub>6:10</sub> + R<sub>9:10</sub> is a more robust estimator.
+
According to M. M. Siddiqui 1964 paper in [[References]], the two optimal order statistics are <math>0.639(n+1)</math> and <math>0.927(n+1)</math>. For <math>n=10</math> it’s 7 and 10, but 6 and 9 work nearly as well (relative efficiency 98.7%, details below). Not using the worst miss radius means that R<sub>6:10</sub> + R<sub>9:10</sub> is a more robust estimator.
  
 
Joint probability distribution of 6<sup>th</sup> miss radius <math>x=R_{6:10}</math> and 9<sup>th</sup> miss radius <math>y=R_{9:10}</math> is
 
Joint probability distribution of 6<sup>th</sup> miss radius <math>x=R_{6:10}</math> and 9<sup>th</sup> miss radius <math>y=R_{9:10}</math> is
Line 162: Line 171:
 
Integrating over all values of <math>u</math>
 
Integrating over all values of <math>u</math>
  
:<math>\int_{0}^{\infty}p(u)e^{-\frac{k^2 u^2}{2}}du = 0.1</math>
+
:<math>\int_{0}^{\infty}p(u)e^{-\frac{k^2 u^2}{2}}du = (1 - 90\%)</math>
  
 
This seems to be too hard even for Maxima to solve analytically, so we’ll have to resort to numeric integration using [https://en.wikipedia.org/wiki/Romberg%27s_method Romberg’s method]. Upper limit of 10 instead of <math>\infty</math> in second integration prevents underflow in the tail.
 
This seems to be too hard even for Maxima to solve analytically, so we’ll have to resort to numeric integration using [https://en.wikipedia.org/wiki/Romberg%27s_method Romberg’s method]. Upper limit of 10 instead of <math>\infty</math> in second integration prevents underflow in the tail.
Line 170: Line 179:
 
  find_root(romberg(p(u)*exp(-k^2*u^2/2),u,0,10)=0.1,k,0.6,0.8);
 
  find_root(romberg(p(u)*exp(-k^2*u^2/2),u,0,10)=0.1,k,0.6,0.8);
 
  0.70766872521906
 
  0.70766872521906
 +
 +
Relative efficiency  of <math>R_{6:10}+R_{9:10}</math> compared to <math>R_{7:10}+R_{10:10}</math>:
 +
 +
f69(x,y):=15120*(1-exp(-x^2/2))^5*(exp(-x^2/2)-exp(-y^2/2))^2*exp(-y^2/2)*x*exp(-x^2/2)*y*exp(-y^2/2);
 +
p69(u):=romberg(f69((u-v)/2,(u+v)/2)/2,v,0,u);
 +
mean69:romberg(x*p69(x),x,0,10);
 +
var69:romberg(t^2*p69(t),t,0,10)-mean69^2;
 +
f710(x,y):=2520*(1-exp(-x^2/2))^6*(exp(-x^2/2)-exp(-y^2/2))^2*x*exp(-x^2/2)*y*exp(-y^2/2);
 +
p710(u):=romberg(f710((u-v)/2,(u+v)/2)/2,v,0,u);
 +
mean710:romberg(x*p710(x),x,0,10);
 +
var710:romberg(t^2*p710(t),t,0,7)-mean710^2;
 +
(var710*mean69^2)/(var69*mean710^2);
 +
0.9873309275507542
  
 
==Multiple Groups==
 
==Multiple Groups==

Revision as of 09:31, 8 June 2016

tl;dr

A typical hunter is interested in the 90% CEP, which we'll call here \(R_{90}\): the radius around point of aim expected to contain 90% of shots.

Using order statistics, we will derive "field estimates" that are easy to measure and compute. (Monte Carlo analysis of robustness and comparison with other statistics available here.)

Let \(R_{m:n}\) be the mth smallest radius (i.e., (n-m+1)th "worst") of n sample shots on a target.

  • If you can only take one shot then \(\hat{R_{90}} = 3 R_{1:1}\)
  • If you can only take three shots then \(\hat{R_{90}} = \sqrt{2} R_{3:3}\)
  • If you take five shots and throw away the worst, \(\hat{R_{90}} \approx 1.6 R_{4:5}\)
  • With ten shots, \(\hat{R_{90}} \approx 0.7 (R_{6:10} + R_{9:10})\)

Motivational Example

You borrow a rifle for the hunt. The owner says it is zeroed and you trust him. But you don’t know how well you can shoot this rifle, or how your ammo performs in it. You want to determine how big a target you can hit, or equivalently how far you can hit a target of a given size. For simplicity assume spherical horse in vacuum that the target is round and the distance is small enough that you don’t have to worry about holdovers and can ignore wind.

By the way, this is what professional hunters do when they bring their client to the range the day before the hunt to “zero the rifle”. They know the rifle is zeroed. They want to know how close they need to get this particular client to the game animal for the hunt to be successful.

Metric Choice

One metric particularly useful in practice is \(R_{90}\): the radius of a circle that is expected to contain 90% of all impacts.

This choice is a compromise. \(R_{100}\) is much easier to estimate, but knowing the trivial fact that \(R_{100}=\infty\) does not help much. Given available number of shots, there would not be enough tail events to determine \(R_{99}\) or even \(R_{95}\) with sufficient accuracy. From the other side, \(R_{50}\) aka CEP might be reasonable for military applications, but is too low for civilian use.

In theory it’s easy to convert between different metrics, but in practice it’s better to have a way to get the metric you want directly with the least amount of math to be done in the field.

One nice side effect of choosing \(R_{90}\) is that it allows to sidestep the tricky Fliers_vs._Outliers question. We simply decide to not care about the few worst shots, regardless of the reason. And if fliers are more frequent, they probably aren’t really fliers.

So you go to the range, set up some paper targets at typical engagement range, shoot at them, and look at the results.

One Shot

Summary

Even one shot is better than nothing. If miss radius (the distance between target center and impact) is \(R_{1:1}\), then estimated \(R_{90}\) is

\[\hat{R_{90}} = 3 R_{1:1}\]

In examples below red circle has radius \(R_{1:1}\) and blue circle has radius \(\hat{R_{90}}\).

OrderStatisticsFig1.png

All impact coordinates were pulled from the same standard bivariate normal distribution, but the circles look very different in size, which means that the variance of the estimate is quite high.

Details

Here’s the math behind the magic number 3. Assume miss radiuses of individual shots follow Rayleigh distribution with \(\sigma = 1\). Its probability density function is

\[f(x)=x e^{-\frac{x^2}{2}}\]

and cumulative distribution function is

\[F(x)=1-e^{-\frac{x^2}{2}}\]

\(x\) here is miss radius rather than horizontal coordinate. \(R_{90}\) and \(x\) are measured in the same units and there are no other factors involved, so \(R_{90}\) should be proportional to \(x\) with some yet unknown dimensionless coefficient \(k\). Probability that miss radius of the next shot is greater than \(kx\) (complementary cumulative distribution function) is

\[1-F(kx) = e^{-\frac{k^2 x^2}{2}}\]

Integrating over all values of \(x\) gives total probability of the next miss radius exceeding \(R_{90}\), which should be equal to 10%:

\[\int_{0}^{\infty}f(x)\left(1-F(kx)\right) dx = \int_{0}^{\infty}x e^{-\frac{x^2}{2}} e^{-\frac{k^2 x^2}{2}} dx = (1 - 90\%)\]

This equation is easy to solve on paper, but it will soon get more interesitng so why not warm Maxima up now to find \(k\).

 find_root(integrate(x*exp(-x^2/2)*exp(-k^2*x^2/2), x, 0, inf)=0.1, k, 1, 10);
 3.0

Three Shots

Summary

If worst miss radius in the three-shot group is \(R_{3:3}\), then \(\hat{R_{90}} = \sqrt{2} R_{3:3} \approx 1.4 R_{3:3}\)

In examples below red circle has radius \(R_{3:3}\) and blue circle has radius \(\hat{R_{90}}\).

OrderStatisticsFig3.png

Details

Here’s where \(\sqrt{2}\) comes from. Probability density of \(m\)th miss radius in a group of \(n\) shots is

\[f_{m:n}(x)=\frac{n!}{(m-1)!(n-m)!}\left(F(x)\right)^{m-1}\left(1-F(x)\right)^{n-m}f(x)\]

For \(m=3\) and \(n=3\)

\[f_{3:3}(x)=\frac{3!}{(3-1)!(3-3)!}\left(F(x)\right)^{3-1}\left(1-F(x)\right)^{3-3}f(x)=3\left(F(x)\right)^{2}f(x)=3\left(1-e^{-\frac{x^2}{2}}\right)^{2} x e^{-\frac{x^2}{2}}\]

Integrating over all values of \(x\)

\[\int_{0}^{\infty}f_{3:3}(x) (1-F(kx)) dx = \int_{0}^{\infty}3\left(1-e^{-\frac{x^2}{2}}\right)^{2} x e^{-\frac{x^2}{2}} e^{-\frac{k^2 x^2}{2}} dx = (1 - 90\%)\]

Solving for \(k\), we get

find_root(integrate(3*(1-exp(-x^2/2))^2*x*exp(-x^2/2)*exp(-k^2*x^2/2), x, 0, inf)=0.1, k, 1, 10);
1.414213562373095

Five Shots

Summary

So far we have ignored the issue of outliers (fliers), but with five or more shots we can finally do something about them: look at the second worst miss radius instead of the worst miss radius.

\[\hat{R_{90}} \approx 1.6 R_{4:5}\]

In examples below red circle has radius \(R_{4:5}\) (4th-smallest miss radius in a 5-shot group), and blue circle has radius \(\hat{R_{90}}\).

OrderStatisticsFig5.png

Details

According to formula 3.13 in M. M. Siddiqui 1964 paper in References, optimum unbiased estimator from single order statistic \(R_{m:n}\) is

\[m \approx 0.79681(n+1)-0.39841+\frac{1.16312}{n+1}\]

For \(n=5\) optimum estimator is \(m=5\), but \(m=4\) is not far behind (relative efficiency 96.3%) and is less sensitive to outliers. There is a sharp drop after that, relative efficiency of \(m=3\) (median miss radius) is 76.3%.

Probability density of \(m\)th miss radius in a group of \(n\) shots is

\[f_{m:n}(x)=\frac{n!}{(m-1)!(n-m)!}\left(F(x)\right)^{m-1}\left(1-F(x)\right)^{n-m}f(x)\]

For \(m=4\) and \(n=5\)

\[f_{4:5}(x) = \frac{5!}{(4-1)!(5-4)!}\left(F(x)\right)^{4-1}\left(1-F(x)\right)^{5-4}f(x) = 20\left(F(x)\right)^{3}\left(1-F(x)\right)f(x) = 20\left(1-e^{-\frac{x^2}{2}}\right)^{3} e^{-\frac{x^2}{2}} x e^{-\frac{x^2}{2}}\]

Integrating over all values of \(x\)

\[\int_{0}^{\infty}f_{4:5}(x) (1-F(kx)) dx = \int_{0}^{\infty} 20\left(1-e^{-\frac{x^2}{2}}\right)^{3} e^{-\frac{x^2}{2}} x e^{-\frac{x^2}{2}} x e^{-\frac{k^2 x^2}{2}} dx = (1 - 90\%)\]

Solving for \(k\)

f45(x):=20*(1-exp(-x^2/2))^3*x*exp(-x^2);
find_root(integrate(f45(u)*exp(-k^2*u^2/2),u,0,inf)=0.1, k, 1, 2);
1.578643529936508

Relative efficiency of \(R_{4:5}\) compared to \(R_{5:5}\):

f45(x):=20*(1-exp(-x^2/2))^3*x*exp(-x^2);
mean45:integrate(x*f45(x),x,0,inf);
f55(x):=5*(1-exp(-x^2/2))^4*x*exp(-x^2/2);
mean55:integrate(x*f55(x),x,0,inf);
float((integrate(x^2*f55(x),x,0,inf)-mean55^2)*mean45^2/((integrate(x^2*f45(x),x,0,inf)-mean45^2)*mean55^2));
0.9627613473444434

Ten Shots

Summary

We can simply split ten shots into two five-shot groups and take the average, but adding 6th and 9th miss radiuses works even better:

\[\hat{R_{90}} \approx 0.7 (R_{6:10} + R_{9:10})\]

In examples below the two red circles have radiuses \(R_{6:10}\) and \(R_{9:10}\), and blue circle has radius \(\hat{R_{90}}\).

OrderStatisticsFig10.png

Details

According to M. M. Siddiqui 1964 paper in References, the two optimal order statistics are \(0.639(n+1)\) and \(0.927(n+1)\). For \(n=10\) it’s 7 and 10, but 6 and 9 work nearly as well (relative efficiency 98.7%, details below). Not using the worst miss radius means that R6:10 + R9:10 is a more robust estimator.

Joint probability distribution of 6th miss radius \(x=R_{6:10}\) and 9th miss radius \(y=R_{9:10}\) is

\[f(x,y)=\frac{n!}{(j-1)!(k-j-1)!(n-k)!}\left[F(x)\right]^{j-1}\left(F(y)-F(x)\right)^{k-j-1}\left(1-F(y)\right)^{n-k}f(x)f(y)\] \[=15120 \left[1-e^{-\frac{x^2}{2}}\right]^5 \left(e^{-\frac{x^2}{2}}-e^{-\frac{y^2}{2}}\right)^2 e^{-\frac{y^2}{2}} x e^{-\frac{x^2}{2}} y e^{-\frac{y^2}{2}}, x \le y\]

Since we only care about \(x+y\), we can rotate axes by 45° and integrate over \(x-y\):

\[p(u)=\frac{1}{2}\int_{0}^{u}f\left(\frac{u-v}{2},\frac{u+v}{2}\right)dv\]

Integrating over all values of \(u\)

\[\int_{0}^{\infty}p(u)e^{-\frac{k^2 u^2}{2}}du = (1 - 90\%)\]

This seems to be too hard even for Maxima to solve analytically, so we’ll have to resort to numeric integration using Romberg’s method. Upper limit of 10 instead of \(\infty\) in second integration prevents underflow in the tail.

f(x,y):=15120*(1-exp(-x^2/2))^5*(exp(-x^2/2)-exp(-y^2/2))^2*exp(-y^2/2)*x*exp(-x^2/2)*y*exp(-y^2/2);
p(u):=romberg(f((u-v)/2,(u+v)/2)/2,v,0,u);
find_root(romberg(p(u)*exp(-k^2*u^2/2),u,0,10)=0.1,k,0.6,0.8);
0.70766872521906

Relative efficiency of \(R_{6:10}+R_{9:10}\) compared to \(R_{7:10}+R_{10:10}\):

f69(x,y):=15120*(1-exp(-x^2/2))^5*(exp(-x^2/2)-exp(-y^2/2))^2*exp(-y^2/2)*x*exp(-x^2/2)*y*exp(-y^2/2);
p69(u):=romberg(f69((u-v)/2,(u+v)/2)/2,v,0,u);
mean69:romberg(x*p69(x),x,0,10);
var69:romberg(t^2*p69(t),t,0,10)-mean69^2;
f710(x,y):=2520*(1-exp(-x^2/2))^6*(exp(-x^2/2)-exp(-y^2/2))^2*x*exp(-x^2/2)*y*exp(-y^2/2);
p710(u):=romberg(f710((u-v)/2,(u+v)/2)/2,v,0,u);
mean710:romberg(x*p710(x),x,0,10);
var710:romberg(t^2*p710(t),t,0,7)-mean710^2;
(var710*mean69^2)/(var69*mean710^2);
0.9873309275507542

Multiple Groups

Summary

With more than ten shots, it’s easier to split them into five- or ten-shot groups and take the average. The factors will be somewhat smaller:

\[\hat{R_{90}} \approx 1.4 R_{4:5} \approx 0.67 (R_{6:10} + R_{9:10})\]

Details

With very large number of groups, the factors are just ratios of expected values:

\[\frac{E(R_{90})}{E(R_{4:5})}\]

f45(x):=20*(1-exp(-x^2/2))^3*x*exp(-x^2);
float(sqrt(2*log(10))/integrate(x*f45(x), x, 0, inf));
1.38619009633813

\[\frac{E(R_{90})}{E(R_{6:10})+E(R_{9:10})}\]

f610(x):=1260*(1-exp(-x^2/2))^5*exp(-2*x^2)*x*exp(-x^2/2);
f910(x):=90*(1-exp(-x^2/2))^8*x*exp(-x^2);
float(sqrt(2*log(10))/(integrate(x*f610(x),x,0,inf)+integrate(x*f910(x),x,0,inf)));
0.6702446399176036