DNA-R1B1C7-L Archives

Archiver > DNA-R1B1C7 > 2012-01 > 1325854489


From: "Sandy Paterson" <>
Subject: Re: [R-M222] analysis
Date: Fri, 6 Jan 2012 12:54:49 -0000
References: <5ab7.327af185.3c365f07@aol.com>,<SNT128-W6485989739F5F4A04CE66EBB940@phx.gbl>,<000001cccb7f$007a4540$016ecfc0$@com><SNT128-W65685B15B0EECB820BB181BB940@phx.gbl>
In-Reply-To: <SNT128-W65685B15B0EECB820BB181BB940@phx.gbl>


OK, let's look at the Poisson distribution.

It's normally used when the distribution you are modelling is known to have
a lowish mean, and you want to estimate probabilities of occurrence for the
values 0,1,2,...

A good example is in population studies. Let's suppose you believe that
fathers, on average, have 1.15 sons and given this, you want to work out the
probability that individual fathers have 0 sons, 1 son, 2 sons and so on.

Using the notation below,

P(x) is the probability of having x sons = [(u ^ x).e ^ (-u)]/x!
u = 1.15
e = the naperian constant (2.718282...)
x! = x(x-1)(x-2)....1 (so 5!=5x4x3x2x1 = 120)

This gives

X P(x)

0 0.316637
1 0.364132
2 0.209376
3 0.080261
4 0.023075
5 0.005307
6 0.000973
7 0.000160
8 0.000023
9 0.000003
10 0.000000

Total 0.999947

It should sum to 1 but there's a small rounding error (either that or I've
made a small error somewhere).

Anyway, from this, assuming that the 1.15 is fairly accurate, we can see
that the probability of 'daughtering out' is about 0.32 per generation.

Perhaps of more interest is that the probability of NOT daughtering out in
50 generations is about (1-0.316637)^50 = a very small number indeed. In
other words, very, very few male lines survive 50 generations.

Turning now to what Anatole does, and using the 257 haplotype example that
John posted : Over 12 markers, 89 haplotypes were 'on base' after time T and
over 25 markers, 18 were 'on base'. Using a sum of mutation rates over the
first 12 markers of .022 and a sum of mutation rates of .046 over the first
25 markers, he computes

ln(257/89)/.022 = 48.2
ln(257/18)/.046 = 57.8

and concludes that the two answers are within his error margin, so he's
happy to proceed.

Exactly what this has to do with a Poisson distribution is a mystery to me,
but let's play around a bit. In the 12-marker case, 89 are still on modal
(or on base) after time T. If the probable number of mutations per
generation is .022, how long would it take for only 89 to still be on base?

Well, using continuous mathematics, and a mutation rate of .022 (the rate at
which markers are believed to move off base), we need to solve for T in

e ^ (-.022T) = 89/257

Taking natural logs on both sides of the equation we get

-.022T.ln(e) = ln(89/257)

Since ln(e)=1 and ln(89/257) = -1.06044, we have

-.022T = -1.06044 hence T = -1.06044/(-.022) = 48.2

Now this has nothing whatsoever to do with Poisson. It is plain simple
continuous mathematics with the number 1 reducing at a (continuous) rate of
.022 per generation. It just so happens that Simeon Poisson developed an
approximation using the naperian number e. But hey, a little bit of Poisson
in your resume looks good. Of course I'm being factious here, and the real
question is whether we should we feel a sense of confidence at the fact that
12 and 25 marker haplotypes produce similar answers.

Equally important is whether one should take comfort when the above
approximations give similar answers to the linear approach.

I guess that's up to the reader, but I'd be interested in Steve Forrest's
thoughts.


Sandy


-----Original Message-----
From:
[mailto:] On Behalf Of Bernard Morgan
Sent: 05 January 2012 20:58
To: dna-r1b1c7
Subject: Re: [R-M222] analysis


P(x), probability distribution of a Poisson random variable X, is equal to
e raised to the power of negative μ, times, μ to the power of x, divided,
by the factorial of x

In the paper
μ, the expected number of occurrences, is equal to:
avg. no. of mutation per marker per generation * number of markers *
number of generation
or
avg. no. of mutation for the halpotype per generation* number of
generation

x, the number of occurrences of an event, is the number of occurrences of
mutations. This is set zero for no mutation away from the ancestral
halpotype.

The factorial of x=0 is 1
μ to the power of x=0 is 1

Thus P(for 0 mutations) is equal only to
e raised to negative sum of (avg. no. of mutation for the halpotype per
generation* number of generation)

We can rewrite the formula for "number of generation" as equal to
t=(ln(1/P(x)))/k

Where P(x) = number of the occurrence of base halpotype (i.e. most common),
divided by, total number of halpotype

Hence:
"number of generation" = (natural log of (total number of halpotype/number
of the occurrence of base (un-mutated) halpotype)), divided by the avg. no.
of mutation for the halpotype per generation

Using the probability of a Poisson distribution we can gauge the number of
generations the common ancestor lived assuming a mutation rate X per
generation.

The author also gives a linear mutation method, where the number of
(uncorrected) generations is equal to:
number of mutations in all haplotypes from the ancestral (most common) base
halpotype, divided by the total number of all haplotypes in the set, divided
by, the average mutation rate for the haloptype per generation

i.e. the average Generic Distance of halpotype set is divided by the
mutation rate X of the halpotype to give the average number of generation it
would have taken to reach this average Generic Distance

Therefore t = ln(N/A)/k and t = n/N/k

t = number of generations
k = an average mutation rate per halpotype
N = a total number of haplotypes in the set
A = a number of unchanged (identical, not mutated)
n = number of mutations in all B haplotypes

Now the part I appreciate is that the only variable that is questionable (I
have no control over) in above the equation, i.e., the average mutation rate
for the haloptype per generation, plays no part in the comparison between
the Poisson distribution and the linear mutation method. It could be
eliminated thus:

i.e. ln(N/A) = n/N

Hence the source of any conflict between the results of two methods will
most likely depend on the incorrect identification of ancestral base.

(There remains a certain part of uncertainty that plague the computation of
the average mutation rate for the haloptype per generation, i.e. the choice
markers for the haplotype. Some markers can be too fast for the timeframe
leading to hidden reversed mutations, or other markers are ultra slow
thereby providing no information for the timeframe in question.)

It also needs to be noted that as the time-frame grows less variables
(smaller halpotypes) need to be used. For more markers means an increase the
probability of variation in halpotypes and hence reduces the size of the
surviving ancestral base. I have found 25 markers the best for a common
ancestor at 30 generations, i.e. 25% of 25 marker halpotypes should be the
same. (If the result is greater than 25% then the common ancestor would be
younger.)

Of course 25 markers has a wide confidence interval for TRMCA (Time to Most
Recent Common Ancestor), more markers would reduce this confidence interval.
Though the more importantly factor in the TRMCA calculation is the accuracy
of mutation rate calculation. For this is what drives the TRMCA with correct
identification of average number of generations away. However I am not
concern about the accuracy of average number of generations, assuming it can
be achieved.

I know from the historical records that O surnames commonly originate in the
10th century from 9th century ancestors. So as long as we accurately
identify the ancestral base (i.e. comparable results from the Poisson
distribution and confirm it with the linear mutation rate or an alternative)
we should have a surnames with a similar number of generations (whatever the
their validity). (I believe Mac surnames typically are younger coming from
Mac Mhic O formations.)

Therefore, if the ancestral base (the halpotype for the surname's common
ancestor) can be found for similar aged surnames, and the process of
identification of an ancestral base and its double checking can be repeated,
this will allow us to get back to the haplotype originator of the M222
mutation. Then maybe someone with a well calibration TMRCA method can tell
me the age. What I really who be interested in is determining intermediate
ancestral base (common ancestor) from whom families of surnames originate
from.


R1b1c7 Research and Links:

http://clanmaclochlainn.com/R1b1c7/
-------------------------------
To unsubscribe from the list, please send an email to
with the word 'unsubscribe' without the
quotes in the subject and the body of the message



This thread: