- Rule 1: Be Funny
- Rule 2: Writing a spec is like writing code for a brain to execute
- Rule 3: Write as simply as possible
- Rule 4: Review and reread several times
The author isn't quite as adept at integrating the humor as seemlessly as Joel, yet it's interesting to see how effective the style is, even for someone still learning it. I commend them for making the topic more accessible. It was probably fun to write and was definitely fun to read!
https://www.joelonsoftware.com/2000/10/15/painless-functiona...
This is wrong. Consider how this criteria is also satisfied by 1-100000x, since lim_{x→0} (1-x)/(1-100000x) = 1. But this is clearly not a good first-order approximation for 1-x around 0.
The proper justification for replacing 1-x with e^-x around 0 is done by examining the first 2 terms of their Taylor expansions, in other words, the functions' value at 0 and their first derivative at 0. Since these match for 1-x and e^-x, they are good first-order approximations of each other around 0.
https://en.wikipedia.org/wiki/Stirling's_approximation#Speed...
You can build on that to build good, composable approximations of any the standard combinatorial functions, in log-space (and recover the approximations you want by simply exponentiating back). For example, if you've implemented an approximate ln-fac(n) ~ ln(fac(n)), you immediately get ln-choose, the logarithm of (n!)/(k!(n-k)!), as simply ln-fac(n) - ln-fac(k) - ln-fac(n-k). Fully composable: if ln-fac() is a numerically good approximation, then is so any reasonable sum or difference.
Or: the log of the binomial distribution PDF is simply ln-choose(n,k) + k*ln(p) + (n-k)*ln(1-p).
If you have n bits, your collision probability is 0.5 at generating 2^(n/2) values.
Or put differently: a 128bit uuid gives you a "good enough" 64bit distributed autoincrement.
* nC2 pairs of objects, each of which has a 1/k chance of matching. Since expectation is linear and it's also known that E[x] = p(x>=1) + p(x>=2) + ..., we have that p(x>=1) = E[x] - epsilon (since we can assume probability of two collisions and above is small) and so probability of collision is ~n^2 / 2k.
* Alternatively you can use union bound to see that probability of at least one match is bounded-above by sum of probabilities of any single event, so is <= n^2 / 2k, with goodness of approximation given by fact that probability of multiple events occurring simultaneously is low.
(The two are effectively the same proof, as indicator random variables can be used to show P(union of indicators) <= E[X], which is effectively a special case of markov inequality)
https://github.com/Cyan4973/xxHash/issues/229#issuecomment-5...
How does one write something like this?
I get the interest, and the review process. What I mean is, is this a hobby where someone is passionate about soothing, or does some employers allow people to work on side projects?
I feel my life is mostly about direct value, and I don't really understand where I went wrong in the path for meaningful knowledge.
Any philosophical help will be very welcome, as you correctly guest I'm a bit lost.
How many do you have to generate before a collision becomes a remote possibility?
// Naive multiplication, no accuracy at all
static double ExpectedNBCollisions_Slow ( const double nbH, const double nbBits )
{
long balls = nbH;
long double bins = nbBits;
long double result = 1.0;
for (long i = 1; i < balls / 2; i++) {
// take a pair from the front and the end to minimize errors
result *= ((bins - i) / bins) * ((bins - (nbH - i)) / bins);
}
return (double)(nbH * result);
}
// TODO This only works for a low number of collisions
static inline double ExpectedCollisions ( const double balls, const double bins )
{
return balls - (bins * (1 - pow((bins - 1)/bins, balls)));
}
// Still too inaccurate: https://preshing.com/20110504/hash-collision-probabilities/
static double EstimateNbCollisions_Taylor(const double nbH, const double nbBits)
{
const long double k = nbH;
const long double b = nbBits;
return (double)(k * (1.0 - expl(-0.5 * k * (k - 1.0) / b)));
}
// demerphq: (double(count) * double(count-1)) / pow(2.0,double(sizeof(hashtype) * 8 + 1));
// the very same as our calc. pow 2 vs exp2. Just the high cutoff is missing here.
static double EstimateNbCollisions_Demerphq(const double nbH, const double nbBits)
{
return (nbH * (nbH - 1)) / pow(2.0, nbBits + 1);
}
// GNU R: qbirthday. rough estimate. FIXME
static double EstimateNbCollisions_R(const double nbH, const double nbBits)
{
return ceil(exp(((log(nbH) + lgamma(3) + log(-log1p(-0.5)))) / 2));
}
// GNU R: pbirthday. FIXME
/*
static double EstimateNbCollisions_Rp(const double c)
{
return (1 - prod((c:(c-0.5+1))/rep(2, 0.5)));
}
*/
// The previous best calculation, highly prone to inaccuracies with low results (1.0 - 10.0)
// TODO: return also the error.
static double EstimateNbCollisions_previmpl(const double nbH, const double nbBits)
{
double exp = exp2(nbBits); // 2 ^ bits
double result = (nbH * (nbH-1)) / (2.0 * exp);
if (result > nbH)
result = nbH;
// improved floating point accuracy
if (result <= exp || nbBits > 32)
return result;
return result - exp;
}
static double EstimateNbCollisions_fwojcik(const double nbH, const int nbBits)
{
// If the probability that there are 1 or more collisions (p(C >=
// 1)) is not much higher than the probability of exactly 1
// collision (p(C == 1)), then the classically-good approximation
// of the probability of any collisions is also a good estimate
// for the expected number of collisions.
//
// If there are 2**n buckets and 2**(n-r) hashes, then the ratio
// of p(C >= 1)/p(C == 1) is about 1/(1-2**(n-2r-1)). This uses
// the new estimator if that ratio is > 1 + 2**-8. That cutoff
// minimizes the error around the values we care about.
if (nbBits - 2.0*log2(nbH) >= 8 - 1) {
return nbH * (nbH - 1) * exp2(-nbBits-1);
}
// The probability that any given hash bucket is empty after nbH
// insertions is:
// pE = ((2**nbBits - 1)/(2**nbBits))**nbH
// so we compute:
// ln(pE) = nbH * ln((2**nbBits - 1)/(2**nbBits))
// = nbH * ln(1 - 1/2**(nbBits))
// = nbH * ln(1 - 2**(-nbBits))
// = nbH * ln(1 + -(2**(-nbBits)))
// This means the probability that any given hash bucket is
// occupied after nbH insertions is:
// pF = 1 - pE
// pF = 1 - exp(ln(pE)
// pF = -(exp(ln(pE) - 1)
// pF = -expm1(ln(pE))
// And the expected number of collisions is:
// C = m - n + n * pE
// C = m - n * (1 - pE)
// C = n * (m/n - 1 + pE)
// C = n * (m/n - (1 - pE))
// C = n * (m/n - pF)
// C = n * (m/n - (-expm1(ln(pE))))
// C = n * (m/n + expm1(ln(pE)))
// Since the format of floats/doubles is k*2**n, multiplying by
// exp2(x) doesn't lose any precision, and this formulation keeps
// m/n and pF at the same general orders of magnitude, so it tends
// to have very good precision. At low hash occupancy, pF is too
// close to m/n for this formula to work well.
double logpE = (double)nbH * log1p(-exp2(-nbBits));
double result = exp2(nbBits) * (exp2(-nbBits) * (double)nbH + expm1(logpE));
return result;
}
I mean, in most real-life situations you probably won't know ahead of time which objects you're going to have to hash (or else you could hash them perfectly anyway).