FRESH

Hacker News

Home

The probability of a hash collision (2022)

140 points by subset

by spit2wind

1 subcomments

The author uses writing techniques like those given by Joel Spolsky:
- Rule 1: Be Funny
- Rule 2: Writing a spec is like writing code for a brain to execute
- Rule 3: Write as simply as possible
- Rule 4: Review and reread several times
The author isn't quite as adept at integrating the humor as seemlessly as Joel, yet it's interesting to see how effective the style is, even for someone still learning it. I commend them for making the topic more accessible. It was probably fun to write and was definitely fun to read!
https://www.joelonsoftware.com/2000/10/15/painless-functiona...

by meindnoch

0 subcomment

In "Appendix A: Supporting proofs for approximations" the author justifies the approximation of 1-x with e^(-x) by computing lim_{x→0} (1-x)/e^(-x) = 1.
This is wrong. Consider how this criteria is also satisfied by 1-100000x, since lim_{x→0} (1-x)/(1-100000x) = 1. But this is clearly not a good first-order approximation for 1-x around 0.
The proper justification for replacing 1-x with e^-x around 0 is done by examining the first 2 terms of their Taylor expansions, in other words, the functions' value at 0 and their first derivative at 0. Since these match for 1-x and e^-x, they are good first-order approximations of each other around 0.

by perihelions

1 subcomments

The quickest way to work around the numeric overflow issues is to use the Stirling approximation of the logarithm of the factorial,
https://en.wikipedia.org/wiki/Stirling's_approximation#Speed...
You can build on that to build good, composable approximations of any the standard combinatorial functions, in log-space (and recover the approximations you want by simply exponentiating back). For example, if you've implemented an approximate ln-fac(n) ~ ln(fac(n)), you immediately get ln-choose, the logarithm of (n!)/(k!(n-k)!), as simply ln-fac(n) - ln-fac(k) - ln-fac(n-k). Fully composable: if ln-fac() is a numerically good approximation, then is so any reasonable sum or difference.
Or: the log of the binomial distribution PDF is simply ln-choose(n,k) + k*ln(p) + (n-k)*ln(1-p).

by j-pb

1 subcomments

The fun part is when you take that approximation and apply it to 2^n.
If you have n bits, your collision probability is 0.5 at generating 2^(n/2) values.
Or put differently: a 128bit uuid gives you a "good enough" 64bit distributed autoincrement.

by krackers

0 subcomment

There's several ways to arrive at the same bound besides the explicit birthday paradox calculation:
* nC2 pairs of objects, each of which has a 1/k chance of matching. Since expectation is linear and it's also known that E[x] = p(x>=1) + p(x>=2) + ..., we have that p(x>=1) = E[x] - epsilon (since we can assume probability of two collisions and above is small) and so probability of collision is ~n^2 / 2k.
* Alternatively you can use union bound to see that probability of at least one match is bounded-above by sum of probabilities of any single event, so is <= n^2 / 2k, with goodness of approximation given by fact that probability of multiple events occurring simultaneously is low.
(The two are effectively the same proof, as indicator random variables can be used to show P(union of indicators) <= E[X], which is effectively a special case of markov inequality)

by mriet

1 subcomments

This comment on github, by Yann Collet (of zstd fame), is the most enlightening, practical information about the practical uses of hashes and checksums that I've seen about hash collisions. I highely, highly recommend it:
https://github.com/Cyan4973/xxHash/issues/229#issuecomment-5...

by permalac

3 subcomments

Honest question.
How does one write something like this?
I get the interest, and the review process. What I mean is, is this a hobby where someone is passionate about soothing, or does some employers allow people to work on side projects?
I feel my life is mostly about direct value, and I don't really understand where I went wrong in the path for meaningful knowledge.
Any philosophical help will be very welcome, as you correctly guest I'm a bit lost.

by senderista

0 subcomment

I think in practice the simplest approximation is probably always good enough. The reason is that I don’t think we care about collision probabilities larger than 50%, and the value of k that gives p=0.5 is sqrt(n) according to the simplest approximation (p=k^2/2n), while p=0.39 for k=sqrt(n) according to the exponential approximation (p=1-(e^-(k^2/2n)). The difference at the most extreme value we care about (p=0.5) is small enough (~11%) that I think the simplest approximation should always suffice for practical applications.

by Simon_O_Rourke

1 subcomments

I've had the privilege (?) of experiencing one in the wild only once. Was hashing a few columns to create a hash key for a dimensional table in snowflake and two wholy different rows clashed.

by rienbdj

5 subcomments

How many values can a UUID v4 take?
How many do you have to generate before a collision becomes a remote possibility?

by rurban

1 subcomments

You can do bigint's or you can do approximations. I've tried about 8 different implementations in smhasher. (for 32bit to 256bit) See this massacre:

    // Naive multiplication, no accuracy at all
    static double ExpectedNBCollisions_Slow ( const double nbH, const double nbBits )
    {
      long balls = nbH;
      long double bins = nbBits;
      long double result = 1.0;
      for (long i = 1; i < balls / 2; i++) {
        // take a pair from the front and the end to minimize errors
        result *= ((bins - i) / bins) * ((bins - (nbH - i)) / bins);
      }
      return (double)(nbH * result);
    }
    
    // TODO This only works for a low number of collisions
    static inline double ExpectedCollisions ( const double balls, const double bins )
    {
      return balls - (bins * (1 - pow((bins - 1)/bins, balls)));
    }
    
    // Still too inaccurate: https://preshing.com/20110504/hash-collision-probabilities/
    static double EstimateNbCollisions_Taylor(const double nbH, const double nbBits)
    {
      const long double k = nbH;
      const long double b = nbBits;
      return (double)(k * (1.0 - expl(-0.5 * k * (k - 1.0) / b)));
    }
    
    // demerphq: (double(count) * double(count-1)) / pow(2.0,double(sizeof(hashtype) * 8 + 1));
    // the very same as our calc. pow 2 vs exp2. Just the high cutoff is missing here.
    static double EstimateNbCollisions_Demerphq(const double nbH, const double nbBits)
    {
      return (nbH * (nbH - 1)) / pow(2.0, nbBits + 1);
    }
    
    // GNU R: qbirthday. rough estimate. FIXME
    static double EstimateNbCollisions_R(const double nbH, const double nbBits)
    {
      return ceil(exp(((log(nbH) + lgamma(3) + log(-log1p(-0.5)))) / 2));
    }
    
    // GNU R: pbirthday. FIXME
    /*
      static double EstimateNbCollisions_Rp(const double c)
    {
      return (1 - prod((c:(c-0.5+1))/rep(2, 0.5)));
    }
    */
    
    // The previous best calculation, highly prone to inaccuracies with low results (1.0 - 10.0)
    // TODO: return also the error.
    static double EstimateNbCollisions_previmpl(const double nbH, const double nbBits)
    {
      double exp = exp2(nbBits); // 2 ^ bits
      double result = (nbH * (nbH-1)) / (2.0 * exp);
      if (result > nbH)
        result = nbH;
      // improved floating point accuracy
      if (result <= exp || nbBits > 32)
        return result;
      return result - exp;
    }
    
    static double EstimateNbCollisions_fwojcik(const double nbH, const int nbBits)
    {
        // If the probability that there are 1 or more collisions (p(C >=
        // 1)) is not much higher than the probability of exactly 1
        // collision (p(C == 1)), then the classically-good approximation
        // of the probability of any collisions is also a good estimate
        // for the expected number of collisions.
        //
        // If there are 2**n buckets and 2**(n-r) hashes, then the ratio
        // of p(C >= 1)/p(C == 1) is about 1/(1-2**(n-2r-1)). This uses
        // the new estimator if that ratio is > 1 + 2**-8. That cutoff
        // minimizes the error around the values we care about.
        if (nbBits - 2.0*log2(nbH) >= 8 - 1) {
            return nbH * (nbH - 1) * exp2(-nbBits-1);
        }
    
        // The probability that any given hash bucket is empty after nbH
        // insertions is:
        //    pE     = ((2**nbBits - 1)/(2**nbBits))**nbH
        // so we compute:
        //    ln(pE) = nbH * ln((2**nbBits - 1)/(2**nbBits))
        //           = nbH * ln(1 - 1/2**(nbBits))
        //           = nbH * ln(1 - 2**(-nbBits))
        //           = nbH * ln(1 + -(2**(-nbBits)))
        // This means the probability that any given hash bucket is
        // occupied after nbH insertions is:
        //     pF = 1 - pE
        //     pF = 1 - exp(ln(pE)
        //     pF = -(exp(ln(pE) - 1)
        //     pF = -expm1(ln(pE))
        // And the expected number of collisions is:
        //     C = m - n + n * pE
        //     C = m - n * (1 - pE)
        //     C = n * (m/n - 1 + pE)
        //     C = n * (m/n - (1 - pE))
        //     C = n * (m/n - pF)
        //     C = n * (m/n - (-expm1(ln(pE))))
        //     C = n * (m/n + expm1(ln(pE)))
        // Since the format of floats/doubles is k*2**n, multiplying by
        // exp2(x) doesn't lose any precision, and this formulation keeps
        // m/n and pF at the same general orders of magnitude, so it tends
        // to have very good precision. At low hash occupancy, pF is too
        // close to m/n for this formula to work well.
        double logpE = (double)nbH  * log1p(-exp2(-nbBits));
        double result = exp2(nbBits) * (exp2(-nbBits) * (double)nbH + expm1(logpE));
    
        return result;
    }

by kleiba

1 subcomments

Not to appear overly Bayesian here, but shouldn't the prior probabilities of the objects themselves be factored in?
I mean, in most real-life situations you probably won't know ahead of time which objects you're going to have to hash (or else you could hash them perfectly anyway).

by racedude

0 subcomment

Good to know, thanks