FRESH

Hacker News

Home

100M-Row Challenge with PHP

176 points by brentroose

by brentroose

8 subcomments

A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds. This optimization process with so much fun, and so many people pitched in with their ideas; so I eventually decided I wanted to do something more.
That's why I built a performance challenge for the PHP community
The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).
I hope people will have fun with it :)

by pxtail

2 subcomments

Side note - I wasn't aware that there is active collectors scene for Elephpants, awesome!
https://elephpant.me/

by semiquaver

4 subcomments

Are they just confused about what characters require escaping in JSON strings or is PHP weirder than I remember?

    {
        "\/blog\/11-million-rows-in-seconds": {
            "2025-01-24": 1,
            "2026-01-24": 2
        },
        "\/blog\/php-enums": {
            "2024-01-24": 1
        }
    }

by Xeoncross

3 subcomments

This is why I jumped from PHP to Go, then why I jumped from Go to Rust.
Go is the most battery-included language I've ever used. Instant compile times means I can run tests bound to ctrl/cmd+s every time I save the file. It's more performant (way less memory, similar CPU time) than C# or Java (and certainly all the scripting languages) and contains a massive stdlib for anything you could want to do. It's what scripting languages should have been. Anyone can read it just like Python.
Rust takes the last 20% I couldn't get in a GC language and removes it. Sure, it's syntax doesn't make sense to an outsider and you end up with 3rd party packages for a lot of things, but can't beat it's performance and safety. Removes a whole lot of tests as those situations just aren't possible.
If Rust scares you use Go. If Go scares you use Rust.

by tzs

1 subcomments

What's a decent time for this?
I was curious what it would take if I approached it the way I do with most CSV transformation tasks that I'm only intending to do once: use Unix command line tools such as cut, sed, sort, and uniq to do the bulk of the work, and then do something in whatever scripting language seems appropriate to put the final output in whatever format is needed.
The first part, using this command [1], produces output lines that look like this:
```
  219,/blog/php-81-before-and-after,2021-06-21
```
and is sorted by URL path and then date.
With 1 million lines that took 9 or 10 seconds (M2 Max Mac Studio). But with 100 million it took 1220 seconds, virtually all of which was sorting.
Turning that into JSON via a shell script [2] was about 15 seconds. (That script is 44% longer than it would have been had JSON allowed a comma after the last element of an array).
So basically 22 minutes. The sorting is the killer with this type of approach, because the input is 7 GB. The output is only 13 MB and the are under 300 pages and the largest page count is under 1000 so building the output up in memory as the unsorted input is scanned and then sorting it would clearly by way way faster.
[1] cut -d / -f 4- | sed -e 's/T..............$//' | sort | uniq -c | sed -e 's/^ *//' -e 's/ /,\//'
[2]
```
  #!/bin/zsh
  echo "{"
  PAGE=none
  while read LINE; do
      COLS=("${(@s/,/)LINE}")
      COUNT=${COLS[1]}
      URL=${COLS[2]}
      DATE=${COLS[3]}
      if [ $URL != $PAGE ]; then
          if [ $PAGE != "none" ]; then
              echo
              echo "    },"
          fi
          PAGE=$URL
          echo "    \"\\$URL\": {"
          FINISHDATE=no
      else
          if [ $FINISHDATE = "yes" ]; then
              echo ","
          fi
      fi
      echo -n "        \"$DATE\": $COUNT"
      FINISHDATE=yes
  done
  echo
  echo "}"
```

by chrismarlow9

1 subcomments

I don't have time to put together a submission but I'm willing to bet you can use this:
https://github.com/kjdev/php-ext-jq
And replicate this command:
jq -R ' [inputs | split(",") | {url: .[0], date: .[1] | split("T")[0]}] | group_by(.url) | map({ (.[0].url): ( map(.date) | group_by(.) | map({(.[0]): length}) | add ) }) | add ' < test-data.csv
And it will be faster than anything you can do in native php
Edit: I'm assuming none of the urls have a comma with this but it's more about offloading it through an extension, even if you custom built it

by Twirrim

1 subcomments

I took a quick look, the dependency on php 8.5 is mildly irritating, even Ubuntu 26.04 isn't lined up to ship with that version, it's on 8.4.11.
You mention in the README that the goal is to run things in a standard environment, but then you're using a near bleeding edge PHP version that people are unlikely to be using?
I thought I'd just quickly spin up a container and take a look out of interest, but now it looks like I'll have to go dig into building my own PHP packages, or compiling my own version from scratch to even begin to look at things?

by tzs

0 subcomment

You should say in the output formatting rules that the pages should be output in the order that the pages are in the input file. Currently it only specifies the order of the visits within the entry for each page.

by tveita

1 subcomments

> Also, the generator will use a seeded randomizer so that, for local development, you work on the same dataset as others
Except that the generator script generates dates relative to time() ?

by Retr0id

1 subcomments

How large is a sample 100M row file in bytes? (I tried to run the generator locally but my php is not bleeding-edge enough)

by csjh

1 subcomments

Obligatory DuckDB solution:
> duckdb -s "COPY (SELECT url[20:] as url, date, count(*) as c FROM read_csv('data.csv', columns = { 'url': 'VARCHAR', 'date': 'DATE' }) GROUP BY url, date) TO 'output.json' (ARRAY)"
Takes about 8 seconds on my M1 Macbook. JSON not in the right format, but that wouldn't dominate the execution time.

by matei88

0 subcomment

It reminds me of a good read about optimizing PHP for 1 billion rows challenge. TLDR; at some point you hit a limit in PHP’s stream layer
https://dev.to/realflowcontrol/processing-one-billion-rows-i...

by poizan42

1 subcomments

> The output should be encoded as a pretty JSON string.
...
> Your parser should store the following output in $outputPath as a JSON file:
```
    {
        "\/blog\/11-million-rows-in-seconds": {
            "2025-01-24": 1,
            "2026-01-24": 2
        },
        "\/blog\/php-enums": {
            "2024-01-24": 1
        }
    }
```
They don't define what exactly "pretty" means, but superflous escapes are not very pretty in my opinion.

by spiderfarmer

0 subcomment

Awesome. I’ll be following this. I’ll probably learn a ton.

by lofaszvanitt

0 subcomment

Submit at the very end, so others wouldn't know you have a better solution.

by wangzhongwang

0 subcomment

[dead]

by tomaytotomato

1 subcomments

Tempted to submit a Java app wrapped in PHP exec() :D