That's why I built a performance challenge for the PHP community
The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).
I hope people will have fun with it :)
{
"\/blog\/11-million-rows-in-seconds": {
"2025-01-24": 1,
"2026-01-24": 2
},
"\/blog\/php-enums": {
"2024-01-24": 1
}
}Go is the most battery-included language I've ever used. Instant compile times means I can run tests bound to ctrl/cmd+s every time I save the file. It's more performant (way less memory, similar CPU time) than C# or Java (and certainly all the scripting languages) and contains a massive stdlib for anything you could want to do. It's what scripting languages should have been. Anyone can read it just like Python.
Rust takes the last 20% I couldn't get in a GC language and removes it. Sure, it's syntax doesn't make sense to an outsider and you end up with 3rd party packages for a lot of things, but can't beat it's performance and safety. Removes a whole lot of tests as those situations just aren't possible.
If Rust scares you use Go. If Go scares you use Rust.
I was curious what it would take if I approached it the way I do with most CSV transformation tasks that I'm only intending to do once: use Unix command line tools such as cut, sed, sort, and uniq to do the bulk of the work, and then do something in whatever scripting language seems appropriate to put the final output in whatever format is needed.
The first part, using this command [1], produces output lines that look like this:
219,/blog/php-81-before-and-after,2021-06-21
and is sorted by URL path and then date.With 1 million lines that took 9 or 10 seconds (M2 Max Mac Studio). But with 100 million it took 1220 seconds, virtually all of which was sorting.
Turning that into JSON via a shell script [2] was about 15 seconds. (That script is 44% longer than it would have been had JSON allowed a comma after the last element of an array).
So basically 22 minutes. The sorting is the killer with this type of approach, because the input is 7 GB. The output is only 13 MB and the are under 300 pages and the largest page count is under 1000 so building the output up in memory as the unsorted input is scanned and then sorting it would clearly by way way faster.
[1] cut -d / -f 4- | sed -e 's/T..............$//' | sort | uniq -c | sed -e 's/^ *//' -e 's/ /,\//'
[2]
#!/bin/zsh
echo "{"
PAGE=none
while read LINE; do
COLS=("${(@s/,/)LINE}")
COUNT=${COLS[1]}
URL=${COLS[2]}
DATE=${COLS[3]}
if [ $URL != $PAGE ]; then
if [ $PAGE != "none" ]; then
echo
echo " },"
fi
PAGE=$URL
echo " \"\\$URL\": {"
FINISHDATE=no
else
if [ $FINISHDATE = "yes" ]; then
echo ","
fi
fi
echo -n " \"$DATE\": $COUNT"
FINISHDATE=yes
done
echo
echo "}"https://github.com/kjdev/php-ext-jq
And replicate this command:
jq -R ' [inputs | split(",") | {url: .[0], date: .[1] | split("T")[0]}] | group_by(.url) | map({ (.[0].url): ( map(.date) | group_by(.) | map({(.[0]): length}) | add ) }) | add ' < test-data.csv
And it will be faster than anything you can do in native php
Edit: I'm assuming none of the urls have a comma with this but it's more about offloading it through an extension, even if you custom built it
You mention in the README that the goal is to run things in a standard environment, but then you're using a near bleeding edge PHP version that people are unlikely to be using?
I thought I'd just quickly spin up a container and take a look out of interest, but now it looks like I'll have to go dig into building my own PHP packages, or compiling my own version from scratch to even begin to look at things?
Except that the generator script generates dates relative to time() ?
> duckdb -s "COPY (SELECT url[20:] as url, date, count(*) as c FROM read_csv('data.csv', columns = { 'url': 'VARCHAR', 'date': 'DATE' }) GROUP BY url, date) TO 'output.json' (ARRAY)"
Takes about 8 seconds on my M1 Macbook. JSON not in the right format, but that wouldn't dominate the execution time.
https://dev.to/realflowcontrol/processing-one-billion-rows-i...
...
> Your parser should store the following output in $outputPath as a JSON file:
{
"\/blog\/11-million-rows-in-seconds": {
"2025-01-24": 1,
"2026-01-24": 2
},
"\/blog\/php-enums": {
"2024-01-24": 1
}
}
They don't define what exactly "pretty" means, but superflous escapes are not very pretty in my opinion.