I look at memory profiles of rnomal apps and often think "what is burning that memory".
Modern compression works so well, whats happening? Open your taskmaster and look through apps and you might ask yourself this.
For example (lets ignore chrome, ms teams and all the other bloat) sublime consumes 200mb. I have 4 text files open. What is it doing?
Alone for chrome to implement tab suspend took YEARS despite everyone being aware of the issue. And addons existed which were able to do this.
I bought more ram just for chrome...
Memory and storage are not "cheap" anymore. Power may also rise in cost
Under these conditions, memory usage and binary size are irrefutably relevant^1
To some, this might feel like going backwards in time toward the mainframe era. Another current HN item with over 100 points, "Hold on to your hardware", reflects on how consumer hardware may change as a result
To me, the past was a time of greater software efficiency; arguably this was necessitated by cost. Perhaps higher costs in the present and future could lead to better software quality. But whether today's programmers are up for the challenge is debatable. It's like young people in finance whose only experience is in a world with "zero" interest rates. It's easier to whine about lowering rates than to adapt
With the money and poltical support available to "AI" companies, the incentive for efficiency of any kind is lacking. Perhaps their "no limits" operations, e.g., its effects on supply, may provide an incentive for others' efficiency
1. As an underpowered computer user that compiles own OS and writes own simple programs, I've always rejected large binary size and excessive memory use, even in times of "abundance"
tr -s '[:space:]' '\n' < file.txt | sort | uniq -c | sort -rn
I’d like to know the memory profile of this. The bottleneck is obviously sort which buffers everything in memory. So if we replace this with awk using a hash map to keep count of unique words, then it’s a much smaller data set in memory:
tr -s '[:space:]' '\n' < file.txt | awk '{c[$0]++} END{for(w in c) print c[w], w}' | sort -rn
I’m guessing this will beat Python and C++?
> Peak memory consumption is 1.3 MB. At this point you might want to stop reading and make a guess on how much memory a native code version of the same functionality would use.
I wish I knew the input size when attempting to estimate, but I suppose part of the challenge is also estimating the runtime's startup memory usage too.
> Compute the result into a hash table whose keys are string views, not strings
If the file is mmap'd, and the string view points into that, presumably decent performance depends on the page cache having those strings in RAM. Is that included in the memory usage figures?
Nonetheless, it's a nice optimization that the kernel chooses which hash table keys to keep hot.
The other perspective on this is that we sought out languages like Python/Ruby because the development cost was high, relative to the hardware. Hardware is now more expensive, but development costs are cheaper too.
The take away: expect more push towards efficiency!
I wonder if frameworks like dotnet or JVM will introduce reference counting as a way to lower the RAM footprint?
import re, operator
def count_words(filename):
with open(filename, 'rb') as fp:
data= memoryview(fp.read())
word_counts= {}
for match in re.finditer(br'\S+', data):
word= data[match.start(): match.end()]
try:
word_counts[word]+= 1
except KeyError:
word_counts[word]= 1
word_counts= sorted(word_counts.items(), key=operator.itemgetter(1), reverse=True)
for word, count in word_counts:
print(word.tobytes().decode(), count)
We could also use `mmap.mmap`.- since GC languages became prevalent, and maybe high level programming in general, coders arent as economic with their designs. Memory isn't something a coder should worry about apparently.
- far more people code apps in web languages because they don't know anything else. These are anywhere from 5-10 levels of abstraction away from the metal, naturally inefficient.
- increasing scope... I can only describe this one by example, web browsers must implement all manner of standards etc that it's become a mammoth task, especially compared to 90s. Same for compilers, oses, heck even computers thenselves were all one-man jobs at some point because things were simpler cos we knew less.
But it's not necessarily an apples to apples comparison. It's not unfair to python because of the runtime overhead. It's unfair because it's a different algorithm with fundamentally different memory characteristics.
A fairer comparison would be to stream the file in C++ as well and maintain internal state for the count. For most people that would be the first/naive approach as well when they programmed something like this I think. And it would showcase what the actual overhead of the python version is.
I don't know if the implementation is written in a "low-level" way to be more accessible to users of other programming languages, but it can certainly be done more simply leveraging the standard library:
from collections import Counter
import sys
with open(sys.argv[1]) as f:
words = Counter(word for line in f for word in line.split())
for word, count in words.most_common():
print(count, word)
At the very least, manually creating a (count, word) list from the dict items and then sorting and reversing it in-place is ignoring common idioms. `sorted` creates a copy already, and it can be passed a sort key and an option to sort in reverse order. A pure dict version could be: import sys
with open(sys.argv[1]) as f:
counts = {}
for line in f:
for word in line.split():
counts[word] = counts.get(word, 0) + 1
stats = sorted(counts.items(), key=lambda item: item[1], reverse=True)
for word, count in stats:
print(count, word)
(No, of course none of this is going to improve memory consumption meaningfully; maybe it's even worse, although intuitively I expect it to make very little difference either way. But I really feel like if you're going to pay the price for Python, you should get this kind of convenience out of it.)Anyway, none of this is exactly revelatory. I was hoping we'd see some deeper investigation of what is actually being allocated. (Although I guess really the author's goal is to promote this Pystd project. It does look pretty neat.)
Rust is high-level enough to still be fun for me (tokio gives me most of the concurrency goodies I like), but the memory usage is often like 1/10th or less compared to what I would write in Clojure.
Even though I love me some lisp, pretty much all my Clojure utilities are in Rust land now.
Would the rust compiler use much more memory compiling a comparable program to the C++ version?
It's certainly a lot better than 1000x, sure, but still surprised me.
Nice post.
(P.S. I'm also Finnish)
from collections import Counter
stats = Counter(x.strip() for l in open(sys.argv[1]) for x in l)
The 12 MB OS looks surprisingly mature. We are so conditioned for bloat that with each click I'm surprised how fast it responds. I don't remember ever being surprised by the same thing twice in a row but here it stays surprising how everything opens in the next frame even on very poor hardware.
Besides from a vm I read there is a synery[3] client for it[4]. I've used crappy pc's on extra screens and having a dedicated machine for a single application is fun, useful and it makes old stuff useful. You can run heavy applications on the main computer, it's unimportant the extra cant do it.
[0] - https://www.youtube.com/watch?v=v3NVKOsWkQs
[1] - https://www.kolibrios.org/en
The ultimate bittersweet revenge would be to run our algorithms inside the RAM owned by these cloud companies. Should be possible using free accounts.
native to what? how c++ is more native than python?
delaying comp sci differentiation for a few months
I wonder if assembly based solutions will become in vogue