FRESH

Hacker News

Home

Python numbers every programmer should know

426 points by WoodenChair

by thundergolfer

8 subcomments

A lot of people here are commenting that if you have to care about specific latency numbers in Python you should just use another language.
I disagree. A lot of important and large codebases were grown and maintained in Python (Instagram, Dropbox, OpenAI) and it's damn useful to know how to reason your way out of a Python performance problem when you inevitably hit one without dropping out into another language, which is going to be far more complex.
Python is a very useful tool, and knowing these numbers just makes you better at using the tool. The author is a Python Software Foundation Fellow. They're great at using the tool.
In the common case, a performance problem in Python is not the result of hitting the limit of the language but the result of sloppy un-performant code, for example unnecessarily calling a function O(10_000) times in a hot loop.
I wrote up a more focused "Python latency numbers you should know" as a quiz here https://thundergolfer.com/computers-are-fast

by fooker

6 subcomments

Counterintuitively: program in python only if you can get away without knowing these numbers.
When this starts to matter, python stops being the right tool for the job.

by zelphirkalt

2 subcomments

I doubt there is much to gain from knowing how much memory an empty string takes. The article or the listed numbers have a weird fixation on memory usage numbers and concrete time measurements. What is way more important to "every programmer" is time and space complexity, in order to avoid designing unnecessarily slow or memory hungry programs. Under the assumption of using Python, what is the use of knowing that your int takes 28 bytes? In the end you will have to determine, whether the program you wrote meats the performance criteria you have and if it does not, then you need a smarter algorithm or way of dealing with data. It helps very little to know that your 2d-array of 1000x1000 bools is so and so big. What helps is knowing, whether it is too much and maybe you should switch to using a large integer and a bitboard approach. Or switch language.

by Aurornis

4 subcomments

A meta-note on the title since it looks like it’s confusing a lot of commenters: The title is a play on Jeff Dean’s famous “Latency Numbers Every Programmer Should Know” from 2012. It isn’t meant to be interpreted literally. There’s a common theme in CS papers and writing to write titles that play upon themes from past papers. Another common example is the “_____ considered harmful” titles.

by willseth

3 subcomments

Every Python programmer should be thinking about far more important things than low level performance minutiae. Great reference but practically irrelevant except in rare cases where optimization is warranted. If your workload grows to the point where this stuff actually matters, great! Until then it’s a distraction.

by f311a

0 subcomment

```
   > Strings
   >The rule of thumb for strings is the core string object takes 41 bytes. Each       additional character is 1 byte.
```
That's misleading. There are three types of strings in Python (1, 2 and 4 bytes per character).
https://rushter.com/blog/python-strings-and-memory/

by riazrizvi

0 subcomment

The titles are oddly worded. For example -
```
  Collection Access and Iteration
  How fast can you get data out of Python’s built-in collections? Here is a dramatic example of how much faster the correct data structure is. item in set or item in dict is 200x faster than item in list for just 1,000 items!
```
It seems to suggest an iteration for x in mylist is 200x slower than for x in myset. It’s the membership test that is much slower. Not the iteration. (Also for x in mydict is an iteration over keys not values, and so isn’t what we think of as an iteration on a dict’s ‘data’).
Also the overall title “Python Numbers Every Programmer Should Know” starts with 20 numbers that are merely interesting.
That all said, the formatting is nice and engaging.

by robertclaus

0 subcomment

I liked reading through it from a "is modern Python doing anything obviously wrong?" perspective, but strongly disagree anyone should "know" these numbers. There's like 5-10 primitives in there that everyone should know rough timings for; the rest should be derived with big-O algorithm and data structure knowledge.

by sjducb

2 subcomments

It’s missing the time taken to instantiate a class.
I remember refactoring some code to improve readability, then observing something that was previously a few microseconds take tens of seconds.
The original code created a large list of lists. Each child list had 4 fields each field was a different thing, some were ints and one was a string.
I created a new class with the names of each field and helper methods to process the data. The new code created a list of instances of my class. Downstream consumers of the list could look at the class to see what data they were getting. Modern Python developers would use a data class for this.
The new code was very slow. I’d love it if the author measured the time taken to instantiate a class.

by perrygeo

2 subcomments

> small int (0-256) cached
It's -5 to 256, and these have very tricky behavior for programmers that confuse identity and equality.
```
  >>> a = -5
  >>> b = -5
  >>> a is b
  True
  >>> a = -6
  >>> b = -6
  >>> a is b
  False
```

by boerseth

2 subcomments

That's a long list of numbers that seem oddly specific. Apart from learning that f-strings are way faster than the alternatives, and certain other comparisons, I'm not sure what I would use this for day-to-day.
After skimming over all of them, it seems like most "simple" operations take on the order of 20ns. I will leave with that rule of thumb in mind.

by xnx

0 subcomment

Python programmers don't need to know 85 different obscure performance numbers. Better to really understand ~7 general system performance numbers.

by mikeckennedy

1 subcomments

Author here.
Thanks for the feedback everyone. I appreciate your posting it @woodenchair and @aurornis for pointing out the intent of the article.
The idea of the article is NOT to suggest you should shave 0.5ns off by choosing some dramatically different algorithm or that you really need to optimize the heck out of everything.
In fact, I think a lot of what the numbers show is that over thinking the optimizations often isn't worth it (e.g. caching len(coll) into a variable rather than calling it over and over is less useful that it might seem conceptually).
Just write clean Python code. So much of it is way faster than you might have thought.
My goal was only to create a reference to what various operations cost to have a mental model.

by ktpsns

1 subcomments

Nice numbers and it's always worth to know an order of magnitude. But these charts are far away from what "every programmer should know".

by ZiiS

0 subcomment

This is really weird thing to worry about in python. But is also misleading; Python int is arbitrary precision, they can take up much more storage and arithmetic time depending in their value.

by andai

1 subcomments

The one I noticed the most was import openai and import numpy.
They're both about a full second on my old laptop.
I ended up writing my own simple LLM library just so I wouldn't have to import OpenAI anymore for my interactive scripts.
(It's just some wrapper functions around the equivalent of a curl request, which is honestly basically everything I used the OpenAI library for anyway.)

by calmbonsai

0 subcomment

You absolutely do not need to know those absolute numbers--only the relative costs of various operations.
Additionally, regardless of the code you can profile the system to determine where the "hot spots" are and refactor or call-out to more performant (Rust, Go, C) run-times for those workflows where necessary.

by HenriTEL

0 subcomment

The goal of the article is not to know the exact numbers by heart, duh!
Care about orders of magnitude instead, in combination with the speed of hardware https://gist.github.com/jboner/2841832 you'll have a good understanding of how much overhead is due to the language and the constructs to favor for speed improvements.
Just reading the code should give you a sense of its speed and where it will spend most time. Combined with general timing metrics you can also have a sense of the overhead of 3rd party libraries (pydantic I'm looking at you).
So yeah, I find that list quite useful during the code design, likely reduce time profiling slow code in prod.

by cma256

0 subcomment

Great catalogue. On the topic of msgspec, since pydantic is included it may be worth including a bench for de-serializing and serializing from a msgspec struct.

by tgv

1 subcomments

I doubt list and string concatenation operate in constant time, or else they affect another benchmark. E.g., you can concatenate two lists in the same time, regardless of their size, but at the cost of slower access to the second one (or both).
More contentiously: don't fret too much over performance in Python. It's a slow language (except for some external libraries, but that's not the point of the OP).

by sireat

0 subcomment

Interesting information but these are not hard numbers.
Surely the 100-char string information of 141 bytes is not correct as it would only apply to ASCII 100-char strings.
It would be more useful to know the overhead for unicode strings presumably utf-8 encoded. And again I would presume 100-Emoji string would take 441 bytes (just a hypothesis) and 100-umlaut chars string would take 241bytes.

by pvtmert

0 subcomment

There are lots of discussions about relatedness of these numbers for a regular software engineer.
Firstly, I want to start with the fact that the base system is a macOS/M4Pro, hence;
- Memory related access is possibly much faster than a x86 server. - Disk access is possibly much slower than a x86 server.
*) I took x86 server as the basis as most of the applications run on x86 Linux boxes nowadays, although a good amount of fingerprint is also on other ARM CPUs.
Although it probably does not change the memory footprint much, the libraries loaded and their architecture (ie. being Rosetta or not) will change the overall footprint of the process.
As it was mentioned on one of the sibling comments -> Always inspect/trace your own workflow/performance before making assumptions. It all depends on specific use-cases for higher-level performance optimizations.

by charlieyu1

0 subcomment

Surprised that list comprehensions are only 26% faster than for loops. It used to feel like 4-5x

by jchmbrln

1 subcomments

What would be the explanation for an int taking 28 bytes but a list of 1000 ints taking only 7.87KB?

by boutell

0 subcomment

Knowing all of these is exactly what a developer shouldn't need to do. Fix "big O" problems in your own code. And be aware of a few exceptionally weird counterintuitive things if it matters on a "big O" level — like "you think this common operation is O(1) but it's actually O(N^2)". If there actually are any of those. And just get stuff done.
I guess you could find yourself in a situation where a 2X speedup is make or break and you're not a week away from needing 4X, etc. But not very often.

by oogali

0 subcomment

It's important to know that these numbers will vary based on what you're measuring, your hardware architecture, and how your particular Python binary was built.
For example, my M4 Max running Python 3.14.2 from Homebrew (built, not poured) takes 19.73MB of RAM to launch the REPL (running `python3` at a prompt).
The same Python version launched on the same system with a single invocation for `time.sleep()`[1] takes 11.70MB.
My Intel Mac running Python 3.14.2 from Homebrew (poured) takes 37.22MB of RAM to launch the REPL and 9.48MB for `time.sleep`.
My number for "how much memory it's using" comes from running `ps auxw | grep python`, taking the value of the resident set size (RSS column), and dividing by 1,024.
1: python3 -c 'from time import sleep; sleep(100)'

by snakepit

0 subcomment

This is helpful. Someone should create a similar benchmark for the BEAM. This is also a good reminder to continue working on snakepit [1] and snakebridge [2]. Plenty remains before they're suitable for prime time.
[1] https://hex.pm/packages/snakepit [2] https://hex.pm/packages/snakebridge

by CmdrKrool

2 subcomments

I'm confused by this:

  String operations in Python are fast as well. f-strings are the fastest formatting style, while even the slowest style is still measured in just nano-seconds.
  
  Concatenation (+)   39.1 ns (25.6M ops/sec)
  f-string            64.9 ns (15.4M ops/sec)

It says f-strings are fastest but the numbers show concatenation taking less time? I thought it might be a typo but the bars on the graph reflect this too?

by mopsi

0 subcomment

It is always a good idea to have at least a rough understanding of how much operations in your code cost, but sometimes very expensive mistakes end up in non-obvious places.
If I have only plain Python installed and a .py file that I want to test, then what's the easiest way to get a visualization of the call tree (or something similar) and the computational cost of each item?

by belabartok39

1 subcomments

Hmmmm, there should absolutely be standard deviations for this type of work. Also, what is N number of runs? Does it say somewhere?

by esafak

1 subcomments

The point of the original list was that the numbers were simple enough to memorize: https://gist.github.com/jboner/2841832
Nobody is going to remember any of the numbers on this new list.

by gcanyon

0 subcomment

As someone who most often works in a language that is literally orders of magnitude slower than this —- and has done so since CPU speeds were measured in double-digit megahertz —- I am crying at the notion that anything here is measured in nanoseconds

by dr_kretyn

0 subcomment

Initially I thought how efficient strings are... but then I understood how inefficient arithmetic is. Interesting comparison but exact speed and IO depend on a lot of things, and unlikely one uses Mac mini in production so these numbers definitely aren't representative.

by superlopuh

1 subcomments

I'm surprised that the `isinstance()` comparison is with `type() == type` and not `type() is type`, which I would expect to be faster, since the `==` implementation tends to have an `isinstance` call anyway.

by mwkaufma

2 subcomments

Why? If those micro benchmarks mattered in your domain, you wouldn't be using python.

by nodja

0 subcomment

I think a lot of commenters here are missing the point.
Looking at performance numbers is important regardless if it's python, assembly or HDL. If you don't understand why your code is slow you can always look at how many cycles things take and learn to understand how code works at a deeper level, as you mature as a programmer things will become obvious, but going through the learning process and having references like these will help you to get there sooner, seeing the performance numbers and asking why some things take much longer—or sometimes why they take the exact same time—is the perfect opportunity to learn.
Early in my python career I had a python script that found duplicate files across my disks, the first iteration of the script was extremely slow, optimizing the script went through several iterations as I learned how to optimize at various levels. None of them required me to use C. I just used caching, learned to enumerate all files on disk fast, and used sets instead of lists. The end result was that doing subsequent runs made my script run in 10 seconds instead of 15 minutes. Maybe implementing in C would make it run in 1 second, but if I had just assumed my script was slow because of python then I would've spent hours doing it in C only to go from 15 minutes to 14 minutes and 51 seconds.
There's an argument to be made that it would be useful to see C numbers next to the python ones, but for the same reason people don't just tell you to just use an FPGA instead of using C, it's also rude to say python is the wrong tool when often it isn't.

by woodruffw

0 subcomment

Great reference overall, but some of these will diverge in practice: 141 bytes for a 100 char string won’t hold for non-ASCII strings for example, and will change if/when the object header overhead changes.

by JBits

0 subcomment

One of the reasons I'm really excited about JAX is that I hope it will allow me to write fast Python code without worrying about these details.

by intalentive

0 subcomment

Great resource. I would also like to see a comparison of variable access times across different scopes.

by lunixbochs

0 subcomment

I'm confused why they repeatedly call a slots class larger than a regular dict class, but don't count the size of the dict

by Retr0id

0 subcomment

> Numbers are surprisingly large in Python
Makes me wonder if the cpython devs have ever considered v8-like NaN-boxing or pointer stuffing.

by jiggawatts

0 subcomment

My god, the memory bloat is out of this world compared to platforms like the JVM or .NET, let alone C++ or Rust!

by Y_Y

0 subcomment

int is larger than float, but list of floats is larger than list of ints
Then again, if you're worried about any of the numbers in this article maybe you shouldn't be using Python at all. I joke, but please do at least use Numba or Numpy so you aren't paying huge overheads for making an object of every little datum.

by Redoubts

0 subcomment

> Attribute read (obj.x) 14 ns
note that protobuf attributes are 20-50x worse than this

by rozab

0 subcomment

I wonder why an empty set takes so much more memory than an empty dict

by m3047

0 subcomment

+1 but I didn't see pack / unpack...

by lcnmrn

0 subcomment

LLMs can improve Python code performance. I used it myself on a few projects.

by iamnotsure

0 subcomment

Exactly wrong.

by zbentley

0 subcomment

I have some questions and requests for clarification/suspicious behavior I noticed after reviewing the results and the benchmark code, specifically:
- If slotted attribute reads and regular attribute reads are the same latency, I suspect that either the regular class may not have enough "bells on" (inheritance/metaprogramming/dunder overriding/etc) to defeat simple optimizations that cache away attribute access, thus making it equivalent in speed to slotted classes. I know that over time slotting will become less of a performance boost, but--and this is just my intuition and I may well be wrong--I don't get the impression that we're there yet.
- Similarly "read from @property" seems suspiciously fast to me. Even with descriptor-protocol awareness in the class lookup cache, the overhead of calling a method seems surprisingly similar to the overhead of accessing a field. That might be explained away by the fact that property descriptors' "get" methods are guaranteed to be the simplest and easiest to optimize of all call forms (bound method, guaranteed to never be any parameters), and so the overhead of setting up the stack/frame/args may be substantially minimized...but that would only be true if the property's method body was "return 1" or something very fast. The properties tested for these benchmarks, though, are looking up other fields on the class, so I'd expect them to be a lot slower than field access, not just a little slower (https://github.com/mikeckennedy/python-numbers-everyone-shou...).
- On the topic of "access fields of objects" (properties/dataclasses/slots/MRO/etc.), benchmarks are really hard to interpret--not just these benchmarks, all of them I've seen. That's because there are fundamentally two operations involved: resolving a field to something that produces data for it, and then accessing the data. For example, a @property is in a class's method cache, so resolving "instance.propname" is done at the speed of the methcache. That might be faster than accessing "instance.attribute" (a field, not a @property or other descriptor), depending on the inheritance geometry in play, slots, __getattr[ibute]__ overrides, and so on. On the other hand, accessing the data at "instance.propname" is going to be a lot more expensive for most @properties (because they need to call a function, use an argument stack, and usually perform other attribute lookups/call other functions/manipulate locals, etc); accessing data at "instance.attribute" is going to be fast and constant-time--one or two pointer-chases away at most.
- Nitty: why's pickling under file I/O? Those benchmarks aren't timing pickle functions that perform IO, they're benchmarking the ser/de functionality and thus should be grouped with json/pydantic/friends above.
- Asyncio's no spring chicken, but I think a lot of the benchmarks listed tell a worse story than necessary, because they don't distinguish between coroutines, Tasks, and Futures. Coroutines are cheap to have and call, but Tasks and Futures have a little more overhead when they're used (even fast CFutures) and a lot more overhead to construct since they need a lot more data resources than just a generator function (which is kinda what a raw coroutine desugars to, but that's not as true as most people think it is...another story for another time). Now, "run_until_complete{}" and "gather()" initially take their arguments and coerce them into Tasks/Futures--that detection, coercion, and construction takes time and consumes a lot of overhead. That's good to know (since many people are paying that coercion tax unknowingly), but it muddies the boundary between "overhead of waiting for an asyncio operation to complete" and "overhead of starting an asyncio operation". Either calling the lower-level functions that run_until_complete()/gather() use internally, or else separating out benchmarks into ones that pass Futures/Tasks/regular coroutines might be appropriate.
- Benchmarking "asyncio.sleep(0)" as a means of determining the bare-minimum await time of a Python event loop is a bad idea. sleep(0) is very special (more details here: https://news.ycombinator.com/item?id=46056895) and not representative. To benchmark "time it takes for the event loop to spin once and produce a result"/the python equivalent of process.nextTick, it'd be better to use low-level loop methods like "call_soon" or defer completion to a Task and await that.

by 867-5309

0 subcomment

tfa mentions running benchmark on a multi-core platform, but doesn't mention if benchmark results used multithreading.. a brief look at the code suggests not

by _ZeD_

1 subcomments

Yeah... No. I've 10+ years of python under my belt and I might have had need for this kind of micro optimizations in like 2 times most

by ewuhic

1 subcomments

This is AI slop.