FRESH

Hacker News

Home

The death of thread per core

149 points by ibobev

by bob1029

7 subcomments

I look at cross core communication as a 100x latency penalty. Everything follows from there. The dependencies in the workload ultimately determine how it should be spread across the cores (or not!). The real elephant in the room is that oftentimes it's much faster to just do the whole job on a single core even if you have 255 others available. Some workloads do not care what kind of clever scheduler you have in hand. If everything constantly depends on the prior action you will never get any uplift.
You see this most obviously (visually) in places like game engines. In Unity, the difference between non-burst and burst-compiled code is very extreme. The difference between single and multi core for the job system is often irrelevant by comparison. If the amount of cpu time being spent on each job isn't high enough, the benefit of multicore evaporates. Sending a job to be ran on the fleet has a lot of overhead. It has to be worth that one time 100x latency cost both ways.
The GPU is the ultimate example of this. There are some workloads that benefit dramatically from the incredible parallelism. Others are entirely infeasible by comparison. This is at the heart of my problem with the current machine learning research paradigm. Some ML techniques are terrible at running on the GPU, but it seems as if we've convinced ourselves that GPU is a prerequisite for any kind of ML work. It all boils down to the latency of the compute. Getting data in and out of a GPU takes an eternity compared to L1. There are other fundamental problems with GPUs (warp divergence) that preclude clever workarounds.

by jandrewrogers

3 subcomments

I've worked on several thread-per-core systems that were purpose-built for extreme dynamic data and load skew. They work beautifully at very high scales on the largest hardware. The mechanics of how you design thread-per-core systems that provide uniform distribution of load without work-stealing or high-touch thread coordination have idiomatic architectures at this point. People have been putting thread-per-core architectures in production for 15+ years now and the designs have evolved dramatically.
The architectures from circa 2010 were a bit rough. While the article has some validity for architectures from 10+ years ago, the state-of-the-art for thread-per-core today looks nothing like those architectures and largely doesn't have the issues raised.
News of thread-per-core's demise has been greatly exaggerated. The benefits have measurably increased in practice as the hardware has evolved, especially for ultra-scale data infrastructure.

by vacuity

1 subcomments

There are no hard rules; use principles flexibly.
That being said, there are some things that are generally true for the long term: use a pinned thread per core, maximize locality (of data and code, wherever relevant), use asynchronous programming if performance is necessary. To incorporate the OP, give control where it's due to each entity (here, the scheduler). Cross-core data movement was never the enemy, but unprincipled cross-core data movement can be. If even distribution of work is important, work-stealing is excellent, as long as it's done carefully. Details like how concurrency is implemented (shared-state, here) or who controls the data are specific to the circumstances.

by hunterpayne

1 subcomments

Context switches (when you change the thread running on a specific core) is one of the most computational expensive things computers do. If somehow you can't use a threadpool and some sort of task abstraction, you probably shouldn't be doing anything with multiple threads or asynchronous code.
I have absolutely no idea why anyone would think breaking the thread per core model is better and I seriously question the knowledge of anyone proposing another model without some VERY good explanation. The GP isn't even close to this in any way.

by SteveLauC

0 subcomment

> a task can yield, which, conceptually, creates a new piece of work that gets shoved onto the work queues (which is "resume that task"). You might not think of it as "this task is suspended and will be resumed later" as much as *"this piece of work is done and has spawned a new piece of work."*
Never thought of it that way, but it’s indeed true — a new task does get enqueued in that case. Thanks for the insight!

by adsharma

1 subcomments

Morsel driven parallelism is working great in DuckDB, KuzuDB and now Ladybug (fork of Kuzu after archival).

0 subcomment

by scrubs

0 subcomment

Async etc is also a function of dynamic work loads sometimes exasperated by the fact socket/channel A is slow so while waiting there deal with channels b,c,d,.. which are also slow for various reasons.
Per core threads and not much else are fairly required for nyse, trading, oms, and i bet things like switches. A web browser might be their polar opposite.

by foota

1 subcomments

An interesting observation:
"At that time, ensuring maximum CPU utilization was not so important, since you’d typically be bound by other things, but things like disk speed has improved dramatically in the last 10 years while CPU speeds have not."

by pjmlp

0 subcomment

Many runtimes and OS APIs have the possibility to attach decisions to which threads on which cores get used.
Java, .NET, Delphi, and C++ co-routines, all provide mechanisms to provide our own scheduler, which can then be used to say what goes where.
Maybe cool languages should look more into the ideas of these not so cool our parents ecosystems kind of languages. There are some interesting ideas there.

by josefrichter

2 subcomments

Isn't this what Erlang/Elixir BEAM is all about?