If you're in a serious multicore setting and you don't take waiting threads off the contended cache line while the other 50 threads take their turn with the underlying contended thing it's very easy to end up way worse off than if you had just let the scheduler drop those threads into e.g. a futex and wake them up when it's their turn.
I'm not sure how deep you'd need to go into the low-power or embedded or hard-realtime worlds before `std::mutex` doesn't do a modest number of spin-CAS rounds before dropping into the futex, I'm sure it exists.
And maybe you work at Optiver and you've got an FPGA interacting with the link-layer and you literally never leave userspace after startup and you've got your own hand-crafted DMA busy-poll situation going because you build your machines with the exact number of cores for the number of threads you need and throw them away when the software changes. There are domains like that. </modest-hyperbole>
The number of "aggressively intermediate" people working at serious companies who roll their own concurrency shit because it's Just Fucking Metal Man is terrifyingly high. And it's dangerous, because if you are someone who needs custom concurrency you know, but if you aren't someone who needs custom concurrency, you often still think you know.
As soon as you yield, things become non-deterministic, so aren't compatible with real-time requirements.
If you have 50 threads trying to access the same thing then what you have is a software design problem. Most synchronization should be done through spsc queues, which are easy to make lock-free (or even wait-free) and efficient, so long as you're aware of how to deal with backoff on the producing side and idle work on the consuming side.
The Optiver model you're describing is pretty much how I'd build any low-latency application. It doesn't really require special hardware to do these things (you can use io_uring to bypass the kernel context switches for anything). It's also much simpler than what you hint at.
I'm willing to accept that a London-based crypto startup (i.e. LMAX-integrated) could have a use for extreme low-latency, extreme low-variance soft-realtime C++. In the sub-mike p99 regime you probably want to keep multithreading out of it entirely in fact.
Hopefully you can accept that nitpicking the use of well-tested concurrency primitives on a forum full of impressionable up-and-coming hackers is almost certainly going to create downward pressure on sensible engineering choices amongst readers of your comments.
I deeply appreciate you helping to steer this little tire fire of a comment thread I seem to have created off the rocks, I meant well but it seems to have ended up as an advertisement for insane defaults.
With that said, the parent seems pretty committed. And for all I know, is actually doing sub-microsecond software HFT or hard-realtime signal processing.
I'm not sure how deep you'd need to go into the low-power or embedded or hard-realtime worlds before `std::mutex` doesn't do a modest number of spin-CAS rounds before dropping into the futex, I'm sure it exists.
And maybe you work at Optiver and you've got an FPGA interacting with the link-layer and you literally never leave userspace after startup and you've got your own hand-crafted DMA busy-poll situation going because you build your machines with the exact number of cores for the number of threads you need and throw them away when the software changes. There are domains like that. </modest-hyperbole>
The number of "aggressively intermediate" people working at serious companies who roll their own concurrency shit because it's Just Fucking Metal Man is terrifyingly high. And it's dangerous, because if you are someone who needs custom concurrency you know, but if you aren't someone who needs custom concurrency, you often still think you know.