“256 cores by 2013”?

I just saw a tweet that’s worth commenting on:

Almost right, and we have already reached that.

I said something similar to the above, but with two important differences:

I said hardware “threads,” not only hardware “cores” – it was about the amount of hardware parallelism available on a mainstream system.
What I gave was a min/max range estimate of roughly 16 to 256 (the latter being threads) under different sets of assumptions.

So: Was I was right about 2013 estimates?

Yes, pretty much, and in fact we already reached or exceeded that in 2011 and 2012:

Lower estimate line: In 2011 and 2012 parts, Intel Core i7 Sandy Bridge and Ivy Bridge are delivering almost the expected lower baseline, and offering 8-way and 12-way parallelism = 4-6 cores x 2 hardware threads per core.
Upper estimate line: In 2012, as mentioned in the article (which called it Larrabee, now known as MIC or Xeon Phi) is delivering 200-way to 256-way parallelism = 50-64 cores x 4 hardware threads per core. Also, in 2011 and 2012, GPUs have since emerged into more mainstream use for computation (GPGPU), and likewise offer massive compute parallelism, such as 1,536-way parallelism on a machine having a single NVidia Tesla card.

Yes, mainstream machines do in fact have examples of both ends of the “16 to 256 way parallelism” range. And beyond the upper end of the range, in fact, for those with higher-end graphics cards.

For more on these various kinds of compute cores and threads, see also my article Welcome to the Jungle.

Longer answer follows:

Here’s the main part from article, “Design for Manycore Systems” (August 11, 2009). Remember this was written over three years ago – in the Time Before iPad, when Android was under a year old:

How Much Scalability Does Your Application Need?

So how much parallel scalability should you aim to support in the application you‘re working on today, assuming that it’s compute-bound already or you can add killer features that are compute-bound and also amenable to parallel execution? The answer is that you want to match your application’s scalability to the amount of hardware parallelism in the target hardware that will be available during your application’s expected production or shelf lifetime. As shown in Figure 4, that equates to the number of hardware threads you expect to have on your end users’ machines.

Figure 4: How much concurrency does your program need in order to exploit given hardware?

Let’s say that YourCurrentApplication 1.0 will ship next year (mid-2010), and you expect that it’ll be another 18 months until you ship the 2.0 release (early 2012) and probably another 18 months after that before most users will have upgraded (mid-2013). Then you’d be interested in judging what will be the likely mainstream hardware target up to mid-2013.

If we stick with "just more of the same" as in Figure 2’s extrapolation, we’d expect aggressive early hardware adopters to be running 16-core machines (possibly double that if they’re aggressive enough to run dual-CPU workstations with two sockets), and we’d likely expect most general mainstream users to have 4-, 8- or maybe a smattering of 16-core machines (accounting for the time for new chips to be adopted in the marketplace). [[Note: I often get lazy and say “core” to mean all hardware parallelism. In context above and below, it’s clear we’re talking about “cores and threads.”]]

But what if the gating factor, parallel-ready software, goes away? Then CPU vendors would be free to take advantage of options like the one-time 16-fold hardware parallelism jump illustrated in Figure 3, and we get an envelope like that shown in Figure 5.

Figure 5: Extrapolation of “more of the same big cores” and “possible one-time switch to 4x smaller cores plus 4x threads per core” (not counting some transistors being used for other things like on-chip GPUs).

First, let’s look at the lower baseline, ‘most general mainstream users to have [4-16 way parallelism] machines in 2013’? So where are were in 2012 today for mainstream CPU hardware parallelism? Well, Intel Core i7 (e.g., Sandy Bridge, Ivy Bridge) are typically in the 4 to 6 core range – which, with hyperthreading == hardware threads, means 8 to 12 hardware threads.

Second, what about the higher potential line for 2013? As noted above:

Intel’s Xeon Phi (then Larrabee) is now delivering 50-64 cores x 4 threads = 200 to 256-way parallelism. That’s no surprise, because this article’s upper line was based on exactly the Larrabee data point (see quote below).
GPUs already blow the 256 upper bound away – any machine with a two-year-old Tesla has 1,536-way parallelism for programs (including mainstream programs like DVD encoders) that can harness the GPU.

So not only did we already reach the 2013 upper line early, in 2012, but we already exceeded it for applications that can harness the GPU for computation.

As I said in the article:

I don’t believe either the bottom line or the top line is the exact truth, but as long as sufficient parallel-capable software comes along, the truth will probably be somewhere in between, especially if we have processors that offer a mix of large- and small-core chips, or that use some chip real estate to bring GPUs or other devices on-die. That’s more hardware parallelism, and sooner, than most mainstream developers I’ve encountered expect.

Interestingly, though, we already noted two current examples: Sun’s Niagara, and Intel’s Larrabee, already provide double-digit parallelism in mainstream hardware via smaller cores with four or eight hardware threads each. "Manycore" chips, or perhaps more correctly "manythread" chips, are just waiting to enter the mainstream. Intel could have built a nice 100-core part in 2006. The gating factor is the software that can exploit the hardware parallelism; that is, the gating factor is you and me.

8 thoughts on ““256 cores by 2013”?”

@ various: I was leaving aside some of the higher-end parts (CPUs, GPUs) in part to be conservative and not overestimate. We’ve got plenty of parallelism, and as I mentioned in Welcome to the Jungle the problem is that it’s coming in forms that are increasingly difficult to program in order to extract actual speedups… single-thread was one thing, multi-core parallel was harder, vectorization is harder, and heterogeneous (GPU) is harder again.

The number of GPU threads do not represent the raw processing power of the chip since most of them are idle to hide memory latencies – so it’s apples to oranges comparison with CPU HW threads. It’s more relevant to compare the number of CU’s on a GPU multiplied with the warp size, which is in hundreds range.

As I’m sure you know, you are somewhat undercounting GPU parallelism. It’s not unreasonable to hand a Kepler GTX 680 a grid of 2500-8000 threads in order to highly utilize the device.

Why no mention of AMD’s Abu Dhabi and Interlagos architectures that offer 16 actual cores (something Intel does not)?

Shawn: “mainstream.” (X86 have been more or less dead since 2006. <– … what? Please be serious.)

Mr. Sutter: AMD is at 16 cores in customer-facing parts on Best Buy shelves as of a little over a year ago; 16-core Bulldozers are currently $630, or $450 at wholesale. (They aren't price competitive, since Intel has gotten so far ahead, but your prediction still came true ahead of time.)

What about Sun Niagara/ SPARC T3/4/5 processors? 16 core with 8 threads per core. That is some fun stuff.
X86 have been more or less dead since 2006. Since Intel don’t have any competition, they have not improved their CPUs as much. Between 2006 to 2012 intel have doubled its speed on desktop. From 2007 to 2012 ARM have managed to speed up its SoC/CPU 17 times.
Apple A6 is as fast per mhz as intel today. That is amazing.

Have to second the comment on Amdahl’s law, until software improves, having more than a couple of cores is simply a waste of cores, of course.

I was optimistic about multicore development in 2006 but I’ve gotten more pessimistic since then.

For one thing, the software problem is tough, and the real problem is Amdahl’s law. I built a system that has ten processing stages; seven of those stages parallelize well, two max out around 2-3 cores, and the last is serial. If I went from 4 to 12 cores I might get a last doubling of speed, but improvements would be small from there on out.

I also used to ignore the effects of the memory hierarchy. Take AMD’s Fusion, which I was impressed with at first. From a programmer’s perspective the unified memory architecture is great, but it really means Fusion will always be a low end part because you don’t get GPU performance without GPU memory and with GPU memory you won’t get good performance for CPU workloads.

Memory hierarchy also made the IBM/Sony “Cell” processor a disappointment.

It’s getting really boring to read about hardware on sites like ArsTechnica because there just isn’t any news anymore. Back in the 2000s there were real performance improvements either in clock rate or IPC or number of cores. Today it’s just some fantasyland about how they busted their tail to reduce power consumption by 100 times in theory except there is something FUBAR about the software and it doesn’t really work cause the thing is going to switch on and never switch off and get hot and burn out your battery in 20 minutes.

Comments are closed.

How Much Scalability Does Your Application Need?

Published by Herb Sutter

8 thoughts on ““256 cores by 2013”?”