C++ AMP keynote is online

Herb Sutter C++, Concurrency 2011-06-162011-06-17 2 Minutes

Yesterday I had the privilege of talking about some of the work we’ve been doing to support massive parallelism on GPUs in the next version of Visual C++. The video of my talk announcing C++ AMP is now available on Channel 9. (Update: Here’s an alternate link; it seems to be posted twice.)

The first 20 minutes has nothing to do with C++ in particular or any platform in particular, but tries to make the case that the right way to view the “trends” of multicore computing, GPU computing, and cloud computing (HaaS) is that they are not three trends at all, but merely facets of the same single trend — heterogeneous parallel computing.

If they are, then one programming model should be able to address them all. We think we’ve found one.

The main reasons we decided to build a new model is that we believe there needs to be a single model that has all of the following attributes:

C++, not C: It should leverage C++’s power for strong abstraction without sacrificing performance, not just be a dialect of C.
Mainstream: It should be programmable by millions of developers, not just by a priesthood. Litmus test: Is the Hello World parallel GPU program a page and half, or a couple of lines?
Minimal: It adds just one general-purpose language extension that addresses not only the immediate problem (dealing with cores that can’t support full C++) but many others. With the right general-purpose extension, the rest can be done as just a library.
Portable: It allows shipping a single EXE that can use any combination of GPU vendors’ hardware. The initial implementation uses DirectCompute and supports all devices that are DX11 capable; DirectCompute is just an implementation detail of the first release, and the model can (and I expect will) be implemented to directly talk to any interesting hardware.
General and future-proof: The initial release will focus on GPU computing, but it’s intended to enable people to write code for the GPU in a way that in the future we can recompile with few or no changes to spread across any and all accessible compute cores, including ones in the cloud.
Open: I mentioned that Microsoft intends to make the C++ AMP specification open, and encourages its implementation on other C++ compilers for any hardware or OS target. AMD announced that they will implement C++ AMP in their FSA reference compiler. NVidia also announced support.

We’re really excited about this, and I hope you find the information in the talk to be useful. A prerelease implementation in Visual C++ that runs on Windows will be available later this year. More to come…

Published by Herb Sutter

Herb Sutter is an author and speaker, a software architect at Microsoft, and chair of the ISO C++ standards committee. View all posts by Herb Sutter

Published 2011-06-162011-06-17

31 thoughts on “C++ AMP keynote is online”

Pingback: GPU: Pricing and Risk – C++ AMP « Tales from a Trading Desk
Brandi says:

2011-08-09 at 3:30 am

Slam dunkin like Shaquille O’Neal, if he wrote ifnomartive articles.
Pingback: Neues Projekt unter .Net: c# oder Prism - Seite 4 - Delphi-PRAXiS
Herb Sutter says:

2011-07-19 at 6:38 am

@JSawyer: Those are good questions. We’ve considered those things and intend to do some of them. Briefly:

Tag/version: “direct3d” is a placeholder and we agree it implies the wrong thing because nothing about the model is DX-specific, so we’ll probably rename this tag. Versions are accounted for in the design with a :## syntax (e.g., restrict(direct3d:11.1) but are not needed in the initial release. Yes, you can use macros for these.

The “restrict” name: Several people internally and externally have suggested renaming ‘restrict’ to ‘target’ because they have a view that this is about target platforms/cores. That’s too narrow a view; this is a general-purpose facility for language _restrictions_ that is integrated into the language and can handle more than just targets (just as lambdas are a general-purpose facility that can handle more than just parallel algorithms).

Timeframe: We haven’t announced a specific schedule yet, but have said that C++ AMP bits will be available later this year.

Thanks,

Herb
JSawyer says:

2011-07-17 at 2:50 am

Hello! I’ve never done GPGPU, but our financial applications are increasingly computation intensive and AMP seems like a nice way to go. I’ve watched your presentation, but I have I few concerns though.

– the “restrict” keyword. I know it is a C99 keyword, but is it the most descriptive one? The “restrict” keyword could actually be used not to “restrict” the language features within {} block, but to “expand” them. Maybe as suggested by others, “target” and “platform” are more suitable keywords. I’m against keyword inflation, but for the sake a clarity, maybe a new keyword should be considered?

restrict(direct3d):
– should the target parameter be used with or without quotes?
– should it have another (optional) parameter, such as version?
– is direct3d the most suitable description for DirectCompute or DirectX 11?

I would much rather see it like this:
restrict(“OpenCL”, 1.1)
restrict(“DirectCompute”, 11)
restrict(“pure”)
Or with restrict keyword changed to platform / target.

This way the programmer could define _targetname and _targetversion constants (within ifdef etc.) and use:
restrict(_targetname, _targetversion)

_targetname would have to be a string, while _targetversion should be a float (to cover point versions, such as 1.1). If “pure”, the _targetversion could be ignored.

But then again, what about multi-targeting? If I’d like my code to run both on OpenCL 1.1 and DirectCompute 11?
maybe restrict (OpenCL11; DirectCompute11) for a common set of features?
Maybe really a preprocessor should define _mytargets, that contain both OpenCL 1.1 and DirectCompute 11 targets?

All in all, the approach seems quite interesting. But when could we start using AMP? What is the time frame for shipping CTP or release versions of AMP?
Pingback: Enterprise Headlines and Summaries, June 2011
Thierry Seegers says:

2011-07-01 at 12:47 pm

I have some questions about the MatrixMult example presented. That function has, in its body, a lambda that is restrict(direct3d). The function itself, though, defaults to restrict(cpu). What does that mean? Given that it is implemented in terms of a restrict(direct3d) piece of code, doesn’t that make the whole function restrict(direct3d)? And what happens when I run this code in a box with no GPU?
Herb Sutter says:

2011-06-27 at 9:13 am

All questions of that form involve a scheduler that is responsible for mapping app workload to hardware and. Traditionally, app threads are mapped to cores by the OS scheduler, and this includes performing preemption to share a core when there are fewer cores than ready threads. PPL tasks are mapped to cores by the ConcRT scheduler. C++ AMP tasks are mapped to specialized cores by the underlying scheduler, which initially is DirectCompute and can be other schedulers tomorrow.

The key thing is to have one scheduler for a given piece of hardware, and avoid having multiple different schedulers both think they own the same hardware and oversubscribing it. For a multicore example, it would be bad if Microsoft PPL’s ConcRT scheduler and Intel’s OpenMP scheduler both thought they owned the whole machine; it was that way at first, but the latter has now been implemented on top of the former so that they will share more nicely. The industry will continue to engineer under the covers to improve scheduling in this kind of way.
Kaz Dragon says:

2011-06-27 at 2:26 am

First off, I really like this concept, the syntax that goes with it, and pretty much everything about it.

I do have a couple of points of curiosity, though, although this may be my naïveté about GPUs: presuming this becomes a mainstream programming paradigm, what’s going to happen when multiple applications try and offload data processing to the GPU at the same time? For example, you’re playing a game of “Generic FPS 2015”, running “Cloud Project @Home” in the background, and your virus scanner kicks in. All of these compete for the resources on the GPU. Will it play nicely?

From a more local perspective, what happens if, within one program, one thread kicks off some GPU parallel processing in one thread, then some other thread does the same, with a different algorithm on a different dataset. Will one computation have to wait for the other to complete entirely before starting, or will it be able to run on parts of the GPU that are “finished” or otherwise unallocated? Or option c): other?
thomas says:

2011-06-21 at 6:46 am

sooo could this mean a whole new level for pc gaming & graphics?

or easier to program direct to metal?
Joshua Burkholder says:

2011-06-20 at 3:05 pm

@Herb Sutter: Thanks for the explanation. I didn’t realize how weak extern "C" { ... } actually is as a language linkage. From the current draft standard ( http://www.open-std.org/jtc1/sc22/wg21/prot/14882fdis/n3291.pdf ), N3291 § 7.5 does a pretty good job of explaining what extern "langage name" { ... } __cannot__ do.

ASIDE: It would be nice if the draft standard detailed what extern "langage name" { ... } could do … especially, how extern "langage name" { ... } compared to the seemingly more restrictive extern "C" { ... }.

While restrict( direct3d ) { ... } implies restricting the compiler to a proper subset of C++ that is useful for the current version of the HLSL Compute Shader in the current version of Direct3D, there is no guarantee that this subset will remain proper over future versions of HLSL … it very well could become a set with some features of C++ and some features not in C++ … or blossom into an improper subset or a superset of C++ … who knows … so we might just be restrict-ing in the short term and find ourselves with an odd keyword in the future.
Moreover, while it's convenient that C99 has restrict as a reserved word, restrict doesn't really get to the heart of the matter. We are asking the compiler's backend to do something other than the default ... instead of producing x86/x64 instructions, we want the compiler to produce HLSL Compute Shader bytecode or Azure library calls or whatever. Words/phrases that immediately come to mind are: target, emit, compile to, produce, build ... or even: language, specification, platform, architecture, ... but not "restrict". I think that something like target( HLSL 5.0, OpenCL 1.0, Azure 1.0, MPI 2.2 ) { ... } ... or emit( HLSL, OpenCL, Azure, MPI ) { ... } with compiler switches for version 5.0 of HLSL language, version 1.0 of the OpenCL spec, version 1.0 of the Azure library, and version 2.2 of the MPI spec/library ... or platform( gpu, cloud ) { ... } with compiler switches for HLSL 5.0 language & OpenCL 1.0 spec for gpu and Azure 1.0 library & MPI 2.2 spec/library for cloud (but defaulting to normal C++0x if I forget to add those compiler switches) ... or some combination of these would get more to the point ... and be a little more future-proof.

Very Respectfully,
Joshua Burkholder

P.S. - Alternatively, you could just lift the restrictions from extern "language name" { ... } that the current draft standard imposes and make any necessary additions ... esp., since you are going to be embracing and extending the C++0x language anyway. Since extern "language name" { ... } was intended for language linkage ... and that seems to be what you are doing here, then you could just add the ability to place extern "language name" { ... } in any scope, to place it on any lambda or member function, and to restrict the C++ language based on the "language name" string within the scope of extern "language name" { ... }. In other words, a lambda could be written as: [=] ( index<2>(1, 2) ) extern "HLSL 5.0" { ... /* restrict C++ to features that produce compiler-backend intermediate forms that are compatible with HLSL 5.0 compiler-backend intermediate forms */ ... } ... or something like that.
Herb Sutter says:

2011-06-19 at 4:25 pm

The trouble with extern is that it’s not fully integrated into the language. Here are a few examples: 1. It’s important to be able to overload on these qualifiers even when only the qualifier is different. 2. You want to be able to express restrictions on all functions, including member functions and lambdas, not just namespace-scope functions. 3. You also have to be able to put restriction qualifiers on pointers to functions and other associated language features. There are more examples, but those are the kinds of things that are necessary to get to a fully integrated design, where this new feature is an orthogonal feature that works with everything else that’s already there in the language, so that it just works with other language features as expected without surprises/limitations.
Joshua Burkholder says:

2011-06-19 at 4:24 pm

Here is another (potentially naive) way to allow the coordinate and matrix perspectives to coexist (…again limited to an index<2> example, but using unions instead of references…):
//=================================================================== #include <iostream> using namespace std; template < size_t N > struct index; template <> struct index< 2 > {     union {         int x;         int column;     };     union {         int y;         int row;     };     index ( int x, int y ) : x( x ), y( y ) {         //     } };
template < size_t N > struct matrix_index;
template <> struct matrix_index< 2 > : public index< 2 > {     matrix_index ( int row, int column ) : index< 2 >( column, row ) {         //     }     matrix_index ( index< 2 > const & i ) : index< 2 >( i ) {         //     } }; std::ostream & operator << ( std::ostream & os, index< 2 > const & i ) {     os << "(   x: " << i.x   << ",      y: " << i.y      << " )" << endl;     os << "[ row: " << i.row << ", column: " << i.column << " )" << endl;     return os; } int main () {     //----------------------------------------------------------     cout << "sizeof( index< 2 > ):        " << sizeof( index< 2 > ) << endl;     cout << "sizeof( matrix_index< 2 > ): " << sizeof( matrix_index< 2 > ) << endl;     cout << endl;     //----------------------------------------------------------     index< 2 > i( 2, 3 );     cout << "index< 2 > i( 2, 3 );" << endl;     cout << "i:" << endl;     cout << i << endl;     i.x = 5;     cout << "i.x = 5;" << endl;     cout << "i:" << endl;     cout << i << endl;     //----------------------------------------------------------     matrix_index< 2 > mi( 3, 2 );     cout << "matrix_index< 2 > mi( 3, 2 );" << endl;     cout << "mi:" << endl;     cout << mi << endl;     mi.column = 5;     cout << "mi.column = 5;" << endl;     cout << "mi:" << endl;     cout << mi << endl;     //----------------------------------------------------------     // compilation checks:     index< 2 > i2( i );     index< 2 > i3( mi );     matrix_index< 2 > mi2( i );     matrix_index< 2 > mi3( mi );     i2 = mi;     i3 = i;     mi2 = mi;     mi3 = i;     return 0; } //===================================================================
At run-time, the code above produces the following:

sizeof( index< 2 > ):        8 sizeof( matrix_index< 2 > ): 8
index< 2 > i( 2, 3 ); i: (   x: 2,      y: 3 ) [ row: 3, column: 2 ) i.x = 5; i: (   x: 5,      y: 3 ) [ row: 3, column: 5 ) matrix_index< 2 > mi( 3, 2 ); mi: (   x: 2,      y: 3 ) [ row: 3, column: 2 ) mi.column = 5; mi: (   x: 5,      y: 3 ) [ row: 3, column: 5 )

Very Respectfully,
Joshua Burkholder
Herb Sutter says:

2011-06-19 at 4:21 pm

The “multiple schedulers” problem is real and needs to be solved independently. For example, C++ PPL uses the ConcRT runtime scheduler. If OpenMP uses a different scheduler, then unless there are extra knobs to tell them not to use the whole machine, both will think they own the whole machine. One way to solve this is to use a common scheduler — for example, IIRC Intel has updated their OpenMP implementation to use the ConcRT scheduler, which solves the problem. So the solution is to reduce the number of independent schedulers.

The good news is that the ConcRT and OpenMP and threadpool schedulers only deal with CPU cores today. C++ AMP deals only with GPU cores today. So there’s no conflict at the moment, and as we integrate C++ PPL and C++ AMP we will definitely be sure to make the schedulers play nice (minimum bar) or be fully unified (ideal, and long-term necessity).
Herb Sutter says:

2011-06-19 at 4:18 pm

They must be built into the compiler. Allowing user-defined restriction qualifiers would be a much bigger language extension because you’d need to be able to have a whole new set of language syntax to let you talk about specific language features in the language, to specify what language features are and aren’t available. It’s possible and nothing closes the door to that as a future path, but it’s best to start with the basics and see how they work out.
Herb Sutter says:

2011-06-19 at 4:16 pm

@Mike: While we designed this, I did explore unifying this with const- and volatile-qualifiers on member functions. It turns out that they’re not quite the same thing — you could shoehorn mem-fun cv-qualifiers into restriction qualifiers, but fundamentally they’re no different than just plain cv-qualifiers on ordinary parameters, and the only reason they’re tacked on so awkwardly at the end is because you can’t write them on the “this” parameter where they belong. IMO the right way in the future to unify mem-fun cv-qualifiers is by allowing explicit “this” parameters… it’s really a different problem.
Herb Sutter says:

2011-06-19 at 4:14 pm

@Jesse: Alas, attributes aren’t the answer because they’re not part of the language. For example, you can’t overload on them.
Herb Sutter says:

2011-06-19 at 4:13 pm

@Lothar: I understand, and a lot of people felt the same about GUIs and OO.

The short answer is that we have to deal with it if we care about taking advantage of the compute throughput available on mainstream hardware from now on, because this is the way new performance will be delivered in mainstream hardware. Applications that aren’t compute-intensive can ignore it; applications that are will need to cope with heterogeneous computing — even if just locally, but more and more with cloud acceleration when connected (I believe this is the inevitable future in a handful of years).

I completely understand your feeling like you don’t want to have to learn to deal with this stuff. The dirty little secret is that nobody wanted multicore in the past five years, and nobody really wants heterogeneous computing now… the chip vendors and the software developers, all of us together, would just love old-free-lunch improvements to single-core throughput to go on forever. But just as starting in the mid-2000s we needed to go parallel to get full compute throughput out of mainstream hardware, we will now need to go heterogeneous to get full performance. And although we’d like to have not had to do the work of bringing parallelism into the mainstream, now that we are doing it there are strong other advantages, notably Performance/W and scalability.

I also get that you don’t want something Windows-only. That’s typical of C++ developers (including me), and it’s why we’ve decided to take the path of making this an open spec from the get-go. As of launch day both AMD and NVidia have announced support. More to come…

But it has to be easier to use than current tools if it’s going to make sense for millions of developers; a priesthood can happily continue to use what’s there now if they want, but we as an industry need to greatly broaden the audience of developers who can do this stuff successfully, as well as make all of us more productive. One aspect of this is minimizing differences from the proven programming models we already know: For C++ in particular, my personal belief is that both parallel computing and heterogeneous computing will fail unless we can make them just about as simple to use as STL algorithms. Fortunately, we now have a proof point that it can be done for parallel algorithms (see C++ PPL), and a proof of concept that the model can be taken to heterogeneous computing with an additional decoration (C++ AMP). I was concerned a few years ago, but now I’m confident that we’re going to get there.
Joshua Burkholder says:

2011-06-19 at 3:15 pm

The formating for the above code sample is fine; however, I forgot to modify the “At run-time, this produces the following” example, so here it is … with proper formating:
sizeof( index< 2 > ):        16 sizeof( matrix_index< 2 > ): 16
index< 2 > i( 2, 3 ); i: (   x: 2,      y: 3 ) [ row: 3, column: 2 ) i.x = 5; i: (   x: 5,      y: 3 ) [ row: 3, column: 5 ) matrix_index< 2 > mi( 3, 2 ); mi: (   x: 2,      y: 3 ) [ row: 3, column: 2 )
mi.column = 5; mi: (   x: 5,      y: 3 ) [ row: 3, column: 5 )

Joshua Burkholder
Joshua Burkholder says:

2011-06-19 at 3:02 pm

@Herb Sutter:

Why restrict( direct3d ) { ... }, vice extern "direct3d" { ... }?

extern "C" { ... } seems to have worked pretty well. What would extern "direct3d" { ... } not do that you need it to?

Very Respectfully,
Joshua Burkholder
Joshua Burkholder says:

2011-06-19 at 2:50 pm

@Mike Gibson: Gleened from just videos, my understanding of “restrict” is that it tells the compiler to consider language rules for a set of the C++ language (in this case, just a proper subset of C++ … but a superset would also be possible for things like C++ 202x language feature testing). Since function pointers and recursion are not allowed under “restrict( direct3d )”, then “restrict( direct3d )” tells the compiler to consider a proper subset of C++0x where a pointer to a function or a recursive function should produce a compile-time error inside the context that follows (i.e. a compile-time error for just the code in between the “{” and “}” that follows the “restrict( direct3d )”). This restriction allows the front end of the compiler to create an intermediate representation out of the proper subset of C++0x code that the backend can optimize and emit to anything it wants … like HLSL bytecode, or IA64 instructions, or a combination of both. Of course, an additional run-time that handles the interaction between the CPU and the GPU (for HLSL) will also be added during compilation besides just the usual run-time that sets up the stack and calls “main”.

@Herb Sutter: If the above is the case, then I think names like “direct3d” and “cpu” are __bad__ … they are much too hardware-target specific. Hardware targets work well as compiler switches (say for x86, x64, …), vice being put directly in code. I would think that language specifications … and maybe versions … should go in between the parens of “restrict( )”. For instance, “restrict( OpenCL 1.0 )”, “restrict( C++ 2003 )”, “restrict( HLSL 5.0 )”, “restrict( C++ 202x )”, “restrict( ECMAScript 5.0 )” … or whatever. Additionally, I would imagine that the HLSL will add features in the future (i.e. leading to additional bytecode instructions) in Direct3D 12 (or Direct3D 13, or Direct3D 14, or …), so which HLSL version does “restrict( direct3d )” target? Will this be a compiler switch, pragma, or something I can stick in between the parens of “restrict( )”? This seems like the DLL-Hell versioning problem that .NET assemblies tried to solve, but at a language/compiler level.

Lastly, I am __REALLY__ dissatisfied with the order of the indices in the index<2>, extent<2>, and similar templates … i.e. y, x … instead of the usual x, y. Hint: If you want me to think in terms of rows and columns, then provide an interface (say a “matrix_index” class for a “matrix_view” class) that allows me to think in terms of rows (y’s) and columns (x’s) … but do __not__ force me to use “y” for row and “x” for column … and do __not__ change the generally accepted order of things … i.e. (x, y) … because you are trying to fit multiple perspectives into one view … i.e. (y, x) that also means (row, column). I should be able to have and declare indices using the coordinate perspective … i.e. (x, y) … or using the matrix perspective … i.e. (row, column). The API should support both perspectives … without me having to type a whole bunch. Maybe more template meta-programming should be done? Or maybe just some API changes?

NOTE: Of course, the (z, y, x) ordering for index and extent needs to be changed as well.

Very Respectfully,
Joshua Burkholder

P.S. – Here is one (potentially naive) way to allow the coordinate and matrix perspectives to coexist (…limited to an index<2> example…):
//=================================================================== #include <iostream>
using namespace std; template < size_t N > struct index; template <> struct index< 2 > { int x; int y; int& row; int& column; index ( int _x, int _y ) : x( _x ), y( _y ), row( y ), column( x ) { // } index ( index const & i ) : x( i.x ), y( i.y ), row( y ), column( x ) { // } index & operator = ( index const & i ) { if ( &i != this ) { x = i.x; y = i.y; } return *this; } }; template < size_t N > struct matrix_index : public index< N > { matrix_index ( int row, int column ) : index< N >( column, row ) { // } matrix_index ( matrix_index const & mi ) : index< N >( mi ) { // } matrix_index ( index< N > const & i ) : index< N >( i ) { // } matrix_index & operator = ( matrix_index const & mi ) { if ( &mi != this ) { index< N >::operator =( mi ); } return *this; } matrix_index & operator = ( index< N > const & i ) { if ( &i != this ) { index< N >::operator =( i ); } return *this; } }; std::ostream & operator << ( std::ostream & os, index< 2 > const & i ) { os << "( x: " << i.x << ", y: " << i.y << " )" << endl; os << "[ row: " << i.row << ", column: " << i.column << " )" << endl; return os; }
int main () { //---------------------------------------------------------- cout << "sizeof( index< 2 > ): " << sizeof( index< 2 > ) << endl; cout << "sizeof( matrix_index< 2 > ): " << sizeof( matrix_index< 2 > ) << endl; cout << endl; //---------------------------------------------------------- index< 2 > i( 2, 3 ); cout << "index< 2 > i( 2, 3 );" << endl; cout << "i:" << endl; cout << i << endl; i.x = 5; cout << "i.x = 5;" << endl; cout << "i:" << endl; cout << i << endl; //---------------------------------------------------------- matrix_index< 2 > mi( 3, 2 ); cout << "matrix_index< 2 > mi( 3, 2 );" << endl; cout << "mi:" << endl; cout << mi << endl; mi.column = 5; cout << "mi.column = 5;" << endl; cout << "mi:" << endl; cout << mi << endl; //---------------------------------------------------------- // compilation checks: index< 2 > i2( i ); index< 2 > i3( mi ); matrix_index< 2 > mi2( i ); matrix_index< 2 > mi3( mi ); i2 = mi; i3 = i; mi2 = mi; mi3 = i; return 0; } //===================================================================
At run-time this produces the following:
sizeof( index ): 16 sizeof( matrix_index ): 16
index i( 2, 3 ); i: ( x: 2, y: 3 ) [ row: 3, column: 2 ) i.x = 5; i: ( x: 5, y: 3 ) [ row: 3, column: 5 ) matrix_index mi( 3, 2 ); mi: ( x: 2, y: 3 ) [ row: 3, column: 2 )
mi.column = 5; mi: ( x: 5, y: 3 ) [ row: 3, column: 5 )
Lothar Scholz says:

2011-06-18 at 10:21 am

Well being a normal application programmer i can’t think of heterogenous computing as any useful for my and the most non multi-media/gaming applications. The shared memory requirement are what kills it all for me, shared memory and large virtual memory sets among all threads/tasks – otherwise i could multiprocessing giving me even more stability in my application. And with this usecase a GCD with the system wide thread pool (no it is not like Cilk, it’s much more powerful) looks more productive and performant.

If you want to convince me take a compiler or database or excel spreadsheet or a word document or … as an example for AMP not a simple particle system. Promoting heterogenous computing is Marketing, just Marketing.

And it is not true that Apples lamda functions/C blocks are Objective-C features. They and GCD are pure C level.

And again – i and my company will never use pure Windows API’s anymore. We make as much money from our MacOSX port as we make from our Windows App. So please stop the API wars.
Dan Golick says:

2011-06-18 at 9:07 am

Its wonderful that Microsoft has made it easier to write multi-core, hetrogeneous code but…

If multiple applications are running trying to use all the cores and all the gpus things start to run slower. We need an OS level task scheduler that can be the traffic cop at run-time.

Now my c# app is trying to use all cores and my c++ app is trying to use all cores and everything runs slower!!
Jesse Towner says:

2011-06-17 at 9:56 pm

Yeah, you very well may be right.
Mike Gibson says:

2011-06-17 at 11:44 am

I think much of what was done with concepts would be applicable to restrict. A generalized way of specifying restrictions on the code you write. Then, instead of requiring a new keyword for code you want to use in with direct3d, the compiler just checks individual restrictions as it goes and doesn’t allow direct3d when the code doesn’t comply. Then, the restrict keyword just uses concepts to make the error messages nicer. This would have the side benefit of allowing old code to take advantage of new processing models without having to be updated to include some new keyword.

Perhaps it’s a good thing that concepts didn’t make it this time around. Perhaps the idea is far more powerful than previously thought.
contextfree says:

2011-06-17 at 11:34 am

Must “restrict” targets be built into the compiler, or is there some way to write them as libraries?
Jesse Towner says:

2011-06-17 at 11:12 am

Yeah, I was thinking about that stuff too. Perhaps something that utilizes C++0x’s new attribute syntax, which would provide better context sensitivity, is the way to go?
Jesse Towner says:

2011-06-17 at 10:46 am

It would appear you are caught in some sort of Reality Distortion Field. Here, let me target you with my tractor beam, perhaps I can free you from that treacherous trap.

Firstly, GCD has absolutely nothing to do with heterogeneous computing. What GCD provides is a hierarchical work-stealing task scheduler and some interesting lock-free/wait-free data structures underneath the hood to facilitate message passing between worker threads. That’s all it is. It’s modeled after technologies from Cilk and other early concurrency projects, which have been around since the 90s or earlier–and it doesn’t even do a good job of providing all of the features from Cilk, only the bare necessities. Microsoft’s PPL and Intel’s TBB initiatives, also somewhat based off of Cilk, go further than GCD in providing concurrent algorithmic skeletons and other high level abstractions which can be composed together to solve various concurrency problems. GCD just gives you a bare bones task scheduler. But in the end, GCD, PPL, TBB, et. al. don’t really attempt to tackle heterogeneous computing.

In fact, Apple came up with a completely different technology for heterogeneous massively parallel computing that I’m sure you’ve heard of–OpenCL. OpenCL takes C99, makes some changes and provides some abstractions for working with vector data-types and image memory buffers on GPUs, but it can also be used for running code on CPUs and perhaps other hardware like FPGAs. And that’s really what heterogeneous computing is about, about writing code that can run on different types of devices in a uniform way. OpenCL implements something of a client-server model similar to OpenGL, where the host application written in regular C (or C++/Obj-C) running on the CPU can delegate work to OpenCL compute kernels, written in the OpenCL C99 dialect and compiled separately that are instantiated and run on various compute devices such as the GPU, or as new threads on the same CPU if you so choose.

So why not just use OpenCL? Well, OpenCL is modeled after C, which is all fine and good, but for C++ programmers, may leave something to be desired. C++ AMP, from what I’ve gleaned, provides you with all of those higher level abstractions if you so chose to use them, such as templates, exceptions, class types with RAII, rvalue-references and so on. Hopefully I’m not wrong about this, but C++ AMP also appears to do away with the client-server model of OpenCL and DirectCompute: it unifies the programming environment in a natural way. You write your application code in C++ AMP, which is just a superset of C++0x (hopefully C++11 really soon), and when you want to delegate computation to heterogeneous devices, you instruct a concurrent algorithmic skeleton, which is simply a template library abstraction, to invoke a function or functor object that has been restricted to run on such devices, also written in C++ AMP and within the same source file. You don’t have to write compute kernel code in a different language or compile the kernel source files separately, it’s all handled by the C++ AMP compiler and runtime libraries automatically. It’s really the next logical step to take within the domain of heterogeneous computing. I really hope C++1x adopts what ever emerges and stabilizes out of the C++ AMP effort. I’m really pumped about it and I’m glad to see it’ll be an open standard as I primarily work on non-Microsoft platforms.

As for Apple’s blocks, they are an Objective-C language extension that doesn’t fit nicely into the C++ programming model. C++0x lambda functions are much more expressive. They provide standard C++ object semantics such as constructors, assignment operators, etc. and facilitate fine-grained variable capture capabilities where you can precisely decide what you want to capture within the closure, either by value or by reference, if you so choose. Additionally, C++0x lambda functions have been in the works for a long time, long before Apple released Obj-C blocks with OS X 10.6.
Mike Gibson says:

2011-06-17 at 8:00 am

This is awesome. Given that the restrict keyword and writeonly additions are totally new, how willing are you to allow the legendary C++ language community to figure out how to “do it better”?. I can foresee that many folks won’t like how MS has done this stuff.

I think that the writeonly idea is even more broadly applicable than the restrict stuff.

What you’re doing with restrict is just adding a way to add more qualifiers, like const and volatile. I see the need to qualify functions/lambdas/etc. in a number ways, but just adding more and more keywords isn’t the way to go. Perhaps you extend restrict to not just this parallel stuff, but to include *all* the restrictive qualifiers that one can use, like const and volatile. Then the writeonly stuff can just be included directly along with the rest of the restrictions.

I can envision a system whereby you can create new restriction types based on other restriction types, all built up from a core set of restrictions like const, volatile, writeonly, pure, direct3d, etc.
Pingback: GPU in the cloud » Blog Archive » C++ AMP keynote is online « Sutter's Mill
Lothar Scholz says:

2011-06-17 at 4:00 am

I’m not impressed – i would have been 5 years ago.

For general purpose multithreading i believe that Apples GCD is much more helpful in reality. And why couldn’t you use Apples lamba function C extension? Maintaining platform independent business logic was already hard without using MT, this is going to be a pain in the butt. Please stop this API wars.