24
Aug

Offloading & CUDA: Parallelism in C++ #3/3 (also OpenMP, OpenACC, GPU & Coprocessors like Xeon Phi)


Shalom! It is time for the final episode
in this three-part series about learning to utilize
parallel resources in C++ programming. In the first episode we explored
how to extract the best performance from a single CPU core. In the second episode,
we learned to utilize all cores within the CPU. But what would be
even more efficient, than utilizing a processor
to the maximum? Well, of course, utilizing
more than one processor! So, how exactly could
we go about doing that? How about using multiple
identical processors? This is no different than
using multiple CPU cores, and we already covered that
in episode two. So, scratch that off the list. How about using multiple computers? This is called cluster
computation, or clustering. Clustering is usually,
but not always, done on operating system level
with tools such as MPICH. The same program is started
on multiple computers, with data collected
in one place. It usually does not
require any changes into the program’s source code, and that kind of thing is not
what I had in my mind. That leaves option three:
Using multiple different processors. Within modern PC hardware,
there is often found a very powerful coprocessor. This coprocessor is the
graphics processing unit, or GPU for short. The GPU is capable of
general purpose computation, much like the CPU, but with advantages
and disadvantages: Unlike the CPU, which can run
maybe four or sixteen threads, the GPUs have hundreds
or even thousands. According to the Steam Hardware
& Software survey of May 2017, the most popular
gaming graphics gard – is currently
the NVidia GTX 750 Ti. It has 640 CUDA cores. The GTX 1080 Ti, released
just couple of months ago, has 3584 CUDA cores. And guess what? The GNU Compiler Collection
(GCC) supports it. Computing on a device
is called “offloading”. Right now, GCC supports offloading
to three different platforms: NVidia PTX-enabled graphics cards; Heterogenous System Architecture,
or HSA for short, which is only still used for – offloading code
to AMD graphics cards; and Intel Many Integrated Core
architecture, or MIC for short, which is mainly used
for running code on – Intel Xeon Phi coprocessors, which
are based on the x86 architecture, but are capable of running
256 concurrent threads or more. Sadly I don’t have one. They are quite expensive. Offloading works like this: You write a single program. Within that program, some parts will be compiled
to be run on the host computer, and some parts will be compiled
to be run on the device. Some parts may be compiled for both. The same goes for constant data,
and also for variables. The device and the host
computer might not share RAM, which means that all
arrays and other data – must be passed back and
forth between these units. Offloading in GCC is done with – one of the following
two standard extensions: OpenMP and OpenACC. These features are
not usually enabled – in your Linux distribution. You can enable them by
building GCC from source code. Instructions can be found
on the GCC website. Let’s study OpenMP first. Our baseline program is
the vanilla single-threaded – non-SIMD Mandelbrot
fractal renderer – that I introduced in episode 1. The plan is to
have the heavy stuff — the fractal calculation — all happen on the external device. So, Iterate is declared as target code. Anything declared between
these two pragmas – will be compiled
for the target device, not for the host. Then, some reorganization, to put all the target device
code in one place, and the host device code
in another place. Separate those two parts. The offloading itself
happens with a single pragma, just like was the case
for parallelization in episode 2. Finally we take the
resulting iteration counts – for each pixel
using a STL library call, and convert the iteration
counts into colors locally, which is a quick operation. Let’s review the components
of the OpenMP pragma. The word “target” means
that the next statement – will be run on the device
rather than on the host. The word “teams” means,
that on the device, a league of teams will be created. This means that instead of
a single team of threads, there will be multiple groups, or gangs or blocks or teams,
whatever you want to call them, and each of them
contain multiple threads. “Distribute parallel for” means – that the following for-loop
will be divided – among those threads, so that each iteration
will be run exactly once – in unspecified order, but multiple iterations
may run concurrently. “Collapse(2)” tells
OpenMP that – the following two for-loops
should be treated as one unit; all combinations of the
iterations of the two loops – should be distributed
across threads. Without the collapse keyword, only the outer for-loop
would be distributed, and the inner for-loop
would be run consecutively – by each thread. The “map” keyword
controls the relationship – between host memory
and device memory. Specifically, we send these
variables “to” the device, and once the device
has done calculating, we receive the “results”
array “from” the device. When you run the program
that uses OpenMP offloading, you use the OMP_DEFAULT_DEVICE
environment variable – to choose which
device to run on. In my case, I want to run it
on the NVidia GTX 970, so I select OMP_DEFAULT_DEVICE=1. And the results are! Huh? … Wha? … I don’t understand. Why is it slow? It is definitely running
on the GTX 970, according to nvidia-settings . O-kay…! Now it crashed. L-let’s move ahead; maybe there will be
an explanation later. OpenACC! OpenACC is a different
standard by different people, that does pretty much
the same as OpenMP, except its focus is
totally on offloading, where as OpenMP began – from making native
parallelism easy. Translating OpenMP pragmas
into OpenACC pragmas – can be easy or
it can be difficult. At easiest, it is just
a change of terminology. What OpenMP calls a team,
OpenACC calls a gang, and so on. OpenACC defines a very
elaborate memory model – involving allocations. It would take me
at least half an hour – to describe how it all works, but in this example – I just copy data in and out. Done. Let’s check
out the performance. It is… eh, well, it’s faster than the
OpenMP offloading version, but still much slower than
the vanilla version. And it crashed too. Apparently it hit
a watchdog timeout. It turns out – that the NVidia driver
kills the task – if the function runs for
more than seven seconds. Now, it is quite surprising – that the GPU-accelerated program
is slower than a native program, but there may be an explanation. Graphics cards may have
hundreds or thousands of threads, but what they have
is actually – functionally closer to SIMD
than to threads. The GPU acceleration works best – with programs where
the exact same calculation – is performed on
a large set of data. Unlike CPU SIMD, GPU acceleration can
cope with situations – where the calculations
diverge into different branches – using if-elses, but the performance will
suffer greatly – whenever that happens. To give GPU acceleration
the final benefit of doubt, I went ahead and translated
the program into CUDA. This means that
instead of GCC, we will use NVidia’s NVCC – to compile the program. CUDA is actually – just a fancy name
for plain old C++, with some extensions. You heard it:
CUDA is C++. If you can write C++,
you can write CUDA. Now, the extensions that
make CUDA useful are: Each variable declaration
can be prefixed – with an optional attribute:
Constant, device, or shared. These attributes deal with
read/write access and lifetime. Each function declaration
can be prefixed – with an optional attribute: Device, host,
device host, and global. The entry-point between
host-code and GPU code – is called a kernel,
and it is indicated – by the __global__ keyword. Kernels are invoked using
a special syntax – that resembles the
C++ template parameters. The two parameters within
these special brackets – are the number of blocks, and the number of threads
per block. These parameters are actually
three-dimensional entities, but I usually treat
them as scalars. Within the kernel,
the compiler defines – a set of variables,
that can be read: blockDim, which is same
as the block dimension – that was passed in the
special bracket parameters. blockIdx is the currently
being executed block index. It runs from zero
to the block dimension. threadIdx is the currently
being executed thread index. It runs from zero to
the thread limit – that was passed in the
special bracket parameters. warpSize is – architecture-dependent
number of threads per warp. The warp is equivalent
to a SIMD unit size. Imagine each of these Chinese
letters is a single thread. A single block can include
a number of threads, but all threads belonging
to a single warp – always execute exactly same
code, for different data. If your CUDA code
contains branches, the GPU will temporarily
stop all the other threads – in the same warp, because threads
belonging to one warp – can only execute
one sequence of code. Essentially the performance
would diminish – by a factor equivalent
to the warp size. The good news is that
unlike in CPU SIMD, the GPU driver can actually
migrate threads between warps – to utilize them
more efficiently, but this is only
a best-effort process – and how exactly it works – is neither disclosed
nor set in stone. CUDA also defines
some vector datatypes. These are vector datatypes
in name only; CUDA hardware contains
no SIMD instructions, because the hardware
itself is a SIMD machine. According to my tests, there is no performance
benefit to using these. And, CUDA contains
a huge library – of functions both for
the host and the device. And by huge,
I mean it is massive. There are like 150000
lines of physical code – in the header files alone. That number is within
the same ballpark – as the amount of code
in a recent Linux kernel, and that was just header files. You could spend a year – just learning how to use
all those libraries. When converting the
fractal renderer – into a CUDA program, the first order business is – to convert the Iterate function
into a kernel. This is done by adding
the __global__ attribute. Now because this function will
be called thousands of times – with identical parameters, with only the block
and thread indexes – differentiating
the different calls, we will need a couple
of more data items – to make the parameters useful. Namely, the center coordinate
in the fractal, the zoom scales, and a pointer to where all
the results will be stored. Of course, to invoke the kernel,
we will use the bracket syntax – introduced at 9:46 in this video. The number of blocks
times the number of threads – must be at least as big
as the results array. I set the number of threads
arbitrarily as 128, and the number of blocks
is calculated from the need. Now the GPU code cannot
write directly – into the host computer’s memory, so we need to allocate memory
directly from the graphics card. The cudaMalloc function
allocates GPU memory. This memory pointer
will be passed to the kernel, and the cudaMemcpy function,
invoked here, will copy the data
from the GPU memory – into the host memory, once the GPU code
has done executing. Here we go! Ah. Finally, it’s faster
than anything else so far. It dances a little with
the OpenMP thread version, but then shakes it
off with a huge margin. In the end, this is
17 times faster – than the vanilla version, and 2.5 times faster
than the thread loop. Can I make it faster? Well first of all – I think the transfer
between GPU and host memory – may be a bottleneck. I should at least try
to see what happens – if I reduce the array
of 32-bit floats – into an array of 16-bit ints. I should also
try multiple streams. CUDA operates using
a stream of operations. A stream is like a pipe – from which the graphics card
driver pulls tasks, that it processes in a sequence. Right now my stream
looks like this. Calculation, transfer,
calculation, transfer. Each period of calculation – is followed by a period
of memory transfer. During the memory transfer, the CUDA cores are
actually idle, waiting for work. But it is actually possible – to have more than
one stream in CUDA. If I start another task
in the second stream – when the first stream
begins transferring data, and vice versa, the calculation tasks on GPU
would be performed back to back, with absolutely no idle time. It would mean perfect
100% GPU utilization. Let’s try that. So, two streams. To support the two streams – we need all kinds of book-keeping. There is a counter for the
next stream to be occupied, the number of streams
currently processing data, and flags indicating the same. They are created with
the cudaStreamCreate function. The memory buffers must
also be duplicated. The memory copying will now happen – with cudaMemcpyAsync, meaning
it will proceed asynchronously. All of the remaining changes
to the main program – are just gymnastics around
the asynchronous relationship – between the calculations and
the processing of the results. The main loop now begins
with interpreting the results; that is, converting the array of
iteration counts into pixels, rendering the pixels, and updating the inaptly named
boolean variable “NeedMoment” – that controls whether – periodicity checking will be
enabled for the next frame. Before we can assume
the results are available, the cudaStreamSynchronize
function must be called. It waits until all commands
in the specified stream – are completed. The result is a very slight
performance improvement. And by very slight, I mean – it’s almost no improvement at all. In the end it’s like
a single percent faster – than the first CUDA version. Totally negligible. As expected, the margin
to the CPU versions – is greatest when the
majority of the iterators – run for a long time. Any time that different
iteration loops branch – at significantly different moments, such as in the very beginning,
at the outermost zoom level, the CUDA engine performance suffers – for reasons that were explained
at 10:57 in this video. Finally, I made a version
that combines best techniques – from all three episodes. First, there are two CUDA streams. In addition, there are eight
native C++ threads, each of them calculating the image – using SIMD acceleration. In total, there are ten
processing units, each of them responsible
for a single picture. Because of this, different frames take wildly
different times to calculate. The following chart actually
is actually smoothed data – where the elephant-size variations – between the render times
of different frames – are somewhat evened out. In the end it is about
forty times faster – than the vanilla version. Now I know what some of you
are thinking. I know, because
I thought the same. Isn’t there overhead – in calculating multiple
frames in parallel? Would it not be better
for cache efficiency – to work on a single frame
using multiple threads? Cache efficiency
notwithstanding, it is actually faster – to calculate multiple frames
in parallel – as you can see
from this chart. The reason is that – when you process a single
frame using threads, say, by assigning each scanline
to the next available thread, there is a situation when
the frame is almost complete, where some threads are idle,
but some are still working. The next frame cannot be started
before all threads are finished. And this moment,
where some threads are idle, just does not exist – when you process multiple
frames in parallel. Now you might think: How about if we start
assigning threads – to the next frame
when they become free, while some threads
are still calculating – the previous frame, instead of waiting until
the previous frame is complete? I thought that too. In fact, the chart
you see on the screen – is the result of that
thought process – carried to its end. The one I described
earlier was even slower. Oh. This screen again. Does this mean I should
say some words, – like “conclusion” or something? Ah. Yeah. I know I pulled some
lines straight – and made some inaccurate
generalizations here. For instance, CUDA
is not actually C++. NVidia’s NVCC
is based on LLVM, and can add the
tiny CUDA extensions – to many different languages,
like Fortran for instance. Essentially the same principle
as with OpenMP, but still different some way. The purpose of this video series – was to give you some
ideas and inspiration, to help you look – where you didn’t know
to look before. There are other people – who make comprehensive
references and tutorials. In any case, please do check the
video description – and the links therein
for all the information – that I could not pack
into the video itself, and check out the comments
on the video as well. I will also gladly reply
to all polite comments, although I do hope you
do your research first – in case it’s a frequently
asked question. Thank you for all your support. Thanks to people who translate
captions to different languages! Huge thanks to all 35000 and more – people who have hit
the subscribe button. You are awesome! I wish everyone was like you. Thanks to people who
give me awesome ideas – for future videos and livestreams, by asking fascinating questions
in the comments. I hope you all have
a fantastic day. See you again.

Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

96 Comments

  • Bisqwit says:

    Sorry if the background music in this video is annoying. It might have been better to do altogether without. For some reason it was more difficult to choose anything that works well, compared to most previous videos.

  • danyael031 says:

    Can you recommend a book to learn c ++?

  • DWLooney1 says:

    Really well-made video, great series!

  • Martin Kemp says:

    Thanks for the series, can't wait for the next one.

  • tempodude says:

    I don't even program in C++, Your voice is just too good.
    Also greetings from israel!

  • TheJanhalsema1 says:

    great video, still baffeld by the performance of OpenACC and OpenMP offloading, I wonder what the OpenCL performance would have been like. nVidea is bringing OpenCL 2.0 to their gpus and with the latest OpenCL specification you can write opencl kernels in C++ 🙂

  • RedKugelblitz says:

    Awesome! I will try to implement cuda in some of my projects, and i am glad you shine light in these concepts. If it weren't for you, i probably wouldn't keep programing like i do today. Thank you bisqwit!

  • ProCactus says:

    Just wow !!

  • Las Desventuras de Virus says:

    Wow Bisqwit, that was a really awesome presentation and a nice point of start to program algorithms for GPU.
    I was just thinking in some arithmetic parser to evaluate a function in a set of thousands of points, and I thought GPU programming could help greatly. Now that you exposed how a good GPU algorithm should be done, I'll reconsider.
    If you are interested, I'll send links if I finish it. I'll not optimize it so much at the moment though.

    Again, thanks for your dedication to this material!

  • James Balajan says:

    Excellent and informative series. Thank you for educating us yet again Bisqwit!

  • Svette says:

    Clearest explanation ever! Please keep this up, great series

  • Naftoreiclag says:

    Awesome!

  • SatikCZE says:

    This is like porn for programmers 🙂

  • Radu Motrescu says:

    This is the best programming videos I've seen on youtube. I love your channel, more C++ please! <3

  • DA oliHVAR says:

    great videos man !

  • apu apustaja says:

    Beautiful visualization

  • ᅠKawa says:

    That dramatic rumble when the two stream eight core version is graphed out though.

  • VintageGearFreak says:

    Hope you have a fantastic day too Bisqwit!

  • Nikolay Mihaylov says:

    great video. what about threads + SIMD, without GPU ? did you tried it or I just miss it somehow?

  • David García says:

    Nice series Joel. Always a pleasure to watch your C++ related videos, they are inspiring.

  • Nucleus says:

    The illustrations are really nice work!

  • Karlo says:

    Great video series! It really shows how programming had to switch from old(er) single execution thread thinking, into multiprocessing, parallelism and offloading.
    Would be great to have video how similar/other technologies are applied especially over slow IO, like HD's, Networks etc…

  • fred forcx says:

    An awesome video as always! These three videos bring so many possibilities, thank you. I was already interested in parallelism in c++ and I had found OpenCL. You didn't talk about it so what's your opinion? It does support more platforms than CUDA, am I wrong?

  • Antoine Tamer says:

    I love you and all your family.

  • Timur Kiyui says:

    I've learned so much from this series, thanks so much for what you do 😊

  • robby says:

    Bis, I have no fucking idea what is this, but man, I like you so much

  • Kei Ouji says:

    This was a great video, and I was playing along in code:blocks… I do wish you put in a word or two about OpenCL. It can run on anything; and is easy to use. You can just set it to run on all CPUs and GPUs at the same time with the same code. It does however compile at runtime. But when you want to get the numbers crunched…. it takes the cake.

  • Awn says:

    I haven't even started C++ yet and I still watch your videos. You're amazing.

  • CraftNeui says:

    I really like such kind of videos, especially with subtitles (captions). While I'm not a native speaker either, I'd find these things irritating:

    Starting at about 9:00:

    Subtitles: Now, the extensions that make CUDA useful are:
    Voice: Now, the extensions that make CUDA useful are the following:

    Subtitles: with an optional attribute: Constant, device, or shared.
    Voice: with one of the following optional attribute*s*: Constant, device, or shared.
    (Also the subtitles displays a bit longer, aka. it goes into the next "voice")

    Subtitles: These attributes deal with read/write access and lifetime.
    Voice: The differences between these attributes lie within the access permissions and lifetime.

    Subtitles: with an optional attribute:
    Voice: with one of the following optional attribute*s*:

    Subtitles: Device, host, device host, and global
    Voice: Device, host, device and host, and global

    And starting at about 14:55:

    Subtitles: It would mean perfect 100% GPU utilization.
    Voice: It would mean perfect 100% utilization of the GPU.

    Also, about from 14:50 to 14:59, you're talking a bit "too fast", you begin the next "part" immediately after finishing one "part".

    I don't want to try to make you look bad, there are some more smaller cases that you've added an ed/s at the end of an word, but these were pretty noticeable, at least for me.
    Keep up the good work. I was going to ask about OpenCL, but others already did that.

  • Gustavo Aguilar says:

    Now I'm inspired to use parallelism in my projects hahaha. Ty

  • Supa Koopa Troopa 64 says:

    How does CUDA on C differ from CUDA on C++?

  • StereoBucket says:

    I love good performance graphs. So exciting.
    Nice trilogy, Bisqwit.

  • Jeff Larkin says:

    Great video! A couple of minor comments, since you asked for them. =)

    FWIW: At 7:00, the NVIDIA CUDA equivalent to a league of OpenMP teams is a Grid. Threadblocks are organized into grids.

    I'm glad you revisited your statement that "CUDA is C++," because there's so much more to the platform (C++, Fortran, libraries, etc.).

    There's really no reason that the OpenACC and OpenMP offloading versions should perform dramatically worse than a naive CUDA version for something this simple. If I can make some time, I'd like to look at the code that you posted on github to see if I can determine what went wrong here. Have you tried using the PGI community edition? It's OpenACC support is far more mature than GCC, so it may give you better results. Also, if data movement is this issue, which it likely is, then I'll point you to a code that I frequently use in teaching asynchronous OpenACC, which happens to also use mandelbrot, albeit a much simpler one (https://github.com/NVIDIA-OpenACC-Course/nvidia-openacc-course-sources/tree/master/october-2015/labs/lab4.pipelining/c99).

    Thanks for the great video. I discovered this video first, so now I need to go back and review the other two.

  • ExSiZuGeSe says:

    Awesome Video, thanks a lot!

  • Sizik says:

    How is the performance of SIMD + threads without also using CUDA?

  • TheMustang63100 says:

    Great video series.

  • Alexei Barnes says:

    OpenMP/OpenACC offloading ended up looking really bad here. Are they really just that adverse to this workload? I'd at least hope that they outperform CPU threads in other workloads, maybe neural networks?

  • amicloud says:

    I think that this series is my favorite thing you've uploaded so far! Great explanations and a very interesting topic. More like this please! And I think your accent is getting much better!

  • Emanuele Bonura says:

    Cannot help but reading all your comment replies with your voice. Inspiring video, you sir are a legend

  • Antoine Tamer says:

    Can you explain why are the curves all have this shapes ?

  • Pesticide 天安門大屠殺 says:

    Great video as always, informative and interesting. You can clearly tell that Bisqwit spent a lot of hours to make this video (studying, producing it, recording, editing, etc) but it was not time wasted because the end result is awesome. Keep it up

  • Mike says:

    fantastic video series bisqwit thank you

  • ZaelHaelAway says:

    Hello Bisqwit, Thanks for this fascinating series !, i have been totally gripped since episode one !… These lessons are gold and i will rewatch them for years to come, i would love to learn more about BLAS next ( especially with another beautiful chart ) ! thanks again

  • staviq says:

    What's with the sheep 🙂

    Also, if your program tries to represent 3-dimensional fractal in a 2-dimensional space, wouldn't the performance graph of this program be a 1-dimensional representation of this fractal?

  • Oscar Urdaneta says:

    everytime i watch these videos i feel like im on 7 tabs of acid

  • Andres Hurtado says:

    Awsome Video. The effort you put on the details really shows !!!! Keep you good effort. I'm learing lots from you.

  • imabluerobot says:

    I love programming but I have a long way to go. Right now I am implementing a simple garbage collector. I think this task would take maybe fifteen minutes for Bisqwit but I've spent weeks on it. : Anyway, thanks for the video!

  • rck says:

    Hey Bis, what resources do you recommend for learning about basic multithreading in c/c++? Also which books did you read? Would love to hearing from you!

  • wazawoo says:

    I want to see how you made the speed graph, it's super cool (sorry if you mentioned it somewhere in the series). Great video!

  • KynikossDragonn says:

    My favourite part of this video is when you made a variable named "NeedMoment"

    I just imagine the program saying: "I need a moment here!"

  • Ja Ty says:

    Hi Bisqwit.
    Which GPU did you used for this CUDA tests? I guess some of NVIDIA GPU because CUDA, but which (GTX970)? And what about the AMD GPU, it won't be have a better performance?
    Many thanks for your answer.

  • Alex Meanberg says:

    Gaining a deeper knowledge of how a CPU works is really empowering, even to high level programmers such as myself. Thank you.

  • Penta Penguin says:

    Why OpenMP and OpenAAC did crash ?

  • kxmode says:

    Shalom aleikhem, Bisqwit! 🙂

  • kxmode says:

    glados:~$ … naturally! 🙂

  • wenxiu long says:

    good video!
    Can you tell us what is your coding font? It's looks nice

  • Sami Rantanen says:

    This is now my new favorite programmer's porn channel 🙂 My job involves processing large point clouds.

  • Hex Wrench says:

    I'm impressed with the work that you have done, however, I don't have enough knowledge on this subject to contribute much more than a simple complement.
    Just as an aside question when do programs need to incorporate a make file in the compiling process?

  • accki911 says:

    Do you plan to continue your C tutorials? Unfortunately, K&R is a bit outdated, we definitely need a person with real programming experience to teach us 🙂

  • inabahare says:

    They're Japanese letters though.. Right?

  • Bacon Invader says:

    Please start an audiobook series

  • Rafal says:

    how did you measure the speed of your programs?

  • thmorriss says:

    this is so dope, keep it up bisquit! we love u

  • Tomas Canevaro says:

    Bisqwit would you consider "the c++ programming language" a good book for learning c++?

  • Adam7868 says:

    i really wish this existed when i was making my multithreaded physics calculator(unfortunately it didnt turn out because i couldent figure out how to do multiple calculations at once) but now this exists i might have another go at it

  • II says:

    You're a fucking genius but you can't afford a $2500 GPU? This makes me question our economic system. Also, you should create a Bitcoin donation address.

  • Ori Levy says:

    שלום

  • Tyler says:

    Hey there, long time fan.

    I was taught in university the C way of dealing with memory. I was wondering what your thoughts are on smart pointers and when they should be used in the context of a C++ project. As I understand, they are just a quality of life thing – and I am guessing they account for a little bit of overhead.

    I really enjoyed your parallelism video playlist
    Looking forward to the next video! Thanks

  • Zeb DeOs says:

    Really enjoying your videos (old and new). Thanks for putting this stuff out there for all to learn from!

  • Kinder112 says:

    What program is it, for drawing such nice charts?

  • Timmy Fifty says:

    Could you do a project with cluster computing? Also, tell me if I'm wrong, but it would be cool if a distributed virtual environment for running programs existed such that:

    – Multiple computers participate in running a single program
    – Data is shared between computers over a fast internet connection
    – Computers can be used for other things like webservers
    – Different types of computers can be part of the same network
    – Bonus: A degree of parity is built in so that computers can be taken off the network without any effect

    Do you know of any software that could act as a personal "bot net" for running programs?'

    As always, really great video!!

  • 1wsx10 says:

    why did the rebooting requirement prevent you from doing the test?

  • monotron says:

    What monospace font are you using for the code snippets in this video? e.g. at 5:25. TIA!

  • chomik says:

    Hi, I wonder how did you make CUDA working with gcc 7. For me it says that 5 is the highest version supported.

  • Beau Mancini says:

    Great work, Bisqwit.

  • Don't Read My Profile Picture says:

    chinese letters? looks like kanji to me

  • Dewa Made Mahaputra Wijaya says:

    This guy is freakin genius, wonder if he write a program to mining crypto, he could be billionaire

  • David Parry says:

    Hi. Great video series. Thank you.
    I'd like to start making my programs use the performance boosts that CPU cores and GPUs provide, which everyone seems to have now.
    In your opinion would OpenACC or OpenMP be the better standard to learn, to concentrate on? Or have neither matured enough yet?
    You say OpenACC now has that issue fixed… Am I right in assuming OpenMP is a little more stable now too?
    Thanks for a great set of tutorials… I'm trying to get back into C++ after an age out of it.

  • Fabrício Figueiredo says:

    hi @bisqwit! very nice video!! could u please tell what is that program u used for performance comparison? thank u very much in advance!

  • Amey Shukla says:

    Thank You so much! I love your videos.

  • Alex Shi says:

    woa this is sick

  • VMA says:

    3:30 at the footnote:
    Why didn't you reboot?

  • Marj Hxjcjd says:

    Why is thos in spanish?

  • Darius Duesentrieb says:

    Can Cuda streams comparable to Vulkan queues or are they something different?

  • sDdnDSiduAe4b - says:

    i think cuda version works at 30-40% of optimum, and way to solve that:
    1. remove as much double as you can REPLACE DOUBLE TO FLOATS!.

    2. more complex and more rewarding: remove thread divergence
    warp consist of 32 thread and could execute only same instruction, so you can see all branches as flags which marks next instruction, and instruction still being executed, just result is ignored if corresponding condition failed, but in your case it's return of function (even worse not immediate return, but small calc + return), which is really bad because this particular thread in warp would be useless while others not finished, and it's may be very possible that most of time you have only 1 thread of 32 possible which is doing job (1/32 max performance!).
    there is three solutions which implies custom balancer, which always stays in loop:
    1. and which takes additional task from pool when current one is done. (will lost some % on reinitialisation – others thread would wait on that code, probably should do several iteration without that check (loop within loop).
    2. aborts (possible saving intermediate result for continue, or should be used in combination with 1.) current iteration if most of other thread in warp is complete.
    3. always makes fixed number of loop iterations and saving intermediate result:
    a. initialisation _global_ function – which create intermediate result which corresponds cycle entering
    b. _global_ function which do fixed number iterations and update intermediate result (with flag marking of those who finished)
    c. _global_ function which filters out finished result.

    third option is least difficult one (as don't use a lot of synchronization primitives which is very difficult to debug and profile)
    but it greatly depends on how much memory intermediate result will occupy (in your case initial input is zero memory (as thread dimension coordinates is not occupying anything, and output is only 32(16) bits per thread which is nothing (but still goes between device and host, intermediate result will be only on device) ) i.e. in sense of device memory usage and more importantly it's internal memory bandwidth this approach degrades hundred times, but i think it would be insignificantly slower in worse case, in best you would get a lot of performance.

  • Gizmoriderful Ye says:

    Think about, if programmers nowadays actually would use those techniques in their programs and games…

  • High Philosopher says:

    Dude, if only you worked with a hardware architect… You would re-engineer computers to a good extent and probably make a fortune.

  • reza f says:

    Bisqwit many thanks for your incredible explanations. I love how you think about problems, and the professionalism in your explanations and communications.

  • Harold McBroom says:

    That's the GPU, that I use.  The Ti-750 duel fan 2 gig video card.   My specs are, 990 Fx Extreme3 motherboard, with Fx-6400 (Vishera) processor, 16 gigs of ram, 111 gig SSD harddrive, running under Windows 7, because I refuse to buy another Microsoft product, and it seems their products are getting worse with each passing year.  Feels like the industry is being sabotaged along with the gaming industry.

    My first experience with machines was when my dad brought home a Ti-99-4a with speech synthesizer that sounded a whole lot better than Microsoft Sam does today.   My first computer was a hand down from my dad, and was an AT&T 6300 with 640k Ram, 1200 baud modem, CGA monitor, and dual floppies.   That computer gave my dad many years of service, and when I got it, gave me around 8 more years of service.   When that machine died, It was as though I lost my best friend, and I remember standing outside on the porch crying.   This will probably be the last computer I buy.   I loved my first computer, and I guess a part of me loves my last computer, but I didn't feel much about any of the computers in-between, as most of them were garbage pre-fabs.   The one I have now, was custom built, and has given me over 10 years with just a few upgrades to the processor, memory, but my harddrive has surprising survived longer than I expected, and it's still kicking.  It's only an 111 gig SSD, they don't even make them anymore to my knowledge.

  • WikiPeoples says:

    Bisqwit – I have to know… were you formally trained in CS? You must have been, right? I know some self taught programmers and they're great app developers, but they don't understand the extremely low level + hardware stuff like you do. I don't mean to minimize your expertise by pointing out you are formally taught (if you are). Quite the contrary – you strike me as a very high level performer / genius.

  • saultube44 says:

    You didn't allowed Synchronization between the thread and the nVidia driver or something. Maybe not making all of the 10 processing units work independently but try to help each other in a coordinated effort so each can be as efficient as possible, for max efficiency

  • toni3doom says:

    i have never seen a programmer smarter than you, very good job, i'm proud of having become one of you patreons.

  • Rat Man says:

    Are you jewish?

  • Piparo says:

    Nice job as always! Regarding the MPICH library, I would like to point that making parallel programs using MPI usually requires some changes of source code if you start with a sequential version of the same program (in contrast with OpenMP, where most directives can be added without further adaptation). Sometimes you have to rethink the program structure to distribute the workload efficiently between processors and to minimize communications between processors for being more expensive that the ones between threads. Most of these practices can be seen within the study of Parallel Algorithms.

  • Bisqwit says:

    Remember to watch the conclusion video in which I address the question about the OpenMP / OpenACC performance, among other things! https://www.youtube.com/watch?v=pCoxpKTmykA

Leave a Reply

Your email address will not be published. Required fields are marked *