If your program operates on large data sets performing computations that are essentially data parallel, you can harness the power of GPU for massive speed ups. Traditionally GPUs have been used in graphics rendering tasks which are data parallel in nature. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. An image is nothing but a matrix or multidimensional array of pixels and GPUs can perform the same instruction on each pixel of an image in parallel. All thread processors in a core of GPU perform the same instructions, as they share the same control unit.
More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with a high ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.
This is where GPU processing differs from CPU processing. The main difference between a thread processor and a CPU core is that each CPU core can perform different instructions on different data in parallel because each CPU core has separate control unit. Nvidia calls thread processors as CUDA (Compute Unified Device Architecture) cores, AMD calls them as Stream Processors (SP). For Nvidia GPUs a cuda core is nothing but a group of thread processors that share the same control unit, essentially meaning all the thread processors in a cuda core will execute the same instruction.
Now think of a computation that involves multiplying two large matrices (the core computation in a neural network). Matrix multiplication is essentially data parallel in nature, the multiplication of individual elements can happen in parallel, the summation of the multiplication results for each result cell can happen in parallel but the summation can happen after only the multiplication for that particular result cell is complete. So you can visualize there is a lot of data parallelism involved but some of the compute tasks are sequential as well.
For your program to take advantage of GPU cores, data has to be transferred from CPU RAM to GPU RAM. And once the computation is done the results have to be transferred back to CPU RAM. A GPU program comprises two parts: a host part the runs on the CPU and one or more kernels that run on the GPU. Typically, the CPU portion of the program is used to set up the parameters and data for the computation, while the kernel portion performs the actual computation.