k-Wave User Forum » Topic: GPU Performance

k-Wave User Forum » Topic: GPU Performance - Tesla C1060 http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060 Support for the k-Wave MATLAB toolbox en-US Tue, 02 Jun 2026 08:02:43 +0000 http://bbpress.org/?v=1.0.2 <![CDATA[Search]]> q http://www.k-wave.org/forum/search.php DanR on "GPU Performance - Tesla C1060" http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060#post-130 Fri, 25 Mar 2011 19:28:23 +0000 DanR 130@http://www.k-wave.org/forum/ Brad, I am using Matlab 7.11 (R2010b). I will check and see if the multiple cores are actually being used. Regarding the times for arrays larger than 600x600, the mistake I made was to use spaces to line up the columns. All times for larger arrays should be justified under the GPU column, not the CPU. I did not run CPU times for these arrays. Thanks for the tip on the Tesla cards. We do all of our reconstruction work with actual data in single precision for the GPU - that is more than adequate. -Dan Bradley Treeby on "GPU Performance - Tesla C1060" http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060#post-121 Wed, 23 Mar 2011 22:13:37 +0000 Bradley Treeby 121@http://www.k-wave.org/forum/ Hi Dan, I was just looking at your simulation times again and two things spring to mind - what version of MATLAB are you using? It seems that it might not be making use of the extra cores on your system 2. Earlier versions of MATLAB (I think before around 2008a) do not include multicore support for parallelisable functions such as the FFT. If this is the case and your CPU clock speed on system 2 is lower than system 1, that could explain the slightly worse performance. Second, I'm wondering if there is a typo or exponent change in your simulation times once you get above 600 x 600 (the time jumps from 984 to 103). Regarding single and double precision, you can use single precision on the CPU by setting <code>'DataCast'</code> to <code>'single'</code>, and single or double precision on the GPU by setting it to <code>'GPUSingle'</code> or <code>'GPUDouble'</code>. Keep in mind the particular GPU card you are using does not have very good double precision performance (it will be around 8 times slower than single precision). This has been addressed in newer TESLA cards (2050, 2070) which have much better double precision performance. Brad. DanR on "GPU Performance - Tesla C1060" http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060#post-118 Mon, 21 Mar 2011 14:33:49 +0000 DanR 118@http://www.k-wave.org/forum/ Brad, I presume the GPU was done in single precision because of the datacast parameter. The CPU was done in (default) double precision so I guess the results could be a little faster for the CPU if this parameter was used here as well. -Dan Bradley Treeby on "GPU Performance - Tesla C1060" http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060#post-116 Fri, 18 Mar 2011 07:37:15 +0000 Bradley Treeby 116@http://www.k-wave.org/forum/ Hi Dan, Thanks for your feedback and comments; it is interesting to see your simulation times and great to see that you are getting good speed-ups using the C1060. Are your CPU and GPU simulations performed in double or single precision? The computations are heavily dependent on the FFT so they will be quickest when using grid sizes that are a power of 2, e.g., 128, 256, 512, etc. They will be almost as quick for sizes with small prime factors, but slower otherwise. The CPU/GPU break-even point is also dependent on the host system which must communicate with and transfer data to and from the GPU. You could also try increasing the priority of the MATLAB thread in windows task manager to see if that gives you a little extra (try anything up to high, but don't use real-time if you still want to interact with your computer!). If you have any more questions or comments, please let us know. Brad. DanR on "GPU Performance - Tesla C1060" http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060#post-115 Thu, 17 Mar 2011 18:55:32 +0000 DanR 115@http://www.k-wave.org/forum/ Here are some preliminary results of running the ultrasound in a homogeneous medium example with and without an Nvidia Tesla C1060 GPU card (240 cores, 4 GB RAM). Below are the actual simulation times (in seconds). System 1: XP, dual processor, 2 GB RAM. Matrix size w/o GPU with GPU 100x100 5 7 200x200 25 15 300x300 101 24 400x400 243 35 500x500 515 69 600x600 908 67 System 2: Win 7-64 bit, 8 GB Ram, 8 processors 100x100 6 8 200x200 28 19 300x300 92 28 400x400 273 40 500x500 573 81 600x600 984 74 700x700 103 800x800 160 900x900 161 1000x1000 208 1200x1200 300 The speed improvements are dramatic at larger matrix sizes, and the GPU shows a significant improvement for matrices larger than 200x200. This is somewhat better than your reported breakeven point of about 512x512 elements. I find it interesting that Win7 seems to be slower than XP. Also, both systems showed an unexpected speed improvement going from 500 to 600 voxel matrices. I apparently hit some optimization sweet spot. Ultimately we wish to use large matrices at fine resolution to generate photoacoustic images, and the Tesla gives me some hope of getting this done in my lifetime. Thanks for a great package that is a lot of fun to work with. -Dan Optosonics