<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="bbPress/1.0.2" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>k-Wave User Forum &#187; Topic: GPU Performance - Tesla C1060</title>
		<link>http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060</link>
		<description>Support for the k-Wave MATLAB toolbox</description>
		<language>en-US</language>
		<pubDate>Tue, 12 May 2026 22:32:27 +0000</pubDate>
		<generator>http://bbpress.org/?v=1.0.2</generator>
		<textInput>
			<title><![CDATA[Search]]></title>
			<description><![CDATA[Search all topics from these forums.]]></description>
			<name>q</name>
			<link>http://www.k-wave.org/forum/search.php</link>
		</textInput>
		<atom:link href="http://www.k-wave.org/forum/rss/topic/gpu-performance-tesla-c1060" rel="self" type="application/rss+xml" />

		<item>
			<title>DanR on "GPU Performance - Tesla C1060"</title>
			<link>http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060#post-130</link>
			<pubDate>Fri, 25 Mar 2011 19:28:23 +0000</pubDate>
			<dc:creator>DanR</dc:creator>
			<guid isPermaLink="false">130@http://www.k-wave.org/forum/</guid>
			<description>&#60;p&#62;Brad,&#60;br /&#62;
I am using Matlab 7.11 (R2010b). I will check and see if the multiple cores are actually being used.&#60;/p&#62;
&#60;p&#62;Regarding the times for arrays larger than 600x600, the mistake I made was to use spaces to line up the columns. All times for larger arrays should be justified under the GPU column, not the CPU. I did not run CPU times for these arrays.&#60;/p&#62;
&#60;p&#62;Thanks for the tip on the Tesla cards. We do all of our reconstruction work with actual data in single precision for the GPU - that is more than adequate.&#60;br /&#62;
-Dan
&#60;/p&#62;</description>
		</item>
		<item>
			<title>Bradley Treeby on "GPU Performance - Tesla C1060"</title>
			<link>http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060#post-121</link>
			<pubDate>Wed, 23 Mar 2011 22:13:37 +0000</pubDate>
			<dc:creator>Bradley Treeby</dc:creator>
			<guid isPermaLink="false">121@http://www.k-wave.org/forum/</guid>
			<description>&#60;p&#62;Hi Dan,&#60;/p&#62;
&#60;p&#62;I was just looking at your simulation times again and two things spring to mind - what version of MATLAB are you using? It seems that it might not be making use of the extra cores on your system 2. Earlier versions of MATLAB (I think before around 2008a) do not include multicore support for parallelisable functions such as the FFT. If this is the case and your CPU clock speed on system 2 is lower than system 1, that could explain the slightly worse performance.&#60;/p&#62;
&#60;p&#62;Second, I'm wondering if there is a typo or exponent change in your simulation times once you get above 600 x 600 (the time jumps from 984 to 103).&#60;/p&#62;
&#60;p&#62;Regarding single and double precision, you can use single precision on the CPU by setting &#60;code&#62;&#38;#39;DataCast&#38;#39;&#60;/code&#62; to &#60;code&#62;&#38;#39;single&#38;#39;&#60;/code&#62;, and single or double precision on the GPU by setting it to &#60;code&#62;&#38;#39;GPUSingle&#38;#39;&#60;/code&#62; or &#60;code&#62;&#38;#39;GPUDouble&#38;#39;&#60;/code&#62;. Keep in mind the particular GPU card you are using does not have very good double precision performance (it will be around 8 times slower than single precision). This has been addressed in newer TESLA cards (2050, 2070) which have much better double precision performance.&#60;/p&#62;
&#60;p&#62;Brad.
&#60;/p&#62;</description>
		</item>
		<item>
			<title>DanR on "GPU Performance - Tesla C1060"</title>
			<link>http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060#post-118</link>
			<pubDate>Mon, 21 Mar 2011 14:33:49 +0000</pubDate>
			<dc:creator>DanR</dc:creator>
			<guid isPermaLink="false">118@http://www.k-wave.org/forum/</guid>
			<description>&#60;p&#62;Brad,&#60;br /&#62;
I presume the GPU was done in single precision because of the datacast parameter. The CPU was done in (default) double precision so I guess the results could be a little faster for the CPU if this parameter was used here as well.&#60;br /&#62;
-Dan
&#60;/p&#62;</description>
		</item>
		<item>
			<title>Bradley Treeby on "GPU Performance - Tesla C1060"</title>
			<link>http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060#post-116</link>
			<pubDate>Fri, 18 Mar 2011 07:37:15 +0000</pubDate>
			<dc:creator>Bradley Treeby</dc:creator>
			<guid isPermaLink="false">116@http://www.k-wave.org/forum/</guid>
			<description>&#60;p&#62;Hi Dan,&#60;/p&#62;
&#60;p&#62;Thanks for your feedback and comments; it is interesting to see your simulation times and great to see that you are getting good speed-ups using the C1060. Are your CPU and GPU simulations performed in double or single precision?&#60;/p&#62;
&#60;p&#62;The computations are heavily dependent on the FFT so they will be quickest when using grid sizes that are a power of 2, e.g., 128, 256, 512, etc. They will be almost as quick for sizes with small prime factors, but slower otherwise.&#60;/p&#62;
&#60;p&#62;The CPU/GPU break-even point is also dependent on the host system which must communicate with and transfer data to and from the GPU. You could also try increasing the priority of the MATLAB thread in windows task manager to see if that gives you a little extra (try anything up to high, but don't use real-time if you still want to interact with your computer!).&#60;/p&#62;
&#60;p&#62;If you have any more questions or comments, please let us know.&#60;/p&#62;
&#60;p&#62;Brad.
&#60;/p&#62;</description>
		</item>
		<item>
			<title>DanR on "GPU Performance - Tesla C1060"</title>
			<link>http://www.k-wave.org/forum/topic/gpu-performance-tesla-c1060#post-115</link>
			<pubDate>Thu, 17 Mar 2011 18:55:32 +0000</pubDate>
			<dc:creator>DanR</dc:creator>
			<guid isPermaLink="false">115@http://www.k-wave.org/forum/</guid>
			<description>&#60;p&#62;Here are some preliminary results of running the ultrasound in a homogeneous medium example with and without an Nvidia Tesla C1060 GPU card (240 cores, 4 GB RAM). Below are the actual simulation times (in seconds).&#60;br /&#62;
System 1: XP, dual processor, 2 GB RAM.&#60;br /&#62;
Matrix size  w/o GPU  with GPU&#60;br /&#62;
100x100        5       7&#60;br /&#62;
200x200       25      15&#60;br /&#62;
300x300      101      24&#60;br /&#62;
400x400      243      35&#60;br /&#62;
500x500      515      69&#60;br /&#62;
600x600      908      67&#60;/p&#62;
&#60;p&#62;System 2: Win 7-64 bit, 8 GB Ram, 8 processors&#60;br /&#62;
100x100        6       8&#60;br /&#62;
200x200       28      19&#60;br /&#62;
300x300       92      28&#60;br /&#62;
400x400      273      40&#60;br /&#62;
500x500      573      81&#60;br /&#62;
600x600      984      74&#60;br /&#62;
700x700              103&#60;br /&#62;
800x800              160&#60;br /&#62;
900x900              161&#60;br /&#62;
1000x1000            208&#60;br /&#62;
1200x1200            300&#60;/p&#62;
&#60;p&#62;The speed improvements are dramatic at larger matrix sizes, and the GPU shows a significant improvement for matrices larger than 200x200. This is somewhat better than your reported breakeven point of about 512x512 elements. I find it interesting that Win7 seems to be slower than XP. Also, both systems showed an unexpected speed improvement going from 500 to 600 voxel matrices. I apparently hit some optimization sweet spot. Ultimately we wish to use large matrices at fine resolution to generate photoacoustic images, and the Tesla gives me some hope of getting this done in my lifetime.&#60;/p&#62;
&#60;p&#62;Thanks for a great package that is a lot of fun to work with.&#60;br /&#62;
-Dan&#60;br /&#62;
Optosonics
&#60;/p&#62;</description>
		</item>

	</channel>
</rss>
