			
			 STREAM for DOS v2
			    Dennis Lee
	      Internet E-mail: denlee@ecf.utoronto.ca

This software is free.  The following text describes the DOS version of
STREAM in a question and answer format.

Q: First of all, what is STREAM ?

A: STREAM is a popular memory bandwidth benchmark that has been used
   on personal computers to super computers.

Q: What has changed since version 1, and why is this update necessary ?

A: Only one change has been made to the source code.  The memory buffers
   STREAM operates on, have been changed from dynamically allocated to
   static global data.  The causes the <memsize> option to be dropped,
   and 9,600,000 bytes of static memory to be used.  A 12 MB computer
   is now required to run this software.
   This change was made because with global buffers, the WATCOM compiler
   generates code that saves the use of 2 registers.  More importantly
   however, the incrementing of the 2 registers are no longer required.
   The following assembly code clearly shows what I mean:

version 1 code
--------------
L19:      fld     st(0)
	  fmul    qword ptr [ebx]
	  add     edx,00000008H
	  add     ebx,00000008H
	  inc     eax
	  fstp    qword ptr -8H[edx]
	  cmp     eax,edi
	  jl      short L19

version 2 code
--------------
L16:      fld     st(0)
	  fmul    qword ptr _buf1[eax]
	  add     eax,00000008H
	  fstp    qword ptr _buf3-8H[eax]
	  cmp     eax,edx
	  jl      short L16

   The second version runs faster on a computer with fast memory.  In
this case, I consider anything above 60 MB/s fast.  Since the original
STREAM code used global data, and there is a difference in performance
between static and dynamic, I felt it necessary to change the DOS
version of STREAM in this regard.

Note: The following text is taken from version 1 of this software.

Q: Is the DOS version of STREAM just compiled from the C source available
   on the STREAM website ?

A: No.  Unfortunately a small change is necessary for accurate results
   in DOS.  Specifically, it isn't DOS which causes the problem, but the
   8253 timer chip in the PC and the way DOS compilers use it to supply
   clock/timer functions to C programs.  The result is, C programs have
   a timer function with a resolution of only ~0.06 sec.  So, in order
   to have STREAM results with ~1% error a run of 6 sec would be necessary.
   Since my computer can move ~40 MB/s, I need about 240 MB of RAM to
   run the benchmark unchanged.  I do not have enough RAM for that.
   Fortunately it is possible to cycle through existing RAM with the
   same results as long as in each iteration all caches are completely
   flooded.  Only a few lines need to be added/changed to the original
   source code for this to work.

Q: So the DOS version only has a few lines changed from the original ?

A: Not quite.  After changing the few lines and getting accurate results
   for my computer, I had no further plans with the STREAM benchmark.
   A few months passed and new memory technologies were introduced into
   the PC market, including SDRAM and BEDO DRAM.  I was interested in how
   much faster (more bandwidth) such memory is than current EDO DRAM and
   FPM DRAM.  Since I wouldn't have access to such machines for a while,
   I decided it would be easiest to obtain results by offering a compiled
   version of STREAM to early buyers of such systems.  If a copy of
   STREAM was going to be released for DOS anyway, I thought, why not add
   some configuration options so it can handle systems with various amounts
   of L2 cache and system memory.  This would allow results to be gathered
   from more systems including older 386 and 486 based ones.  This lead
   to the current DOS version.

Q: How much of the original source ended up replaced/rewritten, and since
   some parts were changed how can you claim the results are comparable
   with those from the original program ?

A: Since the program was short, I started over.  The DOS version
   shares no code with the original except for the 4 loops involved
   in moving data from memory buffer to memory buffer.

   eg. for (j = 0; j < N; j++)
	  c[j] = a[j]

   All the STREAM results come from timing how long these 4 loops take
   to execute, and since the DOS version has the same loops, results
   from both versions are comparable.  I obtained results for my machine,
   and looked at the assembly generated code to make certain.
