- 15 Apr 2009: If you are around 9am next Monday, you might find the
following interesting in the context of the circuit work we did this
semester:
Please join us for this very interesting colloquium:
Monday, Apr 20th, 2009
9:00am, Middlesex College Room 320
Speaker: Jeremy Johnson
Professor and Department Head, Computer Science
Drexel University
Title: Sparse LU Decomposition using FPGA
Abstract:
This talk reports on an FPGA implementation of sparse LU
decomposition.
The resulting special purpose hardware is geared towards power system
problems - load flow computation - which are typically solved iteratively
using Newton Raphson. The key step in this process, which takes
approximately 85% of the computation time, is the solution of sparse
linear systems arising from the Jacobian matrices that occur in each
iteration of Newton Raphson. Current state-of-the-art software packages,
such as UMFPACK and SuperLU, running on general purpose processors
perform suboptimally on these problems due to poor utilization of the
floating point hardware (typically 1 to 4% efficiency). Our LU
hardware, using a special purpose data path and cache, designed to keep
the floating point hardware busy, achieves an efficiency of 60% and
higher. This improved efficiency provides an order of magnitude speedup
when compared to a software solution using UMFPACK running on general
purpose processors.
Jeremy Johnson
Professor and Department Head, Computer Science
Drexel University
University Crossings, Room 100C
3175 JFK Boulevard
Philadelphia, PA 19104
www.cs.drexel.edu/~jjohnson
.
REFRESHMENTS WILL BE SERVED!
- 7 Apr 2009: The AsmRefCard passed out last Thursday that has the
I/O instructions and memory layout can be found as AsmRefCard.090401
in my CS350 directory on gaul. This is the one that we will be using
for the final exam.
- 3 Apr 2009: For homework 3, there were 3 working solutions:
cost = # transistors * # steps * cycle time
207130400512 = 54848 * 134873 * 28
339743617200 = 56494 * 30069 * 200
529721659200 = 59126 * 29864 * 300
- 1 Apr 2009: For some other instruction sets, consider the wikipedia
articles on: Zero Instruction Set Computers (ZISC),
One Insruction Set Computers (OISC),
Redcode (used for the game
Core War),
and the related
Tierra instruction set used for evolving competing programs in
Tierra.
Also of potential interest is the open source hardware movement which parallels the
open source software movement and includes various CPU designs, computer
designs, robots, home automation, etc.
- 1 Apr 2009: In class a while ago, I mentioned that `compare and swap'
was another synchronization primitive found in more recent CPU designs.
Turns out wikipedia has a nice article on it:
Compare-and-swap .
They also have an article on the
test-and-set primitive
that we are using in homework 6. Oddly enough, the test-and-set article
has a
link to implementing test-and-set in SPARC assembly that uses SPARC's
ldstub instruction which is documented as SPARC's compare-and-swap instruction.
An interesting programming variant on using test-and-set is discussed in
test and
test-and-set. The idea is to first loop just doing a normal read on
the lock value until the lock appears open and then loop on the TestAndSet.
The idea behind this is that it cuts down on the amount of looping on the
more expensive TestAndSet operation -- kinda neat.
- 31 Mar 2009: A recent report of the simulator generating an array
exception was tracked down to the offending program generating an illegal
memory reference. To keep this from happening, I have created a function
(guard) which is invoked everytime the mar is set. guard (inherited from
FirstCpu.java) does the following to any address it is given:
return (short)((x & 0xFFFF) % 2048);
This shouldn't effect properly working programs, but might change some
programs behavior as before the code was haphazard in how it handled
such addresses. The new distribution that implements this is:
CircuitSimulator.090331w.tar.gz which is now available in my CS350 directory
on gaul.
- 31 Mar 2009: There is a new distribution: CircuitSimulator.090331v.tar.gz
in my CS350 directory on gaul.
- It introduces a few new test cases and fixes a problem with the test26.cpu
code. Specifically, when test26.cpu tests to see if count is 2, it has to
do this in a section of code guarded by a lock, otherwise it is possible for
the halt to occur before the locked code has completed the update of sum.
Alternatively, sum could have been updated before count is in the locked code.
- The local cache was expecting aligned addresses when it checked to see
if an address was in cache. In the case of invalidating a write, the address
it was given was not aligned. This has now been fixed ensuring cache
consistency.
- When a flush (from a halt) is on the cpu side of the shared cache
and a read is pending on the ram side of shared cache, an array exception
was generated. This has now been fixed.
- 30 Mar 2009: There is a new distribution: CircuitSimulator.090330u.tar.gz
that fixes a problem in CpuLockSet.java. In previous machines, an unrecognized
opcode was treated as a no-op. In CpuLockSet.java, java was generating a
stack overflow when one was encountered. I have fixed it so that it shouldn't
be repeatedly calling itself now. There is another email I have recieved
about a potential problem in the cache consistency code that I haven't
verified yet -- if I can verify the problem, there will be more posted here
on it.
- 30 Mar 2009: Note: announcement on 27th should have referred
to CircuitSimulator.090327t.tar.gz rather than CircuitSimulator.090324s.tar.gz .
- 30 Mar 2009: 23 of the 25 handins passed the benchmark for h5.
However, against other test data things didn't go so well. The full
set of test data I used for marking homework 5 is in my CS350 directory
on gaul as h5.test_data.tar . The one that broke the most programs
was build_extreme . Run times on the benchmark for programs that passed
all the test data were:
- 1,478,366
- 3,891,892
- 4,475,372
- 27 Mar 2009: CircuitSimulator.090324s.tar.gz is now available on gaul
in my CS350 directory. In addition to the fix described below, I have added
two new commands to ThirdComputer: delayed_debug and early_exit. The idea
is to make debugging easier. The way early_exit works is if you think your
program goes wrong around clock cycle 1000, you can put in your cpu file:
early_exit 1000
run
dump 0
dump 1
dump 2
dump 3
dump 4
and the program will exit as if it encountered a halt just after clock cycle
1000 and then execute the dump commands showing you the contents of everything
at that point.
Sometimes the information from set_debug_flag is useful, but only near
the point where the problem is happening. So, for example, if in the
above you wanted to see the debug dumps for the last 100 clock cycles
just before 1000 on device 0, you can type:
delayed_debug 0 900
early_exit 1000
run
And it would be as if set_debug_flag 0 had been done around clock cycle
900 rather than at clock cycle 0. Note each debug flag can be independently
set to a different starting time if you are so inclined.
As far as I know, the changes in this distribution should not break
anything that was working in the previous distribution. If some
unexpected problem arises, don't forget that the old versions are
still available in my CS350 subdirectory under OldVersions.
- 27 Mar 2009: To fix the potential problem discussed in class yesterday
of a write from one process pending in the shared cache's internal 1 item
queue at the point where a read of the same block from another process
is coming back from ram, I have inserted the following code into
CacheLockSet.java
if (ramResponse[2] == memory_read) {
// to handle case where one process has a write pending in
// cpuRequest and there is a read trying to respond to a
// different cpu in ramRequest and they both involve the
// same block of ram. Note that we know the processes are
// different because a process waits for a read before
// processing the write.
if (cpuRequestInProgress) {
int writeaddr = 0xFFFF & requestedCpuCommand[3];
int writeblock = writeaddr / 4;
int readblock = addr / 4;
if (requestedCpuCommand[2] == memory_write) {
if (writeblock == readblock) {
int offset = writeaddr - (writeblock * 4);
ramResponse[4 + offset] = requestedCpuCommand[4];
}
}
if (requestedCpuCommand[2] == memory_write_and_read) {
if (writeblock == readblock) {
ramResponse[4] = requestedCpuCommand[4];
ramResponse[5] = requestedCpuCommand[5];
ramResponse[6] = requestedCpuCommand[6];
ramResponse[7] = requestedCpuCommand[7];
}
}
}
}
This should only have an impact if you were counting on the notion that
if one process continually reads a location in ram and a different process
writes to that location, then the first process should eventually notice
the change. If the read was being done via a test and set, there was no
problem. This addresses when the read was being done by a load instruction.
I am going to add a few extra debug features before announcing the release
that includes this change (should be before 5pm).
- 25 Mar 2009: The midterm will come back tommorrow (Thursday's class).
Enough people seemed to have problems with it that I will spend some time
going over it. I will start with the two controller problems and talk a
bit about the others as time permits. This will be as close as we come
to reviewing the first 2/3rds of the semester, which will also make up a
substantial portion of the final exam.
- 24 Mar 2009: CircuitSimulator.090324s.tar.gz is now available
in the CS350 directory on gaul. Hopefully it fixes the cache consistency
problem outlined below. MemoryProtocolForThirdComputer was updated to
reflect this code change.
- 24 Mar 2009: With regards to cache consistency, write requests
invalidate the local cache copy. test and set requests invalidate
all copies except ram. test and set requests don't use local copies
but instead always go to ram. This raises some problems if both processes
read a value and then one process writes a change (the other continues to use
the old value). If the value is read by a test and set request, then
this isn't a problem. If it is read by a load, there isn't really any
way of getting it out of the cache except loading three other blocks (to
clear the 3 block LRU cache).
There seems to be the potential for a problem here. My intention is
to update command so that it invalidates cpu local copies on a write message
the same way it does on a test and set message. This would not be done on
the shared cache copy. Note that this is done at the block level and so
would impact performance on other words that were on the same block as
the target address of the write.
- 24 Mar 2009: The CircuitSimulator.090324r.tar.gz release is now available
in my CS350 directory on gaul. It is described below.
- 24 Mar 2009: The CircuitSimulator.090324r release contains a build_h5
script, ThirdComputerTwoProcessorsLRU is renamed ThirdComputer, and
there is a new text file MemoryProtocolForThirdComputer which walks though
how memory requests of various kinds are handled. While constructing this
walkthrough, made some changes to the code:
- I noticed that the code would give an error message if a fetch
instruction was done when the local buffer was not valid. While this probably
would never happen, I removed the test for the local buffer being valid
- The same issue was noticed with test and set commands with local buffer
was not valid and the needless test was commented out again.
- 23 Mar 2009: The homework 6 spec is
now available. Note that it anticipates that the next distribution will
have a build_h6 command based on build_h5 as well as
ThirdComputerTwoProcessorsLRU be renamed to ThirdComputer .
- 23 Mar 2009: The release CircuitSimulator.090323q.tar.gz is now
available on gaul in my CS350 directory. If you are still working on
homework 5, then CircuitSimulator.090318p.tar.gz is the release that
will be used for marking your assignment. If you are ready to start
homework 6, you want to run ThirdComputerTwoProcessorsLRU .
I don't anticipate any structural changes to ThirdComputerTwoProcessorsLRU;
but I do plan on doing a bit more testing before renaming it
ThirdComputer (if you run across a problem with it, let me know).
ThirdComputerTwoProcessors is an earlier version that doesn't have the
3 block local cache on each processor. ThirdComputerFourProcessors
is the 4 processor version that won't be used this semester.
ThirdComputerLockSet is a one processor version that understands the
new instruction set (RID and test and set instructions).
AsmRefCard shows the new instruction set additions also.
Your homework 5 solution should still run on ThirdComputerLockSet,
but things will probably be chaotic on the other versions of ThirdComputer.
The file test26.cpu
is the 2 processor version of test1.cpu . The file
test27.cpu is the 2 processor
version of test1.cpu where processor 0 does all the work and processor
1 just idles. As mentioned earlier, adapting your homework 5 solution
to a 2 processor version where one processor does nothing (like test27.cpu)
gives you a good basis from which to explore potential optimizations.
- 23 Mar 2009: With regards to ThirdComputer, I have discovered that
my current protocols can lead to starvation of one of the CPUs in the
case of 4 CPUs. However, things seem to work ok for 2 CPUs. So this
evening I expect to release a copy of the distribution with a 2 CPU
version of ThirdComputer which will then be the one expected in the
homework 6 spec. I have currently isolated all the references to the
one block local buffer into a class I am calling LocalCache. What I will
be working on for the next couple of hours is LocalCacheThree which
will hopefully give each CPU access to an internal local cache of three
blocks managed with an LRU protocol as discussed in class last week.
A test example that uses 2 CPUs to do the sum memory task of test1.cpu
has been run (and will be discussed in class assuming all goes according
to plan).
- 19 Mar 2009: While parallelizing h5 to make h6 is an interesting task,
it is worth noting that there is a rather simple way to start. That is to
check the processor ID and send three of the processors into an idle loop
while the remaining processor executes as the h5 result did. This is an
important start for two reasons: 1) if it works, you always have something
that can be handed in and 2) any attempt to parallelize needs to be compared
to this base case since the question is not whether or not you can program
various exotic implementations, but rather whether or not you can make an
implementation that is fast (so you would only want to hand in a more
advanced solution if it was actually faster than this base case). Of course,
some parallel programming attempts will be useful to prepare you for the
parallel programming question on the final exam (just as the circuit homeworks
helped prepare for the midterm).
- 19 Mar 2009: In class today, it was pointed out that many h4 reports
indicated rather limited testing. It is important that code work on all
test cases that meet the spec. The implementation of the overflow handing
for h5 is as defined in the FirstCpu.java code for SUB and ADD. If you
are relying on it, you should check that it does what you expect and if it
doesn't modify your code to take that into account (there won't be another
change to overflow handling this semester). Similarly, the spec indicates
that not only should any 16-bit 2's complement value work in your lists,
but your lists should work for all sizes from 2 to 1023 (larger than 1 and
smaller than 1024). For people who are using internal stacks, make sure
you can handle lists of numbers this larger regardless of the order they
appear in without a stack overflow messing up the run. Just because a
piece of code passed the test data I tried on the h4 handins doesn't mean
it will pass the h5 test data -- so be sure and do some significant testing
(fortunately the h4/h5 simulator runs faster than the h3 simulator did,
making testing easier to do).
- 18 Mar 2009: The CircuitSimulator.090318p.tar.gz distribution
is now available in my CS350 directory on gaul.
As mentioned below, it fixes a problem with
assembling the FPCR instruction. Also, the dumps being done at the
end of a run by SecondComputer were sometimes not finding a unit.
This was tracked down to the code in BusRotating.java that was
looping over the items attached to the bus and causing each to be
processed and rotated around the queue. It turned out that when
the CPU threw a HaltException, this left the queue in an inconsistent
state. The fix was to rotate the queue first and then process the
elements of the queue. The AsmRefCard has been updated to include
the new instructions to support homework 6. ThirdComputerLockSet seems
to recognize the new instructions, but has only one processor (test23.cpu)
which still has ID 2.
ThirdComputerFourProcessors has four processors (with IDs 0, 1, 2, and 3)
and seems to go into an infinite loop on test24.cpu (I haven't yet figured
out why).
- 18 Mar 2009: Warning: The FPCR command wasn't entered properly
in the assembler. This will be fixed in the upcoming release. If
you need it earlier, the fix is to change the line:
byteAddr.put("BZ", new Integer( 0x29));
to:
byteAddr.put("FPCR", new Integer( 0x29));
in RamWithCacheInstructions.java (it is a short file). A classic cut
and paste error.
- 18 Mar 2009: Coming out of the discussion of how overflow interacts
with comparison of integers, the question was raised as to whether or not
overflow can be avoided. Alternative solutions to homework 4 are relevant
to homework 5 and 6 and so I don't want to say much about the matter. But,
I will make the following observations:
- It is possible to organize the sorting so that you are never comparing
numbers whose subtraction would generate an overflow.
- It is possible to do the sorting without either addition or subtraction
being involved in the comparison process.
- Both of the above can be accomplished using techniques covered in
textbooks aimed at second year comp sci data structure courses (most likely
the one you used, but certainly among the many in Taylor library). Neither
of these approaches would significantly slow down the benchmark task.
- 17 Mar 2009: The test data used for homework 4 can be found in
the TestingForH4 subdirectory of my CS350 directory on gaul. Note that
other test points will be used for homeworks 5 and 6 although the benchmark
will stay the same. There are 4 build scripts:
- build_h4: builds the benchmark (extract_list.gawk used for verification)
- build_pos: benchmark with all the minus signs removed. (extract_list.gawk
used for verification)
- build_another: a list of 30 items with 3 different values: 32000, -32000,
and 1. (extract_30.gawk used for verification)
- build_zero: a list of 30 items with 3 different values: 32000, -32000,
and 0. (extract_30.gawk used for verification)
- 16 Mar 2009: The following is a list of times for programs that
passed all test data (FirstComputer / SecondComputer):
- 4014204 / 1665186
- 8114299 / 3169575
- 8628898 / 4491690
- 9019084 / 4506939
- 9639211 / 4434249
- 10932535 / 5487522
-
- 13 Mar 2009: From the experiments listed below, the cache system
of the simulator seems to be behaving as described. The o distribution
that I am releasing in my CS350 directory on gaul
CircuitSimulator.090313o.tar.gz contains no changes to the simulator
and just distributes the various test cases referred to below for
people who want to run modified versions of the code to do further
exploration of the performance of different aspects of the coding.
- 13 Mar 2009: Further exploring the n distrubution, it is worth
noting that in yesterday's sample code:
LRW r0 1
ADDI r0 -1
NOP
label loop // note: this loop lives in 1 block
BZ done
ADDI r0 -1
BA loop
HALT
address 50
label done
HALT
Not counting memory accesses, every instruction needs two clock cycles to
be fetched and stored in the IR. LRW needs an 3 clock cycles to be processed.
ADDI needs an additional 2 clock cycles, NOP takes 1 additional clock cycle,
BA takes an additional 2 clock cycles. BZ takes 1 clock cycle if it fails
and 2 clock cycles if it succeeds. Since the loop body executes when BZ
fails, the clock cycle count (when no cache or RAM accesses are needed)
we would have expected was: 3 + 4 + 4 or 11 clock cycles, which was pretty
much what we were seeing.
The next example to consider is
LRW r0 1
ADDI r0 -1
NOP
label loop // note: this loop lives in 1 block
BZ done
ADDI r0 -1
BA step
label step
BA loop
HALT
address 50
label done
HALT
This is still a one block loop, but it now has one additional instruction
(a BA step instruction that should cost an additional 4 clock cycles).
Setting r0 to 100 (test13.cpu) we again get 72 stalls, time is now 1578.
1578 - 93 = 1485 / 99 = 15 which is four more clock cycles per pass
than the previous example as expected. By the way, tests 8, 10, 12, and
13 also had a repeatedMemoryRequest as well as the stalls. The stalls
indicate that the cpu is waiting for a response from a memory request
to cache or ram. The repeatedMemoryRequest indicate that the bus was
busy when the cpu tried to send out a memory request and so it is resending
it on the next clock cycle.
If we now make the example:
LRW r0 1
ADDI r0 -1
NOP
label loop
BZ done
ADDI r0 -1
BA step
NOP
label step
BA loop
HALT
address 50
label done
HALT
we now have a loop body that takes the same number of clock cycles
within the cpu, but the loop body no longer sits in one cpu buffer.
So, each time around the loop, each of the two buffers holding the
loop body have to be brought in from cache. This will let us see
the cost of repeated cache accesses. The example is test14.cpu
(again with r0 being set to 100). We now have 291 stalls and 100
repeatedMemoryRequests. Time is now 1896. 1896 - 1578 has this
example taking 318 clocks longer than the previous one. 318 + 72
stalls + 1 repeated memoryRequest = 391 which matches 291 stalls
plus 100 repeatedMemoryRequests, so all the extra overhead is in
memory accessing. The extra memory handling should be one extra
ram fetch and 2 * 98 cache accesses, so we seem to be getting pretty
good turnaround from cache. Our total cost of a single loop pass
is roughly 1896 - 93 / 99 or 18.21 which is not much more than the
cost of 15 from the previous example.
The next example is:
LRW r0 100
ADDI r0 -1
NOP
label loop
BZ done // address: 4 direct map 1
ADDI r0 -1
BA step1
address 36 // direct map 1
label step1
BA loop
HALT
This is test15.cpu. It is the same as the previous example, except that
the two blocks the loop body is on collide at the same spot in cache,
thus requiring two ram accesses per pass around the loop. We still
get 100 repeated memory requests, but now we have 4428 stalls. The
total time is: 6033. 6033 - 1896 / 99 = 41.78 giving us a rough estimate
of penalty of 2 ram accesses per loop iteration additional going on
(as with Homework 4, delay for Ram was set to 20).
For the next example, consider the code:
LRW r0 100
LRW r1 32
LRW r2 64
ADDI r0 -1
NOP
label loop
BZ done
LRI r3 r1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
LRI r3 r2
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r0 -1
BA loop
HALT
address 50
label done
HALT
that does two data memory reads (LRIs) that collide in the cache with each
other (test16.cpu). This takes a current
time of 13278. If I replace the LRIs with FIs, which should allow
the CPU to keep going with less interference as in:
LRW r0 100
LRW r1 32
LRW r2 64
ADDI r0 -1
NOP
label loop
BZ done
FI r1 // LRI r3 r1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
FI r2 // LRI r3 r2
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r0 -1
BA loop
HALT
address 50
label done
HALT
The total time drops down to 11202 from 13278 (or just over 20 clock cycles
per iteration of the loop -- essentially saving one ram access of overhead
and loosing the other one to overhead of the two FIs at the Bus and Cache).
If I replace the FIs with NOPs,
(test18.cpu), total time drops to 8232. Since the loop goes around twice
and each FI was done twice, (11202 - 8232) / (2 * 99) gives us the overhead
for an FI rather than a NOP as 15 clock cycles whereas the overhead of
LRI over NOP in the same situation is (13278 - 8232) / (2 * 99) is
25.48.
If we now consider the example
LRW r0 100
LRW r1 32
LRW r2 64
ADDI r0 -1
NOP
label loop
BZ done
LRI r3 r1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
NOP
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
LRI r3 r2
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r6 1
ADDI r0 -1
BA loop
HALT
address 50
label done
HALT
where there is a bigger gap between the two LRIs (test19.cpu), we
get a run time of 23439. If we replace the NOP in the middle of
the loop above with FI r2 to prefetch the address needed by the
later LRI, we get a run time of 22251. (23439 - 22251) / 99
gives us a savings of 12 clock cycles per iteration by this maneuver.
If I didn't have the FI instruction (i.e., instead of having a NOP
I just have one fewer instructions -- test21.cpu), the time the
program takes is 18984. So, inserting the FI slowed down the program
by (22251 - 18984) / 99 or 33 clock cycles per iteration, which is
somewhat disconcerting. A check of the assembler encoding dump indicates
that removing the FI instruction making the block one instruction shorter
also move the end of the block from 40 to 39. This meant one less block
needed to be fetched going around the loop. Running from address 8 to
address 40, the loop body was 33 words long and hence didn't fit in
cache, whereas with it one instruction shorter, it now fit in cache
with the two LRI commands bringing into cache addresses that not only
conflicted with each other, but also conflicted with the block holding
address instruction immediately after the second LRI.
The command:
java SecondComputer < test20d.cpu | & grep Cache: | grep another
(looking for one of the debug messages that the command method in the
class CacheNew generates)
shows that there isn't a problem with the cache rejecting a request
because there is another one currently in progress. The command
java SecondComputer < test20d.cpu | & grep Cache: | grep received | wc
does show the Cache processing 1292 requests from either CPU or RAM.
If we replace the two LRIs in test16.cpu with SRI r1 r0 and SRI r2 r0
respectively (test22.cpu), the time it takes is 8847 as opposed to the
13278 clock cycles test16.cpu took indicates that not waiting on writes
makes writes for this example on average (13278 - 8847) / (2 * 99) =
22.378... clock cycles faster than reads from RAM.
- 12 Mar 2009: The CircuitSimulator.090311n.tar.gz distribution is
now available in the usual place. As discussed in class today, SRL
and SRA have now been changed so that they update the condition codes
like SLL. Reflecting this, the AsmRefCard we used in today's class
has been updated. A copy of the current version of it can be found
in this distibution under the name AsmRefCard. I also ran a series of
timing experiments to explore the correctness of the implementation
of the caching algorithms. I started with the program
LRW r0 1
ADDI r0 -1
NOP
label loop // note: this loop lives in 1 block
BZ done
ADDI r0 -1
BA loop
HALT
address 50
label done
HALT
What is interesting about this is that it requires three reads from
RAM and the loop sits on a single block and it is easy to control
how many times around the loop we go. Lets consider the following
experiments:
- With r0 1 (test8.cpu) we
get 72 cpu stalls while waiting for Ram and Cache to retrieve once
each of these three blocks and then send through the flush command.
The total time for the run is 93.
- When I change r0 to 2 (test9.cpu), I get exactly 74 stalls
(quickly counted by: grep stall test9.cpu.result | wc) and time
has increased to 105. Initially, one would like to think that both
runs should see exactly the same number of memory stalls since both
runs are fetching the same number of memory blocks (3). However,
recall that the bus is rotating priority on the different entities
on the bus and so the third fetch and/or the flush might have slightly
different timing when r0 is 2 since they occur at different times.
The number of clock cycles for r0 being 2 is 105. If 2 of these are
full the extra stalls, then 103 - 93 gives us the cost of one time
around the loop being exactly 10 clock cycles. Note that test8.cpu
does not execute the loop body because r0 is decremented before the
first BZ done.
- If we change r0 to 3 (test11.cpu) we get 72 stalls
again. The completion time is 114. 114 - 93 is 21 as the cost of
going around the loop twice or 10.5 for the cost of going around
the loop once. This doesn't quite seem right.
- If we change r0 to 4 (test12.cpu) we again get 72 stalls.
The completion time is 126. 126 - 93 is 33 which is giving us a cost
of 11 to go around the loop once.
- If we
then change r0 to be 100 (test10.cpu) we get 72 stalls again. The
time taken is 1182. 1182 - 93 = 1089. 1089 / 99 gives us 11 as the
cost of the number of times to go around the loop once.
Looking deeper into issues like this is something I will be spending
some time on between now and Friday evening. After that, I will figure
that cache pretty much does whatever the code does and limit new
distributions to: 1) fixing problems that break programs that worked
under the earlier FirstComputer distributions except for matters discussed
like slight changes to the way SUB, SUBC, SRL, and SRA handle condition
codes and the definition of new opcodes for prefetching instructions.;
2) adding more statistics gathering features; and 3) getting ThirdComputer
available for people who want an early start on homework 6.
- 11 Mar 2009: The CircuitSimulator.090311m.tar.gz distribution is
now available in my CS350 directory on gaul. The first change made
was to fix the subtraction overflow problem mentioned below. A version
of the system with just that fix over the k distribution can be accessed
via SecondComputerPriorityBus . The main goal of this release is to
make the Cache behave smarter. Since there would be more parallel activity,
to avoid deadlocks I replaced Bus.java with BusRotating.java for
SecondComputer. The old bus always gave the same device highest priority
and since only one command can be processed on the bus per tick, this
could cause problems. The solution is a new bus that rotates the order
of the devices on each tick (so that every 3 ticks each device has had
an opportunity to have maximum priority). The other changes include
the following:
- The new version of the CPU no longer waits when it sends a write request
(similarly neither the cache nor ram send back acknowledgements of write commands). The write policy is still copy-back on the CPU buffer and write-through
on the cache.
- Because the CPU no longer waits for a write, a problem arises if the
last instruction before a halt is a write. To fix this, halt now sends a
flush command through the cache and on to ram which then comes back to
the cache and finally back to the cpu to ensure that any previous writes
have been completed before the HaltException is thrown.
- The cache now has two stages of processing. In the first stage,
a request comes to the cache. The cache responds if it can. If it
can't then it makes a request to ram and passes the request onto the
second stage -- if the ram doesn't accept the request because it is
currently working on another request then the command stays in the first
stage blocking any subsequent cache requests from the cpu until it can
move on. Once a command is in the second stage, it waits for the ram
to respond and then handles the response accordingly. While it is waiting
for the ram, a new instruction can enter the first stage and if it doesn't
require a ram access, the new instruction can be completely processed.
As indicated above, if the request is a write request, it updates the cache,
is passed onto ram, but does not wait in cache after the ram has indicated
that it is free to start on the write.
The above should mean that write commands cause little or no delay in
the system as long as the commands the immediately follow them don't
try to access ram (but instead have memory requests that can be handled
by either the cpu buffer or the cache).
Needless to say, this is a rather complicated upgrade to the system.
In addition to the test cases in the distribution, it has also been
tested on one of the successful homework 4 handins and worked ok.
By working, I mean that the time is much faster and the resulting ram
is the same as the previous versions. If you encounter any problems
with it, please let me know
soonest. While you are waiting for a response, don't forget that you
can access a version of the system with the old version of the cache
via SecondComputerPriorityBus . If no problems arise from this distribution,
I have no further plans for modifying the architecture of the homework 5
machine. The next computer, ThirdComputer, will be the homework 6 machine
(the multicore version of the homework 5 machine).
- 11 Mar 2009: It has been brought to my attention that the overflow
flag gets set when subtracting zero from a number. Since we are past
the homework 4 due date, if you are using overflow and having troubles
you should incorporate a test for this situation. For the upcoming
homework 5, I have patched SUB and SUBC so they no longer have this
problem. The patch will become available when the m distribution is
released (I am skipping l since it looks too much like 1), probably
tonight. Recall in the orignal release of the simulator there was a
note that overflow and carry flags were untested and unreliable and
wasn't working on them because I didn't expect them to be used in the
assignment. With this patch, SUB looks a bit more robust, but SUBC
still needs more thought as it is a more complicated situation.
- 10 Mar 2009: (8:42 pm) The CircuitSimulator.090309k.tar.gz release
is now available. It fixes known problems
with CircuitSimulator.090309j.tar.gz . In addition to the distributed
tests, it has also been tested on a few homework 4 assignments. Note:
the cache still blocks on a fetch so that no other cache requests can
be processed until the fetch completes. I need to give a bit more
thought to whether or not that can change. Also, I put in a number
of statistics that are now dumped. The cache is unit 3, so don't forget
to do a dump 3 if you want to see read hit and miss stats. Remember all
writes that go to the cache also go to Ram, so there are no hit and miss
stats for write (since they take the same amount of time either way).
I also included a build_h5 script that is a modified build_h4 script
along the lines discussed in the homework 5 spec.
- 10 Mar 2009: (1:09 pm) Bug report received for SecondComputer in
j distribution.
A new version should be available soon.
- 9 Mar 2009: The spec for
homework 5 is now available. To the best of my knowlege, it is complete.
- 9 Mar 2009: No problems have been reported with FirstComputer used
in Homework 4. For Homework 5, the distribution
CircuitSimulator.090309j.tar.gz has been placed in my CS350 directory on gaul.
This distribution implements an 8-entry 4-word line direct mapped cache
and a CPU buffer of one 4-word line. It passes my test routines (as well
as two new ones). The following new files exist:
- test6.cpu: a version of test1.cpu where FD (fetch direct), FPCR
(fetch program counter relative) and FI (fetch indirect) instructions
are tried out. In the homework 4 simulator, if you tried to execute
these instructions (opcodes 0x1E, 0x1F, and 0x29), it would be ignored
as a NOP. If you had labels named FD, FI, or FPCR in your homework 4,
those will no longer be processed properly because they are now viewed
as keywords known to the assembler. Other than this, any program that
ran successfully on FirstComputer should also run on SecondComputer
(but usually with a faster time count).
- test7.cpu: a small program that copies an array from one location
to another.
- SecondComputer.java: This is the simulator that you should use
for your homework 5 solution. It builds a computer with a bus connecting
a CacheWithFetch (delay of 2), RamFourWide (delay of 20, reads 4 words at
a time), and SecondCpu (implements FD, FI, and FPCR instructions).
- SecondComputerWithCache.java : A version of SecondComputer where
Cache and SecondCpuWithBuffer are used. The difference is that
FD, FPCR, and FI are treated as NO-OPs here, but the caching should work.
If a problem develops with SecondComputer.java, lets us determine if it
had to do with the changes involved in implementing FD, FPCR, and FI or
not.
- SecondComputerWithBuffer.java: A version of SecondComputerWithCache
that doesn't use a Cache but just has SecondCpuWithBuffer communicate
directly with RamFourWide. If a problem developes with
SecondComputerWithCache, this allows us to see if the problem arose from
the code used to implement the Cache module.
- SecondComputerWithoutBuffer.java: A version of SecondComputerWithBuffer
that uses SecondCpuWithoutBuffer and RamWithCacheInstructions.
If a problem develops with SecondComputerWithBuffer, this allows us to see
if the problem arose from the code used to implement an internal 4 word
buffer in the cpu and communicate between the ram and cpu in multiple of
4-words. The assembler in RamWithCacheInstructions is aware of the FD, FI,
and FPCR op codes.
- FirstComputer.java: A version of SecondComputerWithoutBuffer that
uses FirstCpu and Ram . The main difference is that the assembler in
Ram does not know about the FD, FI, and FPCR op codes.
- 6 Mar 2009: As mentioned last Thursday, the TA cancelled office
hours this past Monday due to another engagement. He is holding special
office hours today (Friday 06 Mar 2009) from 3:30 to 5:30pm in MC4.
- 2 Mar 2009: The spec for
homework 4 has now been updated to include the slowest speed that will
be acceptable for a solution. First, we note that a sort like bubble sort
or insertion sort would have to do n * (n + 1)/2 data reads and the benchmark
is a list where n is 1023, which would be 523,776 data reads. Then we note
that each data read takes at least 20 clock cycles to read ram and at least
another 20 clock cycles to read the instruction that did the ram read and
we see a minimum of 20,951,040 clock cycles. On the other hand, an n log
n sort algorithm would require a minimum of 1023 * 10 (10 is log base 2
of 1024) data reads or 1023 * 10 * 40 clock cycles (409,200). Of course
coding introduces more instructions than the minimum amount to read the
data (writing, looping, etc.). After doing some coding, it looks like
5 million clock cycles is definitely attainable, so I am setting 15 million
clock cycles on the benchmark as the slowest acceptable solution (note:
faster solutions will be worth more, as with other assignments).
- 27 Feb 2009: There is now a spec for
homework 4. This spec uses the scripts build_h4, random_list.gawk,
and extract_list.gawk to build and test the benchmark run. A new
distribution CircuitSimulator.090227i.tar.gz has been made to include
these scripts. At this point, the only thing missing from the spec
is the slowest time allowed for a working solution. The goal of
this value is to prevent solutions based on O(n**2) sort programs
like bubblesort.
- 27 Feb 2009: CircuitSimulator.090227h.tar.gz is now available
in my CS350 directory on gaul.
It includes a new test program test5.cpu which uses the instructions
NOP, SRI, JA, JI, BM, and BNM. It includes a fix to Cpu.java to correct
the handling of the program counter for the SRI instruction and
to cause JI to save pcp1 rather than pc when it jumps (similar to CALL,
but using a register rather than a fixed address).
This leaves the ADC, SUBC, RCC, WCC, BC, BNC, BO, and BNO instructions
untested. At the moment I don't see any use for them in the assignment,
so I will leave testing them til next week. If you have a use for them,
send me email letting me know how you want to use them and I will check
further into them.
- 26 Feb 2009: Plan for 27 Feb 2009 is to put up a formal writeup
of Homework 4. The `spec so far' note below covers all the key points
already for people who want to work on it tonight.
- 26 Feb 2009: CircuitSimulator.090226g.tar.gz is now available
in my CS350 directory on gaul. The following errors have been fixed:
- CALL was fixed in CPU.java
- LRN encoding was fixed in RAM.java
- LRN sign extenision was fixed in CPU.java
- ADDI was changed from a sourceRegister instruction to a sourceConst
instruction in RAM.java
- ADDI was given sign extension in CPU.java
- RET was fixed in CPU.java
- SRL was fixed in CPU.java (to prevent it sign extending)
- checks were added to RAM.java so if your destination is too far
away for a branch instruction, RAM.java complains.
- test2.cpu was created demonstrating a subroutine that doesn
shifting by arbitrary amounts being called multiple times.
- test3.cpu was created to exercise the AND, OR, ADD, SUB, XOR, NOT,
SLL, SRL, and SRA instructions a bit.
- test4.cpu is test1.cpu rewritten to demonstrate the advantages
of ADDI. (test1.cpu was the sum memory program discussed in class).
- Instructions that I haven't done any testing on yet are:
NOP, ADC, SUBC, RCC, WCC, SRI, JA, JI, BC, BNC, BM, BNM, BO, BNO,
SWI, RTI, and RTIM. Of these, SRI, JA, JI, BM, and BNM are the
only ones that look like they might be useful in the assignment.
- 26 Feb 2009: Homework 4 spec so far: You are writing a
program that can sort 1024 integers in a reasonable amount of time
(putting smallest at the front of the list and largest at the end).
The benchmark program will use a random number generator to create
the list, but you are responsible for non-random lists as well.
On the benchmark version, you are expected to have preformance that
is comparable to an n lg n sort such as quicksort or heapsort rather
than an n ** 2 sort like bubble sort. The total amount of memory
that the computer FirstComputer.java in CircuitSimulator.090226f.tar.gz
(see message below) has is 2048 words. The sequence that you are
to sort will set at location 1024 and up to 2047. At location 1023
will be a length value that tells you how many addresses past 1024
contain valid values to sort. The length value is guarenteed to
be larger than 1 and smaller than 1024. You are trying to minimize
the number of clock cycles it takes FirstComputer.java to run your
sort program. FirstComputer.java implements the SUNY RISC design
discussed in volume 2 of the textbook on pages 75 thru 83. It assumes
each memory access requires 20 clock cycles to process. A HALT instruction
with opcode 0x1D has been added to the SUNY RISC instruction set.
The SWI, RTI, and RTIM instructions were implemented as NO-OPs.
The `sum memory' sample program can be found in the
CircuitSimulator.090226f.tar.gz distribution under the names
test1.cpu, test1.cpu.debug, test1.cpu.1000, test1.cpu.unroll, and
test1.cpu.unroll.1000 .
- 26 Feb 2009: Remember that Homework 4 is due the Monday after
the midterm. The simulator for Homework 4 is FirstComputer.java
and is in CircuitSimulator.090226f.tar.gz in my CS350 directory on
gaul. This is the version we discussed in class. It successfully
ran a few versions of a program to sum the contents of memory.
I will be doing more testing on it tonight. There is an error in
the implementation of the CALL instruction that I will be fixing
tonight and posting a newer version. I will also do more extensive
testing of the instruction set. I currently think that certain boundary
cases for the carry and overflow condition code bits are not being
handled right, but I don't think you will be using them for your assignments,
so fixing them is not a high priority at the moment. If you think you
have a use for them, contact me.
- 26 Feb 2009: In terms of what sort of questions will be on the
midterm, I expect the material up to and including discussions of
controllers and architectures. If one were trying to make up a
practice midterm, one could take the three questions from the
midterm and add to them the architecture question and controller
question from the final that I handed out before the quiz and that
would make a reasonable collection of questions. As you may already
have begun to suppect, my exams tend to be dominated by questions
involving designing something or analyzing/tracing something that
has already been designed.
- 25 Feb 2009: Note: if you are working in a group, then that
means everyone in the group should be handing in the copies of the
same report, solution file, etc., and gets the same mark (with the
possible exception of late penalties). In the case where different
group members hand in different stuff, it is the one who comes first
alphabetically that gets used for marking (since that is the order
I process the files).
- 23 Feb 2009: Regarding the question of where `reset' goes, `reset'
is part of the controller interface that brings the controller back to
the start state.
- 20 Feb 2009: Have started on the marking to H2. Of the 19 distinct
solutions handed in (of the 27 people that handed in solutions), 6 of
them were incorrect. At this point, having looked at a few of them,
I think it mostly relates to people not using the tester I provided
and gotten the flag calculation wrong. If you are wondering if your
h2.cdn has such a problem, you can check by the following instructions:
- copy your h2.cdn to x.cdn
- delete the INPUTS line from x.cdn and everything that comes after it.
- get a copy of short_test from my CS350 directory on gaul.
- append the contents of short_test to the end of x.cdn
- if you used flag[0] instead of flag, you will need to edit the OUTPUTS
line accordingly.
- run x.cdn through the simulator to produce x.result
- run check.gawk on x.result
While more extensive testing than this was done on the homeworks, this
test turns up all the assignments that didn't work. Of the ones that
did work, the corresponding preformance numbers are:
- 2528 * 13 = 32864
- 2246 * 15 = 33690
- 1714 * 40 = 68560
- 2784 * 25 = 69600
- 4186 * 17 = 71162
- 1968 * 49 = 96432
- 3584 * 43 = 154112
- 2668 * 67 = 178756
- 2732 * 67 = 183044
- 2686 * 71 = 190706
- 3130 * 70 = 219100
- 3376 * 104 = 351104
- 3904 * 106 = 413824
While obviously not optimal, a working solution to Homework 2 required only
13 lines of code:
public static void pH2(String in1, String in2, String cw,
String result, String flag, int width) {
String AddResult = pAdd(in1, in2, width);
String NotResult = pWordNot(in1, width);
String AndResult = pWordAnd(in1, in2, width);
String One = pConstant(1,width);
String Zero = pConstant(0,width);
String Bus = pDecoder(cw, 3);
String Result1 = pWideSwitch(Zero, AddResult, iW(Bus,0), width);
String Result2 = pWideSwitch(Result1, NotResult, iW(Bus,1), width);
String Result3 = pWideSwitch(Result2, AndResult, iW(Bus,2), width);
String Result4 = pWideSwitch(Result3, One, iW(Bus,3), width);
String Result5 = pWideSwitch(Result4, in1, iW(Bus,4), result, width);
String Compare = pCompareSigned(in1, in2, width);
pJoin(iW(Compare, 0), flag);
}
Homework 3 is a bit larger than this. Indeed, it tends to be the
most challenging of the assignments during the semester. A number of
people have already indicated they have it working, but in case you are
not among them, the following should help you maximize your results:
- Hand in something on time. That way, if you never get it to work
better, at least you don't lose as much late penalty as you would if you
waited.
- Be sure and use the tester I provided you with.
- Grow the assignment incrementally. Within the context of the spec,
the following are natural groupings of instructions that I will test for.
Being able to get some of these groupings to work is better than getting
none of them to work.
- StoreA and Halt. (this allows you to store a 0 at various locations
in memory and then halt)
- LoadA, StoreA, and Halt. (this allows you to copy values from one
location in memory to another)
- LoadA, StoreA, Not, and Halt. (this allows you to change the value
you read from memory before storing it out)
- LoadA, LoadB, StoreA, Add, and Halt
- LoadA, LoadB, StoreA, And, and Halt
- LoadA, LoadB, StoreA, JumpLt, and Halt
- executes the benchmark program properly
Of course, a working program is expected to do all of the above as well
as execute correctly other test programs.
- 19 Feb 2009: Someone running my H3Tester.java reported getting the
error message:
Looking for a string but did not find one null [58] at line number 4038
This was traced to a problem with the character : in an input file
(my ReadText.java doesn't recognize it as a valid character). As a
quick fix, I have reworked the msg outputs in H3Tester.java that had a :
in them so they no longer have a : in them.
Specifically, the lines in the method test in H3Tester.java on gaul in my CS350 directory
that read:
runSimulator();
for (int i = 0; i < 128; i++) {
msg("simulator output: " + i + ": " + binaryDigits(M[i],16));
}
have been changed to:
runSimulator();
msg("oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo");
msg("The above should correspond to the following runSimulator output");
msg("oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo");
for (int i = 0; i < 128; i++) {
msg("[" + i + "] " + binaryDigits(M[i],16));
}
You will probably find it easier to work with than the broken version, although
the problem should not have interferred with simulating your circuit, it only
prevented the printing of the messages I had included to make it easier to
check your output.
- 19 Feb 2009: For reference to people who didn't catch how Ram
works (in particular those who missed that day):
- rw is read/write .
cs is chip select. The handout from that class that goes through the
protocol is called RAM_PROTOCOL and is in the directory that contains the
source to the simulator. This document explains the format and behavior
of the RAM object in the cdn format. All pRam (in Ram.java) does is
translate its parameters to the corresponding cdn format elements.
- The big picture is that a ram chip reads when rw is 0 and writes when rw is 1.
However, it takes multiple clocks to perform an operation, so it needs to
know when to pay attention to the value on the rw line. The chip select (cs)
line rising to 1 tells the ram to take a new command from the rw line.
The Ram tells the cpu that it has performed the operation by raising the
dack (data acknowledgement) line to 1. In addition to rw, cs, and dack,
there are also the lines that carry information: in (data in, only relevant
when writing to RAM), out (used to report value read from or written to
RAM), and address (used to indicate where in RAM the operation is being
performed).
- The two-dimensional array data is used to pre-initialize RAM. It is 2-d
because each word in RAM is viewed as an array of bits. In past semesters,
this is where the test configuration would be input. This year, we have the
CHANGE_RAM feature that lets us override the pre-initialized contents
of RAM in the test data portion of the cdn file.
- 6 Feb 2009: Homework 3 is now
complete (i.e., the description of the tester for this homework
has been added to the spec).
The java file for the tester can be found in the gaul CS350 directory
under the name: H3Tester.java .
- 5 Feb 2009: As announced in class today, the TA will have special
one time office hours on Friday 6 Feb 2009 from 9:30am to 12:30 (the one
just after noon) in room MC4.
- 4 Feb 2009: Homework 3 now has pretty
much everything except the tester for this circuit, so you should be able
to design the controller and architecture you will need for this assignment.
- 4 Feb 2009: CircuitSimulator.090204e.tar.gz is now available on gaul
in my CS350 directory. The only change from d was to add the command
DUMP_RAM which outputs the contents of RAM. This makes it easier to see
what changes your circuit has made to RAM (for Homework 3 testing).
- 4 Feb 2009: Homework 3 UNDER CONSTRUCTION
- 4 Feb 2009: CircuitSimulator.090204d.tar.gz is now available on gaul
in my CS350 directory. As mentioned in class test_data/Registers.java has
been changed so that the default value of memory_default is now "0" rather
than "U". Other than that the changes are all on the simulator in order
to better support homework 3. In particular the following changes have been
made:
- 30 Jan 2009: People often wonder how small memory can get.
This Stanford University news release reports work that stores
35 bits of information on a single electron. The point being that you can
actually store more than 1 bit per atom.
- 30 Jan 2009: Note: The online
course outline has finally been updated to include the Medical Illness
addendum that was in the hardcopy version handed out at the beginning of
the semester.
- 29 Jan 2009: Since the problem with pAnd was causing problems with
Homework 2, I am pushing the Homework 2 due date to 9 Feb 2009 (a week later
than the currently scheduled 2 Feb 2009). Hope this helps. Thanks to
the person who raised the issue. To anyone else caught up in this, remember
that when you are having problems, it is expected that you will ask questions.
I still expect to bring out the Homework 3 spec early next week, for people
who want to go ahead and get started on it.
- 29 Jan 2009: Found the problem. It turned out that in
BooleanRelayCircuits.java, pAnd, pAnd3, and pAnd4 were incorrectly implemented.
Specifically:
public static String pAnd(String in1, String in2, String out) {
return pNot(pNand(in1, in2, out));
}
which for pAnd("a","b","c") would produce:
NAND a b | c
NOT c | t_0
should have been:
public static String pAnd(String in1, String in2, String out) {
return pNot(pNand(in1, in2), out);
}
which for pAnd("a","b","c") would produce:
NAND a b | t_0
NOT t_0 | c
A similar problem was in pAnd3 and pAnd4. I have made these fixes to
BooleanRelayCircuits.java and now the regressions tests on Components.java
work (as well as the tests on BooleanRelayCircuits.java and Tester09.java).
The new version is CircuitSimulator.090109c.tar.gz
in my CS350 directory on gaul.
- 28 Jan 2009: One of the regression test methods (test9()) in
Components.java doesn't seem to be working any more. This was just
brought to my attention and I hope to have a response Thursday.
The reason this is of interest is that this method tests the
pCompare methods which are relevant to homework 2.
- 28 Jan 2009: By the way, although each group member is expected
to hand in the same report and cdn file, it is still possible for
members of a group to get different marks by handing in later than
other group members (in terms of days of late penalties). It is also
possible for a group member to lose marks relative to other group
members by not following the instructions and having missing files or
wrongly named files.
- 28 Jan 2009: Homework 1 will be handed back at the end of class
Thursday. For those interested, there were 3 groups of 3 and 5 groups of 2,
with the remaining 10 students working individually.
- 27 Jan 2009: One question that arises is what should be doable for
performance on assignment 2. In terms of space, one would expect not much
more transistors would be needed than that of three adders. In terms of
depth, one would expect not much more than an adder. This gives a total
performance cost of roughly three times the homework 1 performance costs
(plus or minus a few thousand). Note: this is just a `back of envelope'
calculation (for more on envelopes,
see:
Programming pearls: the back of the envelope (by Jon Bentley, CACM, 1984)
as well as the wikipedia entry).
- 27 Jan 2009: The assignments are not marked yet, but I expect them
to be ready for Thursday's class. People not following instructions on
the names of the output wires, names of report files, names of figure files,
etc. slowed things down. On the plus side, of the 29 people who handed in
circuits, all passed the test data except 1 (whose report indicated they
were already aware there was a problem). The efficiency results were:
1086 * 11 = 11946
-------------------------- 11 is the shallowest 16-bit adder handed in.
494 * 34 = 16796
-------------------------- 494 is the smallest number of transistors in a
16-bit adder design handed in.
512 * 36 = 18432
904 * 21 = 18984
-------------------------- note that with 16-bit addition, depth less than 32
means less than 2 levels per carry if a ripple
carry design is being used. of course there are
other possible designs besides ripple carry.
568 * 34 = 19312
576 * 36 = 20736
670 * 34 = 22780
704 * 36 = 25344
842 * 33 = 27786
842 * 33 = 27786
-------------------------- 27786 was my optimization of the class design
896 * 33 = 29568
896 * 33 = 29568
896 * 33 = 29568
936 * 33 = 30888
932 * 48 = 44736
1024 * 48 = 49152
------------------------ 49152 is the class design
800 * 68 = 54400
1216 * 48 = 58368
------------------------ the builtin pAdd 1056 * 100 = 105600
1438 * 77 = 110726
- 26 Jan 2009: A few emails have asked for clarification about
how the homework 2 circuit is supposed to work. While the English
description is supposed to help, I view the Java test code in the homework
assignment as the defining specification. However, since Java doesn't
know about the keyword THEN, that specification needed to be cleaned up
a bit in order to compile. In addition to removing the THENs and inserting
a missing parenthesis, I also made sure H2a was only looking at the least
significant 3 bits of cw. With these alterations,
Homework 2 should now be clearer.
- 23 Jan 2009:
a hardware computing project recently mentioned on SlashDot with the intro:
- Have you ever wanted to have one of those big scrolling LED displays, to keep you up to date on e-mails, news, weather, or stock quotes? In this video, we'll show you how to build an array of 120 LEDs, how to control them all from one microcontroller, and how to talk to it from your computer.
- 22 Jan 2009: Regarding getting your marks back on assignments and
exams. Normally I will return your marks at the end of class. If you
prefer to get your marks via encrypted email, not have them brought to
class, or allow others to pick them up, then you should read and
hand in form (postscript version)
( form (pdf version)). Note in order
to process this form you will need to present your ID.
- 20 Jan 2009: With respect to today's class about what adder design
is generally used, according to wikipedia, it is Kogge-Stone Adder which is a particular
variant on carry lookahead. Wikipedia also presents the
Brent Kung adder,
the carry bypass
adder, the
carry look-ahead adder, the carry-select adder, the , and the
the Ling adder. There are a lot of different ways of approaching the
issues in adder design. There is still a continuing research effort on
what is the best way to handle addition in circuit design.
- 19 Jan 2009:
Homework 2 is now available. It should be doable with the
same version of the Circuit Simulator as Homework 1 was done with.
Note that Homework 2 includes an addition circuit as a piece of
a larger circuit. The intent is that Homework 3 will include the
circuit described in Homework 2 in order to create a small CPU.
- 15 Jan 2009: As mentioned in class today, if you place the line:
p("DUMP_ALL");
just after a
p("INPUT_DATA ...");
in the Java code that generates your test data, then the simulator will
dump a list of all the wires in your circuit including their depth and
their values computed from the input listed on the INPUT_DATA line.
- 14 Jan 2009: Trying a few sample solutions to Homework 1. Just
using pAdd from Components, one gets 1056 transistors with a depth of
100 for a cost of 105,600. Following the basic design of
last Thursday's class, one gets 1024 transistors with a depth of 48
for a cost of 49,152 (which is twice as good). Reworking
the boolean equations using 209-style methods, I was able to further
reduce the cost to 27,786 (or just a bit more than 40% less than
Thursday's equations). Trying a more radical restructuring of the
problem, I got the cost down to 25,270. There are other possibilities,
but this should give you a sense of what is doable.
- 13 Jan 2009: As mentioned in class today, if you are worried about
the security of your project files over and above the normal OS security
of our local machines, I recommend using the program ccrypt. In order
to use this for class work or email communications, you need to see me
to arrange for an encryption key. Something like:
> tar cf - h1 | ccrypt -e > h1.cpt
Enter encryption key:
Enter encryption key: (repeat)
> submit cs350 h1.cpt
would then become the handin process. Of course, in order for it to work,
I would need to know what encryption key had been used.
On gaul, ccrypt version 1.7 can be found in /gaul/u0/usr/faculty/webber/E1 .
It's home page is ccrypt.sourceforge.net . Links to this and many related computer security matters can
be found on the
CS413 web page
- 13 Jan 2009: TA office hours announced in today's class also posted
above.
- 13 Jan 2009:
Homework 1 has been updated to include NOT in the list of allowed
fundamental primitives (as discussed in class).
- 12 Jan 2009: My office hours posted (see top of this web page).
- 12 Jan 2009: This is to let you know that the electronic online
handin program (submit) is working on gaul now for our course. Thus
if you have your homework 1 work in a directory called h1, you would
hand it in by typing:
submit cs3350 h1
Note: the Monday deadline is the midnight on the boundary between Monday
and Tuesday.
- 9 Jan 2009: The file CircuitSimulator.090109b.tar.gz is now available
in the usual place on gaul (see earlier announcements). This version
now uses the primitives needed to do homework 1. It also is intended
to produce the correct transistor count and depth calcuations for the
performance numbers for homework 1. In addition to updating
RelayCircuitHelper.java and BooleanRelayCircuits.java as mentioned earlier,
I also changed files Circuit.java, Node.java, and Table.java in the
simulator. The only testing that has been done on this new version of
the simulator is the output of Tester09.java using the new version of
the test_data files. It continued to produce the right results on the
272 test cases and the number of transistors (86) and depth (2) reported
for the first test case was verified as correct. Over the weekend I will
do more testing and on Monday read and respond to any email you send over
the weekend. I advise trying to work with CircuitSimulator.090109b.tar.gz
and send me email if it has an obvious problem. Remember that
CircuitSimulator.090109a.tar.gz is also still available and although it
doesn't calculate performance correctly, it should at least allow you to
run test data against your circuit design (if the newer version has some
major problem). If the material doesn't make
sense from looking at the code in Tester09.java and the addition method
pAdd in Components.java as well as the homework spec and stuff written on
this page, then feel free to
send email which I will respond to on Monday. I also plan on running through
an example of using the system in Tuesday's class.
- 9 Jan 2009: A few observations of note:
- The course outline mentions that you are allowed to work in groups
of up to 3 on the assignments. Since you will be doing similar programming
on the exams, it is important to make sure you are learning how to
build these designs when you are working in groups.
- The course outline also mentions that in order to get past my
spam filtering system, you should include the course number (3350) in
your subject line.
- 9 Jan 2009: A detailed spec for
Homework 1 is now available. Note that with the information
now available, you can create sample homework 1 solutions, but
the CircuitSimulator needs more work in order for it to give you
accurate transistor counts and timing calculations which are
important for you to optimize your design.
- 9 Jan 2009: In /gaul/u0/usr/faculty/webber/CS350,
you can now find CircuitSimulator.090109a.tar.gz . This still uses
relays instead of our new model, but it defines the functions:
pNor3, pNor4, pNand3, pNand4, pOr3, pOr4, pAnd3, and pAnd4, which
were not in the original system (pNot, pAnd, pOr, pNor, pNand were
already there). It also defines a new module Tester09.java.
I then tested it by running:
obelix[43]% pwd
/gaul/u0/usr/faculty/webber/CS350/CircuitSimulator.090109a/test_data
obelix[44]% javac *.java
obelix[45]% java Tester09 > test09.cdn
obelix[46]% mv test09.cdn ..
obelix[47]% cd ..
obelix[48]% javac *.java
Note: ReadText.java uses or overrides a deprecated API. Recompile with "-deprecation" for details.
1 warning
obelix[49]% java Circuit < test09.cdn > test09.result
obelix[50]% gawk -f check.gawk < test09.result
ALL 272 RESULTs MATCHED CORRESPONDING OUTPUTs
Of course, you can't work in /gaul/u0/usr/faculty/webber/CS350/CircuitSimulator.090109a, so you will want to make your own copy (or unpack in your own space
CircuitSimulator.090109a.tar.gz . After unpacking, recompiling the two
java areas (the one immediately under CircuitSimulator.090109a and the one
under CircuitSimulator.090109a/test_data may be useful). Under test_data,
you can run Tester09 which creates a circuit with 13 outputs (one for each
of the Not, Nand, Nor, And, and Or gates that the system now supports) and
tests each one for all possible combinations of 0s and 1s as inputs. This
is done in the method test1() in Tester09.java . After that, it erases the
current circuit and lays down a new circuit that adds two 4-bit numbers
producing a 4-bit result and then tests it on all possible inputs. Once
I run Tester09 and put all the circuit and test data specs into test09.cdn,
I then move it up to the directory where the simulator is. Then I run all
this through the simulator. Because I recorded the intended output for each
test data point, I am then about to use the gawk script check.gawk to read
the result file and verify that the output of the simulator matches the
corresponding messages getting the result that all 272 test cases worked fine.
The file test09.result looks like:
* Test Set 1
Circuit cleared with value U
----------
INPUT 0 0 0 0
OUTPUT 1 1 1 0 0 1 1 0 0 1 1 0 0
* RESULT 1 1 1 0 0 1 1 0 0 1 1 0 0
...
and basically check.gawk is skimming through looking for lines beginning
with OUTPUT and then matching their values to lines beginning with
* RESULT . This set up makes it easier to automate the running of large
sets of test data. Anyway, with CircuitSimulator.090109a.tar.gz, it is
now possible for you to write a H1.java file for your adder design and
test it out using the same features you will have available for H1.
The only difference is that the circuit depth and size calculations
the simulator does are completely wrong. It would also be a bad idea
for you to modify BooleanRelayCircuits.java or RelayCircuitHelper.java
in the test_data directory as both of these files will be significantly
changed in the next release (either today or Monday).
- 9 Jan 2009: Hi. Just wanted to let you know that I
still expect to have the material announced in class Thursday up
today. By the way, the file CircuitSimulator.BkWi06a4.tar.gz
on gaul in /gaul/u0/usr/faculty/webber/CS350 is the old material
from 2006 which I am updating to take into account the performance
characteristics of transistors that we discussed in class Thursday
(this old version instead things that relays are the primitive being
used and that changing the signal on either input of a relay has
the same delay). I will also be putting up a spec for homework 1
which will be about optimizing a 16-bit adder.
- 22 Dec 2008: Created announcements page.
I can be contacted by email at webber@csd.uwo.ca .
Be sure and include course number in subject line of your email.