OFFICE HOURS DURING SEMESTER
Most recent announcements are at top, older ones are further down the page.
[Note: this time I didn't email out project marks, but you should be able to figure out your status from the data below.] Below are the stats on two runs. The group numbers were randomly assigned. To generate the random seed for the benchmark, I rolled dice (treating 6 as 0) to get the base 6 value of 00042313 which I used bc to convert to 5733 in base 10. To qualify as working, had to do better than 30,000,000 clock cycles on this benchmark (which all programs that produced correct output accomplished). For further testing, I used 3 lists of length 5, one in order, one reverse order, and a third all duplicates (oddly enough, the all duplicates list was substantially different in performance than the other two, sometimes better, sometimes worse). Of the 7 programs handed in, 3 were not working: one included this in the report, one wasn't working on the sample provided but didn't notice, and one blew up when given a list of 5 items in order (and report wasn't aware of this). [The marks indicated below do not take into account late penalties.]
g2: on benchmark: 3,278,269; on sample provided: 3,242,207
on 5 items 1 thru 5: 5,949
on 5 items 5 thru 1: 6,031
on 5 copies of number 3: 4,872
was a quicksort implementation (fastest)
mark: 10
g4: on benchmark: 4,135,428; on sample provided: 4,138,166
on 5 items 1 thru 5: 3,795
on 5 items 5 thru 1: 18,925
on 5 copies of number 3: 19,129
was a radixsort implementation (1.25 times slower than fastest)
mark: 9.5
g1: on benchmark: 6,286,415; on sample provided: 6,251,811
on 5 items 1 thru 5: 6,308
on 5 items 5 thru 1: 3,439
on 5 copies of number 3: 2,192
was a heapsort implementation (2 times slower than fastest)
mark: 8.5
g7: on benchmark: 18,317,552; on sample provided: 19,789,094
on 5 items 1 thru 5: 18,898
on 5 items 5 thru 1: 18,653
on 5 copies of number 3: 25,103
was a quicksort implementation (6 times slower than fastest)
mark: 7
g5: on benchmark: wrong output; on sample provided: wrong output (unordered)
wrong output on list of 5 items 1 thru 5
wrong output on list of 5 items 5 thru 1
wrong output on 5 copies of the number 3
was a heapsort; report indicated was aware not working
mark: 4.5
g6: failed on list of 5 items, 1 thru 5; on benchmark: 7,847,706;
on sample provided: 7,840,799
worked on list of 5 items, 5 thru 1
worked on 5 copies of the number 3
was a heapsort; report not aware not working, but unsure
mark: 4 (random testing, but no boundary testing reported)
g3: on benchmark: wrong output; on sample provided: wrong output
(on sample introduced values not in list, e.g., 999 and
lost values that were in list, e.g., -32721)
worked on list of 5 items 1 thru 5
worked on list of 5 items 5 thru 1
worked on 5 copies of the number 3
was a heapsort; report not aware not working
mark: 3.5 (claims extensive testing and scripts to verify
results, so unclear how errors in sample were
missed, report less detailed than others)
During its four-hour Financial Analyst Day presentation today, AMD revealed new elements of its processor roadmap spanning the next couple of years, as well as its plans to scale beyond the current multi-core model. Intel talked about processors with "tens to hundreds of cores" at IDF earlier this year, but AMD believes the core race is just a repeat of the megahertz race and that adding more cores isn't the best way to go about scaling processor performance in the future. Instead, AMD is cooking up what it calls "Accelerated Processing Units" ... (see article for rest).
DEPARTMENT OF COMPUTER SCIENCE
THE UNIVERSITY OF WESTERN ONTARIO
Public Lecture for the Degree of Ph.D.
Qufei Wang
WLRU CPU Cache Replacement Algorithm
DATE: Thursday, December 14, 2006
TIME: 1:30 p.m.
PLACE: Room 316, Middlesex College
THESIS SUPERVISOR: Dr. Hanan Lutfiyya
EXTRA-DEPARTMENTAL
EXAMINER: Dr. Abdallah Shami, Elect. & Computer Engineering
EXTERNAL EXAMINER: Dr. Marin Litou, IBM
THESIS EXAMINERS: Dr. Mark Daley
Dr. Mike Katchabaw
ABSTRACT
A CPU consists of two parts, the CPU cores and the CPU caches. CPU caches
are small memories on the same die as the CPU cores. Recently used
instructions and data are stored in CPU caches. Accessing CPU caches is
fast, but accessing the main memory is very slow. The main memory is so
slow that the CPU is idle for more than 80% of the time waiting for memory
accesses. This problem is known as the memory wall. The memory wall
implies that faster or more CPU cores are of little use if the performance
of CPU caches does not improve.
Generally, larger CPU caches have higher performance but the improvement is
very small. A smarter CPU cache replacement algorithm is of more potential.
The CPU cache replacement algorithm decides which cache content to be
replaced. Currently, Least Recently Used (LRU) replacement and its variants
are most widely used in CPUs.
However, the performance of LRU is not satisfactory for applications of
poor locality, such as network protocols and applications, multimedia and
3D graphics. We found that there is a pattern in the memory references of
these applications that makes LRU fails. Based on this discovery, we
developed a new CPU cache replacement called WLRU. Trace based simulations
show that WLRU has significant improvement over LRU for applications of
poor locality. For example, for web servers, WLRU has 50% fewer L2 cache
misses than LRU. This means WLRU
can immediately improve the performance of web servers by more than 200%.
CPU caches have been intensively studied in the past thirty years. WLRU
has by far the biggest improvement. Our studies also indicate that WLRU is
very close to the theoretical upper limit of cache replacement algorithms.
This means any further improvement in CPU cache performance will have to
come from changes to the software. In
future work, we will investigate how to write OS and software to have
better CPU cache performance.
Security extensions for Darwin can be found at SEDarwin analogous to NSA's SELinux work (see also SELinux SourceForge). NSA has some interesting security notes. I particularly like So Your Boss Bought You A New Laptop ... how do you identify and disable wireless capabilities which addresses Mac OS X, Windows, and Linux systems (you might notice an implicit message about the usefulness of wireless networking here -- also note the issue of laptops with built-in microphones as well as putting metallic tape over infra-red ports).
Incidently, you may notice that many of the recent articles have come from two journals that are aimed at a general professional audience and focus on computer hardware related issues IEEE Computer and IEEE Micro. IEEE Computer covers software as well as hardware (including, over the years, many interesting tutorial articles). Intel also publishes Intel Technology Journal (and its archives)) covering their internal research projects. Similarly IBM Technical Journals including IBM Journal of Research and Development and IBM Systems Journal cover their internal research on many topics in computer science with industrial application. There are also some links that might be of interest in Microsoft's hardware development research groups. Although it is mostly about higher level abstractions, there is some discussion of hardware in Cisco's Internet Protocol Journal (for example, the article Network Processors: Programmable Technology for Building Network Systems ).
halt_op = 0; load_op = 1; // R[ra] = M[c]]; PC = PC+1 loadl_op = 2; // R[ra] = M[M[PC+1]]; PC = PC+2 loadi_op = 3; // R[ra] = M[R[rb]]; PC = PC + 1 store_op = 4; // M[c] = R[ra]; PC = PC + 1 storel_op = 5; // M[M[PC+1]] = R[ra]; PC = PC + 2 storei_op = 6; // M[R[ra]] = R[rb]; PC = PC + 1 jumplte_op = 7; // if (R[ra] <= R[rb]) then PC = c else PC = PC + 1 jumpltel_op = 8; // if (R[ra] <= R[rb]) then PC = M[PC+1] else PC = PC + 2 jumpltei_op = 9; // if (R[ra] <= R[rb]) then PC = R[rc] else PC = PC + 1 add_op = 10; // R[ra] = R[rb] + R[rc]; PC = PC + 1 sub_op = 11; // R[ra] = R[rb] - R[rc]; PC = PC + 1 mult_op = 12; // R[ra] = R[rb] * R[rc]; PC = PC + 1 int div_op = 13; // R[ra] = R[rb] / R[rc]; PC = PC + 1 (note: x / 0 = 0)In comments I have also defined the semantics of each of the allowable 14 op codes. The array R represents the 8 16-bit registers Computer01's cpu has. The array M represents the 4096 16-bit words of ram memory Computer01 has. An instruction in Computer01 machine code is a 16-bit word whose most significant 4 bits hold the opcode (undefined opcodes are treated as halts). So bits 15 thru 12 are opcode bits. The ra field is 3 bits used to specify which register to use (bits 11 thru 9). The rb field is 3 bits used to specify another register for operations involving two or more registers (bits 8 thru 6). The rc field is 3 bits used to specify a third register in situations where three registers are involved, such as Add, Sub, Mult, Div, and JumpLteI (bits 5 thru 3). The c field for the Load and Store commands is bits 0 thru 8 (for a maximum value of 511 if unsigned and 255 is signed). The c field for the JumpLte instruction (which uses the ra and rb fields) is even shorter at bits 5 thru 0 (for a maximum value of 63 unsigned). This means that most of the addresses in Ram can't be accessed by these instructions and is the reason why their is also the LoadL, StoreL, and JumpLteL instructions, which basically use a second 16-bit word to store the address constant and then push the PC past it, thus allowing direct access to anywhere in Ram, but at an increased space cost.
Add r0 r4 r5
If you wanted to do something like:
if r3 <= r5 then
r0 = r4 * r5
else
r0 = r4 / r5
then you might write:
JumpLte r3 r5 then
Div r0 r4 r5
JumpLte r3 r3 done
label then
Mult r0 r4 r5
label done
but this would only work if these instructions were in the first 64 words of
memory, because of the size of the constant field for JumpLte.
For something that would work anywhere, you would need to write:
JumpLteL r3 r5
then
Div r0 r4 r5
JumpLteL r3 r3
done
label then
Mult r0 r4 r5
label done
The first example would take 4 words of memory and the second example would
take 6 words of memory.
java Computer01 < x.ram > x.result
The x.ram file is in the format created by the asm01.gawk program.
The x.result file looks like:
RUNTIME = 5139
STATE = done
PC = 49
0 : 4146
1 : 3030
...
4095 : 3333
This tells us that your program took 5139 `clock cycles' to run. It also
tells us that it exited normally (in STATE done) -- if it halted in the
STATE bad_op_code, then that would probably indicate that you had tried to
execute some data. The halt that was executed was at location 49 in memory.
And the contents of memory at location 0 was 4146 all the way to the location
4095 in memory that contained 3333.
The simulator has the following internal states corresponding to its notion of what happens in a single clock cycle: start, waitingForIR, processInstruction, waitingDataRead, waitingSecondDataRead, and waitingDataWrite. If a data access is in cache, then it spends just one cycle in that state, otherwise it spends much more. The simulator has been configured for a Ram where a data access (read or write) takes 50 clock cycles. The fastest an instruction like ADD can be executed is 3 clock cycles, a start, a waitingForIR, and a processInstruction. If it is not in cache, then it takes 52 clock cycles (as it spends 50 in waitingForIR). If you want to watch this happening, the debug statements that I used for tracking the state machine are commented out at the while (! halt) statement in the method main in Computer01.java (uncommenting them and recompiling gives you an output that shows you where your clock cycles are being spent.
In addition to Ram being 50 times slower than Cache, the file Cache.java also defines the parameters and strategies of the Cache. The Cache is 32 16-bit words. These are grouped into 4 word blocks (so the data path to Ram is 64-bits wide). Thus there are 8 cache lines (blocks). It is fully associative with a least recently used replacement algorithm. All writes go directly to Ram and so take the full 50 cycles. They do not change the used status of the word if it is in cache as well (although if the word is in cache, then it is updated there too).
address 2048 100 48049 38872 ... 6454 53363 address 0This should be interpreted as specifying that there is a list of numbers starting at location 2049 whose length is 100 that are to be sorted in place (i.e., the answer should be in locations 2049 through 2148 and location 2048 should continue to contain the value 100). If you look at the source, you will see that I have seeded the random number generator with 98734. The actual test data that your program will be benchmarked against will use a different seed (randomly chosen) which will not be announced prior to the assignment being due.
p(2*i + 1 + " " + M[2*i + 1]);
and this was then copied and edited to create the even print line of
p(2*i + " " + M[2*i] + 1);
which generated the error, which has now been fixed. Incidently, these
decimal values should be viewed as unsigned
0 23 response 24576where the 0 means a read, 23 is the address being read, and the response back from memory should be 24576 which is the value stored at location 23. An entry for a write looks like:
1 2 55 response 55where the 1 means a write, 2 is the address being written, 55 is the value to be written there, and the response back should be 55. From this data, I will write some gawk scripts to create INPUT_DATA commands for driving your caches and checking to see if the results are right. In this file, all values are decimal.
29 8205
indicating that at location 29 the value 8205 is stored.
In this file, all values are decimal. From this data, I will write a
gawk script that will create a 32-bit wide RAM initialized with the
corresponding binary information.
The writeups were good -- there were only two problems that marred this otherwise outstanding performance; 1) two groups got the input/output wire names wrong, e.g., using in[0] and result[0] rather than x[0] and r[0] or using y[0] rather than y, which was rather easy to compensate for; 2) one group looks like it got the index off by one (meaning the check.gawk wasn't useful in processing the test data results and only spot checking was done). Everything was on time except one person with gaul account problems.
Thus, the final mark distribution ended up being 10, 9.6, 9.2, 8.3, 7, 6. I will be sending email shortly notifying the groups of their marks.
I can be contacted by email at webber@csd.uwo.ca . Be sure and include course number in subject line of your email.