7.2. Project 3¶
7.2.1. Project 3: External Sort¶
You will implement something similar to the External Sort shown in OpenDSA Module 9.6.
Important simplification: You should use a standard Heapsort instead of Replacement Selection.
This makes the runs be a consistent size, which makes it much easier to implement the multiway merge phase.
The records consiste of two 32-bit fields, each storing integer values between 1 and 2,000,000,000.
The first field is the key value (what you sort on) and the second is data.
Note that the act of sorting modifies the input file.
7.2.2. Sorting the File¶
You use a “working memory” of 50,000 bytes.
You need to handle files that are much larger than 50,000 bytes in size.
Files are guaranteed to be multiples of 4096 bytes.
You read data from the file into the working memory, and out again to a second file as you need to.
You can’t put records from the file anywhere else.
You may use a small number (< 5) of integer variables to store values temporarily while working.
You should use
RandomAccessFileto interact with the file.You should use
ByteBufferandIntBufferto interact with the working memory.
7.2.3. Managing Your Memory¶
Within these constraints, you can manage the data within the memory pool any way that you like to do the sort.
You can decide where to read data to, and where to write data from (within the working memory).
You can decide the relative sizes of the heap and the output buffer (within the working memory) during the Heapsort phase.
You can decide the size and number of the buffers for the multiway merge phases (they are all actually different places in working memory).
Read about buffer pools. You don’t want to actually implement your memory as a standard buffer pool, but many of the ideas apply.
While you may NOT put data from the file anywhere other than the working memory, you may have other data structures (arrays or lists, for example) to keep track of what information is where in the working memory.
Example: Track the current position in each run.
7.2.4. Development Stages¶
Suggested development stages:
Get it to work on files that fit into the memory pool (only one run is produced by the Heapsort phase). This is Milestone 1.
Get it to work for one multiway merge pass (the number of runs produced during the Heapsort phase is no more than the number of run buffers available for the multiway merge phase).
Get it to work for more than one multiway merge phase.
7.2.5. Project 3 Special Considerations¶
This project is pretty different from the other three this semester.
Three week lifecycle.
No formal testing requirement.
Only one milestone.
Empirical performance is being checked.
There is a massive time difference between reading 4096 bytes at once vs. a series of 1024 integer (4-byte) reads.
If you don’t get that right, then Web-CAT reference tests will time out.
