Alan Dix Snip!t: The External Heapsort

see all channels for Alan Dix

Snip summary	The memory model assumed for this discussion is a paged ... The heapsort algorithms seen so far are not very well su ... NOTE: Be aware that the words 'page access' does not nec ... One may wonder if the d-heaps described in chapter 5 can ... The drawbac The External Heapsort http://www.diku.dk/forskning/performance-engineering/Jesper/heapl.../node23.html

Categories

/Channels/AVL trees

[ go to category ]

For Snip	loading snip actions ...
For Page	loading url actions ...

The memory model assumed for this discussion is a paged memory. The N elements to be sorted are stored in a memory of M pages, with room for S elements per page. Thus N=MS.

The heapsort algorithms seen so far are not very well suited for external sorting. The expected number of page accesses for the basic heapsort in a memory constrained environment (e. g. 3 pages of memory) is where N is the number of records and M is the number of disk blocks. We would like it to be which typically is orders of magnitude faster. (example: 8 byte element, 4KB page, translates into 512 elements pr page).

NOTE: Be aware that the words 'page access' does not necessarily imply that disk accesses are controlled by the virtual memory subsystem. It may be completely under program control.

One may wonder if the d-heaps described in chapter 5 can be used effeciently as an external sorting method. If we consider the same situation as above, it is seen that if we use the number of elements per page S as the fanout, and align the heap on external memory so that all siblings fit in one page, the depth of the tree will drop with a factor and thus the number of page accesses will drop accordingly. As observed in the chapter on d-heaps, the cost of a page access should be weighted against the cost of S comparisons needed to find the maximal child. Anyway, because the cost of a disk access is normally much higher than the time it takes to search linearly through one page of data, this idea is definitely an improvement over the traditional heap.

The drawback of the idea is twofold. Firstly it only improves the external cost with a factor , not the desired factor S, and secondly it increases the internal cost with a factor . Thus it actually makes sorting much slower in the situation where all data fits in memory. Considering that gigabyte main memories are now commonplace and that terabyte memories will eventually be affordable, it is reasonable to require a good internal performance even of external methods.

HTML	<p> The memory model assumed for this discussion is a paged memory. The <i>N</i> elements to be sorted are stored in a memory of <i>M</i> pages, with room for <i>S</i> elements per page. Thus <i>N</i>=<i>MS</i>. </p><p> The heapsort algorithms seen so far are not very well suited for external sorting. The expected number of page accesses for the basic heapsort in a memory constrained environment (e. g. 3 pages of memory) is <img alt="tex2html_wrap_inline1347" src="img28.gif" align="middle" height="32" width="106"> where <i>N</i> is the number of records and <i>M</i> is the number of disk blocks. We would like it to be <img alt="tex2html_wrap_inline1353" src="img29.gif" align="middle" height="32" width="110"> which typically is orders of magnitude faster. (example: 8 byte element, 4KB page, translates into 512 elements pr page). </p><p> <tt>NOTE:</tt> Be aware that the words 'page access' does not necessarily imply that disk accesses are controlled by the virtual memory subsystem. It may be completely under program control. </p><p> One may wonder if the d-heaps described in chapter <a href="node18.html#Dheaps">5</a> can be used effeciently as an external sorting method. If we consider the same situation as above, it is seen that if we use the number of elements per page <i>S</i> as the fanout, and align the heap on external memory so that all siblings fit in one page, the depth of the tree will drop with a factor <img alt="tex2html_wrap_inline1357" src="img30.gif" align="middle" height="29" width="47"> and thus the number of page accesses will drop accordingly. As observed in the chapter on d-heaps, the cost of a page access should be weighted against the cost of <i>S</i> comparisons needed to find the maximal child. Anyway, because the cost of a disk access is normally <em>much</em> higher than the time it takes to search linearly through one page of data, this idea is definitely an improvement over the traditional heap. </p><p> The drawback of the idea is twofold. Firstly it only improves the external cost with a factor <img alt="tex2html_wrap_inline1357" src="img30.gif" align="middle" height="29" width="47"> , not the desired factor <i>S</i>, and secondly it increases the internal cost with a factor <img alt="tex2html_wrap_inline1365" src="img31.gif" align="middle" height="30" width="72"> . Thus it actually makes sorting much slower in the situation where all data fits in memory. Considering that gigabyte main memories are now commonplace and that terabyte memories will eventually be affordable, it is reasonable to require a good internal performance even of external methods.</p>

HTML

The memory model assumed for this discussion is a paged memory. The N elements to be sorted are stored in a memory of M pages, with room for S elements per page. Thus N=MS. The heapsort algorithms seen so far are not very well suited for external sorting. The expected number of page accesses for the basic heapsort in a memory constrained environment (e. g. 3 pages of memory) is <img alt="tex2html_wrap_inline1347" src="img28.gif" align="middle" height="32" width="106"> where N is the number of records and M is the number of disk blocks. We would like it to be <img alt="tex2html_wrap_inline1353" src="img29.gif" align="middle" height="32" width="110"> which typically is orders of magnitude faster. (example: 8 byte element, 4KB page, translates into 512 elements pr page). <tt>NOTE:</tt> Be aware that the words 'page access' does not necessarily imply that disk accesses are controlled by the virtual memory subsystem. It may be completely under program control. One may wonder if the d-heaps described in chapter <a href="node18.html#Dheaps">5</a> can be used effeciently as an external sorting method. If we consider the same situation as above, it is seen that if we use the number of elements per page S as the fanout, and align the heap on external memory so that all siblings fit in one page, the depth of the tree will drop with a factor <img alt="tex2html_wrap_inline1357" src="img30.gif" align="middle" height="29" width="47"> and thus the number of page accesses will drop accordingly. As observed in the chapter on d-heaps, the cost of a page access should be weighted against the cost of S comparisons needed to find the maximal child. Anyway, because the cost of a disk access is normally much higher than the time it takes to search linearly through one page of data, this idea is definitely an improvement over the traditional heap. The drawback of the idea is twofold. Firstly it only improves the external cost with a factor <img alt="tex2html_wrap_inline1357" src="img30.gif" align="middle" height="29" width="47"> , not the desired factor S, and secondly it increases the internal cost with a factor <img alt="tex2html_wrap_inline1365" src="img31.gif" align="middle" height="30" width="72"> . Thus it actually makes sorting much slower in the situation where all data fits in memory. Considering that gigabyte main memories are now commonplace and that terabyte memories will eventually be affordable, it is reasonable to require a good internal performance even of external methods.

Snip!t from collection of Alan Dix

Search Alan Dix Snips

Channels - Alan Dix