eesti teaduste
akadeemia kirjastus
SINCE 1952
Proceeding cover
of the estonian academy of sciences
ISSN 1736-7530 (Electronic)
ISSN 1736-6046 (Print)
Impact Factor (2020): 1.045

Fast iterative circuits and RAM-based mergers to accelerate data sort in software/hardware systems; pp. 323–335

Full article in PDF format |

Valery Sklyarov, Iouliia Skliarova, Artjom Rjabov, Alexander Sudnitson


The paper suggests and describes two architectures for parallel data sort. The first architecture is applicable to large data sets and it combines three stages of data processing: data sorting in hardware (in a Field-Programmable Gate Arrays – FPGA), merging preliminary sorted blocks in hardware (in the FPGA), and merging large subsets received from the FPGA in general-purpose software. Dataexchange between the FPGA anda general-purpose computeris organized throughafast Peripheral Component Interconnect (PCI) expressbus.The second architectureis applicabletosmalldatasetsandit enablessortingtobedoneatthetimeofdata acquisition, soon as the last data item is received, the sorted items can be transferred immediately. The results of experiments clearly demonstrate the advantages of the proposed architectures that permit the reduction of the required hardware resources and increasing throughput compared to the results reported in publications and software functions targeted to data sorting.


1. Knuth, D. E. The Art of Computer Programming. Sorting and Searching, Vol. III. Addison-Wesley, 2011.

2. Sklyarov, V. and Skliarova, I. High-performance implementation of regular and easily scalable sorting networks on an FPGA. Microprocess. Microsyst., 2014, 38(5), 470–484.

3. Mueller, R., Teubner, J., and Alonso, G. Sorting networks on FPGAs. Int. J. Very Large Data Bases, 2012, 21(1), 1–23.

4. Ortiz, J. and Andrews, D. A configurable high-throughput linear sorter system. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 2010, 1–8.

5. Zuluaga, M., Milder, P., and Puschel, M. Computer generation of streaming sorting networks. In Proceedings of the 49th Design Automation Conference. ACM, New York, 2012, 1245–1253.

6. Singh, S. and Greaves, D. J. Kiwi: synthesis of FPGA circuits from parallel programs. In Proceedings of the 16th IEEE International Symposium on Field-Programmable Custom Computing Machines. IEEE, 2008, 3–12.

7. Che, S., Li, J., Sheaffer, J. W., Skadron, K., and Lach, J. Accelerating compute-intensive applications with GPUs and FPGAs. In Proceedings of the 2008 Symposium on Application Specific Processors. IEEE, 2008, 101–107.

8. Chamberlain, R. D. and Ganesan, N. Sorting on architecturally diverse computer systems. In Proceedings of the 3rd International Workshop on High-Performance Reconfigurable Computing Technology and Applications. ACM, New York, 2009, 39–46.

9. Mueller, R. Data Stream Processing on Embedded Devices. Ph.D. thesis, ETH, Zurich, 2010.

10. Koch, D. and Torresen, J. FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, New York, 2011, 45–54.

11. Sklyarov, V., Skliarova, I., Mihhailov, D., and Sudnitson, A. Implementation in FPGA of address-based data sorting. In Proceedings of the 21st International Conference on Field-Programmable Logic and Applications. IEEE, 2011, 405–410.

12. Kipfer, P. and Westermann, R. GPU Gems 2, Improved GPU Sorting. gpugems2_chapter46.html, 2005 (accessed 08.06.2016).

13. Gapannini, G., Silvestri, F., and Baraglia, R. Sorting on GPU for large scale datasets: a thorough comparison. Inf. Process. Manage, 2012, 48(5), 903–917.

14. Ye, X., Fan, D., Lin, W., Yuan, N., and Ienne, P. High performance comparison-based sorting algorithm on many-core GPUs. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 2010, 1–10.

15. Satish, N., Harris, M., and Garland, M. Designing efficient sorting algorithms for manycore GPUs. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 2009, 1–10.

16. Cederman, D. and Tsigas, P. A practical quicksort algorithm for graphics processors. In Proceedings of the 16th Annual European Symposium on Algorithms. Springer-Verlag, Berlin, Heidelberg, 2008, 246–258.

17. Grozea, C., Bankovic, Z., and Laskov, P. FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application. In Facing the Multicore-Challenge (Keller, R., Kramer, D., and Weiss, J. P., eds). Springer-Verlag, Berlin, Heidelberg, 2010, 105–117.

18. Edahiro, M. Parallelizing fundamental algorithms such as sorting on multi-core processors for EDA acceleration. In Proceedings of the 2009 Asia and South Pacific Design Automation Conference. IEEE, 2009, 230–233.

19. Aj-Haj Baddar, S. W. and Batcher, K. E. Designing Sorting Networks. A New Paradigm. Springer, 2011.

20. Marcelino, R., Neto, H. C., and Cardoso, J. M. P. A comparison of three representative hardware sorting units. In Proceedings of the 35th Annual IEEE Conference on Industrial Electronics. IEEE, 2009, 2805–2810.

21. Batcher, K. E. Sorting networks and their applications. In Proceedings of the AFIPS Spring Joint Computer Conference. ACM, New York, 1968, 307–314.

22. Lacey, S. and Box, R. A fast, easy sort: a novel enhancement makes a bubble sort into one of the fastest sorting routines. Byte, 1991, 16(4), 315–320.

23. Sklyarov, V. and Skliarova, I. Fast regular circuits for network-based parallel data processing. Adv. Electr. Comput. Eng., 2013, 13(4), 47–50.

24. Xilinx, Inc. Zynq-7000 all programmable SoC. Technical Reference Manual. , 2016 (accessed 01.02.2017).

25. Xilinx, Inc. VC707 Evaluation Board for the Virtex-7 FPGA User Guide. , 2016 (accessed 01.02.2017).

26. Digilent, Inc. Nexys4 DDR FPGA Board Reference Manual. nexys4ddr_rm.pdf, 2016 (accessed 08.06.2016).

Back to Issue