DI-UMONS : Dépôt institutionnel de l’université de Mons

Recherche transversale
(titres de publication, de périodique et noms de colloque inclus)
2017-04-14 - Article/Dans un journal avec peer-review - Anglais - 13 page(s)

Escobar Fernando A., Kolar Anthony, Harb Naim , Vinci dos Santos Filipe, Valderrama Carlos , "Scalable shared-memory architecture to solve the Knapsack 0/1 problem" in Microprocessors & Microsystems, 50, 2017, 189-201, 10.1016/j.micpro.2017.04.001

  • Edition : Elsevier Science
  • Codes CREF : Sciences de l'ingénieur (DI2000), Techniques d'imagerie et traitement d'images (DI2770), Technologies de l'information et de la communication (TIC) (DI4730), Semi-conducteurs (DI2512), Electronique et électrotechnique (DI2411), Instrumentation médicale (DI2760), Conception assistée par ordinateur (DI1247), Electronique générale (DI2510), Electricité (DI1230)
  • Unités de recherche UMONS : Electronique et Microélectronique (F109)
  • Instituts UMONS : Institut de Recherche en Technologies de l’Information et Sciences de l’Informatique (InforTech), Institut NUMEDIART pour les Technologies des Arts Numériques (Numédiart)
  • Centres UMONS : Centre de Recherche en Technologie de l’Information (CRTI)
Texte intégral :

Abstract(s) :

(Anglais) Dynamic Programming (DP) is used to solve combinatorial optimization problems and constitutes one of the 13 High Performance Computing (HPC) patterns. DP suffers from irregular, data-dependent memory accesses that deteriorates performance. The Knapsack 0/1 belongs to the simplest DP algorithms which is called Serial Monadic and has been treated in software with cache-efficient algorithms as well as parallel threads, OpenMP or MPI. In this paper we propose a shared memory, parametrizable architecture to compute the DP matrix for the Knapsack 0/1. Our system has a parallel runtime of O( mC / q ) for a knapsack of capacity C with m items and q operators. Using additional off-chip space and DMA transfers it can solve knapsacks of any size. The architecture is implemented on the ZYNQ-7020 System On Chip (SoC) that contains a dual- core ARM plus Artix FPGA fabric. Under such architecture we make use of 64-bit High Performance ports for off-chip transfers and asymmetric 32-bit write/64-bit read BRAMs to minimize data exchange times. We also exploit computation synchronization to minimize BRAM address propagation and reduce routing congestion. We present results for a base system with 70 Processing Elements (PEs) capable of solving problems with a maximum item weight ω max = 1024 . For more complex instances we configure the ar- chitecture with 58 PEs and ω max = 6144 , where a single BRAM is shared among 13 computing units. We thus solve problems with 6 ×bigger weights than previous works, attain a 16 ×speed-up versus an op- timized software on an Intel Xeon E5 and get the highest efficiency per core versus other architectures. We achieve between 2 . 4 −3 . 3 ×acceleration versus previous FPGA solutions.

Identifiants :
  • DOI : 10.1016/j.micpro.2017.04.001

Mots-clés :
  • (Anglais) reconfigurable
  • (Anglais) Big Data
  • (Anglais) High Performance Computing
  • (Anglais) FPGA