DI-UMONS : Dépôt institutionnel de l’université de Mons

Recherche transversale
(titres de publication, de périodique et noms de colloque inclus)
2017-05-01 - Article/Compte-rendu - Anglais - 12 page(s)

Escobar Fernando A., Kolar Anthony, Harb Naim , Vinci dos Santos Filipe, Valderrama Carlos , "Scalable shared-memory architecture to solve the Knapsack 0/1 problem" in Microprocessors & Microsystems, 50, 189-201, http://doi.org/10.1016/j.micpro.2017.04.001

  • Edition : Elsevier Science
  • Codes CREF : Electronique et électrotechnique (DI2411)
  • Unités de recherche UMONS : Electronique et Microélectronique (F109)

Abstract(s) :

(Anglais) Dynamic Programming (DP) is used to solve combinatorial optimization problems and constitutes one of the 13 High Performance Computing (HPC) patterns. DP suffers from irregular, data-dependent memory accesses that deteriorates performance. The Knapsack 0/1 belongs to the simplest DP algorithms which is called Serial Monadic and has been treated in software with cache-efficient algorithms as well as parallel threads, OpenMP or MPI. In this paper we propose a shared memory, parametrizable architecture to compute the DP matrix for the Knapsack 0/1. Our system has a parallel runtime of Θ(mC/q) for a knapsack of capacity C with m items and q operators. Using additional off-chip space and DMA transfers it can solve knapsacks of any size. The architecture is implemented on the ZYNQ-7020 System On Chip (SoC) that contains a dual-core ARM plus Artix FPGA fabric. Under such architecture we make use of 64-bit High Performance ports for off-chip transfers and asymmetric 32-bit write/64-bit read BRAMs to minimize data exchange times. We also exploit computation synchronization to minimize BRAM address propagation and reduce routing congestion. We present results for a base system with 70 Processing Elements (PEs) capable of solving problems with a maximum item weight ωmax=1024ωmax=1024. For more complex instances we configure the architecture with 58 PEs and ωmax=6144,ωmax=6144, where a single BRAM is shared among 13 computing units. We thus solve problems with 6 × bigger weights than previous works, attain a 16 × speed-up versus an optimized software on an Intel Xeon E5 and get the highest efficiency per core versus other architectures. We achieve between View the MathML source2.4−3.3× acceleration versus previous FPGA solutions.

Mots-clés :
  • (Anglais) Shared memory
  • (Anglais) Knapsack 01
  • (Anglais) SoC
  • (Anglais) Dynamic programming
  • (Anglais) FPGA