Prism Refraction Search (PRS) - CUDA Implementation

A Novel Physics-Based Metaheuristic Algorithm

Springer, The Journal of Supercomputing, Published: 04 January 2024

Project Overview

This project implements the Prism Refraction Search (PRS) Algorithm using CUDA C to leverage parallel computation for improved performance. The PRS algorithm, inspired by the principles of light refraction through a prism, is a novel metaheuristic optimization method designed for solving global optimization problems.

This project was submitted as part of the final project for CS326 - High Performance Computing at NIT Surat in 2025.

Objectives

Implement the sequential PRS algorithm based on the original research paper.
Develop a parallel CUDA implementation to enhance computational efficiency.
Compare the performance of sequential and parallel implementations on benchmark optimization problems.

Features

Implementation of the sequential PRS algorithm in C.
Development of a parallelized CUDA version to accelerate computation.
Performance benchmarking on standard optimization problems.
Comparative analysis of execution time and scalability.

Installation and Usage

Clone the project repository and compile source files:

git clone https://github.com/xshthkr/prs-cuda.git
cd prs-cuda
make

Compare naive sequential and CUDA parallel execution:

make compare

CPU-based Prism Refraction Search

1:  Initialize population of N solutions

2:  Initialize prism angle A₀

3:  For t = 1 to MaxIter do

4:      For each solution i in 1 to N do
5:          Calculate deviation δₜ[i] = f(iₜ[i])     // Fitness
6:          If δₜ[i] < BestScore:
7:              BestScore = δₜ[i]
8:              BestSolution = iₜ[i]
9:          End if
10:     End for

11:     Calculate refractive index μₘ:
            μₘ = sin((A₀ + δₜ) / 2 ) / sin(A₀ / 2)   // Refractive index Eq.10

12:     For each solution i in 1 to N do
13:         For each dimension j in 1 to D do
14:             Eₜ[i][j] = δₜ[i] - iₜ[i][j] + Aₜ     // Emergent angle Eq.9
15:             r₁ = random number in [-1, 1]
16:             iₜ₊₁[i][j] = asin(...)               // Incidence angle Eq.11
17:             Ensure iₜ₊₁[i][j] is within bounds
18:         End for
19:     End for

20:     Update prism angle Aₜ₊₁:
            Aₜ₊₁ = Aₜ × ((alpha - t) / MaxIter)      // Prism angle Eq.12

21: End for

22: Return BestSolution, BestScore

Running this on the 4-dimensional Rastrigin function, which has many local minimas, using a population size of 1000, the cpu-based prism refraction search algorithm converged on the global minima in about 50 seconds.

Parameters:
  Dimension: 4
  Max Iterations: 6000
  Population size: 1000
  Alpha: 0.009000
Best solution:
  0.002293
  0.003727
  -0.004454
  0.007831
Best score: 0.019900
Elapsed time: 53.258303 seconds

The algorithm discovered a solution with a score of 0.0199 in 53.26 seconds. The accuracy is impressively high, at the cost of a high runtime.

Lets compare this result with the CUDA-based PRS Optimizer.

CUDA-based Prism Refraction Search

Kernels

prs_init_incident_angles_kernel - one thread per solution to initialize solution components with random numbers in (0,90)
prs_init_emergent_angles_kernel - one thread per solution to initialize components with 0.
prs_calculate_fitness_kernel - one thread per individual in population, fitness of every solution is added to an array.
prs_reduce_fitness_sum_block_kernel - block level reduction of fitness array to find block level fitness sums
prs_reduce_fitness_sum_final_kernel - global level reduction of block level sums to get a final global sum
prs_reduce_fitness_min_block_kernel - block level reduction of fitness to find minimum of block level fitnesses and its index in fitness array
prs_reduce_fitness_min_final_kernel - global level reduction of block level fitness minimums to get final global minimum and index
prs_copy_best_solution_kernel - one thread per dimension to copy best solution
prs_update_emergent_prs_update_incident_kernel - one thread per (individual, dimension) pair to calculate emergent angle components and update corresponding incident angle component

Memory Allocation on GPU and CPU

Data	Location	Notes
`incident_angles` (pop_size × dim)	GPU	Updated every iteration
`emergent_angles` (pop_size × dim)	GPU	Fully device-side
`solution` (dim)	GPU scratch per thread (or shared memory)	Used for converting angles to solutions
`best_solution` (dim)	Host + GPU	Tracked both sides; updated if a better one found
`best_score`	Host + GPU	Same as above
`lowerbound`, `upperbound` (dim)	Copied to GPU	Read-only constants for mapping angles to real values
`delta_t`	GPU (per iteration)	Global sum or average – use atomic or reduction
`prism_angle`	Host + GPU	Updated on host, copied to GPU (or done in kernel if needed)
`params`	CPU constant memory	Accessed frequently, rarely changes

Pseudocode

1:  Kernel to initialize population of N solutions

2:  Initialize prism angle A₀

3:  For t = 1 to MaxIter do

4:      Kernel to calculate delta for every solution

5:      Kernel to reduce delta to average delta_t

6:      Calculate refractive index μₘ:
            μₘ = sin((A₀ + δₜ) / 2 ) / sin(A₀ / 2)   // Refractive index Eq.10

7:      Kernel to calculate emergent angle           // Eq. 9
            and update incident angles               // Eq. 11

8:      Update prism angle Aₜ₊₁:
            Aₜ₊₁ = Aₜ × ((alpha - t) / MaxIter)      // Prism angle Eq.12

9:      Kernel to update BestSolution and BestScore

10: End for

11: Return BestSolution, BestScore

Running this on the same 4-dimensional Rastrigin function, we get the following results.

Number of CUDA devices: 1
Device 0: NVIDIA GeForce MX250
  Total global memory: 4230086656 bytes
  Multiprocessor count: 3
  Max threads per block: 1024
Parameters:
  Dimension: 4
  Max Iterations: 10000
  Population size: 1024
  Alpha: 0.009000
Best solution:
  0.009845
  0.001369
  0.013489
  0.000200
Best score: 0.055679
Elapsed time: 1.451018 seconds

The algorithm discovered a solution with a score of 0.055679 in 1.45 seconds. This is a 36.7 times speedup.

This CUDA-accelerated algorithm gave us a faster and more accurate result compared to the CPU-based implementation. This demonstrates the significant improvements that parallelized algorithms can achieve over their sequential counterparts.

Improvements and Considerations

The prism refraction search algorithm is mainly an exploitation algorithm. The only exploratory work done by the algorithm is when the angles are initialized with random numbers. This algorithm is most suitable to be used as a finer search algorithm to be used after narrowing down the search space with a more efficient exploration algorithm.

Although my CUDA implementation of this algorithm is incredibly faster, my implementation suffers with scalability issues. I am not sure whether this is becuase of the random number generator I have used within my kernels, but to have a higher exploration of the solution search space, the max_iterations needs to be really high.

There are several improvements that can be done in my implementation:

Better memory allocation for improved memory coalesence. Kernels are faster when the threads in a warp access sequential data. Since my array of solutions is a 2D array, I have to flatten it to store in the GPU. There is a memory allocation function for 3D data but the documentations and implementations I have read seem to indicate that in my case padding would be required which would make it inefficient.
Kernels can be designed more elegently. Nvidia's documentation and forums discuss the various techniques of reduction of the sums and minimums of arrays can be implemented with better efficiency.

Conclusion

This CUDA-accelerated algorithm produced significantly faster and more accurate results compared to the CPU-based implementation. In our tests, the GPU version achieved a 40x speedup while handling a larger population size and more iterations. It also found a better solution, demonstrating how parallelization not only boosts performance but can also enhance solution quality. These results underscore the benefits of leveraging GPU computing for population-based metaheuristic algorithms.

Requirements

Linux
Dependencies: GNU math library (math.h)
Nvidia GPU (cuda 6.1 or higher)

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
docs		docs
include		include
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prism Refraction Search (PRS) - CUDA Implementation

Project Overview

Objectives

Features

Installation and Usage

CPU-based Prism Refraction Search

CUDA-based Prism Refraction Search

Kernels

Memory Allocation on GPU and CPU

Pseudocode

Improvements and Considerations

Conclusion

Requirements

References

About

Uh oh!

Releases

Packages

Languages

License

xshthkr/prs-cuda

Folders and files

Latest commit

History

Repository files navigation

Prism Refraction Search (PRS) - CUDA Implementation

Project Overview

Objectives

Features

Installation and Usage

CPU-based Prism Refraction Search

CUDA-based Prism Refraction Search

Kernels

Memory Allocation on GPU and CPU

Pseudocode

Improvements and Considerations

Conclusion

Requirements

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages