Wednesday 12 November 2014

Research Capstone: Parallel Processing

Parallel Processing - Week 14

The implementation of this week was focussed on preparing the project for testing and performance measuring. The following spreadsheets represent the performance of the software with different variables applied. The values below are calculated as the average processing time over 100 frames.

A* Pathfinding

Number of A* AgentsSequentialThread Pool
00.0001220.000121
20.0001240.000125
50.000130.000129
100.000140.000139
200.0001590.000158
500.0002180.000217
1000.0003810.000385
The chart above shows the performance of the software running a number of A* Pathfinding calculations. The following graph is a visual representation of this (Sequential in Red, Parallel in Blue):astar
Grass Avoidance
The spreadsheet below shows the performance of the grass avoidance which I spoke of last week with various dimensions of grass (200 represents a grid of 200x200 = 40000 grass blades).
Grass DimensionSequentialThread PoolOpenCL
100.0001720.0001780.003674
250.0002360.0001920.00396
500.0004710.0002280.005042
1000.0013830.0003650.009548
2000.0050840.0008610.02056
3000.0115590.0020450.040698
4000.020780.0052950.068533
5000.0330630.0093710.107917
grassThe chart above was expected with a thread pool of 8 threads performing the calculation at a faster rate than a single thread. However the chart below comparing this with OpenCL (shown in Yellow) was completely unexpected. At all stages OpenCL is slower than the other processing methods. I had initially expected it to be slower, however then it would gradually catch up with, then overtake, the other methods.grassgpuNote: The calculation used for the OpenCL measurements is not yet finished. It is lacking in functionality, therefore the final measurements should in fact be even slower than this, unless I'm doing something REALLY wrong. Sending the data in image format, rather than a large array may have been more beneficial to OpenCL.

Parallel Processing - Week 13

This week I began adding in a more expensive pathfinding system using the A* algorithm. This is currently working well using a search depth of 10 on a single thread on the CPU. Higher search depths or larger numbers of AI can cause the program to stutter and the frame rate can become quite irregular at times too.
This A* implementation only calculates the next best waypoint towards the final goal when the current waypoint has been reached.
*VIDEO REMOVED*
In the video demo above I only used one AI agent which follows the cursor around the map using the blue and red grid on the ground to navigate around obstacles. The blue nodes represent paths that the AI can travel on while red nodes are blocked by obstacles (trees in this case).
The next step will be to fix some of the lag spikes which occur (see performance graph) and then move this implementation to OpenCL and the thread pool.

Parallel Processing - Week 12

The subject of this research paper has changed slightly from previous weeks, now focussing entirely on Parallel processing. Distributed processing over the network will be implemented as part of the Game Development Capstone but will no longer be part of this research paper, leaving more time for Parallel processing research.
Since last week, I have added a visual graph which outputs a value over time.  The graph below outputs the processing time for a frame every 0.02 seconds. This gives me a very good visual representation of the difference in average performance between different processing methods.profiler
As processing the random waypoint AI using OpenCL turned out to be MORE expensive than just calculating this on the CPU, I decided to add a large bulk calculation that caused a great drop in performance for the CPU. I transferred an old grass shader to DirectX 11 for this, as I remembered performing collision detection on this used to cause a significant performance loss.
grass
The graph below demonstrates the difference between the time taken to process different parts of the program. It is evident that the grass collision detection is once again causing a significant performance loss.
performance change
Once this is calculated using OpenCL, the performance of this compared to a more complicated AI algorithm in OpenCL may provide an interesting insight into the best use of OpenCL, instead of basing this research paper entirely on AI algorithms.

Parallel Processing - Week 11

As DirectX 11 has removed the font interface from DirectX 10, I created a font renderer to display performance of the program using the geometry shader and a texture file containing all of the letters in the font (see image below). I would like to upgrade this to a graph of some sort to better demonstrate the average frame rate but this might have to wait until later once all of the other features have been implemented.
font
The random waypoint AI has now been implemented in OpenCL and is working as intended.
opencl
While I can't demonstrate that the OpenCL processing is working by the still image above, the on screen text does show that it's active.
From here I will do the same with the Distributed Processing across the network based on the networking implementation using WinSock from my old summative. After this I will add a more complex AI algorithm, such as A*, which can be easily scaled up (by increasing the navigation grid size) to increase the processing times of the program and demonstrate the performance gain / loss of each processing method at different grid sizes.

Parallel Processing - Week 10

Not much progress was made this week in the technical implementation of this research. My thread locks are still not placed properly and some threads are still stealing jobs from other threads before they can be removed from the job list (see below).
threadJobs
This threads job doesn't exist anymore but the thread still tries to process it.
I have begun reading through the book "Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers", by Barry Wilkinson and Michael Allen (2004) to get a better understanding of other ways these techniques may be implemented and other topics such as Load Balancing and Message Passing methods.

Parallel Processing - Week 08

As the holidays were spent working on personal projects, I did not progress with this research capstone during that period.
The purpose of last week has been to implement some of the techniques I have been researching over the last few weeks, in doing this I can find flaws in my design and research improvements in those areas. To begin with I began experimenting with my old threading framework which was using the "WinThread" libraries. After some time I decided to switch to C++11's thread library and looking into different methods of implementing thread pools, as mentioned in previous posts.
thread pool
This thread pool design was mainly inspired by Intel's "Don't Dread Threads" video series, which discussed using "Data Decomposition" to break parts of the program into smaller tasks which can be added to the thread pools Job List (the vector of std::functions shown below). Once created, the worker threads call the Run() function are placed in stasis by a mutex until new jobs are added to the job list. At this point the threads will find a job, remove it from the list and process the job before finding a new job or returning to an idle state if no more jobs are in the job list.
This is currently working really well, the design is easily scalable and works as intended for any number of threads, even just using a single thread. The "Functional Decomposition" method could be implemented along side this, adding the processing of specific components (such as sound or loading screens) to this job list. I have moved this system to my basic random-waypoint chicken framework. The parallel job list is working very well, however threads are currently stealing jobs from other threads before they can remove their job from the list as I am not properly locking the job list. This is causing the system to crash, as the threads are left with invalid data which can not be processed.
I have also began working on adding my previous OpenCL tests to this framework.
testKernell
Each Processing Element (similar to a thread on the GPU) in a Compute Unit (collection of Processing Elements) is assigned a unique ID for each time a kernel is run, this ID is used as the index for each array accessed. So, for now this is just a simple function adding two arrays together, but... next week!! Great things will happen and this should also feature some AI functionality.

Parallel Processing - Week 06

This week was spent creating a basic framework for testing the implementations of this research.
This framework is made from a personal project which has been stripped down to just feature 20 chickens walking around a 10 x 10 grid. Each chicken is given a light, the blue lights represent chickens that are currently in an idle state while white lights are walking chickens.
chickens
This will provide me with a visual representation of the AI for future use. I will begin moving this basic functionality to the GPU, then improve it from there.

Parallel Processing - Week 05

This week was spent researching the use of parallel processing in Artificial Intelligence as this will be my means of demonstrating the different techniques.
The most informative explanation I found for implementing AI with OpenCL was to build the artificial intelligence algorithms to be as close as possible to graphics algorithms. Like writing shaders, but for AI. This will ensure you treat the graphics card in the same fashion, avoiding branching. According to Alex J. Champandard, a GPU compute unit "must follow the “worst case” branch if only one other compute unit takes that path." so while branching may be more efficient for CPU calculations, this is not necessarily the case for GPU calculations.
Data representations should also be treated in a similar fashion as shaders as the transfer of data to and from the GPU may become a bottleneck in your system, regardless of how efficiently OpenCL performs the calculations. These data representations include storing data in images or matrices, in order to make the most of the functions supplied with OpenCL like read_imagef.

Parallel Processing - Week 04

This week was spent researching what areas parallel and distributed processing are commonly used in. Intel's video series "Don’t Dread Threads" was very informative in terms of designing a scalable thread pool for different computer systems.
In "Don't Dread Threads" they stated that "Functional Decomposition" (running separate threads for networking, AI, sound) should not be used. This results in latency, poor load balancing and is not scalable on varying core counts.
Instead use "Data Decomposition". Break groups of tasks (such as AI, networking) into smaller tasks. Add these smaller tasks to a job list and assign threads in a pool to grab these tasks when they're free. Jobs are given "high penalties” for blocking. This method is easier to load balance and, due to the thread pool being of any size, this is scalable for any number of cores, even just a single thread in a pool.
Intel has a range of software which is used for analysing which parts of code are better suited to a multithreading approach. The suggested software are Intel VTune, Intel Parallel Amplifier (displays “hotspots” that would be ideal for parallelism) and Intel Graphics Performance Analyzers (gives you a complete overview of system resources - CPU, GPU, Bus).

Parallel Processing - Week 03

This week has mainly been spent becoming familiar with the CUDA SDK and OpenCL with the AMD APP (Accelerated Parallel Processing) SDK in order to understand the workflow and what kinds of processes could be offloaded to the GPU.
A great series of tutorials can be found here:  https://www.udacity.com/course/cs344 and here: https://www.coursera.org/course/hetero . Both are free online courses with a wide range of exercises.
Next week I will research different design patterns and immutable objects that are used in the games industry.

Parallel Processing - Week 02

Since writing my proposal, the focus of this capstone has slightly shifted to researching Parallel and Distributed Processing in general. While I will still be comparing different parallel processing approaches, such as CUDA, OpenCL and Multi-threading, I also want to compare this to distributing some more expensive calculations across networked clients. This was partially inspired by our networking assignment from the previous year.
net4
I was performing all calculations server-side meaning each client just had to render the scene. The servers calculations were multithreaded (one thread per client) and this worked really well for 2 - 10 players, however anymore caused the performance of the server to drop significantly. Distributing these calculations across the clients would have solved this problem and would definitely benefit our Titan group project.

Parallel Processing - Week 01

The first week of this paper has been spent deciding upon a topic for this capstone project. Ideally this would tie in with our large group project and possibly even Liwaa's Capstone paper. Below are some ideas I was throwing around:
  • Multithreaded AI system (possibly using CUDA / OpenCL)
  • AI for real-world applications, such as simulating a fire escape, traffic or creating aerodynamic shapes using Genetic Algorithms
  • A Deferred Renderer
  • Component system with prefabs, similar to Unity. Would tie in well with a level editor
  • Creating a dynamic navmesh or flowfield navigation, ties in well with Titan project.
After a discussion with Liwaa and Tece on Monday, I've decided to go with researching designs for multi-threaded systems. This can be tested by researching various AI implementations from Liwaa's capstone project and can be expanded upon by comparing the performance of multi-threading versus other parallel methods, such as CUDA, OpenCL or even passing specific calculations to other network clients.