Comparative Analysis and Optimization of MD5 and MD6 Hashing Algorithm Implementations in C++, OpenCL, and Verilog



**Prepared by:** Caide Marc Spriestersbach David Samuel Young Dylan Thomas Trowsdale

## Prepared for:

Dr. Simon Winberg EEE4120F: High Performance Embedded Systems Department of Electrical Engineering University of Cape Town

# Declaration

- 1. I know that plagiarism is wrong. Plagiarism is to use another's work and pretend that it is one's own.
- 2. I have used the IEEE convention for citation and referencing. Each contribution to, and quotation in, this report from the work(s) of other people has been attributed, and has been cited and referenced.
- 3. This report is my own work.
- 4. I have not allowed, and will not allow, anyone to copy my work with the intention of passing it off as their own work or part thereof.

Provsdale

Dylan Trowsdale

Caide Spriestersbach

David Young

May 19, 2024

Date

May 19, 2024

Date

May 19, 2024

Date

# Abstract

This project aims to compare the performance and efficiency of MD5 and MD6 hashing algorithm implementations across different programming languages. The languages to be tested are Verilog and C++, with an OpenCL implementation to test parallelisation. The focus is on evaluating execution time across these implementations, with C++ MD5 serving as the golden standard.

The Message Digest 5 (MD5) algorithm is primarily used for verifying the integrity of large file contents to prevent unintentional corruption. However, MD5 cannot be efficiently parallelised due to its sequential structure. In contrast, the MD6 algorithm, based on a Merkle tree-like structure, is highly parallelisable and offers potential performance improvements. Our project will implement both MD5 and MD6 algorithms in software, with sequential and parallelised versions.

The software solution will implement the MD5 hashing algorithm sequentially in C++, providing a baseline for performance comparison. By utilising OpenCL, we will investigate the effects of using the GPU rather than the CPU for MD5. OpenCL also allows us to leverage its high-level parallel framework to overcome the challenges posed by MD5's sequential structure and to harness the parallelism inherent in MD6.

We aim to explore the possible capabilities of a Nexys A7 FPGA development board to implement the MD5 algorithm in dedicated hardware using Verilog. By offloading computation to dedicated hardware, we aim to achieve performance improvements when compared to the software implementations.

Our project goals include the successful implementation of all MD5 and MD6 algorithm implementations, performance evaluation across software implementations, ensuring cross-platform compatibility, verifying accuracy and robustness, and providing comprehensive documentation to facilitate learning and skill development among team members.

This project offers insights into the trade-offs between software and hardware implementations of MD5 and MD6 hashing algorithms, highlighting the potential benefits of parallelisation and dedicated hardware solutions.

# Contents

| Li | st of       | Figures                                                                           | 1        |  |  |
|----|-------------|-----------------------------------------------------------------------------------|----------|--|--|
| 1  | Intr        | roduction                                                                         | <b>2</b> |  |  |
| 2  | Bac         | ckground                                                                          | 4        |  |  |
| 3  | Methodology |                                                                                   |          |  |  |
|    | 3.1         | System Overview                                                                   | 6        |  |  |
|    | 3.2         | Software Implementation                                                           | 6        |  |  |
|    |             | $3.2.1  C++ MD5  \dots  \dots  \dots  \dots  \dots  \dots  \dots  \dots  \dots  $ | 6        |  |  |
|    |             | 3.2.2 Verilog MD5                                                                 | 7        |  |  |
|    |             | 3.2.3 OpenCL MD5 Implementation                                                   | 7        |  |  |
|    |             | 3.2.4 Sequential MD6 C++ $\ldots$                                                 | 8        |  |  |
|    |             | 3.2.5 Parallel MD6 C++                                                            | 9        |  |  |
|    | 3.3         | Hardware Implementation                                                           | 10       |  |  |
|    | 3.4         | Performance Evaluation:                                                           | 10       |  |  |
| 4  | Des         | sign                                                                              | 12       |  |  |
|    | 4.1         | Overview                                                                          | 12       |  |  |
|    | 4.2         | MD5 Algorithm Flowchart                                                           | 12       |  |  |
|    | 4.3         | MD5 in Verilog                                                                    | 14       |  |  |
|    |             | 4.3.1 Module Definition                                                           | 14       |  |  |
|    | 4.4         | MD5 in OpenCL                                                                     | 16       |  |  |
|    |             | 4.4.1 Libraries Used                                                              | 16       |  |  |
|    |             | 4.4.2 OpenCL Resources Management                                                 | 16       |  |  |
|    | 4.5         | OpenCL Kernel                                                                     | 18       |  |  |
|    | 4.6         | MD5 in C++ $\ldots$                                                               | 20       |  |  |
|    |             | 4.6.1 Libraries Used                                                              | 20       |  |  |
|    |             | 4.6.2 MD5 Calculation                                                             | 21       |  |  |
|    | 4.7         | MD6 Implementations                                                               | 23       |  |  |
|    |             | 4.7.1 Libraries Used                                                              | 23       |  |  |
|    |             | 4.7.2 MD6 Sequential Implementation                                               | 24       |  |  |
|    |             | 4.7.3 MD6 Parallelised Implementation                                             | 24       |  |  |
|    | 4.8         | Integration and Host Connectivity                                                 | 24       |  |  |
|    |             | 4.8.1 Host Interface Design                                                       | 24       |  |  |
|    |             | 4.8.2 Data Transfer and Synchronisation                                           | 24       |  |  |

| 5  | Pro             | posed Development Strategy                                                                                                                                                   | <b>25</b> |  |  |  |
|----|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|--|--|--|
| 6  | Pla             | nned Experimentation                                                                                                                                                         | 26        |  |  |  |
|    | 6.1             | Verification and Test Vector                                                                                                                                                 | 26        |  |  |  |
|    | 6.2             | Execution Time, Speedup, and Throughput                                                                                                                                      | 26        |  |  |  |
|    | 6.3             | Experimental Setup                                                                                                                                                           | 26        |  |  |  |
|    | 6.4             | Performance Metrics                                                                                                                                                          | 28        |  |  |  |
|    | 6.5             | Input Size Constraints                                                                                                                                                       | 28        |  |  |  |
|    | 6.6             | Implementations for Testing                                                                                                                                                  | 28        |  |  |  |
| 7  | $\mathbf{Res}$  | ults                                                                                                                                                                         | 29        |  |  |  |
|    | 7.1             | MD5 Implementations                                                                                                                                                          | 29        |  |  |  |
|    |                 | 7.1.1 C++ Implementation $\ldots \ldots \ldots$              | 29        |  |  |  |
|    |                 | 7.1.2 C++ OpenCL Implementation                                                                                                                                              | 30        |  |  |  |
|    |                 | 7.1.3 Verilog Implementation                                                                                                                                                 | 32        |  |  |  |
|    | 7.2             | MD6 Implementations                                                                                                                                                          | 32        |  |  |  |
|    |                 | 7.2.1 Sequential C++ Implementation $\ldots \ldots \ldots$   | 32        |  |  |  |
|    |                 | 7.2.2 Parallelised C++ Implementation $\ldots \ldots \ldots$ | 34        |  |  |  |
|    | 7.3             | Comparison of Implementations                                                                                                                                                | 34        |  |  |  |
|    | 7.4             | Speedup Analysis                                                                                                                                                             | 37        |  |  |  |
|    |                 | 7.4.1 Speedup at 0 Byte Message Size                                                                                                                                         | 37        |  |  |  |
|    |                 | 7.4.2 Speedup at 7.92 MB Message Size                                                                                                                                        | 37        |  |  |  |
|    |                 | 7.4.3 Discussion                                                                                                                                                             | 37        |  |  |  |
| 8  | Acc             | eptance Test Procedure                                                                                                                                                       | 39        |  |  |  |
|    | 8.1             | Acceptance Tests                                                                                                                                                             | 39        |  |  |  |
|    | 8.2             | Acceptance Test Procedure                                                                                                                                                    | 39        |  |  |  |
| 9  | Con             | nclusions                                                                                                                                                                    | <b>42</b> |  |  |  |
| Bi | Bibliography 44 |                                                                                                                                                                              |           |  |  |  |

# List of Figures

| 4.1  | Flowchart of the MD5 Algorithm 13                                                                 |
|------|---------------------------------------------------------------------------------------------------|
| 7.1  | Execution Time of C++ MD5 Implementation                                                          |
| 7.2  | Throughput of C++ MD5 Implementation $\ldots \ldots \ldots \ldots \ldots \ldots \ldots \ldots 31$ |
| 7.3  | Execution Time of C++ OpenCL MD5 Implementation                                                   |
| 7.4  | Throughput of C++ OpenCL MD5 Implementation                                                       |
| 7.5  | Execution Time of Sequential C++ MD6 Implementation                                               |
| 7.6  | Throughput of Sequential C++ MD6 Implementation                                                   |
| 7.7  | Execution Time of Parallelised C++ MD6 Implementation                                             |
| 7.8  | Throughput of Parallelised C++ MD6 Implementation                                                 |
| 7.9  | Comparison of Execution Time Across Implementations                                               |
| 7.10 | Comparison of Throughput Across Implementations                                                   |
| 8.1  | Verilog MD5 accuracy test                                                                         |
| 8.2  | Sequential MD5 accuracy test                                                                      |
| 8.3  | Parallel MD5 accuracy test 41                                                                     |
| 8.4  | MD6 Accuracy and variable length input test 41                                                    |

# Introduction

'Do, or do not... there is no try.'

-Yoda, The Empire Strikes Back

This report investigates the implementation of Message Digest 5 (MD5) and Message Digest 6 (MD6) using various programming languages and physical hardware. The project encompasses software implementations of MD5 in C++, Verilog, and OpenCL, along with both sequential and parallel implementations of MD6 in C++ to demonstrate the benefits of parallelisation. The hardware aspect focuses on implementing the MD5 algorithm on a Nexys A7 development board, utilising a Hardware Description Language (HDL) to program the Field Programmable Gate Array (FPGA) on the board. This approach aims to evaluate the performance gains achievable through dedicated hardware compared to software-only solutions. Performance metrics such as execution time, throughput, and resource utilisation are employed to assess each implementation.

The report begins by outlining the approach to implementing and evaluating the MD5 and MD6 hashing algorithms across various programming languages and hardware platforms. Our primary objectives are to develop both software and hardware implementations, measure their performance, and analyse the trade-offs between these approaches. We detail the design and implementation processes, including the experimental setup, performance metrics, and analytical methods used to interpret the results.

Following the approach outline, a more in-depth description of the design is provided. This includes a flowchart for the MD5 algorithm, a module definition for the Verilog implementation, explanations of the libraries used for C++ and OpenCL, management of OpenCL resources, design of the OpenCL kernel, host interface integration, and data synchronisation and transfer mechanisms.

The proposed development strategy aims to provide context for the supporting tools and framework required if the project were to be developed into a commercial product. This includes the use of the Xilinx Vivado design suite, a Hardware Abstraction Layer (HAL) to standardise the interface between software and hardware, the JetBrains CLion 2024 IDE and GCC compiler for development, and the OpenCL framework for writing parallel programs across heterogeneous platforms.

Planned experimentation elaborates on the methodology, describing how implementations are verified against test vectors, how execution time, speedup, and throughput are measured, the experimental setup, performance metrics used, input size constraints, and the specific implementations tested. The results section presents findings for the MD5 implementations in C++, OpenCL, and Verilog, as well as the sequential and parallel MD6 implementations, comparing their performance. The speedup achieved by each implementation is calculated and discussed.

Finally, the report outlines the acceptance tests and procedures to ensure the proper functioning of all systems and to establish metrics for determining the project's success.

## Background

'Laugh it up, fuzzball.'

-Han Solo, The Empire Strikes Back

Hashing algorithms were initially designed for security. As mentioned by Landge and Mishra: 'The integrity and security of embedded systems have been a major concern as they play a significant role in areas such as education, health care devices, gaming consoles, network security devices, consumer electronics, avionics, car industry, controllers in industrial plants, etc' [1]. It is therefore important to explore various methods of security, one being that a file being downloaded does not contain any malicious content. A simple way to check if the desired file contains unwanted contents is by using a checksum, which is offered by MD5.

Although it is not commonly used in modern hardware architectures, MD5 is still used in Legacy systems, protocols, and for specific projects. The GO-SEC project at Helsinki University of Technology utilised the MD5 hashing algorithm as it required a calculation of roughly 10,000 MD5 rounds [2]. Although hardware acceleration for hash algorithms is not required, in certain applications, chains of thousands of hash algorithm rounds are calculated and hardware acceleration may be required [2].

Another issue where MD5 has proven to be useful today in the modern picture is in resource restrained environments. It is still difficult to provide basic security services like message integrity, secrecy, and authentication in contexts with low resources, including those containing Internet of Things (IoT) devices [3]. Since MD5 is a simple, one way hash, it is an easier way to provide rudimentary security. It often has the best overall performance metrics, particularly clock cycle overhead [3].

Given that MD5 cannot be efficiently parallelised due to its inherent sequential structure, our project also incorporates the MD6 hashing algorithm, which is designed with parallelisation in mind. MD6 uses a bottom-up tree-based mode of operation (Merkle tree-like structure), allowing different parts of the hash computation to be processed concurrently, significantly improving performance on modern multi-core processors and parallel computing platforms [4]. Studies have shown that while MD5 remains useful in certain legacy and resource-constrained environments, MD6 offers superior performance and security for applications requiring high throughput and robustness [4]. By implementing both MD5 and MD6, our project aims to highlight the benefits of parallelisation and dedicated hardware solutions, providing a comprehensive comparison of these two algorithms in various environments.

Our project also serves as a hands-on educational endeavor. MD5 & MD6 are both used as a medium

to explore cryptography, FPGA hardware design, and parallelisation in software and hardware. While not influencing future research or being ground breaking, our project is driven by the desire to expand our skill set and demonstrate our existing capabilities in cryptography, hardware design, and parallel computing.

# Methodology

'In my experience, there is no such thing as luck.'

-Obi-Wan Kenobi, A New Hope

This section aims to outline the systematic approach taken to implement and evaluate the MD5 and MD6 hashing algorithms using different programming languages and hardware. The primary objectives are to develop both software and hardware implementations, measure their performance, and analyse the trade-offs between these approaches. Through rigorous testing, a comprehensive comparison of the efficiency and effectiveness of these implementations will be provided. This section details the design and implementation processes, experimental setup, performance metrics, and the analytical methods used to interpret the results.

## 3.1 System Overview

The system for this project encompasses the development environments, target platforms, and methodologies used to implement and evaluate the MD5 and MD6 hashing algorithms. The primary focus is on creating software and hardware implementations and comparing their performance to understand the trade-offs between these approaches.

## 3.2 Software Implementation

#### 3.2.1 C++ MD5

The MD5 hashing algorithm was first implemented sequentially in C++ to establish a baseline for performance comparison. This implementation serves as the golden standard against which other implementations are measured. The development was carried out using JetBrains CLion 2024 IDE, leveraging its robust features for code development, debugging, and testing.

#### **Proposed Implementation Steps:**

- 1. Understand MD5 Algorithm: The RFC 1321 [5] was studied to understand the steps and specifications of MD5.
- 2. Initial Coding: An initial implementation of the code was created using pseudo-code found in the RFC 1321 [5].

- 3. **Testing and Debugging:** Known hashes were used to debug and confirm the working of the code.
- 4. **Optimisation:** Optimise the code for better performance without altering the algorithm's integrity.

The final two steps are iterative, ensuring the most efficient code is produced. The code was optimised as it is serving as the golden standard.

### 3.2.2 Verilog MD5

To explore software acceleration, the MD5 algorithm was implemented using Verilog in CLion IDE. This implementation investigates the performance benefits of using a Hardware Description Languages (HDL).

#### **Proposed Implementation Steps:**

- 1. Understand MD5 Algorithm: The C++ implementation and RFC 1321 [5] was used to understand the steps of MD5.
- 2. Initial Coding: Write the initial version of the MD5 algorithm in Verilog.
- 3. **Testing and Debugging:** Use simulation tools to validate the design against standard test vectors.
- 4. **Optimisation:** Optimise the code for better performance without altering the algorithm's integrity.
- 5. **Iterative Refinement:** Continuously test, debug, and optimise the implementation to ensure efficiency and correctness.

The initial implementation was successful in CLion IDE, where the Verilog code was able to hash inputs of 440 bits, with a total message size of 512 bits including the padding.

**Future Work:** Our future work will focus on overcoming these challenges by expanding the message size and optimising the Verilog code further. This will also us to test against other software and hardware implementations with message sizes larger than 512 bits.

### 3.2.3 OpenCL MD5 Implementation

To explore parallelisation capabilities, the MD5 hashing algorithm was implemented using OpenCL, a framework for parallel computing across various processing units. This implementation aimed to leverage the computational power of GPUs or CPUs to accelerate the MD5 hashing process.

#### **Proposed Implementation Steps:**

1. Understanding MD5 Algorithm: The C++ implementation and RFC 1321 were referenced to understand the sequential steps of the MD5 algorithm.

- 2. Adapting C++ Code for OpenCL: The initial C++ implementation served as the basis for the OpenCL implementation. Modifications were made to ensure compatibility with OpenCL's parallel execution model.
- 3. **OpenCL Environment Setup:** An OpenCL environment was initialised to manage platforms, devices, contexts, and queues.
- 4. Kernel Development: The MD5 hashing operation was encapsulated into an OpenCL kernel. The kernel code was designed to execute in parallel across multiple processing units.
- 5. Data Transfer and Execution: The input message data was transferred to the OpenCL device memory, and the kernel was executed to perform the MD5 hashing operation in parallel.
- 6. **Performance Evaluation:** The execution time of the OpenCL implementation was measured and compared against the sequential C++ implementation to assess the performance gains achieved through parallelisation.

Given the inherent sequential nature of the MD5 algorithm, attempting to parallelise it using OpenCL posed significant challenges. While OpenCL allows for parallel execution across multiple processing units, the fundamental design of MD5, which relies on the sequential processing of data blocks and chaining variables, limits the potential for parallelisation.

Despite these challenges, an OpenCL implementation was still pursued to explore the limitations and performance implications of parallelising MD5. The results obtained from this experiment provided valuable insights into the feasibility of parallelising inherently sequential algorithms using parallel computing frameworks like OpenCL.

Given the challenges encountered and the fundamental nature of MD5, there are no immediate plans for future work in parallelising MD5 using OpenCL. Instead, the focus will shift towards exploring alternative cryptographic algorithms and optimising hardware implementations to meet performance requirements efficiently.

## 3.2.4 Sequential MD6 C++

The MD6 hashing algorithm was implemented sequentially in C++ to explore its performance and efficiency compared to the MD5 algorithm. MD6 is designed to be highly parallelisable, making it a suitable candidate for hardware acceleration. However, a sequential implementation was first developed to establish a baseline for comparison and to understand the algorithm's sequential processing requirements.

### Proposed Implementation Steps:

- 1. Understand MD6 Algorithm: Similar to the MD5 implementation, a comprehensive understanding of the MD6 algorithm was essential. This involved studying the MD6 specification and identifying the key components and operations involved in the hashing process.
- 2. Sequential Implementation: The MD6 algorithm was translated into C++ code, focusing on implementing the sequential processing steps defined in the MD6 specification. This in-

volved coding the initialisation, message padding, compression function, and finalisation steps sequentially.

3. **Testing and Validation:** After the sequential implementation was completed, extensive testing and validation was conducted to ensure that the algorithm produced correct hash values for input messages. This involved comparing the MD6 hash values generated by the sequential implementation with those generated by established MD6 implementations or reference implementations.

#### Challenges:

The primary challenge encountered during the MD6 sequential implementation was ensuring correctness and accuracy while adhering to the MD6 specification. Additionally, optimising the C++ code for performance without compromising algorithm integrity required careful consideration and testing.

### 3.2.5 Parallel MD6 C++

The parallel MD6 C++ implementation aimed to leverage the multi-threading capabilities of the C++ thread library to parallelise the MD6 hashing algorithm. By distributing the workload across multiple threads, the implementation aimed to improve performance and utilise modern multi-core processors effectively.

#### **Proposed Implementation Steps:**

- 1. Understand MD6 Algorithm: Similar to the sequential MD6 implementation, a thorough understanding of the MD6 algorithm was essential. This involved studying the MD6 specification and identifying opportunities for parallelisation within the hashing process.
- 2. **Parallelisation Strategy:** The MD6 algorithm was analysed to identify independent tasks or portions of the algorithm that could be executed concurrently. These included message block processing, compression function computation, and other non-dependent operations.
- 3. Thread-Based Parallelisation: Using the C++ thread library, the MD6 algorithm was parallelised by spawning multiple threads to handle different tasks simultaneously. Synchronisation mechanisms such as mutexes were employed to coordinate thread execution and ensure correct results.
- 4. **Testing and Validation:** After parallelising the MD6 algorithm, rigorous testing and validation were conducted to verify the correctness and accuracy of the parallel implementation. This involved comparing the parallel MD6 hash values with those generated by the sequential MD6 implementation and established MD6 reference implementations.

#### Challenges:

The primary challenge faced during the parallel MD6 C++ implementation was managing thread synchronisation and ensuring data consistency across multiple threads. Additionally, load balancing and optimal utilisation of available CPU cores required careful consideration to avoid performance bottlenecks.

## 3.3 Hardware Implementation

The MD5 hardware implementation aimed to leverage the parallel processing capabilities of dedicated hardware to accelerate the MD5 hashing algorithm. The process involved designing and implementing the MD5 algorithm in hardware description language (HDL), specifically targeting a FPGA (Field-Programmable Gate Array) development board. The FPGA provides a programmable hardware platform that allows for custom digital circuitry to be implemented and tested. The development board to be used is the Nexys A7-100T by Diligent.

#### **Proposed Implementation Steps:**

- 1. Understand MD5 Algorithm: Similar to the software implementations, a thorough understanding of the MD5 algorithm was essential. This involved studying the MD5 specification and the sequential algorithm implementations to identify the key components and operations involved.
- 2. **HDL Design:** The next step was to translate the MD5 algorithm into hardware description language (HDL), such as Verilog or VHDL.
- 3. FPGA Synthesis and Implementation: Once the HDL design was completed, it needed to be synthesised and implemented on the FPGA development board. This process involved translating the HDL code into a configuration file that could be loaded onto the FPGA. This included synthesis, implementation and the generation of bit streams.
- 4. **Testing and Debugging:** After the implementation on the FPGA, extensive testing and debugging was necessary to ensure that the hardware implementation produced the correct MD5 hash value for an input message. This involved comparing the hardware-generated hashes with those generated by the software implementations for validation.

#### **Challenges:**

The MD5 hardware implementation faced several challenges, primarily due to the sequential nature of the MD5 algorithm. Unlike algorithms that can be easily parallelised, such as MD6, the MD5 algorithm's sequential dependencies posed significant challenges for hardware acceleration. Additionally, translating the complex MD5 algorithm into efficient hardware components required careful design and optimisation.

#### Future Work:

Despite the challenges encountered, the MD5 hardware implementation presents opportunities for future work. Further exploration could focus on alternative hardware architectures or optimisation techniques to overcome the sequential nature of the MD5 algorithm and improve hardware performance. Additionally, research into hardware-accelerated cryptographic algorithms, such as SHA-256 or AES, could provide insights into more efficient hardware implementations for cryptographic hash functions.

## **3.4** Performance Evaluation:

1. Execution Time: Measure the time taken to hash a fixed size of data.

- 2. Throughput: Measure the amount of data processed per execution time.
- 3. **Resource Utilisation:** If time permitted, monitor CPU and memory usage during the hashing process.

For all tests involving MD5 and MD6 software implementations, the execution time was measured using the chrono high-resolution clock. Prior to invoking the hashing function, the clock was initiated, and upon completion of the algorithm's execution, the elapsed time was recorded. This duration was then saved to a file for further analysis. In the case of MD5 OpenCL implementation, where padding needed to be performed outside the kernel, the same methodology was applied using the chrono high-resolution clock. Throughput was calculated by dividing the data size to be processed by the measured execution time, providing insights into the system's processing speed. While resource utilisation metrics such as CPU and memory usage were not measured in this study, future work could explore incorporating these measurements to gain a more comprehensive understanding of system performance.

# Design

'The ability to speak does not make you intelligent.'

-Qui-Gon Jinn, The Phantom Menace

### 4.1 Overview

The design of our project focuses on implementing the MD5 hashing algorithm in various environments: C++, OpenCL, and Verilog. This chapter outlines the structure and functionality of these implementations, detailing the architecture and interaction between components. Additionally, it discusses the MD6 implementations used to illustrate parallelisation speedups.

## 4.2 MD5 Algorithm Flowchart

To better understand the process and flow of the MD5 hashing algorithm, a flowchart is provided. This flowchart illustrates the main steps involved in the MD5 computation, from the initial message preprocessing to the final hash output.

The MD5 algorithm consists of the following main steps:

#### 1. Initialise:

- Convert the input string to a vector of bytes.
- Calculate the bit length of the input.

#### 2. Append Padding Bits:

- Append a single '1' bit to the input.
- Append '0' bits until the length is 448 modulo 512.

#### 3. Append Length:

• Append a 64-bit representation of the original length.

#### 4. Initialise MD Buffer:

• Set A = a0, B = b0, C = c0, D = d0.



Figure 4.1: Flowchart of the MD5 Algorithm

#### 5. Process Message in 16-Word Blocks:

- For each 512-bit chunk of input:
  - (a) Break the chunk into sixteen 32-bit words M[0...15].
  - (b) Initialize hash value for this chunk: AA = A, BB = B, CC = C, DD = D.

#### (c) Main Loop (64 iterations):

- i. For each iteration:
  - Compute F, G, and tempShift using a lookup table.
  - Update AA, BB, CC, and DD.

(d) Add this chunk's hash to the result so far: A += AA, B += BB, C += CC, D += DD.

#### 6. Output:

- Combine A, B, C, and D into a 16-byte array.
- Return the MD5 hash.

The flowchart helps visualize these steps and understand the flow of data through the MD5 algorithm. Each step is crucial in ensuring the integrity and uniqueness of the resulting hash value.

### 4.3 MD5 in Verilog

The Verilog implementation of the MD5 hashing algorithm is designed for hardware acceleration. The main components include the message input, control FSM, and computation logic. The design utilises registers and bitwise operations to achieve efficient processing.

#### 4.3.1 Module Definition

The MD5 module is defined with inputs for the clock signal, reset signal, message block, message length, and start signal. The outputs include the 128-bit digest and a ready signal indicating completion.

```
module md5 (
      input wire clk,
2
      input wire reset,
3
      input wire [0:511] message, // 512-bit message block input
      input wire [63:0] message_len, // Length of the message in bits (up to 512)
5
      input wire start,
                                     // Start the process
6
      output reg [127:0] digest, // Output digest
7
      output reg ready
                                     // Indicates completion
8
9
  );
10
  // Define the constants and rotation amounts
11
  reg [31:0] K [63:0];
  reg [4:0] S [63:0];
13
14
  // MD buffer registers and temporary registers for computations
15
  reg [31:0] A, B, C, D;
16
  reg [31:0] AA, BB, CC, DD;
17
  reg [31:0] F;
18
  integer i, j, idx;
19
20
  // Message schedule array and padded message
21
  reg [31:0] M [15:0];
22
  reg [0:511] padded message;
23
24
  initial begin
25
      // Initialize constants and rotation amounts
26
      K[0] = 32'hd76aa478; K[1] = 32'he8c7b756;
27
      // (omitted for clarity)
28
29
      S[0] = 7; S[1] = 12;
      // (omitted for clarity)
30
  end
31
32
  function [31:0] reverse_bytes(input [31:0] data);
33
```

```
34
       integer k;
       begin
35
            for (k = 0; k < 4; k = k + 1) begin
36
                reverse bytes[k*8+:8] = data[8*(3-k)+:8];
37
           end
38
       end
39
  endfunction
40
41
  // Control FSM and main computation loop
42
  reg [5:0] state;
43
  always @(posedge clk) begin
44
       if (reset) begin
45
           // Initialization
46
           state <= 0;</pre>
47
           ready <= 0;</pre>
48
           A <= 32'h67452301;
49
           B <= 32'hefcdab89;</pre>
50
           C <= 32'h98badcfe;</pre>
51
           D <= 32'h10325476;</pre>
           i <= 0;
53
       end else begin
54
           case (state)
55
                0: begin
56
                     if (start) begin
57
                         // (omitted for clarity)
58
                         state <= 1;</pre>
59
                     end
60
                end
61
                1: begin
62
                     if (i < 64) begin
63
                         // Main computation logic
64
                         // (omitted for clarity)
65
                         i = i + 1;
66
                     end else begin
67
                         // Finalize the digest
68
                         A = A + AA;
69
                         B = B + BB;
70
                         C = C + CC;
71
                         D = D + DD;
72
                         digest[127:96] = reverse_bytes(A);
73
                         digest[95:64] = reverse bytes(B);
74
                         digest[63:32] = reverse bytes(C);
75
```

```
digest[31:0]
                                                   = reverse bytes(D);
76
                              ready <= 1;</pre>
77
                              state <= 0;</pre>
78
                         end
79
                   end
80
              endcase
81
        end
82
   end
83
   endmodule
```

Listing 4.1: Simplified Verilog MD5 Implementation

## 4.4 MD5 in OpenCL

The OpenCL implementation of MD5 leverages GPU acceleration for parallel processing. The design involves initialising OpenCL resources, creating buffers for input and output, and executing the kernel.

## 4.4.1 Libraries Used

- <OpenCL/cl.h>: This library is the OpenCL header file, which provides the necessary functions and data types for using OpenCL. It defines the API for interacting with the OpenCL runtime, including functions for setting up the platform, device, context, command queue, and kernels.
- <iostream>: This library is used for standard input and output stream operations, enabling reading from input devices and writing to output devices, such as the console.
- <fstream>: This library provides functionalities for file input and output operations, allowing the program to read from and write to files.
- <vector>: This library is used to create dynamic arrays that can grow or shrink in size. It is particularly useful for handling collections of data that change in size.
- 'OpenCLError.h': This custom header file is used to manage OpenCL errors. It likely contains functions and macros to check and handle errors that occur during OpenCL API calls.
- <exception>: This library defines standard exceptions and the base class for all exceptions thrown by the C++ standard library, enabling the program to handle runtime errors gracefully.
- <string>: This library is used for manipulating strings, providing a wide range of functionalities for handling text data.

## 4.4.2 OpenCL Resources Management

A class is defined to manage OpenCL resources, including platform, device, context, command queue, and program. The kernel code is read, compiled, and executed on the GPU.

```
// Class to manage OpenCL resources
```

```
2 class OpenCLResources {
  private:
3
      cl platform id platform;
      cl device id device;
5
      cl_context context;
6
      cl command queue queue;
7
      cl_program program;
8
  public:
10
      OpenCLResources() {
11
          cl int err;
12
          // Initialize OpenCL Platform and Device
13
          // (omitted for clarity)
14
          // Create Context and Command Queue
          // (omitted for clarity)
16
          // Read and compile the kernel
17
          // (omitted for clarity)
18
      }
19
20
      ~OpenCLResources() {
21
          // Release OpenCL resources
22
23
          clReleaseProgram(program);
          clReleaseCommandQueue(queue);
24
          clReleaseContext(context);
25
      }
26
27
      // Getters for OpenCL resources
28
      cl platform id getPlatform() const { return platform; }
29
      cl device id getDevice() const { return device; }
30
      cl context getContext() const { return context; }
31
      cl_command_queue getQueue() const { return queue; }
32
      cl_program getProgram() const { return program; }
33
  };
34
35
  // Function to run MD5 hashing and return execution times
36
  double runMD5Hashing(OpenCLResources& resources, const std::vector<char>& message, size t
37
      // Get the length of the message
38
      // (omitted for clarity)
39
      // Create a vector to store execution times
40
      // (omitted for clarity)
41
42
```

```
43
      // Use the compiled program
      cl program program = resources.getProgram();
44
      cl int err;
45
46
      // Create the MD5 kernel
47
      cl_kernel kernel = clCreateKernel(program, "md5_hash", &err);
48
      // Set up data buffers
49
      // (omitted for clarity)
50
      // Execute the kernel and get the execution time
51
      // (omitted for clarity)
52
53
      return executionTime;
54
  }
55
```

Listing 4.2: Simplified OpenCL MD5 Implementation

## 4.5 OpenCL Kernel

The OpenCL kernel for MD5 follows a similar structure to the C++ implementation but is optimised for execution on a GPU.

```
constant uint S[64] = {
           /* omitted for clarity */
2
  };
3
5
    constant uint K[64] = {
           /* omitted for clarity */
6
7
  };
8
    constant uint g_values[64] = {
9
           /* omitted for clarity */
10
  };
11
12
   constant uint a0 = 0 \times 67452301;
13
   constant uint b0 = 0xefcdab89;
14
    constant uint c0 = 0x98badcfe;
15
    constant uint d0 = 0 \times 10325476;
16
17
  static uint F(uint x, uint y, uint z) {
18
       return (x & y) | (~x & z);
19
20
  }
21
```

```
22 static uint G(uint x, uint y, uint z) {
       return (x & z) | (y & ~z);
23
24 }
25
  static uint H(uint x, uint y, uint z) {
26
       return x ^ y ^ z;
27
28
  }
29
  static uint I(uint x, uint y, uint z) {
30
       return y ^ (x | ~z);
31
  }
32
33
    _kernel void md5_hash(__global unsigned char* input, __global unsigned char* output, uint
34
      \hookrightarrow inputSize) {
       uint A = a0;
35
       uint B = b0;
36
       uint C = c0;
37
       uint D = d0;
38
39
       for (uint i = 0; i < inputSize; i += 64) {</pre>
40
           uint M[16];
41
42
           for (int j = 0; j < 16; ++j) {</pre>
               M[j] = (input[i + j*4 + 3] << 24) | (input[i + j*4 + 2] << 16) | (input[i + j*4
43
      \hookrightarrow + 1] << 8) | input[i + j*4];
           }
44
45
           uint AA = A;
46
           uint BB = B;
47
           uint CC = C;
48
           uint DD = D;
49
50
           for (int j = 0; j < 64; j += 4) {
51
               uint tempF[4], g[4], tempShift[4];
52
53
                for (int k = 0; k < 4; ++k) {
54
                    int index = (j + k) >> 4;
55
56
                    if (index == 0) {
57
                         tempF[k] = F(BB, CC, DD);
58
                    } else if (index == 1) {
59
                        tempF[k] = G(BB, CC, DD);
60
                    } else if (index == 2) {
61
```

```
tempF[k] = H(BB, CC, DD);
62
                    } else if (index == 3) {
63
                        tempF[k] = I(BB, CC, DD);
64
                   }
65
66
                    g[k] = g_values[j + k];
67
68
                    tempF[k] = tempF[k] + AA + K[j + k] + M[g[k]];
69
                    tempShift[k] = (tempF[k] << S[j + k]) | (tempF[k] >> (32 - S[j + k]));
70
                    AA = DD;
71
                    DD = CC;
72
                    CC = BB;
73
                    BB += tempShift[k];
74
               }
75
           }
76
77
78
           A += AA;
           B += BB;
79
           C += CC;
80
           D += DD;
81
      }
82
83
       for (int i = 0; i < 4; ++i) {
84
           output[i] = (uchar)(A >> (i * 8));
85
           output[i + 4] = (uchar)(B >> (i * 8));
86
           output[i + 8] = (uchar)(C >> (i * 8));
87
           output[i + 12] = (uchar)(D >> (i * 8));
88
      }
89
  }
90
```

Listing 4.3: Simplified OpenCL Kernel for MD5

## 4.6 MD5 in C++

The C++ implementation of MD5 is designed for CPU execution. It involves defining functions for each MD5 operation, processing message blocks, and generating the final hash.

#### 4.6.1 Libraries Used

- <fstream>: This library provides functionalities for file input and output operations, allowing the program to read from and write to files.
- <iostream>: This library is used for standard input and output stream operations, enabling

reading from input devices and writing to output devices, such as the console.

- <vector>: This library is used to create dynamic arrays that can grow or shrink in size. It is particularly useful for handling collections of data that change in size.
- <chrono>: This library provides functionalities for dealing with time, including measuring time intervals, which is useful for benchmarking the performance of the MD5 implementation.

#### 4.6.2 MD5 Calculation

The MD5 calculation function processes the input message, performs padding, initialises buffers, and processes message blocks.

```
// Define a typedef for a function pointer that takes three uint32 t and returns a uint32 t
  typedef uint32 t (*FuncPtr)(uint32 t, uint32 t, uint32 t);
  // The rotation amounts for each round
  static const constexpr uint32_t S[64] = { /* omitted for clarity */ };
5
  // The constants for each round
  static const constexpr uint32 t K[64] = { /* omitted for clarity */ };
8
  // The g values for each round
  static const constexpr uint32 t g values[64] = { /* omitted for clarity */ };
11
  uint32 t a0 = 0x67452301; // Initial value of 'a'
13
  uint32 t b0 = 0xefcdab89; // Initial value of 'b'
14
  uint32 t c0 = 0x98badcfe; // Initial value of 'c'
  uint32 t d0 = 0x10325476; // Initial value of 'd'
16
17
  // Define the lookup table
18
19
  FuncPtr funcTable[4] = {F, G, H, I};
20
  // Calculate the MD5 hash of the input message
21
  std::array<uint8 t, 16> calculate(const std::string& inputStr) {
22
      std::vector<uint8_t> input(inputStr.begin(), inputStr.end());
23
      uint64 t bitLen = input.size() * 8; // original length in bits
24
25
      // Step 1: Append a single '1' bit
26
27
      input.push back(0x80); // in bits: 10000000
      // Step 2: Append '0' bits until length is 448 modulo 512
28
      while (input.size() % 64 != 56) { // 448 = 512 - 64
29
          input.push_back(0); // in bits: 00000000
30
      }
31
```

```
32
      // Step 3: Append 64-bit representation of original length
      for (int i = 0; i < 8; ++i) { // 64 bits = 8 bytes
33
           input.push_back(bitLen >> (i * 8)); // append 8 bits at a time
34
      }
35
      // Step 4: Initialize MD Buffer
36
      uint32 t A = a0;
37
      uint32 t B = b0;
38
      uint32 t C = c0;
39
      uint32_t D = d0;
40
      // Step 5: Process Message in 16-Word Blocks
41
      for (size_t i = 0; i < input.size(); i += 64) {</pre>
42
           // Break chunk into sixteen 32-bit words
43
           uint32 t M[16];
44
           for (int j = 0; j < 16; ++j) {</pre>
45
               M[j] = (input[i + j*4 + 3] << 24) | (input[i + j*4 + 2] << 16) | (input[i + j*4
46
      \hookrightarrow + 1] << 8) | input[i + j*4];
           }
47
48
           // Initialize hash value for this chunk
49
           uint32 t AA = A;
50
           uint32_t BB = B;
51
           uint32 t CC = C;
           uint32_t DD = D;
53
54
           // Main loop
55
           for (int j = 0; j < 64; j += 4) {
56
               uint32 t tempF[4], g[4], tempShift[4];
57
               // Loop unrolling
58
               for (int k = 0; k < 4; ++k) {
59
                   int index = (j + k) >> 4;
60
                   tempF[k] = funcTable[index](BB, CC, DD);
61
                   g[k] = g_values[j + k];
62
                   tempF[k] = tempF[k] + AA + K[j + k] + M[g[k]];
63
                   tempShift[k] = (tempF[k] << S[j + k]) | (tempF[k] >> (32 - S[j + k]));
64
                   AA = DD;
65
                   DD = CC;
66
                   CC = BB;
67
                   BB += tempShift[k];
68
               }
69
           }
70
71
           // Add this chunk's hash to result so far
72
```

```
A += AA;
73
           B += BB;
74
           C += CC;
75
           D += DD:
76
       }
77
78
       // Step 6: Output
79
       std::array<uint8 t, 16> result;
80
       for (int i = 0; i < 4; ++i) {
81
                        = (uint8 t)(A >> (i * 8));
           result[i]
82
           result[i + 4] = (uint8_t)(B >> (i * 8));
83
           result[i + 8] = (uint8 t)(C >> (i * 8));
84
           result[i + 12] = (uint8 t)(D >> (i * 8));
85
       }
86
87
       return result;
88
89
  }
```

Listing 4.4: Simplified C++ MD5 Implementation

## 4.7 MD6 Implementations

The MD6 algorithm implementations are utilised to illustrate parallelisation speedups. MD6 is designed with inherent parallelisation capabilities, making it suitable for performance comparison. The original MD6 code written by Ronald L. Rivest can be found here.

### 4.7.1 Libraries Used

- <iostream>: This library is used for standard input and output stream operations, enabling reading from input devices and writing to output devices, such as the console.
- <vector>: This library is used to create dynamic arrays that can grow or shrink in size. It is particularly useful for handling collections of data that change in size.
- <chrono>: This library provides functionalities for dealing with time, including measuring time intervals, which is useful for benchmarking the performance of the MD6 implementation.
- <fstream>: This library provides functionalities for file input and output operations, allowing the program to read from and write to files.
- 'md6.h': This custom header file contains the declarations for the MD6 functions and constants used in the implementation.
- <cstring>: This library is used for manipulating C-style strings and arrays, providing functions for copying, comparing, and manipulating strings and memory blocks.

- <thread>: This library is used to create and manage multiple threads of execution, enabling the parallelisation of the MD6 algorithm.
- <mutex>: This library provides synchronisation primitives, such as mutexes, to prevent data races and ensure thread safety when multiple threads access shared resources.
- <cstdlib>: This library includes functions for performing general-purpose functions, such as memory management, random number generation, and system command execution.
- <cinttypes>: This library provides fixed-width integer types and formatting macros, which are useful for ensuring consistent and portable integer representations across different platforms.

#### 4.7.2 MD6 Sequential Implementation

The sequential implementation of MD6 is similar to the MD5 implementation but optimised for serial processing.

#### 4.7.3 MD6 Parallelised Implementation

The parallelised implementation of MD6 leverages multi-threading to achieve significant speedups. The design involves dividing the input into smaller chunks, processing each chunk in parallel, and combining the results.

## 4.8 Integration and Host Connectivity

Since the implementation on an FPGA board was unsuccessful, this section merely discusses how the integration could be done in theory. Integrating the accelerator system with a host involves designing an interface for communication and data transfer. The Verilog module can be integrated into an FPGA, while the OpenCL and C++ implementations can be executed on a host CPU or GPU.

#### 4.8.1 Host Interface Design

The host interface is responsible for sending input data to the accelerator and receiving the computed hash values. For the Verilog implementation, this involves designing input and output ports on the FPGA. These ports would handle the communication between the host and the FPGA, ensuring that data is correctly transmitted and received.

#### 4.8.2 Data Transfer and Synchronisation

Efficient data transfer and synchronisation between the host and the accelerator are crucial for achieving optimal performance. Techniques such as direct memory access (DMA) and double buffering can be employed to minimise latency and maximise throughput. DMA allows for direct data transfer between the host memory and the FPGA without continuous CPU intervention, thereby reducing overhead. Double buffering helps in overlapping computation with data transfer, ensuring that the accelerator is never idle waiting for data, which maximises resource utilisation and overall performance.

# **Proposed Development Strategy**

'When nine hundred years old you reach, look as good you will not.' -Yoda, Return of the Jedi

The discussion so far has shown different implementation possibilities for a message digest. The design has compared MD5 and MD6 in various ways. These hashing algorithms can be applied on software as well as applied on an FPGA to take advantage of dedicated hardware for a highly specific task. The MD5 algorithm has been shown to be inherently sequential which opens the possibility to leverage dedicated hardware to perform the specific task.

The FPGA development environment that was used was the Xilinx Vivado design suite. This environment provides tools for design, synthesis, simulation and implementation for Xilinx FPGAS including the Nexys A7 board. The Vivado design suite offers numerous advantages such as high level synthesis, board support package, Verilog simulation and verification and lastly, FPGA debugging and performance analysis.

A Hardware Abstraction Layer (HAL) is a software layer that provides a consistent and standardised interface between the higher-level software and the underlying hardware system. The main purpose of the HAL is to hide the complexities of the hardware from the software. This allows the software to interact through a set of well defined APIs. The abstraction layer encapsulates the low-level details of the hardware and device-specific configurations.

In the context of a digital accelerator implemented on a FPGA device, a HAL can be beneficial to allow the software to leverage the capabilities of the accelerator without being tightly coupled to specific implementation details. A HAL would therefore be highly beneficial to be developed for a commercial application of the digital accelerator implemented on dedicated hardware such as FPGAs.

For the software implementation of the Message Digest algorithms, JetBrains CLion 2024 IDE was used with C++ development tools and the GCC compiler. As the design has shown, MD6 was developed with the intention of being faster and highly parallelisable. The OpenCL framework is an open standard framework for writing parallel programs that can be executed across heterogeneous platforms such as CPUs, GPUs and FPGAs.

In context of the development and implementation of our digital accelerator, with further development of the MD6 implementation, like the MD5 version, OpenCL can be used to leverage the parallel processing capabilities of these devices to provide performance improvements, performance optimisations, heterogeneous computing and integration with C++.

## **Planned Experimentation**

'Never tell me the odds!'

-Han Solo, The Empire Strikes Back

Performance comparisons will be made by evaluating the different message digest implementations. The evaluation will involve recording the execution time of the programs, calculating the speedup of each program compared to the Golden Measure, and comparing the throughput of each implementation.

## 6.1 Verification and Test Vector

To verify the implementations and ensure their accuracy, a well-known English-language pangram ('The quick brown fox jumps over the lazy dog') will be used as a test vector. A pangram contains all the letters of the alphabet, making it useful for testing hashing and encryption algorithms to ensure alphabetic compatibility. The generated hash values will be compared against known reference values and cross-checked between the different implementations for consistency.

## 6.2 Execution Time, Speedup, and Throughput

To evaluate execution time, speedup, and throughput, the implementations will be tested with strings of varying lengths. The execution time will be measured using the chrono timing package in C++ and the **\$time** system function in Verilog. Speedup will be calculated as the ratio of the execution time of the original sequential program to the execution time of the optimised program. Throughput will be determined by dividing the amount of data processed by the execution time.

## 6.3 Experimental Setup

The experiments will be conducted on an iMac Late 2013 equipped with a Core i5 processor running at 2.7 GHz, paired with 16GB of DDR3 RAM, and running MacOS Sonoma 14.4. OpenCL Version 1.2 will be utilised for testing. The system uses an integrated Intel Iris Pro 5200 GPU featuring base clock speeds of 200MHz, a boost clock of 1150MHz, and a rendering configuration of 40 execution units and 320 shading units.

The following C++ code snippet illustrates how the execution time versus message size tests would be conducted:

```
void runTests() {
      int executions = 100;
       std::vector<double> times(executions, 0); // Vector to store all execution times
3
      // Loop over different input sizes
5
       for (unsigned long long inputSize = 0; inputSize <= pow(2, 23); inputSize += 4 * ceil(</pre>
6
      → pow(2, 23) / 400)) {
           // Generate input string of the required size
7
           std::string inputS(inputSize, 'a'); // Fill the string with 'a'
8
9
           std::cout << "Running " << executions << " executions of MD5 hashing on input size "</pre>
10
      \hookrightarrow << inputSize << "\n";
11
           for (int i = 0; i < executions; ++i) {</pre>
12
               auto start = std::chrono::high_resolution_clock::now();
13
14
               std::array<uint8_t, 16> hash = run_hashing_algorithm(inputS);
15
16
               auto end = std::chrono::high resolution clock::now();
17
               std::chrono::duration<double> diff = end - start;
18
19
               times[i] = diff.count(); // Store execution time in vector
20
           }
21
22
           // Write the execution times to a CSV file
23
           std::ofstream outputFile("execution_times.csv", std::ios_base::app); // Append to
24
      \hookrightarrow the file
           if (inputSize == 0) {
25
               outputFile << "Run Number,Message Size,Execution Time\n"; // Write the headers</pre>
26
           }
27
           for (int i = 0; i < executions; ++i) {</pre>
28
               outputFile << i+1 << "," << inputSize << "," << times[i] << "\n";</pre>
29
           }
30
           outputFile.close();
31
32
           // Clear the vector
33
           times.clear();
34
      }
35
  }
36
```



## 6.4 Performance Metrics

- Execution Time: The time taken to process input messages of varying lengths.
- **Speedup**: Speedup =  $\frac{T_{\text{original}}}{T_{\text{optimised}}}$ 
  - $T_{\rm original}:$  Runtime of the original/non-optimised program.

-  $T_{\text{optimised}}$ : Runtime of the optimised program.

• Throughput: Throughput =  $\frac{\text{Amount of Data Processed}}{\text{Execution Time}}$ 

## 6.5 Input Size Constraints

It is important to note that the Verilog implementation can only handle up to an input size of 440 bits due to padding requirements. This limitation will be considered in the performance evaluation.

## 6.6 Implementations for Testing

- Golden Measure: A sequential C++ implementation of the MD5 message digest algorithm.
- **Verilog**: A Verilog implementation of the MD5 algorithm to simulate potential hardware implementation on FPGAs.
- **OpenCL**: An OpenCL version of the MD5 algorithm for GPU acceleration.
- Sequential MD6 C++: A sequential C++ implementation of the MD6 algorithm to demonstrate the performance of MD6.
- **Parallel MD6 C++**: A parallel C++ implementation of the MD6 algorithm to demonstrate the benefits of parallelisation.

Each implementation will be tested for its execution time and throughput. The experiments will provide insights into the performance and efficiency of each approach, highlighting the advantages and limitations of different implementation strategies.

# Results

'Come on, baby! Do the magic hand thing.'

-Greef Karga, The Mandalorian

This chapter presents the results of testing the various implementations of MD5 and MD6. The implementations include C++, C++ OpenCL, and Verilog for MD5, and sequential and parallelised C++ for MD6. We provide detailed results for different input sizes and discuss the performance and limitations encountered. The performance metrics include execution time and throughput, and we analyse how each implementation scales with increasing input sizes.

## 7.1 MD5 Implementations

#### 7.1.1 C++ Implementation

The C++ implementation of MD5 was tested with input sizes ranging from 0 to approximately 8 MB. The performance metrics recorded include execution time and throughput.



Figure 7.1: Execution Time of C++ MD5 Implementation

The execution time of the C++ MD5 implementation follows a linear increase with respect to the input message size. The error bars indicate the standard deviation of the measurements. The execution time starts near 0 ms for very small message sizes and reaches approximately 70 ms for an 8 MB message size. This linear trend suggests that the implementation scales well with larger input sizes, maintaining consistent performance per unit of data processed.

The throughput of the C++ MD5 implementation reaches approximately 120 MB/s at a message size of 0.16 MB. The throughput rapidly increases from near 0 MB/s for very small message sizes to 120 MB/s at around 0.16 MB. It then remains near 120 MB/s for larger message sizes, although there is significant jitter in throughput between 2 MB and 5 MB message sizes. This jitter could be due to variations in system resource allocation during the testing process.

#### 7.1.2 C++ OpenCL Implementation

The C++ OpenCL implementation of MD5 was tested with input sizes ranging from 0 to approximately 800 KB. Performance issues presented input limitations.

The execution time of the C++ OpenCL MD5 implementation follows an exponentially increasing curve. It starts near 0 s for very small message sizes and reaches just above 40 s for an 800 KB message size. The exponential increase in execution time indicates significant performance bottlenecks in the OpenCL implementation, likely due to inefficiencies in kernel execution or data transfer overheads.

The throughput of the C++ OpenCL MD5 implementation starts near 0 KB/s for very small message sizes. It reaches approximately 450 KB/s for a 20 KB message size and then follows an exponentially



Figure 7.2: Throughput of C++ MD5 Implementation



Figure 7.3: Execution Time of C++ OpenCL MD5 Implementation

decaying curve, eventually reaching around 20 KB/s at an 800 KB message size. The initial peak followed by a rapid decline suggests that the OpenCL implementation is highly sensitive to input size, with larger inputs significantly degrading performance.



Figure 7.4: Throughput of C++ OpenCL MD5 Implementation

# 7.1.3 Verilog Implementation

The Verilog implementation of MD5 was restricted to only operate with a maximum input size of 440 bits. Therefore, only a single input size is useful for testing since MD5 pads any input below 440 bits to 512 bits.

The Verilog implementation took 24 time units to execute with a clock period of 370 ps. This means that the hashing only took 8.88 nanoseconds to complete. This extremely fast execution time demonstrates the efficiency of hardware-based implementations, although the limited input size significantly constrains its practical applicability.

# 7.2 MD6 Implementations

# 7.2.1 Sequential C++ Implementation

The sequential C++ implementation of MD6 was tested with input sizes ranging from 0 to approximately 8 MB. Performance metrics include execution time and throughput.

The execution time of the sequential C++ MD6 implementation follows a linear increase with respect to the input message size. The error bars indicate the standard deviation of the measurements. The execution time starts near 0 ms for very small message sizes and reaches approximately 50 ms for an 8 MB message size. This linear trend is similar to that observed in the C++ MD5 implementation, suggesting efficient scaling with input size.

The throughput of the sequential C++MD6 implementation reaches just above 150 MB/s at a message size of 0.08 MB. The throughput starts near 0 MB/s for very small message sizes and shows a rapid



Figure 7.5: Execution Time of Sequential C++ MD6 Implementation



Figure 7.6: Throughput of Sequential C++ MD6 Implementation

dip to 75 MB/s at around 500 KB, likely due to system resources being utilised by an OS application. This dip indicates that external factors can impact the performance of the sequential implementation.

#### 7.2.2 Parallelised C++ Implementation

The parallelised C++ implementation of MD6 was also tested with input sizes ranging from 0 to approximately 8 MB. The results include execution time and throughput, highlighting the benefits of parallelisation.



Figure 7.7: Execution Time of Parallelised C++ MD6 Implementation

The execution time of the parallelised C++ MD6 implementation follows a linear increase with respect to the input message size. The error bars indicate the standard deviation of the measurements. The execution time starts near 0 ms for very small message sizes and reaches approximately 15 ms for an 8 MB message size. The significantly lower execution time compared to the sequential implementation highlights the effectiveness of parallelisation in reducing processing time.

The throughput of the parallelised C++ MD6 implementation reaches 500 MB/s at a message size of 3 MB. The throughput rapidly grows from near 0 MB/s for very small message sizes to 400 MB/s at approximately 500 KB message size. This high throughput indicates that the parallelised implementation can handle large data volumes efficiently, making it suitable for high-performance applications.

## 7.3 Comparison of Implementations

This section provides a comparative analysis of all the implementations.

Figure 7.9 shows the mean execution time across various input message sizes. Note that the execution times for the OpenCL implementation of MD5 were removed from this plot to enhance visual clarity



Figure 7.8: Throughput of Parallelised C++ MD6 Implementation



Figure 7.9: Comparison of Execution Time Across Implementations

as its execution times reached 40 seconds at a message size of 800 KB.

All implementations start with a very low execution time (<1 ms) due to the small input message size. All implementations follow a linear relationship with respect to the input message size. The

MD6 parallelised implementation has the best execution time, around 20 ms for an 8 MB message size. The MD6 sequential implementation has the second-best execution time, around 50 ms for an 8 MB message size. The C++ MD5 implementation has the slowest execution time, around 70 ms for an 8 MB message size.



Figure 7.10: Comparison of Throughput Across Implementations

Figure 7.10 shows the mean throughput across various input message sizes.

All implementations start with a very low throughput (<10 MB/s) due to the small input message size. The MD6 parallelised implementation has the highest throughput, averaging around 480 MB/s after a 1 MB message size. The MD6 sequential implementation has the second-highest throughput, averaging around 160 MB/s after a 0.16 MB message size. The C++ MD5 implementation has the third-highest throughput, averaging around 110 MB/s after a 0.16 MB message size. The C++ OpenCL implementation has the lowest throughput, averaging around 40 MB/s after a 200 KB message size.

This comparison highlights the benefits and limitations of each implementation. The parallelised MD6 implementation excels in both execution time and throughput, demonstrating the advantages of parallel processing. The sequential implementations of MD5 and MD6 also perform well but are outpaced by their parallelised counterparts. The Verilog implementation, while extremely fast for small input sizes, is limited by its maximum input size constraint. The OpenCL implementation faces significant performance challenges, highlighting the need for optimisation in GPU-based hashing solutions.

# 7.4 Speedup Analysis

This section presents the speedup achieved by the different implementations relative to the C++MD5 implementation, which serves as the golden standard. We analyse speedup for two specific message sizes: 0 bytes and 7.92 MB.

### 7.4.1 Speedup at 0 Byte Message Size

At a message size of 0 bytes, we compare the execution times of the different implementations to determine the relative speedup.

| Implementation       | Execution Time (ms) | Speedup |
|----------------------|---------------------|---------|
| C++ MD5              | 0.00118             | 1.0     |
| C++ OpenCL MD5       | 1.170               | 0.00109 |
| Verilog MD5          | 0.00000888          | 132.883 |
| Sequential C++ MD6   | 0.00299             | 0.395   |
| Parallelised C++ MD6 | 0.113               | 0.0104  |

Table 7.1: Speedup of Implementations at 0 Byte Message Size

### 7.4.2 Speedup at 7.92 MB Message Size

For an 7.92 MB message size, we compare the execution times of the different implementations to determine the relative speedup.

| Implementation       | Execution Time (ms) | Speedup |
|----------------------|---------------------|---------|
| C++ MD5              | 68.325              | 1.0     |
| C++ OpenCL MD5       | N/A                 | N/A     |
| Verilog MD5          | N/A                 | N/A     |
| Sequential $C++MD6$  | 49.919              | 1.369   |
| Parallelised C++ MD6 | 16.429              | 4.159   |

Table 7.2: Speedup of Implementations at 8 MB Message Size

### 7.4.3 Discussion

The speedup analysis reveals significant differences in performance across implementations.

At a message size of 0 bytes, the Verilog implementation demonstrates an extraordinary speedup of approximately 132.883x compared to the C++ MD5 implementation, due to its hardware efficiency. This highlights the potential of hardware-based implementations for small input sizes. However, the applicability of the Verilog implementation is limited by its maximum input size constraint of 440 bits.

The parallelised C++ MD6 implementation, while not as effective at 0 bytes, shows substantial improvement for larger input sizes. At a message size of 7.92 MB, it achieves a speedup of 4.159x compared to the C++ MD5 implementation, highlighting the benefits of parallel processing. This makes it the most efficient implementation for handling large data volumes.

The sequential C++ MD6 implementation also shows an improvement over the C++ MD5 implementation at larger input sizes, achieving a speedup of 1.369x at 7.92 MB. This indicates that even without parallelisation, the MD6 algorithm is more efficient than MD5 when implemented sequentially in C++.

In contrast, the C++ OpenCL MD5 implementation exhibits significant performance issues. At a message size of 0 bytes, it has a speedup of only 0.00109x compared to the C++ MD5 implementation, indicating potential inefficiencies in the OpenCL kernel execution or data transfer overheads. This limits its effectiveness, particularly for smaller input sizes.

Overall, the parallelised C++ MD6 implementation provides the best performance across varying input sizes, demonstrating its suitability for high-performance applications. The Verilog implementation, while extremely fast for small inputs, is constrained by its limited input size capability. The sequential C++ MD6 and C++ MD5 implementations perform reliably, with the sequential MD6 showing an edge at larger input sizes. The C++ OpenCL MD5 implementation, however, requires optimisation to improve its performance.

# Chapter 8

# Acceptance Test Procedure

"This is the Way."

— The Armorer, The Mandalorian

Acceptance testing is a critical phase in the development life cycle that validates whether the system meets the requirements and specifications set out. In this section, we will discuss the alpha testing process conducted for our digital accelerator system designed for the message digest algorithms. The testing process aims to evaluate the system's functionality, performance, and usability in a controlled environment, ensuring its adherence to the defined acceptance criteria before releasing it to external users.

# 8.1 Acceptance Tests

| Test                                                                       | Result     |
|----------------------------------------------------------------------------|------------|
| AT1: Software and hardware implementation of MD5 digest works and          | All passed |
| has been implemented successfully                                          |            |
| AT2: Message digest algorithm produces accurate hash value for pan-        | All passed |
| gram test vector                                                           |            |
| <b>AT3:</b> Message digest algorithm works under different conditions with | All passed |
| small and long length strings                                              |            |
| AT4: Cross-platform compatibility of the software implementations          | All passed |

Table 8.1: Acceptance Testing

To be confirmed : Acceptance testing of working hardware implementation of the message digest algorithms implemented on FPGAs.

# 8.2 Acceptance Test Procedure

#### ATP1:

To test for both AT1 and AT2, test each of the digest implementations with the pangram test vector 'The quick brown fox jumps over the lazy dog' as the input string.

The resulting hash values will be compared against the known correct hash value for the pangram using the respective digest algorithm (MD5 or MD6) (9E107D9D372BB6826BD81D3542A419D6 and

c232a489a17cd514773d489cadd09f8c, respectively ). If the computed hash values match the expected values, it confirms that the implementation is working correctly and producing accurate results.

### ATP1:

To test for AT3, the implementations will be executed with different input string lengths. The test cases will include: Varying string sizes from 0 - 8304912 bytes. For OpenCL only string sizes from 0 - 817908 bytes could be tested. The computed hash values will be verified against the known correct values.

### ATP3:

To test for AT4, the software implementations (sequential C++, parallel C++ and OpenCL) will be compiled and executed on different platforms, such as MacOS, Linux, and Windows.

The test cases will include:

- **Compilation:** Verify that the source code compiles successfully on each platform without any errors or warnings.
- **Execution:** Run the compiled programs on each platform and ensure that they execute without any runtime errors or crashes.
- **Consistency:** Verify that the computed hash values are consistent across different platforms for the same input string and digest algorithm.

Based on the acceptance test results found in Table 8.1, the digital accelerator system for message digest algorithms meets the specified acceptance criteria. The software implementations of MD5 and MD6 digest algorithms are working correctly and producing accurate hash values for different input strings and lengths. The software implementations demonstrate cross-platform compatibility, ensuring consistent behavior across different operating systems.

The following figures illustrate some of the results of the Acceptance Testing Procedure:



Figure 8.1: Verilog MD5 accuracy test

MD5 hash of 'The quick brown fox jumps over the lazy dog': 9e107d9d372bb6826bd81d3542a419d6 Execution time: 4.376e-06 s

Figure 8.2: Sequential MD5 accuracy test

# MD5 hash of 'The quick brown fox jumps over the lazy dog': 9e107d9d372bb6826bd81d3542a419d6 Execution time: 8.62e-05 seconds

Figure 8.3: Parallel MD5 accuracy test

| Running single test for sequential implementation |
|---------------------------------------------------|
| MD6-128 hash matches known hash.                  |
| Execution time: 2.7032e-05s                       |
| MD6-256 hash matches known hash.                  |
| Execution time: 8.149e-06s                        |
| MD6-512 hash matches known hash.                  |
| Execution time: 2.854e-05s                        |
| Running single test for parallel implementation   |
| MD6-128 hash matches known hash.                  |
| Execution time: 0.00019313s                       |
| MD6-256 hash matches known hash.                  |
| Execution time: 0.000110664s                      |
| MD6-512 hash matches known hash.                  |
| Execution time: 0.000101671s                      |

Figure 8.4: MD6 Accuracy and variable length input test

# Chapter 9

# Conclusions

'Sorry lady. I don't understand frog.'

-Din Djarin, The Mandalorian

The objectives of this project were to implement and evaluate MD5 and MD6 hashing algorithms using various programming environments, including C++, OpenCL, and Verilog, to measure performance and explore potential optimisations. The comprehensive analysis provided insights into the efficiency, scalability, and trade-offs of different implementation strategies.

The sequential C++ implementation of MD5 served as the golden standard, providing a baseline for comparison. The results showed that while MD5 is inherently sequential, the parallelised C++ MD6 implementation demonstrated significant performance improvements, achieving a speedup of 4.159x for larger input sizes. This underscores the advantages of parallel processing in modern cryptographic applications.

The Verilog implementation of MD5, though constrained by its maximum input size of 440 bits, showcased the potential of hardware acceleration, achieving an execution time of just 8.88 nanoseconds. This result highlights the efficiency of dedicated hardware solutions for specific tasks, despite the limitations imposed by the sequential nature of MD5.

The OpenCL implementation of MD5 faced substantial performance challenges, primarily due to the sequential dependencies within the algorithm. The performance bottlenecks identified suggest that further optimisation is needed for GPU-based hashing solutions to be effective.

In contrast, the sequential and parallelised implementations of MD6 demonstrated the algorithm's suitability for high-performance applications. The sequential MD6 implementation showed a linear increase in execution time with input size, while the parallelised implementation significantly reduced processing time and increased throughput, achieving a peak throughput of 500 MB/s.

The performance evaluation and speedup analysis highlighted the strengths and weaknesses of each approach. The parallelised MD6 implementation emerged as the most efficient solution, suitable for handling large data volumes and high-performance requirements. The Verilog implementation, though limited in input size, demonstrated the potential of hardware-based acceleration. The sequential implementations of both MD5 and MD6 performed reliably, with MD6 showing an edge for larger inputs. The OpenCL MD5 implementation requires further optimisation to overcome performance issues.

Overall, this project achieved its primary objectives of developing and evaluating different implementations of MD5 and MD6. The results provide valuable insights into the trade-offs between software and hardware solutions, highlighting the potential of parallel processing and hardware acceleration in cryptographic applications. Future work can focus on optimising the OpenCL implementation, exploring alternative cryptographic algorithms, and further refining the hardware implementations to overcome the challenges identified.

The complete source code for all implementations, including C++, OpenCL, and Verilog, is available in the project's GitHub repository: EEE4120F-YODA. The repository includes comprehensive comments explaining the code, ensuring that each section is well-documented and easy to understand. Additionally, the code is written efficiently and neatly, following best practices for coding standards.

# Bibliography

- I. A. Landge and B. K. Mishra, "Hardware based md5 implementation using vhdl for secured embedded and vlsi based designs," in 2016 International Conference on Communication and Electronics Systems (ICCES), pp. 1–6, 2016.
- [2] K. Jarvinen, M. Tommiska, and J. Skytta, "Hardware implementation analysis of the md5 hash algorithm," in *Proceedings of the 38th Annual Hawaii International Conference on System Sciences*, pp. 298a–298a, 2005.
- [3] Y. Su, Y. Gao, O. Kavehei, and D. C. Ranasinghe, "Hash functions and benchmarks for resource constrained passive devices: A preliminary study," in 2019 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), pp. 1020–1025, 2019.
- [4] R. Rivest, B. Agre, D. Bailey, S. Cheng, C. Crutchfield, Y. Dodis, K. Elliott, F. Khan, J. Krishnamurthy, Y. Lin, L. Reyzin, E. Shen, J. Sukha, D. Sutherland, E. Tromer, and Y. Yin, "The md6 hash function," 11 2008.
- [5] R. L. Rivest, "The MD5 Message-Digest Algorithm." RFC 1321, Apr. 1992.