This week, writing of the dissertation for Jury evaluation was completed. The document can be found here .
This week, writing of the dissertation was continued and more results for larger images were extracted. The results include the measurement of PSNR and SSIM. The results can be seen on the tables above.
This week, writing of the dissertation was started and more results were extracted. The results include the measurement of two image quality assessment quantities: the Peak Signal to Noise Ratio (PSNR) and the Structural Similarity Index (SSIM). Average results for 512x512 images can be seen in the graphs above.
This week, the testing of the final BM3D implementation was started, with the extraction of results. The testing is done by adding noise to a set of predefined images and then sending the image to the ZYNQ board. Then, on the board, the C program runs on the ARM processor, which in turn works with the BM3D co-processor implemented on the programmable logic. The graph above shows the speedup achieved by the BM3D hardware implementation with respect to execution on a CPU, for increasing image resolutions. 1 – 256x256; 2 – 512x512, 3 – 1280x720, 4 – 1280x1024, 5 – 1920x1080, 6 – 2048x1536, 7 – 2560x1920, 8 – 3264x2448.
This week, the final corrections to the algorithm core were done. The denoising pipeline fixed point implementation was increased to 32 bit, in order to avoid over/underflow, but still there were some problems with the result image. This was due to the fact that when corrupted by noise, after filtering and inverting the transform, a certain image patch can have values larger than 255 and smaller than 0. This fact was not being accounted for in the rounding module, which means that values smaller than 0 were being converted to large values, close to 255. This was fixed by correcting the rounding module in order that if a value smaller than 0 is detected, it is set to 0, and if a value bigger than 255 is produced, it is truncated to 255. As can be seen in the image above, now the denoising results are as expected, and initial measures show that the PSNR is the same of the MATLAB implementation.
This week, the algorithm core was corrected. After analyzing the output image, it was found that there was a cyclic problem with the processing, where the neighborhoods with an even position in the image were correct but the odd positions were being processed erroneously. This problem was due to a counter in the DCT modules that is responsible for the transposing of the image data. This counter wasn't being reset between neighborhood processing, which lead to the described “on/off” behavior. With all the previous problems solved, testing with a clean image (no noise) was successful, and finally, testing with a noisy image was done, but the result was not as expected. In terms of denoising, the white noise in more constant image areas is being well removed, but in black areas of the image, some pixels are being switched to white, which is probably a problem of over/under flow.
This week, the AXI Slave memory access was validated. This means that now the system operates as follows:
1. The CPU writes in the control register to start the denoising task, and then polls the status register for data requests.
2. Data requests can be either read from RAM or write to RAM. For each read, 4 image lines are transferred, as the neighborhoods are processed in a sliding window manner. For each write, data, positions and weight information for the 16 groups (as 16 processors are being used) are transferred.
3. When a data request is needed, the CPU performs it, and writes to the control register, in order that the system can continue processing.
4. Finally, when the denoising is done, the system writes to the status register and the CPU reads this value and executes the last stage of the algorithm: the aggregation of the data in order to form the basic estimate image.
After the validation of the memory access, the image produced by the BM3D hardware implementation was still not correct, and this issue will be addressed in the following week.
This week, progress with testing the system was done. The first error found was that the AXI reset signal was active low, which was not accounted for in the master control module, which meant the system was never starting. The second problem was related with the AXI Master operation. Reading from RAM to BRAM on the PL was working, but writing back to RAM had some issues. In order to try solving this problem, Mino’s code was reviewed carefully, but as no problems were found, the AXI Master code was changed to the Xilinx IPIF code. Even with this changes, the memory interface wasn’t working, so an AXI Slave approach was taken and will be tested on the following week.
This week, testing of the system was started. At the time of writing of this report, the system is still not functional, as the reading of the status ag always returns the value zero. The problem is probably contained in the AXI Master interfaces writing data to RAM, as there is a signal write done, which indicates that data was successfully written to RAM, that needs to go high for the status ag to be updated. The main challenge when testing the system is the difficulty in accessing the information of the BM3D system, which can only be done by modifying the AXI Slave interface to include more registers that can store the values of important control wires of the system.
This week, the C code that runs on the ARM CPU was developed. This code operates as follows:
1. The input image is loaded into RAM, so that the BM3D system can access it.
2. The value that signals a start of the algorithm is written into the control register via the AXI Slave port on the system.
3. The CPU waits for the flag that signals new data is available in memory, and then it processes this data by returning each patch of each group to the original position in a buffer image array and the corresponding weight to a buffer weight array.
4. The CPU waits for the flag that signals that the BM3D system has finished processing all the image, and then, it acknowledges this by writing to the control register.
5. Finally, the buffer image array is divided element-wise by the buffer weight array to produce the final image, in this case, the basic estimate image.
This week, the control logic for the system was developed. It consists of 9 finite state machines, each with its own module, as shown in the figure above:
1. This module is responsible for the control of the next neighborhood memory, which stores the data for the next image neighborhood to be processed.
2. This module loads the image data from a buffer memory, necessary to hold the image data between the block matching and denoising processes, to the memory inside the denoising pipeline.
3. This module is responsible for the control of the memory that stores the weights of each group (one weight per group).
4. This module is responsible for the control of the memory that stores the positions of each patch of each group (16 x,y positions per group).
5. This module is responsible for controlling the AXI Master interface which writes the position and weight data to RAM.
6. This module is responsible for controlling the AXI Master which writes the noise-free image data to RAM.
7. This module is responsible for controlling the output memory that holds the image data, in order to store only correct data that outputs from the denoising pipeline.
8. This module controls a flag that signals when the processing of a neighborhood has finished. 9. This module is the master control, which is responsible for controlling all the other control modules, as well as the internal control logic of the array of matching processors and the denoising pipeline. It also controls the AXI Master Interface which reads image data from RAM to on system memory.
This week, the AXI Master verilog code was developed. The code was developed by professor Li's PhD student Mino Won, and it was further adapted to meet the demands of the BM3D system. The developed module can access any address from the 4GB address space of the ZYNQ device, including the DDR3 controller which is used to access RAM where the image is stored. The module acts as the Master in the AXI transactions, which means the data transfers are initiated by the BM3D module itself, instead of the CPU.
This week, research was done on the AXI Interface Protocol in order to choose the most suitable variant for the memory access in the system. The AXI 4 Protocol is a Master - Slave Protocol and supports 3 types of interfaces: the AXI Lite Interface, AXI Memory Mapped and AXI Stream. The Lite interface is used when small amounts of data are transferred and it uses regular registers to store that data (typically 32 bit registers). This interface is used in the system as a slave, to enable the CPU to access the control and status registers, in order to issue the start of the denoising process and to know when it has finished. Then, for the actual image data transfer, as the system needs to fetch data frequently, an AXI Master is necessary. This master uses the Memory Mapped interface, allowing data to be transferred directly from RAM to the system memory.
In this week, the denoising path implementation and simulation was completed. The results of the verilog testbench are on par with the MATLAB algorithm, with only some pixels being rounded to a different value with an offset of 1 (for example 161 instead of 160). This is believed to be due to the limited number of bits in the fixed point operations during the denoising pipeline, and this will be further investigated. The RTL schematic can be seen in the image above, and it consists of a control module, the fifo that holds the positions of the group, a position decoder to access a memory that contains the part of the image being processed, and the denoising pipeline that is composed by the DCT and Haar transforms, the hard thresholding block, and the inverse Haar and IDCT.
The implementation and simulation of the matching processor was completed in this week. A full correct behavior was verified between the verilog testbench and the MATLAB implementation, i.e., the processor "creates" the same groups as expected. In the image above, an RTL schematic of the matching processor can be seen, and it includes three modules: a memory to store the image neighborhood being processed, the l1 norm block that computes the distance between blocks, and a sorter block that chooses the sixteen minimum distances.
The system level architecture was finished, and can be seen in the image above. It consists on an array of matching processors, each one processing a different image coordinate in parallel, and a denoising path (DCT, Haar, HT and Wiener, inv Haar, IDCT), that does the collaborative filtering for each group formed in the matching processors. There are also various memory modules as well as control finite state machines (not shown).
In this week, work was focused on starting the development of the system level architecture of the system. It included the definition of the various blocks necessary for the implementation of the BM3D algorithm, such as the DCT and Haar transforms, the Matching processor (which includes an L1 norm calculator block), and others.
In this week, work was focused on setting the development environment for future work. This included installing the Vivado Design Suite from Xilinx, reading the get started and hardware debug modules for the ZC706 SoC board from Xilinx, and a review of the Verilog hardware description language.
The BM3D algorithm was replicated during the 1st semester as part of the PDI course. However, a correct implementation of the algorithm in MATLAB was yet to be finished when starting the MSc Thesis. Hence, in this week the MATLAB implementation was finished, which involved fixing some wrong parts and adapting others in order to enable an hardware implementation approach.