Welcome, guest ( Login )

WikiHome » C2H » OptimisingC2HResults

OptimisingC2HResults

Version 17, changed by gcalac@altera.com. 09/22/2006.   Show version history

 

 

Optimising C2H Results

A Design Example

  

  

Contents:

 

Introduction. 3

The Example Used. 3

Initial Experiment (Rev 0) 4

Separating Operations (Rev 1) 5

Reduce Memory Latency (Rev 2) 6

Use of Constants (Rev 3) 8

Combine Address Calculation and Memory Read (Rev 4) 10

Conclusion. 12

Appendix A: Result Data. 13

Appendix B: Other CRC Implementations. 14

Appendix C: Original Source Code Listing. 15

 

Introduction

Altera has recently launched the C2H compiler for Nios II.  This is a productivity tool that automates the building and integration of hardware accelerators for Nios II based systems.  Like any compiler the quality of results produced by C2H depend upon the source code input.  As C2H generates custom acceleration hardware small changes in coding style has a more significant effect for C2H compared with software processing.

 

This document takes the reader through the process of accelerating C code with C2H and optimising the results.  Several iterations are performed that follow the design flow shown below.

At each stage the report file generated by C2H is analysed in order to determine the best approach to optimise the code.

 

A real example is used from start to finish so that the reader can learn how to interpret the report file, apply optimisations and get the best out of the C2H compiler.

 

The example used is taken from the Internet and created by a third party.  This was done to show an example based upon real code rather than something fabricated purely for the purpose of showing great results from the C2H compiler.  Using standard code also shows a real life example pertinent to many applications and prevents the author from having any preconceived ideas about how to apply C2H optimisations.

 

 

 

The Example Used

The example chosen for this exercise was a CRC algorithm which was taken from the book Programming Embedded Systems in C and C++ by Michael Barr.  The CRC algorithm is often used in embedded systems and is therefore a real life example that can be used to learn from.  The CRC compute function is used unmodified as the starting point for this exercise.  However, the main function has been ported to Nios II and has been instrumented with timestamps so that the performance can be determined.  The modified code can be found in Appendix C of this document.  The original code written by Michael Barr can be found via the following link in the examples section (Chapter 6):

http://www.oreilly.com/catalog/embsys/

Initial Experiment (Rev 0)

In this first example the CRC compute function is accelerated unmodified to establish a baseline.  The results tell us about the capabilities of C2H for non optimised code given this particular example.  This code executes on the “Standard” example hardware design shipped with the Nios II Embedded Design Suite.
 

The code segment accelerated is shown below.

At this stage the “analyse all accelerators” option is used for the C2H build.  This avoids generation of HDL, SOPC Builder integration and Quartus II compilation to save time during optimisation iterations.  At this point we are concerned only the CPLI value and there is no point in running the C2H accelerator until the CPLI is as low a possible.  The C2H result file shows the following performance parameters:

 

Loop CPLI:      12

Loop latency:    22

 

In order to optimise the code to suit C2H we need to look at the details of the C2H report.  Unfortunately, in this case the report is a little difficult to interpret due to the number of operations executed on each line of C code within the loop.  The CPLI critical  path section of the report is shown below. 

 

 

The report tells us that the critical variable is remainder.  Because remainder appears on the right hand side of the assignment in both lines of code we know it is used once between states 1 and 11 and once again between states 11 and 21.  remainder appears in the left hand side of the assignment within line 139 meaning that it’s value is updated and ready for use in the last state given in the report, state 21.  From this information we can draw the state looping dependency across two loop iterations as shown below.

We know that the critical variable is remainder and that it becomes available for use in state 21 from the report.  The CPLI value of 12 tells us that the looping dependency must be between state 21 of one iteration and state 9 (21 – 12) of the next iteration.

 

Separating Operations (Rev 1)

The objective of this first modification to the code is to generate a more detailed C2H report rather than improve the CPLI or latency.  As can be seen from the code segment below, intermediate variables were declared and then used so that only a single operation is performed for each line of C code within the loop.

 

 

Before the C2H accelerator is reanalysed the function should be run as software to verify that the same result is produced as the original code and that the modifications have not affected functionality.  When the accelerator is analysed, the C2H result file shows the following performance parameters:

 

Loop CPLI:      12

Loop latency:    22

 

As expected there is no improvement to the performance.  In some cases splitting up operations like this can increase the CPLI and/or latency.  This is because C assignments typically infer a register stage.  One exception to this rule is when operators infer a large amount of logic, require registering to maintain FPGA clocking frequency and therefore occupy multiple C2H state machine cycles such as multipliers and barrel shifters.   The other exception is when operations are trivial and convert to wires. In this case no register stage is added even if the operation in question is the only one for a given assignment.

 

The 2CH report now shows more information about what operations occur within which states of the loop state machine.

 

 

Notice that the critical variable is still remainder and that remainder becomes available in the last state (state 21).  From the report above we can see that the bulk of the looping dependency between states 9 and 21 can attributed to the array based memory access in line 148.  This is highlighted in the diagram below.

 

In order to reduce the CPLI of this design we need to reduce the latency associated with the crcTable array access.

 

Reduce Memory Latency (Rev 2)

The declaration of the crcTable array is shown below.

 

As an array with no initialisation and no specific section attributes or linker settings, crcTable will be placed in the .bss section of the SOPC System memory.  Since a default software project is being used for this exercise the .bss section resides in DDR SDRAM.  Without guidance, the C2H accelerator will be built assuming that a data master accessing crcTable will need access to every memory in the SOPC system.  When scheduling operations the C2H compiler will use the worst case latency for all memories connected to a particular master.

 

An obvious way to improve the CPLI for this design is to ensure that the crcTable array is placed in a low latency memory and that the C2H scheduler takes this in to account.  The “Standard” hardware design used for this example features an on-chip memory called onchip_ram_64_kbytes which is ideal for crcTable as on chip memories typically have a latency of just one clock cycle.  We can make the Nios II linker target this memory by setting an attribute during the declaration of the array as shown below.  Note that when using this attribute the variable declared must be initialised.

This will move the placement of the crcTable array to the on chip memory but will not change the C2H scheduling.  When scheduling operations the C2H compiler will use the worst case latency for all memories connected to a particular master.  By limiting the connections for the master that is used to read crcTable to only the on chip memory we will change the scheduling.  This will result in a lower CPLI as the scheduler will only consider the latency of this memory.  Another benefit of reducing master connections is a reduction of arbitration logic required between C2H and other masters in the SOPC system, which saves logic elements.  Since this is of great benefit we can also apply this technique to the other master in this system associated with the message array.  Pragma’s are used to specify memory connections as shown below.

 

Following a software run to verify that no functional changes have been made the C2H accelerator can be analysed.  The resulting C2H result file shows the following performance parameters:

 

Loop CPLI:      6

Loop latency:    11

 

This shows a significant improvement in CPLI and if we assume that there will be no Avalon stalls the performance has been improved by a factor of 2.  Further examination of the report file shows that remainder is still the critical loop variable and the access of the crcTable has been reduced to 3 states.  The critical path report is shown below:

 

 

The CPLI value of 6 is now due to a looping dependency upon remainder between states 4 and 10 as shown in the following state diagram.

Although the access to the crcTable array is still the most significant part of the critical path it can not be optimised further.  Since all other operations take one or zero states to process it is time to see if operations can be combined within a single assignment statement.

 

Use of Constants (Rev 3)

Referring back to the code listing we can see that by referencing the definition of WIDTH the variable shift_value is effectively a constant.  The comment associated with the WIDTH definition tells us that this only needs modification if the bit width of the CRC standard is changed.

 

As the width of the CRC can only be changed pre-build it is perfectly reasonable to make shift_value an explicit constant.  No functionality will be lost by making this change as the software implementation can also only cope with a 16 bit CRC once built.  As shift_value is now a constant it no longer needs an assignment statement and the constant value can be entered in to the assignment for rshifted_rem.  This change is highlighted in the code segments above and below.

 

The use of a constant in the shift operation for the rshifted_rem assignment has the added benefit of removing a barrel shifter and reducing the number of logic elements required for the accelerator.

 

Following an “Analyse all accelerators” build, the C2H result file shows the following performance parameters:

 

Loop CPLI:      5

Loop latency:    11

 

As expected the CPLI has been reduced by one.  Spotting optimisations at this stage become more difficult and careful study of the C2H report is required to take things further.  The CPLI section of the build report is shown below.

 

 

The CPLI value of 5 is still due to the looping dependency upon remainder but is now between states 5 and 10 as shown in the following state diagram.

 

At first glance of the report it looks as if an optimisation can be realised by combining lines 147 and 148.  However, since line 147 equates to wires, the C2H compiler has not implemented the register stage normally associated with an assignment and the two lines execute within a signal state (state 5).

 

At this point it is worth analysing the critical path line by line to look for lines that obviously cannot be optimised.

 

Line 145:          Although reported this is not part of the critical path because it does not use or affect the variable remainder.

Line 147:          This equates to wires and takes no states to execute so cannot be optimised.

Line 148:          This operations takes one cycle to complete and is not a complex operation that requires registering.  Therefore it is a suitable candidate for combining with another line of code.

Line 151:          Data from memory reads need to be sampled on a clock edge and so always require a new state before the value read can be used.  However, the first state of a memory read where the address is prepared can be combined with other operations without registers.

Line 152:          This equates to wires and takes no states to execute so cannot be optimised.

Line 153:          This operations takes one cycle to complete and is not a complex operation that requires registering.  However, there is no scope to combine it with another assignment as there is no next assignment; so we need to look backwards.  Since line 152 uses no states we need to look back to line 151.  Lines 153 and 151 cannot be combined due to the reasons given for line 151.

 

Combine Address Calculation and Memory Read (Rev 4)

From the analysis performed on the previous report there can be only one option for further optimisation.  That is to combine lines 148 and 151 so that the address calculation is performed in the same state as the address preparation for the memory read.  The code from the previous revision is shown below with the lines in question highlighted.

The address calculation for the memory read can be considered as follows:

 

 

 

The two lines of C code highlighted above are combined as shown highlighted below.

The C2H result file showed the following performance parameters:

 

Loop CPLI:      4

Loop latency:    10

 

This has had the desired effect as can be seen from the CPLI critical path section of the report below.

 

 

Based on the line per line analysis of the report in the previous section there is nothing further that can be done to improve the CPLI for this implementation.  The bottleneck for this algorithm consists of the 3 states required to read the crcTable look-up table plus one state to calculate the new value of remainder.

 

It is now time to run the full C2H build process in order to create the accelerator, integrate with SOPC Builder and compile in Quartus.  Since we are running this example on the “standard” design which uses a Nios II/s processor, there is no data cache that requires flushing.

 

After completing the C2H build we can run the design on a board to confirm that C2H calculates the same CRC and check the performance boost.

 

C2H does in fact give the same CRC value for a given data block.  The performance from the accelerator is summarised in the table below.

 

Original Software Execution Time (ms)

Modified Software with C2H Execution Time (ms)

Performance Gain

45.86

6.56

7 x

 

Conclusion

Throughout this exercise we have seen several methods of improving the CPLI of a C2H based accelerator through simple restructuring of the original C code.  In order to improve the CPLI it is important to know how the C2H compiler converts C code to hardware and how to interpret the C2H report file.

 

The critical path for the CPLI was analysed at each stage and modification made without the need to build the accelerator each time.  This enabled a quick turnaround between iterations.  The total time taken to optimise the original code, make several optimisations and test the final C2H implementation was less than 3 hours including computer processing time.  To deliver performance improvement of 700% with only 3 hours work shows that C2H truly offers productivity gains for accelerating Nios II based systems.

 


Appendix A: Result Data

Upon completion of this exercise all of the code revisions were analysed in full to give a better appreciation of what performance could be gained at each stage of the optimisation process.  Logic utilisation figures were also analysed to determine the “cost” of each accelerator implementation.

 

Code Rev

Execution Time (ms)

Performance Gain*

ALUT Increase

ALUT % Increase**

0

21.98

2 x

675

19 %

1

22.01

2 x

656

19 %

2

9.18

5 x

520

15 %

3

7.87

6 x

494

14 %

4

6.56

7 x

507

15 %

 

*Performance gain is based upon the original web code (rev 0) with no hardware acceleration.  The execution time for this implementation is 45.86 ms.

 

**Increase in resources is relative to the Altera verilog “Standard” example design for the EP2S60ES Nios II board.  There were no cases on increase in resources other than ALUTs.  For reference the resources for the “standard” design are listed below:

ALUTs:            3469

M512s:            2

M4Ks:             141

DSP elements:  8

 

Of course, all of the changes made to the C code during this exercise had the potential to effect the performance of a pure software implementation.  In order to study this the different code revisions were compiled for the Nios II processor and run without acceleration.  The only significant change was between revision 1 and revsion 2.  This was when the crcTable array was moved from SDRAM to low latency on-chip RAM which is an optimisation technique which also applies to software.

 

Code Rev

Execution Time (ms)

0

45.86

1

45.85

2

32.81

3

32.81

4

31.50

 


Appendix B: Other CRC Implementations

The original source code used in this uses a look up table based polynomial implementation which is particularly suited to sequential processors.  It serves as a good example for the basis of understanding the C2H report and for optimising C2H results.  The maximum performance obtained with this approach is 2 bits processed per clock cycle.  This is determined by 1 byte being processed per loop iteration and that loop having a CPLI of 4.

 

An approach better suited to hardware implementations uses shift registers and XOR gates to implement the CRC calculation.  The following diagram shows a serial implementation of the CRC-CCITT algorithm.

 

 

This implementation would process 1 bit per cycle.  Although this delivers lower performance that the look up table and C2H it is easy to see how this can be made to execute in parallel as shown below.

 

 

This implementation can process 8 bits per cycle.  If we knew that the CRC were to be only ever calculated over a number of bytes that is divisible by 4 we could build an implementation that reads 32 bits at a time.  This would process 32 bits per clock cycle.  Alternatively, the accelerator could be built with write enables so that various word sizes can be input.  In this case a DMA may be used to write 32 bit words to the accelerator with Nios II writing the last few bytes in the data size is not divisible by 4.

 

Whilst it is possible to design this type of accelerator using C2H, the coding style required to get the performance would awkward.  One approach may be to design the shifter in RTL and use C2H to to implement the Avalon master that reads the memory system and passes data to the RTL processing engine.


Appendix C: Original Source Code Listing

/**********************************************************************

 *

 * Filename:    crc.c

 *

 * Description: A table-driven implementation of CRC-CCITT checksums.

 *

 * Notes:       Some of the constants in this file are specific to

 *              Arcom's Target188EB hardware.

 *

 *              This code can be easily modified to implement any

 *              "non-reflective" CRC algorithm.  Simply change the

 *              constants POLYNOMIAL, INITIAL_REMAINDER, FINAL_XOR,

 *              and--if an 8 or 32-bit CRC is required--the definition

 *              of type 'width'.

 *

 *

 * Copyright (c) 1998 by Michael Barr.  This software is placed into

 * the public domain and may be used for any purpose.  However, this

 * notice must not be changed or removed and no warranty is either

 * expressed or implied by its publication or distribution.

 **********************************************************************/

 

#include <system.h>

#include <stdio.h>

#include <io.h>

#include <sys/alt_timestamp.h>

#include <stdlib.h>

#include <string.h>

 

#define CRC_DATA_BASE    (EXT_FLASH_BASE + 0x800000)

#define RAM_BUFFER_BASE  EXT_RAM_BASE

 

/*

 * The CRC parameters.  Currently configured for CCITT.

 * Simply modify these to switch to another CRC standard.

 */

#define POLYNOMIAL          0x1021

#define INITIAL_REMAINDER   0xFFFF

#define FINAL_XOR_VALUE     0x0000

 

/*

 * The width of the CRC calculation and result.

 * Modify the typedef for an 8 or 32-bit CRC standard.

 */

typedef unsigned short width;

 

#define WIDTH   (8 * sizeof(width))

#define TOPBIT  (1 << (WIDTH - 1))

 

/*

 * An array containing the pre-computed intermediate result for each

 * possible byte of input.  This is used to speed up the computation.

 */

width  crcTable[256];

 

/**********************************************************************

 *

 * Function:    crcInit()

 *

 * Description: Initialize the CRC lookup table.  This table is used

 *              by crcCompute() to make CRC computation faster.

 *

 * Notes:       The mod-2 binary long division is implemented here.

 *

 * Returns:     None defined.

 *

 **********************************************************************/

void

crcInit(void)

{

    width  remainder;

    width  dividend;

    int    bit;

 

 

    /*

     * Perform binary long division, a bit at a time.

     */

    for (dividend = 0; dividend < 256; dividend++)

    {

        /*

         * Initialize the remainder.

         */

        remainder = dividend << (WIDTH - 8);

 

        /*

         * Shift and XOR with the polynomial.

         */

        for (bit = 0; bit < 8; bit++)

        {

            /*

             * Try to divide the current data bit.

             */

            if (remainder & TOPBIT)

            {

                remainder = (remainder << 1) ^ POLYNOMIAL;

            }

            else

            {

                remainder = remainder << 1;

            }

        }

 

        /*

         * Save the result in the table.

         */

        crcTable[dividend] = remainder;

    }

 

}   /* crcInit() */

 

 

/**********************************************************************

 *

 * Function:    crcCompute()

 *

 * Description: Compute the CRC checksum of a binary message block.

 *

 * Notes:       This function expects that crcInit() has been called

 *              first to initialize the CRC lookup table.

 *

 * Returns:     The CRC of the data.

 *

 **********************************************************************/

width

crcCompute(unsigned char * message, unsigned int nBytes)

{

    unsigned int   offset;

    unsigned char  byte;

    width          remainder = INITIAL_REMAINDER;

 

 

    /*

     * Divide the message by the polynomial, a byte at time.

     */

    for (offset = 0; offset < nBytes; offset++)

    {

        byte = (remainder >> (WIDTH - 8)) ^ message[offset];

        remainder = crcTable[byte] ^ (remainder << 8);

    }

 

    /*

     * The final remainder is the CRC result.

     */

    return (remainder ^ FINAL_XOR_VALUE);

 

}   /* crcCompute() */

 

 

 

 

/**********************************************************************

 *

 * Function:    sevenseg_set_hex()

 *

 * Description: Decode hex number to format for 7-seg display.

 *              Also: sends data to the 7-seg display PIO

 *

 * Notes:

 *

 * Returns:     The 7-seg-decoded values of a hex digit input.

 *

 **********************************************************************/

 

static void sevenseg_set_hex(int hex)

{

    static alt_u8 segments[16] = {

        0x81, 0xCF, 0x92, 0x86, 0xCC, 0xA4, 0xA0, 0x8F, 0x80, 0x84, /* 0-9 */

        0x88, 0xE0, 0xF2, 0xC2, 0xB0, 0xB8 };                       /* a-f */

 

    unsigned int data = segments[hex & 15] | (segments[(hex >> 4) & 15] << 8);

 

    IOWR(SEVEN_SEG_PIO_BASE, 0, data);

}

 

 

 

 

/**********************************************************************

 *

 * Function:    main()

 *

 * Description: Run CRC on flash data block after copying to RAM.

 *              Measure time taken and give visual indication via

 *              seven segment LED

 *

 *

 *

 **********************************************************************/

int