This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\stackMath11institutetext: Artificial Intelligence, Radboud University, Nijmegen, the Netherlands

Baugh-Wooley Multiplication for the RISCV Processor

F.A. Grootjen 0000-0002-0548-5831    N.K. Schauer 0009-0000-2211-8083
Abstract

This article describes an efficient way to implement the multiplication instructions for a RISCV processor. Instead of using three predefined IP blocks for signed, unsigned and mixed multiplication, this article presents a novel extension to the Baugh-Wooley multiplication algorithm which reduces area and power consumption with roughly a factor three.

Keywords:
RISCV CPU multiplication

1 Introduction

Within a RISC instruction set, the multiplication instruction is a bit of an oddball. While almost all instructions use only one clock cycle, the multiplication instruction uses significant resources which is notable when looking at the implementation’s latency and/or density. Within the RISCV instruction set architecture, the multiplication instruction is not even present in the basic instruction set, allowing low end implementations to adhere to the basic specs [6]. Of course, most (larger) implementations support the M extension which do provide multiplication (and division) instructions. Within this M extension, there are unsigned, signed and mixed multiplication variants. At first, it seems to be strange to have these different versions: for example when looking at the 32 bit multiplication instructions, the signedness of the operands does not matter for the result111In this article we will show that this is the case. Suprisingly it is not as straightforward as you would expect.. But this only is true for the lower 32 bits of the result, for the higher 32 bits the signedness of the operands does matter.

Most modern multiplication implementations use some adapted form of long multiplication (see 2.1). These implementations have two distinct phases:

  1. 1.

    Calculation of partial products

  2. 2.

    Summation of all partial products

While first phase is pretty straightforward and easily parallelized, the second phase is more complex. Literature contains many solutions for the second phase that cover the traditional speed/area tradoff: from simple binary adder trees to more sophisticated [3, 4] and area expensive solutions [5, 2].

The goal of this paper is to show how small adaptations to phase 1 make it possible to share phase 2 for all three multiplication forms: signed, unsigned and mixed. Section 2 introduces long multiplication and shows how it works on binary unsigned sequences. Section 3 derives Baugh and Wooleys adaptation to use signed numbers. Section 4 explains how a similar adaptation can be done for the mixed signed numbers. Finally Section 5 wraps them all together and shows that a single multiplicator can be used for all RV32 multiplication instructions.

2 Multiplication

The mathematical definition of multiplication is based on repeated addition:

Definition 1 (Multiplication)

For aa\in\mathbb{N} and bb\in\mathbb{Z} the product aba\cdot b is defined as:

ab=b+b++ba times=i=1ab\displaystyle a\cdot b=\underbrace{b+b+\cdots+b}_{\textrm{$a$ times}}=\sum_{i=1}^{a}b

In this definition, the number aa is called the multiplier and bb is called multiplicand. The result aba\cdot b is called product. Note that for a=0a=0 the product is 0 by definition. Furthermore we can extend this definition for aa\in\mathbb{Z} by defining: ab=((a)b)a\cdot b=-((-a)\cdot b) for a<0a<0.

On a computer, integers are limited in size. In the RV32 instruction set integers are 32 bits in size. So, multiplication takes two 32 bits integers and produces a 64 bit product. The lower 32 bits of the product is stored by the mul instruction. For the higher 32 bits, the specific instruction depends on the number format (signed or unsigned or mixed).

Integers are stored as binary bit sequences. For unsigned numbers the following format is used:

Definition 2 (Unsigned numbers)

The unsigned number aa\in\mathbb{N} is represented as a binary sequence an1an2a1a0a_{n-1}a_{n-2}\ldots a_{1}a_{0} with:

a=i=0n1ai2i\displaystyle a=\sum_{i=0}^{n-1}a_{i}\cdot 2^{i}

Where ai{0,1}a_{i}\in\{0,1\} and nn denotes the number bits. Note that the largest number that can be represented this way is 2n12^{n}-1. For 32 bit numbers this is 4294967295.

2.1 Long Multiplication

Unsigned multiplication of binary sequences can be done using long multiplication, but now in a binary fashion. Table 1 shows the long multiplication of two 4 bit numbers: 10011001 (9) and 01010101 (5).

1 0 0 1 (=9)
×\times 0 1 0 1 (=5)
1 0 0 1
0 0 0 0
1 0 0 1
+ 0 0 0 0
0 0 1 0 1 1 0 1 (=45)
Table 1: Long multiplication of two 4 bit numbers

The long multiplication algorithm for binary sequences is straightforward: starting from the least significant bit of the multiplier write down the multiplicant only when the bit is 1, otherwise write down zeroes. We repeat this procedure for the other bits (at position ii), and write down the multiplicant shifted ii bits to the left. This way we get nn partial products. The final product is the sum of all partial products. It is relatively easy to backup the long multiplication mathematically:

ab=(i=0n1ai2i)b=bi=0n1ai2i=i=0n1bai2i\displaystyle a\cdot b=\left(\sum_{i=0}^{n-1}a_{i}\cdot 2^{i}\right)\cdot b=b\cdot\sum_{i=0}^{n-1}a_{i}\cdot 2^{i}=\sum_{i=0}^{n-1}b\cdot a_{i}\cdot 2^{i}

As you can see the product is a sum of nn partial products. Within the partial product 2i2^{i} represents the left shift and since aia_{i} is either 0 or 1, the product baib\cdot a_{i} is either all zeroes or the multiplicand itself.

2.2 Hardware implementation

Performing the above algorithm in hardware involves 2 tasks:

  1. 1.

    calculate the nn partial products, which can be calculated in parallel

  2. 2.

    calculate the sum of all partial products, using for example an adder tree

For a partial product we have to calculate bai2ib\cdot a_{i}\cdot 2^{i}. As stated above, 2i2^{i} is actually a fixed shift (depending on ii). Since ai{0,1}a_{i}\in\{0,1\} the multiplication with bb can be done with an and vector operation after sign-extending aia_{i} to a vector of length nn.

2.3 Example

For 4x4 to 8 bit, the unsigned multiplication scheme looks as follows:

b3b_{3} b2b_{2} b1b_{1} b0b_{0}
×\times a3a_{3} a2a_{2} a1a_{1} a0a_{0}
a0b3a_{0}b_{3} a0b2a_{0}b_{2} a0b1a_{0}b_{1} a0b0a_{0}b_{0}
a1b3a_{1}b_{3} a1b2a_{1}b_{2} a1b1a_{1}b_{1} a1b0a_{1}b_{0}
a2b3a_{2}b_{3} a2b2a_{2}b_{2} a2b1a_{2}b_{1} a2b0a_{2}b_{0}
+ a3b3a_{3}b_{3} a3b2a_{3}b_{2} a3b1a_{3}b_{1} a3b0a_{3}b_{0}

3 Signed Multiplication

Signed numbers are stored as binary sequences with a twist:

Definition 3 (Signed numbers)

The (2’s complement) signed number aa\in\mathbb{Z} is represented as a binary sequence an1an2a1a0a_{n-1}a_{n-2}\ldots a_{1}a_{0} with:

a=a02n1+i=0n2ai2i\displaystyle a=-a_{0}\cdot 2^{n-1}+\sum_{i=0}^{n-2}a_{i}\cdot 2^{i}

Where ai{0,1}a_{i}\in\{0,1\} and nn denotes the number of bits. Note that the smallest negative (largest positive) number that can be represented this way is 2n1-2^{n-1} (2n112^{n-1}-1 respectively). For 32 bit numbers this is -2147483648 (2147483647 respectively).

This way of denoting 2’s complement signed numbers has the advantage that normal addition (and subtraction) still functions correctly. For signed multiplication however it does not. Baugh and Wooley [1] found a nice way to use unsigned multiplication for signed numbers with only a small number of changes. Let us derive their solution:

ab=(an12n1+i=0n2ai2i)(bn12n1+j=0n2bi2j)=\displaystyle a\cdot b=\left(-a_{n-1}\cdot 2^{n-1}+\sum_{i=0}^{n-2}a_{i}\cdot 2^{i}\right)\cdot\left(-b_{n-1}\cdot 2^{n-1}+\sum_{j=0}^{n-2}b_{i}\cdot 2^{j}\right)=
an1bn122n2A+i=0n2j=0n2aibj2i+jB2n1i=0n2aibn12iX2n1j=0n2an1bj2jY\displaystyle\underbrace{\vphantom{\sum_{j=0}^{n-2}}a_{n-1}b_{n-1}\cdot 2^{2n-2}}_{A}+\underbrace{\sum_{i=0}^{n-2}\sum_{j=0}^{n-2}a_{i}b_{j}\cdot 2^{i+j}}_{B}-\underbrace{\vphantom{\sum_{j=0}^{n-2}}2^{n-1}\sum_{i=0}^{n-2}a_{i}b_{n-1}\cdot 2^{i}}_{X}-\underbrace{2^{n-1}\sum_{j=0}^{n-2}a_{n-1}b_{j}\cdot 2^{j}}_{Y}

Note that the first two terms (AA and BB) are positive, so they do not pose a problem. The last two terms are negative, so we are finding their 2’s complement counterparts so we can simply add them.

Call the first negative term (without the sign) XX and the second one YY. We will focus on XX for now, for YY the derivation is similar. So:

X=2n1i=0n2aibn12i\displaystyle X=2^{n-1}\sum_{i=0}^{n-2}a_{i}b_{n-1}\cdot 2^{i}

Since we are interested in the binary notation of XX define:

xi=aibn1\displaystyle x_{i}=a_{i}b_{n-1}

Assuming XX to be 2n2n bits wide, we can write out its binary notation:

bit position 2n12n-1 2n22n-2 2n32n-3 2n42n-4 nn n1n-1 n2n-2 n3n-3 0
bit value XX 0 0 xn2x_{n-2} xn3x_{n-3} x1x_{1} x0x_{0} 0 0 0

The 2’s complement of XX can be calculated by inversing all bits and adding 1:

bit position 2n12n-1 2n22n-2 2n32n-3 2n42n-4 nn n1n-1 n2n-2 n3n-3 0
bit value X-X 1 1 x¯n2\overline{x}_{n-2} x¯n3\overline{x}_{n-3} x¯1\overline{x}_{1} x¯0\overline{x}_{0} 1 1 1+1

Note that the bit value row is not final, the addition at bitposition 0 will generate a carry that will cascade all the way up to position n1n-1:

bit position 2n12n-1 2n22n-2 2n32n-3 2n42n-4 nn n1n-1 n2n-2 n3n-3 0
bit value X-X 1 1 x¯n2\overline{x}_{n-2} x¯n3\overline{x}_{n-3} x¯1\overline{x}_{1} x¯0+1\overline{x}_{0}+1 0 0 0

Similarly, we will find for Y-Y:

bit position 2n12n-1 2n22n-2 2n32n-3 2n42n-4 nn n1n-1 n2n-2 n3n-3 0
bit value Y-Y 1 1 y¯n2\overline{y}_{n-2} y¯n3\overline{y}_{n-3} y¯1\overline{y}_{1} y¯0+1\overline{y}_{0}+1 0 0 0

When adding X-X and Y-Y some bit positions are worth noticing:

bit position 2n12n-1 2n22n-2 nn n1n-1
bit value X-X 1 1 x¯1\overline{x}_{1} x¯0+1\overline{x}_{0}+1
bit value Y-Y 1 1 y¯1\overline{y}_{1} y¯0+1\overline{y}_{0}+1
bit value XY-X-Y 1 0 x¯1+y¯1+1\overline{x}_{1}+\overline{y}_{1}+1 x¯0+y¯0\overline{x}_{0}+\overline{y}_{0}

So adding XY-X-Y can be done by adding the inversed xx and yy bits, and subsequently adding the bits with bit positions 2n12n-1 and nn.

3.1 Example

For 4x4 to 8 bit, the signed multiplication scheme looks as follows:

b3b_{3} b2b_{2} b1b_{1} b0b_{0}
×\times a3a_{3} a2a_{2} a1a_{1} a0a_{0}
a3b3a_{3}b_{3} AA
a0b2a_{0}b_{2} a0b1a_{0}b_{1} a0b0a_{0}b_{0} BB
a1b2a_{1}b_{2} a1b1a_{1}b_{1} a1b0a_{1}b_{0} BB
a2b2a_{2}b_{2} a2b1a_{2}b_{1} a2b0a_{2}b_{0} BB
a3b2¯\overline{a_{3}b_{2}} a3b1¯\overline{a_{3}b_{1}} a3b0¯\overline{a_{3}b_{0}} YY
a2b3¯\overline{a_{2}b_{3}} a1b3¯\overline{a_{1}b_{3}} a0b3¯\overline{a_{0}b_{3}} XX
+ 1 11 X,YX,Y

Squeezing things together gives:

b3b_{3} b2b_{2} b1b_{1} b0b_{0}
×\times a3a_{3} a2a_{2} a1a_{1} a0a_{0}
11 a0b3¯\overline{a_{0}b_{3}} a0b2a_{0}b_{2} a0b1a_{0}b_{1} a0b0a_{0}b_{0}
a1b3¯\overline{a_{1}b_{3}} a1b2a_{1}b_{2} a1b1a_{1}b_{1} a1b0a_{1}b_{0}
a2b3¯\overline{a_{2}b_{3}} a2b2a_{2}b_{2} a2b1a_{2}b_{1} a2b0a_{2}b_{0}
+ 1 a3b3a_{3}b_{3} a3b2¯\overline{a_{3}b_{2}} a3b1¯\overline{a_{3}b_{1}} a3b0¯\overline{a_{3}b_{0}}

While comparing this scheme with the unsigned one (see Section 2) it is easy to see the similarities. If we only consider the lower nn bits of the output, both schemes produce the same output. At first this is not obvious, since the unsigned scheme uses a0b3a_{0}b_{3} and a3b0a_{3}b_{0} while the signed scheme has a0b3¯\overline{a_{0}b_{3}} and a3b0¯\overline{a_{3}b_{0}}. Still the sum (not the carry) is the same, see the following table:

a0b3a_{0}b_{3} a3b0a_{3}b_{0} carry sum a0b3¯\overline{a_{0}b_{3}} a3b0¯\overline{a_{3}b_{0}} carry sum
0 0 0 0 1 1 1 0
0 1 0 1 1 0 0 1
1 0 0 1 0 1 0 1
1 1 1 0 0 0 0 0

4 Mixed Multiplication

The RISCV instruction set [6] has a special instruction which takes an unsigned multiplier and a signed multiplicant:

MULHSU rd, rs1, rs2

which multiplies signed operand rs1 (multiplicand) and unsigned operand rs2 (multiplier) and stores the upper 32 bits of the product in rd. We will try to derive an extension to Baugh-Wooley multiplication for mixed operands. Let aa be the unsigned multiplier and bb the signed multiplicant.

ab=(i=0n1ai2i)(bn12n1+j=0n2bi2j)=\displaystyle a\cdot b=\left(\sum_{i=0}^{n-1}a_{i}\cdot 2^{i}\right)\cdot\left(-b_{n-1}\cdot 2^{n-1}+\sum_{j=0}^{n-2}b_{i}\cdot 2^{j}\right)=
i=0n1j=0n2aibj2i+jB2n1i=0n1aibn12iX\displaystyle\underbrace{\sum_{i=0}^{n-1}\sum_{j=0}^{n-2}a_{i}b_{j}\cdot 2^{i+j}}_{B}-\underbrace{\vphantom{\sum_{j=0}^{n-2}}2^{n-1}\sum_{i=0}^{n-1}a_{i}b_{n-1}\cdot 2^{i}}_{X}

The first term (BB) is positive, so doesn’t pose any problem. The second term XX is negative, so we are going to find its 2’s complement notation so we can add it. Since we are interested in the binary notation of XX define:

xi=aibn1\displaystyle x_{i}=a_{i}b_{n-1}

Assuming XX to be 2n2n bits wide, we can write out its binary notation:

bit position 2n12n-1 2n22n-2 2n32n-3 nn n1n-1 n2n-2 n3n-3 0
bit value XX 0 xn1x_{n-1} xn2x_{n-2} x1x_{1} x0x_{0} 0 0 0

The 2’s complement of XX can be calculated by inversing all bits and adding 1:

bit position 2n12n-1 2n22n-2 2n32n-3 nn n1n-1 n2n-2 n3n-3 0
bit value X-X 1 x¯n1\overline{x}_{n-1} x¯n2\overline{x}_{n-2} x¯1\overline{x}_{1} x¯0\overline{x}_{0} 1 1 1+1

Note that the bit value row is not final, the addition at bitposition 0 will generate a carry that will cascade all the way up to position n1n-1:

bit position 2n12n-1 2n22n-2 2n32n-3 nn n1n-1 n2n-2 n3n-3 0
bit value X-X 1 x¯n1\overline{x}_{n-1} x¯n2\overline{x}_{n-2} x¯1\overline{x}_{1} x¯0+1\overline{x}_{0}+1 0 0 0

4.1 Example

For 4x4 to 8 bit, the mixed multiplication scheme looks as follows:

b3b_{3} b2b_{2} b1b_{1} b0b_{0}
×\times a3a_{3} a2a_{2} a1a_{1} a0a_{0}
a0b2a_{0}b_{2} a0b1a_{0}b_{1} a0b0a_{0}b_{0} BB
a1b2a_{1}b_{2} a1b1a_{1}b_{1} a1b0a_{1}b_{0} BB
a2b2a_{2}b_{2} a2b1a_{2}b_{1} a2b0a_{2}b_{0} BB
a3b2a_{3}b_{2} a3b1a_{3}b_{1} a3b0a_{3}b_{0} BB
a3b3¯\overline{a_{3}b_{3}} a2b3¯\overline{a_{2}b_{3}} a1b3¯\overline{a_{1}b_{3}} a0b3¯\overline{a_{0}b_{3}} XX
+ 1 11 XX

Squeezing things together gives:

b3b_{3} b2b_{2} b1b_{1} b0b_{0}
×\times a3a_{3} a2a_{2} a1a_{1} a0a_{0}
a0b3¯\overline{a_{0}b_{3}} a0b2a_{0}b_{2} a0b1a_{0}b_{1} a0b0a_{0}b_{0}
a1b3¯\overline{a_{1}b_{3}} a1b2a_{1}b_{2} a1b1a_{1}b_{1} a1b0a_{1}b_{0}
a2b3¯\overline{a_{2}b_{3}} a2b2a_{2}b_{2} a2b1a_{2}b_{1} a2b0a_{2}b_{0}
1 a3b3¯\overline{a_{3}b_{3}} a3b2a_{3}b_{2} a3b1a_{3}b_{1} a3b0a_{3}b_{0}
+ 11

Having the last row (adding a 1 to bit position n1n-1) is a bit unfortunate. If we plan to perform all additions with a tree adder this single line results in an extra (non parallel) action. However, there is a trick to squeeze the scheme a bit more:

b3b_{3} b2b_{2} b1b_{1} b0b_{0}
×\times a3a_{3} a2a_{2} a1a_{1} a0a_{0}
a0b3¯\overline{a_{0}b_{3}} a0b3a_{0}b_{3} a0b2a_{0}b_{2} a0b1a_{0}b_{1} a0b0a_{0}b_{0}
a1b3¯\overline{a_{1}b_{3}} a1b2a_{1}b_{2} a1b1a_{1}b_{1} a1b0a_{1}b_{0}
a2b3¯\overline{a_{2}b_{3}} a2b2a_{2}b_{2} a2b1a_{2}b_{1} a2b0a_{2}b_{0}
+ 1 a3b3¯\overline{a_{3}b_{3}} a3b2a_{3}b_{2} a3b1a_{3}b_{1} a3b0a_{3}b_{0}

The correctness of this last step is easy to see: for all values of a0b3a_{0}b_{3} it is equal to a0b3¯+1\overline{a_{0}b_{3}}+1 (disregarding the carry). The carry itself equals a0b3¯\overline{a_{0}b_{3}}.

a0b3¯+1\overline{a_{0}b_{3}}+1
a0b3a_{0}b_{3} a0b3¯\overline{a_{0}b_{3}} carry sum
0 1 1 0
1 0 0 1

5 Merging

Assuming we have binary signals s,us,u and mm representing signed, unsigned and mixed multiplication respectively (possibly decoded with a demux from the RISCV multiplication instruction, see Table 2) we can now merge all implementations into a single multiplier222Note that we do not actually use the uu signal but exploit the fact that for unsigned multiplication both ss as mm are 0..

instruction description ss uu mm
mulh signed 1 0 0
mulhu unsigned 0 1 0
mulhsu mixxed 0 0 1
Table 2: RV32 multiplication instructions and their demuxed signals

Essentially there are 3 different types of partial products: the first one, the last one and all in between. We will use \oplus for the bitwise xor operator. Unmentioned bit positions are considered 0.

5.1 The First Partial Product

result bit s+ma0bn1¯s+m\overline{a_{0}b_{n-1}} sa0bn1s\oplus a_{0}b_{n-1} a0bn2a_{0}b_{n-2} a0b1a_{0}b_{1} a0b0a_{0}b_{0}
bit position nn n1n-1 n2n-2 1 0

As you can see, the first partial product is standard (unchanged) for bit positions 0 up to and including n2n-2. Bit position n1n-1 is inverted for signed multiplications while bit position nn is 1 for signed and the inverted value of a0bn1a_{0}b_{n-1} for mixed multiplications.

5.2 The Intermediate Partial Products

The intermediate partial products (from 1 upto n2n-2) are shifted depending on their (fixed) number.

result bit (s+m)aibn1(s+m)\oplus a_{i}b_{n-1} aibn2a_{i}b_{n-2} aib1a_{i}b_{1} aib0a_{i}b_{0}
bit position n1+in-1+i n2+in-2+i i+1 i

Apart from being shifted, the bit values are unchanged. Only the most signigicant bit is inverted when performing a signed or mixed multiplication.

5.3 The Last Partial Product

result bit s+ms+m man1bn1m\oplus a_{n-1}b_{n-1} san1bn2s\oplus a_{n-1}b_{n-2} san1b1s\oplus a_{n-1}b_{1} san1b0s\oplus a_{n-1}b_{0}
bit position 2n12n-1 2n22n-2 2n32n-3 n-2 n-1

For the last partial product we have to invert its bitvalues when dealing with signed multiplication. The most significant bit however should only be inverted for mixed multiplications. Finally, both in signed and mixed multiplications the answer should be preceded with a 1 (bit position 2n12n-1).

6 Conclusion

We showed that with some small changes to the partial products, all RV32 multiplications can be handled by a single multiplier, reducing the implementation’s complexity roughly with a factor three. This might be specifically benificial in multi-core and vectorlike implementations.

References

  • [1] Baugh, C., Wooley, B.: A Two’s Complement Parallel Array Multiplication Algorithm. IEEE Transactions on Computers C-22(12), 1045–1047 (1973). https://doi.org/10.1109/T-C.1973.223648
  • [2] Dadda, L.: Some Schemes for Parallel Multipliers. Alta Frequenza 34(5), 349–356 (May 1965)
  • [3] Sakthikumaran, S., Salivahanan, S., Bhaaskaran, V.S.K., Kavinilavu, V., Brindha, B., Vinoth, C.: A Very Fast and Low Power Carry Select Adder Circuit. In: 2011 3rd International Conference on Electronics Computer Technology. vol. 1, pp. 273–276 (2011). https://doi.org/10.1109/ICECTECH.2011.5941604
  • [4] Swathi, P., Kumar, R., Ramana, B.: Low Density and Latency Optimised Multi Operand Binary Adder using Modified Carry Bypass Addition. International Journal of Engineering in Advanced Research Science and Technology 03(01), 285–294 (September 2021)
  • [5] Wallace, C.S.: A Suggestion for a Fast Multiplier. IEEE Transactions on Electronic Computers EC-13(1), 14–17 (February 1964). https://doi.org/10.1109/PGEC.1964.263830
  • [6] Waterman, A., Lee, Y., Patterson, D.A., Asanović, K.: The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.1. Tech. Rep. UCB/EECS-2016-118, EECS Department, University of California, Berkeley (May 2016), http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-118.html