Supported arithmetic

Rounding directions

Some of the classes of operators presented in the following sections are templated by a rounding direction. This is the direction chosen when converting a real number that cannot be exactly represented in the destination format.

There are eleven directions:

zr

toward zero

aw

away from zero

dn

toward minus infinity (down)

up

toward plus infinity

od

to odd mantissas

ne

to nearest, tie breaking to even mantissas

no

to nearest, tie breaking to odd mantissas

nz

to nearest, tie breaking toward zero

na

to nearest, tie breaking away from zero

nd

to nearest, tie breaking toward minus infinity

nu

to nearest, tie breaking toward plus infinity

The rounding directions mandated by the IEEE-754 standard are ne (default mode, rounding to nearest), zr, dn, up, and na (introduced for decimal arithmetic).

Floating-point operators

This class of operators covers all the formats whose number sets are \(F(p,d) = \{m \cdot 2^e; |m| < 2^p, e \ge d\}\). In particular, IEEE-754 floating-point formats (with subnormal numbers) are part of this class, if we set apart overflow issues. Both parameters p and d select a particular format. The last parameter selects the rounding direction.

float< precision, minimum_exponent, rounding_direction >(...)

Formats with no minimal exponent (and thus no underflow) are also available:

float< precision, rounding_direction >(...)

Having to remember the precision and minimum exponent parameters may be a bit tedious, so an alternate syntax is provided: instead of these two parameters, a name can be given to the float class.

float< name, rounding_direction >(...)

There are four predefined formats:

ieee_32

IEEE-754 single precision

ieee_64

IEEE-754 double precision

ieee_128

IEEE-754 quadruple precision

x86_80

extended precision on x86-like processors

Fixed-point operators

This class of operators covers all the formats whose number sets are \(F(e) = \{m \cdot 2^e\}\). The first parameter selects the weight of the least significant bit. The second parameter selects the rounding direction.

fixed< lsb_weight, rounding_direction >(...)

Rounding to integer is a special case of fixed point rounding of weight 0. A syntactic shortcut is provided.

int< rounding_direction >(...)