The precise semantics of floating-point arithmetic programs depends on the execution platform, including the compiler and the target hardware, even if these are fully compliant with the widely adopted IEEE 754 floating-point standard. Platform dependencies infringe on the highly desirable goal of software portability (which is in fact promised by heterogeneous computing frameworks like OpenCL): the same program run on the same inputs on different platforms can produce different results. Serious doubts on the portability of numeric applications arise when these differences are behavioral, i.e. when they lead to changes in the control flow of a program.

In this research we design an algorithm that takes a numeric procedure and determines an input that may lead to different branching decisions depending on how the arithmetic in the procedure is compiled. Our algorithm proceeds in two phases. The first phase determines a candidate input, such that minor variations of the value of exactly one of the input variables causes different branching decisions. In the second phase, the algorithm efficiently searches though all possible platform parameters such as decisions made by the compiler about evaluation order, and use or non-use of FMA hardware features. The goal is to find two sets of platform parameters that lead to different branching decisions.

Our implementation of the algorithm requires minimal intervention by the user. We illustrate its operation on a diverse set of examples, characteristic of scientific numeric computing, where control flow divergence actually occurs across different execution platforms.

** Experiments **

We test our algorithm on three
procedures: (1) **ray
tracing**, in the raySphere function there is a branching
decision over the D variable (D> 0). The inputs are s, r, which
represent the coordinates of the sphere origin and the ray, and
radiusSq, a scalar value giving the square of the radius of
sphere; (2) **molecular
dynamics**, this open source code calculates the Lennard Jones
potential of a molecular system. The energy is calculated based on
pairwise distances of atoms and their velocity and acceleration. In
the ComputeAccel function a decision branch occurs over (rr ,
rrCut). rr is the pairwise distance between atoms j1 and j2 and rrCut
is the threshold for pairwise distances. The calculation of rr is
directly effected by a density variable of the atom using a face
centered cubic model; (3) **long summation of 32 numbers**, we compare
serial C code which accumulates a value to a register, and a reduction
kernel, written in OpenCL which is the common way to implement long
summation on a parallel architecture. The result of the sum is
compared to zero.

** 1. Generating Unstable Inputs **

To test our algorithm, we first apply transformations to each
procedure: we attach a main function that calls the tested procedure; we
annotate the symbolic variables and volatile expressions by calling
*klee_make_symbolic_with_sort*
and *klee_tag_reorderable*. Following shows the ray tracing code
after transformation.

```
//Dot Product 3-Vectors
float dot3(float *a, float *b){
float r = a[0] * b[0] + a[1] * b[1]
+ a[2] * b[2];
klee_tag_reorderable(&r, 0, 2);
return r;
}
int raySphere(float *r, float *s, float radiusSq) {
float A = dot3(r,r);
float B = -2.0 * dot3(s,r);
float C = dot3(s,s) - radiusSq;
float D = B*B - 4*A*C;
if (D > 0)
return 0;
}
int main() {
float r[3], s[3];
float radiusSq;
klee_make_symbolic_with_sort(&r, sizeof(r), "r", 8, 32);
klee_make_symbolic_with_sort(&s, sizeof(s), "s", 8, 32);
klee_make_symbolic_with_sort(&radiusSq, sizeof(radiusSq), "radiusSq", 0, 32);
return raySphere(r, s, radius);
}
```

For a conditional q in the execution path, there may exist multiple inputs that exhibit control flow instability. In our experiments we split the domian of one input variable into subintervals, and run our algorithm on each of these subintervals. In the table below it shows the experiment results.

Procedure | Variable | Domain | Subinterval length | #Unstable Inputs |

ray tracing | s[0] | [-50, 50] | 0.01 | 408 |

molecular dynamics | r_j1[0] | [0, 1] | 0.1 | 12 |

long summation | a[0] | [0, 1] | 0.01 | 100 |

** 2. Control Flow Divergence **

To show control flow divergence, we took the generated sets of inputs and tested them on six different architectures. We convert the serial c code into OpenCL code so that we can run the same code and compare the results on these heterogeneous architectures. Below is the list of these architectures.

type | Manufacture | Description | Year | FMA? | |

1 | CPU | Intel | E5 2650 | 2012 | * |

2 | CPU | AMD | A8-3850 | 2011 | N |

3 | GPU | NVIDIA | GF108 Quadro 600 | 2010 | N |

4 | GPU | NVIDIA | Tesla C2075 | 2011 | Y |

5 | GPU | NVIDIA | Tesla K20 | 2013 | Y |

6 | GPU | AMD | Radeon HD 65550D | 2011 | N |

** 2.1 Ray Tracing Results **

For the generated 408 sets of inputs. 45 of these produce results on different sides of zero for D on different architectures. In the table below some of these inputs are shown. D1, D2 and D3 show the value of D for D1: machine 1,2, D2: machine 3,4,5 and D3: machine 6.

s[0] | s[1] | s[2] | r[0] | r[1] | r[2] | radiusSq | D1 | D2 | D3 |

-33.3704719543457 | -53 | -52.0185546875 | -33.999900817871 | -54 | -53 | 2.9802322E-8 | 4.54933548 | -3.562728882 | 0.0 |

30.1333904266357 | -41 | -40.0234375 | 30.870059967041 | -42 | -41 | 1.907348633E-6 | 3.874450684 | -7.142879486 | 0.0 |

40.0934867858886 | -53 | -52.0185546875 | 40.8501510620117 | -54 | -53 | 2.9802322E-8 | 7.814369202 | -4.13470459 | 0.0 |

40.073875427246 | -53 | -52.0185546875 | 40.8301696777343 | -53 | -54 | 2.9802322E-8 | -6.409370422 | 7.924546242 | 0.0 |

-27.4816207885742 | -53 | -52.0185546875 | -27.9999465942382 | -54 | -53 | 2.9802322E-8 | 6.105501175 | -6.985953331 | 0.0 |

** 2.2 Molecular Dynamics Results **

Following table shows some of the results. R is the difference in pairwise distances of the atoms j1 and j2 to rrCut (rr-rrCut). R1,R2 shows result for machine 1,2,6 and machine 3,4,5 respectively.

r_j1[0] | r_j1[1] | r_j1[2] | r_j2[0] | r_j2[1] | r_j2[2] | Density | rrCut | R1 | R2 |

0.50000011920928955 | 23.71043968200683594 | 22.29841232299804688 | 3.81469726562E-6 | 3.81469726562E-6 | 3.81469726562E-6 | 0.3 | 2.25 | 0.0 | -2.38418579101562E-7 |

0.5078125 | 0.12553329765796661 | 7.8125E-3 | 23.9375 | 23.75 | 22.25 | 0.3 | 2.25 | -2.38418579101562E-7 | 0.0 |

0.4000244140625 | 0.41066083312034607 | 26.125 | 6.25E-2 | 1.5625E-2 | 6.25E-2 | 0.2 | 1.44 | 5.72204590376657E-8 | -6.19888305131155E-8 |

0.10009765625 | 27.23913192749023438 | 4.8828125E-4 | 27.21875 | 0.125 | 25.9453125 | 0.2 | 1.44 | -6.19888305131155E-8 | 5.72204590376657E-8 |

0.2000005841255188 | 34.19741058349609375 | 1.81899E-12 | 0.18798828125 | 7.4505806E-9 | 33.69966888427734375 | 0.1 | 0.25 | -1.49011611938476E-8 | 0.0 |

0.3125 | 6.25E-2 | 33.701171875 | 0.300048828125 | 2.464281208813E-2 | 6.103515625E-5 | 0.1 | 0.25 | 0.0 | -1.49011611938476E-08 |

** 2.3 Long summation and Reduction Results **

The 32 input numbers have a range between [-1.05,1]. Sum (1,2,3,5,6) shows the reduction results for machine 1,2,3,5,6 and sum 4 shows the reduction results for machine 4. Our code compared the result of the summation to 0. The reduction kernel and serial summation results on the inputs generated produce results on either side of zero.

Sum 1,2,3,5,6 | Sum 4 | Serial Sum |

-4.65661287E-8 | -8.7544322E-8 | 4.49363142E-8 |

-5.96046448E-8 | -8.7544322E-8 | 1.054722816E-7 |

-5.96046448E-8 | -8.7544322E-8 | 9.1502443E-8 |

- Faculty: Miriam Leeser, Thomas Wahl
- Research assistant: Mahsa Bayati, Jaideep Ramachandran, Yijia Gu

- Miriam Leeser, Jaideep Ramachandran, Thomas Wahl, and Devon
Yablonski. OpenCL
floating point software on heterogeneous architectures- portable or
not? In
*Workshop on Numerical Software Verification (NSV)*, 2012. [pdf ] - Miriam Leeser, Saoni Mukherjee, Jaideep Ramachandran, and
Thomas Wahl.
**Make it real: Effective floating-point reasoning via exact arithmetic.**In*Design Automation and Testing in Europe*(DATE), 2014. - Yijia Gu, Thomas Wahl, Mahsa Bayati, and Miriam Leeser.
**Behavioral non-portability in scientific numeric computing.**In*Parallel and Distributed Computing*(EURO-PAR), pages 558--569, 2015. [pdf]