Numerical Optimization Part 01

The objective of this post is to introduce fundamental concepts in computational optimization for design applications. This tutorial was part of the Advanced Topics in Digital Design and Fabrication 2016. The content will be broken into a series starting from basic notions moving towards more complex concepts.

Problem Setup

In order to proceed by example we will use a rather trivial scenario which can be verified by conventional geometric modeling. Setting up a simple problem before implementing iterative or any sort of involved processes is quite important as there is nothing more frustrating than being unable to verify the results with a secondary trusted method.

The goal will be to find the parameter of a point on a curve at a particular distance from the start of the curve. This approach is commonly used for dividing or polygonizing curves by chord-length into equally sized segments. There is a direct geometric solution to this problem by finding the intersection point between the curve and a circle of desired radius from a given point on the curve (see figure below).

root finding

The first an rather most important step in finding the parameter of the point at distance is to model the objective function. We can think of this as follows: (a) We select a parameter and evaluate the curve, (b) we measure the distance between the evaluated point and the start of the curve, and (c) We subtract the distance from the desired one. We thus converted the problem into a root-finding problem where all we need to look for is the parameter of the curve where the objective function yields zero.

curve chord length

Brute Force

It is indicative and useful to start from the most naive approach in finding the desired solution as this helps us understand some characteristics of the problem in general. We will sample the curve at uniform intervals and generate a long list of points which we can then sort out to find the one that is closest to zero (of the objective function). The code below illustrates the idea:

  • We initialize the best solution pair (parameter and value) with some really bad numbers, something really large such that any other solution will be better and will be immediately get overwritten.
  • For a fixed number of iterations we divide the curve in equal parametric intervals and evaluate the points along the curve. For each point we compute the objective function. You may think of this as marching forward along the curve taking snapshots.
  • We check if the generated value is lower than the current best and if indeed it is, we store it as the new current best estimate.
  • For visualization purposes we can also store the evaluated points and even better the pairs of parameters and objective function values in a form of an XY graph. This will be very helpful later.

brute

The visuals on screen after running the process for 1000 iterations are quite revealing. We can observe the enormous amount of points on screen but most interestingly the objective function for the entire domain of the curve. What we can clearly see is a bouncing curve which hits the x-axis three times. This makes absolute sense as we can verify visually that the circle indeed intersects the curve at three points. The bouncing effect is due to the absolute value used above. In the image above we can verify that the point at parameter 256.7951 is at distance 99.9764mm from the start of the curve which is 0.0236mm away from the target of 100mm. We call this difference between the target and current best the residual error.

brute force

It just took 1000 iteration to reach this level of accuracy. So naturally the question that arises from all this is: how does the number of iterations relate to the accuracy of the result? The table below shows the residual (how far away we are from the exact solution of 100mm) along the number of iterations. Notice that generally the larger the number of iterations the better the result (closer to zero). Nevertheless this is not always true as for example 50 iterations give better result than 100. There is a hit or miss aspect with this approach as we don’t really control the distribution of the points along the curve near the target. The rate by which an iterated process reaches its target is called the convergence. You may think of this as the number of additional iterations we need to take in order to drop one more digit towards zero.

Iterations           Residual
10                          12.0589
50                          0.1063
100                        0.7407
500                       0.1146
1000                     0.0236
5000                     0.0011
10000                   0.0003

Bouncing Walk

While this objective function evaluates really quickly on modern computers there is something really wasteful about the whole approach of sampling so many points. Think of it in this way: we start from the start point of the curve (which we know it is already not a very good initial guess) and keep moving forward by taking small steps (equal to the domain size divided by the number of samples). The objective value keeps reducing (follow the curve from left to right) but even when we go past the point at which the value starts increasing again (the bounce point) we don’t look back and keep marching forward!

What if we add a bit of logic to detect when our improving value starts deteriorating? At that point instead of moving forward we can switch direction and start moving backwards. But there is a catch: if keep stepping back or forth at the same pace we will become trapped oscillating about the target value. We can mitigate this by halving the stepping size every time we need to change direction. Hopefully after some bounces about the target we will hone in the solution up to some small tolerance value.

bounce

The code above is rather self-explanatory. We introduced a few new variables namely the stepping factor dx which initially is chosen arbitrarily as 10% of the curve domain. This can lead to problems so we need some care about such a selection. In fact we want a step factor as large as we can get away with such that all the initial incremental steps seen below can be minimized. Once we are close to the solution we can hone in really quickly as we divide the step size by a factor of two for each bounce.

Which brings us to the notion of a current value y and the previous value Y. We can detect when we have gone past the target (in either direction) by just comparing the current and previous values of the objective function. If the current value is worse we just divide the stepping factor by two and change its sign so the iteration takes us closer to the target. In the image below you can see that the sampling points start condensing toward the intersection point. You can zoom in the CAD model and observe both how the solution path and graph trace a zig-zag spiral pattern near the solution.

bouncing

The next important improvement about this approach is the termination condition where by we can stop searching once we are comfortable with the residual value. If say we are 0.0000001mm away from the target we call it quits and report the parameter found. This tolerance value is important as it is bound by the precision of the real number representation which can only hold a finite number of decimal places (a 64-bits double can hold up to about 16 decimal places for instance). Generally, it makes more sense to choose this value based on the problem scenario rather than the machine capability. For instance five decimal places for architecture applications, assuming the units being used are millimetres, is more than enough due to say construction tolerances being easily many orders of magnitude larger.

Divide and Conquer

Using the bouncing walk method we achieved near perfect result (up to tolerance) within only about 86 iterations. This is a huge improvement compared to the 10,000 over iteration of uniform sampling. We can observe that the most wasteful aspect of the current method is the initial phase of trying to find the bouncing point about we can quickly converge to the target by halving the search step. We can in fact improve the above algorithm by just keeping this last behavior.

The algorithm below is known as bisection and the overall strategy as divide and conquer. It is based on a mathematical concept known as the intermediate value theorem or Bolzano’s theorem. It goes like this: if a continuous function f( x ) = y defined in domain [xmin, xmax] results into f( xmin ) * f( xmax ) <= 0 then the function has at least one root (zero crossing point). In other words, if the function at the one end of its domain is positive and negative on the other (and vise-versa), well it should cross the x-axis if it is continuous.

We can thus start from the entire domain of the curve, cut it in half and keep that part that contains the crossing point. We can then iterate by incremental subdivisions until we hit the desired tolerance.

bisect

The code below has only minor changes compared to the bouncing walk. We first establish the lower and upper bounds of the domain (xmin and xmax) and then evaluate the objective function at the boundaries (ymin and ymax). Notice that we got rid of the absolute value otherwise we will never get negative signs which helps us select the correct half and the process will fail. It is thus important to check before we begin that our original domain is sane.

For every iteration we compute the mid-point of the domain ( xmid = ( xmin + xmax ) / 2.0 ) and its associated objective function value ( f( xmid ) = ymid ). If the crossing point is in the lower half ( ymin * ymid <= 0 ) then we bring down the upper bound to the mid-point. If the crossing point is in the upper half ( ymid * ymax <= 0 ) then we bring up the lower bound to the mid-point. If neither of the above is true then something went wrong (no crossing point detected) so we can just abort the mission with an error.

In the snapshot below we can observe the subdivision pattern of the algorithm and rapid convergence towards the intersection point. The black line is the objective function without the absolute value (hence there is no bouncing about x-axis). There are still three crossing points on the x-axis as there are still three intersection points between the curve and the circle.

Note that the fact that the first intersection point was found is rather coincidental. If we stretch the curve back we most likely will find the second intersection point. There subtle aspects such as this we will discuss later. Nevertheless, the solution was found in only 25 iterations. This a major improvement but can we do even better?

bolzano

False Position

The intuition for improving the previous method is based on the observation that selecting the mid-point to split the domain is rather arbitrary. We could pick any random point within the domain and the method would still work fine (and perhaps if we are lucky even faster). Therefore the question becomes: is there any more indicative point we can choose to bring us faster to the solution? In fact there is: we can select the point where the line [xmin, xmax] -> [ymin, ymax] crosses the x-axis. In other word perform a linear interpolation between the boundary values as the next guess!

Indeed the results seem to have been improved! Generally it takes about three quarters to half the number of steps to meet the same tolerance criteria compared to bisection. The only change that took place was the estimation of the new partitioning value xnew. Which brings us to the next question: is there yet a better way to estimate a new value and further reduce the number of steps to reach a satisfactory solution? Indeed there are many more that follow the same spirit: Ridder’s method which is a variation of the linear interpolation scheme seen above, and Brent’s method which uses a more complex estimation scheme known as inverse quadratic interpolation. Look them up! In fact Brent’s method has such good characteristics that it is the most commonly used (even Rhino uses it for intersections etc).

falsepos

Newton Raphson Method

In spirit all of the previous methods are somewhat similar. We use some initial guesses to improve subsequent estimations to close in the target. We used a number of samples by evaluating the objective function, from way too many to only three (lower, upper and interim), and fit an approximate function (linear or quadratic form) to extract the new position candidate.

The Newton Raphson method is similar in taking better estimates but it uses not only direct values from the objective function but also its derivative. The logic is simple: we use the tangent vector at a point on the curve to obtain a better estimate by following the slope downhill. If the derivative is zero (or the tangent is horizontal) we must have encountered a special point where there is no other better place to move (there is no more downwards direction as we hit the bottom).

newton

This method is much simpler than the previous except for one minor detail which requires clarification. The computation of the derivative is being made using an approximation method known as the finite differences scheme. In spirit a derivative is the results of looking at what a function does about a point, which direction is it moving from and towards. We thus look at both the left and right side about the particular point and compute the slope dy / dx.

The derivative f'( x ) is the limit of f( x ) – f( xo ) / ( x – xo ) where x → xo or alternatively in delta step form: f( x + dx ) – f( x ) / dx where dx → 0. The step dx is approximated by a small number say 0.000001 (computers cannot handle infinitely small numbers anyway) and all we need to do is to evaluate the function once more at ( x + dx ) to compute an approximation of the derivative.

This simplistic approach to differentiation is both powerful and dangerous. While we don’t have to go through the trouble of figuring out an analytical form for the objective function, in other words this approach is extremely general and simple, we do need to be careful as when we play with very small numbers using computers as we may find ourselves hitting the limits of the finite precision of real numbers using 64-bit double values. Indicatively, addition and somewhat multiplication of real numbers are fairly stable but both subtraction and especially division can produce inaccuracies.

Once the approximate derivative is computed we update the estimate by following Newton’s rule ( xn+1 = xn – f( xn )/ f'( x) ). In fact this is a first order approximation (using only the first derivative) which can be extended to higher order terms (second, third etc derivatives) using Taylor expansions. Again in spirit we fit a better and better local approximation curve (line, quadratic, cubic etc) to estimate the next value.

Some additional words of caution. In the code above there is one input parameter that allows one to set the initial search value. By experimenting with this you will notice that sometimes the solver gets trapped inside basins of the objective function, sometimes it overshoots and hits the boundary of the curve’s domain etc (hence the bounds check after the computation of the new estimate value). In conclusion the Newton Raphson method works extremely fast (see quadratic convergence) but some care is needed to select a good first guess.

Conclusions

The first part of the optimization tutorial series covered a few critical concepts required to approach computational search methods though a very simple example.

  • Model Formulation: When setting up automated search processes it is very critical to define the problem correctly from the start. It is important to have alternative means to approach and verify the results and it is super critical to express the problem concisely in the form of an objective function. This may transform the problem into a search for special points of interest which mathematically become zero-crossing/root finding or minimization/maximization of a quantity.
  • Problem Characterisation: Involves understanding the shape of the problem. While the uniform sampling method for instance was indeed wasteful it did provide a clear picture of the problem behaviour. We could identify the curvature of the problem, the multiplicity of solutions, the density characteristics of the objective function mapping points across its domain. We typically have to work parametrically, meaning that we have to deal with similar or slightly varying cases of the same problem typology. It is perfectly reasonable to plot out a few complete maps to guide our approach.
  • Boundary Conditions: We approaches the particular problem as implicitly being quasi-bounded. While indeed we employed the curve’s domain as guideline for bracketing or restraining Newton’s explosions we generally treated the subject as unconstrained. It is important to clearly understand the exterior boundaries of the problem because this will result to better understanding of the problem contextually as well as improve the performance of the optimization algorithm. For a concrete example we may proactively clip the lower and upper bounds of the domain of interest based on prior knowledge such as: we already knew in advance that starting from the lower bound of the curve’s domain is wasteful as points near the starting point are by definition way too close to the target value. In addition, for bracket tracking methods we need to make sure that have sufficient conditions (see intermediate value theorem logic) before we even start searching.
  • Initial Conditions: Nearly all of the methods presented above perform exceptionally faster if the initial guess is quite close to the target and potentially fail catastrophically if that initial guess is too far. There is no bullet-proof method for making that initial guess other than incorporating some prior knowledge from the context of the problem. For example we know that NURBS curves in Rhino are parametrized using arc-length (the domain of the curve is approximately mapping the travel distance along the curve). For lightly contorted curves we could use directly the target chord-length distance as the initial parameter based on the intuition that for small spacings those should be pretty close to one another.
  • Success Criteria: Already discussed the notion of error tolerance and how it makes most sense to choose one based on the contextual properties of the problem but always keep in mind the finite precision of numerical values expressed in computation. We also understand that there is no single best-method for solving all kinds of problems and each has some key benefits. For instance the Newton-Raphson method is extremely efficient in comparison to all other methods but it is also rather unstable if not carefully conditioned. Bracketing methods are more dependable because as long as the boundary condition is valid the solution will be found no matter what.

So with those preliminary thoughts we will move on from univariate (one parameter) problems to multivariate optimization problems. In the next instalment we will keep with exploring problems that can be formulated in mathematical smooth manner (continuous and differentiable functions). Later we will explore a more peculiar and exciting class of problems which require combination and reconfiguration of discrete entities. Finally we will discuss in detail how to express and impose constraints (unfeasible regions) within optimization processes.