Sketcher performance - where do those CPU cycles go.

acolomitchi · December 20, 2022, 10:25am

That’s the continuation of Sketcher performance - why is Python the culprit?

I ended understanding how to use the VStudio profiler well enough to draw some conclusions, and Python is not the culprit (I’ll put the FreeCAD version info and details on the procedure at the end. I’ll say it here though that the profiling data has been obtained with a Windows build, on a Win10 Home hosted by a Lenovo laptop, AMD Ryzen 5 3500U 2.10 GHz, with 16.0 GB of RAM).

In short, for sketches with (too) many elements and constraints:

about 58% of the CPU goes into SketchObject::solve(bool), with about 36% into Sketch::solve(void) and 21.5% into Sketch::setupSketch(…) (see Figure 1 below). Most of the CPU goes on eigen’s account, but there are some small bits that I’ll come back to .
In any case, Python adds on overhead of about 3% to this branch of the call tree.
about 38% of the CPU goes into some async execution with an anonymous (not assigned to any module) lambda - see Figure 2
I have some troubles identifying where this comes from. Initially I thought that it may be some callbacks to pass the results back into Python-land once the “C++ solving” is complete, but I find way more plausible to be the

            auto fut = std::async(&System::identifyDependentParametersSparseQR,this,J,jacobianconstraintmap, pdiagnoselist, /*silent=*/true);

            makeSparseQRDecomposition( J, jacobianconstraintmap, SqrJT, rank, R, /*transposed=*/true, /*silent=*/false);

            int paramsNum = SqrJT.rows();
            int constrNum = SqrJT.cols();

            fut.wait(); // wait for the execution of identifyDependentParametersSparseQR to finish

towards the end of the int System::diagnose(Algorithm alg) method in the GSC.cpp file (because, yeah, eigen again)
I will appreciate though an explicit confirmation coming from the ones in-the-know that the first hypothesis (passing back some data into the Python-land by using async callbacks) is rubbish and I can forget about it.

Figure 1 - the SketchObject::solve(bool) CPU budget

Figure 2 the async call into some lambda

The FreeCAD version details

OS: Windows 10 Version 2009
Word size of FreeCAD: 64-bit
Version: 0.21.0.31291 (Git)
Build type: Release
Branch: fallen-fruits
Hash: d763f5a64aef76c94653e7f43e854d3c962210d3
Python 3.8.10, Qt 5.15.2, Coin 4.0.1, Vtk 8.2.0, OCC 7.6.3
Locale: English/Australia (en_AU)
Installed mods: 
  * CurvedShapes 1.0.4
  * freecad.gears 1.0.0
  * lattice2 1.0.0
  * sheetmetal 0.2.59

The profiling procedure

Build the Release configuration following the instructions
download the
solverLoad-2oo.py (4.13 KB)
file and place it in some directory
open the solution in VisualStudio and set the FreeCADMainCmd as the solution’s ‘Startup project’
edit the FreeCADMainCmd project’s properties and, in the Debugging section, set the full path to the solverLoad-2oo.py as the value for the “Command Arguments” property
in the VisualStudio Debug menu, choose the “Performance Profiler (Alt-F2)”
select the “CPU Usage” checkbox then press the “Start” button at the bottom of the page
wait for the execution to finish (in my config, took about 1m40secs) and the report page to finish analyzing the data
click “Open details” link (under the plot) and set the “current view” to “Module”
Right click on the list of modules and, on the popup menu, pick “Load all symbols”. Wait for the view to settle
switch the current view to “Call tree”
explore at you leisure the contributions of various functions to CPU consumption

abdullah · December 20, 2022, 5:38pm

That really sounds like it.

about 38% of the CPU goes into some async execution with an anonymous (not assigned to any module) lambda - see Figure 2
I have some troubles identifying where this comes from. Initially I thought that it may be some callbacks to pass the results back into Python-land once the “C++ solving” is complete, but I find way more plausible to be the
            auto fut = std::async(&System::identifyDependentParametersSparseQR,this,J,jacobianconstraintmap, pdiagnoselist, /*silent=*/true);

            makeSparseQRDecomposition( J, jacobianconstraintmap, SqrJT, rank, R, /*transposed=*/true, /*silent=*/false);

            int paramsNum = SqrJT.rows();
            int constrNum = SqrJT.cols();

            fut.wait(); // wait for the execution of identifyDependentParametersSparseQR to finish
towards the end of the int System::diagnose(Algorithm alg) method in the GSC.cpp file (because, yeah, eigen again)
I will appreciate though an explicit confirmation coming from the ones in-the-know that the first hypothesis (passing back some data into the Python-land by using async callbacks) is rubbish and I can forget about it.

Yup. That lambda adds concurrency for the identification of parameters (showing which curves and parts of curves are constrained and which are not) via a QR decomposition. The nice part of it is that it runs totally in parallel with another QR decomposition of the same size for identifying rank and redundant/conflicting constraints (if you have a CPU monitor you will see another core at 100% during QR computation).

I am very well aware that QR decomposition is costly. When we switched from Eigen’s dense QR decomposition to the sparse version, we interacted with Gaël from Eigen to solve some bugs in Eigen. He said that QR is specially costly and that, if we could get away with other type of decomposition, such as LU, which is also rank revealing, we would get quite some improvement. The thing is that nobody has managed to obtain information about which constraints are conflicting/redundant, or which parameters are constrained and which ones are not, from an LU decomposition.

All in all, your observations sound right to me: Python cost is negligible. The lion’s share of the cost is QR decomposition using Eigen (the share will probably get higher and higher as the sketch is bigger in terms of elements and constraints).

Thanks for looking into it. It is reassuring to see that nothing seems off. If you still want to look deeper into other bits, please post your findings.

acolomitchi · December 20, 2022, 11:30pm

I saw it, on Windows. Does it happen on Linux/Mac too? (I don’t have Linux/Mac machines at the moment to give it a try)
Not so long ago, it didn’t; I saw heaps of thread-pool projects to address the short-fall, but I also saw signs that their use is not justifiable by absence of thread-pooling in std::async any more.
E.g. V1 of this thread pool library (May 2021) bears in its “Motivation” :

… currently only the MSVC implementation[6] of > std::async > uses a thread pool, while GCC and Clang do not.

but that is entirely gone for v3

Oh, I’d have to dig deep for this one, has been… what?.. 30 years since my courses in linear algebra. I’m afraid I won’t have time for this chunk of research in the near future.

O(N^3) in the number of constraints, if my guts feel it right.

I don’t think looking deeper into the current state will bring in notable improvements (>15% reduction in CPU).
Anyway, I’ll post another comment with the things that can be done on the current state of source code

Adrian

PS Oh, God, what have I gotten into? I swear all this started with buying my first 3D printer a wee less than 3mo ago

acolomitchi · December 21, 2022, 12:16am

Low hanging fruits (hanging so low they’re more like fallen fruits):

the Constraint::redirectParams(MAP_pD_pD redirectionmap) takes the input argument by value, the operation costs 4.2% of the total CPU consumption. Passing the argument by const MAP_pD_pD& will likely eliminate the cost
various Constraint classes use deeply inefficient expression (and unnecessarily sacrifice performance for source code readability?)
E.g. use of pow(x, 2) instead of an ‘inline double square(double x) { return x*x; }’ (in ConstraintAngleViaPoint, ConstraintPointOnEllipse, ConstraintPointOnHyperbola), computing the length of a segment only to use it by its squared value (why did one do a sqrt before squaring it back? - e.g. ConstraintAngleViaPoint::grad(double *param)).
Optimizing them might gain back at lest half of the 3% CPU they currently cost.

One major thing that may contribute to performance: bringing in OpenMP into the picture (at least with the Windows port, my GPU gathers spiderwebs while I’m running FreeCAD).
eigen can theoretically make use of it if present (IDK if it actually uses it in the algos FC is using).
OpenCascade should be able to use it more effectively (boolean operation between parts, maybe?)

Has anyone looked into using openMP inside FC? Are there some relevant threads I can bring myself up-to-date on the matter?
(I’m annoyed of my GPU’s lack of activity, the lazy btard).

acolomitchi · December 21, 2022, 2:11am

Passing the map by const & drops the execution time by 11% (from 1m38s to 1m28s - based on a statistical set of one lonely sample :large grin: And in the condition of executing it under Profiler with a script designed to put unreasonable load in the Sketcher. And with uncontrolled external factors, like YT music playing or not playing in a browser).

Objectively, tho’, the Subsytem::redirectParams() is no longer noticeable against CPU allocated ticks

Do you want me to send a pull/merge request or can someone just change the signature of Constraint::redirectParams(MAP_pD_pD redirectionmap) to take a (const &) and be done with it? (I’d prefer the latter, less work for me ).

abdullah · December 21, 2022, 3:23pm

Passing CI:
https://github.com/FreeCAD/FreeCAD/pull/8070

Thanks!

abdullah · December 21, 2022, 3:35pm

Many times it is possible to reconcile the two.

About pow(x, 2) I wonder if this is not really optimized in the library. I would have to check the assembly code produced. If it is not optimised, an inline function as you indicate can be used without loss of readability.

If it is just a function “.length()”, a function squaredlength() could be created and use it. That would not be any less readable.

I think tests were done in the past with disappointing results. But this might have changed. Checking again would be good.

I cannot recall (and I am in a rush to search for them now).

I appreciate you being annoyed by your GPU’s lack of activity. It may get you creative in tackling it

Feel free to investigate further or maybe you get into providing some PRs

acolomitchi · December 21, 2022, 8:27pm

At least on my setup (Windows, VC++), it is not - the implementation is sticking to the C++ reference and promotes to the highest precision floating-point type of the two args (form 7)

I searched, but the only relevant mentions of openmp in the first 3 pages of the results popped up in re FEM wb.

I might get the constraints’ grads into the “aesthetically pleasing for perfectionists” state next week, just to have a go in playing with PRs in the FC proj.

Then, who knows? Depends on what annoys me the most - the do-nothing GPU or the lack of openmp libs in the VS/VC++ land - so, yeah, it may happen.

acolomitchi · December 21, 2022, 11:35pm

Actually, I might try a bit more than just optimizing the expressions (DeepSOIC, if he’s still around, may want to chime in)

I’ll try to switch the older code to DeriVector2 (yeah, I’d probably prefer a more formal automatic differentiation library being plugged in, but let’s not jump that far away… for now )
I’ll try to define the “constraint errors” and their gradients into “square distance errors” wherever they are defined in terms of “distance errors”. Two reasons:
less CPU intensive to compute
the sqrt function has infinite derivative at 0 (zero). In an iterative approach to solving an eq system based on gradients, this is likely to throw the corrections steps outta whack (the DoF/dimension closer to its constrained position shouting louder “me, me, adjust me first, I have the highest gradient” - when, if fact, the other DoF-s farther would actually need more attention).

Look, I can do nothing about the complexity of a QR decomposition, but if the problem can be formulated in terms that improve the stability of the algo that uses QR, less such expensive steps will be necessary. (think of chi-squared minimization methods, they don’t try to minimize the “sqrt(chi-squared)”, maybe they have some reasons )
3. I’ll try to express “angle errors” in their specific constraints in terms of “square distance errors” too. More precisely, redefine the errors in term of difference between the expected vs actual of dot(v0, v1) and cross(v0, v1) (where v1, v2 are vectors derived from the specific geometry of the constraint - oh, boy, swampy ground here, I hope all the angle constraints will have vector elements that one can use) Rationale:

eliminate the “impedance mismatch” between large linear dimensions vs angles in the (-pi, pi) range, leading to fudge factors that don’t quite work
angles (and their errors) have a bad habit of being periodic - which will make the (hyper-)surface of the restrictions show local minima/maxima in which an iterative algo can slide into and be blocked there. By contrast, dimensions (or their squares thereof) are better defined
less CPU expensive to compute - just multiplications and additions (and, maaaaybe a single sin/cos evaluation in the beginning if the constrained is actually defined as an angle, but a good number of times they are “tangent to” and “normal to”, which have trivial sin/cos-es)

DeepSOIC · December 22, 2022, 11:05am

I’m not 100% sure what you are up to, but be careful with it.

if the gradient of error function zeroes out where error zeroes out - that is bad. It causes trouble with convergence, solver precision and dof counting. A “classic” case of such situation, which can be created with just the UI, is constraining a point to an intersection of tangent things (point on circle + point on line + tangent between line and circle).
scales of error functions should be comparable. At the moment they are not (some error funcrtions are effectively millimeters, while others are radians). This causes troubles with sketches of large geometric size (a few meters and up). It’d be great if you can fix it (i began fixing it, that led me to try and rewrite solver architecture completely, which i did half-way and then abandoned).

you are not alone though, it took me far more than three months to plough my dirty hands into freecad code

DeepSOIC · December 22, 2022, 12:15pm

okay, i see you already know quite a bit about this =) , I hope you’ll actually fix it.

acolomitchi · December 22, 2022, 5:28pm

Oh, wow, then I can’t do it properly, at least not on that cases. Totally different from the error minimization algos, based on hermitians.
Thank you, I’ll start with those first, see what happens - if it goes sour at least I won’t waste heaps of effort.

I might have some ideas that can work just at the level of the constraints, not touching the solver.

If I am to go with “angles=>distance” in re errors, I’ll probably derive some distance from the elements in the constraint and return the length of the “error chord” (circle with radius the derived distance, use the error angle, return the chord len)
if I’ll go with the “angle=>sq.distance”, I might use some (diff in the) areas of some triangles formed with some segments involved in the constraints. E.g. AngleViaPoint - 1 point, 2 segs, one angle, define some triangles; if one pick/build them triangles well, a zero err (seg supports cross in the point and have the prescribed angle) should map in those areas becoming zero and those triangles being degenerated.

DeepSOIC · December 22, 2022, 6:10pm

I think this is approximately what i did in the thread you mentioned. Maybe it is still worth trying.

acolomitchi · December 23, 2022, 12:29pm

“Approximately” is a good qualifier

The first part of the ConstraintAngleViaPoint::error()

    double ang=*angle();
    DeriVector2 n1 = crv1->CalculateNormal(poa);
    DeriVector2 n2 = crv2->CalculateNormal(poa);

    //rotate n1 by angle
    DeriVector2 n1r (n1.x*cos(ang) - n1.y*sin(ang), n1.x*sin(ang) + n1.y*cos(ang) );

Now, if the n1 is the not-normalized normal vector, one would have some ideas of the size. Problem is, some curves return the not-normalized normal (Line is fine, the Circle not quite but the modulus of its normal is still in distance units), but others are not (the conics return the normalized/dimensionless normal, BSpline doesn’t return anything of value in 0.21).

But it doesn’t matter anyway, the next line wipes out any size/distance information.

    double err = atan2(-n2.x*n1r.y+n2.y*n1r.x, n2.x*n1r.x + n2.y*n1r.y);
    //essentially, the function is equivalent to atan2(n2)-(atan2(n1)+angle). The only difference is behavior when normals are zero (the intended result is also zero in this case).
    return scale * err;

(with the note that scale seems to be 1.0 all the time, maybe added for future extensions)

The possible (non solver-intrusive) approaches that I see:

adjust the geometries code to always provide normals that have their dimensional size (and continue using the ConstraintAngleViaPoint constraint); or
write specific constraints that deal with the special types of curves and special type of constraints (tangent on point, normal on point, whatever) and know how to deal with the specific of the two curves (and get rid of ConstraintAngleViaPoint constraint, as being not specific enough); or
ask any Curve class to implement one more method (say, getRepresentativeDistance() const that the ConstraintAngleViaPoint can use in transforming the error angle into an error distance

Does anyone see other approaches?

Solution 1 is “vague” - the requirement of “sized normals” can’t be enforced/tested by ConstraintAngleViaPoint code can lead to a “fall-back into the same bad habits” if news type of curves get added.

Solution 2 is the most C+±ish approach, but may lead to an explosion of specialized constraint classes (not that there aren’t enough already)

Solution 3 is sorta hack-ish, but is the most conservative in regards with the new/modified code. So I’ll be going down along it - in my fork, see how it works

abdullah · December 24, 2022, 9:51am

I can only double down on the word of caution. Not that I am trying to discourage you. Au contraire, I would love to see improvements!! But the solver is rather sensitive to gradients and errors. The solver is about two things, making things go to the right position (errors and gradients affect convergence speed), and have meaningful diagnosis (rank, redundant/conflicting constraints, restricted parameters). If gradients are zero (where there is still a dependency), the latter breaks down.

While my algebra is passable, my infinitesimal calculus capabilities are worse than they should be. DeepSOIC has always been a good reference for me there. I am very happy to see him around here and certainly he can be of much more use than me (should he have time for it).

That said and my own limitations exposed: (2) leads to an explosion of classes for each pair of curves. It was actually the way before DeepSOIC invented AngleViaPoint. I would rather avoid (2), if only, because it would lead to a substantial amount of code to maintain (and forces new code to be created every time a new curve is implemented).

I am not sure what DeepSOIC may think of it (and certainly he knows better than me).

I am unsure if correct scaling of the vectors (getRepresentativeDistance()) is always necessary.

If it is always, then (1) could work and the effort is in ensuring future developers understand the need to provide correct scaling (so it is mostly a documentation effort, aside of the correction effort). At the end (3) will also depend on the future developer understanding what he or she has at hands.

If it is not always that this scaling is necessary, then (3) does not look hackish to me, but it looks rather like the way to go. It provides the flexibility of providing the right scaling with the ability to provide only a vector (such a normalised vector), and the function to get the scaling shows the intent (some documentation effort may be needed, as probably harmonisation into what kind of vectors should be provided, e.g. normalised).

For sure, feel free to experiment and let us know

acolomitchi · December 24, 2022, 2:17pm

I know that (but please don’t ask me how, I’ll tell you further down anyway )

I imagine I could try plugging in some other solvers, but none that I know could perform the redundant/conflicting constraints detection (and I know too little of the current one to touch). So this will remain just in my imagination (I’d be a fool to disappoint myself by trying and failing , especially since I don’t have enough time to dig as deep as my perfectionist side would like)

Raising my hat to both of you - I spent enough time looking into the constraints code and on forum today to know the respect for the knowledge and the work is due.

Feeling of (my) guts - the solver would have converged faster with specialized constraints (be it because many of the tangent/normal constr would deal with nicer angles). I can relate tho’ with the increased in code complexity, in the context of not enough time/heads/hands to do it and do it coherently.

As for now, I’m unsure either. I implemented the code to return a measure of the curve size (doesn’t need to be very exact, it’s a dirty hack anyway) and the code that would consume it (ConstraintAngleViaPoint and ConstraintL2LAngle). Got myself a python script to build a sketch that uses both but still lets 3 DoF for me, so I can shake the geometry around.

Functionally, mixed results:

on usual sizes (80-100 mm), it works well. But again, so does the unmodified constraints
on sizes of 800 meters (yes, meters) - I had to use a scaled down “representative size” by 0.01 to make messages like

App.getDocument('Unnamed').getObject('Sketch').movePoint(0,2,App.Vector(-237772.343750,783827.062500,0),0)
20:00:18  Drag point: Not able to move point with the id and type: (0, 2)

less frequent. They still happen now and then. And occasionally the solver fails to converge for a drag gesture - grab another point or curve to drag from and the things fall back into place (until the next time it happens)

(see? Told you I’ll tell you how I got to know )

Performance-wise (number of iteration until it converges)? I don’t know yet. I’ll need to identify a place where can insert a “log entry” or a “cout <<” or something to let me know the iteration-to-convergence count (and maybe the quality of the solution).
I’ll have a look tomorrow to identify such a place (any hint will be appreciated)

May worth it, but again it may not.

Why it may not: starting from DeepSOIC’s post on the “errors/gradients of similar magnitudes”, I fooled around with dragging circle arcs of different sizes. At large sizes (800m), the dragging becomes less fluent, sorta choppy (tho’ not at the level shown by DeepSOIC’s post, nothing like converging to different configurations, but dragging is still a bit jumpier than when the sketch has sizes in the 10-100mm). And the ArcOfCircle is constrained by the ConstraintCurveValue (for start/stop angles), which rely on the curve’s Value(…) method and the latter provides exact values for the gradient.

This is to say: even with exact gradients (i.e. not normalized), when we’re dealing with large sizes, the solver takes more iterations to converge - I imagine there could be a (distance type of) size limit of the sketch above which the solver will start failing to converge

Question - is the success condition of the iterative solving absolute or relative to the sketch size? Does the solver stop once an absolute precision has been reached or is the precision relative to sketch size?

Forget it, it’s a hack. Good for experimentation purposes, but that’s about it.

But of course. Tomorrow is another day to fool around (what else to do, all the shops a closed and the dam’d OCCT create geometries with errors ), but I doubt that there’ll be other days dedicated to fooling.

DeepSOIC · December 27, 2022, 1:10am

it is hypot(err1, err2, …) < convergence (convergence = 1e-10 by default; err1 and err2 and so on are error functions of constraints). Some solvers have slightly different formulations of this condition (i think, one of them tests for maximum absolute error of all constrains), but it doesn’t really change the meaning.

DeepSOIC · December 27, 2022, 1:16am

have a look here, for example:
https://github.com/DeepSOIC/FreeCAD-ellipse/blob/ConstraintSolver1/src/Mod/ConstraintSolver/App/DogLeg.cpp
it is my reimplementation of sketcher’s dogleg solver, with slight twists due to architecture change, but also with quite a few more comments.

acolomitchi · December 30, 2022, 7:50pm

Ummm… so… if you’re using Sketcher** for landscaping (100 m order of magnitude), construction or heavy machinery (10m oom), machinery (1m oom) or plastic-3d-printables (0.1-0.01m), you can rest assured the Sketcher will do its best in (attempting to) compute everything with 1Å precision? Is my understanding correct?

If I’m correct, then I’m not surprised any more of Sketcher GUI getting jumpy when dragging large arc-of-circles or occasionally failing to converge.

For the academic inclined, an oldie-but-Goldie on the perils of FP arithmetic What every computer scientist should know about floating point numbers - DAVID GOLDBERG

For the experimentalists in what the hypot(err1, err2, …) may do to you and how much trust one can put in it, the attached is my fooling around the uncertain precision in two ways of Sketcher-themed computations of hypot (umm… almost) by two means and the difference between the two results as the CPU sees it. Personally, when if comes to hypot/square, I see anything beyond double(1e-8) relative precision as questionable.

** Sketcher as geometry engine is pretty decent, can save a lot of formulae scribbling on a paper to get an answer. Or it is so when its solver doesn’t fail to converge (the other rant I feel growing inside me is the use of “always distances, never displacements” - like when you set your “X-dist” constraint for you geom to a positive value no matter if it’s on the left or the right relative to Yaxis. Then you are living with the sword of Damocles above your head, praying a drag on a remotely related DoF won’t slide the solver into the other local minimum and decide the best way to solve that constraint is by mirroring it across Y axis)
Dummy.cpp (2.59 KB)

DeepSOIC · December 30, 2022, 10:27pm

yes, that is unfortunately the truth, as far as i know. You can enable “advanced solver control” in preferences, where this tolerance value can be manually adjusted, but there are no instructions or recommendations for users to do so.

this is an absolute comparison, relative precision is mostly irrelevant.

BTW, there are other interesting precision-related problems in freecad. For example geometric primitives are by default created with tolerance of precision::confusion, which is 1e-7 mm. This is used to test if geometry is coincident or intersecting (e.g. in boolean operations), and can cause trouble for very big things (like, space-elevator-scale models are impossible, unless you alter this tolerance; see this piece of discussion for example).

X-dist and Y-dist flipping in particular were actually fixed some long time ago, by me (i think). The minus sign is automatically removed by swapping the points when the constraint is being created.