[gecode-users] Cloning problems

Thu Feb 12 14:20:37 CET 2009

Hi,

[Sorry, this is a looong message...]

we're hunting a serious bug that occurs during space cloning in 2.2.0. 
The bug occurs very rarely, but we have a testcase that triggers this 
behavior.

We have many constraints in the problem instance and the solver should 
post as many propagators as possible. We have a custom branching for 
this, which posts one propagator at a time in commit(), while the 
alternative is not to post the propagator (i.e. a no-op). Because we're 
only looking for the first solution, in the case of a failure we no 
longer need the path back to the root in the recomputation tree, so we 
decided to use our own simple search engine for this. The standard DFS 
search engine exhibits exactly the same behavior (both with 
recomputation on and off), and we don't see any problems with our search 
engine.

Everything seems to work for the vast majority of the test cases, but 
there are a few instances that cause problems (probably) during cloning 
(can be probably also be caused by some earlier bad subscibe or 
unsubscribe). From our point of view, there is nothing wrong or special 
about the instances. The crashes occur at the same location both on 
Linux and Windows, in both release and debug builds. Changing memory 
management (e.g. never deleting Spaces in the search engine) can cause 
the crash to occur at slightly different places (e.g. some propagation 
during status() after clone() finishes).

One particular case we're looking at now crashes at core.icc:2270, where 
f[0] is a bad pointer (0xfeeefeee at Windows). We're not sure how this 
can happen - we know that in this case n==2 at core.icc:2255, so idx[0] 
is bad pointer at core.icc:2252. This is also what Valgrind says on 
Linux (bad read of size 4).

When we were trying to debug the other cases, we found out that the 
subscription list in a variable in the cloned space contained an actor 
link that was probably copied incorrectly as it seemed as a pure 
ActorLink like Space::a_actors, having a totally different address than 
the rest of the actors (probably belonging to the original space 
object). When we tried to find out when this actor link entered the 
list, we ended up in VarImp<VIC>::update again.

We're (of course:-)) using FloatVars in the model, and we eliminated all 
other kinds of variables and propagators. In our case, pc_max==1 and 
free_bits==0.

We find it difficult to understand what is happening during cloning. We 
would appreciate if someone explaned the basic idea. We only have 
floatvars, propagators and one branching (no advisors or other types of 
actors/branchings/advisors).

We know how VarImp<VIC>::resize works, that's easy. In 
VarImp<VIC>::enter, we can't see why you do "--idx[0];" as the first 
iteration of the for cycle overwrites it (as long as pc>0, of course). 
May be just optimization of course. As for VarImp<VIC>::update, we only 
guess...we suspect that a) the original x->idx[0] is destroyed somewhere 
so it needs to get restored from a memcpy backup at idx[0], b) 
ActorLink::_prev is probably used to map old actors to new ones (thus 
the "->prev()". We did not dig deep enough to be sure though, so we'd 
welcome some guidance here.

Cheers,
Filip