[gecode-users] Cloning problems

Filip Konvička filip.konvicka at logis.cz
Fri Feb 13 16:18:59 CET 2009


Ha! One of our custom propagators (not a "standard" floatvar propagator) 
returned __ES_SUBSUMED directly instead of calling the macro. So it is 
exactly like you said.

Thanks very much for support!

As a side effect, we now have a nice description of what happens during 
cloning that might be useful in the future.

Frankly, we're also glad that it was our fault in the end. We've been 
tracing the cloning process in the debugger in both C++ and assembler 
for quite some time, so we feel even more comfortable about Gecode now 
that we've seen the internals.

Thanks again,
Filip

> Hi,
> 
> we just discussed the very same idea. It seems that during cloning, an 
> actor is present in the variable subscription list, but not in the 
> Space::a_actors list, so we thought that maybe an actor disappeared 
> without unsubscribing. We're using ES_SUBSUMED(this, home) everywhere in 
> floatvar propagators to indicate entailment, I'll check whether that is 
> done everywhere.
> 
> Thanks,
> Filip
> 
>> Hi Mikael and I just had one more thing you could check: the behaviour could
>> also be due to a propagator that says subsumption but does not delete all of
>> its dependencies via cancel.
>>
>> Christian
>>
>> --
>> Christian Schulte, www.ict.kth.se/~cschulte/
>>
>>
>> -----Original Message-----
>> From: Christian Schulte [mailto:cschulte at kth.se] 
>> Sent: Friday, February 13, 2009 11:12 AM
>> To: 'Filip Konvička'; 'users at gecode.org'
>> Cc: 'Luboš Moric'
>> Subject: RE: [gecode-users] Cloning problems
>>
>> Hi,
>>
>> that sounds ugly ;-) Before I start with rambling about cloning, a tip:
>> Gecode supports unary branchings (with a single alternatiuve only). So you
>> could have a unary branching and don't have to have a mock-up empty branch.
>> The search engines even optimize that case. If you are looking for an
>> example, check the assign function in gecode/int.hh for assigning variables.
>>
>> Okay, cloning it is. So, there are two stages: copying and updating. During
>> copying you first create a new space and copy all actors. While copying all
>> actors you copy each variable and its domain used by an actor and collect a
>> list of all these variables (actually a list for each variable type). The
>> only thing that you do not do is copying the dependency array of a variable
>> (the array of pointers that point to propagators that depend on a variable).
>> Then in updating, you update the dependency arrays (and also deal with
>> advisors).
>>
>> Updating a dependency array just copies for each old entry the new entry
>> (stored via a forwarding pointer in each actor) and restore some information
>> for variables that has been stored in the copy of the dependency array.
>> After that you cleanup the space, the actors, and you are done.
>>
>> In VarImp<VIC>::enter, the --idx[0] makes room for an additional entry. Then
>> you keep moving actors to a free entry and by this create a new free entry
>> for the propagation condition i. When the loop is done, you know that there
>> is a free entry for propagation condition pc and that's where the propagator
>> is entered.
>>
>> Your guess about x->idx[0] in VarImp<VIC>::update is correct: this actually
>> stores the forwarding pointer for a variable implementation. Also your
>> suspicion about ActorLink::prev is correct: there the forwarding pointer for
>> actors is stored.
>>
>> Let me come up with some idea where the trouble could be:
>>  - you forget to update a variable or view at some place. Either in your
>> branching, propagator, or model. 
>>    Then you can see an invalid pointer in the dependency array.
>>    You might want to change the default constructor of VarBase in
>> gecode/kernel/var.icc and the default constructor
>>    of VarViewBase in gecode/kernel/view.icc so that they set varimp to NULL
>> (actually, I just did that yesterday in the trunk as well). Then you will be
>> sure that when you forget to update, you'll see a NULL pointer.
>>  - you commit a heinous crime in that you change the dependencies of a
>> variable with cancel/subscribe during 
>>    cloning. That is illegal and will for sure break everything. Note that
>> creating a propagator during
>>    cloning will commit that crime, too.
>>
>> It is very very unlikely (but not impossible, of course) that the cloning in
>> Gecode has a snatch but, as said, very unlikely given that its used by
>> anything else.
>>
>> Hope that helps. Please do not hesitate to ask some more
>> Christian
>>
>> --
>> Christian Schulte, www.ict.kth.se/~cschulte/
>>
>>
>> -----Original Message-----
>> From: users-bounces at gecode.org [mailto:users-bounces at gecode.org] On Behalf
>> Of Filip Konvicka
>> Sent: Thursday, February 12, 2009 2:21 PM
>> To: users at gecode.org
>> Cc: Luboš Moric
>> Subject: [gecode-users] Cloning problems
>>
>> Hi,
>>
>> [Sorry, this is a looong message...]
>>
>> we're hunting a serious bug that occurs during space cloning in 2.2.0. 
>> The bug occurs very rarely, but we have a testcase that triggers this 
>> behavior.
>>
>> We have many constraints in the problem instance and the solver should 
>> post as many propagators as possible. We have a custom branching for 
>> this, which posts one propagator at a time in commit(), while the 
>> alternative is not to post the propagator (i.e. a no-op). Because we're 
>> only looking for the first solution, in the case of a failure we no 
>> longer need the path back to the root in the recomputation tree, so we 
>> decided to use our own simple search engine for this. The standard DFS 
>> search engine exhibits exactly the same behavior (both with 
>> recomputation on and off), and we don't see any problems with our search 
>> engine.
>>
>> Everything seems to work for the vast majority of the test cases, but 
>> there are a few instances that cause problems (probably) during cloning 
>> (can be probably also be caused by some earlier bad subscibe or 
>> unsubscribe). From our point of view, there is nothing wrong or special 
>> about the instances. The crashes occur at the same location both on 
>> Linux and Windows, in both release and debug builds. Changing memory 
>> management (e.g. never deleting Spaces in the search engine) can cause 
>> the crash to occur at slightly different places (e.g. some propagation 
>> during status() after clone() finishes).
>>
>> One particular case we're looking at now crashes at core.icc:2270, where 
>> f[0] is a bad pointer (0xfeeefeee at Windows). We're not sure how this 
>> can happen - we know that in this case n==2 at core.icc:2255, so idx[0] 
>> is bad pointer at core.icc:2252. This is also what Valgrind says on 
>> Linux (bad read of size 4).
>>
>> When we were trying to debug the other cases, we found out that the 
>> subscription list in a variable in the cloned space contained an actor 
>> link that was probably copied incorrectly as it seemed as a pure 
>> ActorLink like Space::a_actors, having a totally different address than 
>> the rest of the actors (probably belonging to the original space 
>> object). When we tried to find out when this actor link entered the 
>> list, we ended up in VarImp<VIC>::update again.
>>
>> We're (of course:-)) using FloatVars in the model, and we eliminated all 
>> other kinds of variables and propagators. In our case, pc_max==1 and 
>> free_bits==0.
>>
>> We find it difficult to understand what is happening during cloning. We 
>> would appreciate if someone explaned the basic idea. We only have 
>> floatvars, propagators and one branching (no advisors or other types of 
>> actors/branchings/advisors).
>>
>> We know how VarImp<VIC>::resize works, that's easy. In 
>> VarImp<VIC>::enter, we can't see why you do "--idx[0];" as the first 
>> iteration of the for cycle overwrites it (as long as pc>0, of course). 
>> May be just optimization of course. As for VarImp<VIC>::update, we only 
>> guess...we suspect that a) the original x->idx[0] is destroyed somewhere 
>> so it needs to get restored from a memcpy backup at idx[0], b) 
>> ActorLink::_prev is probably used to map old actors to new ones (thus 
>> the "->prev()". We did not dig deep enough to be sure though, so we'd 
>> welcome some guidance here.
>>
>> Cheers,
>> Filip
>>
>>
>> _______________________________________________
>> Gecode users mailing list
>> users at gecode.org
>> https://www.gecode.org/mailman/listinfo/gecode-users
>>
>>   
> 
> 





More information about the gecode-users mailing list