Re: Re: Re: Why is document removal so slow?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Why is document removal so slow?

Peter Gardfjäll

Hello everybody,

I have written on this subject previously, i.e. regarding the bad
performance seen on document removal from large collections, and hence
the need for a batch removal method, removing several documents in a
single invocation.

My idea was that such an approach would speed things up considerably as
index files would only be written to disk once for the entire set of
removed documents.

Has any work been done on such a method? If not, would it be difficult to
provide such a method (e.g. in the org.exist.collections.Collection API)?

I would also like to know whether there is anything to gain from such
a batch removal method. Maybe my theory about disk writes being the big
performance issue is wrong. The slow removal performance could e.g. be
inherent in the underlying index strucutres and algorithms.

cheers, Peter

> One way to improve performance would be to "collect" all document removals
> for a collection and remove those documents in a batch. As the index files
> are organized by collection, we would only need a single update of the text
> index, value index and structural index for the whole collection, thus
> reducing the number of write operations significantly.
>
> Maybe this could be done by additional API methods that accept a set of
> documents instead of a single one?
>
> Wolfgang



-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Why is document removal so slow?

Michael Beddow-2
Peter Gardfjäll wrote

>
> [...]
> Has any work been done on such a method?
>

Not sure if there has been any progress on the special batch method
approach, BUT
a) the general speedup of indexing in the latest CVS is reflected in a
reduction in large collection deletion times
b) the way the recovery code currently works means that a huge amount of
journalling data has to be accumulated during deletion of a large
collection. As I've reported here before, this sometimes produces vast
journal files, which don't obey the configured size limits. So for the
moment, it's one step forward and half a step back on this particular thing,
but I know that Wolfgang is giving a lot of attention to the
journalling/recovery/transaction issues, so I would expect an improvement
here soon.

Michael Beddow



-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

RE: Re: Re: Why is document removal so slow?

Chris Marasti-Georg
In reply to this post by Peter Gardfjäll
> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of
> Michael Beddow
> Sent: Wednesday, October 19, 2005 8:28 AM
> To: [hidden email]
> Subject: Re: [Exist-open] Re: Re: Why is document removal so slow?
>
> Peter Gardfjäll wrote
>
> >
> > [...]
> > Has any work been done on such a method?
> >
>
> Not sure if there has been any progress on the special batch
> method approach, BUT
> a) the general speedup of indexing in the latest CVS is
> reflected in a reduction in large collection deletion times
> b) the way the recovery code currently works means that a
> huge amount of journalling data has to be accumulated during
> deletion of a large collection. As I've reported here before,
> this sometimes produces vast journal files, which don't obey
> the configured size limits. So for the moment, it's one step
> forward and half a step back on this particular thing, but I
> know that Wolfgang is giving a lot of attention to the
> journalling/recovery/transaction issues, so I would expect an
> improvement here soon.
>
> Michael Beddow
>
>

One good model to look at is the Eclipse resource model.  While I hate the fact that I can't turn off the history feature (that I've seen at least), they have one really neat feature that speeds up batch resource processes: a WorkspaceModifyAction (or something to that effect).  What it does is tell the workbench not to notify anyone of resource changes until after the whole operation is through.  Then, let all of the resource change listeners have their crack at things.

How this could work for exist is:
In java code, you could implement basically the same thing - an action whose "run" method makes db modifications, and then index does not get updated until the run method has ended.

In Xquery, perhaps a util:db-modify() type function that can take a bit of code, or a function name (not sure if that's possible with the current implementation), which would allow xupdates, storing, etc and not update the index till after it had exited.

Thoughts?  Would this actually speed stuff up?

Chris Marasti-Georg


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Why is document removal so slow?

wolfgangmm
In reply to this post by Peter Gardfjäll
Hi,

removing a document is mostly done in method
NativeBroker.removeDocument(Txn transaction, DocumentImpl document,
boolean freeDocId). If you analyze this method, you will find that
most of the time is spent for updating the structure and fulltext
indexes. Removing the document data itself is fast: the method just
removes all the data pages in dom.dbx occupied by the document.
Updating and saving the indexes takes much longer. This is because
eXist organizes all indexes by collection, so removing an index entry
for a single document means that the index entries for the whole
collection have to loaded and saved back. I thus expect a
batch-removal method to be much faster.

Wolfgang

On 10/19/05, Peter Gardfjäll <[hidden email]> wrote:

>
> Hello everybody,
>
> I have written on this subject previously, i.e. regarding the bad
> performance seen on document removal from large collections, and hence
> the need for a batch removal method, removing several documents in a
> single invocation.
>
> My idea was that such an approach would speed things up considerably as
> index files would only be written to disk once for the entire set of
> removed documents.
>
> Has any work been done on such a method? If not, would it be difficult to
> provide such a method (e.g. in the org.exist.collections.Collection API)?
>
> I would also like to know whether there is anything to gain from such
> a batch removal method. Maybe my theory about disk writes being the big
> performance issue is wrong. The slow removal performance could e.g. be
> inherent in the underlying index strucutres and algorithms.
>
> cheers, Peter
>
> > One way to improve performance would be to "collect" all document removals
> > for a collection and remove those documents in a batch. As the index files
> > are organized by collection, we would only need a single update of the text
> > index, value index and structural index for the whole collection, thus
> > reducing the number of write operations significantly.
> >
> > Maybe this could be done by additional API methods that accept a set of
> > documents instead of a single one?
> >
> > Wolfgang
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Power Architecture Resource Center: Free content, downloads, discussions,
> and more. http://solutions.newsforge.com/ibmarch.tmpl
> _______________________________________________
> Exist-open mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/exist-open
>


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Re: Re: Why is document removal so slow?

wolfgangmm
In reply to this post by Michael Beddow-2
> b) the way the recovery code currently works means that a huge amount of
> journalling data has to be accumulated during deletion of a large
> collection. As I've reported here before, this sometimes produces vast
> journal files, which don't obey the configured size limits. So for the
> moment, it's one step forward and half a step back on this particular thing,
> but I know that Wolfgang is giving a lot of attention to the
> journalling/recovery/transaction issues, so I would expect an improvement
> here soon.

I'm a bit in conflict here. Removing a collection is done within a
single transactional context. This ensures that - in case of a system
failure - either the entire collection has been removed or nothing. If
the system fails and you restart the database, the collection will be
exactly as before. But in order to be able to restore the already
removed parts of the collection, the journal needs to store a copy of
every page, which means the journal log is growing very fast beyond
the configured limits. Since the operation is running within a single
transaction, we can't simply truncate the journal as long as the
transaction has not yet completed. In general, the journal can not be
replaced during a running transaction.

It would certainly be possible to break the single transaction into
many smaller transactions when removing a larger collection. However,
this would violate transactional integrity. On the other hand, this
price could be paid for the single case of removing a huge collection,
maybe using a configuration option?

Wolfgang


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open