Managing large number of documents

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Managing large number of documents

Toralf Kirsten
Hi all,
I have seen an earlier thread to this topic (called "Max number of files
in a collection") initiated by José María. But, to my knowledge, there
was no solution or more evaluation material.

Currently, I also have to import large XML documents from the
Bioinformatics domain. The files take at least around 980 MB and at most
around 8 GB. Each file should be associated with a collection and, if
necessary, could be splitted into many little documents.

The import of such a large document terminates with OutOfMemeoryException.
For a further test case I wrote a Java SAX parser to split a file (the
little one with 980 MB) and load each resulting document into the
XML-DB. But the server crashes with OutOfMemoryException. The crash is
not recoverable so that I have to re-install the server.
To avoid to many little documents I grouped the documents after the
split process into one larger document which is then inserted into the
XML-DB. That means, the original XML file is firstly splitted w.r.t. a
specified node level and secondly, n resulting temp. docs are grouped
together and build a resource within the XML-DB. The group factor n is
currently 100. Unfortunately, I could not load all resource documents
which are built in this way. I played a little bit with the group
factor, but my program always terminates with the error

Server returned HTTP response code: 500 for URL:
http://<my_ip_address>:8081/exist/xmlrpc

and the import process stopps. The eXist server outputs

...
11 Oct 2005 13:47:30,002 [P1-9] DEBUG (DOMFile.java [add]:220) -
Creating overflow page
11 Oct 2005 13:47:30,003 [P1-9] DEBUG (DOMFile.java [<init>]:2929) -
Creating overflow page
11 Oct 2005 13:47:30,007 [P1-9] DEBUG (DOMFile.java [add]:220) -
Creating overflow page
11 Oct 2005 13:47:30,008 [P1-9] DEBUG (DOMFile.java [<init>]:2929) -
Creating overflow page
11 Oct 2005 13:47:31,822 [P1-9] DEBUG (LRUCache.java [resize]:220) -
Growing cache from 3687 to 8901
11 Oct 2005 13:47:31,861 [P1-9] DEBUG (LRDCache.java [cleanup]:148) -
totalReferences = 640001; maxReferences = 640000
11 Oct 2005 13:47:32,617 [P1-9] DEBUG (Collection.java [store]:780) -
document s tored.
11 Oct 2005 13:47:32,792 [P1-9] DEBUG (RpcConnection.java [parse]:1162)
- parsin g /db/Affy_HG_U133Plus2/d7a9432d.xml took 4945ms.
11 Oct 2005 13:47:37,613 [P1-9] WARN  (ServletHandler.java [handle]:574)
- Error  for /exist/xmlrpc
java.lang.OutOfMemoryError
11 Oct 2005 13:47:50,905 [Thread-2] INFO  (BTree.java
[printStatistics]:1759) - words.dbx INDEX 5530 / 5341 / 81118182 / 1105245
11 Oct 2005 13:47:50,906 [Thread-2] INFO  (BFile.java
[printStatistics]:403) - w ords.dbx DATA 8901 / 7296 / 7098537 / 851075
11 Oct 2005 13:47:51,894 [Thread-2] INFO  (NativeBroker.java
[sync]:2806) - Memo ry: 128064K total; 128064K max; 30406K free
11 Oct 2005 13:47:51,895 [Thread-2] INFO  (BTree.java
[printStatistics]:1759) - collections.dbx INDEX 64 / 39 / 103196 / 1
11 Oct 2005 13:47:51,895 [Thread-2] INFO  (BFile.java
[printStatistics]:403) - c ollections.dbx DATA 64 / 64 / 26227 / 160
11 Oct 2005 13:47:51,895 [Thread-2] INFO  (BTree.java
[printStatistics]:1759) - elements.dbx INDEX 64 / 1 / 145220 / 0
11 Oct 2005 13:47:51,896 [Thread-2] INFO  (BFile.java
[printStatistics]:403) - e lements.dbx DATA 156 / 155 / 289202 / 1353
11 Oct 2005 13:47:51,896 [Thread-2] INFO  (BTree.java
[printStatistics]:1759) - values.dbx INDEX 64 / 1 / 0 / 0
11 Oct 2005 13:47:51,896 [Thread-2] INFO  (BFile.java
[printStatistics]:403) - v alues.dbx DATA 64 / 0 / 0 / 0
11 Oct 2005 13:47:51,897 [Thread-2] INFO  (BTree.java
[printStatistics]:1759) - dom.dbx INDEX 1093 / 1093 / 1978061 / 8807
11 Oct 2005 13:47:51,897 [Thread-2] INFO  (DOMFile.java
[printStatistics]:1169) - dom.dbx DATA 256 / 255 / 24488414 / 2
11 Oct 2005 13:47:51,897 [Thread-2] INFO  (BTree.java
[printStatistics]:1759) - values-by-qname.dbx INDEX 64 / 1 / 0 / 0
11 Oct 2005 13:47:51,898 [Thread-2] INFO  (BFile.java
[printStatistics]:403) - v alues-by-qname.dbx DATA 64 / 0 / 0 / 0


The number of inserted resources until the server crashes is 6589. The
number of inserted and splitted docs is 658833. The number of resources
which have not been loaded is around 100.

I'm using the last available snapshot (25.9.2005) with most of the
standard configuration settings. I only changed (except the port change)
the Xmx paramater from 256MB to 512MB within the ./bin/server.sh.

Is there anything what I can do/try more?

Thanks, Toralf


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Managing large number of documents

Leif-Jöran Olsson
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Toralf Kirsten skrev:
> Hi all,

>
> The number of inserted resources until the server crashes is 6589. The
> number of inserted and splitted docs is 658833. The number of resources
> which have not been loaded is around 100.
>
> I'm using the last available snapshot (25.9.2005) with most of the
> standard configuration settings. I only changed (except the port change)
> the Xmx paramater from 256MB to 512MB within the ./bin/server.sh.
>
> Is there anything what I can do/try more?

Hi!
If you have more memory to spare, give some more of it to eXist.

Leif-Jöran
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFDS9pShcIn5aVXOPIRAu1pAKCEZqhKn0pSi7lPSeu5uPjo/Fc/iwCgsVC8
PrprI1eBOjQVdRySJ2Hp+SM=
=ttP4
-----END PGP SIGNATURE-----


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Managing large number of documents

wolfgangmm
In reply to this post by Toralf Kirsten
Hi,

the current CVS version should solve some memory problems. At least,
you should see a more moderate memory consumption. I think it should
be worth a try.

Wolfgang

On 10/11/05, Toralf Kirsten <[hidden email]> wrote:

> Hi all,
> I have seen an earlier thread to this topic (called "Max number of files
> in a collection") initiated by José María. But, to my knowledge, there
> was no solution or more evaluation material.
>
> Currently, I also have to import large XML documents from the
> Bioinformatics domain. The files take at least around 980 MB and at most
> around 8 GB. Each file should be associated with a collection and, if
> necessary, could be splitted into many little documents.
>
> The import of such a large document terminates with OutOfMemeoryException.


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open