I'm trying to see if I can use the content extraction module for
developing a search interface for a bunch of PDF files. I've tried the
demo app at http://localhost:8080/exist/apps/demo/cex-demo.html, but
this seems to produce inaccurate results. Basically, if a result is
found in the 'page' field of an index on a PDF file, it seems that all
pages of that PDF file are returned. I'm testing with eXist-develop,
revision d9ecd33 on Windows, with Oracle JDK 1.8.0_73.
<field name="title" store="yes">Indexing</field>
<field name="para" store="yes">This is the first paragraph.</field>
<field name="para" store="yes">And a second paragraph.</field>
I would expect the query ft:search('/db/apps/test.txt', 'para:second')
to return only the second <field>. Yet, it appears that whenever a match
is found in a field, *all* fields with the same name are returned for
<search score="4.7551346" uri="/db/apps/test.xml">
<field name="para">This is the first paragraph.</field>
<field name="para">And a <exist:match xmlns:exist="http://exist.sourceforge.net/NS/exist">second</exist:match> paragraph.</field>
Of course, the matching <field>s can be identified by their embedded
<exist:match> element, but I would rather expect that only matching
<field>s are returned in the first place.
Is this a bug, or am I'm misunderstanding how ft:search() works?