Lucene - Problem with queries in XML

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Lucene - Problem with queries in XML

Immanuel Normann
Hi,

Lucene queries in XML behave strange in my setting and I have no clue why.

My collection.xconf for the collection /db/howto/lucene/data is the following

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <fulltext default="none" attributes="false"/>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text xmlns:tei="http://www.tei-c.org/ns/1.0" qname="tei:l"/>
        </lucene>
    </index>
</collection>



In this collection I am trying to make full text search in a file jedermann.xml. It is a TEI-file, hence the namespace xmlns:tei in collection.xconf and the default namespace in my test-search.xql which begins with:

declare default element namespace "http://www.tei-c.org/ns/1.0";
declare variable $jedermann := doc("/db/howto/lucene/data/jedermann.xml");

... (: my test search :)

Term search works as expected:

For $jedermann//l[ft:query(.,'Gott')] as well as for $jedermann//l[ft:query(.,<query><term>Gott</term></query>)] I get 3 result elements.

However, wildcard search does not yield the same result in a plain Lucene query versus XML Lucene query:

$jedermann//l[ft:query(.,'Go*t')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><wildcard>Go*t</wildcard></query>)] returns nothing.

The same observation for regex search:

$jedermann//l[ft:query(.,'/Go.t/')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><regex>Go.t</regex></query>)] returns nothing.

In order to solve this miracle I tried to replicate it with hamlet.xml from /de/apps/demo/data/.
However, I couldn't do so - that is plain and XML Lucene query behave the same way. Now, I wonder how this can be. The only difference I could identify (apart from the content of cause) is the existence of the default namespace (TEI) in my setting versus no such namespace in the hamlet setting. Perhaps the namespace does not propagate properly to the XML Lucene query?! Any idea?

Cheers,
Immanuel

------------------------------------------------------------------------------

_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Lucene - Problem with queries in XML

Jens Østergaard Petersen-2
Hi Immanuel,

I tried to replicate this with the Hamlet in the Shakespeare app (also in the TEI namespace), but I got consistent results.

#

xquery version "3.0";

declare default element namespace "http://www.tei-c.org/ns/1.0";

let $hamlet := doc("/db/apps/shakespeare/data/ham.xml")
return
<results>
    <result n="1">{count($hamlet//sp[ft:query(., 'cannon')])}</result>
    <result n="2">{count($hamlet//sp[ft:query(., <query><term>cannon</term></query>)])}</result>
    <result n="3">{count($hamlet//sp[ft:query(., 'ca?non')])}</result>
    <result n="4">{count($hamlet//sp[ft:query(., <query><wildcard>ca?non</wildcard></query>)])}</result>
    <result n="5">{count($hamlet//sp[ft:query(., '/can.on/')])}</result>
    <result n="6">{count($hamlet//sp[ft:query(., <query><regex>can.on</regex></query>)])}</result>
</results>

#

Five hits every time.

Cheers,

Jens

On 1 Oct 2015 at 13:00:01, Immanuel Normann ([hidden email]) wrote:

Hi,

Lucene queries in XML behave strange in my setting and I have no clue why.

My collection.xconf for the collection /db/howto/lucene/data is the following

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <fulltext default="none" attributes="false"/>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text xmlns:tei="http://www.tei-c.org/ns/1.0" qname="tei:l"/>
        </lucene>
    </index>
</collection>



In this collection I am trying to make full text search in a file jedermann.xml. It is a TEI-file, hence the namespace xmlns:tei in collection.xconf and the default namespace in my test-search.xql which begins with:

declare default element namespace "http://www.tei-c.org/ns/1.0";
declare variable $jedermann := doc("/db/howto/lucene/data/jedermann.xml");

... (: my test search :)

Term search works as expected:

For $jedermann//l[ft:query(.,'Gott')] as well as for $jedermann//l[ft:query(.,<query><term>Gott</term></query>)] I get 3 result elements.

However, wildcard search does not yield the same result in a plain Lucene query versus XML Lucene query:

$jedermann//l[ft:query(.,'Go*t')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><wildcard>Go*t</wildcard></query>)] returns nothing.

The same observation for regex search:

$jedermann//l[ft:query(.,'/Go.t/')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><regex>Go.t</regex></query>)] returns nothing.

In order to solve this miracle I tried to replicate it with hamlet.xml from /de/apps/demo/data/.
However, I couldn't do so - that is plain and XML Lucene query behave the same way. Now, I wonder how this can be. The only difference I could identify (apart from the content of cause) is the existence of the default namespace (TEI) in my setting versus no such namespace in the hamlet setting. Perhaps the namespace does not propagate properly to the XML Lucene query?! Any idea?

Cheers,
Immanuel
------------------------------------------------------------------------------
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open

------------------------------------------------------------------------------

_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Lucene - Problem with queries in XML

Immanuel Normann
Hi Jens,

I am afraid I was not quite clear in the explanation of my attempt to solve the problem:
I observed exactly the same as you did with the Hamlet example, but that is not the point!
The inconsistency was only observable in my setting with a different XML-document.

So the point is how can it be that I can do full text search consistently within one text (e.g. hamlet.xml), but only inconsistently in another one? Where "consistent full text search" means plain query and XML query with Lucene must yield identical results. (s. http://exist-db.org/exist/apps/doc/lucene.xml#D2.2.5.9).

Even more precisely the point is: how can it be at all that full text search can be inconsistent?

It definitely shouldn't depend on the content. But the fact that I observed consistent search with Hamlet (as you did), but inconsistent search on my jedermann.xml at least indicates a dependancy on the content and/or on the collection.xconf.

Therefore I investigated the principal difference between the hamlet setting and my jedermann setting (with setting I mean the XML-files themselves together with the collection.xconf files). And I could identify  as only principle difference between these two settings that my jedermann setting involves a namespace whereas the hamlet setting doesn't. That's why I came to the conclusion that "perhaps the namespace does not propagate properly to the XML Lucene query".

I hope this explanation makes the problem more comprehensible. In summary it is this: Independent of the XML content and the collection.xconf, plain and XML Lucene queries should yield the same result. But I came across a witness setting that violates this consistency principle. The hamlet example is just a witness which is conform with this consistency principle.

Cheers
Immanuel





2015-10-01 15:14 GMT+02:00 Jens Østergaard Petersen <[hidden email]>:
Hi Immanuel,

I tried to replicate this with the Hamlet in the Shakespeare app (also in the TEI namespace), but I got consistent results.

#

xquery version "3.0";

declare default element namespace "http://www.tei-c.org/ns/1.0";

let $hamlet := doc("/db/apps/shakespeare/data/ham.xml")
return
<results>
    <result n="1">{count($hamlet//sp[ft:query(., 'cannon')])}</result>
    <result n="2">{count($hamlet//sp[ft:query(., <query><term>cannon</term></query>)])}</result>
    <result n="3">{count($hamlet//sp[ft:query(., 'ca?non')])}</result>
    <result n="4">{count($hamlet//sp[ft:query(., <query><wildcard>ca?non</wildcard></query>)])}</result>
    <result n="5">{count($hamlet//sp[ft:query(., '/can.on/')])}</result>
    <result n="6">{count($hamlet//sp[ft:query(., <query><regex>can.on</regex></query>)])}</result>
</results>

#

Five hits every time.

Cheers,

Jens

On 1 Oct 2015 at 13:00:01, Immanuel Normann ([hidden email]) wrote:

Hi,

Lucene queries in XML behave strange in my setting and I have no clue why.

My collection.xconf for the collection /db/howto/lucene/data is the following

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <fulltext default="none" attributes="false"/>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text xmlns:tei="http://www.tei-c.org/ns/1.0" qname="tei:l"/>
        </lucene>
    </index>
</collection>



In this collection I am trying to make full text search in a file jedermann.xml. It is a TEI-file, hence the namespace xmlns:tei in collection.xconf and the default namespace in my test-search.xql which begins with:

declare default element namespace "http://www.tei-c.org/ns/1.0";
declare variable $jedermann := doc("/db/howto/lucene/data/jedermann.xml");

... (: my test search :)

Term search works as expected:

For $jedermann//l[ft:query(.,'Gott')] as well as for $jedermann//l[ft:query(.,<query><term>Gott</term></query>)] I get 3 result elements.

However, wildcard search does not yield the same result in a plain Lucene query versus XML Lucene query:

$jedermann//l[ft:query(.,'Go*t')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><wildcard>Go*t</wildcard></query>)] returns nothing.

The same observation for regex search:

$jedermann//l[ft:query(.,'/Go.t/')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><regex>Go.t</regex></query>)] returns nothing.

In order to solve this miracle I tried to replicate it with hamlet.xml from /de/apps/demo/data/.
However, I couldn't do so - that is plain and XML Lucene query behave the same way. Now, I wonder how this can be. The only difference I could identify (apart from the content of cause) is the existence of the default namespace (TEI) in my setting versus no such namespace in the hamlet setting. Perhaps the namespace does not propagate properly to the XML Lucene query?! Any idea?

Cheers,
Immanuel
------------------------------------------------------------------------------
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open


------------------------------------------------------------------------------

_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Lucene - Problem with queries in XML

Immanuel Normann
Well, as I was too fast with my last email I have to apologize as I didn't fully appreciated your attempt to solve the problem. Hence I should supplement this:

Actually you did more than just replicating my hamlet setting, as you added also a namespace!
It would be helpful, though, to know where exactly you added the namespace? In your email it is just visible in the xquery. I assume you added the namesapce also in hamlet.xml. May be the point is where (if at all) you added the namespace in the collection.xconf? May be I did this at the wrong place (see my collection.xconf in my initial mail).

Regardless of this it shouldn't be the case for any setting that a plain query yields results whereas the corresponding XML query doesn't.

Best regards
Immanuel



2015-10-02 11:53 GMT+02:00 Immanuel Normann <[hidden email]>:
Hi Jens,

I am afraid I was not quite clear in the explanation of my attempt to solve the problem:
I observed exactly the same as you did with the Hamlet example, but that is not the point!
The inconsistency was only observable in my setting with a different XML-document.

So the point is how can it be that I can do full text search consistently within one text (e.g. hamlet.xml), but only inconsistently in another one? Where "consistent full text search" means plain query and XML query with Lucene must yield identical results. (s. http://exist-db.org/exist/apps/doc/lucene.xml#D2.2.5.9).

Even more precisely the point is: how can it be at all that full text search can be inconsistent?

It definitely shouldn't depend on the content. But the fact that I observed consistent search with Hamlet (as you did), but inconsistent search on my jedermann.xml at least indicates a dependancy on the content and/or on the collection.xconf.

Therefore I investigated the principal difference between the hamlet setting and my jedermann setting (with setting I mean the XML-files themselves together with the collection.xconf files). And I could identify  as only principle difference between these two settings that my jedermann setting involves a namespace whereas the hamlet setting doesn't. That's why I came to the conclusion that "perhaps the namespace does not propagate properly to the XML Lucene query".

I hope this explanation makes the problem more comprehensible. In summary it is this: Independent of the XML content and the collection.xconf, plain and XML Lucene queries should yield the same result. But I came across a witness setting that violates this consistency principle. The hamlet example is just a witness which is conform with this consistency principle.

Cheers
Immanuel





2015-10-01 15:14 GMT+02:00 Jens Østergaard Petersen <[hidden email]>:
Hi Immanuel,

I tried to replicate this with the Hamlet in the Shakespeare app (also in the TEI namespace), but I got consistent results.

#

xquery version "3.0";

declare default element namespace "http://www.tei-c.org/ns/1.0";

let $hamlet := doc("/db/apps/shakespeare/data/ham.xml")
return
<results>
    <result n="1">{count($hamlet//sp[ft:query(., 'cannon')])}</result>
    <result n="2">{count($hamlet//sp[ft:query(., <query><term>cannon</term></query>)])}</result>
    <result n="3">{count($hamlet//sp[ft:query(., 'ca?non')])}</result>
    <result n="4">{count($hamlet//sp[ft:query(., <query><wildcard>ca?non</wildcard></query>)])}</result>
    <result n="5">{count($hamlet//sp[ft:query(., '/can.on/')])}</result>
    <result n="6">{count($hamlet//sp[ft:query(., <query><regex>can.on</regex></query>)])}</result>
</results>

#

Five hits every time.

Cheers,

Jens

On 1 Oct 2015 at 13:00:01, Immanuel Normann ([hidden email]) wrote:

Hi,

Lucene queries in XML behave strange in my setting and I have no clue why.

My collection.xconf for the collection /db/howto/lucene/data is the following

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <fulltext default="none" attributes="false"/>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text xmlns:tei="http://www.tei-c.org/ns/1.0" qname="tei:l"/>
        </lucene>
    </index>
</collection>



In this collection I am trying to make full text search in a file jedermann.xml. It is a TEI-file, hence the namespace xmlns:tei in collection.xconf and the default namespace in my test-search.xql which begins with:

declare default element namespace "http://www.tei-c.org/ns/1.0";
declare variable $jedermann := doc("/db/howto/lucene/data/jedermann.xml");

... (: my test search :)

Term search works as expected:

For $jedermann//l[ft:query(.,'Gott')] as well as for $jedermann//l[ft:query(.,<query><term>Gott</term></query>)] I get 3 result elements.

However, wildcard search does not yield the same result in a plain Lucene query versus XML Lucene query:

$jedermann//l[ft:query(.,'Go*t')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><wildcard>Go*t</wildcard></query>)] returns nothing.

The same observation for regex search:

$jedermann//l[ft:query(.,'/Go.t/')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><regex>Go.t</regex></query>)] returns nothing.

In order to solve this miracle I tried to replicate it with hamlet.xml from /de/apps/demo/data/.
However, I couldn't do so - that is plain and XML Lucene query behave the same way. Now, I wonder how this can be. The only difference I could identify (apart from the content of cause) is the existence of the default namespace (TEI) in my setting versus no such namespace in the hamlet setting. Perhaps the namespace does not propagate properly to the XML Lucene query?! Any idea?

Cheers,
Immanuel
------------------------------------------------------------------------------
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open



------------------------------------------------------------------------------

_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Lucene - Problem with queries in XML

Jens Østergaard Petersen-2
In reply to this post by Immanuel Normann


On 2 Oct 2015 at 11:53:09, Immanuel Normann ([hidden email]) wrote:

Hi Jens,

I am afraid I was not quite clear in the explanation of my attempt to solve the problem:
I observed exactly the same as you did with the Hamlet example, but that is not the point!
The inconsistency was only observable in my setting with a different XML-document.

No, you were working with the Hamlet doc in the Demo app, which is Jon Bosak's old one, without a namespace. I was working with the TEI document in the Shakespeare Demo app, which is in the TEI namespace.

So the point is how can it be that I can do full text search consistently within one text (e.g. hamlet.xml), but only inconsistently in another one? Where "consistent full text search" means plain query and XML query with Lucene must yield identical results. (s. http://exist-db.org/exist/apps/doc/lucene.xml#D2.2.5.9).

Have you tried reindexing the document?

Even more precisely the point is: how can it be at all that full text search can be inconsistent?

I think we would need to have a look at your jedenmann.xml.

Best,

Jens

It definitely shouldn't depend on the content. But the fact that I observed consistent search with Hamlet (as you did), but inconsistent search on my jedermann.xml at least indicates a dependancy on the content and/or on the collection.xconf.

Therefore I investigated the principal difference between the hamlet setting and my jedermann setting (with setting I mean the XML-files themselves together with the collection.xconf files). And I could identify  as only principle difference between these two settings that my jedermann setting involves a namespace whereas the hamlet setting doesn't. That's why I came to the conclusion that "perhaps the namespace does not propagate properly to the XML Lucene query".

I hope this explanation makes the problem more comprehensible. In summary it is this: Independent of the XML content and the collection.xconf, plain and XML Lucene queries should yield the same result. But I came across a witness setting that violates this consistency principle. The hamlet example is just a witness which is conform with this consistency principle.

Cheers
Immanuel





2015-10-01 15:14 GMT+02:00 Jens Østergaard Petersen <[hidden email]>:
Hi Immanuel,

I tried to replicate this with the Hamlet in the Shakespeare app (also in the TEI namespace), but I got consistent results.

#

xquery version "3.0";

declare default element namespace "http://www.tei-c.org/ns/1.0";

let $hamlet := doc("/db/apps/shakespeare/data/ham.xml")
return
<results>
    <result n="1">{count($hamlet//sp[ft:query(., 'cannon')])}</result>
    <result n="2">{count($hamlet//sp[ft:query(., <query><term>cannon</term></query>)])}</result>
    <result n="3">{count($hamlet//sp[ft:query(., 'ca?non')])}</result>
    <result n="4">{count($hamlet//sp[ft:query(., <query><wildcard>ca?non</wildcard></query>)])}</result>
    <result n="5">{count($hamlet//sp[ft:query(., '/can.on/')])}</result>
    <result n="6">{count($hamlet//sp[ft:query(., <query><regex>can.on</regex></query>)])}</result>
</results>

#

Five hits every time.

Cheers,

Jens

On 1 Oct 2015 at 13:00:01, Immanuel Normann ([hidden email]) wrote:

Hi,

Lucene queries in XML behave strange in my setting and I have no clue why.

My collection.xconf for the collection /db/howto/lucene/data is the following

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <fulltext default="none" attributes="false"/>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text xmlns:tei="http://www.tei-c.org/ns/1.0" qname="tei:l"/>
        </lucene>
    </index>
</collection>



In this collection I am trying to make full text search in a file jedermann.xml. It is a TEI-file, hence the namespace xmlns:tei in collection.xconf and the default namespace in my test-search.xql which begins with:

declare default element namespace "http://www.tei-c.org/ns/1.0";
declare variable $jedermann := doc("/db/howto/lucene/data/jedermann.xml");

... (: my test search :)

Term search works as expected:

For $jedermann//l[ft:query(.,'Gott')] as well as for $jedermann//l[ft:query(.,<query><term>Gott</term></query>)] I get 3 result elements.

However, wildcard search does not yield the same result in a plain Lucene query versus XML Lucene query:

$jedermann//l[ft:query(.,'Go*t')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><wildcard>Go*t</wildcard></query>)] returns nothing.

The same observation for regex search:

$jedermann//l[ft:query(.,'/Go.t/')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><regex>Go.t</regex></query>)] returns nothing.

In order to solve this miracle I tried to replicate it with hamlet.xml from /de/apps/demo/data/.
However, I couldn't do so - that is plain and XML Lucene query behave the same way. Now, I wonder how this can be. The only difference I could identify (apart from the content of cause) is the existence of the default namespace (TEI) in my setting versus no such namespace in the hamlet setting. Perhaps the namespace does not propagate properly to the XML Lucene query?! Any idea?

Cheers,
Immanuel
------------------------------------------------------------------------------
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open





------------------------------------------------------------------------------

_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Lucene - Problem with queries in XML

Immanuel Normann
You are right! I will bundle my jedermann.xml or anything comparable such that it can replicated.
 ... coming soon.

Immanuel

2015-10-02 12:29 GMT+02:00 Jens Østergaard Petersen <[hidden email]>:


On 2 Oct 2015 at 11:53:09, Immanuel Normann ([hidden email]) wrote:

Hi Jens,

I am afraid I was not quite clear in the explanation of my attempt to solve the problem:
I observed exactly the same as you did with the Hamlet example, but that is not the point!
The inconsistency was only observable in my setting with a different XML-document.

No, you were working with the Hamlet doc in the Demo app, which is Jon Bosak's old one, without a namespace. I was working with the TEI document in the Shakespeare Demo app, which is in the TEI namespace.

So the point is how can it be that I can do full text search consistently within one text (e.g. hamlet.xml), but only inconsistently in another one? Where "consistent full text search" means plain query and XML query with Lucene must yield identical results. (s. http://exist-db.org/exist/apps/doc/lucene.xml#D2.2.5.9).

Have you tried reindexing the document?

Even more precisely the point is: how can it be at all that full text search can be inconsistent?

I think we would need to have a look at your jedenmann.xml.

Best,

Jens

It definitely shouldn't depend on the content. But the fact that I observed consistent search with Hamlet (as you did), but inconsistent search on my jedermann.xml at least indicates a dependancy on the content and/or on the collection.xconf.

Therefore I investigated the principal difference between the hamlet setting and my jedermann setting (with setting I mean the XML-files themselves together with the collection.xconf files). And I could identify  as only principle difference between these two settings that my jedermann setting involves a namespace whereas the hamlet setting doesn't. That's why I came to the conclusion that "perhaps the namespace does not propagate properly to the XML Lucene query".

I hope this explanation makes the problem more comprehensible. In summary it is this: Independent of the XML content and the collection.xconf, plain and XML Lucene queries should yield the same result. But I came across a witness setting that violates this consistency principle. The hamlet example is just a witness which is conform with this consistency principle.

Cheers
Immanuel





2015-10-01 15:14 GMT+02:00 Jens Østergaard Petersen <[hidden email]>:
Hi Immanuel,

I tried to replicate this with the Hamlet in the Shakespeare app (also in the TEI namespace), but I got consistent results.

#

xquery version "3.0";

declare default element namespace "http://www.tei-c.org/ns/1.0";

let $hamlet := doc("/db/apps/shakespeare/data/ham.xml")
return
<results>
    <result n="1">{count($hamlet//sp[ft:query(., 'cannon')])}</result>
    <result n="2">{count($hamlet//sp[ft:query(., <query><term>cannon</term></query>)])}</result>
    <result n="3">{count($hamlet//sp[ft:query(., 'ca?non')])}</result>
    <result n="4">{count($hamlet//sp[ft:query(., <query><wildcard>ca?non</wildcard></query>)])}</result>
    <result n="5">{count($hamlet//sp[ft:query(., '/can.on/')])}</result>
    <result n="6">{count($hamlet//sp[ft:query(., <query><regex>can.on</regex></query>)])}</result>
</results>

#

Five hits every time.

Cheers,

Jens

On 1 Oct 2015 at 13:00:01, Immanuel Normann ([hidden email]) wrote:

Hi,

Lucene queries in XML behave strange in my setting and I have no clue why.

My collection.xconf for the collection /db/howto/lucene/data is the following

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <fulltext default="none" attributes="false"/>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text xmlns:tei="http://www.tei-c.org/ns/1.0" qname="tei:l"/>
        </lucene>
    </index>
</collection>



In this collection I am trying to make full text search in a file jedermann.xml. It is a TEI-file, hence the namespace xmlns:tei in collection.xconf and the default namespace in my test-search.xql which begins with:

declare default element namespace "http://www.tei-c.org/ns/1.0";
declare variable $jedermann := doc("/db/howto/lucene/data/jedermann.xml");

... (: my test search :)

Term search works as expected:

For $jedermann//l[ft:query(.,'Gott')] as well as for $jedermann//l[ft:query(.,<query><term>Gott</term></query>)] I get 3 result elements.

However, wildcard search does not yield the same result in a plain Lucene query versus XML Lucene query:

$jedermann//l[ft:query(.,'Go*t')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><wildcard>Go*t</wildcard></query>)] returns nothing.

The same observation for regex search:

$jedermann//l[ft:query(.,'/Go.t/')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><regex>Go.t</regex></query>)] returns nothing.

In order to solve this miracle I tried to replicate it with hamlet.xml from /de/apps/demo/data/.
However, I couldn't do so - that is plain and XML Lucene query behave the same way. Now, I wonder how this can be. The only difference I could identify (apart from the content of cause) is the existence of the default namespace (TEI) in my setting versus no such namespace in the hamlet setting. Perhaps the namespace does not propagate properly to the XML Lucene query?! Any idea?

Cheers,
Immanuel
------------------------------------------------------------------------------
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open






------------------------------------------------------------------------------

_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Lucene - Problem with queries in XML

Immanuel Normann
Hi Jens,

I think I spotted the culprit now! An indeed it is a incosistency - though a minor one.
You can easily replicate it with your Shakespeare example:

xquery version "3.0";

declare default element namespace "http://www.tei-c.org/ns/1.0";

let $hamlet := doc("/db/apps/shakespeare/data/ham.xml")

return
<results>
    <result n="1">{count($hamlet//sp[ft:query(., 'Hamlet')])}</result>
    <result n="2">{count($hamlet//sp[ft:query(., <query><term>Hamlet</term></query>)])}</result>
    <result n="3">{count($hamlet//sp[ft:query(., 'Ham?et')])}</result>
    <result n="4">{count($hamlet//sp[ft:query(., <query><wildcard>Ham?et</wildcard></query>)])}</result>
    <result n="5">{count($hamlet//sp[ft:query(., '/Ham.et/')])}</result>
    <result n="6">{count($hamlet//sp[ft:query(., <query><regex>Ham.et</regex></query>)])}</result>
</results>

results in

<results xmlns="http://www.tei-c.org/ns/1.0">
<result n="1">425</result>
<result n="2">425</result>
<result n="3">425</result>
<result n="4">0</result>
<result n="5">425</result>
<result n="6">0</result>
</results>


To be or not to be - with upper case - that is the question ;-)

Best
Immanuel

2015-10-02 12:37 GMT+02:00 Immanuel Normann <[hidden email]>:
You are right! I will bundle my jedermann.xml or anything comparable such that it can replicated.
 ... coming soon.

Immanuel

2015-10-02 12:29 GMT+02:00 Jens Østergaard Petersen <[hidden email]>:


On 2 Oct 2015 at 11:53:09, Immanuel Normann ([hidden email]) wrote:

Hi Jens,

I am afraid I was not quite clear in the explanation of my attempt to solve the problem:
I observed exactly the same as you did with the Hamlet example, but that is not the point!
The inconsistency was only observable in my setting with a different XML-document.

No, you were working with the Hamlet doc in the Demo app, which is Jon Bosak's old one, without a namespace. I was working with the TEI document in the Shakespeare Demo app, which is in the TEI namespace.

So the point is how can it be that I can do full text search consistently within one text (e.g. hamlet.xml), but only inconsistently in another one? Where "consistent full text search" means plain query and XML query with Lucene must yield identical results. (s. http://exist-db.org/exist/apps/doc/lucene.xml#D2.2.5.9).

Have you tried reindexing the document?

Even more precisely the point is: how can it be at all that full text search can be inconsistent?

I think we would need to have a look at your jedenmann.xml.

Best,

Jens

It definitely shouldn't depend on the content. But the fact that I observed consistent search with Hamlet (as you did), but inconsistent search on my jedermann.xml at least indicates a dependancy on the content and/or on the collection.xconf.

Therefore I investigated the principal difference between the hamlet setting and my jedermann setting (with setting I mean the XML-files themselves together with the collection.xconf files). And I could identify  as only principle difference between these two settings that my jedermann setting involves a namespace whereas the hamlet setting doesn't. That's why I came to the conclusion that "perhaps the namespace does not propagate properly to the XML Lucene query".

I hope this explanation makes the problem more comprehensible. In summary it is this: Independent of the XML content and the collection.xconf, plain and XML Lucene queries should yield the same result. But I came across a witness setting that violates this consistency principle. The hamlet example is just a witness which is conform with this consistency principle.

Cheers
Immanuel





2015-10-01 15:14 GMT+02:00 Jens Østergaard Petersen <[hidden email]>:
Hi Immanuel,

I tried to replicate this with the Hamlet in the Shakespeare app (also in the TEI namespace), but I got consistent results.

#

xquery version "3.0";

declare default element namespace "http://www.tei-c.org/ns/1.0";

let $hamlet := doc("/db/apps/shakespeare/data/ham.xml")
return
<results>
    <result n="1">{count($hamlet//sp[ft:query(., 'cannon')])}</result>
    <result n="2">{count($hamlet//sp[ft:query(., <query><term>cannon</term></query>)])}</result>
    <result n="3">{count($hamlet//sp[ft:query(., 'ca?non')])}</result>
    <result n="4">{count($hamlet//sp[ft:query(., <query><wildcard>ca?non</wildcard></query>)])}</result>
    <result n="5">{count($hamlet//sp[ft:query(., '/can.on/')])}</result>
    <result n="6">{count($hamlet//sp[ft:query(., <query><regex>can.on</regex></query>)])}</result>
</results>

#

Five hits every time.

Cheers,

Jens

On 1 Oct 2015 at 13:00:01, Immanuel Normann ([hidden email]) wrote:

Hi,

Lucene queries in XML behave strange in my setting and I have no clue why.

My collection.xconf for the collection /db/howto/lucene/data is the following

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <fulltext default="none" attributes="false"/>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text xmlns:tei="http://www.tei-c.org/ns/1.0" qname="tei:l"/>
        </lucene>
    </index>
</collection>



In this collection I am trying to make full text search in a file jedermann.xml. It is a TEI-file, hence the namespace xmlns:tei in collection.xconf and the default namespace in my test-search.xql which begins with:

declare default element namespace "http://www.tei-c.org/ns/1.0";
declare variable $jedermann := doc("/db/howto/lucene/data/jedermann.xml");

... (: my test search :)

Term search works as expected:

For $jedermann//l[ft:query(.,'Gott')] as well as for $jedermann//l[ft:query(.,<query><term>Gott</term></query>)] I get 3 result elements.

However, wildcard search does not yield the same result in a plain Lucene query versus XML Lucene query:

$jedermann//l[ft:query(.,'Go*t')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><wildcard>Go*t</wildcard></query>)] returns nothing.

The same observation for regex search:

$jedermann//l[ft:query(.,'/Go.t/')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><regex>Go.t</regex></query>)] returns nothing.

In order to solve this miracle I tried to replicate it with hamlet.xml from /de/apps/demo/data/.
However, I couldn't do so - that is plain and XML Lucene query behave the same way. Now, I wonder how this can be. The only difference I could identify (apart from the content of cause) is the existence of the default namespace (TEI) in my setting versus no such namespace in the hamlet setting. Perhaps the namespace does not propagate properly to the XML Lucene query?! Any idea?

Cheers,
Immanuel
------------------------------------------------------------------------------
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open







------------------------------------------------------------------------------

_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Lucene - Problem with queries in XML

Jens Østergaard Petersen-2
Hi Immanuel,

Yeah, you right. I now remember I came across this problem before, <https://github.com/eXistSolutions/sarit/blob/master/modules/app.xql#L1776>, but I did not think about it, only “fixed” it - and forgot all about it. 

It looks to me like a bug and I think you should report it.

To think, or not to think, that is the question!

Jens

On 2 Oct 2015 at 13:49:18, Immanuel Normann ([hidden email]) wrote:

Hi Jens,

I think I spotted the culprit now! An indeed it is a incosistency - though a minor one.
You can easily replicate it with your Shakespeare example:

xquery version "3.0";

declare default element namespace "http://www.tei-c.org/ns/1.0";

let $hamlet := doc("/db/apps/shakespeare/data/ham.xml")

return
<results>
    <result n="1">{count($hamlet//sp[ft:query(., 'Hamlet')])}</result>
    <result n="2">{count($hamlet//sp[ft:query(., <query><term>Hamlet</term></query>)])}</result>
    <result n="3">{count($hamlet//sp[ft:query(., 'Ham?et')])}</result>
    <result n="4">{count($hamlet//sp[ft:query(., <query><wildcard>Ham?et</wildcard></query>)])}</result>
    <result n="5">{count($hamlet//sp[ft:query(., '/Ham.et/')])}</result>
    <result n="6">{count($hamlet//sp[ft:query(., <query><regex>Ham.et</regex></query>)])}</result>
</results>

results in

<results xmlns="http://www.tei-c.org/ns/1.0">
<result n="1">425</result>
<result n="2">425</result>
<result n="3">425</result>
<result n="4">0</result>
<result n="5">425</result>
<result n="6">0</result>
</results>


To be or not to be - with upper case - that is the question ;-)

Best
Immanuel

2015-10-02 12:37 GMT+02:00 Immanuel Normann <[hidden email]>:
You are right! I will bundle my jedermann.xml or anything comparable such that it can replicated.
 ... coming soon.

Immanuel

2015-10-02 12:29 GMT+02:00 Jens Østergaard Petersen <[hidden email]>:


On 2 Oct 2015 at 11:53:09, Immanuel Normann ([hidden email]) wrote:

Hi Jens,

I am afraid I was not quite clear in the explanation of my attempt to solve the problem:
I observed exactly the same as you did with the Hamlet example, but that is not the point!
The inconsistency was only observable in my setting with a different XML-document.

No, you were working with the Hamlet doc in the Demo app, which is Jon Bosak's old one, without a namespace. I was working with the TEI document in the Shakespeare Demo app, which is in the TEI namespace.

So the point is how can it be that I can do full text search consistently within one text (e.g. hamlet.xml), but only inconsistently in another one? Where "consistent full text search" means plain query and XML query with Lucene must yield identical results. (s. http://exist-db.org/exist/apps/doc/lucene.xml#D2.2.5.9).

Have you tried reindexing the document?

Even more precisely the point is: how can it be at all that full text search can be inconsistent?

I think we would need to have a look at your jedenmann.xml.

Best,

Jens

It definitely shouldn't depend on the content. But the fact that I observed consistent search with Hamlet (as you did), but inconsistent search on my jedermann.xml at least indicates a dependancy on the content and/or on the collection.xconf.

Therefore I investigated the principal difference between the hamlet setting and my jedermann setting (with setting I mean the XML-files themselves together with the collection.xconf files). And I could identify  as only principle difference between these two settings that my jedermann setting involves a namespace whereas the hamlet setting doesn't. That's why I came to the conclusion that "perhaps the namespace does not propagate properly to the XML Lucene query".

I hope this explanation makes the problem more comprehensible. In summary it is this: Independent of the XML content and the collection.xconf, plain and XML Lucene queries should yield the same result. But I came across a witness setting that violates this consistency principle. The hamlet example is just a witness which is conform with this consistency principle.

Cheers
Immanuel





2015-10-01 15:14 GMT+02:00 Jens Østergaard Petersen <[hidden email]>:
Hi Immanuel,

I tried to replicate this with the Hamlet in the Shakespeare app (also in the TEI namespace), but I got consistent results.

#

xquery version "3.0";

declare default element namespace "http://www.tei-c.org/ns/1.0";

let $hamlet := doc("/db/apps/shakespeare/data/ham.xml")
return
<results>
    <result n="1">{count($hamlet//sp[ft:query(., 'cannon')])}</result>
    <result n="2">{count($hamlet//sp[ft:query(., <query><term>cannon</term></query>)])}</result>
    <result n="3">{count($hamlet//sp[ft:query(., 'ca?non')])}</result>
    <result n="4">{count($hamlet//sp[ft:query(., <query><wildcard>ca?non</wildcard></query>)])}</result>
    <result n="5">{count($hamlet//sp[ft:query(., '/can.on/')])}</result>
    <result n="6">{count($hamlet//sp[ft:query(., <query><regex>can.on</regex></query>)])}</result>
</results>

#

Five hits every time.

Cheers,

Jens

On 1 Oct 2015 at 13:00:01, Immanuel Normann ([hidden email]) wrote:

Hi,

Lucene queries in XML behave strange in my setting and I have no clue why.

My collection.xconf for the collection /db/howto/lucene/data is the following

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index>
        <fulltext default="none" attributes="false"/>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text xmlns:tei="http://www.tei-c.org/ns/1.0" qname="tei:l"/>
        </lucene>
    </index>
</collection>



In this collection I am trying to make full text search in a file jedermann.xml. It is a TEI-file, hence the namespace xmlns:tei in collection.xconf and the default namespace in my test-search.xql which begins with:

declare default element namespace "http://www.tei-c.org/ns/1.0";
declare variable $jedermann := doc("/db/howto/lucene/data/jedermann.xml");

... (: my test search :)

Term search works as expected:

For $jedermann//l[ft:query(.,'Gott')] as well as for $jedermann//l[ft:query(.,<query><term>Gott</term></query>)] I get 3 result elements.

However, wildcard search does not yield the same result in a plain Lucene query versus XML Lucene query:

$jedermann//l[ft:query(.,'Go*t')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><wildcard>Go*t</wildcard></query>)] returns nothing.

The same observation for regex search:

$jedermann//l[ft:query(.,'/Go.t/')] returns 3 result elements - again as expected, BUT
$jedermann//l[ft:query(.,<query><regex>Go.t</regex></query>)] returns nothing.

In order to solve this miracle I tried to replicate it with hamlet.xml from /de/apps/demo/data/.
However, I couldn't do so - that is plain and XML Lucene query behave the same way. Now, I wonder how this can be. The only difference I could identify (apart from the content of cause) is the existence of the default namespace (TEI) in my setting versus no such namespace in the hamlet setting. Perhaps the namespace does not propagate properly to the XML Lucene query?! Any idea?

Cheers,
Immanuel
------------------------------------------------------------------------------
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open







------------------------------------------------------------------------------

_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Lucene - Problem with queries in XML

Immanuel Normann


2015-10-02 15:23 GMT+02:00 Jens Østergaard Petersen <[hidden email]>:
Hi Immanuel,

Yeah, you right. I now remember I came across this problem before, <https://github.com/eXistSolutions/sarit/blob/master/modules/app.xql#L1776>, but I did not think about it, only “fixed” it - and forgot all about it. 

It looks to me like a bug and I think you should report it.

Did it right now: https://github.com/eXist-db/exist/issues/804

Cheers
Immanuel

------------------------------------------------------------------------------

_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open