Analyzer for ft:query

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Analyzer for ft:query

Claudius Teodorescu
Hi,


I have an index that contains some capital letter, made with a custom analyzer that does not lowercase the input.

When I search with ft:query for a string with capital letter, like "tasmAt*" I am not able to retrieve the correct result.

I guess that ft:query is using somewhere the StandardAnalyzer, which does lowercasing.

I reproduced this in a unit test I have for lucene indexing and searching.

Does anyone have better experience and knowledge in providing the same analyzer to search like was provided for indexing?


Thanks,
Claudius
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Claudius Teodorescu
Hi,


I saw that ft:query allows passing of some options for the query, like:
<options>
    <default-operator>and|or</default-operator>
    <phrase-slop>number</phrase-slop>
    <leading-wildcard>yes|no</leading-wildcard>
    <filter-rewrite>yes|no</filter-rewrite>
</options>.

This is why I thought that here can be another option called "analyzer", as Lucene allows it for querying.


Claudius
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Claudius Teodorescu
So, more extensive investigations and tests made me found that eXist is gracefully using the same analzyer for querying.

Claudius
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Claudius Teodorescu
This post was updated on .
Hi,


I finally managed to solve this entirely for eXist 3.0.

The use cas eis as follows: we do have documents both in devanagari and transliteration of devanagari, and we have chosen to index them in a pivotal format, called SLP1 (https://en.wikipedia.org/wiki/SLP1). For example, khalu and खलु are indexed as Kalu.

Such approach allows query strings to be both in devanagari and transliteration of devanagari, and the results to be returned both in devanagari and transliteration of devanagari.

As one can see, the indexes have capital letters, so querying needs to NOT lowercase the analyzed query string in eXist. On the other hand, the query strings containing wildcards are not passed through the custom analyzer used for indexing.

I fixed this by adding two extra options for queries, namely:
<set-lowercase-expanded-terms>no</set-lowercase-expanded-terms>
<query-parser-classname>org.apache.lucene.queryparser.analyzing.AnalyzingQueryParser</query-parser-classname>.

I intend to keep the first option as it is, and to add the class name of the query parser to the index configuration file, as below:
<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <lucene>
            <analyzer class="de.unihd.hra.libs.java.luceneTranscodingAnalyzer.TranscodingAnalyzer"/>
            <query-parser class="org.apache.lucene.queryparser.analyzing.AnalyzingQueryParser" />
            <text qname="tei:p"/>
        </lucene>
    </index>
</collection>

My question is: can I dare to hope that such improvements would be included in the eXist basecode? Are they of interest for more people?

Thanks,
Claudius

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Joe Wicentowski
Hi Claudius,

Great work!  I don't have an immediate use case for your improvements,
but I think anything that enriches the analysis and parsing pipelines
for eXist's Lucene-based full text index - particularly for
multilingual applications - would be a great addition.

Two questions:

1. Would you consider renaming your <query-parser-classname> as
<query-parser>?  I notice that <analyzer> isn't called
<analyzer-classname>, though these elements both take a @class
attribute.

2. Do you see any opportunities for enhancing the range index with a
similar feature?  They're both based on Lucene.  We've talked about
cross-pollination of other features across the full text and range
indexes[1], so I thought I'd raise this.

Joe

[1] https://github.com/eXist-db/exist/pull/1233

On Wed, Mar 1, 2017 at 11:38 AM, Claudius Teodorescu
<[hidden email]> wrote:

> Hi,
>
>
> I finally managed to solve this entirely for eXist 3.0.
>
> The use cas eis as follows: we do have documents both in devanagari and
> transliteration of devanagari, and we have chosen to index them in a pivotal
> format, called SLP1 (https://en.wikipedia.org/wiki/SLP1). For example, khalu
> and खलु are indexed as Kalu.
>
> Such approach allows query strings to be both in devanagari and
> transliteration of devanagari, and the results to be returned both in
> devanagari and transliteration of devanagari.
>
> As one can see, the indexes have capital letters, so querying needs to NOT
> lowercase the analyzed query string in eXist. On the other hand, the query
> strings containing wildcards are not passed through the custom analyzer used
> for indexing.
>
> I fixed this by adding two extra options for queries, namely:
> <set-lowercase-expanded-terms>no</set-lowercase-expanded-terms>
> <query-parser-classname>org.apache.lucene.queryparser.analyzing.AnalyzingQueryParser</query-parser-classname>.
>
> I intend to keep the first option as it is, and to add the class name of the
> query parser to the index configuration file, as below:
> <collection xmlns="http://exist-db.org/collection-config/1.0">
>     <index xmlns:tei="http://www.tei-c.org/ns/1.0"
> xmlns:xs="http://www.w3.org/2001/XMLSchema">
>         <lucene>
>             <analyzer
> class="de.unihd.hra.libs.java.luceneTranscodingAnalyzer.TranscodingAnalyzer"/>
>             <query-parser-classname
> class="org.apache.lucene.queryparser.analyzing.AnalyzingQueryParser" />
>             <text qname="tei:p"/>
>         </lucene>
>     </index>
> </collection>
>
> My question is: can I dare to hope that such improvements would be included
> in the eXist basecode? Are they of interest for more people?
>
> Thanks,
> Claudius
>
>
>
>
>
> --
> View this message in context: http://exist.2174344.n4.nabble.com/Analyzer-for-ft-query-tp4669352p4671573.html
> Sent from the exist-open mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Exist-open mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/exist-open

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Claudius Teodorescu
Hi,


Thanks, Joe.

1. I renamed it immediately after I sent the message, so we did have the same thought.

2. Can you mock some use case for this suggestion?


Claudius
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Joe Wicentowski
> 1. I renamed it immediately after I sent the message, so we did have the
> same thought.

Ha, great!

> 2. Can you mock some use case for this suggestion?

Not really, I'm afraid - I was just throwing the idea out there.  Feel
free to disregard it if it doesn't make any sense.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Adam Retter
In reply to this post by Claudius Teodorescu
> My question is: can I dare to hope that such improvements would be included
> in the eXist basecode? Are they of interest for more people?

I am happy to review a PR. Please do include tests though as it helps
me understand the problem that you are solving (sadly I am not a
languages expert).

Cheers Adam.

--
Adam Retter

eXist Developer
{ United Kingdom }
[hidden email]
irc://irc.freenode.net/existdb

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Pietro Liuzzo
In reply to this post by Claudius Teodorescu
I would be very glad to be able to use this also for ethiopic and search at the same time translitteration and fidel script! is it already possible?

2017-03-01 17:38 GMT+01:00 Claudius Teodorescu <[hidden email]>:
Hi,


I finally managed to solve this entirely for eXist 3.0.

The use cas eis as follows: we do have documents both in devanagari and
transliteration of devanagari, and we have chosen to index them in a pivotal
format, called SLP1 (https://en.wikipedia.org/wiki/SLP1). For example, khalu
and खलु are indexed as Kalu.

Such approach allows query strings to be both in devanagari and
transliteration of devanagari, and the results to be returned both in
devanagari and transliteration of devanagari.

As one can see, the indexes have capital letters, so querying needs to NOT
lowercase the analyzed query string in eXist. On the other hand, the query
strings containing wildcards are not passed through the custom analyzer used
for indexing.

I fixed this by adding two extra options for queries, namely:
<set-lowercase-expanded-terms>no</set-lowercase-expanded-terms>
<query-parser-classname>org.apache.lucene.queryparser.analyzing.AnalyzingQueryParser</query-parser-classname>.

I intend to keep the first option as it is, and to add the class name of the
query parser to the index configuration file, as below:
<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:tei="http://www.tei-c.org/ns/1.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <lucene>
            <analyzer
class="de.unihd.hra.libs.java.luceneTranscodingAnalyzer.TranscodingAnalyzer"/>
            <query-parser-classname
class="org.apache.lucene.queryparser.analyzing.AnalyzingQueryParser" />
            <text qname="tei:p"/>
        </lucene>
    </index>
</collection>

My question is: can I dare to hope that such improvements would be included
in the eXist basecode? Are they of interest for more people?

Thanks,
Claudius





--
View this message in context: http://exist.2174344.n4.nabble.com/Analyzer-for-ft-query-tp4669352p4671573.html
Sent from the exist-open mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open



--
Pietro Maria Liuzzo
cel (DE): +49 (0) 176 61 000 606
Skype: pietro.liuzzo (Quingentole)

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Claudius Teodorescu
Hi, Pietro,

Do you have a sort of "pivotal format" for Ethiopic, as there is SLP1 (https://en.wikipedia.org/wiki/SLP1) for Devanagari and Roman transliteration?

SLP1 is an ASCII transliteration scheme, which I found to be a very clever idea. It can stand in the middle so that one can have, for instance, transcoding from Devanagari to Roman transliteration or to Telugu or to Bengali.

If you have a proper analyzer, which have to index the documents in a convenient format, you can simply use it in the collection.xconf file, and you will have the index.

For searching, when the two ft:query options will be made available in eXist, you will be able to use them. Until then, if you want, I can send you a modified java class that will implement them after rebuilding eXist.

For an example of analyzer, see https://github.com/claudius108/lucene-transcoding-analyzer.


Claudius
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Pierrick Brihaye
In reply to this post by Pietro Liuzzo
Hello,

Le 02/03/2017 à 14:22, Pietro Liuzzo a écrit :

> I would be very glad to be able to use this also for ethiopic and search
> at the same time translitteration and fidel script! is it already possible?

I've written such a thing years ago for arabic
(http://www.nongnu.org/aramorph/). Search could be made in arabic as
well as in latin script (using fixed transliteration rules).

Same thing for results...

Cheers,

p.b.




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Claudius Teodorescu
Very nice work, Pierrick. This shows that the is valuable.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Claudius Teodorescu
Hi, Adam,

The changes consists in only few line of codes, adding a new option for ft:query(). I have seen that they are no unit tests for the other options for ft:query().

The changes are:
1. line
public static final String OPTION_SET_LOWERCASE_EXPANDED_TERMS = "set-lowercase-expanded-terms";
after the line
public static final String DEFAULT_OPERATOR_OR = "or";

2. lines
        option = options.getProperty(OPTION_SET_LOWERCASE_EXPANDED_TERMS);
        if (option != null) {
            if (option.equalsIgnoreCase("yes"))
                parser.setLowercaseExpandedTerms(true);
            else
                parser.setLowercaseExpandedTerms(false);
        }
after the lines
                parser.setMultiTermRewriteMethod(MultiTermQuery.CONSTANT_SCORE_BOOLEAN_QUERY_REWRITE);
        }

The second option I mentioned above is not needed, as in the index configuration file one can add the class name for the query parser as below:
<parser class="org.apache.lucene.queryparser.analyzing.AnalyzingQueryParser"/>

The only problem is that eXide reports that the element parser should not be there.


Claudius
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Analyzer for ft:query

Claudius Teodorescu
Loading...