whitespace lost in content extraction

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

whitespace lost in content extraction

ron.vandenbranden
Hi,

I'm trying the content extraction module (eXist-3.0rc1) to create a
searchable index for PDF files. This seems to work great mostly, but I
have a problem concerning whitespace.

Basically, the content:get-metadata-and-content() function seems to
suppress newlines. For example, if a PDF document contains following
text (for clarity's sake, I've marked space characters with '_'):

     test_
     whitespace
     extraction

When extracted with content:get-metadata-and-content(), this results in
following HTML structure:

   <div class="page">
     <p>test whitespaceextraction</p>
   </div>


This skews searchability, since 'whitespace' and 'extraction' will be
indexed as one single word 'whitespaceextraction'.

Yet, when testing this with the standalone Tika app, the newline is
honoured:

   <div class="page">
   <p>test
   whitespace
   extraction</p>
   </div>

If that content is indexed in eXist (with appropriate whitespace
settings in conf.xml), that will result in 3 indexed keywords: 'test',
'whitespace', and 'extraction'.

I assumed that the serialization of the content extraction functions
would be governed by the whitespace settings in conf.xml. Yet, following
settings have no effect on the content extracted with
content:get-metadata-and-content():

   <indexer caseSensitive="yes"
            index-depth="5"
            preserve-whitespace-mixed-content="yes"
            suppress-whitespace="none">

I've tried to force indentation with:

   declare option exist:serialize "indent=yes";

...but this has no effect either: it only applies to the final
serialization, not the content extraction function.

Is there any way to influence whitespace treatment in the content
extraction functions (any way to hard-code settings or properties
somewhere perhaps)? Any help much appreciated!

Kind regards,

Ron

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open

testwhitespaceextraction.pdf (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

Dmitriy Shabanov
Hi,

Will you able to build eXist from source?

On Thu, Apr 14, 2016 at 3:05 PM, ron.vandenbranden <[hidden email]> wrote:
Hi,

I'm trying the content extraction module (eXist-3.0rc1) to create a searchable index for PDF files. This seems to work great mostly, but I have a problem concerning whitespace.

Basically, the content:get-metadata-and-content() function seems to suppress newlines. For example, if a PDF document contains following text (for clarity's sake, I've marked space characters with '_'):

    test_
    whitespace
    extraction

When extracted with content:get-metadata-and-content(), this results in following HTML structure:

  <div class="page">
    <p>test whitespaceextraction</p>
  </div>


This skews searchability, since 'whitespace' and 'extraction' will be indexed as one single word 'whitespaceextraction'.

Yet, when testing this with the standalone Tika app, the newline is honoured:

  <div class="page">
  <p>test
  whitespace
  extraction</p>
  </div>

If that content is indexed in eXist (with appropriate whitespace settings in conf.xml), that will result in 3 indexed keywords: 'test', 'whitespace', and 'extraction'.

I assumed that the serialization of the content extraction functions would be governed by the whitespace settings in conf.xml. Yet, following settings have no effect on the content extracted with content:get-metadata-and-content():

  <indexer caseSensitive="yes"
           index-depth="5"
           preserve-whitespace-mixed-content="yes"
           suppress-whitespace="none">

I've tried to force indentation with:

  declare option exist:serialize "indent=yes";

...but this has no effect either: it only applies to the final serialization, not the content extraction function.

Is there any way to influence whitespace treatment in the content extraction functions (any way to hard-code settings or properties somewhere perhaps)? Any help much appreciated!

Kind regards,

Ron

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open




--
Dmitriy Shabanov

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

ron.vandenbranden
Hi Dmitryi,

On 14/04/2016 23:03, Dmitriy Shabanov wrote:
>
> Will you able to build eXist from source?
>

Yes, sure, no problem at all.

Best,

Ron

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

ron.vandenbranden
No clue what I should change, though, do you have a suggestion?

Best,

Ron

On 14/04/2016 23:25, ron.vandenbranden wrote:

> Hi Dmitryi,
>
> On 14/04/2016 23:03, Dmitriy Shabanov wrote:
>>
>> Will you able to build eXist from source?
>>
>
> Yes, sure, no problem at all.
>
> Best,
>
> Ron
>
>

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

Dmitriy Shabanov

On Fri, Apr 15, 2016 at 1:29 AM, ron.vandenbranden <[hidden email]> wrote:
No clue what I should change, though, do you have a suggestion?

Best,

Ron


On 14/04/2016 23:25, ron.vandenbranden wrote:
Hi Dmitryi,

On 14/04/2016 23:03, Dmitriy Shabanov wrote:

Will you able to build eXist from source?


Yes, sure, no problem at all.

Best,

Ron





--
Dmitriy Shabanov

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

ron.vandenbranden
Thanks, Dmitriy,

I've given it a try, but the build failed during the compilation of the
range index module:

compile-src:
      [echo] Compiling sources 'index-range'
     [javac] Compiling 13 source files to F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\build\classes
     [javac] warning: [options] bootstrap class path not set in conjunction with-source 1.7
     [javac] F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexConfig.java:102: error: cannot find symbol
     [javac]                 if (type != Type.ITEM) {
     [javac]                             ^
     [javac]   symbol:   variable Type
     [javac]   location: class RangeIndexConfig
     [javac] F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexConfig.java:107: error: cannot find symbol
     [javac]         return Type.ITEM;
     [javac]                ^
     [javac]   symbol:   variable Type
     [javac]   location: class RangeIndexConfig
     [javac] Note: Some input files use or override a deprecated API.
     [javac] Note: Recompile with -Xlint:deprecation for details.
     [javac] Note: F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexWorker.java uses unchecked or unsafe operations.
     [javac] Note: Recompile with -Xlint:unchecked for details.
     [javac] 2 errors
     [javac] 1 warning

This was with oracle-jdk1.8.0_73 (on Windows 7 Professional, 64 bit).
Since the "bootstrap" warning seemed to suggest that the source should
be compiled with Java 1.7, I've tried again with oracle-jdk1.7.0_51.

Though this made the "bootstrap" warning disappear, the rest of the
errors remained.

Should this branch be built with another specific Java version?

Best,

Ron

On 15/04/2016 11:20, Dmitriy Shabanov wrote:
> test this branch
> https://github.com/shabanovd/exist/tree/bugfix/content_extractor_ws
>

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

Dmitriy Shabanov
Very strange, that branch different from latest develop branch by one commit.

I build locally, and it was compiling by java 1.8 (successfully)

On Sat, Apr 16, 2016 at 12:21 AM, ron.vandenbranden <[hidden email]> wrote:
Thanks, Dmitriy,

I've given it a try, but the build failed during the compilation of the range index module:

compile-src:
     [echo] Compiling sources 'index-range'
    [javac] Compiling 13 source files to F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\build\classes
    [javac] warning: [options] bootstrap class path not set in conjunction with-source 1.7
    [javac] F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexConfig.java:102: error: cannot find symbol
    [javac]                 if (type != Type.ITEM) {
    [javac]                             ^
    [javac]   symbol:   variable Type
    [javac]   location: class RangeIndexConfig
    [javac] F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexConfig.java:107: error: cannot find symbol
    [javac]         return Type.ITEM;
    [javac]                ^
    [javac]   symbol:   variable Type
    [javac]   location: class RangeIndexConfig
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexWorker.java uses unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 2 errors
    [javac] 1 warning

This was with oracle-jdk1.8.0_73 (on Windows 7 Professional, 64 bit). Since the "bootstrap" warning seemed to suggest that the source should be compiled with Java 1.7, I've tried again with oracle-jdk1.7.0_51.

Though this made the "bootstrap" warning disappear, the rest of the errors remained.

Should this branch be built with another specific Java version?

Best,

Ron


On 15/04/2016 11:20, Dmitriy Shabanov wrote:
test this branch https://github.com/shabanovd/exist/tree/bugfix/content_extractor_ws




--
Dmitriy Shabanov

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

Jens Østergaard Petersen-2
Hi,

It builds fine with me, Mac OS X & Java 1.8.0_73-b02, and

import module namespace content="http://exist-db.org/xquery/contentextraction"
at "java:org.exist.contentextraction.xquery.ContentExtractionModule";

let $path := '/db/test/testwhitespaceextraction.pdf'
let $binary := util:binary-doc($path)
return
    contentextraction:get-metadata-and-content($binary)

returns

<div class="page">
<p/>
<p>test whitespace extraction</p>
<p/>
</div>

which looks right to me.

Jens

On 16 April 2016 at 00:03:59, Dmitriy Shabanov ([hidden email]) wrote:

Very strange, that branch different from latest develop branch by one commit.

I build locally, and it was compiling by java 1.8 (successfully)

On Sat, Apr 16, 2016 at 12:21 AM, ron.vandenbranden <[hidden email]> wrote:
Thanks, Dmitriy,

I've given it a try, but the build failed during the compilation of the range index module:

compile-src:
     [echo] Compiling sources 'index-range'
    [javac] Compiling 13 source files to F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\build\classes
    [javac] warning: [options] bootstrap class path not set in conjunction with-source 1.7
    [javac] F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexConfig.java:102: error: cannot find symbol
    [javac]                 if (type != Type.ITEM) {
    [javac]                             ^
    [javac]   symbol:   variable Type
    [javac]   location: class RangeIndexConfig
    [javac] F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexConfig.java:107: error: cannot find symbol
    [javac]         return Type.ITEM;
    [javac]                ^
    [javac]   symbol:   variable Type
    [javac]   location: class RangeIndexConfig
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexWorker.java uses unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 2 errors
    [javac] 1 warning

This was with oracle-jdk1.8.0_73 (on Windows 7 Professional, 64 bit). Since the "bootstrap" warning seemed to suggest that the source should be compiled with Java 1.7, I've tried again with oracle-jdk1.7.0_51.

Though this made the "bootstrap" warning disappear, the rest of the errors remained.

Should this branch be built with another specific Java version?

Best,

Ron


On 15/04/2016 11:20, Dmitriy Shabanov wrote:
test this branch https://github.com/shabanovd/exist/tree/bugfix/content_extractor_ws




--
Dmitriy Shabanov
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

ron.vandenbranden
Hi,

Sorry, I was mistaken: Gitnoob as I am I had checked out Dmitriy's develop branch and forgot to switch to 'bugfix/content_extractor_ws'.

After switching to that branch, building went flawless, and I can confirm it fixes the WS issues in content extraction.

Many thanks (I hope it gets merged in the eXist code
soon)!

Best,

Ron

On 17/04/2016 10:03, Jens Østergaard Petersen wrote:
Hi,

It builds fine with me, Mac OS X & Java 1.8.0_73-b02, and

import module namespace content="http://exist-db.org/xquery/contentextraction"
at "java:org.exist.contentextraction.xquery.ContentExtractionModule";

let $path := '/db/test/testwhitespaceextraction.pdf'
let $binary := util:binary-doc($path)
return
    contentextraction:get-metadata-and-content($binary)

returns

<div class="page">
<p/>
<p>test whitespace extraction</p>
<p/>
</div>

which looks right to me.

Jens

On 16 April 2016 at 00:03:59, Dmitriy Shabanov ([hidden email]) wrote:

Very strange, that branch different from latest develop branch by one commit.

I build locally, and it was compiling by java 1.8 (successfully)


------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

ron.vandenbranden
In reply to this post by Jens Østergaard Petersen-2
Ok,

Still one remark: this fix does not seem to apply to content:stream-content().

If I adapt Jens' code snippet to following query:

  declare function trigger:index-callback($root as element(), $path as xs:anyURI, $page as xs:integer?) {
    $root
  };

  let $path := '/db/test/testwhitespaceextraction.pdf'
  let $binary := util:binary-doc($path)
  let $callback := trigger:index-callback#3
  let $namespaces := 
    <namespaces><namespace prefix="xhtml" uri="http://www.w3.org/1999/xhtml"/></namespaces>
  let $stream :=
    content:stream-content($binary, "//xhtml:div", $callback, $namespaces, $path)
  let $non-stream := content:get-metadata-and-content($binary)//html:div[@class='page']
  return <test>
    <stream>{$stream}</stream>
    <non-stream>{$non-stream}</non-stream>
  </test>

This returns following result:
  <test>
    <stream>
      <div xmlns="http://www.w3.org/1999/xhtml" class="page">
        <p/>
        <p>test whitespaceextraction</p>
        <p/>
      </div>
    </stream>
    <non-stream>
      <div xmlns="http://www.w3.org/1999/xhtml" class="page">
        <p/>
        <p>test whitespace extraction</p>
        <p/>
      </div>
    </non-stream>
  </test>

As you see, the whitespace differs in both variants, and content:stream-content() still incorrectly suppresses whitespace.

Best,

Ron

On 17/04/2016 10:03, Jens Østergaard Petersen wrote:
Hi,

It builds fine with me, Mac OS X & Java 1.8.0_73-b02, and

import module namespace content="http://exist-db.org/xquery/contentextraction"
at "java:org.exist.contentextraction.xquery.ContentExtractionModule";

let $path := '/db/test/testwhitespaceextraction.pdf'
let $binary := util:binary-doc($path)
return
    contentextraction:get-metadata-and-content($binary)

returns

<div class="page">
<p/>
<p>test whitespace extraction</p>
<p/>
</div>

which looks right to me.

Jens

On 16 April 2016 at 00:03:59, Dmitriy Shabanov ([hidden email]) wrote:

Very strange, that branch different from latest develop branch by one commit.

I build locally, and it was compiling by java 1.8 (successfully)

On Sat, Apr 16, 2016 at 12:21 AM, ron.vandenbranden <[hidden email]> wrote:
Thanks, Dmitriy,

I've given it a try, but the build failed during the compilation of the range index module:

compile-src:
     [echo] Compiling sources 'index-range'
    [javac] Compiling 13 source files to F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\build\classes
    [javac] warning: [options] bootstrap class path not set in conjunction with-source 1.7
    [javac] F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexConfig.java:102: error: cannot find symbol
    [javac]                 if (type != Type.ITEM) {
    [javac]                             ^
    [javac]   symbol:   variable Type
    [javac]   location: class RangeIndexConfig
    [javac] F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexConfig.java:107: error: cannot find symbol
    [javac]         return Type.ITEM;
    [javac]                ^
    [javac]   symbol:   variable Type
    [javac]   location: class RangeIndexConfig
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexWorker.java uses unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 2 errors
    [javac] 1 warning

This was with oracle-jdk1.8.0_73 (on Windows 7 Professional, 64 bit). Since the "bootstrap" warning seemed to suggest that the source should be compiled with Java 1.7, I've tried again with oracle-jdk1.7.0_51.

Though this made the "bootstrap" warning disappear, the rest of the errors remained.

Should this branch be built with another specific Java version?

Best,

Ron


On 15/04/2016 11:20, Dmitriy Shabanov wrote:
test this branch https://github.com/shabanovd/exist/tree/bugfix/content_extractor_ws




------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

Dmitriy Shabanov
Ok, I fix that too. Wander, should there be parameter to switch off/on WS handling?

On Mon, Apr 18, 2016 at 5:21 PM, ron.vandenbranden <[hidden email]> wrote:
Ok,

Still one remark: this fix does not seem to apply to content:stream-content().

If I adapt Jens' code snippet to following query:

  declare function trigger:index-callback($root as element(), $path as xs:anyURI, $page as xs:integer?) {
    $root
  };

  let $path := '/db/test/testwhitespaceextraction.pdf'
  let $binary := util:binary-doc($path)
  let $callback := trigger:index-callback#3
  let $namespaces := 
    <namespaces><namespace prefix="xhtml" uri="http://www.w3.org/1999/xhtml"/></namespaces>
  let $stream :=
    content:stream-content($binary, "//xhtml:div", $callback, $namespaces, $path)
  let $non-stream := content:get-metadata-and-content($binary)//html:div[@class='page']
  return <test>
    <stream>{$stream}</stream>
    <non-stream>{$non-stream}</non-stream>
  </test>

This returns following result:
  <test>
    <stream>
      <div xmlns="http://www.w3.org/1999/xhtml" class="page">
        <p/>
        <p>test whitespaceextraction</p>
        <p/>
      </div>
    </stream>
    <non-stream>
      <div xmlns="http://www.w3.org/1999/xhtml" class="page">
        <p/>
        <p>test whitespace extraction</p>
        <p/>
      </div>
    </non-stream>
  </test>

As you see, the whitespace differs in both variants, and content:stream-content() still incorrectly suppresses whitespace.

Best,

Ron

On 17/04/2016 10:03, Jens Østergaard Petersen wrote:
Hi,

It builds fine with me, Mac OS X & Java 1.8.0_73-b02, and

import module namespace content="http://exist-db.org/xquery/contentextraction"
at "java:org.exist.contentextraction.xquery.ContentExtractionModule";

let $path := '/db/test/testwhitespaceextraction.pdf'
let $binary := util:binary-doc($path)
return
    contentextraction:get-metadata-and-content($binary)

returns

<div class="page">
<p/>
<p>test whitespace extraction</p>
<p/>
</div>

which looks right to me.

Jens

On 16 April 2016 at 00:03:59, Dmitriy Shabanov ([hidden email]) wrote:

Very strange, that branch different from latest develop branch by one commit.

I build locally, and it was compiling by java 1.8 (successfully)

On Sat, Apr 16, 2016 at 12:21 AM, ron.vandenbranden <[hidden email][hidden email]> wrote:
Thanks, Dmitriy,

I've given it a try, but the build failed during the compilation of the range index module:

compile-src:
     [echo] Compiling sources 'index-range'
    [javac] Compiling 13 source files to F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\build\classes
    [javac] warning: [options] bootstrap class path not set in conjunction with-source 1.7
    [javac] F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexConfig.java:102: error: cannot find symbol
    [javac]                 if (type != Type.ITEM) {
    [javac]                             ^
    [javac]   symbol:   variable Type
    [javac]   location: class RangeIndexConfig
    [javac] F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexConfig.java:107: error: cannot find symbol
    [javac]         return Type.ITEM;
    [javac]                ^
    [javac]   symbol:   variable Type
    [javac]   location: class RangeIndexConfig
    [javac] Note: Some input files use or override a deprecated API.
    [javac] Note: Recompile with -Xlint:deprecation for details.
    [javac] Note: F:\devtools\eXist\exist-ws\exist\extensions\indexes\range\src\org\exist\indexing\range\RangeIndexWorker.java uses unchecked or unsafe operations.
    [javac] Note: Recompile with -Xlint:unchecked for details.
    [javac] 2 errors
    [javac] 1 warning

This was with oracle-jdk1.8.0_73 (on Windows 7 Professional, 64 bit). Since the "bootstrap" warning seemed to suggest that the source should be compiled with Java 1.7, I've tried again with oracle-jdk1.7.0_51.

Though this made the "bootstrap" warning disappear, the rest of the errors remained.

Should this branch be built with another specific Java version?

Best,

Ron


On 15/04/2016 11:20, Dmitriy Shabanov wrote:
test this branch https://github.com/shabanovd/exist/tree/bugfix/content_extractor_ws






--
Dmitriy Shabanov

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

ron.vandenbranden
Ok, perfect, thanks! It works now.

On 18/04/2016 19:52, Dmitriy Shabanov wrote:
> Ok, I fix that too. Wander, should there be parameter to switch off/on
> WS handling?

I don't know if that makes sense code-wise, but from a user perspective
wouldn't it be most consistent if content extraction followed the
whitespace settings (preserve-whitespace-mixed-content,
suppress-whitespace) defined for the indexers in conf.xml?

Best,

Ron

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: whitespace lost in content extraction

Dmitriy Shabanov
On Tue, Apr 19, 2016 at 11:03 AM, ron.vandenbranden <[hidden email]> wrote:
Ok, perfect, thanks! It works now.

On 18/04/2016 19:52, Dmitriy Shabanov wrote:
Ok, I fix that too. Wander, should there be parameter to switch off/on WS handling?

I don't know if that makes sense code-wise, but from a user perspective wouldn't it be most consistent if content extraction followed the whitespace settings (preserve-whitespace-mixed-content, suppress-whitespace) defined for the indexers in conf.xml?

That is different purpose setting, mixing up staff will bring problem later for sure.

--
Dmitriy Shabanov

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open