util:parse doesn't accept BOM

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

util:parse doesn't accept BOM

Chris Tomlinson-2-2
Hi,

I’m not sure if this is considered an error or not. I have several XML documents that fail to parse with “Content not allowed in prolog.” (err EXXQDY0002). The cause seems to be the occurrence of BOM in the files. They appear as UTF-8 with BOM. The files also have XML version 1.0 declarations (with encoding UTF-8).

oXygenXML 16 has no complaints opening the files.

I have seen a parser get confused by the BOM - declaring the file to be SGML and reporting the explicit XML declaration as a second declaration.

However, in this case it seems like the util:parser is not ignoring the BOM and declaring it to be non-allowed content.

Thanks,
Chris


------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Dannes Wessels-3
Hi,

BOMs are in general an issue for XML parsers (in Java?). Most of the time parsing fails.

A solution is to wrap a (file)input stream into a this class: https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html 

The issue is.... To find all entries and make it
Consistent for the whole database....

regards dannes


--
www.exist-db.org

> On 13 Jul 2015, at 18:09, Chris Tomlinson <[hidden email]> wrote:
>
> Hi,
>
> I’m not sure if this is considered an error or not. I have several XML documents that fail to parse with “Content not allowed in prolog.” (err EXXQDY0002). The cause seems to be the occurrence of BOM in the files. They appear as UTF-8 with BOM. The files also have XML version 1.0 declarations (with encoding UTF-8).
>
> oXygenXML 16 has no complaints opening the files.
>
> I have seen a parser get confused by the BOM - declaring the file to be SGML and reporting the explicit XML declaration as a second declaration.
>
> However, in this case it seems like the util:parser is not ignoring the BOM and declaring it to be non-allowed content.
>
> Thanks,
> Chris
>
>
> ------------------------------------------------------------------------------
> Don't Limit Your Business. Reach for the Cloud.
> GigeNET's Cloud Solutions provide you with the tools and support that
> you need to offload your IT needs and focus on growing your business.
> Configured For All Businesses. Start Your Cloud Today.
> https://www.gigenetcloud.com/
> _______________________________________________
> Exist-open mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/exist-open

------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Chris Tomlinson-2-2
Hello Dannes,

So are you saying that eXist-db should add use of BOMInputStream? That seems helpful to me.

As I mentioned the Java application, oXygenXML 16, simply ignores an initial BOM which seems a good behavior.

If eXist-db ignored BOM when storing/parsing documents it seems to me that would allow for storing documents with and without BOM in a transparent manner.

Thanks,
Chris

> On Jul 13, 2015, at 11:24 AM, Dannes Wessels <[hidden email]> wrote:
>
> Hi,
>
> BOMs are in general an issue for XML parsers (in Java?). Most of the time parsing fails.
>
> A solution is to wrap a (file)input stream into a this class: https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html 
>
> The issue is.... To find all entries and make it
> Consistent for the whole database....
>
> regards dannes
>
>
> --
> www.exist-db.org
>
>> On 13 Jul 2015, at 18:09, Chris Tomlinson <[hidden email]> wrote:
>>
>> Hi,
>>
>> I’m not sure if this is considered an error or not. I have several XML documents that fail to parse with “Content not allowed in prolog.” (err EXXQDY0002). The cause seems to be the occurrence of BOM in the files. They appear as UTF-8 with BOM. The files also have XML version 1.0 declarations (with encoding UTF-8).
>>
>> oXygenXML 16 has no complaints opening the files.
>>
>> I have seen a parser get confused by the BOM - declaring the file to be SGML and reporting the explicit XML declaration as a second declaration.
>>
>> However, in this case it seems like the util:parser is not ignoring the BOM and declaring it to be non-allowed content.
>>
>> Thanks,
>> Chris
>>
>>
>> ------------------------------------------------------------------------------
>> Don't Limit Your Business. Reach for the Cloud.
>> GigeNET's Cloud Solutions provide you with the tools and support that
>> you need to offload your IT needs and focus on growing your business.
>> Configured For All Businesses. Start Your Cloud Today.
>> https://www.gigenetcloud.com/
>> _______________________________________________
>> Exist-open mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/exist-open


------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Dannes Wessels-3
oxygen probably uses this same streamwrapper. I think I remember once i saw it in a oxygen stack trace .,.
--
www.exist-db.org

> On 13 Jul 2015, at 18:42, Chris Tomlinson <[hidden email]> wrote:
>
> As I mentioned the Java application, oXygenXML 16, simply ignores an initial BOM which seems a good behavior.

------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Adam Retter
In reply to this post by Dannes Wessels-3
Indeed the answer from Xerces (the XML parser used in eXist) seems to
be this - http://mail-archives.apache.org/mod_mbox/xerces-j-users/200401.mbox/%3C4010DC5D.3070109@...%3E

We would need to make quite a few changes in eXist, basically any
calls from any API to org.exist.collections.Collection#validate... and
org.exist.collections.Collection#store...

Not particularly complex, but would take probably 1/2 - 1 days effort.

Perhaps open an issue (feature request), with the key points from these threads?

On 13 July 2015 at 17:24, Dannes Wessels <[hidden email]> wrote:

> Hi,
>
> BOMs are in general an issue for XML parsers (in Java?). Most of the time parsing fails.
>
> A solution is to wrap a (file)input stream into a this class: https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html
>
> The issue is.... To find all entries and make it
> Consistent for the whole database....
>
> regards dannes
>
>
> --
> www.exist-db.org
>
>> On 13 Jul 2015, at 18:09, Chris Tomlinson <[hidden email]> wrote:
>>
>> Hi,
>>
>> I’m not sure if this is considered an error or not. I have several XML documents that fail to parse with “Content not allowed in prolog.” (err EXXQDY0002). The cause seems to be the occurrence of BOM in the files. They appear as UTF-8 with BOM. The files also have XML version 1.0 declarations (with encoding UTF-8).
>>
>> oXygenXML 16 has no complaints opening the files.
>>
>> I have seen a parser get confused by the BOM - declaring the file to be SGML and reporting the explicit XML declaration as a second declaration.
>>
>> However, in this case it seems like the util:parser is not ignoring the BOM and declaring it to be non-allowed content.
>>
>> Thanks,
>> Chris
>>
>>
>> ------------------------------------------------------------------------------
>> Don't Limit Your Business. Reach for the Cloud.
>> GigeNET's Cloud Solutions provide you with the tools and support that
>> you need to offload your IT needs and focus on growing your business.
>> Configured For All Businesses. Start Your Cloud Today.
>> https://www.gigenetcloud.com/
>> _______________________________________________
>> Exist-open mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/exist-open
>
> ------------------------------------------------------------------------------
> Don't Limit Your Business. Reach for the Cloud.
> GigeNET's Cloud Solutions provide you with the tools and support that
> you need to offload your IT needs and focus on growing your business.
> Configured For All Businesses. Start Your Cloud Today.
> https://www.gigenetcloud.com/
> _______________________________________________
> Exist-open mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/exist-open



--
Adam Retter

eXist Developer
{ United Kingdom }
[hidden email]
irc://irc.freenode.net/existdb

------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

nsincaglia
In reply to this post by Dannes Wessels-3
I know this thread is about year old but I just had the pleasure of experiencing this issue. This is a really difficult problem to determine the cause. When I opened the files in Oxygen and eXide, the problem was completely undetectable. I could open and store the files fine but util:parse() would fail when I used it on the original file.

I finally was able to figure out what the issue was by opening the XML files in JEdit and looking at the file properties. It listed the character encoding as UTF-8Y, which apparently is another name for UTF-8 with Byte Order Mark (BOM).

Will eXist-db v3.0 be able to handle this issue? If not, is there a way to detect the BOM and remove it?

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Jens Østergaard Petersen-2

Jens


On 12 July 2016 at 03:59:05, nsincaglia ([hidden email]) wrote:

I know this thread is about year old but I just had the pleasure of
experiencing this issue. This is a really difficult problem to determine the
cause. When I opened the files in Oxygen and eXide, the problem was
completely undetectable. I could open and store the files fine but
util:parse() would fail when I used it on the original file.

I finally was able to figure out what the issue was by opening the XML files
in JEdit and looking at the file properties. It listed the character
encoding as UTF-8Y, which apparently is another name for UTF-8 with Byte
Order Mark (BOM).

Will eXist-db v3.0 be able to handle this issue? If not, is there a way to
detect the BOM and remove it?





--
View this message in context: http://exist.2174344.n4.nabble.com/util-parse-doesn-t-accept-BOM-tp4668231p4670513.html
Sent from the exist-open mailing list archive at Nabble.com.

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Adam Retter
Without some work, this is not supported by the Xerces2 XML parser
that we use. Can you open an issue with all relevant info and links
please?
Personally I think it would be more interesting to switch parser,
perhaps to something like Aalto XML

On 12 July 2016 at 08:39, Jens Østergaard Petersen <[hidden email]> wrote:

> It should be possible; see
> <http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java>.
>
> Jens
>
>
> On 12 July 2016 at 03:59:05, nsincaglia ([hidden email]) wrote:
>
> I know this thread is about year old but I just had the pleasure of
> experiencing this issue. This is a really difficult problem to determine the
> cause. When I opened the files in Oxygen and eXide, the problem was
> completely undetectable. I could open and store the files fine but
> util:parse() would fail when I used it on the original file.
>
> I finally was able to figure out what the issue was by opening the XML files
> in JEdit and looking at the file properties. It listed the character
> encoding as UTF-8Y, which apparently is another name for UTF-8 with Byte
> Order Mark (BOM).
>
> Will eXist-db v3.0 be able to handle this issue? If not, is there a way to
> detect the BOM and remove it?
>
>
>
>
>
> --
> View this message in context:
> http://exist.2174344.n4.nabble.com/util-parse-doesn-t-accept-BOM-tp4668231p4670513.html
> Sent from the exist-open mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning
> reports.http://sdm.link/zohodev2dev
> _______________________________________________
> Exist-open mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/exist-open
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning
> reports.http://sdm.link/zohodev2dev
> _______________________________________________
> Exist-open mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/exist-open
>



--
Adam Retter

eXist Developer
{ United Kingdom }
[hidden email]
irc://irc.freenode.net/existdb

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Jens Østergaard Petersen-2
Adam, could you tell if Aalto XML (first time I’ve heard of it …) would prove wellformed <x xml:id=“1”/>? Xerces holds on to the original XML 1.0 definition of NCName  which was quite quickly superseded in XML 1.0 by the definition made in XML 1.1. This is especially annoying in that xml:id’s beginning with digits are held to be malformed in eXist, though they are wellformed according to XML 1.0.

Jens

On 12 July 2016 at 10:00:19, Adam Retter ([hidden email]) wrote:

Without some work, this is not supported by the Xerces2 XML parser
that we use. Can you open an issue with all relevant info and links
please?
Personally I think it would be more interesting to switch parser,
perhaps to something like Aalto XML

On 12 July 2016 at 08:39, Jens Østergaard Petersen <[hidden email]> wrote:

> It should be possible; see
> <http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java>.
>
> Jens
>
>
> On 12 July 2016 at 03:59:05, nsincaglia ([hidden email]) wrote:
>
> I know this thread is about year old but I just had the pleasure of
> experiencing this issue. This is a really difficult problem to determine the
> cause. When I opened the files in Oxygen and eXide, the problem was
> completely undetectable. I could open and store the files fine but
> util:parse() would fail when I used it on the original file.
>
> I finally was able to figure out what the issue was by opening the XML files
> in JEdit and looking at the file properties. It listed the character
> encoding as UTF-8Y, which apparently is another name for UTF-8 with Byte
> Order Mark (BOM).
>
> Will eXist-db v3.0 be able to handle this issue? If not, is there a way to
> detect the BOM and remove it?
>
>
>
>
>
> --
> View this message in context:
> http://exist.2174344.n4.nabble.com/util-parse-doesn-t-accept-BOM-tp4668231p4670513.html
> Sent from the exist-open mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning
> reports.http://sdm.link/zohodev2dev
> _______________________________________________
> Exist-open mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/exist-open
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning
> reports.http://sdm.link/zohodev2dev
> _______________________________________________
> Exist-open mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/exist-open
>



--
Adam Retter

eXist Developer
{ United Kingdom }
[hidden email]
irc://irc.freenode.net/existdb

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Dannes Wessels-3
In reply to this post by Adam Retter

On 12 Jul 2016, at 10:00 , Adam Retter <[hidden email]> wrote:

Aalto XML

My Idea is to  have of a selectable parser setup. The current validation stuff is highly dependent on Xerces, so I guess we would need it anyway.

But a kind of per-collection configuration of a parser would probably work……

regards

Danned

eXist-db Native XML Database


------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Henry S. Thompson-2
In reply to this post by Adam Retter
Adam Retter writes:

> Personally I think it would be more interesting to switch parser,
> perhaps to something like Aalto XML

For speed and conformance, including full 5th edition character set
improvements, I strongly recommend ltxml2 [1], the library behind
rxp [2].

ht

[1] https://www.ltg.ed.ac.uk/software/ltxml2/
[2] https://www.ltg.ed.ac.uk/software/rxp/
--
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: [hidden email]
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Adam Retter
In reply to this post by Jens Østergaard Petersen-2
I just tested that XML with Aalto and yes it parsed it just fine.

On 12 July 2016 at 17:12, Jens Østergaard Petersen <[hidden email]> wrote:

> Adam, could you tell if Aalto XML (first time I’ve heard of it …) would
> prove wellformed <x xml:id=“1”/>? Xerces holds on to the original XML 1.0
> definition of NCName  which was quite quickly superseded in XML 1.0 by the
> definition made in XML 1.1. This is especially annoying in that xml:id’s
> beginning with digits are held to be malformed in eXist, though they are
> wellformed according to XML 1.0.
>
> Jens
>
> On 12 July 2016 at 10:00:19, Adam Retter ([hidden email]) wrote:
>
> Without some work, this is not supported by the Xerces2 XML parser
> that we use. Can you open an issue with all relevant info and links
> please?
> Personally I think it would be more interesting to switch parser,
> perhaps to something like Aalto XML
>
> On 12 July 2016 at 08:39, Jens Østergaard Petersen <[hidden email]>
> wrote:
>> It should be possible; see
>>
>> <http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java>.
>>
>> Jens
>>
>>
>> On 12 July 2016 at 03:59:05, nsincaglia ([hidden email]) wrote:
>>
>> I know this thread is about year old but I just had the pleasure of
>> experiencing this issue. This is a really difficult problem to determine
>> the
>> cause. When I opened the files in Oxygen and eXide, the problem was
>> completely undetectable. I could open and store the files fine but
>> util:parse() would fail when I used it on the original file.
>>
>> I finally was able to figure out what the issue was by opening the XML
>> files
>> in JEdit and looking at the file properties. It listed the character
>> encoding as UTF-8Y, which apparently is another name for UTF-8 with Byte
>> Order Mark (BOM).
>>
>> Will eXist-db v3.0 be able to handle this issue? If not, is there a way to
>> detect the BOM and remove it?
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>>
>> http://exist.2174344.n4.nabble.com/util-parse-doesn-t-accept-BOM-tp4668231p4670513.html
>> Sent from the exist-open mailing list archive at Nabble.com.
>>
>>
>> ------------------------------------------------------------------------------
>> What NetFlow Analyzer can do for you? Monitors network bandwidth and
>> traffic
>> patterns at an interface-level. Reveals which users, apps, and protocols
>> are
>> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
>> J-Flow, sFlow and other flows. Make informed decisions using capacity
>> planning
>> reports.http://sdm.link/zohodev2dev
>> _______________________________________________
>> Exist-open mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/exist-open
>>
>>
>>
>> ------------------------------------------------------------------------------
>> What NetFlow Analyzer can do for you? Monitors network bandwidth and
>> traffic
>> patterns at an interface-level. Reveals which users, apps, and protocols
>> are
>> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
>> J-Flow, sFlow and other flows. Make informed decisions using capacity
>> planning
>> reports.http://sdm.link/zohodev2dev
>> _______________________________________________
>> Exist-open mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/exist-open
>>
>
>
>
> --
> Adam Retter
>
> eXist Developer
> { United Kingdom }
> [hidden email]
> irc://irc.freenode.net/existdb



--
Adam Retter

eXist Developer
{ United Kingdom }
[hidden email]
irc://irc.freenode.net/existdb

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Adam Retter
In reply to this post by Henry S. Thompson-2
Henry,

Whilst that looks like a very interesting project, unfortunately there
are two restrictions which would stop us from using it with eXist:

1) It is GPL licensed

2) It is written in C, and eXist is in Java

On 13 July 2016 at 08:31, Henry S. Thompson <[hidden email]> wrote:

> Adam Retter writes:
>
>> Personally I think it would be more interesting to switch parser,
>> perhaps to something like Aalto XML
>
> For speed and conformance, including full 5th edition character set
> improvements, I strongly recommend ltxml2 [1], the library behind
> rxp [2].
>
> ht
>
> [1] https://www.ltg.ed.ac.uk/software/ltxml2/
> [2] https://www.ltg.ed.ac.uk/software/rxp/
> --
>        Henry S. Thompson, School of Informatics, University of Edinburgh
>       10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
>                 Fax: (44) 131 650-4587, e-mail: [hidden email]
>                        URL: http://www.ltg.ed.ac.uk/~ht/
>  [mail from me _always_ has a .sig like this -- mail without it is forged spam]



--
Adam Retter

eXist Developer
{ United Kingdom }
[hidden email]
irc://irc.freenode.net/existdb

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Henry S. Thompson-2
Adam Retter writes:

> Henry,
>
> Whilst that looks like a very interesting project, unfortunately there
> are two restrictions which would stop us from using it with eXist:
>
> 1) It is GPL licensed

Wasn't sure about that.

> 2) It is written in C, and eXist is in Java

Indeed -- shimming is possible, but complicated, I understand.

ht
--
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 650-4587, e-mail: [hidden email]
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Adam Retter
> Wasn't sure about that.
>
>> 2) It is written in C, and eXist is in Java
>

Actually, adding the JNI code nessecary is probably the easiest bit.
The hard part is ensuring cross-platform support and tooling so that
everyone can compile the C code as well as the Java code. Not to
mention then creating platform specific distributions of eXist (at the
moment Java let's us forget about all of that)

--
Adam Retter

eXist Developer
{ United Kingdom }
[hidden email]
irc://irc.freenode.net/existdb

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Adam Retter
In reply to this post by Adam Retter
Jens,

I looked into this a bit further. Whilst I would love to get Aalto
into eXist, the problem is that Aalto doesn't support validation, and
at the moment eXist offers optional XML Schema and DTD validation of
documents on store.

However, there is a middle ground, which is Woodstox. Whilst not as
fast as Aalto it beats the pants of Xerces. In addition Woodstox also
implements the SAX API, so it is basically a drop-in replacement for
Xerces. Woodstox is also meant to be more conformant than Xerces, and
in addition offers native support for RelaxNG as well as XML Schema
and DTD.

I basically can't see any downside from switching from Xerces to
Woodstox. In addition, it is almost zero work for us to do it, if we
stick with the SAX API. Woodstox also offers a StaxAPI which should
again improve performance over the SAX API, but switching eXist over
to STAX from SAX would be some more work, so we could put a pin in
that for the future.

Cheers Adam.

On 13 July 2016 at 11:59, Adam Retter <[hidden email]> wrote:

> I just tested that XML with Aalto and yes it parsed it just fine.
>
> On 12 July 2016 at 17:12, Jens Østergaard Petersen <[hidden email]> wrote:
>> Adam, could you tell if Aalto XML (first time I’ve heard of it …) would
>> prove wellformed <x xml:id=“1”/>? Xerces holds on to the original XML 1.0
>> definition of NCName  which was quite quickly superseded in XML 1.0 by the
>> definition made in XML 1.1. This is especially annoying in that xml:id’s
>> beginning with digits are held to be malformed in eXist, though they are
>> wellformed according to XML 1.0.
>>
>> Jens
>>
>> On 12 July 2016 at 10:00:19, Adam Retter ([hidden email]) wrote:
>>
>> Without some work, this is not supported by the Xerces2 XML parser
>> that we use. Can you open an issue with all relevant info and links
>> please?
>> Personally I think it would be more interesting to switch parser,
>> perhaps to something like Aalto XML
>>
>> On 12 July 2016 at 08:39, Jens Østergaard Petersen <[hidden email]>
>> wrote:
>>> It should be possible; see
>>>
>>> <http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java>.
>>>
>>> Jens
>>>
>>>
>>> On 12 July 2016 at 03:59:05, nsincaglia ([hidden email]) wrote:
>>>
>>> I know this thread is about year old but I just had the pleasure of
>>> experiencing this issue. This is a really difficult problem to determine
>>> the
>>> cause. When I opened the files in Oxygen and eXide, the problem was
>>> completely undetectable. I could open and store the files fine but
>>> util:parse() would fail when I used it on the original file.
>>>
>>> I finally was able to figure out what the issue was by opening the XML
>>> files
>>> in JEdit and looking at the file properties. It listed the character
>>> encoding as UTF-8Y, which apparently is another name for UTF-8 with Byte
>>> Order Mark (BOM).
>>>
>>> Will eXist-db v3.0 be able to handle this issue? If not, is there a way to
>>> detect the BOM and remove it?
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>>
>>> http://exist.2174344.n4.nabble.com/util-parse-doesn-t-accept-BOM-tp4668231p4670513.html
>>> Sent from the exist-open mailing list archive at Nabble.com.
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> What NetFlow Analyzer can do for you? Monitors network bandwidth and
>>> traffic
>>> patterns at an interface-level. Reveals which users, apps, and protocols
>>> are
>>> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
>>> J-Flow, sFlow and other flows. Make informed decisions using capacity
>>> planning
>>> reports.http://sdm.link/zohodev2dev
>>> _______________________________________________
>>> Exist-open mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/exist-open
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> What NetFlow Analyzer can do for you? Monitors network bandwidth and
>>> traffic
>>> patterns at an interface-level. Reveals which users, apps, and protocols
>>> are
>>> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
>>> J-Flow, sFlow and other flows. Make informed decisions using capacity
>>> planning
>>> reports.http://sdm.link/zohodev2dev
>>> _______________________________________________
>>> Exist-open mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/exist-open
>>>
>>
>>
>>
>> --
>> Adam Retter
>>
>> eXist Developer
>> { United Kingdom }
>> [hidden email]
>> irc://irc.freenode.net/existdb
>
>
>
> --
> Adam Retter
>
> eXist Developer
> { United Kingdom }
> [hidden email]
> irc://irc.freenode.net/existdb



--
Adam Retter

eXist Developer
{ United Kingdom }
[hidden email]
irc://irc.freenode.net/existdb

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

nsincaglia
I wanted to re-engage this discussion again with the community because I came across this issue again yesterday. There are a couple of things that concern me about this issue;

1). It is very difficult to diagnose. Even though I had experienced this issue in the past, it still took me longer than I wished to finally realize what the issue was.
2). After I have figured out that the file contained a BOM, I must then convince the organization that sent me the offending XMLs that they need to fix this issue so that I can process their XMLs. They can easily turn around and tell me "We don't see the issue. The XML is valid according to our tools. Plus, no one else has complained about it, so it must be a personal problem".

The last time we discussed this topic, it was suggested that maybe eXistdb should consider something other than Apache Xercese2. This seems pretty extreme for such a minor issue.

So, I wanted to suggest another possible solution and get some opinions on the viability of the suggestion.

Would it be possible to develop a Java module that one could install that could check the first bytes of a retrieved XML resource and remove the BOM if it exists and do nothing if the BOM is not detected?

This way we would still have a path forward if the party who sent us the XML with a BOM refuses to address the issue.

Having a "bom-removal" java module would enable us to drop in a function call after retrieving the XML resource but before the parse-xml function call where an error is thrown if a BOM exists. Something like this:

let $resource := ft-client:retrieve-resource($handle, $file-path)
let $resource-decoded := util:base64-decode($resource)
let $resource-decoded-bom-removed :=bom-removal($resource-decoded)
let $parse-xml := parse-xml($resource-decoded-bom-removed)

This would not be an all encompassing solution but I was thinking that it could be a solution that one could use in situations where they don't control or can't influence the XML creation process.

Thoughts?

Nick

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Michael Westbay-2
Hi Nick,

I've had problems with strange things in HTML files while parsing in the past. My solution has been to wrap all HTTP requests in another module and pass a pre-processing function optionally.

I start off with a commonly used function that just replaces something like a Japanese double-width space with a normal space:

declare variable $pages:REPLACE-FULL-SPACES := function($body) {
  replace(
$body, ' ', ' ')
};

Then my fetch function that takes an optional pre-process callback:

declare function pages:fetch($url as xs:string, $cookie as xs:string?, $replace-callback as function?) as item()? {
  if ($times le 0) then () else
  let $encoding := pages:get-encoding($url)
  let $options := pages:get-options($encoding)
  let $headers := if (string-length($cookie) gt 0) then <headers><header name="Cookie">{$cookie}</header></headers> else ()
  let $raw-page :=
      let $request := <http:request method="GET" href="{$url}" override-media-type="html">{
        if (string-length($cookie) gt 0) then
        <http:header name="Cookie" value="{$cookie}"/>
        else ()
      }
        <http:header name="Connection" value="close"/>
      </http:request>
      let $responses := http:send-request($request)
      return if ($responses[1]/@status eq '200') then
        let $body := util:binary-to-string($responses[2],$encoding)
        (: Fix for all kinds of specific calamities :)
        let $body := if (exists($replace-callback)) then $replace-callback($body) else $body
        return <httpc:response statusCode="200">
          <httpc:headers>{$responses[1]/httpc:headers}</httpc:headers>
          <httpc:body type="xml" mimetype="text/xml">{
            util:parse-html($body,$options)
          }
          </httpc:body>
        </httpc:response>
        else
           ()
  return $raw-page
};

And I can call it from my main module like:

let $page := pages:fetch($url, (), $pages:REPLACE-FULL-SPACES)

The gist of the matter is, first fetch the URL, getting the body as a string. Run a replacement function against the body while it is a string, replacing (or removing) anything that may cause problems in parsing. Then parse the result into XML.

The above is for eXist  2.1dev, with my patches to necohtml and util:parse-html that passes necohtml specific optional parameters. But the basic process should work.

Hope this helps.



2017-02-12 5:47 GMT+09:00 nsincaglia <[hidden email]>:
I wanted to re-engage this discussion again with the community because I came
across this issue again yesterday. There are a couple of things that concern
me about this issue;

1). It is very difficult to diagnose. Even though I had experienced this
issue in the past, it still took me longer than I wished to finally realize
what the issue was.
2). After I have figured out that the file contained a BOM, I must then
convince the organization that sent me the offending XMLs that they need to
fix this issue so that I can process their XMLs. They can easily turn around
and tell me "We don't see the issue. The XML is valid according to our
tools. Plus, no one else has complained about it, so it must be a personal
problem".

The last time we discussed this topic, it was suggested that maybe eXistdb
should consider something other than Apache Xercese2. This seems pretty
extreme for such a minor issue.

So, I wanted to suggest another possible solution and get some opinions on
the viability of the suggestion.

Would it be possible to develop a Java module that one could install that
could check the first bytes of a retrieved XML resource and remove the BOM
if it exists and do nothing if the BOM is not detected?

This way we would still have a path forward if the party who sent us the XML
with a BOM refuses to address the issue.

Having a "bom-removal" java module would enable us to drop in a function
call after retrieving the XML resource but before the parse-xml function
call where an error is thrown if a BOM exists. Something like this:

let $resource := ft-client:retrieve-resource($handle, $file-path)
let $resource-decoded := util:base64-decode($resource)
let $resource-decoded-bom-removed :=bom-removal($resource-decoded)
let $parse-xml := parse-xml($resource-decoded-bom-removed)

This would not be an all encompassing solution but I was thinking that it
could be a solution that one could use in situations where they don't
control or can't influence the XML creation process.

Thoughts?

Nick





--
View this message in context: http://exist.2174344.n4.nabble.com/util-parse-doesn-t-accept-BOM-tp4668231p4671352.html
Sent from the exist-open mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open



--
Michael Westbay
Writer/System Administrator
http://www.japanesebaseball.com/

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Dannes Wessels-3
In reply to this post by nsincaglia
Hi,

Apaches BOMInputStream does exactly what you need: it strips the additional bytes when they are present. To repair a file, You could run code like te following:

File in = new File("in.xml");
File out = new File("out.xml");

try (InputStream is = new BOMInputStream(new FileInputStream(in));
OutputStream os = new FileOutputStream(out);) {

IOUtils.copy(is, os);
}
regards

Dannes


eXist-db Native XML Database



On 11 Feb 2017, at 21:47 , nsincaglia <[hidden email]> wrote:

The last time we discussed this topic, it was suggested that maybe eXistdb
should consider something other than Apache Xercese2. This seems pretty
extreme for such a minor issue. 

So, I wanted to suggest another possible solution and get some opinions on
the viability of the suggestion. 

Would it be possible to develop a Java module that one could install that
could check the first bytes of a retrieved XML resource and remove the BOM
if it exists and do nothing if the BOM is not detected? 

This way we would still have a path forward if the party who sent us the XML
with a BOM refuses to address the issue.


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: util:parse doesn't accept BOM

Dannes Wessels-3
Better Java8 style, less dependancies: 

Path in = Paths.get("in.xml");
Path out = Paths.get("out.xml");

try (InputStream is = new BOMInputStream(Files.newInputStream(in)) ) {
Files.copy(is,out);
}



On 12 Feb 2017, at 11:39 , Dannes Wessels <[hidden email]> wrote:

To repair a file, You could run code like te following:


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
12
Loading...