Encoding conversion from ANSI/ASCII to UTF-8

classic Classic list List threaded Threaded
7 messages Options
Jon
Reply | Threaded
Open this post in threaded view
|

Encoding conversion from ANSI/ASCII to UTF-8

Jon
Hello,

I have been trying to change the encoding of a batch of XML files from ANSI to UTF-8 in eXist 2.2.
The XML files are in the DB, stripped of their XML declaration. I run an XQL (3.0) which contains a transform:transform that runs an XSLT (2.0) transform on all files, and saves them at their current location.

In the XSLT, I specify the following configuration for the output:
<xsl:output method="xml" version="1.0" indent="yes" encoding="utf-8" omit-xml-declaration="no"/>.
I tried multiple variations of this <xsl:output> and it looks like no matter what, this line is ignored during the transformation.

Attached is my XSL Transform: trigger.xsl

If anyone has an idea, that would be wonderful.

Thank you,
Jonathan
Reply | Threaded
Open this post in threaded view
|

Re: Encoding conversion from ANSI/ASCII to UTF-8

Joe Wicentowski
Hi Jon,

Can you go into a little more detail about how you are determining that the output from eXist is not in the encoding you are expecting?

Also, how are you ingesting the ANSI-encoded documents/data into eXist?

Joe

Sent from my iPhone

> On Apr 17, 2015, at 10:31 AM, Jon <[hidden email]> wrote:
>
> Hello,
>
> I have been trying to change the encoding of a batch of XML files from ANSI
> to UTF-8 in eXist 2.2.
> The XML files are in the DB, stripped of their XML declaration. I run an XQL
> (3.0) which contains a transform:transform that runs an XSLT (2.0) transform
> on all files, and saves them at their current location.
>
> In the XSLT, I specify the following configuration for the output:
> <xsl:output method="xml" version="1.0" indent="yes" encoding="utf-8"
> omit-xml-declaration="no"/>.
> I tried multiple variations of this <xsl:output> and it looks like no matter
> what, this line is ignored during the transformation.
>
> Attached is my XSL Transform:  trigger.xsl
> <http://exist.2174344.n4.nabble.com/file/n4667391/trigger.xsl>  
>
> If anyone has an idea, that would be wonderful.
>
> Thank you,
> Jonathan
>
>
>
> --
> View this message in context: http://exist.2174344.n4.nabble.com/Encoding-conversion-from-ANSI-ASCII-to-UTF-8-tp4667391.html
> Sent from the exist-open mailing list archive at Nabble.com.
>
> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> _______________________________________________
> Exist-open mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/exist-open

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Jon
Reply | Threaded
Open this post in threaded view
|

Re: Encoding conversion from ANSI/ASCII to UTF-8

Jon
Hi Joe,

Thank you for your response.
I actually check the output encoding on this site: http://i-tools.org/charset.
Weirdly, for those documents, Notepad++ tells me they are UTF-8 without BOM. I am wondering if not having a BOM generates wrong encoding for the document.

The ANSI documents were actually created via a CMS and uploaded to eXist. Originally, documents created with that CMS would have UTF-8 encoding but the latest update of the CMS somehow broke this and the resulting files have a mix of ANSI/UTF-8 encodings.

Jonathan
Reply | Threaded
Open this post in threaded view
|

Re: Encoding conversion from ANSI/ASCII to UTF-8

Joe Wicentowski
Hi Jon,

Sorry, I'm still not clear: how are you uploading these documents into eXist?  There are various methods for doing so (see http://exist-db.org/exist/apps/doc/uploading-files.xml), and knowing exactly how is helpful for people here in understanding where problems might be creeping in.

If I recall correctly, eXist expects the documents you upload to be UTF-8 encoded and sans BOM.  I'd suggest ensuring all of your documents are BOM-less UTF-8 before you upload them into eXist.  (Others - please correct me if I'm wrong here.) 

Joe

On Mon, Apr 20, 2015 at 9:06 AM, Jon <[hidden email]> wrote:
Hi Joe,

Thank you for your response.
I actually check the output encoding on this site:
http://i-tools.org/charset.
Weirdly, for those documents, Notepad++ tells me they are UTF-8 without BOM.
I am wondering if not having a BOM generates wrong encoding for the
document.

The ANSI documents were actually created via a CMS and uploaded to eXist.
Originally, documents created with that CMS would have UTF-8 encoding but
the latest update of the CMS somehow broke this and the resulting files have
a mix of ANSI/UTF-8 encodings.

Jonathan

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Encoding conversion from ANSI/ASCII to UTF-8

hendrickst
Jonathan and I ended up posted about the same issue (he's the vendor we're using). ;-)

In regards to your question, I believe they're now using xquery for their new files creation and updates. Prior to February of this year they were using direct api calls via java.

The goal was always to have UTF-8 encoded files. An upgrade to our 3rd party tool was put into place in February and apparently the core team had not communicated that the core encoding switched. We now have a fix in place for files going forward (as of this past Sunday) but we have files from roughly a two month period that have the incorrect encoding. That's where we're stuck at currently.

~Trevor Hendricks
Reply | Threaded
Open this post in threaded view
|

Re: Encoding conversion from ANSI/ASCII to UTF-8

Joe Wicentowski
Hi Trevor, 

Ah, I see your earlier post now too: http://markmail.org/message/thnnwsraueiehbjs.

So the only place you have these possibly ANSI- or mal-encoded files is in eXist?  And you're trying to spit these files all back out as UTF-8?

I'd suggest doing a backup (http://exist-db.org/exist/apps/doc/backup.xml) to get all of the database contents back onto the filesystem, where you can run other utilities for sniffing and fixing encoding.  It seems to me that operating directly on the filesystem might help simplify the task.  Undoing encoding problems is very hairy.  Good luck!

Joe


On Tue, Apr 21, 2015 at 2:32 PM, hendrickst <[hidden email]> wrote:
Jonathan and I ended up posted about the same issue (he's the vendor we're
using). ;-)

In regards to your question, I believe they're now using xquery for their
new files creation and updates. Prior to February of this year they were
using direct api calls via java.

The goal was always to have UTF-8 encoded files. An upgrade to our 3rd party
tool was put into place in February and apparently the core team had not
communicated that the core encoding switched. We now have a fix in place for
files going forward (as of this past Sunday) but we have files from roughly
a two month period that have the incorrect encoding. That's where we're
stuck at currently.

~Trevor Hendricks



--
View this message in context: http://exist.2174344.n4.nabble.com/Encoding-conversion-from-ANSI-ASCII-to-UTF-8-tp4667391p4667406.html
Sent from the exist-open mailing list archive at Nabble.com.

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open


------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open
Reply | Threaded
Open this post in threaded view
|

Re: Encoding conversion from ANSI/ASCII to UTF-8

hendrickst

You have it correct, only in eXist. We’re using xforms for the modifications.

 

To make things worse, the same collection have UTF-8 documents and ANSI docs.

 

I’ve searched for file system tools with no specific luck other than UTFCast, but the demo doesn’t allow you to export (even a few lines or a few files would be great).

 

We’ll keep searching though. Thanks!

 

 

 

 

From: Joe Wicentowski [mailto:[hidden email]]
Sent: Tuesday, April 21, 2015 2:49 PM
To: Hendricks Trevor
Cc: [hidden email]
Subject: Re: [Exist-open] Encoding conversion from ANSI/ASCII to UTF-8

 

Hi Trevor, 


Ah, I see your earlier post now too: http://markmail.org/message/thnnwsraueiehbjs.

 

So the only place you have these possibly ANSI- or mal-encoded files is in eXist?  And you're trying to spit these files all back out as UTF-8?

 

I'd suggest doing a backup (http://exist-db.org/exist/apps/doc/backup.xml) to get all of the database contents back onto the filesystem, where you can run other utilities for sniffing and fixing encoding.  It seems to me that operating directly on the filesystem might help simplify the task.  Undoing encoding problems is very hairy.  Good luck!

 

Joe

 

 

On Tue, Apr 21, 2015 at 2:32 PM, hendrickst <[hidden email]> wrote:

Jonathan and I ended up posted about the same issue (he's the vendor we're
using). ;-)

In regards to your question, I believe they're now using xquery for their
new files creation and updates. Prior to February of this year they were
using direct api calls via java.

The goal was always to have UTF-8 encoded files. An upgrade to our 3rd party
tool was put into place in February and apparently the core team had not
communicated that the core encoding switched. We now have a fix in place for
files going forward (as of this past Sunday) but we have files from roughly
a two month period that have the incorrect encoding. That's where we're
stuck at currently.

~Trevor Hendricks



--
View this message in context: http://exist.2174344.n4.nabble.com/Encoding-conversion-from-ANSI-ASCII-to-UTF-8-tp4667391p4667406.html

Sent from the exist-open mailing list archive at Nabble.com.

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open

 


------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Exist-open mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/exist-open