Charset Problem

pr0gr4mm3r's picture

He has: 1,502 posts

Joined: Sep 2006

I have a client posting MS Word documents exported as HTML to my web server. Their smart quotes and other special characters are showing up as ?'s and boxes. I'm pretty sure it's a charset problem somewhere, but I don't know how to fix it. Any advice would be appreciated. Smiling

decibel.places's picture

He has: 1,494 posts

Joined: Jun 2008

yeah, charsets can be tricky, I've done some work with Spanish and Hebrew content.

Have you tried UTF8? I think that is the most complete and flexible charset for most languages and special characters.

I'm looking at Word 2003 options, in the Save tab there is an option to "save smart tags as XML properties in Web Pages" which could have some effect... Also, in the save dialog, there are options to save the doc as a full HTML web page or filtered HTML...

Or, have you tried saving the the entire word doc as XML? It will become OpenXML, which can be viewed by Word as well as browsers.

Or, how do you import the Word/HTML docs, into a RTE like TinyMCE or FCK? what filter(s) is the RTE using? Is this on a CMS like Drupal?

pr0gr4mm3r's picture

He has: 1,502 posts

Joined: Sep 2006

Or, how do you import the Word/HTML docs, into a RTE like TinyMCE or FCK? what filter(s) is the RTE using? Is this on a CMS like Drupal?

Nope, the pages are saves and uploaded to the site as-is.

I did some further digging, and found that MS Word uses a proprietary charset which could be causing the problem. The only part I don't understand is that the client says it work find on his previous hosting space, but not on mine.

I will see if saving it in filtered HTML will do anything. Thanks for the tip.

He has: 629 posts

Joined: May 2007

Hello pr?gr?mm?r: (Sorry, couldn't resist Smiling )
You may find HTML Tidy useful. It has an option to clean up all the proprietary MS Word stuff.

You can download it from sourceforge[1], use an online version[2], the version that comes with the Firefox HTML Validator[3], or get one of many HTML editors that have it built in.

[1] http://tidy.sourceforge.net/
[2] http://infohound.net/tidy/
[3] http://users.skynet.be/mgueury/mozilla/

Cordially, David
--
delete from internet where user_agent="MSIE" and version < 8;

pr0gr4mm3r's picture

He has: 1,502 posts

Joined: Sep 2006

Nice, tidy does the trick. Thanks!!

Is there an easy way to stop it from reforming the HTML, I just want the charset issue fixed.

He has: 629 posts

Joined: May 2007

An easy way to stop Tidy reformatting text? I don't know for sure. There are several options for formatting that seem to interact - it took me a while to get Tidy to format how I like. I imagine that setting all formatting options to "off" may work, but have not tried doing it:

Tidy formatting options.

Cordially, David
--
delete from internet where user_agent="MSIE" and version < 8;

Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.