Charset Problem
I have a client posting MS Word documents exported as HTML to my web server. Their smart quotes and other special characters are showing up as ?'s and boxes. I'm pretty sure it's a charset problem somewhere, but I don't know how to fix it. Any advice would be appreciated.
decibel.places posted this at 04:04 — 29th September 2008.
He has: 1,494 posts
Joined: Jun 2008
yeah, charsets can be tricky, I've done some work with Spanish and Hebrew content.
Have you tried UTF8? I think that is the most complete and flexible charset for most languages and special characters.
I'm looking at Word 2003 options, in the Save tab there is an option to "save smart tags as XML properties in Web Pages" which could have some effect... Also, in the save dialog, there are options to save the doc as a full HTML web page or filtered HTML...
Or, have you tried saving the the entire word doc as XML? It will become OpenXML, which can be viewed by Word as well as browsers.
Or, how do you import the Word/HTML docs, into a RTE like TinyMCE or FCK? what filter(s) is the RTE using? Is this on a CMS like Drupal?
pr0gr4mm3r posted this at 14:05 — 29th September 2008.
He has: 1,502 posts
Joined: Sep 2006
Nope, the pages are saves and uploaded to the site as-is.
I did some further digging, and found that MS Word uses a proprietary charset which could be causing the problem. The only part I don't understand is that the client says it work find on his previous hosting space, but not on mine.
I will see if saving it in filtered HTML will do anything. Thanks for the tip.
webwiz posted this at 03:12 — 30th September 2008.
He has: 629 posts
Joined: May 2007
Hello pr?gr?mm?r: (Sorry, couldn't resist )
You may find HTML Tidy useful. It has an option to clean up all the proprietary MS Word stuff.
You can download it from sourceforge[1], use an online version[2], the version that comes with the Firefox HTML Validator[3], or get one of many HTML editors that have it built in.
[1] http://tidy.sourceforge.net/
[2] http://infohound.net/tidy/
[3] http://users.skynet.be/mgueury/mozilla/
Cordially, David
--
delete from internet where user_agent="MSIE" and version < 8;
pr0gr4mm3r posted this at 13:40 — 30th September 2008.
He has: 1,502 posts
Joined: Sep 2006
Nice, tidy does the trick. Thanks!!
Is there an easy way to stop it from reforming the HTML, I just want the charset issue fixed.
webwiz posted this at 19:15 — 1st October 2008.
He has: 629 posts
Joined: May 2007
An easy way to stop Tidy reformatting text? I don't know for sure. There are several options for formatting that seem to interact - it took me a while to get Tidy to format how I like. I imagine that setting all formatting options to "off" may work, but have not tried doing it:
Tidy formatting options.
Cordially, David
--
delete from internet where user_agent="MSIE" and version < 8;
Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.