Copy Paste From Word Docs And Validation...

Roo's picture

She has: 840 posts

Joined: Apr 1999

Arrgggh!

I have a client that always sends me content in a Word doc. It's sooooooo time consuming to paste the text into a text editor and then need to edit so that things will validate.

Mostly it's apostrophe's and quote marks.

I'm putting the content into a CMS, and I always need to edit everything first..

Does anyone know of a program or a utility that will strip the crap from .docs ?

Roo

Renegade's picture

He has: 3,022 posts

Joined: Oct 2002

Do you have dreamweaver? There is a function in there which will do it for you.

Alternatively, you could try save as text only and then open it in notepad.

Or just ask our client to give you content in .txt format? Explain to them that it is actually costing them more because all the crap has to be taken out.

demonhale's picture

He has: 3,278 posts

Joined: May 2005

I think htmltidy could help here, what I also sometimes do is save to rtf then open in a text program and copy from there... although I lose some text effects... but I think htmltidy could help in this case...

Abhishek Reddy's picture

He has: 3,348 posts

Joined: Jul 2001

Save as HTML and run the file through Textism's Word HTML Cleaner. Note that I last used it around 2003, so unless the script has been updated too, it might not work quite as well with files from newer Word versions.

While it could save you a bit of time, the output will still require a lot of work, unfortunately. MS Office programs seem to have a unique knack for producing markup that's readable by neither human nor machine. Sad

Roo's picture

She has: 840 posts

Joined: Apr 1999

No I don't have Dreamweaver. I use Edit Pad which is just a plain text editor.

I will try saving as html and running through that Word Cleaner. (Gasp! I shudder to think of the mess that will look like!)

I suppose I could also try tidying with HTML Kit, I also have NoteTab Lite, CONText and TextPad....I really have used them so I'll have to see if they offer any cleanup.

Roo

demonhale's picture

He has: 3,278 posts

Joined: May 2005

HTML Kit has HTML Tidy built into into it... so you could use that if you have it...

He has: 688 posts

Joined: Feb 2001

I use Edit Pad for stuff like that all the time. Just copy from Word and paste it into Edit Pad. Edit Pad should strip out all the non-ASCII crap and leave you with just text only, or at least it does in my situations. If it doesn't clean stuff up for you it may be in the settings but all I ever do is copy and paste and I never end up with all those little squares and odd characters.

But it sounds like you also don't like the repetitive task of adding back in the HTML tags like paragraphs. If you have a lot of paragraphs and you want to avoid pasting 's and 's over and over again, remember that you can highlight line breaks and use the find and replace function.

a) Find the end of any paragraph and put your cursor down.
b) drag down to just before the first letter of the next paragraph, which should be two "returns" on a word processor.
c) I realize that there's no actual text highlighted but copy what you highlighed anyway.
d) CTRL+F will open up your Find function with your multi-line blank selection already in the Find part (even if you can't see it because there's no text)
e) Paste that same selection that you copied a momet ago in the Replace area. This is important because it's the easiest way to get the multiple lines in the Replace area, since using Return won't work there.
f) In the return area, add to the top line and to the bottom line
g) Click Replace All and every paragraph break from your original Word document should now have HTML tags, no matter how big your document
h) You'll just need to add one last to the very top of your page and one final to the very bottom and you're completely done.

Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.