catdoc ported to Windows

Recently I had to automatically extract text from a bunch of Word documents under Windows. I liked the looks of catdoc, but didn’t see a native Win32 port around. The source code looked so very close to compiling under MinGW, so I made the few minor changes necessary and got it working (catdoc, catppt, and xls2csv). Native Win32 executables, support for long filenames, etc.

Basically all I did was:

Nothing special, and it’s not perfect. But here is a zip of the compiled binaries and (GPL-licensed) source code, just for you:

15 September 2009 by Ben    12 comments

12 comments (oldest first)

Ernesto 4 Mar 2010, 02:11 link

Excellent !

Really you improve some bug in 16bit catdoc. I have a question

1) what happen with docx files ? 2) i have some problem with spanish, what do i need to do?


Ben 4 Mar 2010, 15:08 link

Hi Ernesto, I don’t think catdoc supports docx at all, so you’ll have to use another tool for that. Unfortunately I can’t help you with the Spanish issue. -Ben

Corporal Max Sterling 13 Apr 2010, 08:42 link

you are awesome dude, this totally saved me a huge headache, as C++ is my weakest language. thank you!!

Ted 6 Aug 2010, 08:45 link

I can’t figure out why I can’t get it to work. Whenever I try to run catdoc for Windows I get the following message:

Cannot load charset cp1251 – file not found

I successfully ran the build.bat file. Not sure what else I need to do to get it to work.

Ben 6 Aug 2010, 13:17 link

Hi Ted, have you unzipped the entire zip file into a directory? It includes the pre-built catdoc.exe as well as the charsets directory. Make sure cp1251.txt exists in wherever you’ve got catdoc in the “charsets” dir.

Ted 7 Aug 2010, 07:11 link

Thanks Ben for the quick reply. I unzipped the file and moved them to the web directory where it will be used. Inside the catdoc folder I have a charsets, compat and src folder. Inside the charsets folder I see the cp1251.txt file.

When I ran the build.bat batch file I received the following messages:

C:Inetpubwwwrootkbcatdocsrc>gcc -DCATDOC_VERSION=\”0.94.2\” -O2 charsets.c substmap.c fileutil.c confutil.c numutils.c ole.c catdoc.c writer.c analyze.c rt fread.c reader.c ../compat/glob.c -I../compat -o ../catdoc

C:Inetpubwwwrootkbcatdocsrc>gcc -DCATDOC_VERSION=\”0.94.2\” -O2 charsets.c substmap.c fileutil.c confutil.c numutils.c ole.c xls2csv.c sheet.c xlsparse.c . ./compat/glob.c -I../compat -o ../xls2csv

C:Inetpubwwwrootkbcatdocsrc>gcc -DCATDOC_VERSION=\”0.94.2\” -O2 charsets.c substmap.c fileutil.c confutil.c numutils.c ole.c catppt.c pptparse.c ../compat/ glob.c -I../compat -o ../catpptInside the catdoc folder I have a charsets, compat and src folder.

Once that completed I ran the following command at the DOS command prompt to test it:

C:/Inetpub/wwwroot/kb/catdoc/catdoc -w “filetoextract.doc”

That’s when it returns the following message:

Cannot load charset cp1251 – file not found

Ben Hoyt 7 Aug 2010, 13:52 link

Hi Ted, I’m not sure what the problem is there if it build fine. However, did you try the catdoc.exe that comes pre-built with the .zip file?

Markos 21 Aug 2010, 06:46 link

I have the exact same problem. I also have the workaround.

You see for catdoc to work, you need to run it from its own directory: The “working directory” needs to be the one where catdoc is installed.

So unless Ben surprises us with a more “portable” solution which would help us put catdoc directory in the PATH, we have to manually “cd pathofcatdoc && catdcoc somefile”.

Perhaps an environmental variable could be added that informs catdoc and the rest of the files where to look for its charsets files.

Hope that helps! :-)

elef 5 Nov 2010, 06:24 link

Thanks for this port.

It’s just what I needed and I got it to work quite easily, but I have an issue: some accented letters (öüóőúáűéí) are getting corrupted. The weird thing is, part of the file is fine and other parts have wrong characters. Was catdoc not designed to handle non-ASCII characters in the first place? Or is this some anomaly of the Windows port? I can supply some sample files if that helps.

elef 5 Nov 2010, 06:34 link

Note: I tried various output encodings (although I’d much prefer UTF-8) and I also tried the -u option. Nothing seems to help. E.g. this: “Németország (7 szövetségi tartomány)” turns into this: “Nйmetorszбg (7 szцvetsйgi tartomбny)” while the same characters are conserved fine in other parts of the same document.

Pavel 14 Jan 2012, 22:15 link

Hi, thanks for this compilation. I have W7 64bit and the 16bit catdoc won’t run. I have also problems with language-specific characters as mentioned other before. Nevertheless thank you once again.

Pavel 14 Jan 2012, 22:20 link

Someone mentioned docx to txt requirement. I am not sure if there is a tool to do that but doing so for docx files could be somehow possible to build in one environment or the other because docx is zipped bunch of xml files. I will probably build my own, specifically in php.