catdoc ported to Windows
Recently I had to automatically extract text from a bunch of Word documents under Windows. I liked the looks of catdoc, but didn’t see a native Win32 port around. The source code looked so very close to compiling under MinGW, so I made the few minor changes necessary and got it working (catdoc, catppt, and xls2csv). Native Win32 executables, support for long filenames, etc.
Basically all I did was:
- Add a glob function from the BSD-licensed unixem library.
- Change a few of the
ifdef __MSDOS__
toif defined(__MSDOS__) || defined(_WIN32)
. - Make one or two other minor changes to
fileutil.c
, including theexe_dir()
function.
Nothing special, and it’s not perfect. But here is a zip of the compiled binaries and (GPL-licensed) source code, just for you:
15 September 2009 by Ben 12 comments
12 comments (oldest first)
Hi Ernesto, I don’t think catdoc supports docx at all, so you’ll have to use another tool for that. Unfortunately I can’t help you with the Spanish issue. -Ben
you are awesome dude, this totally saved me a huge headache, as C++ is my weakest language. thank you!!
I can’t figure out why I can’t get it to work. Whenever I try to run catdoc for Windows I get the following message:
Cannot load charset cp1251 – file not found
I successfully ran the build.bat file. Not sure what else I need to do to get it to work.
Hi Ted, have you unzipped the entire zip file into a directory? It includes the pre-built catdoc.exe as well as the charsets directory. Make sure cp1251.txt exists in wherever you’ve got catdoc in the “charsets” dir.
Thanks Ben for the quick reply. I unzipped the file and moved them to the web directory where it will be used. Inside the catdoc folder I have a charsets, compat and src folder. Inside the charsets folder I see the cp1251.txt file.
When I ran the build.bat batch file I received the following messages:
C:Inetpubwwwrootkbcatdocsrc>gcc -DCATDOC_VERSION=\”0.94.2\” -O2 charsets.c substmap.c fileutil.c confutil.c numutils.c ole.c catdoc.c writer.c analyze.c rt fread.c reader.c ../compat/glob.c -I../compat -o ../catdoc
C:Inetpubwwwrootkbcatdocsrc>gcc -DCATDOC_VERSION=\”0.94.2\” -O2 charsets.c substmap.c fileutil.c confutil.c numutils.c ole.c xls2csv.c sheet.c xlsparse.c . ./compat/glob.c -I../compat -o ../xls2csv
C:Inetpubwwwrootkbcatdocsrc>gcc -DCATDOC_VERSION=\”0.94.2\” -O2 charsets.c substmap.c fileutil.c confutil.c numutils.c ole.c catppt.c pptparse.c ../compat/ glob.c -I../compat -o ../catpptInside the catdoc folder I have a charsets, compat and src folder.
Once that completed I ran the following command at the DOS command prompt to test it:
C:/Inetpub/wwwroot/kb/catdoc/catdoc -w “filetoextract.doc”
That’s when it returns the following message:
Cannot load charset cp1251 – file not found
Hi Ted, I’m not sure what the problem is there if it build fine. However, did you try the catdoc.exe that comes pre-built with the .zip file?
I have the exact same problem. I also have the workaround.
You see for catdoc to work, you need to run it from its own directory: The “working directory” needs to be the one where catdoc is installed.
So unless Ben surprises us with a more “portable” solution which would help us put catdoc directory in the PATH, we have to manually “cd pathofcatdoc && catdcoc somefile”.
Perhaps an environmental variable could be added that informs catdoc and the rest of the files where to look for its charsets files.
Hope that helps! :-)
Thanks for this port.
It’s just what I needed and I got it to work quite easily, but I have an issue: some accented letters (öüóőúáűéí) are getting corrupted. The weird thing is, part of the file is fine and other parts have wrong characters. Was catdoc not designed to handle non-ASCII characters in the first place? Or is this some anomaly of the Windows port? I can supply some sample files if that helps.
Note: I tried various output encodings (although I’d much prefer UTF-8) and I also tried the -u option. Nothing seems to help. E.g. this: “Németország (7 szövetségi tartomány)” turns into this: “Nйmetorszбg (7 szцvetsйgi tartomбny)” while the same characters are conserved fine in other parts of the same document.
Hi, thanks for this compilation. I have W7 64bit and the 16bit catdoc won’t run. I have also problems with language-specific characters as mentioned other before. Nevertheless thank you once again.
Someone mentioned docx to txt requirement. I am not sure if there is a tool to do that but doing so for docx files could be somehow possible to build in one environment or the other because docx is zipped bunch of xml files. I will probably build my own, specifically in php.
Excellent !
Really you improve some bug in 16bit catdoc. I have a question
1) what happen with docx files ? 2) i have some problem with spanish, what do i need to do?
thank