Filtering and Filetypes


Indexing Word and Excel documents using catdoc

Word Files

catdoc is a text extracting program. It doesn't try to preserve Word character formatting, it's goal is simply to extract the document text.
You will need to download and install catdoc version 0.90 or later from http://www.45.free.net/~vitus/software/catdoc/   

Next, you need to add the following filter information to the file .glimpse_filters

*.doc	/webglimpsehome/catdoc <  	
*.DOC /webglimpsehome/catdoc <

Edit the wgreindex file in each archive that needs to access non-ascii files. Change both glimpseindex command lines to add the -z option, like so:

/bin/cat /home/WWW/proj/test/.wg_toindex | /usr/local/bin/glimpseindex -n -H /home/WWW
/proj/test -o -t -h -X -U -f -C -F -z > /dev/null

/bin/cat /home/WWW/proj/test/.wg_toindex | /usr/local/bin/glimpseindex -n -H /home/WWW 
/proj/test -o -t -h -X -U -f -C -F -

Excel Files

xls2csv is a program which converts an Excel spreadsheet into a comma-separated value file (csv). It is included in the catdoc program listed above, so if you've already downloaded and installed it, you can move on to add the filter information below. xls2csv extracts data while omitting any formatting info or formulas.

Just as you did with the word filter above, you will need to modify the file .glimpse_filters to include these filters:

*.xls	/webglimpsehome/xls2csv <  	
*.XLS /webglimpsehome/xls2csv <

Edit the wgreindex file in each archive that needs to access non-ascii files. Change both glimpseindex command lines to add the -z option, like so:

/bin/cat /home/WWW/proj/test/.wg_toindex | /usr/local/bin/glimpseindex -n -H /home/WWW 
/proj/test -o -t -h -X -U -f -C -F -z > /dev/null

/bin/cat /home/WWW/proj/test/.wg_toindex | /usr/local/bin/glimpseindex -n -H /home/WWW  
/proj/test -o -t -h -X -U -f -C -F -z

Continue to Next Page >>