[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Webglimpse Home]

Re: Cache pdftotext results for glimpse? (2)




 * Forgot to add: For such a script to work, I would need access to the name of
 *  
 * the file presently being indexed. Is there a way I can request this name to 
 * be 
 * passed on to the script? Maybe something along the lines of
 * 
 * *.pdf   pdftotextfilter.sh %1 <
 * 
 * in .glimpse_filters?

Don't know, but you could calculate an MD5 of stdin and use that as an
index into your cache.

I solved the problem by regularly preconverting new/changed pdf, doc
etc., only indexing text/html, and recognising 'preconversions' in the
CGI frontend. A bonus is that you can then present both the pdf and
the txt file in your results in case people don't have a PDF reader
installed in their browser.

Thanks for the URL below BTW.

Lee Wilmot
RIPE NCC

 * 
 * With many thanks for your help,
 * 
 * David.
 * 
 * PS: I found the patch explained in
 *   http://www-2.cs.cmu.edu/~dst/Adobe/Gallery/xpdf-generic-patch.html
 * very useful for indexing my PDF files. In my view, indexing is not copying a
 *  
 * file. Of course, it's up to individuals to decide whether they want to ignor
 * e 
 * the "do not copy" flag. In any case, your users might appreciate a link :)
 * 
 * 
 * ---------------------------------------------------------------------------
 * Dr David Philip Kreil                   ("`-''-/").___..--''"`-._
 * Research Fellow                          `6_ 6  )   `-.  (     ).`-.__.`)
 * University of Cambridge                  (_Y_.)'  ._   )  `._ `. ``-..-'
 * ++44 1223 764107, fax 333992           _..`--'_..-_/  /--'_.' ,'
 * www.inference.phy.cam.ac.uk/dpk20     (il),-''  (li),'  ((!.-'
 * 
 *