[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Webglimpse Home]
Re: Cache pdftotext results for glimpse? (2)
* Forgot to add: For such a script to work, I would need access to the name of
*
* the file presently being indexed. Is there a way I can request this name to
* be
* passed on to the script? Maybe something along the lines of
*
* *.pdf pdftotextfilter.sh %1 <
*
* in .glimpse_filters?
Don't know, but you could calculate an MD5 of stdin and use that as an
index into your cache.
I solved the problem by regularly preconverting new/changed pdf, doc
etc., only indexing text/html, and recognising 'preconversions' in the
CGI frontend. A bonus is that you can then present both the pdf and
the txt file in your results in case people don't have a PDF reader
installed in their browser.
Thanks for the URL below BTW.
Lee Wilmot
RIPE NCC
*
* With many thanks for your help,
*
* David.
*
* PS: I found the patch explained in
* http://www-2.cs.cmu.edu/~dst/Adobe/Gallery/xpdf-generic-patch.html
* very useful for indexing my PDF files. In my view, indexing is not copying a
*
* file. Of course, it's up to individuals to decide whether they want to ignor
* e
* the "do not copy" flag. In any case, your users might appreciate a link :)
*
*
* ---------------------------------------------------------------------------
* Dr David Philip Kreil ("`-''-/").___..--''"`-._
* Research Fellow `6_ 6 ) `-. ( ).`-.__.`)
* University of Cambridge (_Y_.)' ._ ) `._ `. ``-..-'
* ++44 1223 764107, fax 333992 _..`--'_..-_/ /--'_.' ,'
* www.inference.phy.cam.ac.uk/dpk20 (il),-'' (li),' ((!.-'
*
*