[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Webglimpse Home]
Re: Cache pdftotext results for glimpse? (2)
Dear Lee,
> * Forgot to add: For such a script to work, I would need access to the name of
> * the file presently being indexed. Is there a way I can request this name to
> * be
> * passed on to the script? Maybe something along the lines of
> *
> * *.pdf pdftotextfilter.sh %1 <
> *
> * in .glimpse_filters?
>
> Don't know, but you could calculate an MD5 of stdin and use that as an
> index into your cache.
Can you recommend a tool for this?
> I solved the problem by regularly preconverting new/changed pdf, doc
> etc., only indexing text/html, and recognising 'preconversions' in the
> CGI frontend. A bonus is that you can then present both the pdf and
> the txt file in your results in case people don't have a PDF reader
> installed in their browser.
Yes, that's what I do now but that basically means maintaing a "text only
mirror". I've started writing scripts to do that, and my first hack works
fine, but I worry how it will scale up.
> Thanks for the URL below BTW.
:)
Many thanks and
best regards,
David.
---------------------------------------------------------------------------
Dr David Philip Kreil ("`-''-/").___..--''"`-._
Research Fellow `6_ 6 ) `-. ( ).`-.__.`)
University of Cambridge (_Y_.)' ._ ) `._ `. ``-..-'
++44 1223 764107, fax 333992 _..`--'_..-_/ /--'_.' ,'
www.inference.phy.cam.ac.uk/dpk20 (il),-'' (li),' ((!.-'