Mar. 24th, 2021

realmscryer: (Default)
Almost done downloading 534G of files from IAPSOP. Just 95G  to go. Elapsed time is about 70 hours so far. Really be nice to get higher than 1.8Mbytes/s

Periodicals, monographs, courses, the whole ball of wax is nearly ready.  The periodicals appear to come with a descriptive index page but the monographs do not. So how am I going to find anything in this pile of bits?

I do have an idea.

What if I OCR'd all the pdfs? I think Tesseract might have some interesting OCR output which I can then search. Hopefully my computer can crunch through everything in a useful length of time.

My current candidate to do the above:
https://pypi.org/project/ocrmypdf/#description


I was getting ahead of myself! Looks like the files have all been processed already. They are searchable!
Now for a decent desktop search engine.

Update: Found the tool I'm going to start testing for search.
ripgrep-all should be able to find text in a bunch of pdfs.
Hopefully it can get through 500G of pdfs fast.

https://github.com/phiresky/ripgrep-all

UPDATE: The archives are downloaded! Total time to download was about 86 hours for 543.325Gbytes. Time to take a peek.

And ripgrep probably won't work out so I am now running Recoll https://www.lesbonscomptes.com/recoll/ 
 

Very cool stuff. Recoll indexed all the files which took a little while. Less than an hour. Now I can search all 40k+ files for any term and get instant results. Also, the ability to create more complex queries than available in Google might come in handy. With the user interface set to display search results as a table browsing is super easy. Click for a preview of the found text or open the pdf if it looks interesting.

For a test I did a search for Merlin. I wouldn't have guessed from the titles of the documents I would find anything about Merlin in them. Searching "astrology" is fun, too.
Page generated Jan. 30th, 2026 10:01 am
Powered by Dreamwidth Studios