Batch scanning

sui.generis · Aug 17, 2012

I'd love to build up my pleco reader library with content from my print library so I've always got something handy to work on. The send to reader functionality is nice for short entries, 1 or 2 paragraphs (perhaps my camera's resolution is larger than norm, but that's about the most I can fit on screen if I've taken a shot of a single page), but to get a full page's content, article or other larger block in is tough. And send to reader doesn't seem to let you send different selections to the same file. Nothing wrong with how it's designed, it just isn't what I'm looking for at the moment.

1 solution that would work well for me would be support for image based pdfs in the reader, I think I read some other discussion of that. For the record, I'd be happy to pay for a pdf addon. I could then do batch scanning in an app designed for it. With the ability to read the image pdfs, you could avoid any extra scanning complexity (dealing with multiple columns, correctly sorting images, texts and other languages on the same page, cropping or optimizing) and I could OCR particular words on demand from my pdfs.

Is that something that could happen? Thanks.

mikelove · Aug 18, 2012

sui.generis said:
1 solution that would work well for me would be support for image based pdfs in the reader, I think I read some other discussion of that. For the record, I'd be happy to pay for a pdf addon. I could then do batch scanning in an app designed for it. With the ability to read the image pdfs, you could avoid any extra scanning complexity (dealing with multiple columns, correctly sorting images, texts and other languages on the same page, cropping or optimizing) and I could OCR particular words on demand from my pdfs.

That's possible, but since we don't support it on iOS either at the moment we'd like to get it actually released on that first; it's much much easier to do this on iOS because there's built-in support for PDF viewing (takes only a tiny bit more effort than opening a regular image file), whereas Android lacks that support and would have to read PDFs via a third-party library.

In general, though, if you've got PDFs on your desktop anyway you might be better off OCRing them on your desktop - the OCR engines on desktops are a) more specifically calibrated to this type of work and b) more powerful / accurate on account of having an order of magnitude more memory and CPU to play around with (not to mention being able to be left running in the background while you do other things).

sui.generis · Aug 18, 2012

mikelove said:
sui.generis said:

1 solution that would work well for me would be support for image based pdfs in the reader, I think I read some other discussion of that. For the record, I'd be happy to pay for a pdf addon. I could then do batch scanning in an app designed for it. With the ability to read the image pdfs, you could avoid any extra scanning complexity (dealing with multiple columns, correctly sorting images, texts and other languages on the same page, cropping or optimizing) and I could OCR particular words on demand from my pdfs.

Click to expand...

That's possible, but since we don't support it on iOS either at the moment we'd like to get it actually released on that first; it's much much easier to do this on iOS because there's built-in support for PDF viewing (takes only a tiny bit more effort than opening a regular image file), whereas Android lacks that support and would have to read PDFs via a third-party library.

It's kinda interesting since Chrome (desktop) supports pdfs (so well, I don't even install adobe reader anymore). Waiting might be to your advantage, avoid the hassle of licensing until Google includes it too, and I'm betting they will. Not really to my advantage, but c'est la vie.

radioman · Aug 18, 2012

mikelove said:
In general, though, if you've got PDFs on your desktop anyway you might be better off OCRing them on your desktop - the OCR engines on desktops are a) more specifically calibrated to this type of work and b) more powerful / accurate on account of having an order of magnitude more memory and CPU to play around with (not to mention being able to be left running in the background while you do other things).

Maybe it's just me but I find Pleco quite effective on-the-fly. I have also found that the effectiveness of the desktop solution to be contingent on the quality and type of material being OCRed. Unless it is clean, straight text, or unless substantial time is used to go through, redact, and proofread, I get mixed results (which translates to poor results because I'm still going to have to fiddle around with processing of the poorly processed areas after the fact.

I remain convinced that this "sequential picture OCR-on-the-fly" concept is a good idea. If that functionality can be brought to a raw un-OCRed PDF, then that would be remarkable.

mikelove · Aug 19, 2012

sui.generis said:
It's kinda interesting since Chrome (desktop) supports pdfs (so well, I don't even install adobe reader anymore). Waiting might be to your advantage, avoid the hassle of licensing until Google includes it too, and I'm betting they will. Not really to my advantage, but c'est la vie.

There are open-source libraries too, they're just a bit on the slow side.

radioman said:
I remain convinced that this "sequential picture OCR-on-the-fly" concept is a good idea. If that functionality can be brought to a raw un-OCRed PDF, then that would be remarkable.

We've got PDFs working on iOS like any other image files now - really just a matter of telling the system to load the PDF into an image buffer instead of the JPEG or whatever. (takes a bit more RAM but it works) And we could do likewise on Android if we had a fast library to decode PDFs into images - only ones we've found so far are either very slow, GPLed (meaning we can't use them), or commercially licensed.

ckatt · Aug 25, 2012

Any suggestions for desktop ocr software?

radioman · Aug 25, 2012

ckatt said:
Any suggestions for desktop ocr software?

1) Adobe Acrobat 9 or above.
2) OCR tools (Available on Mac, not sure about other platforms)

But... if Pleco can do this on-the-fly PDF function, I will likely not be OCRing any documents in the future. I'll just let Pleco handle it realtime.

Batch scanning

sui.generis

探花

mikelove

皇帝

sui.generis

探花

radioman

状元

mikelove

皇帝

ckatt

状元

radioman

状元