After a brief introduction to Regular Expressions last time it’s time to actually scan the 1985 BBC Sound Effects Catalogue and turn it into a text file.
Happily, I had already photographed this whole thing in early 2022. That taught me that the OCR task was not at all easy and I put it to one side. Eventually I realised that pulling the catalogue numbers alone was possible and very useful.
Contents
Scandals
As you saw in the previous post, the OCR scan vandalised the text. I forget which tool I used for that example, but there are various ways and means to do this and the results are only part of the problems you’ll have doing a big job. In general you get better outputs when you zoom in on a smaller area though. Whatever, I needed to quickly scan 358 pages and create a monster text file and of the inevitable scrambled mess be damned.
Power Down
My first thought was to use the Microsoft Power Automate application. It’s very easy to cycle through a folder of images, OCR them for text and then append it all to one big document ready for searching. Unfortunately, it doesn’t seem to be working properly at the moment – throwing an error about memory. I won’t go into here, but I’ve tried a lot of things and it it’s a bust. For now.
Lens Not Good, Man
I then went round a few options but ended up on my phone. Not doom scrolling or whatever I’m supposed to be doing on it, but using apps. Microsoft moved their Lens app to become mobile only in the past couple of year and basically that’s their free solution. It’s alright and works fine but I needed to batch OCR all the page images and I couldn’t make that work.
Adobe (S)can
Adobe have their own app with similar functionality called Scan. It’s free! Unless you want to get large batches done. Even then it’s limited to 50 at a time. Well, there’s a 7-day trial so after loading all the images onto my phone I selected batches of 50 at a time and opened then in Adobe Scan. 7 or 8 batches later (the eighth was a small one) I had created 8 Word document files full of this kind of thing:
Not As Good As Your Word
That’s a Word document exported from Adobe Scan and opened in Word on my desktop. And, you’re probably thinking that it looks pretty good! And it does. Why can’t that be turned into a digital version of the catalogue??
Briefly, how it looks up there is built on underlying formatting which has none of the apparent order and line-by-line coherence that you see. It’s like a jigsaw puzzle of elements which come together to make the picture look right. If you take an individual piece though, it’s not a line or even a column. There are all kinds of fragmetary chunks of text that include lines and columns in a random patterns.
So, there’s no real value in that as it stands. Tantalising as it looks, and searchable as it is, it does not convert to anything else I can use.
Mark My Words
Never mind though. All I had to do was copy all the text from all the documents and paste them into a simple .txt file. The next step required a slightly more nerdy application. Instead of MS Notepad, I used Notepad++, a free code editor and notepad.
Notepad++ has a couple of key features which made the job of extracting the catalogue numbers a cinch. Firstly it has regex searches. As covered in the previous post, the regex search pattern below will find all the catalogue numbers.
((EC)|(NH))\d+[ABCDEFGHJKLMNPQRSTUVWXYZ]\d*
That’s useful, but alone is not enough. I need to find ’em all and the select and copy the matches. Notepad++ has a cool feature called Mark. As well as Find and Replace, Mark can search for the text you want. It then selects – or, marks – that text so that you can copy it. Or, delete, cut etc.
So, with that done it’s only the catalogue numbers in my clipboard and a total of 11,434 sift through. I’m ready to paste to Excel and really start sorting out these numbers!
Excellent
Pasting into Excel, the next job is to remove duplicates using the tool on the Data tab. That removed 9,714 duplicates leaving a total of 1,720 unique values. Is that corect? Is that the number of EC, ECS and NHS 7″ records in the catalogue? You’ll have to wait to next time to see how that went.