After a brief introduction to Regular Expressions last time it’s time to actually scan the 1985 BBC Sound Effects Catalogue and turn it into a text file.

I’ve cleaned it up a bit since I got it, but this is how it came to me

Happily, I had already photographed this whole thing in early 2022. That taught me that the OCR task was not at all easy and I put it to one side. Eventually I realised that pulling the catalogue numbers alone was possible and very useful.

Scandals

As you saw in the previous post, the OCR scan vandalised the text. I forget which tool I used for that example, but there are various ways and means to do this and the results are only part of the problems you’ll have doing a big job. In general you get better outputs when you zoom in on a smaller area though. Whatever, I needed to quickly scan 358 pages and create a monster text file and of the inevitable scrambled mess be damned.

Power Down

My first thought was to use the Microsoft Power Automate application. It’s very easy to cycle through a folder of images, OCR them for text and then append it all to one big document ready for searching. Unfortunately, it doesn’t seem to be working properly at the moment – throwing an error about memory. I won’t go into here, but I’ve tried a lot of things and it it’s a bust. For now.

Lens Not Good, Man

I then went round a few options but ended up on my phone. Not doom scrolling or whatever I’m supposed to be doing on it, but using apps. Microsoft moved their Lens app to become mobile only in the past couple of year and basically that’s their free solution. It’s alright and works fine but I needed to batch OCR all the page images and I couldn’t make that work.

Adobe (S)can

Adobe have their own app with similar functionality called Scan. It’s free! Unless you want to get large batches done. Even then it’s limited to 50 at a time. Well, there’s a 7-day trial so after loading all the images onto my phone I selected batches of 50 at a time and opened then in Adobe Scan. 7 or 8 batches later (the eighth was a small one) I had created 8 Word document files full of this kind of thing:

Not As Good As Your Word

That’s a Word document exported from Adobe Scan and opened in Word on my desktop. And, you’re probably thinking that it looks pretty good! And it does. Why can’t that be turned into a digital version of the catalogue??

Briefly, how it looks up there is built on underlying formatting which has none of the apparent order and line-by-line coherence that you see. It’s like a jigsaw puzzle of elements which come together to make the picture look right. If you take an individual piece though, it’s not a line or even a column. There are all kinds of fragmetary chunks of text that include lines and columns in a random patterns.

So, there’s no real value in that as it stands. Tantalising as it looks, and searchable as it is, it does not convert to anything else I can use.

Mark My Words

Never mind though. All I had to do was copy all the text from all the documents and paste them into a simple .txt file. The next step required a slightly more nerdy application. Instead of MS Notepad, I used Notepad++, a free code editor and notepad.

Notepad++ has a couple of key features which made the job of extracting the catalogue numbers a cinch. Firstly it has regex searches. As covered in the previous post, the regex search pattern below will find all the catalogue numbers.

((EC)|(NH))\d+[ABCDEFGHJKLMNPQRSTUVWXYZ]\d*

That’s useful, but alone is not enough. I need to find ’em all and the select and copy the matches. Notepad++ has a cool feature called Mark. As well as Find and Replace, Mark can search for the text you want. It then selects – or, marks – that text so that you can copy it. Or, delete, cut etc.

Notepad++ finds and marks all the catalogue numbers.

So, with that done it’s only the catalogue numbers in my clipboard and a total of 11,434 sift through. I’m ready to paste to Excel and really start sorting out these numbers!

Excellent

Pasting into Excel, the next job is to remove duplicates using the tool on the Data tab. That removed 9,714 duplicates leaving a total of 1,720 unique values. Is that corect? Is that the number of EC, ECS and NHS 7″ records in the catalogue? You’ll have to wait to next time to see how that went.

Leave a comment

Your email address will not be published. Required fields are marked *