{"id":2220,"date":"2023-10-23T18:38:56","date_gmt":"2023-10-23T18:38:56","guid":{"rendered":"http:\/\/www.bbcrecords.co.uk\/wp\/?p=2220"},"modified":"2023-10-23T18:38:56","modified_gmt":"2023-10-23T18:38:56","slug":"sfx-discography-11-catatlogue-number-extraction-pt-2-simple-scanning","status":"publish","type":"post","link":"http:\/\/www.bbcrecords.co.uk\/wp\/sfx-discography-11-catatlogue-number-extraction-pt-2-simple-scanning\/","title":{"rendered":"SFX Discography 11 &#8211; Catatlogue Number Extraction pt. 2 &#8211; Simple Scanning"},"content":{"rendered":"\n<p>After a brief introduction to Regular Expressions last time it&#8217;s time to actually scan the 1985 BBC Sound Effects Catalogue and turn it into a text file.  <\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"2560\" src=\"http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/IMG_3392-scaled.jpg\" alt=\"\" class=\"wp-image-2228\" srcset=\"http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/IMG_3392-scaled.jpg 1920w, http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/IMG_3392-225x300.jpg 225w\" sizes=\"auto, (max-width: 1920px) 100vw, 1920px\" \/><figcaption class=\"wp-element-caption\">I&#8217;ve cleaned it up a bit since I got it, but this is how it came to me<\/figcaption><\/figure>\n\n\n\n<p>Happily, I had already photographed this whole thing in early 2022. That taught me that the OCR task was not at all easy and I put it to one side. Eventually I realised that pulling the catalogue numbers alone was possible and very useful.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Scandals <\/h2>\n\n\n\n<p>As you saw in the previous post, the OCR scan vandalised the text. I forget which tool I used for that example, but there are various ways and means to do this and the results are only part of the problems you&#8217;ll have doing a big job. In general you get better outputs when you zoom in on a smaller area though. Whatever, I needed to quickly scan 358 pages and create a monster text file and of  the inevitable scrambled mess be damned.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Power Down<\/h3>\n\n\n\n<p>My first thought was to use the Microsoft Power Automate application. It&#8217;s very easy to cycle through a folder of images, OCR them for text and then append it all to one big  document ready for searching. Unfortunately, it doesn&#8217;t seem to be working properly at the moment &#8211; throwing an error about memory. I won&#8217;t go into here, but I&#8217;ve tried a lot of things and it it&#8217;s a bust. For now.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lens Not Good, Man<\/h3>\n\n\n\n<p>I then went round a few options but ended up on my phone. Not doom scrolling or whatever I&#8217;m supposed to be doing on it, but using apps. Microsoft moved their Lens app to become mobile only in the past couple of year and basically that&#8217;s their free solution. It&#8217;s alright and works fine but I needed to batch OCR all the page images and I couldn&#8217;t make that work. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Adobe (S)can<\/h3>\n\n\n\n<p>Adobe have their own app with similar functionality called Scan. It&#8217;s free! Unless you want to get large batches done. Even then it&#8217;s limited to 50 at a time. Well, there&#8217;s a 7-day trial so after loading all the images onto my phone I selected batches of 50 at a time and opened then in Adobe Scan. 7 or 8 batches later (the eighth was a small one) I had created 8 Word document files full of this kind of thing: <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"643\" height=\"1024\" src=\"http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/page-142-as-a-doc-643x1024.jpg\" alt=\"\" class=\"wp-image-2217\" srcset=\"http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/page-142-as-a-doc-643x1024.jpg 643w, http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/page-142-as-a-doc-188x300.jpg 188w, http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/page-142-as-a-doc.jpg 702w\" sizes=\"auto, (max-width: 643px) 100vw, 643px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Not As Good As Your Word<\/h2>\n\n\n\n<p>That&#8217;s a Word document exported from Adobe Scan and opened in Word on my desktop. And, you&#8217;re probably thinking that it looks pretty good! And it does. Why can&#8217;t that be turned into a digital version of the catalogue??<\/p>\n\n\n\n<p>Briefly, how it looks up there is built on underlying formatting which has none of the apparent order and line-by-line coherence that you see. It&#8217;s like a jigsaw puzzle of elements which come together to make the picture look right. If you take an individual piece though, it&#8217;s not a line or even a column. There are all kinds of fragmetary chunks of text that include lines and columns in a random patterns. <\/p>\n\n\n\n<p>So, there&#8217;s no real value in that as it stands. Tantalising as it looks, and searchable as it is, it does not convert to anything else I can use.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Mark My Words<\/h2>\n\n\n\n<p>Never mind though. All I had to do was copy all the text from all the documents and paste them into a simple .txt file. The next step required a slightly more nerdy application. Instead of MS Notepad, I used <a href=\"https:\/\/notepad-plus-plus.org\/\" data-type=\"link\" data-id=\"https:\/\/notepad-plus-plus.org\/\">Notepad++, a free code editor and notepad<\/a>.<\/p>\n\n\n\n<p>Notepad++ has a couple of key features which made the job of extracting the catalogue numbers a cinch. Firstly it has regex searches. As covered in the previous post, the regex search pattern below will find all the catalogue numbers.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>((EC)|(NH))\\d+&#91;ABCDEFGHJKLMNPQRSTUVWXYZ]\\d*<\/code><\/pre>\n\n\n\n<p>That&#8217;s useful, but alone is not enough. I need to find &#8217;em all and the select and copy the matches. Notepad++ has a cool feature called Mark. As well as Find and Replace, Mark can search for the text you want. It then selects &#8211; or, marks &#8211; that text so that you can copy it. Or, delete, cut etc.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"695\" src=\"http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/Screenshot-2023-10-23-165025-1024x695.png\" alt=\"\" class=\"wp-image-2232\" style=\"aspect-ratio:1.4733812949640288;width:633px;height:auto\" srcset=\"http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/Screenshot-2023-10-23-165025-1024x695.png 1024w, http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/Screenshot-2023-10-23-165025-300x204.png 300w, http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/Screenshot-2023-10-23-165025-768x521.png 768w, http:\/\/www.bbcrecords.co.uk\/wp\/wp-content\/uploads\/2023\/10\/Screenshot-2023-10-23-165025.png 1492w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Notepad++ finds and marks all the catalogue numbers.<\/figcaption><\/figure>\n\n\n\n<p>So, with that done it&#8217;s only the catalogue numbers in my clipboard and a total of 11,434  sift through. I&#8217;m ready to paste to Excel and really start sorting out these numbers! <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Excellent<\/h2>\n\n\n\n<p>Pasting into Excel, the next job is to remove duplicates using the tool on the Data tab. That removed 9,714 duplicates leaving a total of 1,720 unique values. Is that corect? Is that the number of EC, ECS and NHS 7&#8243; records in the catalogue? You&#8217;ll have to wait to next time to see how that went.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>After a brief introduction to Regular Expressions last time it&#8217;s time to actually scan the 1985 BBC Sound Effects Catalogue and turn it into a text file. Happily, I had already photographed this whole thing in early 2022. That taught me that the OCR task was not at all easy and I put it to &hellip; <\/p>\n<p class=\"link-more\"><a href=\"http:\/\/www.bbcrecords.co.uk\/wp\/sfx-discography-11-catatlogue-number-extraction-pt-2-simple-scanning\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;SFX Discography 11 &#8211; Catatlogue Number Extraction pt. 2 &#8211; Simple Scanning&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":2232,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[56],"tags":[],"class_list":["post-2220","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-sfx-discography","entry"],"_links":{"self":[{"href":"http:\/\/www.bbcrecords.co.uk\/wp\/wp-json\/wp\/v2\/posts\/2220","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.bbcrecords.co.uk\/wp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.bbcrecords.co.uk\/wp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.bbcrecords.co.uk\/wp\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.bbcrecords.co.uk\/wp\/wp-json\/wp\/v2\/comments?post=2220"}],"version-history":[{"count":8,"href":"http:\/\/www.bbcrecords.co.uk\/wp\/wp-json\/wp\/v2\/posts\/2220\/revisions"}],"predecessor-version":[{"id":2237,"href":"http:\/\/www.bbcrecords.co.uk\/wp\/wp-json\/wp\/v2\/posts\/2220\/revisions\/2237"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/www.bbcrecords.co.uk\/wp\/wp-json\/wp\/v2\/media\/2232"}],"wp:attachment":[{"href":"http:\/\/www.bbcrecords.co.uk\/wp\/wp-json\/wp\/v2\/media?parent=2220"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.bbcrecords.co.uk\/wp\/wp-json\/wp\/v2\/categories?post=2220"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.bbcrecords.co.uk\/wp\/wp-json\/wp\/v2\/tags?post=2220"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}