PDFから本文テキストを抽出して配列にストアして文字列検索 v2

指定のPDFの本文テキストから、同義語をリストで与えて文字列検索を行い、出現ページのページ数を返すAppleScriptです。

PDFからの索引作成を行うために作成したものです。最初に対象PDFから本文テキストを（ページごとに）抽出してテキスト検索キャッシュを作成。

まずはこのテキスト検索キャッシュへの検索を行ったのち、ヒットしなかったらPDFに対して文字列検索を行います。

筆者の実行環境（MacBook Pro Retina 2012）で483ページある「AppleScript最新リファレンス」に対して本Scriptを実行して4.66 secぐらいです。

テキスト検索キャッシュの効果を発揮するためには、索引作成の同義語リストをまとめて与えて処理するのがベストでしょう。

AppleScript名：PDFから本文テキストを抽出して配列にストアして文字列検索 v2

— Created 2017-06-18 by Takaaki Naganoya
— 2017 Piyomaru Software
use AppleScript version "2.4"
use scripting additions
use framework "Foundation"
use framework "Quartz"
use bPlus : script "BridgePlus"

–検索対象の語群
set sList to {"Piyomaru Software", "ぴよまるソフトウェア"} –considering case

set thePath to POSIX path of (choose file of type {"com.adobe.pdf"})

set aRes to findWordListInPDFContents(thePath, sList) of me
–> {1, 3, 4, 71, 72, 75, 95, 96, 97, 98, 420, 429, 479, 483}

—PDF本文テキスト中から、語群で出現ページをリストで取得（索引作成用）
on findWordListInPDFContents(thePOSIXPath as string, sList as list)
　　script spdPDF
　　　　property textCache : missing value
　　　　property aList : {}
　　end script
　　
　　–PDFのテキスト内容をあらかじめページごとに読み取って、検索用のテキストキャッシュを作成
　　set anNSURL to (current application’s |NSURL|’s fileURLWithPath:thePOSIXPath)
　　set theDoc to current application’s PDFDocument’s alloc()’s initWithURL:anNSURL
　　set theCount to theDoc’s pageCount() as integer
　　
　　set (textCache of spdPDF) to current application’s NSMutableArray’s new()
　　
　　repeat with i from 0 to (theCount – 1)
　　　　set aPage to (theDoc’s pageAtIndex:i)
　　　　set tmpStr to (aPage’s |string|())
　　　　((textCache of spdPDF)’s addObject:{pageIndex:i + 1, pageString:tmpStr})
　　end repeat
　　
　　
　　–主にテキストキャッシュを対象にキーワード検索
　　repeat with s in sList
　　　　
　　　　–❶部分一致で抽出
　　　　set bRes to ((my filterRecListByLabel1((textCache of spdPDF), "pageString contains ’" & s & "’"))’s pageIndex) as list
　　　　
　　　　–❷、❶のページ単位のテキスト検索で見つからなかった場合（ページ間でまたがっている場合など）
　　　　if bRes = {} then
　　　　　　set bRes to {}
　　　　　　set theSels to (theDoc’s findString:s withOptions:0)
　　　　　　repeat with aSel in theSels
　　　　　　　　set thePage to (aSel’s pages()’s objectAtIndex:0)’s label()
　　　　　　　　set curPage to (thePage as integer)
　　　　　　　　if curPage is not in bRes then
　　　　　　　　　　set the end of bRes to curPage
　　　　　　　　end if
　　　　　　end repeat
　　　　end if
　　　　
　　　　set the end of (aList of spdPDF) to bRes
　　　　
　　end repeat
　　
　　–2D list to 1D list conversion (Flatten)
　　load framework
　　set bList to (current application’s SMSForder’s arrayByFlattening:(aList of spdPDF)) as list
　　
　　–Uniquefy
　　set cList to uniquifyList(bList) of me
　　
　　–Sort 1D List
　　set anArray to current application’s NSArray’s arrayWithArray:cList
　　set sortRes1 to (anArray’s sortedArrayUsingSelector:"compare:") as list of string or string –as anything
　　
　　
　　set (textCache of spdPDF) to "" –Purge
　　set (aList of spdPDF) to {} –Purge
　　
　　return sortRes1
end findWordListInPDFContents

–リストに入れたレコードを、指定の属性ラベルの値で抽出
on filterRecListByLabel1(aRecList as list, aPredicate as string)
　　set aArray to current application’s NSArray’s arrayWithArray:aRecList
　　set aPredicate to current application’s NSPredicate’s predicateWithFormat:aPredicate
　　set filteredArray to aArray’s filteredArrayUsingPredicate:aPredicate
　　return filteredArray
end filterRecListByLabel1

on uniquifyList(aList as list)
　　set aArray to current application’s NSArray’s arrayWithArray:aList
　　set bArray to aArray’s valueForKeyPath:"@distinctUnionOfObjects.self"
　　return bArray as list
end uniquifyList