PDFから本文テキストを抽出して配列にストアして文字列検索

AppleScript名：PDFから本文テキストを抽出して配列にストアして文字列検索

— Created 2017-06-18 by Takaaki Naganoya
— 2017 Piyomaru Software
use AppleScript version "2.4"
use scripting additions
use framework "Foundation"
use framework "Quartz"

property textCache : missing value
property aList : {}

–検索対象の語群
set sList to {"notification", "Cocoa"} –considering case

set thePath to POSIX path of (choose file of type {"com.adobe.pdf"})

–PDFのテキスト内容をあらかじめページごとに読み取って、検索用のテキストキャッシュを作成
set anNSURL to (current application’s |NSURL|’s fileURLWithPath:thePath)
set theDoc to current application’s PDFDocument’s alloc()’s initWithURL:anNSURL
set theCount to theDoc’s pageCount() as integer

set textCache to current application’s NSMutableArray’s new()

repeat with i from 0 to (theCount – 1)
　　set aPage to (theDoc’s pageAtIndex:i)
　　set tmpStr to (aPage’s |string|())
　　(textCache’s addObject:{pageIndex:i + 1, pageString:tmpStr})
end repeat

–主にテキストキャッシュを対象にキーワード検索
repeat with s in sList
　　
　　–❶部分一致で抽出
　　set bRes to ((my filterRecListByLabel1(textCache, "pageString contains ’" & s & "’"))’s pageIndex) as list
　　
　　–❷、❶のページ単位のテキスト検索で見つからなかった場合（ページ間でまたがっている場合など）
　　if bRes = {} then
　　　　set bRes to {}
　　　　set theSels to (theDoc’s findString:s withOptions:0)
　　　　repeat with aSel in theSels
　　　　　　set thePage to (aSel’s pages()’s objectAtIndex:0)’s label()
　　　　　　set curPage to (thePage as integer)
　　　　　　if curPage is not in bRes then
　　　　　　　　set the end of bRes to curPage
　　　　　　end if
　　　　end repeat
　　end if
　　
　　set the end of aList to bRes
　　
end repeat

return aList

–リストに入れたレコードを、指定の属性ラベルの値で抽出
on filterRecListByLabel1(aRecList as list, aPredicate as string)
　　set aArray to current application’s NSArray’s arrayWithArray:aRecList
　　set aPredicate to current application’s NSPredicate’s predicateWithFormat:aPredicate
　　set filteredArray to aArray’s filteredArrayUsingPredicate:aPredicate
　　return filteredArray
end filterRecListByLabel1