Unicodeの文字をNormalizeする

UTF-8の文字列を、NFD/NFKD/NFC/NFKCの各正規化形式で正規化し、チェックのためのhexdumpするAppleScriptです。

NSStringの状態でNFD/NFKD/NFC/NFKCの各正規化形式で正規化して、そのままAppleScriptのstringに「as string」でcastしても、その正規化の状態は維持されます。

外部から他のOS上で作成したデータを取り込んで扱う場合に、文字列であっても正規化形式が異なるパターンがあります。実際に、PDFから文字列を取り出して、そのまま処理したところ同じ文字列なのに照合できないというケースがありました。その場合に、本Scriptで利用している正規化処理で明示的にいったん処理してからAppleScriptのstringにcastしたところ問題なく扱えました。

問題があった場合には、まずHexdumpして文字列の内容がどのようになっているかをチェックしています。目に見える文字が同じなのにプログラム側からは同じデータとして判定できないという例は、たまにある話なので。

AppleScript名：Unicodeの文字をNormalizeする

— Created 2015-09-30 by Takaaki Naganoya
— 2015 Piyomaru Software
use AppleScript version "2.4"
use scripting additions
use framework "Foundation"

–Reference:
–http://akisute.com/2010/05/utf-8-normalize.html
–http://nomenclator.la.coocan.jp/unicode/normalization.htm

set a to "がぎぐげご"
set aStr to current application’s NSString’s stringWithString:a
log hexDumpString(aStr)
–> {"E3", "81", "8C", "E3", "81", "8E", "E3", "81", "90", "E3", "81", "92", "E3", "81", "94"}

–NFD
set aNFD to aStr’s decomposedStringWithCanonicalMapping()
–>　　(NSString) "がぎぐげご"
log hexDumpString(aNFD)
–> {"E3", "81", "8B", "E3", "82", "99", "E3", "81", "8D", "E3", "82", "99", "E3", "81", "8F", "E3", "82", "99", "E3", "81", "91", "E3", "82", "99", "E3", "81", "93", "E3", "82", "99"}

–NFKD
set aNFKD to aStr’s decomposedStringWithCompatibilityMapping()
–>　　(NSString) "がぎぐげご"
log hexDumpString(aNFKD)
–> {"E3", "81", "8B", "E3", "82", "99", "E3", "81", "8D", "E3", "82", "99", "E3", "81", "8F", "E3", "82", "99", "E3", "81", "91", "E3", "82", "99", "E3", "81", "93", "E3", "82", "99"}

–NFC
set aNFC to aStr’s precomposedStringWithCanonicalMapping()
–>　　(NSString) "がぎぐげご"
log hexDumpString(aNFC)
–> {"E3", "81", "8C", "E3", "81", "8E", "E3", "81", "90", "E3", "81", "92", "E3", "81", "94"}

–NFKC
set aNFKC to aStr’s precomposedStringWithCompatibilityMapping()
–>　　(NSString) "がぎぐげご"
log hexDumpString(aNFKC)
–> {"E3", "81", "8C", "E3", "81", "8E", "E3", "81", "90", "E3", "81", "92", "E3", "81", "94"}

–NSStringをhexdumpする
on hexDumpString(theNSString)
　　set theNSData to theNSString’s dataUsingEncoding:(current application’s NSUTF8StringEncoding)
　　set theString to (theNSData’s |description|()’s uppercaseString())
　　
　　–Remove "<" ">" characters in head and tail
　　set tLength to (theString’s |length|()) – 2
　　set aRange to current application’s NSMakeRange(1, tLength)
　　set theString2 to theString’s substringWithRange:aRange
　　
　　–Replace Space Characters
　　set aString to current application’s NSString’s stringWithString:theString2
　　set bString to aString’s stringByReplacingOccurrencesOfString:" " withString:""
　　
　　set aResList to splitString(bString, 2)
　　–> {"E3", "81", "82", "E3", "81", "84", "E3", "81", "86", "E3", "81", "88", "E3", "81", "8A"}
　　
　　return aResList
　　
end hexDumpString

–Split NSString in specified aNum characters
on splitString(aText, aNum)
　　
　　set aStr to current application’s NSString’s stringWithString:aText
　　if aStr’s |length|() ≤ aNum then return aText
　　
　　set anArray to current application’s NSMutableArray’s new()
　　set mStr to current application’s NSMutableString’s stringWithString:aStr
　　
　　set aRange to current application’s NSMakeRange(0, aNum)
　　
　　repeat while (mStr’s |length|()) > 0
　　　　if (mStr’s |length|()) < aNum then
　　　　　　anArray’s addObject:(current application’s NSString’s stringWithString:mStr)
　　　　　　mStr’s deleteCharactersInRange:(current application’s NSMakeRange(0, mStr’s |length|()))
　　　　else
　　　　　　anArray’s addObject:(mStr’s substringWithRange:aRange)
　　　　　　mStr’s deleteCharactersInRange:aRange
　　　　end if
　　end repeat
　　
　　return (current application’s NSArray’s arrayWithArray:anArray) as list
　　
end splitString