UTF-8の文字列を、NFD/NFKD/NFC/NFKCの各正規化形式で正規化し、チェックのためのhexdumpするAppleScriptです。
NSStringの状態でNFD/NFKD/NFC/NFKCの各正規化形式で正規化して、そのままAppleScriptのstringに「as string」でcastしても、その正規化の状態は維持されます。
外部から他のOS上で作成したデータを取り込んで扱う場合に、文字列であっても正規化形式が異なるパターンがあります。実際に、PDFから文字列を取り出して、そのまま処理したところ同じ文字列なのに照合できないというケースがありました。その場合に、本Scriptで利用している正規化処理で明示的にいったん処理してからAppleScriptのstringにcastしたところ問題なく扱えました。
問題があった場合には、まずHexdumpして文字列の内容がどのようになっているかをチェックしています。目に見える文字が同じなのにプログラム側からは同じデータとして判定できないという例は、たまにある話なので。
AppleScript名:Unicodeの文字をNormalizeする |
— Created 2015-09-30 by Takaaki Naganoya — 2015 Piyomaru Software use AppleScript version "2.4" use scripting additions use framework "Foundation" –Reference: –http://akisute.com/2010/05/utf-8-normalize.html –http://nomenclator.la.coocan.jp/unicode/normalization.htm set a to "がぎぐげご" set aStr to current application’s NSString’s stringWithString:a log hexDumpString(aStr) –> {"E3", "81", "8C", "E3", "81", "8E", "E3", "81", "90", "E3", "81", "92", "E3", "81", "94"} –NFD set aNFD to aStr’s decomposedStringWithCanonicalMapping() –> (NSString) "がぎぐげご" log hexDumpString(aNFD) –> {"E3", "81", "8B", "E3", "82", "99", "E3", "81", "8D", "E3", "82", "99", "E3", "81", "8F", "E3", "82", "99", "E3", "81", "91", "E3", "82", "99", "E3", "81", "93", "E3", "82", "99"} –NFKD set aNFKD to aStr’s decomposedStringWithCompatibilityMapping() –> (NSString) "がぎぐげご" log hexDumpString(aNFKD) –> {"E3", "81", "8B", "E3", "82", "99", "E3", "81", "8D", "E3", "82", "99", "E3", "81", "8F", "E3", "82", "99", "E3", "81", "91", "E3", "82", "99", "E3", "81", "93", "E3", "82", "99"} –NFC set aNFC to aStr’s precomposedStringWithCanonicalMapping() –> (NSString) "がぎぐげご" log hexDumpString(aNFC) –> {"E3", "81", "8C", "E3", "81", "8E", "E3", "81", "90", "E3", "81", "92", "E3", "81", "94"} –NFKC set aNFKC to aStr’s precomposedStringWithCompatibilityMapping() –> (NSString) "がぎぐげご" log hexDumpString(aNFKC) –> {"E3", "81", "8C", "E3", "81", "8E", "E3", "81", "90", "E3", "81", "92", "E3", "81", "94"} –NSStringをhexdumpする on hexDumpString(theNSString) set theNSData to theNSString’s dataUsingEncoding:(current application’s NSUTF8StringEncoding) set theString to (theNSData’s |description|()’s uppercaseString()) –Remove "<" ">" characters in head and tail set tLength to (theString’s |length|()) – 2 set aRange to current application’s NSMakeRange(1, tLength) set theString2 to theString’s substringWithRange:aRange –Replace Space Characters set aString to current application’s NSString’s stringWithString:theString2 set bString to aString’s stringByReplacingOccurrencesOfString:" " withString:"" set aResList to splitString(bString, 2) –> {"E3", "81", "82", "E3", "81", "84", "E3", "81", "86", "E3", "81", "88", "E3", "81", "8A"} return aResList end hexDumpString –Split NSString in specified aNum characters on splitString(aText, aNum) set aStr to current application’s NSString’s stringWithString:aText if aStr’s |length|() ≤ aNum then return aText set anArray to current application’s NSMutableArray’s new() set mStr to current application’s NSMutableString’s stringWithString:aStr set aRange to current application’s NSMakeRange(0, aNum) repeat while (mStr’s |length|()) > 0 if (mStr’s |length|()) < aNum then anArray’s addObject:(current application’s NSString’s stringWithString:mStr) mStr’s deleteCharactersInRange:(current application’s NSMakeRange(0, mStr’s |length|())) else anArray’s addObject:(mStr’s substringWithRange:aRange) mStr’s deleteCharactersInRange:aRange end if end repeat return (current application’s NSArray’s arrayWithArray:anArray) as list end splitString |
More from my site
(Visited 367 times, 1 visits today)
頭のいいタイマー割り込み実行 – AppleScriptの穴 says:
[…] ィレクトリ・セパレータ)だったり、Finderが管理しているファイル名はUnicodeのNormalize方式が異なる(処理しやすいようにNormalizeし直さないとダメ)など割と頭の痛い問題がいろいろあり […]