HTMLをプレーンテキストに変換するAppleScriptです。
本来は、文字コードの自動推測を行う部分はライブラリ化していますが、掲載用にライブラリをScript中に展開してみました。
コードの自動推測部分の処理は、昔のCotEditorのソースを読んで前半部分の「古くからある文字エンコーディングの勝ち抜け」処理を書き、後半部分のマイナー文字エンコーディングの多数決&文字化け検出方式がオリジナル処理部分です。
本コード自動推測は、意味のある日本語のテキストを処理するように設計してあるので、寿司屋の湯呑みのように魚の名前の漢字が1文字で羅列されているようなテキストの文字コード自動判別でミスを起こす可能性があります(実際に魚の名前のテキストを作って読み込んでみましたが、とくに問題はありませんでした)。
ただ、最近はUTF-8だけでなんとかなりそうなので、そこまでエグいテキストに遭遇することもないでしょう。
AppleScript名:HTMLをplain textに変換(文字コード自動認識ライブラリ展開) |
— Created 2017-09-08 by Takaaki Naganoya — 2017 Piyomaru Software use AppleScript version "2.4" use scripting additions use framework "Foundation" –use jLib : script "japaneseTextEncodingDetector" property NSString : a reference to current application’s NSString property NSMutableArray : a reference to current application’s NSMutableArray property NSAttributedString : a reference to current application’s NSAttributedString property NSUnicodeStringEncoding : a reference to current application’s NSUnicodeStringEncoding set aFile to choose file set aRes to readJapanesTextFileWithGuessingEncoding(POSIX path of aFile) of me if aRes = false then return "" set aPlainText to HTMLDecode(aRes) of me on HTMLDecode(HTMLString) set theString to current application’s NSString’s stringWithString:HTMLString set theData to theString’s dataUsingEncoding:(NSUnicodeStringEncoding) set attStr to NSAttributedString’s alloc()’s initWithHTML:theData documentAttributes:(missing value) return (attStr’s |string|()) as string end HTMLDecode –Read Japanese text with detecting its text encoding on readJapanesTextFileWithGuessingEncoding(aPOSIXpath as string) –ISO2022JP check set aNSData to current application’s NSData’s dataWithContentsOfFile:aPOSIXpath set aDataLength to aNSData’s |length|() if aDataLength > 1024 then set aDataLength to 1024 –0x1B check set anNSString to current application’s NSString’s stringWithString:(character id 27) — 0x1B set theData to anNSString’s dataUsingEncoding:(current application’s NSUTF8StringEncoding) set theRange to aNSData’s rangeOfData:theData options:0 range:(current application’s NSMakeRange(0, aDataLength)) –found 0x1B in aNSData if |length| of theRange = 1 and location of theRange < aDataLength then set aStr to (current application’s NSString’s alloc()’s initWithData:aNSData encoding:(current application’s NSISO2022JPStringEncoding)) –21 if aStr is not equal to missing value then return (aStr as text) — ISO2022JP end if –EUC set resValue to (current application’s NSString’s alloc()’s initWithData:aNSData encoding:(current application’s NSJapaneseEUCStringEncoding)) if resValue is not equal to missing value then return (resValue as text) –UTF-8 set resValue to (current application’s NSString’s alloc()’s initWithData:aNSData encoding:(current application’s NSUTF8StringEncoding)) if resValue is not equal to missing value then return (resValue as text) –SHift JIS set resValue to (current application’s NSString’s alloc()’s initWithData:aNSData encoding:(current application’s NSShiftJISStringEncoding)) if resValue is not equal to missing value then return (resValue as text) –多数決を取る –UTF-16BE/LE/無印Unicodeは多数決を取る set resValue1 to (current application’s NSString’s alloc()’s initWithData:aNSData encoding:(current application’s NSUTF16BigEndianStringEncoding)) as text set sample1 to getTextSample(resValue1) of me set lang1 to specifyLanguageOfText(sample1) of me set para1 to length of (paragraphs of sample1) set words1 to length of (words of sample1) –UTF-16LE set resValue2 to (current application’s NSString’s alloc()’s initWithData:aNSData encoding:(current application’s NSUTF16LittleEndianStringEncoding)) as text set sample2 to getTextSample(resValue2) of me set lang2 to specifyLanguageOfText(sample2) of me set para2 to length of (paragraphs of sample2) set words2 to length of (words of sample2) –無印Unicode set resValue3 to (current application’s NSString’s alloc()’s initWithData:aNSData encoding:(current application’s NSUnicodeStringEncoding)) as text set sample3 to getTextSample(resValue3) of me set lang3 to specifyLanguageOfText(sample3) of me set para3 to length of (paragraphs of sample3) set words3 to length of (words of sample3) –文字および文法的に見て「日本語」ならそれを返す if lang1 = "ja" then return resValue1 if lang2 = "ja" then return resValue2 if lang3 = "ja" then return resValue2 –文字化けしたときには、日本語の「Word」として認識されづらく、Paragraphも少ない(1とか)なので条件で除外する if para1 is not equal to 1 then if (words1 ≤ words2) or (words1 ≤ words3) then return resValue1 end if end if if para2 is not equal to 1 then if (words2 ≤ words1) or (words2 ≤ words3) then return resValue2 end if end if if para3 is not equal to 1 then if (words3 ≤ words1) or (words3 ≤ words2) then return resValue3 end if end if return false –文字コード判定に失敗した end readJapanesTextFileWithGuessingEncoding on specifyLanguageOfText(aStr) set aNSstring to current application’s NSString’s stringWithString:aStr set tagSchemes to current application’s NSArray’s arrayWithObjects:(current application’s NSLinguisticTagSchemeLanguage) set tagger to current application’s NSLinguisticTagger’s alloc()’s initWithTagSchemes:tagSchemes options:0 tagger’s setString:aNSstring set aLanguage to tagger’s tagAtIndex:0 |scheme|:(current application’s NSLinguisticTagSchemeLanguage) tokenRange:(missing value) sentenceRange:(missing value) return aLanguage as text end specifyLanguageOfText on getTextSample(aText) set aLen to length of aText if aLen < 1024 then set bLen to aLen else set bLen to 1024 end if return (text 1 thru bLen of aText) end getTextSample |