KB-UTF8Encoding And BOM

2007-10-31 12:39 AM

static void TestXMLWriter()

    MemoryStream ms = new MemoryStream();

    XmlTextWriter xtw = new XmlTextWriter(ms, Encoding.UTF8);

    xtw.Formatting = Formatting.Indented;

    XmlDocument xd = new XmlDocument();

    xd.LoadXml("<Group><User>Jeffrey</User></Group>");

    xd.Save(xtw);

    xtw.Flush();

    xtw.Close();

    string xml = Encoding.UTF8.GetString(ms.ToArray());

    Console.WriteLine(xml);

    xd.LoadXml(xml);

XmlTextWriter設為Fomatting.Indented時，可以把XML整成美美的縮排格式，於是我寫了以上的Code。但是這段程式有點問題，明明是XmlTextWriter輸出的Byte Array轉成字串後，再被另一個XmlDocument.LoadXml()卻會發生Data at the root level is invalid. Line 1, Position 1.的錯誤!

問題出在BOM!! Byte Order Mark，相信有很多人在使用文字編輯軟體都有發現Save UTF-8 with BOM這類的選項，BOM是加在文字檔前的幾個位元，可以協助應用程式識別文字的儲存格式(多用於Windows，許多被這幾個位元搞暈頭的開發者視之為微軟的餘毒)，這裡有篇好文章。

Byte order mark	Description
EF BB BF	UTF-8
FF FE	UTF-16, little endian
FE FF	UTF-16, big endian
FF FE 00 00	UTF-32, little endian
00 00 FE FF	UTF-32, big-endian

先前的程式寫法，XmlTextWriter在寫入MemoryStream時，預設就會加上BOM。若我們將Byte Array寫入檔案，再用StreamReader或XmlDocument.Load讀取，BOM被用來識別編碼格式後將被丟棄不出現在內容字串中；當我們直接用Encoding.GetString轉成字串時，BOM則會被帶入，造成LoadXml()解析錯誤，雖可以設法Trim掉，但我覺得更好的方法是利用UTF8Encoding建構式所支援的encoderShouldEmitUTF8Identifier參數，設為false，輸出結果即不會加註BOM。因此程式要改成:

XmlTextWriter xtw = new XmlTextWriter(ms, new System.Text.UTF8Encoding(false));

That's all, folks.

KB-UTF8Encoding And BOM

Comments

Post a comment