CultureInfo與中文字串排序
0 | 9,774 |
同事反應一個問題,有段程式使用LINQ的OrderBy排序一串中文名稱,在不同主機得到不同結果:
排版顯示純文字
string[] pool =
"一 二 三 四 五 六 七 到 底 排 成 啥 順 序".Split(' ');
Response.Write(string.Join(",",
pool.OrderBy(o => o).ToArray()));
Response.End();
一般來說,我們期望它依筆畫多寡排序,如以下順序:
一,七,二,三,五,六,四,成,序,到,底,排,啥,順
但在某些主機上,卻得到以下結果:
一,七,三,二,五,六,到,啥,四,序,底,成,排,順
對資料庫定序影響SQL查詢排序一事已有頗多經驗,但對CultureInfo影響.NET字串排序卻只有約略印象,算算也到"出來混,遲早要還"的時候囉! 寫了一段測試程式驗證:
排版顯示純文字
<%@ Page Language="C#" %>
<%@ Import Namespace="System.Xml.Linq" %>
<%@ Import Namespace="System.Linq" %>
<%@ Import Namespace="System.Collections.Generic" %>
<%@ Import Namespace="System.Globalization" %>
<%@ Import Namespace="System.Threading" %>
<script runat="server">
static string[] pool =
"一 二 三 四 五 六 七 到 底 排 成 啥 順 序".Split(' ');
void Test(CultureInfo ci, string charset)
{
Thread.CurrentThread.CurrentCulture = ci;
Response.Write("<hr />Culture=" + ci.DisplayName);
var ordered = pool.OrderBy(o => o).ToArray();
Response.Write("<br />" + string.Join(",", ordered));
Response.Write("<br />" + string.Join(",",
ordered.Select(
o => BitConverter.ToString(ci.CompareInfo.GetSortKey(o).KeyData)
).ToArray()));
Encoding enc = Encoding.GetEncoding(charset);
Response.Write("<br />" + string.Join(",",
ordered.Select(o => BitConverter.ToString(enc.GetBytes(o))).ToArray()
));
}
void Page_Load(object sender, EventArgs e)
{
Test(new CultureInfo("zh-tw"), "big5");
Test(new CultureInfo("en-us"), "utf-8");
}
</script>
Culture=Chinese (Taiwan)
一,七,二,三,五,六,四,成,序,到,底,排,啥,順
80-02-01-01-01-01-00,80-0A-01-01-01-01-00,80-13-01-01-01-01-00,80-2F-01-01-01-01-00,80-7E-01-01-01-01-00,80-8F-01-01-01-01-00,81-3C-01-01-01-01-00,82-20-01-01-01-01-00,83-5B-01-01-01-01-00,85-04-01-01-01-01-00,85-6E-01-01-01-01-00,90-0D-01-01-01-01-00,91-95-01-01-01-01-00,95-BB-01-01-01-01-00
A4-40,A4-43,A4-47,A4-54,A4-AD,A4-BB,A5-7C,A6-A8,A7-C7,A8-EC,A9-B3,B1-C6,D4-A3,B6-B6
Culture=English (United States)
一,七,三,二,五,六,到,啥,四,序,底,成,排,順
9E-02-01-01-01-01-00,9E-09-01-01-01-01-00,9E-11-01-01-01-01-00,9E-A1-01-01-01-01-00,9E-AC-01-01-01-01-00,A1-97-01-01-01-01-00,A2-5E-01-01-01-01-00,A5-A4-01-01-01-01-00,A7-1E-01-01-01-01-00,AE-F0-01-01-01-01-00,AE-F6-01-01-01-01-00,B2-79-01-01-01-01-00,B3-FD-01-01-01-01-00,E9-07-01-01-01-01-00
E4-B8-80,E4-B8-83,E4-B8-89,E4-BA-8C,E4-BA-94,E5-85-AD,E5-88-B0,E5-95-A5,E5-9B-9B,E5-BA-8F,E5-BA-95,E6-88-90,E6-8E-92,E9-A0-86
雖然不知SortKey是如何決定的,但由測試結果看來,當CultureInfo設為zh-tw時,會依BIG5編碼排序,呈現的就是依筆劃的排法;當設為en-us(或InvariantCulture)時,則會依UTF-8排序。在程式當中,若要求特定排序方式,建議在進行排序或比對前,指定Thread.CurrentThread.CurrentCulture較為保險。
Comments
Be the first to post a comment