【茶包射手筆記】Dictionary.Add()效能問題調查

2015-06-15 11:03 PM

If Count is less than the capacity, this method approaches an O(1) operation. If the capacity must be increased to accommodate the new element, this method becomes an O(n) operation, where n is Count.

只要Dictionary資料筆數沒超過Capacity（Capacity可在建構時指定），資料新增屬O(1)操作，每次呼叫的執行速度不因Dictonary資料量多寡改變，理論上為定值。

網友Cb Atm提供了一則打破O(1)規則的奇妙案例。經我簡化改寫如下：

using System;

using System.Collections.Generic;

using System.Diagnostics;

using System.Linq;

using System.Text;

namespace DictKey

    class Program2

        const int CAPACITY = 20000;

        public struct Index

            public string StkNo;

            public byte TypeCode;

            public int Idx;

        static Random rnd = new Random(32767);

        static List<Index> stocks =

            Enumerable.Range(0, CAPACITY).Select(

                o => new Index()

                    Idx = o,

                    StkNo = rnd.Next(1000, 1011).ToString(),

                    TypeCode = (byte)rnd.Next(0, 10)

                }).ToList();

        static void Test1()

            var dictStruct = new Dictionary<Index, int>(CAPACITY);

            Action<int> test = (count) =>

                Stopwatch sw = new Stopwatch();

                dictStruct.Clear();

                sw.Start();

                for (var i = 0; i < count; i++)

                    dictStruct.Add(stocks[i], i);

                sw.Stop();

                Console.WriteLine("{1:n0} items: {0:n0}ms",

                    sw.ElapsedMilliseconds, dictStruct.Count);

};

            test(10000);

            test(CAPACITY);

        static void Main(string[] args)

            Test1();

            Console.Read();

程式宣告一個Index Struct作為Dictionary的TKey型別，測試時使用亂數產生兩萬筆資料（但StkNo屬性值限定介於1000-1010，TypeCode限定0-9，Idx欄位取流水號具唯一性），分別測試Dictionary<Index, int>.Add()一萬筆及兩萬筆，依O(1)理論，後者的時間應為前者兩倍，事實不然：

10,000 items: 1,993ms
20,000 items: 7,869ms

塞入兩萬筆耗時為塞入一萬筆的四倍左右，為什麼？

做了一個小改寫，移除StkNo亂數取值的範圍限制，改成StkNo = rnd.Next().ToString()（實際範圍放大成int.MinValue到int.MaxValue），得到完全不同結果！

10,000 items: 6ms
20,000 items: 5ms

除了執行速度加快數百倍，加入一萬筆與加入兩萬筆的時間不分上下。（表示Add()所耗時間已非瓶頸，差異微小到可被忽略）

再試改寫成StkNo = rnd.Next(1000,1100).ToString()，讓StkNo介於1000到1099，結果則為：

10,000 items: 247ms
20,000 items: 877ms

速度比最初版本快8-9倍，塞入兩萬筆的時間為塞入一萬筆的3.5倍。

過去對雜湊表原理了解有限，歸納可知StkNo多樣性影響效能甚鉅，但不知其所以然。爬文查到Struct作為TKey時，鍵值比對時與GetHashCode()有關，但再多理論都比不上Source Code權威及清楚明瞭。使用JustDecompile揭開Dictionary<TKey, TValue> .Add()神祕面紗：

        private void Insert(TKey key, TValue value, bool add)

            int num;

            if (key == null)

                ThrowHelper.ThrowArgumentNullException(ExceptionArgument.key);

            if (this.buckets == null)

                this.Initialize(0);

            int hashCode = this.comparer.GetHashCode(key) & 2147483647;

            int length = hashCode % (int)this.buckets.Length;

            int num1 = 0;

            for (int i = this.buckets[length]; i >= 0; i = this.entries[i].next)

                if (this.entries[i].hashCode == hashCode &&

                     this.comparer.Equals(this.entries[i].key, key))

                    if (add)

                        ThrowHelper.ThrowArgumentException(

                              ExceptionResource.Argument_AddingDuplicate);

                    this.entries[i].@value = value;

                    Dictionary<TKey, TValue> dictionary = this;

                    dictionary.version = dictionary.version + 1;

                    return;

                num1++;

依其程式邏輯，Insert時（Add背後會呼叫Insert(key, value, true)）先用GetHashCode()取得key參數的hashCode值，接著依Bucket分群規則（取hashCode除以Bucket個數的餘數，餘數相同者放在同一Bucket）比對同一Bucket的現存Key值，若hashCode相同且新舊Key值.Equals()也相等，依add參數決定要拋出鍵值重複錯誤還是取代現值。延伸閱讀

由此可知：

hashCode可能重複，而多個hashCode可能同屬一個Bucket
每次比對時先由hashCode找出其所屬Bucket，一一比對該Bucket下是否已存在相同鍵值
比對鍵值是否相同的邏輯為 this.entries[i].hashCode == hashCode && this.comparer.Equals(this.entries[i].key, key)，當hashCode相同才會進行第二階段Equals()比對。

故可推論，當hashCode重複的機率愈高，鍵值會集中在少數Bucket中，額外執行Equals()的次數愈高，勢必拖累效能。

所以新增一萬筆會耗時近2秒是因為StkNo變化太少導致Hash Code大量重複嗎？用以下段程式驗證：

static void Test2()

    var hashCodeStats = new Dictionary<int, int>();

    for (var i = 0; i < 20000; i++)

        var hashCode = stocks[i].GetHashCode();

        if (hashCodeStats.ContainsKey(hashCode))

            hashCodeStats[hashCode]++;

        else

            hashCodeStats.Add(hashCode, 1);

    Console.WriteLine("共有{0}種HashCode", hashCodeStats.Count);

測試結果：

StkNo = rnd.Next(1000, 1011).ToString() 兩萬筆資料只有11種HashCode
StkNo = rnd.Next(1000, 1100).ToString() 兩萬筆資料只有100種HashCode
StkNo = rnd.Next().ToString() 兩萬筆資料有兩萬種HashCode

雖然兩萬筆資料有唯一的Idx屬性，還配上不同TypeCode，但GetHashCode()似乎只跟StkNo有關，使用以下程式進一步證實這點：

static void Test3()

    var idx1 = new Index()

        StkNo = "AAAA",

        Idx = 1,

        TypeCode = 2

};

    var idx2 = new Index()

        StkNo = "AAAA",

        Idx = 100,

        TypeCode = 200

};

    var idx3 = new Index()

        StkNo = "AAAB",

        Idx = 1,

        TypeCode = 2

};

    Console.WriteLine("{0} vs {1} vs {2}",

        idx1.GetHashCode(),

        idx2.GetHashCode(),

        idx3.GetHashCode());

結果為80968112 vs 80968112 vs -651627219，StkNo相同GetHashCode()就相同。

由以上分析，想改善新増效能有兩個解決方向：1) 改用唯一值Idx當鍵值 2) 解決Index.GetHashCode()大量重複的狀況

第一種做法可將Dictionary<Index, int>改為Dictionary<int, int>：

static void Test1A()

    var dictStruct = new Dictionary<int, int>(CAPACITY);

    Action<int> test = (count) =>

        Stopwatch sw = new Stopwatch();

        dictStruct.Clear();

        sw.Start();

        for (var i = 0; i < count; i++)

            dictStruct.Add(stocks[i].Idx, i);

        sw.Stop();

        Console.WriteLine("{0:n0} items: {1:n0}ms {2:n0}ticks",

            dictStruct.Count,

            sw.ElapsedMilliseconds,

            sw.ElapsedTicks);

};

    for (var i = 0; i < 5; i++)

        test(10000);

        test(CAPACITY);

修改後，一萬筆及兩萬筆的執行時間都變成0ms，速度快到測不出來，需改比ElapsedTicks。程式跑五次，排除第一次暖機期，兩萬筆的執行時間約為一萬筆的兩倍。

10,000 items: 1ms 3,471ticks
20,000 items: 0ms 2,248ticks
10,000 items: 0ms 545ticks
20,000 items: 0ms 1,127ticks
10,000 items: 0ms 561ticks
20,000 items: 0ms 1,140ticks
10,000 items: 0ms 540ticks
20,000 items: 0ms 1,082ticks
10,000 items: 0ms 573ticks
20,000 items: 0ms 1,113ticks

但修改成Dictionary<int, int>有後遺症，取得鍵值數字後需另外查表才態拿到Index型別，程式需配合改寫。還有第二種做法，我們override Index的GetHashCode()傳回更具唯一性的雜湊值（流水號整數Idx是很好的選擇），如下：

public struct Index

     public string StkNo;

     public byte TypeCode;

     public int Idx;

     public override int GetHashCode()

         return Idx;

 static void Test1()

     Action<int> test = (count) =>

         var dictStruct = new Dictionary<Index, int>(CAPACITY);

         Stopwatch sw = new Stopwatch();

         dictStruct.Clear();

         sw.Start();

         for (var i = 0; i < count; i++)

             dictStruct.Add(stocks[i], i);

         sw.Stop();

         Console.WriteLine("{0:n0} items: {1:n0}ms {2:n0}ticks",

             dictStruct.Count,

             sw.ElapsedMilliseconds,

             sw.ElapsedTicks);

};

     for (var i = 0; i < 5; i++)

         test(10000);

         test(CAPACITY);

測試結果如下，自訂GetHashCode()後，ticks數為Dictionary<int, int>的兩倍，但執行時間已加速到小於1ms，兩萬筆耗時約為一萬筆的2.5倍：

10,000 items: 3ms 11,730ticks
20,000 items: 0ms 2,292ticks
10,000 items: 0ms 1,471ticks
20,000 items: 0ms 2,035ticks
10,000 items: 0ms 1,089ticks
20,000 items: 0ms 2,749ticks
10,000 items: 0ms 1,064ticks
20,000 items: 0ms 2,247ticks
10,000 items: 0ms 1,078ticks
20,000 items: 0ms 2,677ticks

【結論】使用Struct作為Dictionary TKey，若GetHashCode()值重複性過高，將嚴重影響Add()效能，應考慮自訂GetHashCode()增加差異性加以改善，或改用string或int等單純型別以達最佳效能。

【茶包射手筆記】Dictionary.Add()效能問題調查

Comments

# 2015-06-15 09:01 PM by 丫土伯

# 2020-03-04 11:21 AM by hkk

Post a comment