IT story

C # 문자열의 GetHashCode ()는 어떻게 구현됩니까?

hot-time 2020. 12. 30. 19:15
반응형

C # 문자열의 GetHashCode ()는 어떻게 구현됩니까?


나는 그것이 성능에 영향을 미칠 것이라고 생각하기 때문에 궁금합니다. 전체 문자열을 고려합니까? 그렇다면 긴 문자열에서 느려질 것입니다. 문자열의 일부만 고려하면 성능이 저하됩니다 (예 : 문자열의 시작 부분 만 고려하면 HashSet에 동일한 문자열이 대부분 포함되어 있으면 성능이 저하됩니다.


이와 같은 질문이 있으면 참조 소스 소스 코드 를 확보하십시오 . 디 컴파일러에서 볼 수있는 것보다 더 많은 것이 있습니다. 원하는 .NET 대상과 일치하는 것을 선택하십시오.이 방법은 버전간에 크게 변경되었습니다. 여기서는 Source.NET 4.5 \ 4.6.0.0 \ net \ clr \ src \ BCL \ System \ String.cs \ 604718 \ String.cs에서 가져온 .NET 4.5 버전을 재현하겠습니다.

        public override int GetHashCode() { 

#if FEATURE_RANDOMIZED_STRING_HASHING
            if(HashHelpers.s_UseRandomizedStringHashing)
            { 
                return InternalMarvin32HashString(this, this.Length, 0);
            } 
#endif // FEATURE_RANDOMIZED_STRING_HASHING 

            unsafe { 
                fixed (char *src = this) {
                    Contract.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'");
                    Contract.Assert( ((int)src)%4 == 0, "Managed string should start at 4 bytes boundary");

#if WIN32
                    int hash1 = (5381<<16) + 5381; 
#else 
                    int hash1 = 5381;
#endif 
                    int hash2 = hash1;

#if WIN32
                    // 32 bit machines. 
                    int* pint = (int *)src;
                    int len = this.Length; 
                    while (len > 2) 
                    {
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0]; 
                        hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ pint[1];
                        pint += 2;
                        len  -= 4;
                    } 

                    if (len > 0) 
                    { 
                        hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ pint[0];
                    } 
#else
                    int     c;
                    char *s = src;
                    while ((c = s[0]) != 0) { 
                        hash1 = ((hash1 << 5) + hash1) ^ c;
                        c = s[1]; 
                        if (c == 0) 
                            break;
                        hash2 = ((hash2 << 5) + hash2) ^ c; 
                        s += 2;
                    }
#endif
#if DEBUG 
                    // We want to ensure we can change our hash function daily.
                    // This is perfectly fine as long as you don't persist the 
                    // value from GetHashCode to disk or count on String A 
                    // hashing before string B.  Those are bugs in your code.
                    hash1 ^= ThisAssembly.DailyBuildNumber; 
#endif
                    return hash1 + (hash2 * 1566083941);
                }
            } 
        }

This is possibly more than you bargained for, I'll annotate the code a bit:

  • The #if conditional compilation directives adapt this code to different .NET targets. The FEATURE_XX identifiers are defined elsewhere and turn features off whole sale throughout the .NET source code. WIN32 is defined when the target is the 32-bit version of the framework, the 64-bit version of mscorlib.dll is built separately and stored in a different subdirectory of the GAC.
  • The s_UseRandomizedStringHashing variable enables a secure version of the hashing algorithm, designed to keep programmers out of trouble that do something unwise like using GetHashCode() to generate hashes for things like passwords or encryption. It is enabled by an entry in the app.exe.config file
  • The fixed statement keeps indexing the string cheap, avoids the bounds checking done by the regular indexer
  • The first Assert ensures that the string is zero-terminated as it should be, required to allow the optimization in the loop
  • The second Assert ensures that the string is aligned to an address that's a multiple of 4 as it should be, required to keep the loop performant
  • The loop is unrolled by hand, consuming 4 characters per loop for the 32-bit version. The cast to int* is a trick to store 2 characters (2 x 16 bits) in a int (32-bits). The extra statements after the loop deal with a string whose length is not a multiple of 4. Note that the zero terminator may or may not be included in the hash, it won't be if the length is even. It looks at all the characters in the string, answering your question
  • The 64-bit version of the loop is done differently, hand-unrolled by 2. Note that it terminates early on an embedded zero, so doesn't look at all the characters. Otherwise very uncommon. That's pretty odd, I can only guess that this has something to do with strings potentially being very large. But can't think of a practical example
  • The debug code at the end ensures that no code in the framework ever takes a dependency on the hash code being reproducible between runs.
  • The hash algorithm is pretty standard. The value 1566083941 is a magic number, a prime that is common in a Mersenne twister.

Examining the source code (courtesy of ILSpy), we can see that it does iterate over the length of the string.

// string
[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail), SecuritySafeCritical]
public unsafe override int GetHashCode()
{
    IntPtr arg_0F_0;
    IntPtr expr_06 = arg_0F_0 = this;
    if (expr_06 != 0)
    {
        arg_0F_0 = (IntPtr)((int)expr_06 + RuntimeHelpers.OffsetToStringData);
    }
    char* ptr = arg_0F_0;
    int num = 352654597;
    int num2 = num;
    int* ptr2 = (int*)ptr;
    for (int i = this.Length; i > 0; i -= 4)
    {
        num = ((num << 5) + num + (num >> 27) ^ *ptr2);
        if (i <= 2)
        {
            break;
        }
        num2 = ((num2 << 5) + num2 + (num2 >> 27) ^ ptr2[(IntPtr)4 / 4]);
        ptr2 += (IntPtr)8 / 4;
    }
    return num + num2 * 1566083941;
}

ReferenceURL : https://stackoverflow.com/questions/15174477/how-is-gethashcode-of-c-sharp-string-implemented

반응형