一千萬個為什麽

搜索

如何獲得stripped模式的校驗和

I have a 64 bit number (but only the 42 low order bits are used) and need to computer the sum of the 4 bits at n, n+m, n+m*2 and n+m*3 (note: anything that can produce a sum >4 is invalid) for some fixed m and every value of n that places all the bits in the number

作為使用 m = 3 並給出16位數的例子

0010 1011 0110 0001

我需要計算

2, 3, 1, 2, 3, 0, 3

有沒有人有任何(酷)的想法來做到這一點?我很好,有點混蛋。


我目前的想法是使輸入的位移副本對齊要進行求和的值,然後構建一個邏輯樹來執行一個4x 1位加法器。

v1 = In;
v2 = In<<3;
v3 = In<<6;
v4 = In<<9;

a1 = v1 ^ v2;
a2 = v1 & v2;
b1 = v3 ^ v4;
b2 = v3 & v4;
c2 = a1 & b1;
d2 = a2 ^ b2;

o1 = a1 ^ b1;
o2 = c2 ^ d2;
o4 = a2 & b2;

這確實會導致3個不同整數的結果分散,但是很好。

edit: as it happens I need the histogram of the sums so doing a bit-count of o4, o2&o1, o2 and o1 gives me what I want.


第二個解決方案使用完美的散列函數

arr = [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4];

for(int i = 0; i < N; i++)
{
   out[i] = arr[(In & 0b1001001001) % 30]; 
   In >>= 1;
}

這是通過註意到4個選擇的比特只能采用16個模式,並且(通過猜測和檢查)他們可以使用模式30被散列到0-15中。從那裏,一個計算值的表格給出所需的總和。因為它只發生在4個步驟中的3個我需要這樣工作。


附:

正確的王牌快速。快速勝過清楚。我希望能夠運行數百萬次。

最佳答案

也許我很瘋狂,但我很開心:D 這個解決方案基於數據並行性的使用,並偽造一個矢量cpu而沒有實際使用SSE內在函數或類似的東西。

unsigned short out[64];
const unsigned long long mask      = 0x0249024902490249ul;
const unsigned long long shiftmask = 0x0001000100010001ul;

unsigned long long t = (unsigned short)(in >> 38) | (unsigned long long)(unsigned short)(in >> 39) > 40) > 41) << 48;
t &= mask;
*((unsigned long long*)(out + 38)) = (t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask);

[... snipsnap ...]

t = (unsigned short)(in >> 2) | (unsigned long long)(unsigned short)(in >> 3) > 4) > 5) << 48;
t &= mask;
*((unsigned long long*)(out + 2)) = (t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask);

t = (unsigned short)in | (unsigned long long)(unsigned short)(in >> 1) << 16;
t &= mask;
*((unsigned int*)out) = (unsigned int)((t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask));

By reordering the computations, we can further reduce the execution time significantly, since it drastically reduces the amount of times that something is loaded into the QWORD. A few other optimizations are quite obvious and rather minor, but sum up to another interesting speedup.
unsigned short out[64];
const unsigned long long Xmask = 0x249024902490249ull;
const unsigned long long Ymask = 0x7000700070007u;

unsigned long long x = (in >> 14 & 0xFFFFu) | (in >> 20 & 0xFFFFu) > 26 & 0xFFFFu) > 32) << 48;
unsigned long long y;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[32] = (unsigned short)(y >> 48);
out[26] = (unsigned short)(y >> 32);
out[20] = (unsigned short)(y >> 16);
out[14] = (unsigned short)(y      );

x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[33] = (unsigned short)(y >> 48);
out[27] = (unsigned short)(y >> 32);
out[21] = (unsigned short)(y >> 16);
out[15] = (unsigned short)(y      );

[snisnap]

x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[37] = (unsigned short)(y >> 48);
out[31] = (unsigned short)(y >> 32);
out[25] = (unsigned short)(y >> 16);
out[19] = (unsigned short)(y      );

x >>= 1;
x &= 0xFFFF000000000000ul;
x |= (in & 0xFFFFu) | (in >> 5 & 0xFFFFu) > 10 & 0xFFFFu) << 32;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[38] = (unsigned short)(y >> 48);
out[10] = (unsigned short)(y >> 32);
out[ 5] = (unsigned short)(y >> 16);
out[ 0] = (unsigned short)(y      );

[snipsnap]

x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[ 9] = (unsigned short)(y >> 16);
out[ 4] = (unsigned short)(y      );

在我的電腦上編譯為64位二進制文​​件的本地c ++執行5000萬次執行的運行時間(所有輸出經過驗證匹配^^):
基於陣列的解決方案:〜5700 ms
樸素的硬編碼解決方案:〜4200毫秒
第一個解決方案:〜2400 ms
第二種解決方案:〜1600毫秒

轉載註明原文: 如何獲得stripped模式的校驗和