From NuclearCat's homepage
- Wonderful CRC guide
- Eric Dumazet suggestion for correct src ip for udp listening 0.0.0.0
- C bitwise operations
- Hashes, trees, etc.
- Wonderful tool to find holes in structures pahole
- pfunct --size to see size of functions
- strip --remove-section=.comment --remove-section=.note
- gcc -Os -s
- If linking, gcc with -ffunction-sections -fdata-sections, and linker with -Wl,--gc-sections (for libraries)
- posix_memalign(void **memptr, size_t alignment, size_t size)
- __attribute__ ((aligned(32)))
- -fmudflap -lmudflap (not performance)
- integer overflow can be tracked by compiling with -ftrapv
- Sequential memory access is almost 10 times faster than random access (No surprise for DB developers, I hope). In your programs, this means you should use arrays instead of linked lists. In your code, this means you should avoid indirect calls. C++ developers - avoid virtual functions and use templates instead.
- Choose data structures as small as possible. There is a huge speed penalty for data elements that are larger than a cache line. If they are larger than a memory page, it gets much worse.
- Also note: cache alignment, also issue with cache line overlapping & etc.
- Pre-fetching data (reading it into the cache before it is actually use) can help hide much of the cost of main memory access. C developers can force pre-fetch by accessing the variable before it is actually used, or by calling an appropriate assembly instruction.
- Use gcc builtins. Safe for cross-platform. Use gcc profiling power
- Use structures to keep together variables that are used together. This will help you get them on the same cache line.
- Don't forget about structure alignment.
- L3 cache is used in high end servers. Large L3 cache, say with 32M, can give you tremendous speed ups, much better than what a faster CPU will give you. Which is why servers with large L3 cache are so expensive. Ulrich gave an example of having Oracle fit entirely into the L3 cache, which will make it 10 times faster and will justify the cost. I burst out laughing, because to me fitting the DB into 32M of memory is ridiculous - we need 12G for the SGA! But the rest of the audience took this seriously, so maybe I missed something.
- Multi-threading. Generally, a very bad idea - because if a bit of memory is in L1 cache of CPU #1, CPU #2 can’t use it until #1 releases the bit, and then #2 reads it from main memory, while invalidating the cache for #1, because the bit was changed by #2. So, if you use multi-threading, your expensive L1 cache is wasted. This part reminded me of the way cache fusion works in RAC.
- Tip. How to avoid branching (kills modern superscalar processors performance)
- value overflow. In some case & max_value -1 or use %
- prefetch instructions, issue 6-10 instructions before usage if data likely is in L2, and 20-25 is data likely in memory.
- prefetchnta will load data directly to L1. Use it for all kind of data access that can pollute L2 cache.
- Cache line usually 32 bit. Coherency line size on new arch is 64?
- posix_memalign(void **memptr, size_t alignment, size_t size), alignment 64 for cache friendly alignment (x86 usual cacheline L1)
- Prefer M-ary trees to Binary or Red/Black trees. Many keys per node more cache friendly.