TrackFM: Far-out Compiler Support for a Far Memory World
Introduction
TrackFM is a compiler-based far memory system which enables programmers to transparently upgrade their applications to use remote memory, extending the capacity of available main memory for those programs. TrackFM does this by using modern compiler analysis and transformation techniques to support far/remote memory. To optimize the data movement costs between server and client, TrackFM introduces novel compiler analysis and transformation passes (in the form of LLVM passes).
Along with the compiler techniques, TrackFM extends the library-based far memory solution, AIFM, to reuse it’s runtime and local/remote object management mechanims.
TrackFM Design and Implementation
TrackFM has 2 components:
- Compiler: Takes the unmodified application source code and uses the modern/novel compiler analysis and transformation passes to generate the transformed. It also injects required TrackFM runtime dependencies. TrackFM compiler employs two main analysis/transformation optimization for minimal remote access overheads:
- Guard Checking: TracFM uses guards to ensure that an object is present in local memory before accessing it.
- Loop Chunking: Loop chunking is used to reduce the overhead introduced into the program where guards are present in a loop body.
- Runtime: TrackFM reuses AIFM’s runtime to create and manage remote objects, i.e., how to fetch/evict objects from remote server into the local memory.
Results
- TrackFM performs around 2.7x better than FastSwap but slightly worse than AIFM around 0.9x. However, TrackFM enables far memory transparently unlike AIFM.
- Compilatoin cost for TrackFM is 6x compared to standard LLVM and the generated code size is increased by about 2.4x.
Discussion
- Important to note that the “Normalized performance” on the graph between FastSwap and AIFM is in comparison to a machine with local memory
- It is not optimal to check every time, essentially turning one instruction into 14. THe challenge is knowing when NOT to inject it into the code
- AIFM gives you prefetching (you get it for free since TrackFM extends AIFM)
- Remote cost is where the thing is actually remote and you need to go fetch it. Overhead is drowned out over the network because the network is so expensive
- When comparing to overhead, you’re only talking about the remote case, which is what you’re optimizing for
- Q: Is this a synthetic case? A: This is like a minor fault, where you’re faulting in a page, not because it’s over the network, but because it’s not mapped in. It’s in memory, just not a page table for it
- Q: In the graph of TrackFM, AIFM, and Fastswap. X-axis was [% of 31GB] for the local memory, is that the percentage free or used? A: It can be interpreted as the percentage constrained as it’s only able to use 16GB (50%). Essentially artificially constraining the application to only use this amount of memory
- Q: So that’s what the tuning would be good for? A: Yes, the assumption is the person modifying the code understands the infrastructure. So someone using a remote array would use a different object size from someone using the map, but if the compiler is doing it for you, it doesn’t know
- Memory constrained to 1Gb out of 12Gb
- Q: What’s zipfy/zipf/zipfian? A: Basically a power law distribution, There are very few hot keys, and very many keys that aren’t accessed very often, essentially a long tail.
- Q: The higher the skew, the longer the tail? A: Dr.Hale: If you have higher skewness, (accessing one object over and over again, perfect for Fastswap), but for TackFM, it’s not very great, high spatial locality, high guards/faults, essentially lookups in a hash table (random access)
- Q Opportunities for hardware to help questions were asked from reviewers a lot (conclusions and thoughts slide)? A: Yes, they’re building hardware for architected pointer features to be combined with compiler-driven techniques. Thought about building on top of the cherry(?) extensions The compiler approach gives more flexibility.
- TrackFM only works for sequential data structures currently, anything that’s a linked data structure. It’ll work but it’ll pay the cost of the guards.
- Q: Was there data that showed the performance if you didn’t do any loop chunking? A: Yes, it was pretty bad
Conclusion
In conclusion, TrackFM is a compiler/runtime-based solution that enables programmers to write application that can use far memory (with large memory capacity) transparently. Using TrackFM, programmers can expect performance on-par with the existing state-of-the-art library-based solutions (within 10%) along with 2x the programmer transparency. A lot of this is possible because of advancements made in the far memory hardware (RDMA), which provides low latency remote memory accesses, enabling the solutions like TrackFM.
References
- Paper: https://dl.acm.org/doi/10.1145/3617232.3624856
- Lightning Talk Video: https://www.youtube.com/watch?v=e6c0MhP2CJQ
- Repo: https://github.com/compiler-disagg/TrackFM
- Fastswap (2020) paper: https://dl.acm.org/doi/pdf/10.1145/3342195.3387522
- Fastswap slides: https://2020.eurosys.org/wp-content/uploads/2020/04/slides/133_amaro_slides.pdf
- AIFM (2020) paper: https://www.usenix.org/system/files/osdi20-ruan.pdf
- AIFM slides: https://www.usenix.org/sites/default/files/conference/protected-files/osdi20_slides_ruan.pdf
- AIFM RDMA Project: https://github.com/fengqingthu/MIT_6.5810_Adding_RDMA_To_AIFM/blob/master/65810_Final_Project.pdf