The ECE 4/599 Course Blog

Disaggregated Memory for Expansion and Sharing in Blade Servers

by Gabriel Rodgers (leader), Sami Aljabery (blogger), Noah Bean (scribe)

Introduction

This paper remarks on the current trend in server architecture, there’s an imbalance in the memory-to-capacity ratio, which is accompanied by an increase in power consumption and data center costs. The paper re-evaluates this relationship by looking at the traditional compute-memory co-location and implementing something new called a “memory blade”, a module for blade servers that enables disaggregated memory across a system. This inclusion is used for memory-capacity expansion, resulting in an increase in performance and reducing power costs. The memory blades in particular are building blocks that allow for two system proposals. The two system proposals are the page-swapping remote memory (PS) and fine-grained remote memory access (FGRA). PS is implemented in the virtual layer while FGRA is implemented in the coherence hardware to handle transparent memory expansion and sharing across systems.

Implementation

The implementation starts with the memory blade, which is composed of DRAM modules that connect to the server via a high-speed interface such as PCIe. Furthermore, the design of the memory blade includes custom hardware to allow for both PS and FGRA systems to be implemented in testing. For PS, it doesn’t use the custom hardware but leverages virtual memory to handle page swaps in both local and remote memory. The hypervisor in the software stack manages the page access, ensuring that the system works with the OS and applications. Unlike PS, FGRA doesn’t rely on just software and the VMM. FGRA takes full advantage of the custom hardware the memory blade utilizes to enable use of coherence protocols and a custom memory controller. These implementations result in a reduced overhead of page-swapping. Both these systems are put through trace-based simulations and evaluated on benchmarks to compare the performance of a memory-blade system to a traditional system.

Results

The results contain simulations using benchmarks such as zeusmp, perl, gcc, bwaves, spec4p, nutch4p, tpchmix, mcf, pgbench, indexer, specjbb, and Hmean. Figure 5 details the capacity expansion results, and we find that PS has better performance over FGRA in speedups over M-app-75% provisioning and M-median provisioning. These graphs highlight both FGRA and PS relative to the baseline, with a 4X to 320X increase in performance. In figure 7a, PS handles the imbalance between VM memory demands and local capacity, resulting in a possible 68% reduction of processor count. In terms of cost, PS is slightly better as it’s not reliant on custom hardware, but implementing disaggregated memory improves performance-per-dollar by 87% in Figure 7(b). FGRA has some drawbacks as it doesn’t handle swapping like PS, which is addressed inn figure 8 for potential hardware implementations such as page migration.

results

Weakness

One of the main weaknesses discussed in the paper is that the memory blade introduces several modes of failure, which could be solved by adding redundancy to the memory controller. Another noted weakness is that PS works without any additional hardware use, but FRGA would require some hardware modifications to be feasible, and this would include addressing page migration. Adding on, both PS and FRGA struggle with heavier loads, which could be addressed by modifying the high-speed interface. This is also a concern when considering local and remote latency tradeoffs, an upgrade from PCIe to CXL could be worth visiting.

Class Discussion

Practical Considerations

Source:

https://dl.acm.org/doi/10.1145/1555754.1555789

https://chatgpt.com/

Note: ChatGPT was used to improve clarity on the class discussion section.