The CS/ECE 4/599 Course Blog

(MC)^2: Lazy MemCopy at the Memory Controller

by Paul Suvrojyoti (Leader / Presentor / blogger), Kabir Vidyarthi(Presentor), Derek Werbowy (Presentor), Carlos Alvarado-Lopez (scribe), William Davis(blogger)

Introduction

Modern software systems rely heavily on the process of memory copying to provide isolation, simplify synchronization, and support common operations such as serialization, I/O buffering, and snapshot creation. Although the system of memcpy appears simple, it holds many operations that impose a significant performance cost; A large portion of the CPU cycles are spent stalled on cache misses and DRAM accesses, in many cases only a small portion of the copied data is ever actually used. As the processor speeds continued to outpace the improvements in memory latency, the inefficiency of eager, byte-by-byte copying becomes a major bottleneck to the entire system.

The (MC)² Lazy Memcopy style architecture addresses some of these problems. This is done by rethinking where and when the data movement occurs. Instead of immediately copying data at the CPU, (MC)² shifts the copy management into the memory controller and delays the actual data transfer until it is actually needed. By tracking copy intent and resolving it only on demand, (MC)² aims to eliminate redundant memory traffic, reduce cache pollution, and substantially lower the stall time that is associated with the traditional memory copy operation.

Overview

(MC)² reduces used to reduced the overall high cost of memory copying by moving the copy management into the memory controller. This is done by making lazy copies. Instead of constantly copying data with memcpy, the system tracks copy intentions using a Copy Tracking table (CTT) and delays the actual movement of bytes, until the data is actually needed for an operation. By only making copies when the destination is read or the source is overwritten. With (MC)² it avoids unnecessary memory traffic and CPU stalls, which are a major bottleneck in modern systems due to cache misses and the long DRAM latencies.

The use of a memory controller is extended with the hardware structures, including the CTT and a bounce pending queue (BPQ), to transparently intercept copy requests and route memory accesses to the correct location both physically and digitally. This allows for the destination reads to be serviced directly from the source buffer and source writes to trigger on-demand copying. All of this happens while preserving memory consistency and cache coherence. By operating below the cache hierarchy and working on the physical addresses, (MC)² provides fine-grained, cacheline-level copying virtualization, without requiring the operating system or application level changes.

(MC)² further improves performance by handling chains of copies, merging adjacent regions, and performing background copy completion when tracking resources become saturated. Evaluations that were done across microbenchmarks and real applications such as Protobuf, MongoDB, MVCC databases, and fork-based snapshots showed significant reductions in copy-induced stalls and memory bandwidth consumption, yielding substantial speedups small and partially used buffers that dominate real world workloads.

Hardware Function and changes

CPU pushes the copy management into the memory controller for MCLAZY. So a SRAM based CTT (copy tracking table) and a small queue called BPQ (bounce pending queue) were added. CTT functions:

Software Function and Changes:

(MC)² made some changes to the software side too. It introduced a clean interface ( a wrapper) for lazy copying “memcpy(dest, src, size)”. Programs use it arbitrary sizes and alignments. The paper states the hardware mechanism is most efficient when it can track is 64 bytes chunks or cacheline sized. The software therefore provides the wrapper that preservers norma copy but internally chooses between

Example , suppose the program calls memcpy_lazy(dest=…03, src=…00, size=200) and the cacheline is 64 B.

The wrapper will:

Results:

In performance evaluations, (MC)² lazy memcpy outperforms many of the existing systems. The only other system that got close to it in terms of latency is their zIO. For uncached source buffers, the lazy copy system achieves up to 11 times lower latency than the conventional memcpy for medium and large copy sizes over 1 KB, where DRAM is access is the dominate use of time for execution. When the source is already in the cache, the traditional memcpy can be slightly faster for smaller file sizes due to the addition of the memory controller logic in the (MC)² lazy memcpy system. For a different type of access pattern, where the it reads copied data

Key Results:

Key results:

Class Discussion

The system presents a valuable solution but at what cost?

Why hasn’t this been adapted yet?

Is this a hardware vulnerability?

What is the actual cost?

Conclusion

In summary, (MC)² lazy memcpy shows that a large portion of the time the system uses is wasted waiting for copying data that is never fully used, with CPU’s stalled waiting on memory rather than doing other useful work. By moving this into the memory controller and making those lazy copies, the system then only has to track copy intent and only moves the data when the data is needed. Hardware support through the copy tracking table and bounce pending Queue enables this to happen in a transparent faction.

References:

Generative AI Disclosure