What is ECC memory?
For servers in businesses and data centers, it’s mission-critical to minimize errors in data, and that’s the purpose of ECC (Error Correcting Code) memory.
ECC is a method of detecting and then correcting single-bit memory errors. A single-bit memory error is a data error in server output or production, and the presence of errors can have a big impact on server performance.
There are two types of single-bit memory errors: hard errors and soft errors. Hard errors are caused by physical factors, such as excessive temperature variation, voltage stress, or physical stress brought upon the memory bits.
Soft errors occur when data is written or read differently than originally intended, such as variations in voltage on the motherboard, to cosmic rays or radioactive decay that can cause bits in the memory to flip. Since bits retain their programmed value in the form of an electrical charge, this type of interference can alter the charge of the memory bit, causing an error. In servers, there are multiple places where errors can occur: in the storage drive, in the CPU core, through a network connection, and in various types of memory.
For workstations and servers where errors, data corruption and/or system failure must be avoided at all cost, such as in the financial sector, ECC memory is often the memory of choice.
Here’s how ECC memory works. In computing, data is received and transmitted through bits — the smallest unit of data in a computer – which are expressed in binary code using either a one or zero.
When bits are grouped together, they create binary code, or “words,” which are units of data that are addressed and moved between memory and the CPU. For example, an 8-bit binary code is 10110001.
With ECC memory, there is an extra ECC bit, which is known as a parity bit. This extra parity bit makes the binary code read 101100010, where the last zero is the parity bit and is used to identify memory errors. If the sum of all the 1’s in a line of code is an even number (not including the parity bit), then the line of code is called even parity. Error-free code always has even parity. However, parity has two limitations: it is only able to detect odd numbers of errors (1, 3, 5, etc.) and allows even numbers of errors to pass (2, 4, 6, etc.). Parity also isn’t able to correct errors – it’s only able to detect them. That’s where ECC memory comes into play.
ECC memory uses parity bits to store an encrypted code when writing data to memory, and the ECC code is stored at the same time. When data is read, the stored ECC code is compared to the ECC code that was generated when the data was read. If the code that was read doesn’t match the stored code, it’s decrypted by the parity bits to determine which bit was in error, then this bit is immediately corrected. Syndrome tables are a mathematical way of identifying these bit errors and then correcting them.
As data is processed, ECC memory is constantly scanning code with a special algorithm to detect and correct single-bit memory errors.
In mission-critical industries, such as the financial sector, ECC memory can make a massive difference. Imagine you’re editing a client’s confidential account information and then exchanging this data with other financial institutions. As you’re sending the data, say a binary digit gets flipped by some type of electrical interference.
The binary code that the other financial institution would receive could be 100100011, which communicates different information than you originally intended – it’s an error. The third digit has been flipped from a 1 to a 0 due to the electrical interference. So, the sum of the first eight bits now totals 3 – which is not even parity, meaning the confidential data you sent is at risk of being corrupted (or your system is at risk of a system crash). However, if ECC memory is installed, it will be able to detect the error and correct it by changing the third binary digit back to a 1 (the original code).
By detecting and correcting single-bit errors, ECC server memory helps preserve the integrity of your data, prevent data corruption, and prevent system crashes and failures.
Use Crucial ECC memory in mission-critical servers and workstations.