February 9, 2011

UnbeliEvable MeMory FailurEs

Over the last few days I program I was working on has been showing very inconsistent and almost nonsense errors. But finding the cause was very hard, and things really took an unexpected turn.

Over the last few days I program I was working on has been showing very inconsistent and almost nonsense errors.

Of UTF8 Errors

At first some XML files were being reported as not propertly UTF-8 encoded. A fault in the program we tried to fix. The program does some automatic conversion, so I revise that code and made some changes. But thinks kept getting worse. Also, restarting the server twice, and reading the same files (no changes in between) lead to different errors.

Of Strange Uppercase Letters 

As we kept looking in the logs, the XML processing errors increased and start to invocle also XML tags. An open tag was not properly closed. Or tags would not match. Or:

Parse Error: Opening and ending tag mismatch: tweetId line 0 and twEetId

The closign tag has an uppercase E. Now these files are not written by a human, they are generated by the program. Many 'e' characters got turned uppercase, causing trouble. Than some 'm' characters. Since there was some code to code with ASCII characters in UTF8 files, I though that code had gone bonkers. So I replaced it with some totally different code. Same effect.

Strings in a String List That Change By Themselves

Finally, today we had a file name error. The program loads a list of files in a TStringList, than reads some of the files. One of the files was missing, well no, the original file was there but the file names in the TStringList got a letter turned to uppercase. Since the storage is on a Linux box, file names are cese sensitive. This was completely unbelievable, but it made an earleir suspicious (hardware problems?) seems plausible.

We run a MemTest. We got droves of errors. Asked the web farm support people to fix the hardware... and within a couple of hours we had the RAM replaced and the application running nicely with no error (well, only a few minor errors that are in the code!).

It's "a Bit in the RAM", Can You Believe It?

I was a little shocked. I had witnesses disk drive faillures. I had witmesses RAM errors (writing in an area would cause access violations and crash programs). But I would have never thought that a hardware errror in one of the bits of some bytes of a memroy block will turn some letters uppercase! That's an hardware based algorithm, very fast.... but unfortunately not very reliable.

Were someone to tell me letters in his program were magically turning uppercase, I'd have probably told him to start a career in something other than programming. But since I witnessed this, I realize there is no limit in the creativity of hardware failures!

PS. No, no error on this server, I wrote uppercase letters in the title on purpose, as you probably guessed!



UnbeliEvable MeMory FailurEs 

Pretty failure!
Many years ago, after a lightning strike, a dot matrix 
printer started to adding a bit to any letter. Thus A 
became B, B became C and so on. It was very funny to 
read the reports ;-))
Comment by Tiago on February 9, 22:02

UnbeliEvable MeMory FailurEs 

Not that incredible.

The difference between ASCII upper and lowercase codes 
is 32 (A=65, a=97):  32 == 00010000

A permanently "off" 5th bit in a byte is guaranteed to  
transform a lowercase ASCII char to an uppercase char 
if it passes thru that byte.

What is more odd is that this hardware fault only 
manifested itself in this relatively benign fashion.  
Any/all string chars would be transformed by this 
byte, not to mention that potentially any pointer 
values would be rendered invalid and numerical values 

The fact that you didn't get other such spurious 
heisenbugs is what is bizarre to me.
Comment by Jolyon Smith [http://www.deltics.co.nz/blog] on February 9, 23:27

UnbeliEvable MeMory FailurEs 


The craziest things can happen when memory got
corrupted. Since the bits were changing randomly,
uppercase letters are just a little annoyance, since
all kinds of data corruption could happen.

You're lucky ;-)

Best regards,

Comment by Fabricio Araujo on February 10, 02:18

UnbeliEvable MeMory FailurEs 

You should have had a RAM issue also while typing the 
last sentence, that before the PS ;-)
Comment by Luigi D. Sandon on February 10, 08:04

UnbeliEvable MeMory FailurEs 

Luigi, added that sentence at the end and my "dislexic" 
fingers made a good mess.

All others, yes we did get other errors. For example, 
when sending email the program will crash. I checked 
that code dozen of times, and tested it on other 
computers were it worked fine. Also, we had a couple of 
"tar.gz" files we couldn't reopen on a different 
computer. We though of a file transmission problems, at 
first. You don't tend to think of bits changing in RAM 
as the actual problem.
Comment by Marco Cantu [http://www.marcocantu.com] on February 10, 08:19

UnbeliEvable MeMory FailurEs 

It reminds of of a machine that would blue-screen 
every couple of days.

When installing the machine, I ran a Memtest-86 on it, 
so it was fine.

At first, I thought it was a driver problem, so a re-
install with only WHQL drivers was the first step.

The problem frequency went down to every couple of 

Further research turned out that it wold occur more 
frequently with more memory in use. It took a couple 
of months to find that pattern.

So finally, after about 6 months, I ran the Memtest-86 
again: one of the memory modules turned out to be bad: 
1 bit in 1 byte was wrong.

Comment by Jeroen Pluimers [http://wiert.wordpress.com] on February 10, 08:38

UnbeliEvable MeMory FailurEs 

We had weird situations with Firebird databases
getting corrupted. While in the VERY early ages of
Firebird it might have been engine bugs, nowadays, it
pretty much has been nearly always hardware failures
and not only disk failures. RAM being a very sensitive
part in this environments. I know various places where
replacing the RAM stopped corrupting databases. ;-)
Comment by Thomas Steinmaurer [http://www.upscene.com] on February 10, 09:07

Post Your Comment

Click here for posting your feedback to this blog.

There are currently 0 pending (unapproved) messages.