Delphi Handbooks Collection


Delphi XE Handbook


Delphi 2010 Handbook


April 19, 2007

A Tale of Indy Sockets and a Two Characters Terminator

I've had a very bad experience with a 2 char separator in a stream-based socket. Here is the story.

There is a reason I haven't blogged over the last two days. Beside having to prepare my Delphi 2007 seminar (taking tomorrow morning) I had to fix a very nasty bug that took me almost 4 days to chase. Here is the story. With a company I work for and I'm a partner with, we've built a complex web-based architecture in which multiple server -side applications share the workload and communicate over sockets. For this reason, we've created an XML-based communication architecture. A request is send in the form of an XML string and a response is received in the form of XML data in a stream. This worked for years, both running multiple server applications on the same server or distributing them over a few boxes transparently, as the inter-process communication is socket-based.

Now one of the programs built with this technology is my newsgroup front end, dev.newswhat.com. On this site, we experienced intermittent problems due to our choice for the socket message terminator. We originally picket character #27, a symbol almost never used. And it is not used by itself, but it can be the second character of an escaped UTF-8 sequence. So with some messages in newsgroups with odd characters, the socket transmission would end too soon and the following portion of the data would be send after the following request, loosing synchronization.

That's why we decided to make things safer going for a two characters (or two bytes) separator, simply doubling the special character we were using. Tests were quite successful and the newshat.com site didn't got into the same problem any more. Great. Last week, however, we deployed a multi-servers system and moved a couple of large applications to a second server. Or we tried to, as they'll quite soon stop working and simply keep listening for more data in the sockets.

After a lot of debugging (on the actual Linux servers, as everything works locally) I found the problem. The Indy code responsible for the communication reads the incoming data in chunks and looks for the terminator characters in each buffer. However, in case the two characters of the terminator are slip among the two buffers, it will keep reading and never get to the end of the stream. We did use Indy 9 (latest available version) so the problem might have been fixed in version 10. The method in question is:

      function TIdTCPConnection.ReadLn(ATerminator: string = LF; 
const ATimeout: Integer = IdTimeoutDefault; AMaxLineLength: Integer = -1): string;

and this is the line that looks for the terminator in the current buffer (after the LSize position, already scanned in the past) and can fooled by a multi-byte terminator:

LTermPos := MemoryPos(ATerminator, PChar(InputBuffer.Memory) + LSize, LInputBufferSize - LSize);

Now the odd thing is that is the two programs are on the same computer, the buffer is filled very fast and it is unlikely to have a split terminator. By remoting the program, buffer read at each iteration is much smaller and (reading thousands of documents as we do) there is a much higher change for an error





 

6 Comments

A Tale of Indy Sockets and a Two Characters Terminator 

So no matter what terminator you choose, you might run
into problems later. So why use a terminator? Send the
length of the string first. (I guess you may end up
having to buffer some data, though).
Comment by Nicholas Sherlock [http://www.sherlocksoftware.org] on April 19, 15:00

A Tale of Indy Sockets and a Two Characters Terminator 

.."And it is not used by itself, but it can be the
second character of an escaped UTF-8 sequence."

??? I thought UTF-8 bytes in multi-byte sequences
always had the top bit set!

Also, is your #27 dec, hex or octal?

- Roddy
Comment by Roddy Pratt [] on April 19, 15:03

A Tale of Indy Sockets and a Two Characters Terminator 

 
Using terminators is not reliable for sure.
The best solution should be

 [PREFIX] [SIZE]  [DATA...DATA]

Comment by Petar H [http://bigspeed.net] on April 19, 17:58

A Tale of Indy Sockets and a Two Characters Terminator 

 Marco,

A possible solution to that problem is to read an
extra character at the end of buffer. 
More generally, if the buffer is B bytes long and the
terminating string is T bytes long, then always read B
+ (T - 1) bytes at a time, then search for the whole
string.
The next buffer you read will be read starting at the
previous address + B, meaning you will actually read
(T-1) bytes twice for each buffer you read.

But this will never miss a terminating string, and the
logic is quite simple to implement and maintain...

Philippe


Comment by filofel on April 20, 12:33

A Tale of Indy Sockets and a Two Characters Terminator 

Meh.

This is TCP newb goof #1 (the Packet Fallacy), also 
memorialized in The Lame List as #20:

http://tangentsoft.net/wskfaq/articles/lame-list.html

You are *required* to assemble the received stream 
before parsing it.  Always has been, always will be.  
This is just as true when using a leading message 
length header, which can easily arrive fragmented.

Using a delimited message isn't all bad but multi-
octet delimiters solve nothing.  When using a message 
delimiter protocol you are supposed to escape any 
occurrences of the delimiter octet value that may 
occur in the payload.

Character encoding is an irrelevant abstraction at 
the message framing level.  The TCP stream is defined 
as a stream of octets and you impose a structure upon 
it at the application level.

These are simply the facts of life.
Comment by VBer and proud of it on April 22, 16:56

A Tale of Indy Sockets and a Two Characters Terminator 

I encountered a similar problem with Indy 9, which 
solved encoding the text that is transfered using 
MIME64 ... as this makes the string longer, first I 
compress it using Zlib.

So ...

1. First, compress the string with Zlib.

2. Encode the compressed string with MIME64.

The encoded string ( Zlib+MIME64 ) is usually not 
longer that the original string ( in my particular 
case, very often is shorter ), and you can use a 
single character terminator whitout worry.
Comment by Lluis Olle on August 18, 22:12


Post Your Comment

Click here for posting your feedback to this blog.

There are currently 0 pending (unapproved) messages.