Skip to content
This repository has been archived by the owner on Jan 15, 2021. It is now read-only.

ThaliCore: Data corruption causes SSL to crash #1963

Open
lesn1kk opened this issue Sep 22, 2017 · 8 comments
Open

ThaliCore: Data corruption causes SSL to crash #1963

lesn1kk opened this issue Sep 22, 2017 · 8 comments
Assignees

Comments

@lesn1kk
Copy link
Member

lesn1kk commented Sep 22, 2017

Reporter by Enrico:

It looks like a data corruption, probably caused by a bug in ThaliCore. When the replication starts, the SSL handshaking completes successfully and the first device starts sending the attachment to the second device in chucks of 16384 bytes each. The second device receives the first few chunks and decrypts them without errors. Then, usually it's the 5th chunk, a chunk arrives with a length of 16394 byes that is clearly wrong, since only 16384 were sent. The first 16032 bytes of the chunk are OK, but the last 352 bytes are different from the last 352 bytes sent, plus there are 8 extra bytes. At that point the SSL routine, that decrypts the chunk, returns an error. It doesn't look like it's a bug in SSL or in JXcore, I guess it's a bug in the transmission/reception of the chunk.

@enricogior
Copy link
Member

The first problem is caused by the buffer size used on the receiving side:
https://github.com/thaliproject/thali-ios/blob/master/ThaliCore/SocketConnection/VirtualSocket.swift#L32
The chunks send from the sender are 16384 bytes, setting the same size on the receiver makes the SSL error disappear.
After that fix, all received chunks have the expected size and SSL can successfully decrypt them.

There is still a problem with replicating large attachments: after the first 9 chunks (total of 144 KBytes) has been send and successfully received, the 10th chunk is sent but never received.
At the same time, on the receiver side, the MPCF notifies that the other peer has been lost.
The connection will finally timeout, but no other connection error is reported.

@enricogior
Copy link
Member

Upon further investigation, it turned out that the increased buffer size didn't "fix" the SSL error, it simply changed the way the underneath bug shows up.
I confirm it's not an SSL bug, it's a ThaliCore bug that causes a data corruption that in turn causes the SSL error or a silent failure or a timeout error.

On the sender side, the node layer calls VirtualSocket.writeDataToOutputStream passing the data to transmit to the other peer.
When the data is small, the transmission succeeds because all the data is passed and trasmitted at once.
When the data is a large file, the transmission requires multiple calls to
VirtualSocket.writeDataToOutputStream and it may happen that, after a few chunks of data have been written to the output stream, the stream internal buffer (managed by the MPCF) gets full and therefore
strongOuputStream.hasSpaceAvailable is false and we have to wait before writing more data.

In that case, the chunk is appended to pendingDataToWrite, but pendingDataToWrite
is declared as nullable and never allocated, so the current chunk is actually lost and never transmitted.
Depending on how many chunks of data get lost, the receiving peer may end up with the SSL error when trying to decrypt a corrupted block of data

@enricogior
Copy link
Member

enricogior commented Sep 25, 2017

Fixing the pendingDataToWrite bug was not enough. There are other bugs still causing data corruption.
VirtualSocket.writeDataToOutputStream doesn't check if there is pendingDataToWrite data before writing the input data, so we may end up transmitted all the chunks but not in the correct order.
There is also thread unsafe code.

@enricogior
Copy link
Member

I've implemented a fix that has improved things but not fixed all the issues.
Now attachments can be successfully replicated between devices at reported speed between ~130 MBPS and ~300 MBPS, but there is still a bug that from time to time causes a data corruption and the replication to fail.
I have an idea of what it may be but not yet confirmed.

@yaronyg
Copy link
Member

yaronyg commented Sep 26, 2017

Something smells very wrong about those numbers. By MBPS do you mean Megabytes or Megabits? Because even on Wifi the iPhone shouldn't get much about 250 Mbps (as in bits, not Bytes). And on MPCF I wouldn't personally expect much more than 10 Mbps (although that is just a theory, I don't know the actual limit). But I would be worried about any tests showing 130 MBPS - 300 MBPS.

@enricogior
Copy link
Member

@yaronyg those are the numbers shown by ThaliTestApp:
https://github.com/thaliproject/ThaliTestApp/blob/master_mlesnic_additional_attachments/www/js/index.js#L97-L102
The calculation may be buggy but I assume it is consistent when comparing transfer speed (in bytes per second) of Native mode vs. WiFi mode, and the numbers are similar. There was the suspicion that the MPCF was using BT instead of WiFi direct, but it doesn't seem that is the case.

@lesn1kk
Copy link
Member Author

lesn1kk commented Sep 27, 2017

I provided fix for calculating transfer rate. Also, keep in my mind that this is based on each phone's timestamps, so you need to make sure their clocks are synchronised, otherwise you could even get negative numbers.

@enricogior
Copy link
Member

There was a serious bug in ThaliCore that has been fixed in this branch
https://github.com/thaliproject/thali-ios/commits/enrico-1963

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants