When you want to copy files from one machine to another, you might
think about using scp to copy them. You might think about using
rsync. If, however, you’re
trying to copy a large amount of data between two machines, here’s a
better, quicker, way to do it is using netcat.
On the receiving machine, run:
# cd /dest/dir && nc -l -p 12345 | tar -xf -
On the sending machine you can now run:
# cd /src/dr && tar -xf - | nc -q 0 remote-server 12345
You should find that everything works nicely, and a lot quicker. If
bandwidth is more constrained than CPU, then you can add “z” or “j” to
the tar options (“tar -xzf -” etc) to compress the data before it sends
it over the network. If you’re on gigabit, I wouldn’t bother with the
compression. If it dies, you’ll have to start from the beginning, but
then you might find you can get away with using rsync if you’ve copied
enough. It’s also
worth pointing out that the recieving netcat will die as soon as the
connection closes, so you’ll need to restart it if you want to copy the
data again using this method.
It’s worth pointing out that this does not have the security that scp or
rsync-over-ssh has, so make sure you trust the end points and everything
in between if you don’t want anyone else to see the data.
Why not use scp? because it’s incredibly slow in comparison. God knows
what scp is doing, but it doesn’t copy data at wire speed. It aint the
encyption and decryption, because that’d just use CPU and when I’ve done
it it’s hasn’t been CPU bound. I can only assume that the scp process
has a lot of handshaking and ssh protocol overhead.
Why not rsync? Rsync doesn’t really buy you that much on the first copy.
It’s only the subsequent runs where rsync really shines. However, rsync
requires the source to send a complete file list to the destination
before it starts copying any data. If you’ve got a filesystem with a
large number of files, that’s an awfully large overhead, especially as
the destination host has to hold it in memory.
on said:
On the sending side it would be
… tar -cf – …
instead of
… tar -xf – …
(copy-past-o?)
Regards and thanks for the tip
— t
on said:
What about error resilence? If you apply gzip compression, there will at least be some data integrity check, right? I don’t know how big the risk is that data gets damaged in transport, but scp at least has the data integrity guarantees of ssh communication.
Perhaps this is not important for your application?
on said:
What about error resilence? If you apply gzip compression, there will at least be some data integrity check, right? I don’t know how big the risk is that data gets damaged in transport, but scp at least has the data integrity guarantees of ssh communication.
Perhaps this is not important for your application?
on said:
What about error resilence? If you apply gzip compression, there will at least be some data integrity check, right? I don’t know how big the risk is that data gets damaged in transport, but scp at least has the data integrity guarantees of ssh communication.
Perhaps this is not important for your application?
on said:
What about error resilence? If you apply gzip compression, there will at least be some data integrity check, right? I don’t know how big the risk is that data gets damaged in transport, but scp at least has the data integrity guarantees of ssh communication.
Perhaps this is not important for your application?
on said:
Instead of netcat you could use the excellent mbuffer tool. Small and neat and very useful in case of cahnges in read and write speeds on the ends as well as transfer speed variations.
Also, I found that on some systems netcat seems to level off at 150-200 MB/s, mbuffer seems to be able to go far beyond that one.
http://www.maier-komor.de/mbuffer.html
on said:
TCP will provide some error resilience, but I agree with @ulrik, using -z would detect more errors.
Also, GNU tar has a -C flag so you don’t have to cd.
on said:
i have noticed that nfs is faster than scp (by about a factor of 4 or 5 on my setup). maybe there is a bug in ssh that makes it slow.
on said:
scp has encryption overhead and usually does not compress, so try with the -C command flag.
Also check out pv, which can visualise progress nicely.
You can restart the transfer if you first create the tarball and then use dd with seek/skip accordingly.
Finally, check out the sendfile package.
on said:
@ulrik: to provide resilience, just send the file across 5 times 😛
I find it remarkable that tarpipe+ssh is usually much faster than scp.
on said:
Same idea only with ssh. I believe the slowness you’re experiencing with scp is a negotiation before and after every file. Piping with tar eliminates the lag, as the network only sees one big stream.
——-
from http://www.mcnabbs.org/andrew/linux/shellhacks/
(cd /orig/dir; tar cf – .) | (cd /targ/dir; tar xpf -)
Copying with a cp -R will mangle stuff sometimes. If you want to completely copy a tree, this command is the best way to do it. Additional tar options, like –same-owner, will let you keep things even more intact. Another way to do this is: “tar cfC – /orig/dir . |tar xpfC – /targ/dir”. They should both do exactly the same thing, though the original example is perhaps a little easier to remember.
tar cf – . |ssh remotehost “cd /targ/dir; tar xf -“
This is logically exactly the same as the tar pipe above, except that this time ssh allows our pipe to cross system boundaries.
on said:
I like jimcooncat’s suggestion for two reasons: (1) tar handles everything (permissions, symlinks, hardlinks, device nodes, everything; although I’m suddenly not sure about extended attributes and ACLs) and (2) it’s mostly the same for local and remove copies, just put ssh and quotes around one of the pipeline elements.
I like to use && to separate the cd from the tar, just to make sure that if you mistype the directory name, the command will fail instead of copying the wrong thing into the wrong place.
on said:
I’d almost always compress data before sending it to the wire. If gzip is too slow, use lzop. It is designed for speed. As noted by another commenter you can also gain the security of ssh by piping tar through ssh.
on said:
Quoting rsync(1):
Beginning with rsync 3.0.0, the recursive algorithm used is now an incremental scan that uses much less memory than before and begins the transfer after the scanning of the first few directories have been completed.
on said:
Actually, the performance problems you’re seeing when copying large batches of data over SCP are almost certainly caused by the way that OpenSSH manages send and receive buffers. Static buffers are allocated because that reduces code complexity, which makes OpenSSH more resilient against odd conditions, but which also reduces the number of corner cases that could be used to attack its security.
There is a set of patches maintained against OpenSSH called hpn-ssh which adds code to use dynamic buffer sizes (as well as a cipher=no option, which essentially sends everything in cleartext for a minor but measurable speed gain).
The latest version of the patch set, hpn13, adds the MT-AES-CTR cipher mode, which is a multithreaded AES implementation. On a fast multicore machine, this cipher should let the sender keep up with even a very fast network link.
on said:
@Jon, laughing my head off… 🙂
on said:
Thanks to Carsten, for pointing out mbuffer.
on said:
Well netcat seems to be a good option. the only concern is data security like resync o scp provides. So it is very necessary to have both sending and receiving ends trustable. SCP is dead slow for copying big amount of data but right now we are using it because of its good data security.