gzip source blocks (this involves writing out a version of the file that's compressed in block-increments).More likely to be bottlenecked by the disk throughput / seek time than the CPU.Easily hits 100 MB/s on my workstation, satisfying the idea that you should be able to build 12 GB payloads in ~1m.The 32 bit Rollsum hash produces far fewer false positives than the 16 bit one, with the same 4 byte hash overhead. The resulting file was around 50 MB and does not compress well (which makes sense, since it's hashes with a hopefully near-random distribution). Generating a gosync file for a 22 GB file (not on an SSD) took me around 2m31s ~= 145 MB/s sustained checksum generation. When there are very similar files, the speed is far higher (since the weak checksum match is significantly cheaper) Some numbers:īudget for 8 MB/s byte by byte comparison on single thread: 120ns I think that we're mostly CPU bound, and should scale reasonably well with more processors. On an 8 MB file with few matches, I'm hitting about 16 MB/s with 4 threads. Things like version numbers in the files would be good, and then implementation of more features to make it potentially usable. Being able to support multiple source files would also be good.Īfter that, there needs to be some cleanup of the CLI command code, which is pretty verbose and duplicates a lot. To optimize transmitted size, blocks should be gziped. Some changes and refactoring need to happen where there are assumptions about a source block being of 'blocksize' - in order Work needs to be done to add support for a BlockSourceResolver that can deal with compressed blocks. There is a basic HTTP Blocksource, which currently supports fixed size blocks (no compression on the source), but which should be able to multiple tcp connections to increase transfer speed where latency is a bottleneck. NB: The command-line tools are not in a state for use in production! The command-line tools are fleshed out enough for testing comparison behaviour and profiling it against real input files. Split up the data files: improve the fallback when there's a non HTTP 1.1 proxy between the client and server.Split the checksum blocks from the data: serve checksum blocks securely over https, while allowing caching on the data over http.Arrange files by hashed path, and checksum: if a file hasn't changed, you can serve the existing version (NB: this works well with s3 sync).Zsync modified rsync to allow it to be used against a "dumb" http server, but we can go further: Gosync includes a commandline tool that's intended as a starting point for building the functionality that you want. Making use of multiple CPUs while doing the comparisons.Making use of multiple http connections, to avoid the bandwidth latency product limiting transfer rates to remote servers.There are many areas that benefit from the use of multiple threads & connections: By writing it in Go, it's easier to create in a way that's cross-platform, can take advantage of multiple CPUs with built in benchmarks, code documentation and unit tests. The intent is that it's easier to build upon than the zsync/rsync codebases. Gosync is a library inspired by zsync and rsync.
0 Comments
Leave a Reply. |