Как эффективно генерировать и проверять контрольные суммы файлов?

2638
Aaron Rubinstein

Я хотел бы иметь возможность собирать и проверять контрольные суммы для крупномасштабных коллекций файлов, обычно вложенных в сложную иерархию каталогов.

Каждому файлу нужна контрольная сумма? Существуют ли способы использовать существующую структуру каталогов, скажем, для проверки только узла в дереве файлов и не обязательно каждого файла в нем?

12
Как отмечается в ответах, важно различать типы угроз, которые вы смягчаете, и контрольную сумму соответственно. [Предыдущий ответ о переполнении стека библиотек и информатики] (http://libraries.stackexchange.com/a/615/438), который я написал, может представлять интерес, хотя в основном это касается HDFS. Andy Jackson 11 лет назад 0

6 ответов на вопрос

13
db48x

The most efficient way to use checksums is to make the computer do it all. Use a filesystem such as ZFS which checksums (actually it uses hashes, which are stronger than a checksum) all data when it's written, and verifies them every time the data is read. Of course, the downside is that ZFS doesn't know when deleting or overwriting a file is a mistake and when it's normal operation, but because ZFS uses copy-on-write semantics for everything, you can use it's snapshotting feature to mitigate the risk.

ZFS can also automatically restore data that fails a hash check by using any redundancy you've set up, whether raid5-style parity, drive mirrors or duplicate copies (add the copies=N property to any ZFS filesystem and it'll store N copies of any data you write). It also stores the hashes in a Merkle tree, where the hash value of a file depends on the hashes of the blocks, the hash of a directory entry depends on the hash values of the files and directories it contains, the hash of a filesystem depends on the hash of the root directory, etc.

Regardless of what solution you end up with, you'll invariably find that the process is limited by the speed of your disks, not by the speed of your CPU.

Also, don't forget to take into account the BER of your disks. They are, after all, mere plates of spinning rust. A consumer-level drive has a an error rate of 1 incorrectly-read bit for every 10^14 bits read, which works out to 1 bit out of every 11 terabytes you read. If you have an 11 terabyte data set and you compute the hash of every file in it, you will have computed one of those checksums incorrectly and permanently damaged one block of one of the files in the data set. ZFS, however, knows the hash of every block it wrote to every disk in your pool, and therefore knows which block was lost. It can then use the redundancy (parity, mirrors or extra copies) in your pool to rewrite the data in that block with the correct values. These safety features also apply when you use zfs send or receive to copy data from your primary system to the backups.

Ben brings up a good point in the comments however. ZFS doesn't expose any of the hash values that it computes to the user, so data that enters or leaves a ZFS system should be accompanied by hashes. I like the way the Internet Archive does this with an xml file that accompanies every item in the archive. See https://ia801605.us.archive.org/13/items/fakebook_the-firehouse-jazz-band-fake-book/fakebook_the-firehouse-jazz-band-fake-book_files.xml as an example.

Ты подтолкнул меня на это. Я также собирался предложить систему, основанную на хэше. Хешируйте каждый файл, хешируйте хеши файлов (+ хэши sub dir) для хэша каталога и т. Д. Компромисс между CPU / IO и вероятностью ошибки. Контрольная сумма / CRC дешевая, но вероятность ошибки увеличивается с масштабом. Так что делайте обычные хэши, но они начинаются с гораздо меньшей вероятностью ошибки. The Diamond Z 11 лет назад 1
Даже если вы запускаете файловую систему, такую ​​как ZFS (Btrfs также имеет аналогичную функциональность, но все еще находится в тяжелой разработке и в настоящее время не считается готовой к производственному использованию), вам потребуется периодически выполнять операцию очистки, чтобы убедиться, что данные читать и сверять с контрольными суммами или хэшами. Просто вычисление контрольных сумм и последующее бездействие с ними до тех пор, пока вам не понадобится доступ к данным, потенциально хуже, чем бесполезное. a CVn 11 лет назад 3
Выполнение пользовательского пространства md5sum с более чем 150 ГБ данных на моем домашнем ПК заняло около 40 минут в режиме тайм-аута, просто с привязкой к вводу / выводу. Масштабируя это в 100 раз, мы получаем 15 ТБ для проверки менее чем за три дня * на потребительском оборудовании. * Я бы, конечно, посчитал это выполнимым даже для большого архива с правильно выбранным интервалом. a CVn 11 лет назад 1
@ BenFino-Radin ZFS обладает тем преимуществом, что если файл поврежден в хранилище, он не может быть * прочитан * из файловой системы. Также ошибки регистрируются. Таким образом, при периодической проверке системных журналов, а также при периодических очистках, если вы можете прочитать файл, вы можете быть уверены, что он не изменился * за пределами предполагаемого режима использования файловой системы *. ** Конечно **, это не защищает от вредоносного или ошибочного программного обеспечения, повреждающего содержимое файла с помощью средств, предусмотренных системой. * Для этого * вам нужна какая-то отдельная контрольная сумма или хэширование. Я бы сказал, что эти двое идут вместе. a CVn 11 лет назад 1
Да, это хороший момент. Мой последний скраб исправил 2 килобайта поврежденных данных. Это четыре блока, разбросанных по пяти дискам! Чем дольше вы переходите между чтениями определенного фрагмента данных, тем выше вероятность того, что вы накопите достаточно ошибок в одном файле, и он не сможет его восстановить. 11 лет назад 1
Правильно, это отличный момент @ MichaelKjörling. Я не особо много говорил о том, как я определяю эффективность. Это, безусловно, может стать препятствием, если вы хотите проверять файлы часто, например, несколько раз в неделю, если вы начинаете ползти по шкале 10 ТБ. 11 лет назад 0
Я с @ BenFino-Radin здесь относительно ZFS. Это, конечно, удобно, и я вижу, что это полезно в сочетании с более переносимыми решениями по сохранению, но в зависимости от одного только это приведет вас к файловой системе, и это определенно делает меня неудобным. 11 лет назад 0
ZFS вычисляет контрольные суммы для блоков, а не файлов или битовых потоков, нет? Хотя ZFS решает проблему вычислений, может показаться, что она менее контролируема человеком и не производит переносимые данные, независимо от файловой системы - что необходимо для архивов. 11 лет назад 3
6
Danubian Sailor

I would generate checksum for each file. Checksums are very small, and generating checksum for the whole directory would require you to process every file as well (at least if you are not speaking about directory checksum, made only from directory entries - I would make them as well, to ensure no data is deleted).

Assume you have one checksum for the whole archive. You know the data is corrupted, but you don't know if this is only one file, and, more important, which of them. Having separate checksums give you more flexibility. You can detect single file that is corrupted, and replace it from the file from other backup (which can, in turn, have other file corrupted).

In that way your data is more likely to survive.

Это, безусловно, имеет смысл. Мне просто интересно, какие существуют стратегии для обработки вычислительно дорогого подвига генерации и проверки сотен тысяч контрольных сумм. 11 лет назад 0
4
Christian Pietsch

Maybe this is a good time to bring up BagIt. This is a very simple yet powerful file packaging format intended for archiving, long term preservation, and transfer of digital objects. Users include the Library of Congress and the California Digital Library.

A BagIt tool (they exist in several programming languages) puts your files into a certain directory structure and does the checksumming/hashing for you. That is all.

PS: Of course, BagIt tools can also verify bags against the included checksums/hashes, and you can add some metadata to bags. But that's as complex as bags get.

1
a CVn

This answer is a combination of that of @lechlukasz and @db48x, also incorporating some points made in comments as well as some of my own thoughts.

The simple path forward is a combined file-system and separate-metadata approach.

By using a file system that does on-the-fly data hashing and validation, such as ZFS or Btrfs (do note that although great advances have been made, Btrfs is not considered ready for production use at this time), you can be reasonably sure that if the data can be read off the disk without the operating system erroring out, then the data read was written to disk in the way intended by the file system. By running periodic "scrub" operations, all data is read and verified against the file system's idea of what it should be.

However, that only protects against on-disk corruption (unreadable blocks, outright hardware write errors, invalid writes that corrupt parts of the data directly on the block device, etc.). It does not protect against a software bug, incorrect user operation, or malicious software which works through the intended operating system facilities for working with files, assuming that those facilities are free of such bugs.

To protect against the latter, you need another layer of protection. Checksumming or hashing data from a user application's perspective will help protect against many of the above-mentioned risks, but needs to be performed separately (either as a built-in process action in the software, or as a completely separate process).

With today's hardware and what's practical for storing large amounts of data (spinning platter hard disks as opposed to solid-state disks/SSDs), even complex hashing algorithms such as SHA1 will be largely I/O-bound -- that is, the speed at which the data is hashed will be a function of the storage system's read speed, rather than the ability of the computer's processor to calculate the hash. I did an experiment with running a user-space MD5 hashing process over approximately 150 GB of data on what in 2012 was a mid-tier consumer PC, and it finished after exercising the disk basically without interruption for about 40 minutes. Scaling those figures up 100-fold, you'd get the MD5 hashes of a 15 TB collection in about three days' time on that same hardware. By adding read transfer rate (which can be easily accomplished e.g. using RAID; RAID 0 for example is striping without redundancy, commonly used to achieve higher read/write performance possibly in combination with RAID 1 forming RAID 10), the time to completion can be lowered for the same amount of data.

By combining the two, you get the best of both worlds: the file system gives you assurance that what you received when reading the file is what was actually written, and a separate fixity-checking process can run over the entire collection ensuring that the data stored still matches what was ingested into the archive. Any inconsistency between the two (file system says the file is OK, fixity checking says it's not) will indicate a file that has been modified outside of the archive's intended mode of operation but from within the operating system's facilities, prompting a restore from a secondary copy (backup). The fixity check can thus run at a longer time interval, which becomes essential for very large archives, but any online accesses are still guaranteed to not be corrupted on the hardware if the reads succeed. In principle, the archive software could rely on the file system to report inconsistencies as read errors, and perform a separate fixity check in the background as the user is working with the file and displaying an appropriate message should that indicate that the file does not match what was ingested into the archive. Using a block-hashing file system, such a scheme would have minimal impact on perceived performance while still providing assurance that the content is correct.

1
mjuarez

I've gone through the answers, and even though I like the idea of relying on ZFS to handle the data-layer errors, there's still the problem of the files getting changed, either by mistake or maliciously. ZFS won't protect you in that case, and like somebody else mentioned, it won't give you the a user-viewable "hash" to store somewhere else for external validation.

There's a Linux application called TripWire that was used extensively for monitoring system executables, to validate they haven't been changed after an attack. That project is apparently now abandoned, but there's a new one called AIDE (Advanced Intrusion Detection Environment), recommended over on ServerFault:

https://serverfault.com/questions/62539/tripwire-and-alternatives

When you install, it would run every x minutes, user-configurable, and it would check all the folders you specify for changes in the files. It needs to run once to calculate all the file hashes, and then after that, it checks all the hashes against the current file, and makes sure they're still the same. You can specify which type of hash or combination of hashes to use (I wouldn't recommend anything weaker than SHA-256), which file attributes to use (contents, size, modified timestampst, etc), the frequency at which it checks, how/where to store the hash database, etc.

Some might consider this overkill, but depending on the OP's requirements, it might give him more peace of mind that the data he's storing will stay the same after a certain point of time.

0
John Lovejoy

The National Archives of Australia has developed [Checksum Checker] (http://checksumchecker.sourceforge.net/) which is freely available under GPLv3.

It reads a checksum and algorithm from a database, then recalculates the checksum for the file, compares the two values and reports if there is an error. It supports MD5, SHA1, SHA2, SHA256 and SHA512 algorithms.

Other software in their digital repository [DPR] (http://dpr.sourceforge.net/) generates the initial checksum (as well as doing all other processing activities)

Похожие вопросы