Расширение параметра Bash: лучшие практики для скорости?

432
runlevel0

Мне просто интересно, если кто-нибудь знает какие-либо лучшие практики или есть какие-либо документы по этой теме:

Сценарий поиска / поиска в файлах журнала. Чтобы выразить свою точку зрения, я буду использовать ls. Итак, скажем, что я бегу, lsчтобы перечислить ряд файлов в каталоге

/var/log/remote/serverX.domain.local/ps/ps2.log.2014-mm-dd.gz

Где mm и dd являются числами месяца и дня, кроме serverX есть целый ряд серверов (для примера я использую 4,5,9,10 (это настоящие серверы)

Я запустил ls со временем, используя сначала список параметров в фигурных скобках, а затем изменил его на звездочку, чтобы увидеть различия. Я, конечно, не ожидал, что звездочка будет работать лучше.

 emartinez@serverlog:~$ time ls /var/log/remote/server.domain.local/ps/ps2.log.2014-10-0.gz /var/log/remote/server10.domain.local/ps/ps2.log.2014-10-01.gz  ... /var/log/remote/server5.domain.local/ps/ps2.log.2014-10-02.gz  real 0m0.004s user 0m0.010s sys 0m0.000s 

Затем я заменяю последнюю фигурную скобку звездочкой:

time ls /var/log/remote/server.domain.local/ps/ps2.log.2014-10-0*.gz 

И я получаю следующую статистику:

 real 0m0.028s user 0m0.020s sys 0m0.020s 

Это большая разница, хотя есть только 2 варианта, поскольку доступные даты - только 01 и 02 октября.

Я снова запустил тест, но на этот раз я заменил месяцы на список, который соответствует результатам:

ps2.log.2014--0.gz : real 0m0.010s ps2.log.2014--0*.gz : real 0m0.168s 

Это большая разница только для одной звездочки !!! Имеет смысл, что это медленнее, но есть ли какие-то критерии относительно того, насколько медленнее, и есть ли какие-нибудь лучшие практики, изложенные где-нибудь?

2

2 ответа на вопрос

2
rici

It may seem like prefix-* should be easy to turn into, for example, prefix-1 prefix-2, since we're used to seeing directory listings sorted. But it turns out that very few filesystems can actually produce sorted filename listings, and furthermore that there is no standard API for asking for sorted filename listings.

If a program -- such as ls, or, for that matter, bash -- needs a list of filenames, it needs to read the whole directory listing, which will be produced in some random order (often the order is related to creation time; sometimes it's based on a hash of the filename; but in pretty well no case is it a simple alphabetic order). So in order to resolve prefix-*, you need to read the entire directory and check every filename against the pattern. Since the most costly part of that procedure is reading the directory, it makes little difference how complex the pattern is or how many filenames match the pattern.

In summary, pathname expansion ("resolving globs") is going to be slow in a large directory. That's a reason to avoid large directories, rather than a reason to avoid globs.

But there's another important datapoint: prefix- is not pathname expansion. It's "brace expansion" and it's an extension to the Posix shell standard (although almost all shells implement it). There are a number of differences between brace expansion and pathname expansion, but one important and relevant difference is that brace expansion does not depend on the existence of files. Brace expansion is a simple string operation.

Consequently, prefix- will always expand to prefix-1 prefix-2, regardless of whether those files exist or not. That means it can be expanded without reading the directory and without stating any file. Clearly, that's going to be fast. But there's a downside: there's no way to tell whether the result corresponds to real files.

Consider the following simple example:

$ mkdir test && cd test $ touch file1 file2 file4 $ ls file* file1 file2 file4 $ ls file[1234] file1 file2 file4 $ ls file ls: cannot access file3: No such file or directory file1 file2 file4 

Final point: Pathname expansion is done by the shell, not by ls. With pathname expansion, we could just as well use echo:

$ echo file* file1 file2 file4 $ echo file[1234] file1 file2 file4 

And echo will produce the list somewhat faster, because all echo needs to do is print its arguments, while ls (which receives the same arguments) has to stat each argument in order to verify that it is a file. That stat -- which is not a cheap call -- is entirely redundant in the case of a pathname expansion, because the shell has already used the directory listing in order to filter the file list and therefore every filename passed to ls is known to exist. (Unless the glob didn't match any files at all.)

In addition, echo is a bash built-in, so it can be invoked without creating a child process.

In the case of brace expansion, though, echo does not produce the same result:

$ echo file file1 file2 file3 file4 

So we could use ls, redirecting its error output to the bit bucket:

$ ls file file1 file2 file4 

and in this case, the stat calls are not redundant because the shell never validated the filenames.

Unless your directories are really huge, none of this will make much difference and the glob will be a lot easier to write. If your directories are really huge, you should consider splitting them into smaller sub-directories.

For example, instead of paths like:

/var/log/remote/serverX.domain.local/ps/ps2.log.2014-mm-dd.gz 

you could use:

/var/log/remote/serverX/domain.local/ps/ps2.log.2014-mm-dd-gz 

And if you are keeping the logs forever, you might want to extract the year to avoid infinitely increasing directory size:

/var/log/remote/2014/serverX/domain.local/ps/ps2.log.2014-mm-dd-gz 

(2014 is deliberately repeated.)

Sharding the directories will usually be a big win because it provides a mechanism to optimise globbing. As mentioned above, the shell cannot optimize

/var/log/remote/server[2357].domain.local/ps/ps2.log.2014-10-*-gz 

but it can optimize

/var/log/remote/server[2357]/domain.local/ps/ps2.log.2014-10-*-gz 

In the second case, server[2357] only needs to be matched against the directory names, and once that is done, ps2.log.2014-10-*-gz only needs to be matched against the filenames in the matched directories.

Большое спасибо, приятель! Потрясающее чтение. К сожалению, я не могу голосовать за вас, так как мой представитель на этом форуме только 6. Но еще раз большое спасибо! runlevel0 9 лет назад 0
1
Dennis

Shell expansion is always performed in a particular order; brace expansion is performed first, file name expansion is performed last.

Thus, a command like

echo * 

first gets expanded to

echo 1* 2* 3* 

then, file name expansion is performed for 1*, 2* and 3*. Each expansion involves going through all file names in the directory and comparing them against the pattern.

As the number of words and/or the number of files in the directory grow, this becomes gradually slower. Even in an empty directory,

shopt -s nullglob # print nothing for non-matching words echo * # prints nothing shopt -u nullglob # back to the default 

takes almost five seconds on my machine. This is not at all surprising if you consider that file name expansion is performed one million times...

A much faster alternative is to avoid combining both types of shell expansion whenever possible.

The command

echo [1-1000000]* # also prints nothing 

searches for the same file names, but it uses a single pattern. This takes 33 milliseconds on my machine.

Using square brackets instead of curly brackets has additional benefits:

$ touch 13 $ echo * 13 13 $ echo [1..20]* 13 

The first approach found the file twice, since it matches the patterns 1* and 13*. This doesn't happen with "pure" file name expansion.

Большое спасибо тоже !! Как я уже говорил выше, у меня недостаточно представителей, чтобы проголосовать за вас. Оба ответа чрезвычайно проницательны и полезны. runlevel0 9 лет назад 0

Похожие вопросы