It may seem like prefix-*
should be easy to turn into, for example, prefix-1 prefix-2
, since we're used to seeing directory listings sorted. But it turns out that very few filesystems can actually produce sorted filename listings, and furthermore that there is no standard API for asking for sorted filename listings.
If a program -- such as ls
, or, for that matter, bash
-- needs a list of filenames, it needs to read the whole directory listing, which will be produced in some random order (often the order is related to creation time; sometimes it's based on a hash of the filename; but in pretty well no case is it a simple alphabetic order). So in order to resolve prefix-*
, you need to read the entire directory and check every filename against the pattern. Since the most costly part of that procedure is reading the directory, it makes little difference how complex the pattern is or how many filenames match the pattern.
In summary, pathname expansion ("resolving globs") is going to be slow in a large directory. That's a reason to avoid large directories, rather than a reason to avoid globs.
But there's another important datapoint: prefix-
is not pathname expansion. It's "brace expansion" and it's an extension to the Posix shell standard (although almost all shells implement it). There are a number of differences between brace expansion and pathname expansion, but one important and relevant difference is that brace expansion does not depend on the existence of files. Brace expansion is a simple string operation.
Consequently, prefix-
will always expand to prefix-1 prefix-2
, regardless of whether those files exist or not. That means it can be expanded without reading the directory and without stat
ing any file. Clearly, that's going to be fast. But there's a downside: there's no way to tell whether the result corresponds to real files.
Consider the following simple example:
$ mkdir test && cd test $ touch file1 file2 file4 $ ls file* file1 file2 file4 $ ls file[1234] file1 file2 file4 $ ls file ls: cannot access file3: No such file or directory file1 file2 file4
Final point: Pathname expansion is done by the shell, not by ls
. With pathname expansion, we could just as well use echo
:
$ echo file* file1 file2 file4 $ echo file[1234] file1 file2 file4
And echo
will produce the list somewhat faster, because all echo
needs to do is print its arguments, while ls
(which receives the same arguments) has to stat
each argument in order to verify that it is a file. That stat
-- which is not a cheap call -- is entirely redundant in the case of a pathname expansion, because the shell has already used the directory listing in order to filter the file list and therefore every filename passed to ls
is known to exist. (Unless the glob didn't match any files at all.)
In addition, echo is a bash
built-in, so it can be invoked without creating a child process.
In the case of brace expansion, though, echo
does not produce the same result:
$ echo file file1 file2 file3 file4
So we could use ls
, redirecting its error output to the bit bucket:
$ ls file file1 file2 file4
and in this case, the stat
calls are not redundant because the shell never validated the filenames.
Unless your directories are really huge, none of this will make much difference and the glob will be a lot easier to write. If your directories are really huge, you should consider splitting them into smaller sub-directories.
For example, instead of paths like:
/var/log/remote/serverX.domain.local/ps/ps2.log.2014-mm-dd.gz
you could use:
/var/log/remote/serverX/domain.local/ps/ps2.log.2014-mm-dd-gz
And if you are keeping the logs forever, you might want to extract the year to avoid infinitely increasing directory size:
/var/log/remote/2014/serverX/domain.local/ps/ps2.log.2014-mm-dd-gz
(2014
is deliberately repeated.)
Sharding the directories will usually be a big win because it provides a mechanism to optimise globbing. As mentioned above, the shell cannot optimize
/var/log/remote/server[2357].domain.local/ps/ps2.log.2014-10-*-gz
but it can optimize
/var/log/remote/server[2357]/domain.local/ps/ps2.log.2014-10-*-gz
In the second case, server[2357]
only needs to be matched against the directory names, and once that is done, ps2.log.2014-10-*-gz
only needs to be matched against the filenames in the matched directories.