Finding Things

Objectives

Questions
- How can we find files in complex folders?
- How can we find lines in files?
Keypoints
- grep and find can be used to find files
- grep can also be used to search in files

Instructor note

Demo/teaching: 15 min
Exercise: 15 min

A critical skill for working on UNIX systems in order to find back files and folders.

Note

First we will demonstrate a couple of commands and later we will use these in an exercise.

If you want to type-along with the instructors, you can download and extract the example like this:

cd
wget https://gitlab.sigma2.no/training/tutorials/unix-for-hpc/-/raw/master/content/episodes/finding-things/finding-things.tar.gz --no-check-certificate 
tar xzvf finding-things.tar.gz

Finding files in folders with `find`

You should now have a folder named finding-things in your current directory containing a bunch of files and folders with random names and different file types.

Now we will try to find a couple of different files in this mess. Let’s start with a file called output.txt which is in one of the subfolders. Our tools is fittingly called find and is available on all Unix and Linux machines. The general syntax is quite simple

$ find STARTING_POINT OPTIONS

but if we look into its help file using man find, we see that there are tons of parameters and options available.

Based on file name

In our case, we know the folder in which to look (the ‘STARTING_POINT’) and the exact name, so the command becomes:

$ find finding-things -name output.txt

After pressing enter we will see the following result

finding-things/godcjrbv/output.txt

which means that find successfully found the file with name output.txt in folder finding-things/godcjrbv.

If, for some reason we are not sure about the capitalization of our file (was it ‘output.txt’, ‘Output.txt’, ‘OUTPUT.txt’ or something else), we can use -iname instead. This way we ignore the case of search term

$ find finding-things -iname output.txt

The result of this command is in our case:

finding-things/godcjrbv/output.txt
finding-things/aivxievn/feipoogd/OUTPUT.txt

so we see that there were actually two files called the same, but in different subfolders, and with different capitalisation, and using the -iname option ensured we found them both.

Let’s imagine that we forgot the name of a data file we want to use but we remember it is a .csv file. We could look through each folder separately or try out all names with a .csv suffix, we can think off. Instead we can use find in combination with wildcard characters (see here for more information). The two most useful wildcards are the asterisk (*) and the question mark (?). The asterisk matches zero or more characters while the question mark represents any single character.

So if we want to list all .csv files, we can use the asterisk like this:

$ find finding-things -iname '*.csv'

The output from this command is

finding-things/aivxievn/qbafmtbq/data10.csv
finding-things/aivxievn/qbafmtbq/data1.csv
finding-things/aivxievn/qbafmtbq/data11.csv
finding-things/aivxievn/qbafmtbq/data2.csv
finding-things/aivxievn/qbafmtbq/data14.csv
finding-things/aivxievn/qbafmtbq/data13.csv
finding-things/aivxievn/qbafmtbq/data4.csv
finding-things/aivxievn/qbafmtbq/data3.csv
finding-things/aivxievn/feipoogd/data.csv

We use -iname here to also match .CSV files (we don’t want to miss them unintentionally). The quotation marks make sure that the *.csv is interpreted by find and not the shell (bash) itself.

Based on other attributes

find can not only be used to look for files based on their name but also on other properties like file size or access/modification date. This can be useful to display only files which have been changed or created in the last minutes or hours.

For example, let’s create a new empty file with

$ touch finding-things/new_file.txt

This file is now much newer than the other files in the folder and we can find it with the either the -amin or -mmin option, depending if we are looking for access or modification time, respectively.

$ find finding-things -mmin -10

resulting in the output

finding-things
finding-things/experiment1
finding-things/experiment1/asdeswer
finding-things/new_file.txt
finding-things/sdenohww
finding-things/pexhhtec
finding-things/godcjrbv
finding-things/aivxievn
finding-things/aivxievn/qbafmtbq
finding-things/aivxievn/feipoogd

We will get all files or folders that have been modified (in our case, have been created) less than 10 min ago. Your output might look different depending on when you created the finding-things folder. By contrast -mmin +10 would show all files that have been modified more than 10 min ago.

Another property, that is sometimes useful for differentiating files, is their size. If we want to see all files in our finding-things directory larger than 100 kB we can use

$ find finding-things -size +100k

which gives the result:

finding-things/aivxievn/qbafmtbq/data10.csv
finding-things/aivxievn/qbafmtbq/data1.csv
finding-things/aivxievn/qbafmtbq/plot.jpg
finding-things/aivxievn/qbafmtbq/data11.csv
finding-things/aivxievn/qbafmtbq/data2.csv
finding-things/aivxievn/qbafmtbq/data14.csv
finding-things/aivxievn/qbafmtbq/data13.csv
finding-things/aivxievn/qbafmtbq/data4.csv
finding-things/aivxievn/qbafmtbq/data3.csv

Other possible input options are for example -1M (smaller than 1 MB) or +2G (larger than 2 GB).

Combining attributes

To unlock find’s full potential, it is possible to combine different attribute to search for files very precisely. If we for example want to find all .csv files with a size of greater than 200 kB we can use:

$ find finding-things -name '*.csv' -size +200k

There was one file with those attributes, namely

finding-things/aivxievn/qbafmtbq/data13.csv

Finding lines in a file with `grep`

While findis useful for finding files based on their names and other parameters, grep let’s us find things within files. Basic usage (there are a lot of options for more clever things, see the man page man grep) uses the syntax grep whatToFind fileToSearch.

We got a list of genes from a colleague, we want to analyse for a project. First let’s find the list with find and have a look at its layout with head.

$ find finding-things -name 'genelist.tsv'

From that command we found the path to this file, namely

finding-things/genelist.tsv

We use this path to look at the first 5 lines in the file:

$ head -n5 finding-things/genelist.tsv

which results in

Entry	Gene names	Organism
Q14914	PTGR1 LTB4DH	Homo sapiens (Human)
Q6GMI9	uxs1 zgc:91980	Danio rerio (Zebrafish) (Brachydanio rerio)
O75452	RDH16 RODH4 SDR9C8	Homo sapiens (Human)
O23530	SNC1 BAL At4g16890 dl4475c FCAALL.51	Arabidopsis thaliana (Mouse-ear cress)

We see that the file contains genes organised in three columns, the last describing the organism. Assuming we first want to list all genes from rats, we can use grep like this:

$ grep rat finding-things/genelist.tsv

This return one result.

O70351	Hsd17b10 Erab	rattus norvegicus (rat)

But from our colleague's email we know that there should be two rat genes, so we try the `-i` option to ignore capitalization:

$ grep -i rat finding-things/genelist.tsv

This time we get both genes as result:

O08699	Hpgd Pgdh1	Rattus norvegicus (Rat)
O70351	Hsd17b10 Erab	rattus norvegicus (rat)

We remembered these genes from a conversation with another colleague. Fortunately we wrote down some notes. But where is the files with the notes? It must be in our folder. So let's look for the gene name in all files in the folder 'finding-things' using the `-r` option of `grep`:

$ grep -r Hsd17b10 finding-things -C 1 -n

As expected, this returns the gene in ‘genelist.tsv’ but indeed also the note file we were looking for (in addition to another file where this string occurs). -n shows the line number and -C 1 provides us with some context by displaying one line before and after the line containing our search term.

Exercise

Exercise (15 min)

We not only forgot where we saved the output file output.txt, but also the figure we plotted. How can you find the file graph.jpg? How do we look for this file only in the folder experiment1 and its subfolders?
Now we want to find a plot we saved somewhere but we don’t remember which file format we used (.jpg, .bmp, .png or something else). How could you find a file with the name ‘plot’ but unknown suffix?
Let’s assume that we can’t remember if our output files are of type .jpg or .JPG but we know that it is larger than 500 kB. How can we find it?
The notes file also contained the gene ‘AtFDH1’ that might be interesting. From which organism is that gene in the ‘genelist.tsv’ file?
How can we find the file containing the sequence of gene ‘Hsd17b10’? It is saved somewhere in our folder but not properly labeled.

Solution

If we want to look in all subfolders of our basefolder finding-things, we can use:
```
$ find finding-things -name graph.jpg
```
If we only want to look in the subfolder experiment1, we can modify the starting point of the search.
```
$ find finding-things/experiment1 -name graph.jpg
```
One possibility to look for a file with an unknown suffix is to use a wildcard: use:
```
$ find finding-things -name 'plot.*'
```
Don’t forget the quotation marks around plot.*.
You can combine multiple search option. First remember to use -iname if you want to search for files without taking the capitalizations into account. Second the -size parameter let’s you find files based on their size.
```
$ find finding-things -iname '*.jpg' -size +500k
```
Use grep to search for our gene name in the right file:
```
$ grep AtFDH1 finding-things/genelist.tsv
```
Use the -r option to recursively look for the search phrase in all files in the specified folder, in our case the finding-things folder.
```
$ grep -r Hsd17b10 finding-things
```

Keypoints

It is easy to loose the overview in large folders with many subfolders and files.
Both grep and find enable us to find files (again).
Both commands offer many options to find excatly what you are looking for, check out their man pages with man find and man grep.