How to download and extract files from the internet?
How can we find files in complex folders?
How can we find lines in files?
findcan be used to find files
grepcan also be used to search in files
Demo/teaching: 15 min
Exercise: 15 min
A critical skill for working on to have the skills to find files and folders (again).
To get the files and folder structure for the following exercises we will use a combination of two often used tools:
wget URLwill download whatever is specified by URL and save it in the current folder.
taris used for packing and unpacking archives with file suffixes like
.tar.bz2. It is infamous for its confusing parameters.
$ wget https://gitlab.sigma2.no/training/tutorials/unix-for-hpc/-/raw/master/content/downloads/finding.tar.gz $ tar xzvf finding.tar.gz
Finding files in folders with
You should now have a folder named
test_data in your current directory containing
a bunch of files and folders with random names and different file types.
Now we will try to find a couple of different files in this mess.
Let’s start a file called
output.txt which has to be in one of the
Our tools is fittingly called
find and is available on all Unix and Linux
The general syntax is quite simple
$ find STARTING_POINT OPTIONS
but if we look into its help file using
man find, we see that there are
tons of parameters and options available.
Based on file name
In our case, we know the folder in which to look (the ‘STARTING_POINT’) and the exact name, so the command becomes:
$ find test_data -name output.txt
If, for some reason we are not sure about the capitalization of our file (was
it ‘output.txt’, ‘Output.txt’, ‘OUTPUT.txt’ or something else), we can use
-iname instead. This way we ignore the case of search term
$ find test_data -iname output.txt
Let’s imagine that we forgot the name of a data file we want to use but we
remember it is a
.csv file. We could look through each folder separately or
try out all names with a
.csv suffix, we can think off.
Instead we can use find in combination with wildcard characters (see here for more
information). The two most
useful wildcards are the asterisk (*) and the question mark (?).
The asterisk matches zero or more characters while the question mark represents
any single character.
So if we want to list all
.csv files, we can use the asterisk like this:
$ find test_data -iname '*.csv'
-iname here to also match
.CSV files (we don’t want to miss them
unintentionally). The quotation marks make sure that the
*.csv is interpreted
by find and not the shell (bash) itself.
Based on other attributes
find can not only be used to look for files based on their name but also on
other properties like file size or access/modification date.
This can be useful to display only files which have been changed or created in
the last minutes or hours.
For example, let’s create a new empty file with
$ touch test_data/new_file.txt
This file is now much newer than the other files in the folder and we can find
it with the either the
-mmin option, depending if
we are looking for access or modification time, respectively.
$ find test_data -mmin -10
We will get all files or folders that have been modified (in our case, have
been created) less than 10 min ago. By contrast
-mmin +10 would show all
files that have been modified more than 10 min ago.
Another property, that is sometimes useful for differentiating files, is their
size. If we want to see all files in our
test_data directory larger than 100
kB we can use
$ find test_data -size +100k
Other possible input options are for example
-1M (smaller than 1 MB) or
(larger than 2 GB).
To unlock find’s full potential, it is possible to combine different attribute
to search for files very precisely. If we for example want to find all
files with a size of greater than 200 kB we can use:
$ find test_data -name '*.csv' -size +200k
Finding lines in a file with
findis useful for finding files based on their names and other
grep let’s us find things within files.
Basic usage (there are a lot of options for more clever things, see the man
man grep) uses the syntax
grep whatToFind fileToSearch.
We got a list of genes from a colleague, we want to analyse for a project.
First let’s find the list with
find and have a look at its layout with
$ find test_data -name 'genelist.tsv' $ head -n5 test_data/genelist.tsv
We see that the file contains genes organised in three columns, the last describing the organism. Assuming we first want to list all genes from rats, we can use grep like this:
$ grep rat test_data/genelist.tsv
This return one result. But from our colleague’s email we know that there
should be two rat genes, so we try the
-i option to ignore capitalization:
$ grep -i rat test_data/genelist.tsv
This time we get both genes as result.
We remembered these genes from a conversation with another colleague.
Fortunately we wrote down some notes. But where is the files with the notes? It
must be in our folder. So let’s look for the gene name in all files in the
folder ‘test_data’ using the
-r option of
$ grep -r Hsd17b10 test_data -C 1 -n
As expected, this returns the gene in ‘genelist.tsv’ but the note file we were
-n shows the line number and
-C 1 provides us with some
context by displaying one line before and after the line containing our search
Exercise (15 min)
We not only forgot where we saved the output file
output.txtto but also the figure we plotted. How can you find the file
graph.jpg? How do we look for this file only in the folder
experiment1and its subfolders?
Now we want to find a plot we saved somewhere but we don’t remember which file format we used (
.pngor something else). How could you find a file with the name ‘plot’ but unknown suffix?
Let’s assume that we can’t remember if our our output files are of type
.JPGbut we know that it is larger than 500 kB. How can we find it?
The notes file also contained the gene ‘AtFDH1’ that might be interesting. From which organism is that gene in the ‘genelist.tsv’ file?
How can we find the file containing the sequence of gene ‘Hsd17b10’? It is saved somewhere in our folder but not properly labeled?
If we want to look in all subfolders of our basefolder
test_data, we can use:
$ find test_data -name graph.jpg
If we only want to look in the subfolder
experiment1, we can modify the starting point of the search.
$ find test_data/experiment1 -name graph.jpg
One possibility to look for a file with an unknown suffix is to use a wildcard: use:
$ find test_data -name 'plot.*'
Don’t forget the quotation marks around
You can combine multiple search option. First remember to use
-inameif you want to search for files without taking the capitalizations into account. Second the
-sizeparameter let’s you find files based on their size.
$ find test_data -iname '*.jpg' -size +500k
Use grep to search for our gene name in the right file:
$ grep AtFDH1 test_data/genelist.tsv
-roption to recursively look for the search phrase in all files in the specified folder, in our case the
$ grep -r Hsd17b10 test_data