Shell Basics
Shell Basics
created by Jesse Zhang for EE372 Lab hour 1
Starting with the Terminal, log into Stanford’s Corn server.
ssh SUNETID@corn.stanford.edu
You can view the contents of your current directory using the ls
command. Create a new directory using
mkdir EE372Lab1
and change your working directory to this new directory using
cd EE372Lab1
You can move up a directory using
cd ..
Download the examples we’re using for this tutorial using the command
wget data-science-sequencing.github.io/assets/labhour1/lab1examples.zip
Unzip the zipped files
unzip data-science-sequencing.github.io/assets/labhour1/lab1examples.zip
You can remove a file with
rm lab1examples.zip
Type “yes” if the Terminal asks you if you’re sure you want to remove the file. You can use the mv
command to move files. For example, we can use the following command to move the lab1example_GSM file up a directory.
mv lab1example_GSM ../
awk
is a powerful tool that lets you easily lets you do some initial processing on a file of interest. We will go through a few examples below.
If you want to split each line of a file based on a substring “R1”, you can do
awk -F "R1" '{print $0"\t"$1"\t"$2}' lab1example_SRR
The “-F” option indicates what substring you want to split on. This will split “SRR1234” into “SR” and “234”, and it will split “helloR1worldR1!!” into “hello”, “world”, and “!!”. By default, awk splits using white space. The portion of the command in the single quotes tells awk how to process each line of the file. print $0
will print out the entire line, print $1
will print out the first part of the line (after splitting), and print $2
will print the second part of the line. The way we used awk here with the file lab1example_SRR will result in the entire line being printed out followed by a tab, followed by “SR”, followed by the a number. We can also print out the the row index using “NR”:
awk '{print NR}' lab1example_SRR
the last number printed here will be the number of lines in the file (1632). We can also print out the length of each row:
awk '{print length($0)}' lab1example_SRR
which will be 10 in this case for all 1632 rows.
The lab1example_fastq file contains 1000 reads from a sequencing experiment. Every 4 lines correspond to 1 read (you can learn more about fastq files using this tutorial). We can use awk to extract only the sequences with
awk '{if (NR % 4 == 2) print $0}' lab1example_fastq > reads
The “> reads” here saves the output of the command to the left of the “>” in a file called “reads”.
Another powerful tool is sed
, and we’ll use only its substitution feature here. We can replace every “A” with “a” in the newly made “reads” file with
sed 's/A/a/g' reads
The “s” indicates substitution, and the “g” indicates global (every “A” will be substituted with “a”). Similarly, we can delete all “A”s with
sed 's/A//g' reads
There are a few ways to view a file. cat
prints the entire content of a file into the Terminal. If you are handling big files, you may not want this. head
allows you to print the first only the first 10 lines of a files. You can use
head -n 20 reads
to print the first 20 lines of the “reads” file.
You can use cat
and “>” to easily create a new file. Typing
cat > temp
followed by 1, 11, 2, 27, 14 (hitting return after typing each number) will create a file called “temp” with these 5 numbers occupying 5 lines.
sort temp
will sort the lines of this file lexographically, which is different than numerically. You will need to use
sort -n temp
to sort the file numerically. We showed above how you can use awk
to count the number of lines in a file. We can also count the number of lines in the “reads” file using
wc -l reads
To look at the number of unique reads, we must first sort
the reads files, save it as some intermediate file (i.e. sort reads > reads_sorted
), and then use the command
uniq reads_sorted
We can save the output of this command as another file “reads_uniq” and count the number of reads in this file with wc -l
. You should get 997 as your answer (i.e. there are 997 unique reads out of 1000 total reads).
The last tool we will talk about is grep
, which allows you to select rows that contain a particular string. For example,
grep "ATACAA" reads
will select the lines in “reads” that contain “ATACAA” in them.
grep -v “A” reads
will select the lines in “reads” that do NOT contain “A” in them. There should only be 2 lines that satisfy this condition.
The last thing we’ll talk about is the concept of piping. Rather than saving intermediate files (like we did with “reads_sorted” above), we can use the pipe operator |
to stream the output of one tool as the input to the next tool. For example,
cat reads | grep "ATACAA" | grep -v "TTT" | wc -l
will first print all the lines of “reads”, select lines with the “ATACAA” string, then select lines that do not have the “TTT” string before counting how many lines there are. This command answers the question: “Of the 1000 reads in the reads file, how many of them contain the subsequence ATACAA but not the subsequence TTT?”
We can also rewrite an above set of commands that counted the number of unique reads in the “reads” file with
sort reads | uniq | wc -l
The pipe operator works with awk
and sed
too, of course. If we started with the fastq file rather than the reads file, we can still count the number of unique reads using
awk '{if (NR % 4 == 2) print $0}' lab1example_fastq | sort | uniq | wc -l
A LOT more can be done with the command line tools discussed here. You can find out more about these and other tools here.