|
DISCLAIMER Follow these instructions at your own risk. Though they have been tested by the authors, said authors and the Laboratory for Genomics and Bioinformatics take no responsibility for any damage or loss of data as a result of their use. All opinions expressed are those of the authors and not necessarily those of the University of Oklahoma Health Sciences Center. Use of this tutorial acknowledges your agreement with these terms.
tutorial: setting up BLAST on OS X This tutorial will guide you through every step necessary to setup and run BLASTs locally on your Mac using OS X. It assumes absolutely no knowledge of unix commands; any commands you need will be explained as we go. note: This document was meant to gently take your hand and guide you through this process, explaining as much as possible. As such, it is a bit long and will perhaps be boring for seasoned *nix-users. Don't be intimidated by its length; once you've done this a few times the entire process takes less than two or three minutes to do. If you'd like something less verbose, see the README files in the BLAST distribution.
To start, open a Darwin command-line interface (aka terminal). This can be found under
opening a terminal
![]() Figure 02. - A newly opened Terminal window For those of you new to Unix-based systems, this is known as the command-line or terminal window and is where most of the work gets done. When you use the terminal, you will mostly type commands with the keyboard rather than clicking on things with your mouse. What you will see on the last line will be different than in Figure 2. This is because "dyer-g3-jo" is the name of my computer and "posidian" is my user name. When you open this window you will see your machine name and user name in place of these. Remember, when I describe one of the many commands that we'll use during this tutorial, you can look at the associated figure and see that command typed and also how the computer responds it. A quick note on how to enter commands you see in this tutorial: Each command will be listed out with a number in front of it. This number is not to be typed, it only exists so that we can reference the command like "command 7" or "commands 9-12" etc. Also, be sure to press the "return" or "enter" key after each command (depending on which your keyboard has). This will be labeled as "<return>". Don't actually type <return>, just remember to hit the return button. Also, be sure to enter spaces between words in a command when you see them in this tutorial, such as between the words "mkdir" and "blast" in command 1 below.
first commands
![]() Figure 03. - Terminal window showing output of commands 1-3. We are going to learn three unix commands now. They are "mkdir", "cd", and "pwd". "mkdir" stands for "make directory", cd for "change directory", and "pwd" for "print working directory". Our first command will create a directory called "blast". 1 mkdir blast <return> We then need to change into (move into) that directory. 2 cd blast <return> Then we'll print out our current folder position. 3 pwd <return>
making directories
![]() Figure 04. - commands 4-7 Use the mkdir command again, this time to make two directories at once: "programs" and "databases". The programs directory is where we'll later put the set of BLAST programs, and the databases directory is where we'll put DNA and protein sequence files to BLAST against. 4 mkdir programs databases <return> The "ls" command "lists" the contents of the directory you are currently in. We'll use it now to see the two folders we just created. 5 ls <return> Now change into the programs directory. 6 cd programs <return> Then let's see where we are. 7 pwd <return>
connecting via FTP
![]() Figure 05. - Connecting to NCBI's FTP server (commands 8-10) This command connects us to NCBI's FTP server 8 ftp ftp.ncbi.nih.gov <return> After a moment, a lot of information will flood the screen. When it stops, you will be prompted to enter a Name or User Name. Use the name "anonymous". 9 anonymous <return> It will then ask for your password, and you should use your email address. While you type your password it will be invisible. This is just a security thing, your keyboard isn't broken! 10 whomever@wherever.com <return> This will bring you to an ftp command prompt (ftp>) as in figure 5.
listing available BLAST packages
FTP commands are much like UNIX commands.
downloading BLAST
Now, for OS X, we want to download the file that starts with "blast-" (not "netblast-") and has powerpc-macosx
in the name. The rest of the file name has version numbers that may change, so you will have to look
for the name yourself and modify the command below if necessary. The aptly-named "get" command is how you
download files with ftp.
looking for test sequences
We have downloaded BLAST, but we also need to download some sequences to BLAST against and
test our BLAST installation. In this tutorial we will download the E. coli genome and use
it for our tests.
downloading the Escherichi coli K12 genome
Let's get the file NC_000913.fna file first.
The ".fna" extension stands for Fasta Nucleic Acid, which lets you know that this is a nucleic acid sequence
in FASTA format. (You will have to wait a moment between these commands while each sequence downloads).
clearing your screen
After quitting ftp we are back in our programs directory. Let's clear the junk off our screen
by typing the "clear" command. This makes everything look nicer, and you can still scroll
back up and look at your previous commands if you wish.
de-compressing BLAST
The set of programs in the blast distribution that we downloaded are compressed together into
one larger file. This makes suites of programs easier to download and saves on storage space.
We can uncompress them using the "tar" command with the following options:
creating the NCBI initialization file
We need to create a file that BLAST needs to work properly. It must be in our home directory, called
".ncbirc", and have very specific contents.
preparing to run formatdb
BLAST is set up, now we are ready to format the E. coli sequence files so that we can perform
searches against it. There is a program called formatdb that you must run on any text files
that you wish to BLAST against.
setting the path environmental variable
Before we can run formatdb, OS X needs to know where the blast executables are. We can make it
remember by adding the path to the executables to the PATH environmental variable. If that means
nothing to you, don't worry about it. It is UNIX stuff that is beyond the scope of this tutorial.
Fortunately, doing it is easy:
running formatdb
Since we are creating a database of the amino acid sequences, and another of the nucleic acid
sequences, we will run formatdb twice.
creating an example file
We now have the BLAST programs set up and databases formatted and ready to use. What we need now
is a sequence file to test our setup with. We are going to pull one out of the genome and use it
to test, which will certainly at least match itself. There are several ways of doing this, but a
quick and easy way is to use the "tail" command to view the last few lines of the genome sequence,
count how many of those last lines are one record, and then use the "tail" command again but
redirect the output to a file. Here we go:
moving back to the BLAST directory
You should now have the testseq.faa file in the original blast directory. If you have followed
this tutorial to this point you can go there with this command:
first BLAST
Let's BLAST the file with the program "blastall". Since we added the programs to your PATH variable,
you can run BLAST anytime by using the "blastall" command. You must use options to tell blastall where
the database file and query files you want to use are. Since we are in the same directory as our query file, we
can just say the name of the file, other wise we would have to use both the path to the file and the name of the
file. So, to blast our amino acid test sequence against the amino acid database, we do:
BLAST options
That was fast wasn't it? Great, you say, but what did we just do? The "-p" option tells BLAST which program to use. Since we wanted to
compare a protein sequence against a protein database, we use "blastp". Here is the full list of available programs:
The "-e" option is used if you want to filter the results that are returned by E value. Using "-e .001" makes blastall not report rows unless they have E values at least as good as 10e-3. Finally, the "-m" option always has a number as its argument and changes the way that blastall displays its results. Values of 8 and 9 will return nice short tabular formatted rows. Experiment with different values or omit the "-m" to see the traditional BLAST output. For example: Clear the screen yet again. 50 clear <return> Perform the BLAST. This time, we'll save (redirect) the output in a file. 51 blastall -p blastp -i testseq.faa -d databases/ecoli.faa -e .001 > results.txt <return> Look at the file listing in a detailed manner. 52 ls -l <return>
graduation
As seen in figure 20, that command doesn't produce any output. That is because we used the ">"
redirector again, which redirects the output to a file that you name. You can view that file will any text
editor or by using the unix "cat", "more", or "less" commands followed by the filename.
No longer will you rely on NCBI's webserver when it slows to a crawl during the day from overuse! Go forth,
and BLAST locally.
|