Microarray Data - Laboratorio di Evoluzione Microbica e Molecolare

Transcript

Microarray Data - Laboratorio di Evoluzione Microbica e Molecolare
Laboratorio di
Bioinformatica
Lezione #2
Dr. Marco Fondi
Contact: [email protected]
www.unifi.it/dblemm/ – tel. 0552288308
Dip.to di Biologia Evoluzionistica
Laboratorio di Evoluzione Microbica e Molecolare, Università di Firenze
Lezione #2
b)Web resources for
bioinformatics
b) BLAST (Basic Local
Alignment Search Tool)
?
Wet-Lab experiments
DATA
Bibliographic Databases
Taxonomic Databases
WEB Databases
Nucleotide Databases
Genomic Databases
Protein Databases
Microarray Databases
Knowledge bases = Biological databases
Punto di partenza di qualsiasi analisi bioinformatica
(e non).
Melanie
Sequence Data/Genome Data
…atgctggactgagtaatcct…
…MQYYLERRSQMPGYTRYMML…
Gene Prediction
(ORF finding)
Protein Structure
Taxonomy
Metabolic pathways
information
Expression profiles
(Microarray Data)
DataBase overview
Sequence Data/Genome Data
…atgctggactgagtaatcct…
…MQYYLERRSQMPGYTRYMML…
Gene Prediction
(ORF finding)
Protein Structure
Taxonomy
Metabolic pathways
information
Expression profiles
(Microarray Data)
EMBL-EBI
GenBank
PDB (Protein DataBank) database
JGI Database
sequence in
FASTA Format
FASTA Format
>gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald
GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC
AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC
ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT
CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC
CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT
GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT
AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA
CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA
CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT
gi number
Locus Name
ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA
CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT
Database Identifiers
GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC
Accession number
TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC
gb
GenBank
CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC
emb
EMBL
CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC
dbj
DDBJ
CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG
CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC
sp
SWISS-PROT
GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC
pdb
Protein Databank
GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA
TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG
pir
PIR
GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC
FASTA Definition Line
>gi|193425|gb|M60978.1|MUSGAPDS
ref
RefSeq
“Text” search
DB
Sequence
in FASTA
Format
BLAST
Sequence similarity
search
>gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of
glycerald
GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC
AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC
ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT
CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC
CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT
GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT
AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA
CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA
CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT
ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA
Sequence Data/Genome Data
…atgctggactgagtaatcct…
…MQYYLERRSQMPGYTRYMML…
Gene Prediction
(ORF finding)
Protein Structure
Taxonomy
Metabolic pathways
information
Expression profiles
(Microarray Data)
Molecola di DNA
Sequenza in formato FASTA:
>Cromosoma (TITOLO)
ATCATTATTGATCCTGATCGGTTAGCAT
CGTATTTCCTTACCGGGACCCCATGATC
GATACAGTAAACCTTAGGATGATTATTG
ATGCTGATCGGTTAGCATCGTATTTCCT
TACCGGGACCCCATGATCGATACAGTA
AACCTTAGGTGATTATTGATCCTGATCG
GTTAGCATCGTATTTCCTTACCGGGACC
CCATGATCGATACAGTAATAATTAGGAT
GATTATTGATCCTGATCGGTTAGCATCG
TATTTCCTTACCGGGACCCCATGATCGA
TACAGTAAACCTTAGGATGATTATTGAT
CCTGATCGGTTAGCATCGTATTTCCTTA
CCGGGACCCCATGATCGATACAGTAAA
CCTTAGATGATTATTGATCCTGATCGGT
ATGCATCGTATTTCCTTACCGGGACCCC
ATGATCGATACAGTAAACCTTAGGTTGA
ATCGTATTTCCTTACCGGGACCCCATGA
TCGATACAGTAAACCTTAGGTAGCATCG
TATTTCCTTACCGGGACCCCATGATCGA
ATGAGTAAACCTTAGGTAGCATTGAATT
TCCTTACCGGGACCCCATGATCGATACA
GTAAACCTTAGG…..
ORF Finder @ NCBI:
Sequence Data/Genome Data
…atgctggactgagtaatcct…
…MQYYLERRSQMPGYTRYMML…
Gene Prediction
(ORF finding)
Protein Structure
Taxonomy
Expression profiles
(Microarray Data)
Metabolic pathways
information
Ho un gene (una sequenza), in quale
processo metabolico è coinvolto?
Dato un processo metabolico, quali
sono i geni coinvolti?
Metabolic pathways information @ KEGG
Metabolic pathways information @ KEGG
Apoptosis in Homo sapiens
Apoptosis in Monodelphis domestica
Sequence Data/Genome Data
…atgctggactgagtaatcct…
…MQYYLERRSQMPGYTRYMML…
Protein Structure
Gene Prediction
(ORF finding)
Taxonomy
Metabolic pathways
information
Expression profiles
(Microarray Data)
Ogni proteina ha una sua
struttura 3D
Amino acid sequence
NLKTEWPELVGKSVEE
AKKVILQDKPEAQIIVL
PVGTIVTMEYRIDRVR
LFVDKLDNIAEVPRVG
Folding!
Protein Structure in the WEB
Strutture note
Predizioni di strutture
If prediction = true
Protein structure prediction
Protein structure @ NCBI
Disegno di farmaci
drug design
Protein-protein
docking
Evoluzione
Proteomica
Assegnazione funzionale
Sequence Data/Genome Data
…atgctggactgagtaatcct…
…MQYYLERRSQMPGYTRYMML…
Gene Prediction
(ORF finding)
Protein Structure
Taxonomy
Metabolic pathways
information
Expression profiles
(Microarray Data)
Expression profiles (Microarray Data)
Array Analysis
Hierarchical Clustering
Gene Expression @ NCBI
Expression profile:
Interazioni proteina-proteina
Assegnazione funzionale
Proteomica
NCBI (
http://www.ncbi.nlm.nih.gov/)
•
•
•
•
Entrez interface to databases
– Medline/OMIM
– Genbank/Genpept/Structures
BLAST server(s)
– Five-plus flavors of blast
Draft Human Genome
Much, much more…
INTEGRATION!!!
Things to know and remember about
using web server-based tools
• State usando il computer di qualcun altro
• (Probabilmente) state utilizzando un insieme
ristretto delle opzioni disponibili
• Grande utilità per analisi preliminari e “veloci”.
Per analisi più accurate e complesse è preferibile
utilizzare database e software in maniera “locale”
• La pratica e gli errori (intelligenti!!!) sono il miglior
modo per imparare
Sequence Comparison
BLAST
Basic Local Alignment Search Tool
Perché comparare le sequenze?
Per individuare quali altri organismi possiedono il
gene sotto studio (query) (es. produzione antibiotici,
target per farmaci)
Per una preliminare attribuzione funzionale
(hypothetical protein, putative function)
Attribuzione funzionale
AACGT
TTGCC
TATAG
Confronto
sequenze
(BAST)
proteina X – funzione ignota
Database
sequenze
Sequenze simili
Trasferimento dell’informazione
relativa alla funzione
proteina X – funzione A
proteina 1 – funzione A
proteina 2 – funzione A
proteina 3 – funzione A
proteina 4 – funzione A
proteina 5 – funzione A
proteina 6 – funzione A
proteina 7 – funzione A
proteina 8 – funzione A
Sequence in FASTA
Format
QUERY
>gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of
glycerald
GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC
AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC
ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT
CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC
CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT
GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT
AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA
CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA
CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT
ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA
BLAST
DB
Lista di sequenze simili alla query
BLAST in the web @NCBI
Using Basic BLAST Methods
• Example: MASH-1 protein sequence
from mouse
• Can I find similar proteins in Human?
Input Query
Choose Database
Submitting Your Query
• Input query sequence
– FASTA
– Raw
– Accession/ ID
• Choose Database
– Many available; varies with program
– For complete list follow the link to:
Finds Conserved Domains
Limit results with
entrez query
E-Value cut off
Submitting Your Query
• CD Search
– Finds conserved domains in query
sequence
– Compares to patterns and profiles of CDs
• Limit by entrez query
– Restricts results to single organism etc.
• E-value cut off
– Restricts results to ones falling below
defined e-value
– Default = 10
– Will revisit concept of e-value
Filtering
Matrix
Gap Penalties
Submitting Your Query
• Low complexity filtering
– Low complexity sequence can lead to
spurious alignments
– Filtering “hides” these regions
– On by default
– SEG (proteins) or DUST (nucleic acids)
– Should turn it off in some cases… what if
your entire sequence gets filtered?
Submitting Your Query
• Choice of scoring matrix
– Different ones available
– BLOSUM matrices based on observed
frequencies of a.a. substitutions
– Each tailored to different levels of
sequence divergence and length
– BLOSUM 62 = default
– Shown to be best at detecting most protein
similarities… don’t usually need to change
– Follow link for detailed information
Submitting Your Query
• Gap Penalties
– Accounts for insertions and deletions in
different sequences
– Scores are penalized for gaps to prevent
aberrant alignments
– Opening penalty is high; extension penalty
is lower
– Defaults may change depending on matrix
choice
– Rarely need to change default value
Protein Words
Query:GTQITVEDLFYNIATRRKALKN
GTQ
Word size = 3 (default)
TQI
Word size can only be 2 or 3
QIT
ITV
Make a lookup
table of words
TVE
VED
EDL
DLF
...
Query: GTQITVEDLFYNIATRRKALKN
TQI
QIT
ITV
TVE
VED
EDL
DLF
...
ch
!
M
at
GTQ
DB
extend
extend
TVEDLFRRLKIAGTQEDLRRT
GGHPYTTFWWYQLMERGTQ
GRTHPYTTTWWEWHHRGTQ
GRTHPYTTTWWEWHHRGTQ
GRTHPYTTTWWEWHHRGTQ
GRTHPYTTTWWEWHHRGTQ
Query: GTQITVEDLFYNIATRRKALKN
TVEDLFRRLKIAGTQEDLRRT
GGHPYTTFWWYQLMERGTQ
GRTHPYTTTWWEWHHRGTQ
GRTHPYTTTWWEWHHRGTQ
GRTHPYTTTWWEWHHRGTQ
…..
GRTHPYTTTWWEWHHRGTQ
Score
Score
Score
Score
Score
Score
…..
E-values
Bit Scores
Click for more info
Take note
Basic BLAST programs and databases
In 6 frames
Nucleotide
Sequence
blastn
Protein
Sequence
Translated
Protein Sequence
tblastn
blastp
blastx
Nucleotide DB
In 6 frames
tblastx
Protein DB
Translated DB
(contain amino
acid sequences)

Documenti analoghi

Lezione06 - Blast e Fasta

Lezione06 - Blast e Fasta blastp: cerca similarità in banche dati proteiche a partire da un a query di amino acidi. blastn: cerca similarità in banche dati di nucleotidi a partire da una query di nucleotidi. blastx: cerca s...

Dettagli

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool BLAST è fondamentale per capire la relazione di una sequenza query con altre proteine o sequenze di DNA note. I suoi utilizzi comprendono: • individuare ortologhi e paraloghi • scoperta di nuovi ge...

Dettagli