So using the sequences of homologous
gene between several species,
our aim is to reconstruct phylogenetic
tree of the corresponding species.
For this, we have to compare
sequences and compute distances
between these sequences and we
have seen last week how we were
able to measure the similarity
between sequences and we can
use this similarity as a measure
of distance between sequences.
So we will compare pairs of
sequences, measure the similarity
and store the value of distance,
of similarity into what we
could call a matrix or an array.
Before going further, let's make
more explicit the use of these
two terms, they are not equivalent
but some people mix them.
The matrix is a mathematical object,
it's something you manipulate
when you do linear algebra,
it's a mathematical concept.
An array is a computer
science concept.
An array is a data structure.
It is true that a matrix may be
implemented as an array in a
program or an algorithm but not
all the arrays are matrices
so be careful when you use a
term matrix or its equivalent,
not completely equivalent in
computer science array.
So from now on we will speak of
arrays because we will speak
of algorithm, filling an array
with the values of distances.
How do we proceed? The input of
our algorithm, we will not give
the details of the algorithm here
but the input of our algorithm
is a set of genomic sequences.
This set maybe large so it will
be stored, available in a file.
What is a file in computer science?
Well it's not as the same as the
file in the real world let's say.
A series of information
stored here sequentially.
So the idea here, we may have
a large file of several tens,
hundreds of thousands of sequences.
What do we do, is we read a sequence
in the file, here we have
a very small file of three sequences.
We read a sequence in the file
and we compute the distances
between this sequence and all the
other sequences of the file,
we have to read them of course.
So with the first one, of course
and the distance is zero, with
the second one, we fill in the
second column of the array.
The first rule is for the first
sequence, the third sequence,
now we repeat the process with
the next sequence, we read it,
we compare it with the other
sequences of the file.
Of course when we compare the
sequence with itself we have zero
and we're fine for the last one
and you understand the process
so that what we obtain
is a complete array.
Some people again may call it a
matrix and they would say that
the matrix is symmetric, of course these
values are equal and these and these.
Why? Remember that since
it isn't distance.
The distance between a first
sequence and a second sequence
is equal to the distance between the
second sequence and the first one.
So it's logical to find a
symmetric matrix and we have zero
under diagonal because of the
fact that the distance between
a sequence and itself is zero.
OK. This will be the basis for
our algorithm, we will work on
such an array and make combination
of rows and columns in order
to group species
accordingly in a tree.