WEBVTT
00:00:00.550 --> 00:00:08.670
So using the sequences of homologous
gene between several species,
00:00:08.670 --> 00:00:18.780
our aim is to reconstruct phylogenetic
tree of the corresponding species.
00:00:20.000 --> 00:00:26.220
For this, we have to compare
sequences and compute distances
00:00:26.220 --> 00:00:32.400
between these sequences and we
have seen last week how we were
00:00:32.430 --> 00:00:37.860
able to measure the similarity
between sequences and we can
00:00:37.860 --> 00:00:42.970
use this similarity as a measure
of distance between sequences.
00:00:44.010 --> 00:00:52.240
So we will compare pairs of
sequences, measure the similarity
00:00:52.460 --> 00:00:58.420
and store the value of distance,
of similarity into what we
00:00:58.420 --> 00:01:00.870
could call a matrix or an array.
00:01:01.700 --> 00:01:07.770
Before going further, let's make
more explicit the use of these
00:01:07.770 --> 00:01:14.080
two terms, they are not equivalent
but some people mix them.
00:01:15.520 --> 00:01:20.920
The matrix is a mathematical object,
it's something you manipulate
00:01:20.920 --> 00:01:26.500
when you do linear algebra,
it's a mathematical concept.
00:01:26.890 --> 00:01:30.780
An array is a computer
science concept.
00:01:31.210 --> 00:01:33.080
An array is a data structure.
00:01:34.070 --> 00:01:38.600
It is true that a matrix may be
implemented as an array in a
00:01:38.600 --> 00:01:45.250
program or an algorithm but not
all the arrays are matrices
00:01:45.600 --> 00:01:51.040
so be careful when you use a
term matrix or its equivalent,
00:01:51.410 --> 00:01:55.370
not completely equivalent in
computer science array.
00:01:55.720 --> 00:02:00.930
So from now on we will speak of
arrays because we will speak
00:02:00.930 --> 00:02:06.800
of algorithm, filling an array
with the values of distances.
00:02:08.790 --> 00:02:16.050
How do we proceed? The input of
our algorithm, we will not give
00:02:16.050 --> 00:02:20.350
the details of the algorithm here
but the input of our algorithm
00:02:20.410 --> 00:02:23.900
is a set of genomic sequences.
00:02:25.230 --> 00:02:31.000
This set maybe large so it will
be stored, available in a file.
00:02:32.010 --> 00:02:34.040
What is a file in computer science?
00:02:34.070 --> 00:02:37.960
Well it's not as the same as the
file in the real world let's say.
00:02:40.070 --> 00:02:44.090
A series of information
stored here sequentially.
00:02:45.440 --> 00:02:51.530
So the idea here, we may have
a large file of several tens,
00:02:51.770 --> 00:02:54.630
hundreds of thousands of sequences.
00:02:55.850 --> 00:03:00.830
What do we do, is we read a sequence
in the file, here we have
00:03:01.340 --> 00:03:05.060
a very small file of three sequences.
00:03:05.690 --> 00:03:09.120
We read a sequence in the file
and we compute the distances
00:03:09.120 --> 00:03:13.280
between this sequence and all the
other sequences of the file,
00:03:13.550 --> 00:03:14.850
we have to read them of course.
00:03:15.800 --> 00:03:20.490
So with the first one, of course
and the distance is zero, with
00:03:20.490 --> 00:03:25.790
the second one, we fill in the
second column of the array.
00:03:26.810 --> 00:03:32.340
The first rule is for the first
sequence, the third sequence,
00:03:32.760 --> 00:03:38.640
now we repeat the process with
the next sequence, we read it,
00:03:38.980 --> 00:03:42.950
we compare it with the other
sequences of the file.
00:03:43.550 --> 00:03:47.030
Of course when we compare the
sequence with itself we have zero
00:03:47.900 --> 00:03:51.460
and we're fine for the last one
and you understand the process
00:03:51.690 --> 00:03:54.750
so that what we obtain
is a complete array.
00:03:56.470 --> 00:04:00.450
Some people again may call it a
matrix and they would say that
00:04:00.730 --> 00:04:08.460
the matrix is symmetric, of course these
values are equal and these and these.
00:04:08.770 --> 00:04:12.460
Why? Remember that since
it isn't distance.
00:04:13.120 --> 00:04:16.620
The distance between a first
sequence and a second sequence
00:04:16.650 --> 00:04:19.740
is equal to the distance between the
second sequence and the first one.
00:04:20.050 --> 00:04:25.200
So it's logical to find a
symmetric matrix and we have zero
00:04:25.200 --> 00:04:28.800
under diagonal because of the
fact that the distance between
00:04:29.180 --> 00:04:31.030
a sequence and itself is zero.
00:04:31.610 --> 00:04:37.170
OK. This will be the basis for
our algorithm, we will work on
00:04:37.170 --> 00:04:44.070
such an array and make combination
of rows and columns in order
00:04:44.070 --> 00:04:47.870
to group species
accordingly in a tree.