WEBVTT

1
00:00:05.509 --> 00:00:09.952
Welcome to dimensional Data Modeling Day 2 lab Today is gonna be a hands-on

2
00:00:09.952 --> 00:00:10.269
exercises

3
00:00:10.269 --> 00:00:14.626
that builds on the day one lab where we built a cumulative table that tracks all

4
00:00:14.626 --> 00:00:14.898
of

5
00:00:14.898 --> 00:00:19.606
the NBA players' seasons and their stats throughout the seasons This time around

6
00:00:19.606 --> 00:00:19.969
we

7
00:00:19.969 --> 00:00:23.964
are going to be taking those data sets and converting them into slowly changing

8
00:00:23.964 --> 00:00:24.250
dimension

9
00:00:24.250 --> 00:00:29.065
type 2 because remember slowly changing dimension type 2 is the gold standard of

10
00:00:29.065 --> 00:00:29.408
all

11
00:00:29.408 --> 00:00:33.291
the slowly changing dimension types It's the only one that is purely item potent

12
00:00:33.291 --> 00:00:33.569
so

13
00:00:33.569 --> 00:00:38.273
we're gonna be using postcress in today's lab If you don't have the stuff

14
00:00:38.273 --> 00:00:38.609
installed

15
00:00:38.728 --> 00:00:43.023
make sure to clone the repo in the description below You're gonna need Docker

16
00:00:43.023 --> 00:00:43.329
installed

17
00:00:43.329 --> 00:00:47.075
and you're gonna need to have that set up before The lab to get the most out of

18
00:00:47.075 --> 00:00:47.283
this

19
00:00:47.283 --> 00:00:51.726
because this is hands on exercises and we learn data engineering by doing here

20
00:00:51.726 --> 00:00:52.043
And

21
00:00:52.043 --> 00:00:55.996
yeah if you want to learn how to do this type of modeling with the hottest

22
00:00:55.996 --> 00:00:56.243
technologies

23
00:00:56.243 --> 00:01:01.064
like iceberg and Trino I highly recommend checking out joining the Data Expert

24
00:01:01.064 --> 00:01:01.435
Academy

25
00:01:01.435 --> 00:01:05.289
in the description below First we need to like probably drop and recreate the

26
00:01:05.289 --> 00:01:05.564
players'

27
00:01:05.564 --> 00:01:09.002
table I made a mistake in the last lab where like I didn't add the columns that

28
00:01:09.002 --> 00:01:09.204
I

29
00:01:09.204 --> 00:01:11.963
wanted for this lab and I forgot that these labs like build on top of each other

30
00:01:12.558 --> 00:01:16.950
So anyways I'll just go over this real quickly to kind of go over what what I

31
00:01:16.950 --> 00:01:17.209
needed

32
00:01:17.209 --> 00:01:22.216
to add so we needed this is active column in the players' table which just says

33
00:01:22.216 --> 00:01:22.528
whether

34
00:01:22.528 --> 00:01:28.791
or not the player is active in the current season we need that and then also um

35
00:01:28.791 --> 00:01:29.159
uh

36
00:01:29.159 --> 00:01:33.182
we want to all this this query does is it like does this crazy thing where it

37
00:01:33.182 --> 00:01:33.418
generates

38
00:01:33.418 --> 00:01:36.459
a series of all the seasons and then it

39
00:01:37.168 --> 00:01:40.049
creates that same uh like

40
00:01:41.000 --> 00:01:46.689
Uh array table that we had from the previous session What we're gonna do is

41
00:01:47.909 --> 00:01:49.879
I already ran all that So

42
00:01:50.668 --> 00:01:53.439
what we're gonna do is we're going to uh

43
00:01:54.838 --> 00:01:58.346
We're going to create a table We're gonna create a table create a table We'll

44
00:01:58.346 --> 00:01:58.579
call

45
00:01:58.579 --> 00:02:00.400
this uh we'll call this player's SCD

46
00:02:01.019 --> 00:02:05.650
So the main things that I I I care about two dimensions in that other table

47
00:02:05.650 --> 00:02:05.939
right

48
00:02:06.019 --> 00:02:09.081
So let's let's go ahead and select some things from this table real quick

49
00:02:09.081 --> 00:02:09.300
scoring

50
00:02:09.300 --> 00:02:10.808
class and is active

51
00:02:11.419 --> 00:02:14.899
from players where current season equals 2022

52
00:02:15.879 --> 00:02:20.642
So you'll see in this table we have uh a scoring class that's put player name

53
00:02:20.642 --> 00:02:20.939
here

54
00:02:22.588 --> 00:02:28.414
And you'll see with this we have access to the player name the scoring class and

55
00:02:28.414 --> 00:02:28.778
whether

56
00:02:28.778 --> 00:02:30.528
or not they were active that season

57
00:02:31.058 --> 00:02:35.889
and so what we wanna do is we essentially wanna create

58
00:02:36.419 --> 00:02:37.300
um

59
00:02:38.800 --> 00:02:44.850
An SCD table that models like from what year to what year they were either like

60
00:02:44.850 --> 00:02:45.229
we

61
00:02:45.349 --> 00:02:49.775
we wanna like we wanna see we wanna track changes in two columns because that's

62
00:02:49.775 --> 00:02:50.069
one

63
00:02:50.069 --> 00:02:53.894
of the things that's cool about SCD tables is that like you can actually track

64
00:02:53.894 --> 00:02:54.149
multiple

65
00:02:54.149 --> 00:02:59.044
columns at once and that's what we're going to be doing uh today so what we

66
00:02:59.044 --> 00:02:59.349
wanna

67
00:02:59.349 --> 00:03:03.288
do right is let's create our players SCD table

68
00:03:03.909 --> 00:03:07.763
uh what we're gonna have in here is we're gonna have a player name and this is

69
00:03:07.763 --> 00:03:07.990
gonna

70
00:03:07.990 --> 00:03:08.538
be a text

71
00:03:09.308 --> 00:03:13.288
And then we're gonna have a scoring class uh which is a scoring class

72
00:03:13.860 --> 00:03:15.939
We have is active which is a boolean

73
00:03:16.679 --> 00:03:20.069
And then we have a current season uh which is an integer

74
00:03:20.800 --> 00:03:26.538
And then our um primary key here is gonna be player name uh current season

75
00:03:27.569 --> 00:03:32.330
It's gonna be our uh oh yeah we have a couple more things here though right So

76
00:03:32.330 --> 00:03:32.610
this

77
00:03:32.610 --> 00:03:39.028
is because we're trying to do a type 2 so we need to add two more um

78
00:03:39.719 --> 00:03:44.519
columns here right which is gonna be start season which I call an integer and

79
00:03:44.519 --> 00:03:44.838
end

80
00:03:44.838 --> 00:03:46.159
season which is an integer

81
00:03:46.808 --> 00:03:53.401
So you're gonna have both the start season and the end season as well so this is

82
00:03:53.401 --> 00:03:53.788
now

83
00:03:54.199 --> 00:03:59.137
essentially a properly modeled SCD right We have our two these are our columns

84
00:03:59.137 --> 00:03:59.490
we're

85
00:03:59.490 --> 00:04:03.314
tracking and then this is gonna be our current season this is usually gonna be

86
00:04:03.314 --> 00:04:03.569
the

87
00:04:03.569 --> 00:04:09.120
last column like in uh like the big data world this current season you could

88
00:04:09.120 --> 00:04:09.490
think

89
00:04:09.490 --> 00:04:14.151
of as like the date partition of the of the data table so you can kind of like

90
00:04:14.151 --> 00:04:14.409
think

91
00:04:14.409 --> 00:04:18.447
about it that way And then you have the start season and end season and then

92
00:04:18.447 --> 00:04:18.699
these

93
00:04:18.699 --> 00:04:22.412
are the two columns we wanna track So one of the things I want to show here is

94
00:04:22.412 --> 00:04:22.619
like

95
00:04:22.619 --> 00:04:26.309
we wanna see how Michael Jordan kind of like cause he like retired and then he

96
00:04:26.309 --> 00:04:26.540
came

97
00:04:26.540 --> 00:04:31.976
back right and we wanna like see how his records change over time and we're

98
00:04:31.976 --> 00:04:32.338
we're

99
00:04:32.420 --> 00:04:36.019
we're gonna do this first What we're gonna do is we're gonna write the query

100
00:04:36.019 --> 00:04:36.259
that

101
00:04:36.259 --> 00:04:42.071
looks at all of history and creates one SCD record from all of history That's

102
00:04:42.071 --> 00:04:42.459
what

103
00:04:42.459 --> 00:04:49.061
we're gonna do first And then after that we will I'll show how to then take this

104
00:04:49.061 --> 00:04:49.449
query

105
00:04:49.899 --> 00:04:55.749
uh take an existing SCD and then build on top of it incrementally because you

106
00:04:55.749 --> 00:04:56.139
can

107
00:04:56.139 --> 00:05:01.966
do it either way So let's go ahead and uh run this code here run this create

108
00:05:01.966 --> 00:05:02.309
table

109
00:05:02.540 --> 00:05:03.939
It already exists Great

110
00:05:04.678 --> 00:05:06.309
OK so

111
00:05:07.470 --> 00:05:12.304
Here's a good question like how are we gonna do this Like how do we create um

112
00:05:12.304 --> 00:05:12.588
like

113
00:05:12.588 --> 00:05:17.575
so players like if you ran that other big nasty query I sent you like players is

114
00:05:17.575 --> 00:05:17.869
gonna

115
00:05:17.869 --> 00:05:21.962
have data all the way back to 1996 as well right You'll see we have all sorts of

116
00:05:21.962 --> 00:05:22.189
data

117
00:05:22.189 --> 00:05:23.379
in this players' table

118
00:05:23.988 --> 00:05:24.480
So

119
00:05:25.160 --> 00:05:31.147
how do we create a player's table with no filter here So what we wanna do right

120
00:05:31.147 --> 00:05:31.500
is

121
00:05:31.709 --> 00:05:34.420
we have a players' scoring class

122
00:05:35.069 --> 00:05:36.009
is active

123
00:05:36.699 --> 00:05:38.569
What we wanna do is we wanna

124
00:05:39.278 --> 00:05:46.149
calculate the streak of like how many seasons or how long that they were in this

125
00:05:46.149 --> 00:05:46.579
uh

126
00:05:47.399 --> 00:05:49.399
Like in a current dimension

127
00:05:50.220 --> 00:05:55.809
And how we do that is we look at like what was the dimension before

128
00:05:56.379 --> 00:05:59.369
That's how we can see how this works So

129
00:06:00.389 --> 00:06:05.310
And how do we do that Like well we use window functions right So let's go ahead

130
00:06:05.310 --> 00:06:05.600
and

131
00:06:05.600 --> 00:06:09.100
look at lag right scoring class 1

132
00:06:10.439 --> 00:06:14.984
we need to uh partition this though right So we're gonna say we gotta put over

133
00:06:14.984 --> 00:06:15.269
we

134
00:06:15.269 --> 00:06:19.524
say partition by uh and then in our partition here we want a partition by player

135
00:06:19.524 --> 00:06:19.790
name

136
00:06:20.199 --> 00:06:24.973
and then we have an order by or order by here is gonna be current season so we

137
00:06:24.973 --> 00:06:25.238
can

138
00:06:25.238 --> 00:06:33.333
see who like and this is gonna be as previous scoring Class right And let's put

139
00:06:33.333 --> 00:06:33.838
is

140
00:06:33.838 --> 00:06:39.726
active up here just to keep things going So then we wanna just essentially copy

141
00:06:39.726 --> 00:06:40.119
this

142
00:06:40.119 --> 00:06:41.528
again for is active

143
00:06:42.199 --> 00:06:45.720
and then this will be previous is active So this will be

144
00:06:46.358 --> 00:06:51.064
depending on what the previous season was that will be what we get out of this

145
00:06:51.064 --> 00:06:51.358
So

146
00:06:51.358 --> 00:06:55.325
if we run this you'll see uh oh we need current season in here as well put

147
00:06:55.325 --> 00:06:55.559
current

148
00:06:55.559 --> 00:06:57.379
season here so

149
00:06:59.670 --> 00:07:05.619
You'll see OK like here is AC Green 1996 he was bad and um

150
00:07:06.769 --> 00:07:11.697
is active is true right His previous um you'll see his previous scoring class

151
00:07:11.697 --> 00:07:12.048
and

152
00:07:12.048 --> 00:07:14.209
previous is active These are both null

153
00:07:15.040 --> 00:07:20.216
Um so I you see that this like when it's a line like this that means no If it's

154
00:07:20.216 --> 00:07:20.488
empty

155
00:07:20.488 --> 00:07:21.420
that means false

156
00:07:22.439 --> 00:07:28.595
So um that makes sense because he didn't play in 1995 or there's no data for

157
00:07:28.595 --> 00:07:28.980
1995

158
00:07:29.000 --> 00:07:33.162
so that makes sense but you'll see how you can go forward and forward and

159
00:07:33.162 --> 00:07:33.439
forward

160
00:07:33.439 --> 00:07:34.428
here and like

161
00:07:35.000 --> 00:07:38.839
he essentially was just bad the whole time right And so you can see like other

162
00:07:38.839 --> 00:07:39.079
players

163
00:07:39.079 --> 00:07:42.608
like they're like they're pretty consistent right But this might not always be

164
00:07:42.608 --> 00:07:42.879
the

165
00:07:42.879 --> 00:07:48.734
case right where uh so what we're gonna do is we're going to create a CTE here

166
00:07:48.734 --> 00:07:49.079
we're

167
00:07:49.079 --> 00:07:51.649
gonna call this um we'll call this with previous

168
00:07:55.579 --> 00:08:00.896
All right And then this is a select Star from with previous So then in this case

169
00:08:00.896 --> 00:08:01.209
what

170
00:08:01.209 --> 00:08:07.040
we can do is we can create an indicator of whether or not it changed

171
00:08:07.730 --> 00:08:11.879
So what we can say is like case when scoring class

172
00:08:12.579 --> 00:08:14.488
does not equal previous

173
00:08:15.100 --> 00:08:18.259
scoring class then one else 0 end

174
00:08:19.230 --> 00:08:21.139
This is um I can say

175
00:08:22.569 --> 00:08:25.358
Put a then add an L and an end

176
00:08:25.889 --> 00:08:30.783
We're gonna call this we're gonna call this as and this is gonna be change uh

177
00:08:30.783 --> 00:08:31.088
scoring

178
00:08:31.088 --> 00:08:32.090
class change indicator

179
00:08:37.038 --> 00:08:42.598
And we need to have essentially a similar similar bucket here where we have case

180
00:08:42.598 --> 00:08:42.969
when

181
00:08:42.969 --> 00:08:44.080
is active

182
00:08:44.840 --> 00:08:48.000
uh does not equal previous is active

183
00:08:48.989 --> 00:08:50.538
We're gonna say is active

184
00:08:52.099 --> 00:08:56.917
Change indicator So let's let's go ahead and just run all of this and I think

185
00:08:56.917 --> 00:08:57.219
this

186
00:08:57.219 --> 00:08:58.649
will make a little bit more sense

187
00:08:59.500 --> 00:08:59.969
So

188
00:09:00.658 --> 00:09:04.149
we have our um players here and

189
00:09:04.969 --> 00:09:09.115
See how we have these change indicators If we sort here do we have OK so see

190
00:09:09.115 --> 00:09:09.359
here's

191
00:09:09.359 --> 00:09:11.609
here's someone um here uh

192
00:09:13.129 --> 00:09:15.389
He was um so Aaron Brooks

193
00:09:16.119 --> 00:09:20.709
You see how like he went from bad to average and then he went from average to

194
00:09:20.709 --> 00:09:20.979
good

195
00:09:20.979 --> 00:09:26.483
wow so it looks like he was really ramping up here from 2008 to 2009 right So it

196
00:09:26.483 --> 00:09:26.788
looks

197
00:09:26.788 --> 00:09:31.533
like he changed two times like in like back to back seasons right And then oh

198
00:09:31.533 --> 00:09:31.830
yeah

199
00:09:31.830 --> 00:09:34.519
and then but then then he went then he kind of faltered right So you see how we

200
00:09:34.519 --> 00:09:34.668
went

201
00:09:34.668 --> 00:09:35.658
from good to average

202
00:09:36.389 --> 00:09:40.110
so these are the kind of changes that we're looking for right We're trying to

203
00:09:40.110 --> 00:09:40.359
see

204
00:09:40.359 --> 00:09:42.389
like how um

205
00:09:42.979 --> 00:09:48.698
Um how people like are changing and like how they are going about their changes

206
00:09:48.698 --> 00:09:49.080
So

207
00:09:49.080 --> 00:09:55.907
this is this is great So what we wanna do here right is we we essentially want

208
00:09:55.907 --> 00:09:56.308
to

209
00:09:57.599 --> 00:10:05.128
Create a um a streak to identify when like for every change that they have So

210
00:10:05.128 --> 00:10:05.599
imagine

211
00:10:05.599 --> 00:10:10.343
the first the first time they enter they get zero and then the the next time

212
00:10:10.343 --> 00:10:10.639
they

213
00:10:10.639 --> 00:10:14.858
make a change they get 1 the next time they make a change they get 2 and so on

214
00:10:14.858 --> 00:10:15.080
and

215
00:10:15.080 --> 00:10:21.028
so forth So this is um we're gonna put this we're gonna say like make a new um

216
00:10:21.028 --> 00:10:21.359
CTE

217
00:10:21.359 --> 00:10:22.599
here with indicators

218
00:10:24.700 --> 00:10:26.469
So this is now

219
00:10:28.070 --> 00:10:33.506
This is great So what we want to do is there's two ways to go about doing this

220
00:10:33.506 --> 00:10:33.808
right

221
00:10:33.928 --> 00:10:39.686
I um like really we wanna track these things together right And I think that

222
00:10:39.686 --> 00:10:40.070
that's

223
00:10:40.070 --> 00:10:44.981
probably something we want to do here um like these two like it gets complicated

224
00:10:44.981 --> 00:10:45.308
if

225
00:10:45.308 --> 00:10:49.299
we have two indicators like this in this pattern so what we actually want to do

226
00:10:49.299 --> 00:10:49.548
is

227
00:10:49.548 --> 00:10:51.379
combine these into

228
00:10:52.009 --> 00:10:56.639
Um into one single indicator and that's pretty easy So what we're going to do is

229
00:10:56.639 --> 00:10:56.928
we're

230
00:10:56.928 --> 00:10:59.928
going to paste this up here this case when

231
00:11:00.538 --> 00:11:04.086
And then we're gonna just call this instead of scoring class change indicator

232
00:11:04.086 --> 00:11:04.359
we're

233
00:11:04.359 --> 00:11:08.742
gonna call this change indicator which is where if there's a difference in

234
00:11:08.742 --> 00:11:09.080
scoring

235
00:11:09.080 --> 00:11:14.247
class or is active so like if they go from good to bad or bad to good or from

236
00:11:14.247 --> 00:11:14.519
active

237
00:11:14.519 --> 00:11:19.269
to inactive then this will indicate that they they did that right

238
00:11:19.840 --> 00:11:25.996
So now what we wanna do is we wanna create um let's let's look at this so we say

239
00:11:25.996 --> 00:11:26.320
with

240
00:11:26.320 --> 00:11:27.279
indicators

241
00:11:29.928 --> 00:11:32.639
So what we can do here is

242
00:11:33.369 --> 00:11:39.234
we want to essentially sum these up so that we can keep track of the streaks for

243
00:11:39.234 --> 00:11:39.580
players

244
00:11:39.729 --> 00:11:45.391
right So in this case what we can say is you can say um you can say row number

245
00:11:45.391 --> 00:11:45.690
no

246
00:11:46.070 --> 00:11:47.769
we want some sum

247
00:11:48.279 --> 00:11:51.269
uh in this case we're gonna say uh change indicator

248
00:11:52.500 --> 00:11:56.268
Uh but this is actually gonna be uh another window function In this case we're

249
00:11:56.268 --> 00:11:56.519
gonna

250
00:11:56.519 --> 00:11:57.229
say over

251
00:11:57.989 --> 00:12:02.930
And then again we're gonna say uh partition by and then in this case our

252
00:12:02.930 --> 00:12:03.259
partition

253
00:12:03.259 --> 00:12:05.979
by here is still gonna be a player name

254
00:12:07.788 --> 00:12:12.418
And then our order by here is going to be a current season

255
00:12:16.048 --> 00:12:19.269
And then this is gonna be as streak identifier

256
00:12:19.918 --> 00:12:24.323
which is probably like what is Zach doing here right So let's go ahead and see

257
00:12:24.323 --> 00:12:24.599
what's

258
00:12:24.599 --> 00:12:29.983
going on here with uh a certain player right So here's AC Green who like seems

259
00:12:29.983 --> 00:12:30.320
like

260
00:12:30.320 --> 00:12:35.721
he was active and bad for like a long time right So let's go look at Streak

261
00:12:35.721 --> 00:12:36.038
identifier

262
00:12:36.119 --> 00:12:41.503
You see how Streak identifier OK so he did change here though So you see um he

263
00:12:41.503 --> 00:12:41.820
went

264
00:12:41.918 --> 00:12:44.320
he became inactive at this point

265
00:12:44.840 --> 00:12:49.675
So you see how his identifier changed to one and then it looks like he was just

266
00:12:49.675 --> 00:12:49.960
inactive

267
00:12:49.960 --> 00:12:51.820
for like the rest of the data set

268
00:12:52.440 --> 00:12:58.427
so um and you see how like that's all in the same that's all the same value here

269
00:12:58.427 --> 00:12:58.759
right

270
00:12:58.879 --> 00:13:03.250
so you can see exactly when he became inactive So

271
00:13:03.798 --> 00:13:09.135
this is exactly what we're looking for um when we are trying to build out these

272
00:13:09.135 --> 00:13:09.469
SCDs

273
00:13:09.469 --> 00:13:13.308
is we are looking for like how long they stay the same value

274
00:13:13.869 --> 00:13:16.369
And you can kind of see how different players kind of

275
00:13:16.979 --> 00:13:21.122
progress and change their values over time and uh some players are only active

276
00:13:21.122 --> 00:13:21.418
for

277
00:13:21.418 --> 00:13:28.715
like one year so what we can do is from this uh data right now we have one more

278
00:13:28.715 --> 00:13:29.099
window

279
00:13:29.099 --> 00:13:34.623
function here so we're gonna call this um uh with streaks as And this is gonna

280
00:13:34.623 --> 00:13:34.969
give

281
00:13:34.969 --> 00:13:42.169
us our streak identifier So now what we wanna do is we essentially can aggregate

282
00:13:42.169 --> 00:13:42.649
on

283
00:13:42.649 --> 00:13:48.898
this identifier right Because if we do a min and a max on this identifier for

284
00:13:48.898 --> 00:13:49.288
the

285
00:13:49.288 --> 00:13:49.769
season

286
00:13:50.580 --> 00:13:56.715
That will collapse all of these rows right because we uh we know that all those

287
00:13:56.715 --> 00:13:57.099
values

288
00:13:57.099 --> 00:14:01.292
are gonna be the same because we just prove that out with uh how we are doing

289
00:14:01.292 --> 00:14:01.538
this

290
00:14:01.538 --> 00:14:04.250
with the lag and the lead and all that stuff

291
00:14:04.820 --> 00:14:09.769
So in that case right what we can say is um we can say select uh we're gonna say

292
00:14:09.769 --> 00:14:10.029
player

293
00:14:10.029 --> 00:14:12.529
name and then we're gonna say um

294
00:14:13.250 --> 00:14:14.408
streak or we're gonna say

295
00:14:15.418 --> 00:14:16.869
from with streaks

296
00:14:17.950 --> 00:14:22.275
So in this case we want to do a player name and then we can say streak

297
00:14:22.275 --> 00:14:22.529
identifier

298
00:14:23.048 --> 00:14:27.759
And we want to do a couple we probably have a couple more here We have is active

299
00:14:28.288 --> 00:14:31.658
and then we have um uh scoring class

300
00:14:32.500 --> 00:14:36.998
And then then it gets a little bit interesting because then we can say max we

301
00:14:36.998 --> 00:14:37.279
can

302
00:14:37.279 --> 00:14:41.229
say max current season as um end season

303
00:14:41.798 --> 00:14:43.519
and min current season

304
00:14:45.250 --> 00:14:49.778
As start season oh well let's put this you might usually you put the start

305
00:14:49.778 --> 00:14:50.080
before

306
00:14:50.080 --> 00:14:51.369
the um

307
00:14:53.048 --> 00:14:55.558
For the end there right and then

308
00:14:56.129 --> 00:15:01.927
this is now uh this is a regular aggregation right So in this case we can say uh

309
00:15:01.927 --> 00:15:02.250
group

310
00:15:02.250 --> 00:15:07.558
by player name streak identifier is active and scoring class right

311
00:15:08.369 --> 00:15:15.979
So now this will collapse all those records right So now you'll see we have um

312
00:15:16.700 --> 00:15:21.408
These players who are in here and they are like

313
00:15:21.940 --> 00:15:28.608
some of these are for a long time so this 2003 to 2022 this guy right and then

314
00:15:28.608 --> 00:15:28.979
um

315
00:15:28.979 --> 00:15:33.859
some of these people change a lot right Like this guy changed 10 times wow

316
00:15:34.460 --> 00:15:40.929
Like um so what this is doing right is you can see with these people you can uh

317
00:15:40.929 --> 00:15:41.288
like

318
00:15:41.288 --> 00:15:44.860
a lot of times like when people do this pattern they actually don't you you

319
00:15:44.860 --> 00:15:45.099
group

320
00:15:45.099 --> 00:15:49.674
by streak identifier but you don't include it in the data set right And then

321
00:15:49.674 --> 00:15:49.979
what

322
00:15:49.979 --> 00:15:54.527
this does is well let's put an order by in this to kind of really make this pop

323
00:15:54.527 --> 00:15:54.779
so

324
00:15:54.779 --> 00:15:59.502
we can say order by player name I think that will really uh make this pop a

325
00:15:59.502 --> 00:15:59.779
little

326
00:15:59.779 --> 00:16:02.859
bit So remember we had that AC green guy right

327
00:16:03.509 --> 00:16:08.291
Where he had just that one change where he went from active to inactive right So

328
00:16:08.291 --> 00:16:08.590
you'll

329
00:16:08.590 --> 00:16:17.349
see here how he was active from 1996 to 2000 and then he changed uh in 2001

330
00:16:17.869 --> 00:16:22.538
to inactive right And then from 2001 to 2022 he was inactive

331
00:16:23.109 --> 00:16:23.590
So

332
00:16:25.070 --> 00:16:32.636
What I'm trying to illustrate here is you see how this is essentially it Like we

333
00:16:32.636 --> 00:16:33.109
essentially

334
00:16:33.109 --> 00:16:39.607
have what we're looking for now This is our uh our SCD sort of table Um one of

335
00:16:39.607 --> 00:16:39.969
the

336
00:16:39.969 --> 00:16:48.082
things that I think we should do uh for the later part of this lab is we want to

337
00:16:48.082 --> 00:16:48.509
filter

338
00:16:48.509 --> 00:16:52.646
this down just a little bit And what I'm going to say here is I'm gonna say

339
00:16:52.646 --> 00:16:52.889
where

340
00:16:52.889 --> 00:16:59.549
current season is less than or equal to 2021 just so that we can use 2022 in the

341
00:16:59.549 --> 00:16:59.918
incremental

342
00:17:00.168 --> 00:17:02.359
build of this So

343
00:17:02.928 --> 00:17:07.742
what we can do is uh let's let's run this code again and you'll see uh now now

344
00:17:07.742 --> 00:17:08.009
this

345
00:17:08.009 --> 00:17:09.140
goes to 2021

346
00:17:09.930 --> 00:17:14.458
exactly what we are expecting and um you'll see OK here's AC Green with the

347
00:17:14.458 --> 00:17:14.759
ordering

348
00:17:15.049 --> 00:17:19.961
and so so this is so zero is their first streak and one is their second streak

349
00:17:19.961 --> 00:17:20.250
And

350
00:17:20.259 --> 00:17:23.911
so you'll see like OK this this makes sense Let's find someone who like is a

351
00:17:23.911 --> 00:17:24.140
little

352
00:17:24.140 --> 00:17:29.030
bit more interesting like uh this Aaron Brooks guy right um it's actually like

353
00:17:29.030 --> 00:17:29.380
put

354
00:17:29.380 --> 00:17:33.767
streak identifier in there as well like the order by so it's not so freaking

355
00:17:33.767 --> 00:17:34.059
ugly

356
00:17:34.059 --> 00:17:38.241
so you'll see this now actually explains the entire history for people so like

357
00:17:38.241 --> 00:17:38.539
this

358
00:17:38.539 --> 00:17:43.728
guy started out as bad and active than average and active and Good and active

359
00:17:43.728 --> 00:17:44.074
like

360
00:17:44.074 --> 00:17:47.863
he just like was like one year at a time bang bang bang bang bang right Well he

361
00:17:47.863 --> 00:17:48.074
changed

362
00:17:48.074 --> 00:17:52.483
a lot right Like this guy OK then he did have he did have a um it looks like he

363
00:17:52.483 --> 00:17:52.703
went

364
00:17:52.703 --> 00:17:57.871
inactive for a year here right And then uh then here he finally had two years

365
00:17:57.871 --> 00:17:58.193
where

366
00:17:58.193 --> 00:17:59.943
he was at least consistent This guy like

367
00:18:00.795 --> 00:18:03.821
wouldn't even say he's like slowly changing right He's changing almost every

368
00:18:03.821 --> 00:18:04.074
season

369
00:18:04.500 --> 00:18:08.311
So like but it depends on who the player is right And like so the streak

370
00:18:08.311 --> 00:18:08.549
identifier

371
00:18:08.549 --> 00:18:12.998
is just like the first value and then each one after that is um every time it's

372
00:18:12.998 --> 00:18:13.259
changed

373
00:18:13.509 --> 00:18:17.531
and so that you can get the continuous values that you're looking for for

374
00:18:17.531 --> 00:18:17.818
different

375
00:18:17.818 --> 00:18:20.989
for different players and like over what time frame that it exists

376
00:18:21.750 --> 00:18:28.115
So um that's kind of the idea behind how uh these this streak identifier works

377
00:18:28.115 --> 00:18:28.539
So

378
00:18:29.828 --> 00:18:34.140
What we can do here is the last thing we have in here is this current season

379
00:18:34.650 --> 00:18:39.098
which actually we wanna just hard code this is gonna be 2021 as current season

380
00:18:39.809 --> 00:18:44.191
because this is uh this is like imagine if this was in like airflow or like some

381
00:18:44.191 --> 00:18:44.449
sort

382
00:18:44.449 --> 00:18:50.000
of pipeline that this is uh a parameter that you would inject right this guy

383
00:18:50.559 --> 00:18:51.180
this guy

384
00:18:51.729 --> 00:18:56.962
so um this essentially is the query that we're looking for um I think we say

385
00:18:56.962 --> 00:18:57.289
insert

386
00:18:57.289 --> 00:18:58.368
into players

387
00:18:59.920 --> 00:19:00.588
The city

388
00:19:01.689 --> 00:19:01.699
Um

389
00:19:03.809 --> 00:19:05.670
Oh what did I did I

390
00:19:06.309 --> 00:19:08.969
did I mess something up here There's like a uh there's like a

391
00:19:11.209 --> 00:19:14.118
What are the columns in this OK so

392
00:19:15.838 --> 00:19:20.602
00 you're right Streak identifier is not in there and it's like score OK scoring

393
00:19:20.602 --> 00:19:20.920
classes

394
00:19:20.920 --> 00:19:21.358
first

395
00:19:23.479 --> 00:19:24.858
And then is active

396
00:19:27.989 --> 00:19:28.719
And then

397
00:19:29.729 --> 00:19:32.809
Start season end season current season OK that looks right

398
00:19:34.160 --> 00:19:40.069
OK so now this is our kind of way of so the main thing I'm trying to show with

399
00:19:40.069 --> 00:19:40.380
this

400
00:19:40.380 --> 00:19:44.338
is this is a way of processing all of history

401
00:19:45.318 --> 00:19:45.989
And

402
00:19:46.939 --> 00:19:47.348
Oh

403
00:19:47.910 --> 00:19:48.858
it already exists

404
00:19:49.689 --> 00:19:50.229
Let's see

405
00:19:53.959 --> 00:19:55.390
Players a CD

406
00:20:00.880 --> 00:20:05.454
What 00 because there's a duplicate in the primary key because the primary oh I

407
00:20:05.454 --> 00:20:05.759
messed

408
00:20:05.759 --> 00:20:10.065
up I actually messed up the primary key is not that It's um the primary key is

409
00:20:10.065 --> 00:20:10.318
all

410
00:20:10.318 --> 00:20:14.529
three It's the start season end season and the player name It's all three of

411
00:20:14.529 --> 00:20:14.809
those

412
00:20:15.279 --> 00:20:18.750
Small little mistake here so drop table players SCD

413
00:20:21.430 --> 00:20:22.489
And then um

414
00:20:23.420 --> 00:20:28.755
In this case primary key here is gonna be player uh name start season I think

415
00:20:28.755 --> 00:20:29.088
it's

416
00:20:29.088 --> 00:20:32.543
actually just player name and start season because yeah you don't have to put

417
00:20:32.543 --> 00:20:32.789
end

418
00:20:32.789 --> 00:20:34.160
season in there like you could

419
00:20:34.848 --> 00:20:38.124
potentially put it in there but like they they should only have one value at the

420
00:20:38.124 --> 00:20:38.328
start

421
00:20:38.328 --> 00:20:41.689
of each season or at this like for that value So I'm pretty sure that's actually

422
00:20:41.689 --> 00:20:41.900
the

423
00:20:41.900 --> 00:20:44.088
primary key not current season it's start season

424
00:20:45.078 --> 00:20:49.381
So you can go ahead and create that table one more time and now this should

425
00:20:49.381 --> 00:20:49.650
actually

426
00:20:49.650 --> 00:20:50.078
run

427
00:20:50.699 --> 00:20:54.694
OK now it runs pretty great So now we can just say like uh select star from

428
00:20:54.694 --> 00:20:54.930
players

429
00:20:54.930 --> 00:20:55.650
SCD

430
00:20:57.009 --> 00:21:01.141
And this should OK great This gives us the same data right exact same data that

431
00:21:01.141 --> 00:21:01.400
we

432
00:21:01.400 --> 00:21:09.370
had before And so um this is um awesome So this is essentially how you can

433
00:21:09.370 --> 00:21:09.868
create

434
00:21:09.868 --> 00:21:12.390
an SCD table uh from

435
00:21:13.420 --> 00:21:17.778
Uh your whatever data you have like your daily data or like because you can

436
00:21:17.778 --> 00:21:18.068
think

437
00:21:18.068 --> 00:21:22.661
of the season here you could change that to date or year or month or whatever

438
00:21:22.661 --> 00:21:22.949
and

439
00:21:22.949 --> 00:21:26.425
like you can have the same exact values right That could be a big thing that can

440
00:21:26.425 --> 00:21:26.630
make

441
00:21:26.630 --> 00:21:31.804
a big difference here um so that's a big part of like SCD stuff that I highly

442
00:21:31.804 --> 00:21:32.108
recommend

443
00:21:32.108 --> 00:21:37.309
like checking out is just like this is um this could be very powerful uh

444
00:21:38.670 --> 00:21:44.091
There are some things about this uh pattern that I don't like uh and let's um

445
00:21:44.091 --> 00:21:44.430
let's

446
00:21:44.430 --> 00:21:48.231
kind of go over those before we go into the incremental build just so that you

447
00:21:48.231 --> 00:21:48.469
can

448
00:21:48.469 --> 00:21:51.289
kind of see the difference between the two So

449
00:21:52.068 --> 00:21:57.422
there's gonna be some expensive parts of this query Um the the big expensive

450
00:21:57.422 --> 00:21:57.805
parts

451
00:21:57.805 --> 00:22:02.134
of this query are these windows right Like look how many times we have to slice

452
00:22:02.134 --> 00:22:02.404
this

453
00:22:02.404 --> 00:22:08.636
up right So we have 12 right and then we have 3 So you essentially have to do 22

454
00:22:08.636 --> 00:22:08.963
window

455
00:22:08.963 --> 00:22:13.502
functions on top of each other and then an aggregation right And this is the

456
00:22:13.502 --> 00:22:13.805
only

457
00:22:13.805 --> 00:22:18.098
time we aggregate so you actually do window functions on the entire data set

458
00:22:18.098 --> 00:22:18.404
right

459
00:22:18.443 --> 00:22:22.789
Like you never crunch the data down anywhere Right it's only at the very end

460
00:22:22.789 --> 00:22:23.078
when

461
00:22:23.078 --> 00:22:27.475
you do this big group buy that you end up actually shrinking the data volume at

462
00:22:27.475 --> 00:22:27.750
all

463
00:22:28.078 --> 00:22:32.513
and that is it's good because like uh you have all the data and like you can you

464
00:22:32.513 --> 00:22:32.759
you

465
00:22:32.759 --> 00:22:36.306
have the full history which is very powerful and now you have this beautiful

466
00:22:36.306 --> 00:22:36.559
little

467
00:22:36.559 --> 00:22:41.489
SCD table that like really explains and really like really makes things pop

468
00:22:41.489 --> 00:22:41.868
right

469
00:22:42.160 --> 00:22:45.199
It really shows how people like play and do their thing

470
00:22:45.739 --> 00:22:52.279
But like it um this this only works like this query really only works a lot like

471
00:22:52.989 --> 00:22:58.921
Given the scale but uh this query like also is powerful So like when I worked on

472
00:22:58.921 --> 00:22:59.269
unit

473
00:22:59.269 --> 00:23:03.806
economics at Airbnb this essentially was the query we worked with and uh we did

474
00:23:03.806 --> 00:23:04.108
this

475
00:23:04.108 --> 00:23:07.438
for all the different line items and all the different uh uh things in that

476
00:23:07.438 --> 00:23:07.660
table

477
00:23:07.989 --> 00:23:12.714
and it worked great Spark was able to crunch this pretty well because it's um

478
00:23:12.714 --> 00:23:13.029
like

479
00:23:13.769 --> 00:23:17.652
It it like these window functions are actually not that expensive like you if

480
00:23:17.652 --> 00:23:17.930
you

481
00:23:17.930 --> 00:23:22.220
just partition things up and like slice and dice it so like and then you can

482
00:23:22.220 --> 00:23:22.489
essentially

483
00:23:22.489 --> 00:23:26.463
just regenerate all of history every day right So then you can change this like

484
00:23:26.463 --> 00:23:26.729
uh

485
00:23:26.729 --> 00:23:31.735
like the next day right so like you would just change it to like 2022 2023 2024

486
00:23:31.735 --> 00:23:32.029
right

487
00:23:32.049 --> 00:23:35.438
And then then you don't even have to think about the incremental right even

488
00:23:35.438 --> 00:23:35.680
though

489
00:23:35.680 --> 00:23:36.130
like

490
00:23:36.759 --> 00:23:42.471
The data engineer in me like who cares a lot about efficiency is like I hate

491
00:23:42.471 --> 00:23:42.828
this

492
00:23:42.900 --> 00:23:46.680
Why am I scanning all of history every day I don't like this at all and that's a

493
00:23:46.680 --> 00:23:46.890
big

494
00:23:46.890 --> 00:23:51.464
thing that I think y'all should definitely be looking into is like how that is

495
00:23:51.464 --> 00:23:51.769
um

496
00:23:52.289 --> 00:23:57.918
impacting stuff is uh like and but most of the time like this can be enough and

497
00:23:57.918 --> 00:23:58.250
this

498
00:23:58.338 --> 00:24:00.420
this is gonna work for you um

499
00:24:01.209 --> 00:24:05.257
But on the flip side I also love teaching about the other ways that things can

500
00:24:05.257 --> 00:24:05.509
be

501
00:24:05.509 --> 00:24:10.609
done because this query here versus the query we're about to do this query here

502
00:24:10.609 --> 00:24:10.949
is

503
00:24:10.949 --> 00:24:15.561
gonna be a lot more prone to out of memory exceptions and SKU and other

504
00:24:15.561 --> 00:24:15.868
interesting

505
00:24:15.868 --> 00:24:21.554
problems like that that like um like for example if you have like uh like most

506
00:24:21.554 --> 00:24:21.910
of

507
00:24:21.910 --> 00:24:26.319
this is a slowly changing dimension But you saw how we had that one guy who was

508
00:24:26.319 --> 00:24:26.578
like

509
00:24:26.578 --> 00:24:32.421
changing every single year So imagine if you have some people or some dimensions

510
00:24:32.421 --> 00:24:32.838
in

511
00:24:32.838 --> 00:24:40.820
your data set that are not um that are not uh as slowly changing as other ones

512
00:24:40.820 --> 00:24:41.289
then

513
00:24:41.289 --> 00:24:46.231
the problem here is that you get a lot of streaks for that specific user and

514
00:24:46.231 --> 00:24:46.539
then

515
00:24:46.539 --> 00:24:50.811
that blows up the cardinality of this right And that kind of like like slows it

516
00:24:50.811 --> 00:24:51.078
down

517
00:24:51.078 --> 00:24:51.828
and that's where like

518
00:24:52.519 --> 00:24:56.477
When you're doing this like they should all have roughly the same amount of

519
00:24:56.477 --> 00:24:56.759
slowly

520
00:24:56.759 --> 00:24:59.578
changing but obviously that it will never be the case but that's what you're

521
00:24:59.578 --> 00:24:59.779
kind

522
00:24:59.779 --> 00:25:01.439
of hoping for And

523
00:25:02.059 --> 00:25:05.900
but like that's where I like these queries I like this query I think this query

524
00:25:05.900 --> 00:25:06.140
works

525
00:25:06.140 --> 00:25:11.674
great I ran this query in production at scale so like don't worry too much about

526
00:25:11.674 --> 00:25:12.019
that

527
00:25:12.019 --> 00:25:16.303
because that's one of the other things I wanna like really uh advise y'all with

528
00:25:16.303 --> 00:25:16.588
is

529
00:25:17.459 --> 00:25:21.939
when you're working with dimensions you can do like crazy stuff like this like

530
00:25:21.939 --> 00:25:22.259
you

531
00:25:22.259 --> 00:25:25.949
can do like I'm just gonna scan all the history like you can do that because

532
00:25:25.949 --> 00:25:26.180
dimensional

533
00:25:26.180 --> 00:25:29.918
data really isn't that big like dimensional data even even have Facebook when

534
00:25:29.918 --> 00:25:30.205
it's

535
00:25:30.205 --> 00:25:34.592
like billions it still isn't that big compared to like where the fact data goes

536
00:25:34.592 --> 00:25:34.884
right

537
00:25:34.925 --> 00:25:39.125
The fact data is just like can be billions and trillions and just really really

538
00:25:39.125 --> 00:25:39.405
really

539
00:25:39.405 --> 00:25:44.430
large So like um you know that was one big thing that was different between

540
00:25:44.430 --> 00:25:44.765
Airbnb

541
00:25:44.765 --> 00:25:48.034
and Facebook though is that this query probably wouldn't fly at Facebook because

542
00:25:48.034 --> 00:25:48.286
they

543
00:25:48.286 --> 00:25:52.690
have billions of users whereas at Airbnb there's like millions of users and so

544
00:25:52.690 --> 00:25:53.005
you

545
00:25:53.005 --> 00:25:57.182
have like a kind of a there is a line there of scale When you add like 1 or 2

546
00:25:57.182 --> 00:25:57.391
more

547
00:25:57.391 --> 00:26:00.984
zeros where you can't just like throw everything into a window function like

548
00:26:00.984 --> 00:26:01.260
this

549
00:26:01.260 --> 00:26:05.132
and call it a day so that is definitely something that you want to be

550
00:26:05.132 --> 00:26:05.391
considering

551
00:26:05.391 --> 00:26:08.694
as you kind of like are going through your career and going through all sorts of

552
00:26:08.694 --> 00:26:08.901
stuff

553
00:26:08.901 --> 00:26:15.049
so um yeah let's go ahead and get into this next uh kind of way of doing things

554
00:26:15.049 --> 00:26:15.391
right

555
00:26:15.641 --> 00:26:20.406
So what we wanna do is so we have players STD so what we're gonna say is we're

556
00:26:20.406 --> 00:26:20.671
gonna

557
00:26:20.671 --> 00:26:23.432
say with um yesterday data

558
00:26:24.449 --> 00:26:29.456
As in this case we're we're gonna say we're gonna say what I call this last

559
00:26:29.456 --> 00:26:29.769
season

560
00:26:29.769 --> 00:26:31.318
last season SCD

561
00:26:31.890 --> 00:26:37.209
uh select star from players SCD where current season equals 2021

562
00:26:39.170 --> 00:26:42.598
And then we have um this season data

563
00:26:45.630 --> 00:26:49.259
As and in this case we have select star from players

564
00:26:50.299 --> 00:26:51.449
Where current season

565
00:26:52.029 --> 00:26:53.029
goes 2022

566
00:26:53.739 --> 00:26:59.774
So what we're trying to do here is see if things have changed right So the thing

567
00:26:59.774 --> 00:27:00.130
is

568
00:27:00.140 --> 00:27:04.237
is like some of the SCD records we don't have to care about and we know that

569
00:27:04.237 --> 00:27:04.479
they're

570
00:27:04.479 --> 00:27:09.288
never gonna change because they like like they're already completed they don't

571
00:27:09.288 --> 00:27:09.689
have

572
00:27:09.689 --> 00:27:15.289
that current record to them So the thing is is like so this last season SCD what

573
00:27:15.289 --> 00:27:15.618
we

574
00:27:15.618 --> 00:27:20.493
wanna do is we have one more thing in here and that's gonna be end season equals

575
00:27:20.493 --> 00:27:20.779
2021

576
00:27:21.299 --> 00:27:24.170
Cause then we want um then we have historical

577
00:27:24.739 --> 00:27:25.828
SCD as

578
00:27:26.420 --> 00:27:29.088
and then historical SCD is gonna be

579
00:27:29.779 --> 00:27:30.549
this guy

580
00:27:33.078 --> 00:27:38.896
But this is where end season is less than 2021 So because in this case like if

581
00:27:38.896 --> 00:27:39.239
someone

582
00:27:39.239 --> 00:27:44.038
has already like for example Michael Jordan he like retired in 1997 and then he

583
00:27:44.038 --> 00:27:44.358
came

584
00:27:44.358 --> 00:27:49.091
back in 2001 those records are never gonna change even if something happens in

585
00:27:49.091 --> 00:27:49.430
2023

586
00:27:49.479 --> 00:27:52.211
there's nothing that can change those records because they have already started

587
00:27:52.211 --> 00:27:52.439
and

588
00:27:52.439 --> 00:27:58.073
ended So that's what this historical SCD is doing And then we have last season

589
00:27:58.073 --> 00:27:58.449
SCD

590
00:27:58.699 --> 00:28:02.201
So let's go ahead and just like look at that select star from last season SCD

591
00:28:02.201 --> 00:28:02.420
first

592
00:28:02.420 --> 00:28:06.574
just to kind of like get a grapple of what's going on here and you'll see that

593
00:28:06.574 --> 00:28:06.818
like

594
00:28:06.818 --> 00:28:10.692
we're gonna only have one record for everyone here right You have this 2021

595
00:28:10.692 --> 00:28:10.969
record

596
00:28:10.969 --> 00:28:16.662
and like uh it's gonna be uh where all of these are gonna be like these are the

597
00:28:16.662 --> 00:28:16.979
current

598
00:28:16.979 --> 00:28:22.948
records like assuming that we have all the data up to 2021 And then uh like 2022

599
00:28:22.948 --> 00:28:23.299
could

600
00:28:23.299 --> 00:28:28.559
um for for the records that don't change they could just increase one and then

601
00:28:28.559 --> 00:28:28.910
for

602
00:28:28.910 --> 00:28:33.147
the records that do change we need to add a new record that's essentially how

603
00:28:33.147 --> 00:28:33.430
this

604
00:28:33.430 --> 00:28:39.058
is gonna work right So how are we gonna like look at this I think that there's

605
00:28:39.058 --> 00:28:39.390
gonna

606
00:28:39.390 --> 00:28:43.358
be a couple different ways that you could think about doing this right So

607
00:28:43.949 --> 00:28:50.258
one is let's look at uh like So if we join here right so let's say from last

608
00:28:50.258 --> 00:28:50.608
season

609
00:28:50.608 --> 00:28:56.779
SCD let's call this LS um join this season data TS

610
00:28:58.380 --> 00:28:58.390
Right

611
00:28:59.130 --> 00:29:04.870
um this is actually gonna be a left joint uh because uh there might be uh

612
00:29:04.870 --> 00:29:05.229
actually

613
00:29:05.868 --> 00:29:07.969
Yeah yeah it'll be a left joint right because

614
00:29:08.539 --> 00:29:11.973
well wait it's the other way around because this season can have new data

615
00:29:11.973 --> 00:29:12.219
because

616
00:29:12.219 --> 00:29:16.456
you can have the new players who didn't come in right it's actually this way

617
00:29:16.456 --> 00:29:16.739
last

618
00:29:16.739 --> 00:29:20.909
season SCD and this is actually just uh yeah and then this is a left joint

619
00:29:20.909 --> 00:29:21.170
because

620
00:29:21.459 --> 00:29:25.864
you can have those new players who don't have any records yet and then this is

621
00:29:25.864 --> 00:29:26.140
gonna

622
00:29:26.140 --> 00:29:30.459
be on LS.player name equals TS.player name

623
00:29:31.400 --> 00:29:32.868
So this is going to give us

624
00:29:34.078 --> 00:29:39.677
OK so in this case what we care about here is we want LS.player name and then we

625
00:29:39.677 --> 00:29:39.989
want

626
00:29:39.989 --> 00:29:40.309
um

627
00:29:41.430 --> 00:29:42.489
Um Ls

628
00:29:43.630 --> 00:29:46.229
let's change this This is TS and this is Ellis

629
00:29:47.000 --> 00:29:51.219
There we go this is so this is TS.player name and then we have um LS

630
00:29:52.420 --> 00:29:56.449
Uh um scoring class LS is active

631
00:29:57.890 --> 00:30:03.585
And then we have TS scoring class TS is active right So these are gonna be uh

632
00:30:03.585 --> 00:30:03.920
the

633
00:30:03.920 --> 00:30:08.088
main things we're gonna be looking at So you'll see in this case like

634
00:30:09.019 --> 00:30:13.197
These players are mostly gonna be the same but some of these are not right So

635
00:30:13.197 --> 00:30:13.459
like

636
00:30:13.459 --> 00:30:17.049
this Aaron Henry guy you see how he is not the same

637
00:30:17.699 --> 00:30:24.528
His record changed um So in this case we have um like we can have the changed

638
00:30:24.528 --> 00:30:24.930
records

639
00:30:24.930 --> 00:30:29.978
and then we have the new records right So that's like that's gonna be

640
00:30:29.978 --> 00:30:30.338
essentially

641
00:30:30.338 --> 00:30:33.342
the way to think about this right So we have the changed records and the new

642
00:30:33.342 --> 00:30:33.529
records

643
00:30:33.900 --> 00:30:39.052
where so in this case let's go ahead and add in here we have this is gonna be

644
00:30:39.052 --> 00:30:39.338
let's

645
00:30:39.338 --> 00:30:44.098
call this um uh unchanged records call that first

646
00:30:48.088 --> 00:30:49.779
OK so in this case

647
00:30:50.469 --> 00:30:54.539
uh what we want here is there's a nice wear condition here

648
00:30:55.068 --> 00:30:58.517
and the so there's gonna be two things here like first off this is this is not a

649
00:30:58.517 --> 00:30:58.709
left

650
00:30:58.709 --> 00:31:02.247
joint now this is just a regular joint because we'll have the new we'll have the

651
00:31:02.247 --> 00:31:02.469
new

652
00:31:02.469 --> 00:31:05.784
and changed records we'll have new and changed and then unchanged records and

653
00:31:05.784 --> 00:31:06.039
then

654
00:31:06.039 --> 00:31:08.469
those will be our kind of all of our things going together

655
00:31:08.989 --> 00:31:13.878
So in this case uh we just have a we class here where it's like where TS scoring

656
00:31:13.878 --> 00:31:14.150
class

657
00:31:14.150 --> 00:31:20.259
equals um LS scoring class and TS is active equals LS is active

658
00:31:21.160 --> 00:31:25.939
Right And then in this case what we have is we have um

659
00:31:26.729 --> 00:31:30.729
We wanna we wanna essentially keep all of the the records that we need here so

660
00:31:30.729 --> 00:31:30.979
we

661
00:31:31.029 --> 00:31:32.189
we have TS

662
00:31:32.789 --> 00:31:39.164
player name and then we have uh TS uh scoring class TS is active and then we

663
00:31:39.164 --> 00:31:39.539
want

664
00:31:39.539 --> 00:31:46.018
to have uh then we have the LS.start season and then in this case we have TS

665
00:31:46.018 --> 00:31:46.400
current

666
00:31:46.400 --> 00:31:46.828
season

667
00:31:47.348 --> 00:31:52.617
and this is gonna be as end season So what this is gonna do is for the records

668
00:31:52.617 --> 00:31:52.910
that

669
00:31:52.910 --> 00:31:55.269
don't change we're just gonna

670
00:31:55.900 --> 00:31:57.059
Increase it by one

671
00:31:57.989 --> 00:31:58.000
So

672
00:31:58.578 --> 00:32:01.640
uh like let's go ahead and just look at that We're gonna say select Star from

673
00:32:02.568 --> 00:32:07.759
Unchanged records so this is gonna essentially expand it out one so you'll see

674
00:32:07.759 --> 00:32:08.130
uh

675
00:32:08.130 --> 00:32:14.458
we have like uh AC AC Green right And see now he's gonna go from 2021 to 2022

676
00:32:14.458 --> 00:32:14.809
and

677
00:32:14.809 --> 00:32:20.183
so that's great like he's gonna be uh like gonna be in there and that will be

678
00:32:20.183 --> 00:32:20.500
awesome

679
00:32:20.848 --> 00:32:27.270
So then we have uh we have one more in here where we have the we have the new

680
00:32:27.270 --> 00:32:27.608
and

681
00:32:27.608 --> 00:32:33.112
changed records that's gonna be the other kind of uh The other pieces of this

682
00:32:33.112 --> 00:32:33.479
puzzle

683
00:32:33.479 --> 00:32:36.949
that are gonna be kind of an interesting piece right where so

684
00:32:38.098 --> 00:32:44.464
Um we have unchanged records and then we have uh new records right Or a new this

685
00:32:44.464 --> 00:32:44.838
is

686
00:32:44.838 --> 00:32:50.029
new and changed records right because those can go together So in the changed

687
00:32:50.029 --> 00:32:50.400
records

688
00:32:50.400 --> 00:32:54.589
category the problem is is like you actually end up having two records right You

689
00:32:54.589 --> 00:32:54.868
have

690
00:32:54.868 --> 00:33:01.585
the record for um uh the change and then you also have the record for the um the

691
00:33:01.585 --> 00:33:01.959
like

692
00:33:01.959 --> 00:33:04.996
the the closed record the one that like just finished and then you have the

693
00:33:04.996 --> 00:33:05.199
record

694
00:33:05.199 --> 00:33:05.799
for the change

695
00:33:06.368 --> 00:33:09.078
So in this case what we want to do is

696
00:33:09.650 --> 00:33:14.519
we want this in here right And then this becomes a left joint here

697
00:33:15.750 --> 00:33:16.959
And then uh

698
00:33:18.229 --> 00:33:21.539
This this we clause is interesting right

699
00:33:22.229 --> 00:33:28.179
So it's gonna be um or so I'll show you how this works This one's so you have

700
00:33:28.179 --> 00:33:28.509
this

701
00:33:28.509 --> 00:33:34.372
as an uh it's like like they either one of them doesn't equal either the scoring

702
00:33:34.372 --> 00:33:34.739
classes

703
00:33:34.739 --> 00:33:39.639
don't equal or the is actives don't equal but then there's another one that's

704
00:33:39.639 --> 00:33:39.989
like

705
00:33:39.989 --> 00:33:47.549
or um LS is active is null right Or we're gonna say LS.player name is null

706
00:33:48.358 --> 00:33:53.705
And that will give us the um the new the new and changed records that we can

707
00:33:53.705 --> 00:33:54.019
then

708
00:33:54.019 --> 00:34:00.424
work with So in this case right um one of the things that we can look at right

709
00:34:00.424 --> 00:34:00.779
is

710
00:34:00.779 --> 00:34:07.942
uh this this one is a little bit trickier so the you have the You you you you

711
00:34:07.942 --> 00:34:08.340
really

712
00:34:08.340 --> 00:34:13.439
want to use like an array of struck here kind of thing right Because you have to

713
00:34:13.439 --> 00:34:13.739
essentially

714
00:34:13.739 --> 00:34:18.426
take one record and make it two records because uh there there was a change that

715
00:34:18.426 --> 00:34:18.719
happened

716
00:34:19.059 --> 00:34:24.901
and so you have to have both records in in in there Uh I I hope that makes sense

717
00:34:24.901 --> 00:34:25.208
So

718
00:34:25.219 --> 00:34:25.728
um

719
00:34:26.260 --> 00:34:31.342
in that case like what we wanna do is we want to create a row right in this case

720
00:34:31.342 --> 00:34:31.610
which

721
00:34:31.610 --> 00:34:36.835
is gonna be um wait is this gonna be I think I made a mistake here I think this

722
00:34:36.835 --> 00:34:37.110
has

723
00:34:37.110 --> 00:34:41.025
just changed because then new new is like the last one you have like one more

724
00:34:41.025 --> 00:34:41.269
that

725
00:34:41.269 --> 00:34:45.636
gives you new which is just gonna be where uh where players know and then you

726
00:34:45.636 --> 00:34:45.909
just

727
00:34:45.909 --> 00:34:46.349
pull it in

728
00:34:47.208 --> 00:34:51.453
Yeah OK so actually this is just changed records so and all these go together I

729
00:34:51.453 --> 00:34:51.719
know

730
00:34:51.719 --> 00:34:55.987
this is like kind of a lot but it'll it'll make sense So in this case we're

731
00:34:55.987 --> 00:34:56.239
gonna

732
00:34:56.239 --> 00:34:57.668
put an array here

733
00:34:58.289 --> 00:35:03.777
where we need to unnest this array essentially where what we have is we have a

734
00:35:03.777 --> 00:35:04.119
row

735
00:35:04.119 --> 00:35:05.530
here which is gonna be

736
00:35:06.849 --> 00:35:11.140
These like well because you have the um the old record and then the new record

737
00:35:11.140 --> 00:35:11.409
they

738
00:35:11.409 --> 00:35:16.579
kind of like go together um and I think we actually have to create a type here

739
00:35:17.090 --> 00:35:21.173
because uh postre is weird in in Spark and other places you don't have to worry

740
00:35:21.173 --> 00:35:21.429
about

741
00:35:21.429 --> 00:35:25.619
this because Spark can just handle it but we need to have one that has uh

742
00:35:26.510 --> 00:35:31.010
these values It has like the SCD type so I think we'll go ahead and do that up

743
00:35:31.010 --> 00:35:31.260
here

744
00:35:31.260 --> 00:35:32.938
so create type SCD type

745
00:35:33.648 --> 00:35:36.539
And then this type has uh we have

746
00:35:37.269 --> 00:35:42.869
scoring class scoring class and then we have is active boolean

747
00:35:44.050 --> 00:35:48.079
And we have start season integer and end season

748
00:35:49.378 --> 00:35:51.519
Integer I think that's all we need

749
00:35:52.090 --> 00:35:57.227
is that and then cause then yeah and then the player name is gonna still be here

750
00:35:57.227 --> 00:35:57.530
because

751
00:35:57.530 --> 00:36:02.166
we join on player name so that will be fine so we don't need player name and

752
00:36:02.166 --> 00:36:02.438
then

753
00:36:02.438 --> 00:36:04.159
so we'll create this type here

754
00:36:06.188 --> 00:36:13.713
OK so now in the row here right we need to put these in here So we have the the

755
00:36:13.713 --> 00:36:14.090
old

756
00:36:14.090 --> 00:36:18.178
record which is just gonna be all of the old record stuff So we have LS scoring

757
00:36:18.178 --> 00:36:18.418
class

758
00:36:18.628 --> 00:36:23.179
LS is active LS.start season LS end season

759
00:36:24.550 --> 00:36:28.398
And then then we have uh and then we need to cast this as an SCD type

760
00:36:29.510 --> 00:36:31.099
Then we have another row here

761
00:36:31.958 --> 00:36:37.159
same same idea but these are now all uh we have TS scoring class TS

762
00:36:37.829 --> 00:36:38.469
So um

763
00:36:39.659 --> 00:36:43.000
These don't exist yet right Because they are um

764
00:36:43.809 --> 00:36:50.086
Uh this is gonna just be current season and this is TS uh current season right

765
00:36:50.086 --> 00:36:50.478
because

766
00:36:50.478 --> 00:36:55.703
the new record is just one value it's just like the current season 2022 and then

767
00:36:55.703 --> 00:36:56.030
so

768
00:36:56.030 --> 00:36:57.898
this is an SCD type as well

769
00:36:58.628 --> 00:37:03.823
and so this is gonna give us our kind of values that we're looking for and the

770
00:37:03.823 --> 00:37:04.128
good

771
00:37:04.128 --> 00:37:08.973
news is here we can get rid of all of this right because we uh this is not what

772
00:37:08.973 --> 00:37:09.228
we

773
00:37:09.228 --> 00:37:13.869
need and then What you can do in this is like we can do like an on nest

774
00:37:16.769 --> 00:37:20.039
So this is gonna give us an un-nest which then we have as

775
00:37:20.570 --> 00:37:21.820
uh this will be

776
00:37:22.579 --> 00:37:26.016
Let's like look at like what this this looks like because this is gonna be a lot

777
00:37:26.016 --> 00:37:26.219
of

778
00:37:26.219 --> 00:37:29.360
changes here So let's go ahead and look at like what this query actually gives

779
00:37:29.360 --> 00:37:29.570
us

780
00:37:29.739 --> 00:37:30.030
Uh

781
00:37:30.849 --> 00:37:33.119
so comment this out and then we can uh run this

782
00:37:34.699 --> 00:37:38.889
OK so you'll see we have our changed records and uh

783
00:37:39.570 --> 00:37:41.599
It now has um

784
00:37:42.610 --> 00:37:45.840
Is that right cause we have changed records where

785
00:37:46.929 --> 00:37:49.878
Yeah and then this is not there So this is gonna be our

786
00:37:52.849 --> 00:37:57.769
OK there we go so now we have our changed records and it has like um

787
00:37:59.699 --> 00:38:04.397
There should be 2 what Oh yeah cause cause it's not like oh yeah there we go So

788
00:38:04.397 --> 00:38:04.659
you

789
00:38:04.659 --> 00:38:09.055
see how now we have 2 records we have the old record and then the new record for

790
00:38:09.055 --> 00:38:09.300
things

791
00:38:09.300 --> 00:38:13.809
that changed out So what we can do is we can actually uh

792
00:38:14.500 --> 00:38:15.918
um we probably want to just

793
00:38:17.199 --> 00:38:21.474
Flatten like or we wanna like get get it out of the type out of the struct so we

794
00:38:21.474 --> 00:38:21.699
can

795
00:38:21.699 --> 00:38:24.579
say like um unnested changed records

796
00:38:28.378 --> 00:38:32.812
And then in here we can just say select star from change records Let me change

797
00:38:32.812 --> 00:38:33.090
it

798
00:38:33.168 --> 00:38:38.949
this is as um records let's call this as records and we have a player name and

799
00:38:38.949 --> 00:38:39.289
then

800
00:38:39.289 --> 00:38:42.648
we have uh records and then this is an SCD type though

801
00:38:43.530 --> 00:38:49.787
So now what we can do is we should be able to do in this case we can say start

802
00:38:49.787 --> 00:38:50.099
season

803
00:38:50.269 --> 00:38:55.519
so we can just essentially flatten this guy out and we can say um scoring class

804
00:38:59.728 --> 00:39:01.559
Here we go scoring class

805
00:39:02.128 --> 00:39:03.800
and then this is uh

806
00:39:04.840 --> 00:39:05.918
It is active

807
00:39:08.539 --> 00:39:10.239
Then we have start season

808
00:39:12.199 --> 00:39:13.449
First season and then end season

809
00:39:15.208 --> 00:39:15.739
Why is it

810
00:39:18.438 --> 00:39:21.860
Isn't that what it yeah start season end season that is

811
00:39:23.239 --> 00:39:23.250
Like

812
00:39:24.269 --> 00:39:25.550
OK there we go It's just like it

813
00:39:26.929 --> 00:39:31.207
OK so now we have this is our unnested change records and I know that this has

814
00:39:31.207 --> 00:39:31.458
like

815
00:39:31.458 --> 00:39:36.625
been a long query in the making right now but you'll see now we have this broken

816
00:39:36.625 --> 00:39:36.929
out

817
00:39:36.929 --> 00:39:43.128
to where these are all the players who had um any sort of uh

818
00:39:43.958 --> 00:39:47.453
Any sort of change here right That's gonna be our last like so you see how now

819
00:39:47.453 --> 00:39:47.659
we

820
00:39:47.659 --> 00:39:51.659
have two records for all these these are all the players who changed so we have

821
00:39:51.659 --> 00:39:51.909
one

822
00:39:51.909 --> 00:39:57.030
last one here which is our new records right New records is easy though like

823
00:39:57.679 --> 00:40:02.438
so I'm just gonna uh what we wanna do here is we're gonna say select star from

824
00:40:02.438 --> 00:40:02.719
uh

825
00:40:02.719 --> 00:40:10.616
this season data TS we're gonna say left join uh last season SCD on TS player

826
00:40:10.616 --> 00:40:11.110
name

827
00:40:11.110 --> 00:40:12.280
equals

828
00:40:12.829 --> 00:40:14.000
I'll put this to Ellis

829
00:40:15.340 --> 00:40:21.239
LS.player name and this is gonna be where LS.player name is Noel

830
00:40:22.369 --> 00:40:27.239
And then what we want here is for TS we don't say TS.player name TS

831
00:40:28.398 --> 00:40:29.668
Scoring class

832
00:40:30.239 --> 00:40:32.070
TS is active

833
00:40:32.780 --> 00:40:34.349
then TS current season

834
00:40:35.409 --> 00:40:42.473
As uh it's gonna be start season TS current season as end season OK so now the

835
00:40:42.473 --> 00:40:42.889
now

836
00:40:42.889 --> 00:40:47.444
we have everything ready to go We just have to now just essentially union all of

837
00:40:47.444 --> 00:40:47.728
these

838
00:40:47.728 --> 00:40:52.171
together so we have uh so we're gonna say in this case we're gonna say select

839
00:40:52.171 --> 00:40:52.449
star

840
00:40:52.449 --> 00:40:52.840
from

841
00:40:53.489 --> 00:40:59.695
historical SCD union all and then we have select star from um unnested change

842
00:40:59.695 --> 00:41:00.139
direction

843
00:41:00.449 --> 00:41:03.090
union all selectar from

844
00:41:04.469 --> 00:41:04.478
Uh

845
00:41:05.099 --> 00:41:05.989
new records

846
00:41:06.809 --> 00:41:07.469
I think

847
00:41:09.079 --> 00:41:11.320
You all must have the same number of queries OK

848
00:41:12.159 --> 00:41:16.349
same number of columns OK what did we miss here 12345

849
00:41:17.369 --> 00:41:19.909
And this has 12345

850
00:41:21.619 --> 00:41:25.856
And then oh there's yeah we have unnested change records new records and then

851
00:41:25.856 --> 00:41:26.159
historical

852
00:41:26.159 --> 00:41:31.521
SCD oh because historical SCD has um the that current season in it that's the

853
00:41:31.521 --> 00:41:31.878
problem

854
00:41:31.878 --> 00:41:36.903
has that like last value right So what we can do is uh but we don't actually

855
00:41:36.903 --> 00:41:37.199
ever

856
00:41:37.199 --> 00:41:42.070
use current season in that table so what we can do here is we can just

857
00:41:42.639 --> 00:41:44.000
pop in player name

858
00:41:44.829 --> 00:41:47.898
Um scoring class is active

859
00:41:48.688 --> 00:41:50.809
start season end season

860
00:41:51.878 --> 00:41:52.869
So now this should work

861
00:41:55.610 --> 00:42:01.461
There we go OK so now we have uh uh this is the way to do it incrementally So

862
00:42:01.461 --> 00:42:01.769
now

863
00:42:01.769 --> 00:42:06.323
like the good thing I know this query probably looks really insane but it's

864
00:42:06.323 --> 00:42:06.648
actually

865
00:42:06.648 --> 00:42:12.332
um it's processing a lot less data because we're only processing um the

866
00:42:12.332 --> 00:42:12.769
compacted

867
00:42:12.769 --> 00:42:18.033
uh 2021 records and then the 2022 records So this is probably processing like

868
00:42:18.033 --> 00:42:18.409
about

869
00:42:18.409 --> 00:42:23.114
20 times less data So even though it's like the query is a little bit more

870
00:42:23.114 --> 00:42:23.409
complicated

871
00:42:23.409 --> 00:42:24.719
because you have to think about all the

872
00:42:25.389 --> 00:42:29.626
Different ways that things can change It's also a very powerful query where you

873
00:42:29.626 --> 00:42:29.929
can

874
00:42:29.929 --> 00:42:30.378
um

875
00:42:31.179 --> 00:42:34.968
Where everything kind of comes together and it like is pretty pretty pretty

876
00:42:34.968 --> 00:42:35.260
freaking

877
00:42:35.260 --> 00:42:41.171
cool how this stuff like ends up working so um I think let me let me run this

878
00:42:41.171 --> 00:42:41.500
one

879
00:42:41.500 --> 00:42:45.062
more time OK so we have this and then it has all the all the players and what

880
00:42:45.062 --> 00:42:45.260
they're

881
00:42:45.260 --> 00:42:46.889
doing and so

882
00:42:47.489 --> 00:42:53.202
This is the idea behind how you can kind of incrementally add more and more uh

883
00:42:53.202 --> 00:42:53.559
players

884
00:42:53.559 --> 00:42:54.898
in here and

885
00:42:55.438 --> 00:43:00.118
that's kind of the idea so this is like I think I did it in a slightly different

886
00:43:00.118 --> 00:43:00.378
way

887
00:43:00.559 --> 00:43:05.262
in the um in the last lap but like I like this way of doing it because it's very

888
00:43:05.262 --> 00:43:05.510
clear

889
00:43:05.878 --> 00:43:07.559
of like the um

890
00:43:08.349 --> 00:43:11.588
The the way that this works Oh actually there's one more union here actually

891
00:43:11.588 --> 00:43:11.820
right

892
00:43:11.860 --> 00:43:16.739
There's the uh um select Star from there's the unchanged Records right

893
00:43:18.010 --> 00:43:19.489
And then union all

894
00:43:20.378 --> 00:43:23.530
yeah that's that's actually what it is because you get you get all of this

895
00:43:24.699 --> 00:43:26.809
there we go and then that like is

896
00:43:27.340 --> 00:43:28.530
the records that we're looking for

897
00:43:29.539 --> 00:43:32.639
Yeah and so um this is like

898
00:43:33.628 --> 00:43:37.266
A lot right And this is one of the reasons why like when I was when I was

899
00:43:37.266 --> 00:43:37.469
working

900
00:43:37.469 --> 00:43:40.506
at Airbnb and looking at unit economics and everything and I'm like dude I don't

901
00:43:40.506 --> 00:43:40.708
know

902
00:43:40.708 --> 00:43:43.947
about that I don't know about writing this query right This query is freaking

903
00:43:43.947 --> 00:43:44.179
gnarly

904
00:43:44.429 --> 00:43:48.079
And like there's so many like edges that can happen here and different things

905
00:43:48.079 --> 00:43:48.340
that

906
00:43:48.340 --> 00:43:53.167
like you want to be aware of right So like some some like like for example

907
00:43:53.167 --> 00:43:53.469
there's

908
00:43:53.469 --> 00:43:57.819
gonna be a couple other assumptions here that you want to really um think about

909
00:43:57.819 --> 00:43:58.110
right

910
00:43:58.269 --> 00:44:03.762
for these queries to make them work the right way And uh like where these

911
00:44:03.762 --> 00:44:04.128
problems

912
00:44:04.128 --> 00:44:08.157
can happen a lot where one of my big assumptions in this query that like we

913
00:44:08.157 --> 00:44:08.409
didn't

914
00:44:08.409 --> 00:44:12.421
even really talk about is the fact that I assume that is active and scoring

915
00:44:12.421 --> 00:44:12.688
class

916
00:44:12.688 --> 00:44:14.099
are essentially never null

917
00:44:14.929 --> 00:44:18.886
right Because if they're if they ever get nulled or nullified or whatever then

918
00:44:18.886 --> 00:44:19.168
it

919
00:44:19.168 --> 00:44:23.846
kind of breaks this pattern it breaks a lot of this because like then like Uh

920
00:44:23.846 --> 00:44:24.139
for

921
00:44:24.139 --> 00:44:30.365
example Noel right here right Uh here's one Noel you know Noel doesn't equal

922
00:44:30.365 --> 00:44:30.809
null

923
00:44:30.918 --> 00:44:34.190
right And like this it would it would get filtered out here right Whenever you

924
00:44:34.190 --> 00:44:34.409
do

925
00:44:34.409 --> 00:44:37.779
a comparison like that where it's like if something doesn't equal Noel or Noel

926
00:44:37.779 --> 00:44:38.019
doesn't

927
00:44:38.019 --> 00:44:38.648
equal null

928
00:44:39.168 --> 00:44:41.731
that's where there's like that other queer there's like another one there's like

929
00:44:41.731 --> 00:44:41.929
that

930
00:44:41.929 --> 00:44:46.773
is distinct from and that's where like you can actually handle Noel in these

931
00:44:46.773 --> 00:44:47.119
cases

932
00:44:47.128 --> 00:44:51.587
but like I never use those I always end up using like the hard equal signs and

933
00:44:51.587 --> 00:44:51.849
does

934
00:44:51.849 --> 00:44:56.105
not equal signs So but like that's a big thing to remember about your query

935
00:44:56.105 --> 00:44:56.389
patterns

936
00:44:56.389 --> 00:45:01.826
is that they can be like that right You can have um you can have all sorts of

937
00:45:01.826 --> 00:45:02.128
different

938
00:45:02.128 --> 00:45:06.571
things that can happen that like can mess things up and that's why it's

939
00:45:06.571 --> 00:45:06.889
important

940
00:45:06.889 --> 00:45:11.098
to have your um assumptions checked and your quality assumptions checked when

941
00:45:11.098 --> 00:45:11.449
you're

942
00:45:11.449 --> 00:45:16.586
building out these queries And so in this case this query is going to work

943
00:45:16.586 --> 00:45:16.929
better

944
00:45:17.199 --> 00:45:21.506
in in a lot of cases because it's it's querying a lot less data And so this

945
00:45:21.506 --> 00:45:21.760
query

946
00:45:21.760 --> 00:45:26.211
will work way faster but it also has a sequential problem right Where because

947
00:45:26.211 --> 00:45:26.530
now

948
00:45:26.530 --> 00:45:29.917
we are depending on yesterday's data because you see how we have like historical

949
00:45:29.917 --> 00:45:30.159
and

950
00:45:30.159 --> 00:45:34.263
then current and that it makes it a little bit harder to like backfill and that

951
00:45:34.263 --> 00:45:34.519
part

952
00:45:34.519 --> 00:45:35.519
of it's a little bit annoying

