As planned I finished the Term 1 section this weekend. The last remaining section introduced SQL as a tool to access big amount of data. The basic statements were taught which are needed to query the required data from a database. These included the following statements: SELECT, WHERE, LIMIT, OR, AND, LIKE and ORDER BY. The SELECT statement is used to select the columns you are interested in from the database. With the WHERE statement, you specify the table from the database. For example, with SELECT channel FROM web_events you would select the channel column from the web_events table.
Since databases can contain large amounts of rows which are outputted with the SELECT statement it could make sense to limit this output. This is, where the LIMIT statement comes in handy:
LIMIT 5 limits the Output to 5 rows.
The WHERE Statement can be used as a filter. When using the WHERE statement only rows which fulfill the criteria specified in the WHERE Statement will be returned:
The order of the statement is important as well. The WHERE statement has to follow the FROM statement. A reverse order would result in an error message.
The LIKE statement is kind of similar to the WHERE statement, but as the name says, it also returns matches which are “like” the keyword but do not match its criteria to 100%. To take our facebook example – a WHERE channel = ‘face’ statement would return an empty output since the criteria given does not exactly match the row now name ‘facebook’. The LIKE statement, however, would return all rows which contain the string ‘face’. But to get the right result we need to add a wildcard operator as well. The ‘%’ specifies which characters should be ignored after or before the specified input. In our example this would look like this:
All characters after the ‘face’ will be ignored. Therefore, all rows with a Facebook entry will be returned.
Term 1 Project – Explore Weather Data
The task was to pull weather data of the closest city data from a database with a SQL statement and then compare these temperatures with the global average data. The global average data could be queried with a SQL statement and then downloaded in csv sheet. To pass the project a line graph which depicts a comparison of the city close to where you live and the global average over time (the measurements start in the 1750s) should be provided. Also, the graph should show the running average of the respective year to smoothen out the yearly variance.
Overall, the tasks did not seem to difficult and my feeling is that this project was designed to no not overwhelm the learners right in the beginning with a challenging project. The focus lies more on making oneself familiar with the way projects are reviewed, the project rubric, etc.
The tool used to solve this challenge could be chosen yourself. Python, Excel, R – all are valid options. I chose R as a tool. I just got started with R, (I completed a class on Udemy last week) so I wanted to prove my new acquired R knowledge right away. It probably took me a lot longer to complete the project with R than just doing this in Excel, but I think it was worth it.
I was a little bit concerned about calculating the moving average since I was not sure how to do this in a data frame, but it turned out that there is already a package which does all the work for you. It is called TTR. With the SMA function you can just specify the measure and the interval and the moving average is calculated. The rest was pretty straightforward. The only complication was that I had the data for Hamburg (the nearest city) and the global average in two different data frames. So plotting both lines in one graph was a challenge I had to solve. I found out that the data for each geom_line function can be specified individually, so I could plot the final line graph like this: ggplot(NULL,aes(x=as.Date(ISOdate(year,1,1))))+geom_point(data=df_1,aes(y=df_1temp), color= red) +geom_point(data=df_2,aes(y=df_2temp),color= blue)
The result looks like this:
Will continue with the next section – Introduction to Python tomorrow.