MM CW 10 – What policymaker know about girls and women issues

Context:

Policymakers in 5 different countries (Colombia, India, Indonesia, Kenya, and Senegal) were asked to give their estimate on indicators related to girls and woman’s issues. Examples are share of female labour force, early marriage rate for women, etc.

Original Viz:

How the viz could be improved:

  • It is not explained what the diamonds stand for (they stand for one answer of a policymaker).
  • A summary of the answers which shows how far off the different policymakers were for each topic would facilitate the interpretation of each chart.
  • The gridlines are not necessary. They compete with the most recent available data line for attention.

My makeover:

This time I created two versions of the makeover. My first viz did receive some critical feedback on which I will focus later on. First my thoughts on my first version:

MMCW10_Dash (1)

The key message I want to convey:

  • I wanted to highlight the topics which the biggest discrepancy between estimate and actual measure. I achieved by calculating the average of all answers of all policymakers and country and comparing this to the correct answer right next to it.

The design choices I made:

  • I opted for a diverging bar chart. This is normally a good choice when comparing 2 measures over multiple topics.
  • I also added an additional overview box to the dashboard. The arrows show if the policymakers on average have overestimated or underestimated an issue. The colors signal if the average of the policymaker’s answers lies within the 20% range of the correct answer.

The feedback I got for this viz during the viz review live meeting was quite mixed. Major criticisms were:

  • Too many colors
  • The arrows are confusing since they display two measures at the same time: Overestimation/Underestimation with the direction of the arrows and within 20%range / not within 20% range with the color palette. This might be confusing because either preattentive attribute (color and shape) could represent either measure.
  • The comparison between the correct answer and estimate could be facilitated by using the correct answer as a target rather than showing it as an individual bar.

Taking this feedback into account, I created a new version of the viz:

MMCW10_Dash (2)

Now, the correct answers are not separate bar charts, but target lines in the estimates. Also, I got rid of the arrows and replaced them with bar charts which show the difference between guess and estimate. The bar charts are colored depending on if they fall within the 20% range of the correct answer or not.

I got rid of some colors as well. The background color now is in a more subtle eggshell tone and the countries, which are used as filters, are colored in dark grey rather than in pink.

 

DAND – Python Introduction continued

I finished the control flow and functions lecture of the python section today. The lections are definitely not designed for complete programming newbies, but they are rather a real quick refresher on the most important concepts which are needed for the program. The quizzes are also a little bit trickier than you would expect for normal Python introductory course. For me, this fits quite well to my profile since I have already some knowledge of the key concepts like loops, if conditions, etc, but I definitely have a lot more to learn.

In the control flow lecture, if statements, for and while loops were topics which were already familiar to me. Hence, I will not go into more detail here. New to me were Break & Continue, Zip and enumerate and list comprehensions.

Break & Continue:

Break stops the loop when a condition is met.

y = [“a”,”b”,”c”,”d”]

for e in y:
if e==”c”:
break
print (e)

This will output:

a,b

c and  d will not be printed since the break statement is run when e = c

The continue statement works similarly. If a certain condition is met, the loop stops, but different from the break statement, it will continue with the next iteration of the loop.

for e in y:
if e==”c”:
continue
print (e)

The output is a,b,d. C is not printed but the loop is continued after the c iteration. Therefore, the last item of the list, d, is printed as well.

Enumerate and zip

Enumerate is a built-in function which returns the index of the current position in the loop.

names = [“Dennis, “Hans”, “Rick”]

for i, name in enumerate(names):

print (i, name)

output: 0 Dennis, 1 Hans, 2 Rick

The zip function is used to combine multiple lists.

names  = [“Dennis”, “Hans”, “Rick”]

age = [22,44,55]

personalinfo = list(zip(names,age))

print (personalinfo)

Output:

[(‘Dennis’, 22), (‘Hans’, 44), (‘Rick’, 55)]

It returns a list of tuples which combines the information of both lists, names and age.

To return this new element it has to be put in a list. Otherwise, the print statement will just return the zip object. Another option is to loop of the zip element.

names  = [“Dennis”, “Hans”, “Rick”]

age = [22,44,55]

personalinfo = zip(names,age) # zip element is not turned into a list!

for info in personalinfo:

print (info)

Output is the 3 different tuples:

(‘Dennis’, 22)
(‘Hans’, 44)
(‘Rick’, 55)

It is also possible to unzip a tuple with the asterisks symbol: *

name, age = zip(*personalinfo)

This will store the different elements of the zipped object as a list in the variables name and age and undo the prior executed zipping.

list comprehensions

This is a concise way of writing a for loop in one sentence. The structure is the following:

newlist = [x+2 for x in range(5)]

print (newlist)

Output: [2, 3, 4, 5, 6] # For every x in the range 0-4 (5 is non inclusive) 2 is added to x.

The general structure of the list comprehension is, that for ist specified what is done to the element over which is iterated: x+2

Then it is specified how many iterations should occur: for x in range(5)

As the last step, the arguments are embedded in [] brackets to create the list. This has the big advantage that no empty list has to be created first to which then elements have to be appended.

This basic structure can vary a bit when taking if statements into consideration.

newlistif = [x+2 for x in range(5) if x%2==0]
print (newlistif)

Output: [2,4,6]

Here, the if statement is added at the very end, behind the range argument. Now, +2 is only added to even numbers.

However, if and else statement is necessary. This order changes.

newlistifelse = [x+2 if x%2==0 else x+1 for x in range(5)]
print (newlistifelse)

Output: [2, 2, 4, 4, 6]

Now the if statement is located directly behind the argument which specifies the action (x+2). This makes sense since it is also the argument the if statement refers to. If the if statement is true, x+2 will be added to x. The if statement is then followed by the else statement. The else statement is then followed by the action which refers to the else statement (x+1).

Functions:

I won’t go into much detail regarding the general functioning of functions since this was already a familiar concept to me. Functions work the following:

Multiple inputs can be put into a function and inside the functions body, it is defined how these inputs are handled. In my exemplary function my_function, the 2 inputs a and b are given into the function and then multiplied with each other. The return statement returns the results of the function’s

def my_function(a,b):

return a*b
result = my_function(5,2)

print (result)

Output: 10

New to me was the lambda function. This is a quick way of defining short functions. The prior function can be rewritten as a lambda function in the following way:

resultlambda= lambda a,b: a*b
print (resultlambda(2,3))

Output: 6

As you can see. The function can be written and assigned to a variable in only one line instead of 3.

Iterators and generators were the last concepts introduced in the functions lecture. This, I have to admit, I did not quite get. Generators are functions which create iterators. They are an alternative to lists which also returns iterators. The yield statement makes this function a generator. It makes a return statement obsolete and yields the element which is iterated over.

def generate(x):

for x in range(x):

yield x
var1 = generate(5)

for var in var1: print (var)

output: 0,1,2,3,4

MM CW 9 – World Economic Freedom

Hello folks,

the topic of this weeks’ makeover is the Economic Freedom Index. This index measures the economic freedom of almost every country in the world. Since its creation in 1970 almost every year, new countries have been added. The latest data of this data source is from 2015.

What I dislike about the viz – In this case not much.

  • Overall this is a clean and straightforward visualization. The index rating is split into 4 groups: “Most free”, “second quartile”, third quartile” and “least free”. The countries on the world map are colored according to this color legend.
  • This is an explorative visualization, meaning insights have to be discovered on your own. However, it would have been nice to have some key findings pointed out. E.g. which country has improved the most, lost the most index points, etc. The additional table on the left only provides information about the ranking and the increase/decrease in points relative to the last measured index.
  • The countries in the table are not colored according to their group (most free, etc.). To find out about the group you would have to cross-check on the map.

My makeover:

MMCW9

The key message I want to convey

  • I want to show which countries have gained and lost the most index points since the first measurement (1970). This one took me way longer than expected and I changed the message multiple times. My first table calculation which showed the difference between the Index value measured in 2015 and the first year of measurement didn’t return meaningful results. Georgia and Mauritius were the countries with the largest improvement. This is because some countries were not included in 1970 and were added in the following years. This would result in a very high improvement for some countries since their respective value in 2015 – 0 (value in 1970) obviously shows a larger gain in index points in comparison to countries which a have already been included in the index in 1970. Considering this, I created a set of countries which comprised only countries which already were included in the 1970 index. This shows a completely different picture. Two South American countries took very different turns. Chile, once far below Venezuela in economic freedom, has taken a big upsurge and overtaken Venezuela. Venezuela, on the other hand, was well above average in ECF as of 1970, but as of 2015, it lags far behind all other countries.

    The design choices I made

  • I opted for a jitter plot. Using the random function I located the data points for the countries along the x-axis between a range from  0 to 1 for each year. The y-axis shows the Summary index of the ECF
  • I highlighted the countries Venezuela and Chile with bright colors and increased the bubble size.
  • I added an average line for each year.

MM CW 8 – Where do our Drugs and Medicine come from?

This week’s makeover is based on the following viz:

 

What I do not like about the viz:

  • The bubbles block the view. The map in the background is hardly visible and seems redundant.
  • The points do not correspond to the countries they represent on the map. E.g. The African countries are scattered somewhere in the ocean and it looks like South Africa is above Egypt and like they are neighboring countries which is clearly not the case.
  • The bubble sizes make comparisons difficult. Is the bubble size of Belgium really bigger than the bubble size of France? It is really hard to tell, even though the numbers state that Belgium exports $4 Billion more than France.
  • A lot of bubbles are the same size. There are a lot of countries displayed which do not seem to play a major part in the market of Drugs and Medicine. These countries clutter the visualization and do not add additional insights.
  • The color scale is hard to distinguish. The colors difference for the >10% and the 5%-10% bins are barely noticeable. The same goes for the color for the lowest 2 bins.

My Makeover

MM8Dash (2)

The key message I want to convey

  • The Drugs and Medicine exports are dominated by very few countries. 
  • 5 countries export more Drugs and Medicine than the rest of the world.

The design choices I made

  • I opted for a simple stacked bar chart. This does the trick to show exports volumes by countries in relative comparison to each other.
  • I split the countries into two groups. By grouping the countries, a second layer of comparison is enabled. Now the group of top 5 exporting countries and the rest of the world can be compared along the Y-axis. Visualizing these 2 groups next to each other amplifies the huge difference in market share by country.
  • I chose a discrete color scale for the top 5 countries and colored all other countries uniform. This puts an additional emphasis on the biggest exporting countries.
  •  I added a reference line. The reference line at the end of the top 5 groups makes it easier the compare the two groups. The white space which expands from the end of the bar chart till the reference line shows the additional export volume of the top 5 group.
  •  The two big bubbles right to the stacked balk charts give perspective. To get a feeling for the total amount exported by the 2 groups, I added the bubbles which display the total value of Drugs and Medicine export in Billion $. They are not size mapped.

DAND – Term 1 finished!

As planned I finished the Term 1 section this weekend. The last remaining section introduced SQL as a tool to access big amount of data. The basic statements were taught which are needed to query the required data from a database. These included the following statements: SELECT, WHERE, LIMIT, OR, AND, LIKE and ORDER BY. The SELECT statement is used to select the columns you are interested in from the database. With the WHERE statement, you specify the table from the database. For example, with SELECT channel FROM web_events you would select the channel column from the web_events table.

SQLStatement

Since databases can contain large amounts of rows which are outputted with the SELECT statement it could make sense to limit this output. This is, where the LIMIT statement comes in handy:

LIMKITPNG

LIMIT 5 limits the Output to 5 rows.

The WHERE Statement can be used as a filter. When using the WHERE statement only rows which fulfill the criteria specified in the WHERE Statement will be returned:

WHERE

The order of the statement is important as well. The WHERE statement has to follow the FROM statement. A reverse order would result in an error message.

The LIKE statement is kind of similar to the WHERE statement, but as the name says, it also returns matches which are “like” the keyword but do not match its criteria to 100%. To take our facebook example – a WHERE channel = ‘face’ statement would return an empty output since the criteria given does not exactly match the row now name ‘facebook’. The LIKE statement, however, would return all rows which contain the string ‘face’. But to get the right result we need to add a wildcard operator as well. The ‘%’ specifies which characters should be ignored after or before the specified input. In our example this would look like this:

LIKE

All characters after the ‘face’ will be ignored. Therefore, all rows with a Facebook entry will be returned.

Term 1 Project – Explore Weather Data

The task was to pull weather data of the closest city data from a database with a SQL statement and then compare these temperatures with the global average data. The global average data could be queried with a SQL statement and then downloaded in csv sheet. To pass the project a line graph which depicts a comparison of the city close to where you live and the global average over time (the measurements start in the 1750s) should be provided. Also, the graph should show the running average of the respective year to smoothen out the yearly variance.

Overall, the tasks did not seem to difficult and my feeling is that this project was designed to no not overwhelm the learners right in the beginning with a challenging project. The focus lies more on making oneself familiar with the way projects are reviewed, the project rubric, etc.

The tool used to solve this challenge could be chosen yourself. Python, Excel, R – all are valid options. I chose R as a tool. I just got started with R, (I completed a class on Udemy last week) so I wanted to prove my new acquired R knowledge right away. It probably took me a lot longer to complete the project with R than just doing this in Excel, but I think it was worth it.

I was a little bit concerned about calculating the moving average since I was not sure how to do this in a data frame, but it turned out that there is already a package which does all the work for you. It is called TTR. With the SMA function you can just specify the measure and the interval and the moving average is calculated. The rest was pretty straightforward. The only complication was that I had the data for Hamburg (the nearest city) and the global average in two different data frames. So plotting both lines in one graph was a challenge I had to solve. I found out that the data for each geom_line function can be specified individually, so I could plot the final line graph like this: ggplot(NULL,aes(x=as.Date(ISOdate(year,1,1))))+geom_point(data=df_1,aes(y=df_1temp), color= red) +geom_point(data=df_2,aes(y=df_2temp),color= blue)

The result looks like this:

Rplot04

Will continue with the next section – Introduction to Python tomorrow.

Cheers!

Data Analysts Nanodegree has started!

Yesterday, the content for the Data Analyst Nano Degree (DAND) was released. I could not wait to get started so I already dug into the material. The Syllabus for Term1 is the following:

The first part, Welcome to Term1, consists of 3 video lectures and a final capstone project which needs to be passed to complete this section. 

 

WelcometoND

Welcome to the Nandodree Program! is the name of the first video series in the welcome to Term1 section. It provides an overview over the course syllabus, an introduction of the different teachers as well as study tips, and encouragement on joining the available study communities (slack and the forums) in which you can ask for help from your peers who are part of the same cohort. Udacity also provides, and this is very different from the other MOOC platforms I know, an own career portal. Here, you can get advice on improving your CV/Linkedn page, etc. Udacity also fosters collaborations with possible employers and a career profile page which can be activated if you are actively looking for a new job.

LifeOfDataAnalyst

The second video series is called the Life of a Data Analyst. In this section, there are 3 videos about women who work in the field of data analytics. There are 2 Data Scientists, one works for Hire and one works for Facebook. And one data analytics manager who works for Summit Schools. Her job is to create dashboards which show the performance of the different students. In this section, you can already see that Udacity is truly trying to erase the gap between studying a new skill and really getting the chance to apply this knowledge in a new job. By given context and showing the work of data analysts in their day jobs a first link between the for the job required skill set and the syllabus is established.

I haven’t started yet with the next section. It will be about SQL. So after the introductory part, this will be the first section where data analytics skills are taught. I hope to finish this section by this week.

SQL

Overall, the first impression is good. The quality of the content is very high. The videos are very professional and engaging. Since the cost for the program is quite high (499€) I expected a noticeable difference to the other programs which are available on platforms like Coursera or Udemy. And on a first glance, the higher quality is definitely visible. Especially, in the additional support provided for the students (Interviews with professionals, Career Portal). There is also a mentor assigned to you who checks in with you regularly. I will write more about the quality of the video lectures in my next post.

See you then!

MM CW 7 – Winter Olympics Fever

For the start of the Winter Olympics in South Korea, a data viz which shows all medals earned since the first Winter Olympics in 1924 by Country and Medal(Bronze, Silver, Gold) was picked.

Overall, I find this to be an appealing viz. It looks orderly and clean. The message is clear and it is easy to find out how many and which medals a country won. The small KPI summary in each box also gives a nice overview of the total medals won and the split of medal ranks.

Points which could be improved

  • The subheaders are difficult to read. The subheader’s font size should be increased.
  • Comparisons are difficult. The countries are sorted by total medals won so it is easy to compare by this measure but besides this, a comparison between countries, for example by bronze medals, is difficult.
  • The viz’s stretches out very far. It is easy to get lost in the viz since you do have to scroll down quite far to reach the bottom. It would be an advantage to have an overview which does not require scrolling.
  • Looking for a specific country is difficult. The high amount of displayed countries makes it tedious to look for a specific country. A filter or highlighter could solve this issue.

My Makeover

WinterOlympicsDash

The key message I want to convey:

  • Since this is an explorative viz, there is not one conclusion to draw but multiple insights which can be explored. However, the focus lies on the medal count split by medal rank.

The design choices I made:

  • I chose a scatter plot to show the count of medals won by all countries over the years. I took advantage of the random function to distribute the scatter plot dots randomly within the year column.
  • The rows are split by medal rank. Each row shows the medals won by rank(bronze, silver, gold). The color pallette is explained in the title, so no additional color legend is needed.
  • A country can be highlighted via the drop-down box. This facilitates the comparison between the countries. The selection of a country in the drop down menu results in an increase of the circle size for this country.
  • An overall summary of medals won is displayed on top of the scatter plot.