Category Archives: Data

Project step by step: Is U.S. multiple birth rate hits record high?

 

After read “Data at Work” I knew that I want really try to use the sequential steps to solve a problem proposed by the book showing a very logical path to solve my own questions.

I would like to add how much I found this book extremely useful to understand and solve good problems. You can read my review and my notes about the book. Even when the book is completely oriented to solve problems using Excel 2016 as a tool, the steps involved (and the way of thinking) is entirely valid for any data visualization problem.

Collecting the data

Perhaps the most ungrateful task because there are few chances to find a good and reliable source of data.

Assessing data availability

After a good research into internet looking for different government sources, I found several datasets that can be used as a source of information in “Vital Statistics of the United States, 1980. Volume I, Natality” and subsequent reports submitted for each year until 2014.

Adjusting the data

Fortunately, the data was already normalized, so in this case, there is no need to make any adjustment.

Exploring the data

I examined the data creating a few charts to understand it as a first step. Along these lines, there are two graphs and even when they are very simple, clearly show trends and proportions.

Distribution of the information

In this visualization, we can observe the distribution of the information by color indicating how much multiple births happened by year and category: total, twins, triplets, quadruplets, and quintuplets.

 

us_birth_availability-2

Evolution along time

The rise in multiple birth rates has been associated with expanded use of fertility therapies such as ovulation-inducing drugs and assisted reproductive technologies (ART).  Also, older maternal age at childbearing also contributes to more multiples births because of elevated FSH (follicle-stimulating hormone) as women age.

us_multiplebirthcomparison3x

Final conclusion

The answer to the initial question << Is U.S. multiple birth rate hits to record high? >> is yes, multiple birth rate are in a continuous growth during the last three decades.

Notes:

Project step by step: In which states there is less wage gap between men and women?

At the beginning of each visualization work, along with the question to answer the most important thing is to look for a reliable data source. Within the different origins that can have the data there are different degrees of quality and statistical reliability, which can range from data obtained through official sources to data obtained from social networks

In this case, the data came almost at the same time as the question to answer while I was reviewing the fantastic web http://statusofwomendata.org, and I started wondering which are the states where there is less difference between wages according to gender and race.

Collecting the data

Even when I found part of the data at http://statusofwomendata.org, the source of the data belongs to the IWPR analysis of the American Community Survey data Integrated Public Use Microdata Series, Version 5.0.

 

Assesing data availability

Once established the fact that the source is reliable and accurate, those characteristics does not exempt it from the existence of missing values in it.

And even if I could detail in which states exist a lack of information, it would be better to create a graph that allow to visualize the lack of data.

[Heatmap with all the information]

Adjusting the data

Considering the question that we want to answer, we will be working with percentages instead of absolute values. For that reason, we will consider the median correspond to men as a point of comparation against women from different ethnical origins.

[Heatmap with the remain information]

Exploring the data

In order to understand the data, the creation of some graphs nothing extremely complex just something very simple and plain is usually a terrific tool. I would like to present the next graphs: one shows the trends in salaries by gender without any separation by ethnic group. The second graph shows the trends in salaries for women and men separated by ethnic group.

The purpose of both graphs is to be used as a tool, instead of serving as an information, like an insight about the data, in order to understand what are the trends inside behind the plain numbers.

Some loosey ideas about the data:
– There is a gap by gender. Mathematically this value is around 20%

[Graph 1: Trends for women and men]

[Graph 2: Trends for women and men separated by ethnic group]

Answering the question

The general idea is created a rich chart, to present the information to the reader and let some room to analyze the information.

 

 

Ideas:

Creating a SVG Map: Women in the Parliament

Nathan Yau‘s book, Visualize This, is a book for learning by reading and doing. The theory and concepts explained are stated clearly and colloquially, as if it were a great master class, in the same way that the examples are explained step by step, and detailing the reason behind each line of code.

I liked several graphs, their practical application and the way they allow you to visualize the data. I have already included another example (inspired by the book) corresponding to the use of a heatmap.

In this case, take as an example the practical case shown in the section Map Countries from Chapter 8: Visualizing spatial relationship about how to process countries using SVG map.

1 SVG File for World

SVG files are XML files. It can be easily edit it using a text editor. So, you can edit the color into the file, the XML tells the browser what to show, such as the color and the images.

2 Coloring the countries

I used the designer.colors function in R (in the fields package) to make a linear scale of 256 ‘new’ colors.

Ncolors <- 256
ColRamp <- designer.colors(n=Ncolors, col=c("#CCEBC5", 
"#A8DDB5", "#7BCCC4", "#4EB3D3", "#08589E", "#08589E"))

3 Generate hexadecimal colors

for (i in 1:nrow(wParliament2016)) {
 for (j in 1:nrow(Countries)) {
 
wParliament2016$Country_Code[i] <- ifelse(
    wParliament2016$Country_Name[i] == Countries$Name[j], 
    Countries$Code[j], wParliament2016$Country_Code[i] ) 

}#for j
}#for i

wParliament2016[is.na(wParliament2016)] <- 0

#Set up the vector that will save the CSS code per 
country
CSS <- rep("", nrow(wParliament2016))

#Divide the range of Life Expectancy in Ncolor bins
Bins <- seq(max(wParliament2016$X2014, na.rm=TRUE), 
     min(wParliament2016$X2014, na.rm=TRUE), length=Ncolors)

ColRamp[which.max(abs(Bins-
 wParliament2016$X2014[100]))]

#Loop through all countries. Asign a color.
#Save the CSS text in a vector
for (i in 1:nrow(wParliament2016)) {
 #Find which Bin is closest to value of Life Expectancy
 ColorCode <- ifelse(!is.na(wParliament2016$X2014[i]), 
    ColRamp[which.min(abs(Bins-
    wParliament2016$X2014[i]))], "white") 
 #Country.ID is the alpha-2 country code
 CSS[i] <- paste(".", tolower(wParliament2016$Country_Code[i]), 
 " { fill: ", ColorCode, " }", sep="")
}

write(CSS, "output2.txt", sep="\n")

4 Edit SVG file

The next step is edit the SVG file for the map, in order to change fill atributted with the correspondent hexadecimal color according for women representation that each country have.

.af { fill: #67C7D3 }
.al { fill: #86CFBD }
.dz { fill: #4BB0D2 }
.as { fill: #C8EAC2 }
.ad { fill: #1A71AE }
.ao { fill: #217BB4 }
.ag { fill: #B9E4BA }
.ar { fill: #2A87BC }

5 Edit the SVG map using Inkscape

Finally, the last step is edit the resultant map using Inkscape (or Illustrator) in order to add a title, source of the data and extra information to the image.

6 Final result

screenshot-2016-12-05-13-37-12

Creating a Heatmap

 

A heatmap is basically a table that has colors in place of numbers. Colors correspond to the level of the measurement. Each column can be a different metric like above, or it can be all the same like this one. It’s useful for finding highs and lows and sometimes, patterns.

From Nathan Yau | Visualize This

In order to visualize trends within large sets of data, it is useful consider to create a data heat map with color instead of number allowing display highs and lows.

If it true that the accuracy is lost for the lack of the numbers, but a wide vision about trends is obtained in exchange.

The colors used within the table, belong a spectrum of colors based on its distance from the statistical mean, so, in that way, intuitively darker colors means one thing and lighter colors another thing facilitating a quick evaluation about patterns, maximum and minimum values.

Intro

As I read the book “Visualize This” from Nathan Yau, I was analyzing which projects could implement the ideas presented. And one of the graphics that came back to my mind again and again was the Heatmap.

Code

library(RColorBrewer)

america <- read.csv("AmericaCupData.csv", sep=",")
america <- america[order(america$Title, decreasing = FALSE),]

row.names(america) <- america$Team
america <- america[,2:17]
america_matrix <- data.matrix(america_titles)

america_heatmap <- heatmap(america_matrix, Rowv=NA, 
Colv=NA, col = brewer.pal(9, "Blues"), scale="column", 
margins=c(5,10))

Result

copaamerica