Visualize this

De la misma forma en que muchos se empiezan a interesar por Big Data y al buscar informacion empiezan en Internet, descubri que hay blogs que son referencia en el tema y uno de ellos es FlowingData, luego de conocerlo y navegarlo multiples veces no puedo estar mas que de acuerdo con el boca oreja popular.

El blog tiene sus anios y sigue siendo vigente en parte por sus tutoriales, su forma sencilla de explicar como visualizar conceptos e ideas, y porque a pesar de tener cierta trayectoria o precisamente por tenerla se mantienen actualizados y publican acerca de acontecimientos actuales y como analizarlos utilizando tecnicas de visualizacion de datos.

El autor del blog Nathan Yu, ha publicado dos libros relacionados al tema, el primero es “Visualize this”, piedra fundacional para principiantes (pensemos en el conocimiento como un edificio y que por algun lugar tenemos que empezar). El libro permite organizar el conocimiento que tengas de haber leido blogs, notas, papers, dado que  una forma sencilla e intuitiva presenta los conceptos basicos para entender que es big data, porque y como visualizarla, y una vez sentadas esas bases muestran ejercicios sencillos y desarrollados paso a paso; la mejor forma de aprender: haciendo.

Los primeros tres capitulos (Chapter 1 – Telling Stories with Data, Chapter 2 — Handling Data

Chapter 3 — Choosing Tools to Visualize Data) presentan la idea de como contar historias con data, como manejar los datos para que se transformen en informacion y como elegir herramientas para visualizar los datos. En cada uno de estos capitulos la idea es presentarle al lector la variedad de herramientas y formas de trabajo que existen actualmente y darle un panorama general.

Los siguientes capitulos son mas practicos y muestran en el Chapter 4 — Visualizing Patterns over Time muestra como visualizar la informacion en el tiempo dado que la informacion va variando de acuerdo a lo que suceda. Tambien indica que de acuerdo al tipo de informacion con la que se cuente (discreta o continua), el tipo de grafico a utilizar varia.

A lo largo del Chapter 5 — Visualizing Proportions is about data grouped by categories, subcategories and population. This chapter shows how to represent the individual categories, but at the same time how to each choice is related with the others. We will see data as a part of a whole and how to represent the information when proportions varies over time.

The most remarkable concept in this chapter is the visualization should represent  in a very good way the proportions.

En el Chapter 6 we will see Visualizing Relationships between the data, the similarities between groups, within groups, and even within subgroups. Looking for relationship in your data could be challenging (an elegant adjetive for the word trabajoso y dificil) but it is highly recommendable because the data shows be itself its own story though relationships and interactions. As the author explains (and I feel totally agree with that) playing with data is explore the data and perhaps during the process you find something interesting. And when it happens you can explain to your readers what you find. After all, in those cases is the data who choose to tell a story instead of force to the data to adjust a previous idea.

Chapter 7 is about how to spot groups within a population and across multiple criteria, and spot the outliers (values up or down to median value) using common sense.

It is simple when you need to compare across a single variable, but you need more tools when the dataset have a lot of variables for each object to compare.

Chapter 8 is about Maps, and what can I write about maps that can not be written before? After all, it is an excellent way to visualize informacion because it is more than intuitive: all are familiar with Maps, so look for the way to show information within them is move on one step under well-known land.  

I really enjoy this chapter because the results achieved using R at the beginning, and later Python and SVG are amazing, sume unas pocas pinceladas of Illustrator (or Inkscape) and the final result are sobresalientes y profesionales.  

Chapter 9 is the closure of the book, and it has a lot of recommendation, the most valuable is remember you are design and present the information for other people, no for yourself: it’s your job and responsability to set the stage.   

 

Chapter 1

What software should I use to visualize my data? There is a lot of options, some are out-of-the-box and click-and-drag. Others require a little bit of programming.

Chapter 3

What software should I use to visualize my data? There is a lot of options, some are out-of-the-box and click-and-drag. Others require a little bit of programming.

Out-of-the-box Visualization

Copy and paste some data or load a CSV file and you’re done. Select the graph and voila!

 

  • Microsoft Excel | Google Spreadsheets
  • Many Eyes

 

    • Tableau Software: offers a lot of interactive visualization tools and does a good job with data management. There is two version one free and other paid, the free version offers a reduce set of graphs and the data to create each graph is public, the paid version allows to maintain the information private and offers the complete set of tools and graphs.

 

  • Trade Offs

 

    • Even when you gain some flexibility and you can customize some things, there is a small variety of options to choose.

Programming

Even when requiere a considerable mount of effort and time to start, once you achieve some point you can do whatever you need with your data. Some of the tools that you could chose:

    • Phyton / PHP
    • HTML / Javascript and CSS

 

  • Trade offs

 

    • It is learning how to speak in a new language, with all the work, effort and time involved in that.

Illustration

If you are an engineer, well, you are out of a comfort zone, and this is another thing that you need to learn. Nevertheless, you should know how to manage at least in a comfortable way some of the most well known illustration tools because you gain a lot of control about the information that you present to the public, and if you present a polish data graphics people can clearly see the story that you are telling.

  • Illustrator: Adobe Illustrator is the industry standar. Every graphics that goes to the print at NYTimes was created with it. You can do where you need to do in graphics terms, the downside though is it expensive.
  • Inkscape: the free alternative very similar to Illustrator.
  • Trade Off: These are tools for illustration and graphics, there are not tool created for data manipulation, however those are a necessary complement for your presentation work.

  

Chapter 4

How to visualize time series data? Time data is everywhere. It is simple natural to have data over the time.

Temporal data could be categorized as discrete or continuous. Knowing which category your data belongs to can help you decide how to visualize it.

In discrete case, values are from specific points or blocks of time, and there is a finite number of possible values. For example, people take a test, and that’s it. Their score dont change afterward.

In continuous case, it is constantly changing, like the temperature, it can be measured in any time of the day and it changes.

Discrete points in time

  • Bars graph: Simple but useful graph.
  • Stacked bar chart:
  • Points using scatterplot: each dot has an x- and y- which represent each value. This kind of graph is used to visualize nontemporal data. For temporal data, time is represented on the horizontal axis, and values or measurements are represented on the vertical axis. The value axis of scatterplots doesn’t always have to start at zero, but it is a good practice.

Continuous data

Using continuous line.

  • Smoothing and estimation: LOESS locally weighted scatterplot smoothing, it enables you to fit a curve to your data. LOESS starts at the beginning of the data and takes small slices. At each slice it estimates a low-degree polynomial for just the data in the slice. LOESS moves along the data, fitting a bunch of tiny curves, and together they form a single curve.

Chapter 5

Que buscamos visualizar en las proportions: maximum, minimum and the overall distribution.

Parts of a Whole:

This is a proportion in the most simple form. It is a set of proportions from 1 to 100.

  • Pie: Simple, old fashion school (from 1801 by William Fairplay). Main recommendation: dont put too many wedges in one pie.
  • Donut chart: It is almost the same than a pie chart, but with a circle in the middle. Usually that space is used for a label or some other content.
  • Stacked bar chart: to show data over the time, o to show data by categories.
  • Hierarchy and rectangles: Or tree-structured data.

Proportions over the time:

  • What happen if you have a set of proportions over time? The most common thing is those proportions varies and there is different ways to show that:
    • Stacked Continuous:  Cuando tomamos cada uno de los graficos correspondientes a cada periodo de tiempo y los mismos son “apilados” uno encima del otro.
    • Point-by-point: Muy similar al Stacked continuous graph, pero una linea representa cada recta representa cada una de las categorias y su variacion en el tiempo. Resulta en un grafico tal vez mas facil de leer que el anterior.

Chapter 6

It is about visualizing relationships between variables. Along this chapter we’ll see three different concepts: Correlation, Distribution and comparison.

  • Correlation: when one thing tends to change in a certain way as another thing changes. In all the cases, but specially on those which involving correlation, the graph is important, but even more important is the interpretation of the results.
    • Relationship between two variables: We will use a scatterplot function to find it.
    • Relationship among several variables: We will use a scatterplot matrix, specially useful during exploration phases. Also it’s possible to create a scatterplot matrix with fitted loess curves.
    • Bubbles: even when scatterplot graphs are the horse battle for correlation, you can use bubble graphs to add a third variable in the same graphic: area size of the bubbles, plus x axis position and y axis position.
  • Distribution: We’ll see graphs to visualice everything about your data, in order to see the full distribution.
    • Distribution bars, or histogram
    • Density plots
  • Comparison: In some opportunities it’s useful to compare multiple distributions rather that just the mean, median and mode. In those cases is useful use a histogram matrix. At these point, the books presents different cases but at the end, the most important concept indicated along this section is: refine your graph to avoid interpretation problems for your readers, you need to do your best to explain the data plus take extra care in telling the story.

Chapter 7

This chapter is about how to spot groups within a population and across multiple criteria, and spot the outliers using common sense.

With a lot of common sense, the author explains what happen if you want to compare the square fit for two houses, it’s easy because its one single variable, but what happen when you want to compare number of bathrooms, floors… and perhaps more variables. At the end, it’s tricky and that is why we look for a way to comparing across multiple variables.

Comparing across multiple variables

  • Showing it all at once: Instead of the numbers, you can use colors to indicate values, facilitating to find high and low values based in colors.
    • Create a heatmap, to show how to do that different groups of variables, indicating by color how high o low is the value. Remember that heat map it enables to see all your data at once, however the focus is on individual points.
    • Create a Chernoff faces, you can use faces to show multivariable data, to see each unit as a whole instead of split up by several metrics, however this method is a little nerd, and just confusing for general public.
    • Create a star chart, you can use an abstract object to modify the shape to match data values. The center is the minimum value for each value, and the ends represent the maximum. It posible to represent several units on a single chart, but it’s become useless in a hurry, which makes for a poorly told story.
  • Running in Parallel, to identify groups or variables could be related.
    • Parallel coordinates: One line per unit, and after connecting the dots, you can look for common trends across multiples units. With relation a relative scales, axes span minimum and maximum for each variable. Due to the quantity of variables and lines, this graphic could be a little confusing, so, as good practice the last step should be editing the graph in Illustrator (or similar) to add colors, labels, blurbs and text in order to obtain a clear result.
  • Reducing dimensions using a multidimensional scaling, to put together those entity whit more similar variables. Nevertheless, this graph is really abstract and perhaps not really for a general audience.
  • Searching for outliers: All the previous cases presented along this chapter were about how units of data belong in certain groups, and in this section we are on focus of units that don’t belong in certain groups (estamos centrados, estamos viendo, que pasa con las unidades que no pertenecen a ciertos grupos). These points are called outliers. Sometimes they could be the most interesting part of your story, or they could just typos with a missing zero. The point behind that is you don’t want to make a graph on the premise of an outlier, because at the end, the resulting graph doesn’t have any sense.
  • You can use specific functions, but nothing is better that common sense, basic plots and knowledge of the data that you are managing. Once you find the outliers, you could use varied colors, provide pointers or use thicker borders to remark them into a graph (if this is your intention, off course; otherwise and if it doesn’t add any relevant information you can eliminate it).
    • Also you can use a box plot, with shows quartiles in a distribution. Box plot can automatically highlight points that are more than 1.5 times more o less than the upper and the lower quartiles.

Chapter 8: Maps

Maps, and a revision of this subject using R, Python and SVG. Using maps is almost the same than using statistical graphics, instead of using x- and y- coordinates your are deal with latitude and longitude.

Also, it is quite interesting when we introduce time. One map could represents a slide of time so several maps represent several slides of time.

  • Specific locations, just map a list of locations based on latitude and longitude.
    • Map with dots: map of specific points
    • Map with lines: to connect the dots on your map
    • Scaled points: We are using the map with points, but adding the principles of the bubbles plot and use it on a map.
  • Regions, to represent no only single locations but counties, states, countries as a entire regions.
    • Color by data: using choropleth maps are the most common way to map regional data. Based in some metric, regions are colored following a color scale that you define. Variations of colors, categories and symbols {permiten} contar la historia completa, as well as annotate your maps to highlight specific regions or features, and aggregate to zoom in on countries.
  • Incorporation of time, in order to visualize the data over space and time.
    • Small multiples maps, one map for each slice of time.
    • Take the difference, no always is necessary to create multiple maps to show changes. Sometimes it makes more sense to visualize actual difference in a single map, it highlights changes instead of single slices in time. There is specially useful to add a legend, source and title if the graphic is for a wider audience.
  • Animation: One of the most obvious ways to visualize changes over space and time is to animate your data. Instead of showing slices in time with individual maps, it is possible show the changes as they happen on a single interactive map.

Chapter 9: Design with a purpose

How your design your graphics affects how readers interpret the underlying data.

Visualization is about communicating data, so it is necessary to take the time to learn about what makes the base of each graphic.

Important highlights:

Know about the data, after all, how can you explain interesting points in a dataset when you don’t know the data?

Learn about numbers and metrics.

Figure out where they came form and how there were estimated, and see if they even make sense.

Take the time (and seguramente va a llevar tiempo) to get to know your data and learn the context of the numbers.

Punch some numbers in R to understand what each metric represents.

After you learn all you can about the data, you are ready to design your graphics: if you learn about the data, the visual storytelling will come natural.

Prepare your readers:

The objetive of a data designer is to communicate what you know to your audience. Assume that your reader receive the graph without any context so, to accompany the graph with labels, titles and colors is vital.

Conclusion

Es un libro que vale la pena leer, es corto pero no demasiado, orientado a mostrar conceptos y en forma practicar explicar como crear graficos basados en datos. Bien organizado, bien diagramado internamente, y con buenos graficos es un excelente comienzo para todos aquellos interesados en visualizacion de datos.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s