by Jiaqi Li, DLINQ Intern, Computer Science & Geology Joint Major
The efforts powered by human curiosity and an innate need to understand the world, for lack of a better term, underlie all scientific explorations. That is precisely why science is, or should be, a universal language, and why, communication about scientific understandings and explorations to a broad audience is crucial. Perhaps, I got hooked by data visualization as it embodies both the scientific elements and its power for storytelling.
There are three major parts that constitute a science, as referenced in the journal 50 Years of Data Science, by David Donoho, namely:
- Intellectual content
- Organization in an understandable form
- Reliance upon the test of experience as the ultimate standard of validity
The combination of these three elements servicing the topic of data draws me further into data science. Despite the buzz surrounding the term data, as well as me walking around claiming that I am interested in data analysis, this was not a word that I explored before this summer. Wikipedia explains that data is “a set of values of qualitative or quantitative variables”. Data are facts and information with identifiable qualities. In other words, as I interpret it, data is a representation of reality, and thus, data science is the science that interprets reality by learning from data, through its exploration, cleaning, transformation, computation, modelling, visualization and representation. Such interpretation of data science allows me to strip away the buzz and the ballyhoo and focus on how data matters to me. It matters because like all members of the natural and social community, I am constantly striving to make sense of my surroundings and my place in it.
Data visualization, however, is only one branch of data science. It strikes me of particular importance because it focuses on communication. Visualization makes data accessible and interpretable to the general public. It is crucial that scientific understandings are communicated with the public in an engaging and inviting way. Connecting with my motive of exploring data as explained above, such a communication is an invitation to share the scientific community’s understanding of the external reality with a larger community. For example, in the TED talk the Best Stats You’ve Ever Seen, Hans Rosling showcases how data is able to transfer powerful, informative and fun information to the public in a succinct, rich and dynamic way. To put it simply, data visualization is a form of storytelling.
My explorations this summer followed two interrelated branches: studying the theories, skills, structure and other related fields of data science, and development of skills of data visualization. To study data science, I have been following the Data Science specialization, a collective of 9 courses and 1 capstone project, on Coursera, published by University of Johns Hopkins. I find it helpful to follow a prepared and verified syllabus with provided materials, as the online course provides me with the structure and organization of the discipline that I otherwise would not be able to discover with such precision on my own. Additionally, the extended materials and examples in the course are great resources for expanding my understanding of data science with real life instances.
In contrast, my progression of developing the skills of data visualization has been rather emergent, spontaneous and random. Following a path that I designed for myself, I studied the notebooks on Kaggle (an online datasets platform for data scientists), explored and got amazed by data visualization by organizations and data scientists online, attempted to visualize random datasets, studied Python packages, and completed guided projects on Coursera. Throughout this process, I have attempted using visualization tools Excel, Python packages and R. I focused on Python as I was already familiar with this language and it comes with a wide range of packages for different types of visualizations.
I started by refamiliarizing myself with python and learning to use Pandas package, a tool that allows for storing and manipulation of data using DataFrame, a two-dimensional data structure that stores tabular data in columns and rows. I followed online documentations and tutorials and compiled a list of Pandas functions and methods. Similar to this process, I also compiled a list of plotting functions and methods using Matplotlib later in the process. These steps helped me learn these essential packages, and they prove to be useful for references while I explore datasets.
The first dataset I explored is the real-time U.S. County-Level Data, a public dataset recording numbers of Covid-19 cases by counties, published by the New York Times.
At the time of this exploration, June – July 2020, Coronavirus Disease was on the front page of every news outlet and I was checking the number of new cases in Vermont everyday. Exploring this dataset allowed me to visually interpret the pandemic.
Instead of answering any specific questions, I followed my intuition to rearrange the dataset and attempted various types of plots. It is quite large and is updated every day. It became clear to me that the most valuable information generated comes from visualizing the trends and changes over time and across spaces. For example, the number of values (counties) under each index (date) increases over time as cases of the virus spread, and eventually stabilizes as the virus spreads to all the counties. Depending on the scale of the research questions, the dataset can be sectioned to extract parts of the information, to focus on one state or one day, for example.
I extracted the data from Vermont counties to generate line plots to show development Covid-19 cases over time, with y-axis as the number of cases, and x-axis as time (date). The color-coded plot for each Vermont county can be split up (the plot on the left) or presented on the same graph to allow for comparisons.
A pie graph is an effective way to see the distribution of Covid-19 cases on a certain date (2020-07-04 in the example below) between different states.
It is also possible to visualize the change geographically and temporally at the same time through animation, using Plotly Express.
After getting bored with this dataset, as a result of a random comment from a friend, I started to explore the data, Preview of Fall20 Courses, provided by the College before fall registration of 2020. That exploration involved a few “courageous” attempts with no advisable results.
Finally I delved into Top 50 Spotify Songs – 2019 on Kaggle.
Señorita,Shawn Mendes,canadian pop,117,55,76,-6,8,75,191,4,3,79
This dataset includes 50 most listened songs around the world on Spotify in 2019. It is very different from the U.S. County-Level dataset. It is a lot smaller and each value (song) has a lot more variables (attributes of the song). Visualizing this dataset is less about changes (over time or across space), but more around comparing and contrasting the distribution of each variable of different values, or different variables of the same value.
After cleaning the original dataset by removing unnecessary data entries and renaming the data columns and rows, I wondered how different variables (e.g. “energy” and “popularity”) of a song correlate, so I created some scatter plots, such as the one below (song labels above the dots overlap so I only included a few of them).
It seemed that there does not exist a strong correlation between these two variables. However, I was not satisfied with this plot since I could not derive convincing conclusions from it. Instead, correlation plot in combination with heat map would include information for both quantifiable correlations as well as visual signals (color maps) for the correlation.
The correlation heatmap is able to generate clear and advisable results. However, I wondered how correlations between two variables could be visualized other than scatter plots. Violin plots show the distribution of a dataset through incorporating features of both box plots and kernel density plots. Using the Seaborn package, I plotted multiple violin plots of the Energy variable for each loudness level for comparison. The graph presents a general (but not absolute) trend of increasing energy level as the loudness level increases for this collection of songs. However, within individual loudness level, the Energy level of songs could still vary widely across the spectrum.
Investigating this dataset also led me to dig into the concepts of univariate and bivariate data, and categorical data. Histograms along with kernel density estimates can effectively plot univariate distribution.
I learned that data should be presented differently according to the type of data – I explored the seaborn package, and various plots available within it to visualize categorical data. After going through the tutorial, I attempted to make the connection between plots and the Spotify data I have, and considered how to use it on the Spotify dataset. It occurred to me that a treemap would be an appropriate format to display hierarchical data like the number of songs of each genre.
While investigating this dataset, I kept asking the question, what do I want to learn and what can I learn from it? It occurred to me that I would have been better able to come up with research questions if I have deeper knowledge of theoretical frameworks such as statistics. This became more apparent as I came across statistical concepts such as categorical data, univariate and bivariate data, and kernel density estimation. This exploration process allowed me to learn about various types of plots, and how to use the visualization tools generated by them, but I would not be able to properly utilize them if I do not have the statistical understanding of data. Attempts at visualization without proper inspection of the original dataset can also be risky, as pointed out by Bob Cole, that “a poorly constructed visualization can be misleading and perhaps even do harm.” So it is essential that I keep learning and making sure that I remain true to the original datasets.
My messy strategy of learning as I explored datasets was fun but also unstructured. Such a method ensured that I was motivated throughout the process as I immediately was able to apply what I learned from the tutorial, documentation and courses, onto the dataset I was exploring, by making connections between specific types of plots and my understanding of the dataset. However, this process was also inefficient at times, as I read extensive documentation of multiple visualization packages to find the right plot, while I went back to certain tutorials multiple times since I did not commit myself to memorize certain methods and skills.
It also draws my attention that the style and format of my existing plots still require a lot of improvements. For example, both of the line plots and the pie chart from the Covid-19 dataset are cluttered with information and rather cumbersome to look at. I need more practice to better understand how different types of visualization suit different types of data. Additionally, my plots have no personal style. I did not spend much effort and time configuring the aesthetic elements of the plots to best present the data. Zadie Smith, in her recent book Intimations, describes that “a style is a means of insisting on something.” If I continue to expand my effort in data visualization, having a personal style is claiming and insisting on my own digital entity, which could be a crucial message conveying how the data is relevant to me.
I foresee that this personal project could have multiple paths for future applications. Most recently, I was able to engage in a project with the STEM Innovation group to visualize data of traffic flow in buildings like dining halls to maintain proper social distancing. Additionally, I have become more motivated to get involved in data science, which is the field I would like to pursue in combination with my Computer Science major.