Get all your news in one place.
100’s of premium titles.
One app.
Start reading
Forbes
Forbes
Technology
Kalev Leetaru, Contributor

Why Data Visualization Is Equal Parts Data Art And Data Science

Getty Images

One of the most powerful ways through which we convey the results of data science is visualization, from simple Excel graphs through advanced displays like network diagrams and bespoke visuals. What most outside the data science community don’t realize is just how much artistry is involved in the creation of some of those visualizations, from the impact of color schemes on perception in geographic mapping to the layout algorithms and data filtering used in network visualizations. Given the rising use of networks to understand everything from social media to semantic graphs, just how much of an impact do our layout algorithms and filtering decisions have on the final images we see?

Network visualizations are at once beautiful and informative, helping us make sense of the macro through micro patterns in the vast connected ecosystems that define the world around us. Yet, like any form of data visualization, network visualization does not capture the sum total reality of our data so much as it constructs one possible reality.

When we think of scientific visualization, we think that the images we see present to us the one single “truth” of a dataset, without realizing that any given dataset can tell many different stories depending on the questions we ask of it and the filters we apply to answer those questions.

The myriad possible filters we apply to a graph to reduce its dimensionality, the layout algorithm that places the nodes in space, the clustering algorithms like modularity that group nodes by “similarity,” the definition of “similarity” that we hand to those clustering algorithms, the node sizing algorithms like PageRank, the color scheme and the random seeds used by many algorithms that ensure each run yields a very different image: all of these conspire to ensure that a single dataset can yield a nearly infinite number of possible visualizations.

How does this process play out in a real world visualization task?

In April 2016 my open data GDELT Project began recording the list of hyperlinks found in the body of each worldwide online news article it monitors. Not all news articles contain links, but many link to external websites such as the homepages of organizations being mentioned in the article or other news outlets from which specific story elements were sourced. These external sources of information provide powerful insights into which websites each news outlet considers worthy of mentioning, in much the same way that the references in an academic paper offer insights into the works believed most relevant and reputable by each field.

As of last month, GDELT’s link database had recorded more than 1.78 billion outlinks from more than 304 million articles. Collapsing each URL to its root domain and connecting each news outlet with the unique list of external domains it has linked to over the last three years and the number of days it published at least one article linking to that domain, the final dataset is composed of just over 30 million distinct pairings of news outlets and external websites, including links to other news outlets.

This link dataset is a classic network graph that can be readily visualized using off the shelf visualization packages like the open source Gephi.

However, its size and density mean that the graph must be filtered to reduce it down to a specific subset of greatest methodological interest, while the edge count must be lowered to reduce the graph to its most “significant” edges.

Instead of focusing on which sites a given news outlet links to, a far more interesting question in light of the current interest in combating “fake news” is to restrict the analysis to only links between news outlets and to compile a list of the top news outlets that link to a given other news outlet. In other words, for a news outlet like CNN, what are the top other news outlets around the world that link most heavily to CNN as a source in their own reporting? Much like academic citation networks convey authority, the linking behavior of news outlets can similarly convey a proxy of “news authoritativeness.”

Thus, the 30 million edge graph was methodologically inverted and only edges connecting news outlets were retained. A link from a CNN article to a UN report would be discarded, but a link from a CNN article to a New York Times article would be preserved.

As an initial visualization, the top 30 news outlets linking to each news outlet on at least 30 days or more were extracted and a random subset of 10,000 edges were used to form a new graph. The nodes were positioned using the OpenOrd layout algorithm and colored by Blondel et al’s modularity, with coloration selected by Gephi’s built-in palette generator.

The final image is seen below.

GDELT GKG 2016-2018 Outlink Graph (Top 30 / Random 10,000) White Background

OpenOrd, like many layout algorithms, utilizes random seeds meaning it will yield a slightly different result each time it is run. This is an important distinction that is lost to many unfamiliar with network visualizations: there is no single “truth” to the visualized structure of a graph. Every rendering of a graph will present it in a slightly different way. Researchers frequently run a layout algorithm multiple times until they find a presentation that either looks the “best” or does the best job of visually separating the clusters of greatest importance to their analysis.

What happens if we change the background color from white to black, while otherwise leaving the graph exactly as-is?

GDELT GKG 2016-2018 Outlink Graph (Top 30 / Random 10,000) Black Background

Despite the graph being exactly the same, the darker background subtly changes our perception of the graph, making the spaces between nodes clearer and drawing a sharper contrast between clusters. Our eye is also drawn more naturally to the graph’s diffuse structure.

As the two images make clear, even the selection of the background color of a graph can have an impact on our perception of it.

What if we adjust the thickness of each connection based on edge strength? In other words, the line between two outlets that were linked on 200 different days would be thicker than the line between two outlets linked on just the minimum 30 days.

GDELT GKG 2016-2018 Outlink Graph (Top 30 / Random 10,000) With Edge Weighting

Rather than a diffuse mess of lines we begin to see macro level structure. The lower center orange cluster becomes especially vivid.

What if we instead filter the graph to retain the top 30 news outlets linking to each outlet where the linked-to domain is also in the top 30 list of each of those inlinking domains? In other words, filtering to each outlet’s top 30 reciprocal edges. In addition, instead of displaying 10,000 randomly selected edges, the top 30,000 strongest edges are retained.

GDELT GKG 2016-2018 Outlink Graph (Top 30 Reciprocal / Top 30,000)

This graph shows a far more centralized structure, with a center core of tightly connected outlets around which the rest of the media ecosystem revolves.

This paints a very different picture of our global media structure, from the earlier diffuse dense collective to a galaxy-like mass of small clusters orbiting around a central core of international stature outlets. Much of this comes from our use of strongest edges rather than random edges, reminding us of the critical impact our sampling decisions have on the final structure we see.

Adjusting the thickness of each edge based on edge strength makes the central core less prominent and instead emphasizes the isolated nature of the myriad smaller clusters around the periphery.

GDELT GKG 2016-2018 Outlink Graph (Top 30 Reciprocal / Top 30,000) With Edge Weighting

How much of an impact does the layout algorithm have on our understanding of the structure of a graph?

Here we reduce the graph to the top five inlink outlets by news outlet and display the top 50,000 strongest connections using the same OpenOrd algorithm used in all of the above graphs.

GDELT GKG 2016-2018 Outlink Graph (Top 5 / Top 50,000) Using OpenOrd

The result is a very diffuse structure like the earlier renderings, showing complex structure with multiple cores, a complex interconnected structure on the left and numerous other clusters.

In contrast, the image below shows the results of running the exact same graph through the Force Atlas 2 algorithm instead. This image looks like an entirely different graph, with the entire network extending outwards from a central core.

GDELT GKG 2016-2018 Outlink Graph (Top 5 / Top 50,000) Using Force Atlas 2

Here’s another comparison, this time limiting to just those news outlets indexed in Google News circa mid-2017 and similarly limiting to the top 5 inlink domains by outlet and displaying the top 50,000 strongest connections.

The OpenOrd layout shows a diffuse structure.

GDELT GKG 2016-2018 Outlink Graph (Google News / Top 5 / Top 50,000) Using OpenOrd

The Force Atlas 2 layout, on the other hand, once again centralizes the graph structure.

GDELT GKG 2016-2018 Outlink Graph (Google News / Top 5 / Top 50,000) Using Force Atlas 2

Sometimes the centralized perspective of Force Atlas 2 can be helpful in drawing attention to the centralized clustering of a graph.

Here the same graph as above, reduced from the top 50,000 strongest connections down to the top 10,000.

The OpenOrd layout predictably shows a fairly diffuse layout, though helpfully captures the graph’s dual center.

GDELT GKG 2016-2018 Outlink Graph (Google News / Top 5 / Top 10,000) Using OpenOrd

The Force Atlas 2 version collapses this center but makes it more apparent that the entire graph revolves around a complex center.

GDELT GKG 2016-2018 Outlink Graph (Google News / Top 5 / Top 10,000) Using Force Atlas 2

Graphs are most commonly displayed as edge visualizations in which the connections between nodes are the focal point of the image. This tends to be the norm in domains where it is the connectivity structure that is of greatest interest. In other domains it is the nodes themselves that are of primary importance, with the graph structure used only to position them in space according to their relatedness.

The image below shows a traditional OpenOrd layout of the graph of news outlets based in the United States, using their top 10 reciprocal edges and limiting to the top 30,000 strongest edges.

GDELT GKG 2016-2018 Outlink Graph (US News Outlets / Top 10 Reciprocal / Top 30,000) Using Edge Weights And OpenOrd With Edges

The image below shows the exact same graph, but with the edges hidden to show only the nodes.

GDELT GKG 2016-2018 Outlink Graph (US News Outlets / Top 10 Reciprocal / Top 30,000) Using Edge Weights And OpenOrd With Nodes Only

This actually makes many of the peripheral clusters clearer and draws the macro structure of the graph into starker focus. For high density graphs like this, node visualizations can often make it easier to understand graph structure without the burden of tens of thousands of distracting spaghetti lines crisscrossing the image.

Putting this all together, we see just how much of an impact our algorithmic and methodological decisions have on the final visual representation we receive of a given dataset. Every visualization on this page displays the very same dataset but offers different perspectives by filtering it in different ways and using different algorithmic and visual selections.

In the end, perhaps the biggest takeaway is the reminder that the incredible imagery that emerges from our vast datasets are equal parts data art and data science, constructing rather than reflecting reality.

Sign up to read this article
Read news from 100’s of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.