Blog

The constant (over)flow of data and trying to make it useful

Author: Stefan Freitag

Published: 2023-08-08

It should come as no surprise to anyone that we are currently living in the age of data. Data that is constantly generated through everyday items such as our phones, computers and even some newer home appliances. This data is not stationary, but constantly flowing through the ether in between interconnected devices and endpoints. But data is not only responsible for making our daily lives easier (or harder, depending on ones viewpoint), it is also a huge potential in the field of humanities. Data can make research more in-depth and can help to improve the conclusions drawn, but bad data may also dilude said conclusions or even lead to false ones.

All of this begs the questions: How can data be made useful? How can we find the proverbial needle in the haystack of data? And how do technological advancements change our view on, generation and use of data? During the DHCH event of 2023, Peter Fornaro from the University of Basel tried to tackle this issue during his keynote, which will be briefly summarised in the following write-up.

Not all data is good data

First of all, it is important to keep in mind that data by itself is not a thing with inherent meaning or value. Its existence does not mean that it necessarily leads to any meaningful conclusions. Data by itself, without any work put into it, is little more than a pile of undefined measurements or observations and thus requires work to make sense of the chaos.

It should also be noted that, when we say "data", we mostly mean digital data. But in reality, data is usually analogue and must be digitized in order to be used by computers. This happens by translating the analogue data into digital code which consists of an ending set of characters. In that way, digital data represents an analogue measurement.

In order to be able to conduct scholarly research with data, it is also essential that said data is openly available, easily accessible and, perhaps most importantly, interoperable. One can have huge amounts of data, but cooperating with other researchers becomes infinitely more complex when this data is not compatible with other systems or is challenging to transfer.

A supercomputer in the palm of our hand

Especially since the emergence of the smartphone, small computers have per- meated society. These little rectangles can do things computers of the old days could not even dream of. The typical smartphone today is multiple times faster than the computer that put men on the moon in 1969.

Another important factor is the inception and worldwide spread of the internet. Today, there are more than 5 billion internet users worldwide and most people have access to the internet. The sheer amount of data on the internet is so incomprehensibly large that even Google has not indexed most of its contents.

So with tiny supercomputers in our pockets and worldwide interconnections between them, it is not surprising at all that the amount of data generated has skyrocketed in the last couple of years. Data that, as shown above, is not inherently meanigful but instead might add additional noise, which makes it even harder to find relevant data. On the upside, these developments also mean unprecedented potential for the dissimination, access and processing of digital content.

AI and machine learning

The concept of artificial intelligence is very old, machine learning on the other hand is a rather new development that is very well suited for analyzing data. The examples of language models like ChatGPT or the new AI-assisted "Generative Fill"-function in Adobes Photoshop show how much potential AI currently has. In the case of ChatGPT, AI acts like a chat slot in which the user explains to the AI what they need it to do and the AI goes ahead and does just that, to the best of its abilities of course.

Such technologies will develop fast and will lead to the emergence of even more efficient and sophisticated algorithms, which will also lead to an increase in the amount of data. From the perspective of the humanities, it is therefore necessary to not ignore those developments, but instead come up with relevant research questions in regards to AI.

The downsides of all of this

All these developments hold great potential when it comes to research, but the homogenization of data and the increasing interoperability of systems also carry the risks of vulnerabilities when it comes to cyber attacks. Each and every system on the web is constantly under attack and malicious hackers come up with new ways of attacks every single day. Thus, it is essential to also focus on security of the data and the systems.

When speaking of the security of data, it is also paramount to not forget about the long term storage of data. Technological obsolescence might lead to huge amounts of data that can no longer be accessed or interpreted due to the fact that the documentation of its file format is lost to time or the hardware required to access the data is no longer in production. It is therefore necessary to develop and use open source file formats and find ways for long term data storage in a way that best enables access in a hundred years or even more. One example for this is GitHubs Arctic Code Vault, in which a snapshot of every single one of its publich repositories is preserved.

/

The constant (over)flow of data and trying to make it useful