Challenges in data collection
The analytical system we operate relies on collection of large volumes of data from various conventional and unconventional sources. The raw bits of information then have to be structured to make sure the system can properly use each data point. One of the unique qualities of our system is that it does not shy away from distilling useful data from unstructured and difficult-to-process sources.
The resulting dataset is organized as a «social network in reverse». It is person-centric: each legal and natural person gets a unique ID, and any data obtained is accumulated within that person’s dedicated cluster. Any relationships and links between persons in the database (whether discovered from outward sources or ’mined’ by application of rules introduced into the system) get recorded as separate layers. While within a social network data subjects (users) create the links themselves (likes, shares, disclosure of relationships), in our system such links are, obviously, introduced without the will of the data subject.
Timely collection is essential
When our data provider’s team started putting together the dataset we are operating now, they made a discovery which may now seem obvious to many: while the overall volume of data on the internet certainly grows, some of the very useful data disappears never to be found again.
This is especially true for sensitive personal and business data. Some publications could be taken off the internet for legal reasons, websites and blogs set up for the purpose of supporting a side in a corporate conflict would cease to be maintained after the conflict is over, some data will be closed due to government intervention or application of ’the right to be forgotten’ law etc.
A similar problem exists with online databases: many of them exhibit online only the actual state of the record, leaving historical information obtainable exclusively by physical search. Once the record changes, it is no longer accessible through scraping.
The WayBack Machine, a very useful tool to pick historical data, is not of universal help. Many obscure websites and blogs just get left out, not fitting with the WBMachine’s order of priorities.
That means that timely collection of data is essential, and the only way to achieve it is to ensure continuous monitoring of key websites and other sources.
Picking only relevant data
With so much data available for collection over the internet, it is important to limit the collection effort to data bits which support the purpose of the dataset and not unnecessarily clutter it.
When dealing with corporate registers, this issue does not arise, as all the data there is potentially relevant for the purpose. However, when it comes to social media, online publications and other ’soft’ sources, selecting the right pieces of information is key.
An important element of the system is creating and constantly perfecting smart algorithms which automatically determine if a particular page is worth scraping (via keywords and many other parameters).
While it is relatively easy to automatically sort information obtained from structured sources, in order to enter data derived from media sources we still need human analyst effort.
To make analyst’s job easier, after the source is scraped, the system pre-formats the text to save time on its analysis. The analyst then works on the specially arranged text and manually enters the data into the system.
If a large enough number of similarly structured documents is discovered (Eg. presidential decrees on granting citizenship containing lists of persons and very similar legal language, published on a government website), a scraper code is quickly put together by the IT team and the data is entered automatically.
Even seemingly structured sources, such as corporate registers, often present a challenge for data collection.
Every country and every official body within a country seem to have their own IT contractors tasked with creation and maintenance of online datasets. A user-friendly and easy to access online register is a rarity, not a rule. On top of this, State agencies frequently change the format of their databases (sadly, without improving their quality).
Our overall impression is that although much of the official data is open to public by virtue of law, many governments still cling to it and use all sorts of tricks to complicate access.
It thus becomes a challenge for the IT team to devise customized algorithms for collection of data from each particular official online source and review that software from time to time, as the original data storage would change its format.
(Note. Privatepol Data Analytics Inc. is not itself engaged in data collection and uses the dataset provided by its contractor)