Using data from unverified sources

When you compile a dataset on a business community, would you include into it unverified information derived from low-quality sources such as blogs or forums? Depends on the purpose of the database.

The commercial data providers we know are most likely to refrain from collecting such data and putting it into the system. The first reason is that, liability disclaimers aside, they all advertise to their customers the quality of their data, and having unverified, potentially faulty data in the system would certainly contaminate the overall quality of the dataset.

The second reason is risk planning as far as data provider’s legal liability is concerned. Such systems — much more open for third party review by definition than ours — are widely used for KYC checks and journalist work, and publicizing a faulty statement or basing a commercial decision on it could bring potential legal repercussions for the data company.

A different approach is warranted if the data is to be used for investigations. In the investigatory scenario no information is to be discarded simply because it was not printed in a reputable publication. It could very well happen that a statement from a dodgy source can later be used to link together pieces of data which would otherwise remain unrelated. In fact, it could initially be the only link between those pieces of data, and further in-depth research would then confirm — through more authoritative sources — that the link is a valid one.

Example. A social network post contains an assertion that Swiss lawyer X is fronting Mr. A, a PEP. There is nothing, so far, in the media or official records (courts / corporate etc.) to substantiate that assertion. We record it anyway and a link gets stored in the system, albeit with a high ’potential to be untrue’ rating. Within a year, previously unavailable records get imported into the dataset showing that an LLC owned by an offshore company where Mr. X is a director purchased a house from Mr. A, at the same time when it was reported that criminal proceedings were commenced against Mr. A. «Two coincidences» in this case are certainly a clue, if not a pattern.

That is why, contrary to the marketing message of many business data providers, we do not insist on all our data having a first class provenance. To enrich our system we use data from both authoritative sources and those of unknown quality (or, rather, known low quality). The main function of the system is to generate leads and clues for further research and not provide verified statements of fact. (When we do analytical work for clients on the basis of the system, our user agreement fully reflects that).

Data providers for our system routinely derive it from:

  • publications of any ’leaks’, including document images
  • documents from civil and criminal proceedings, not officially published
  • pdf files of business and legal documents appearing on the internet
  • posts on social media, including document images
  • forum posts and comments triggering a keyword search for names of people (including known nicknames), entities or businesses
  • blogs, including anonymous ’compromat leaking’ blogs
  • private reports and opinions from expert sources
  • etc. etc. etc.

here are many reasons why some valuable data would appear only in a low quality source and never make it elsewhere. Mainstream media are sensitive to libel / defamation / breach of privacy etc. lawsuit risks; anonymous, low-quality platforms are not. Social media also tolerate ’leaks’ and ’compromat’ with ease, because its operators rely on the ’platform’ exemption providing immunity from liability for hosting user generated content.

A million-dollar question is how can we properly deal with the risk that a certain data element may be untrue, so that it is not assigned the same degree of trust as, for example, the data directly obtained from State-run corporate registers. We have developed a set of data attributes and ways to process those attributes to deal with this issue. The multi-dimensional architecture of the dataset allows each data point to have properties attributable to it, such as source, time and — in this case — potential to be untrue. Algorithms solving analytical tasks within the system are formulated to take due regard of such attributes.

’Compromat’: a popular practice in Russia, Ukraine and other countries of the region is to publicize actual or made-up leaks compromising a person or a business. Its veracity is by definition disputable, and publication in first class media impossible for legal reasons. ’Compromat’ leaks are published on dedicated compromat websites, within social networks, blogs and internet forums. From time to time such publications appear in tabloid printed media.