Big Data – Good?

Big data can be used for good and it can be used for evil. Some recent public research illustrates the former but there are doubts about some private uses

It is not generally realised that Statistics New Zealand has a large research database – the Integrated Data Infrastructure (IDI) – containing microdata about people and households from a range of government agencies, SNZ surveys including the 2013 Census, and non-government organisations. It contains over 166 billion facts – and is continually growing.

Before you panic – you are probably in there – SNZ has very strict privacy controls to limit the abuse of the database. It is hard to think of any further controls without making it useless for researchers, although ultimately its security relies on their integrity and that of those who work at SNZ. (One is that the data is ‘de-identified’ that is personal identifying information, such as names and addresses, is removed and identifiers, such as IRD and NHI numbers, are encrypted.)

That it can be a useful research tool is illustrated by an analytical paper recently released by the Treasury: Using IDI Data to Estimate Fiscal Impacts of Better Social Sector PerformanceTo simplify, the study follows the IDI data of children born in 1993 through to their early twenties, exploring the association of various explanatory characteristics (such as gender, ethnicity, region, contact with CYF, family welfare history, caregiver details) with education, health, welfare and corrections outcomes as young adults.

You will not be surprised that the research found that those who were assessed at risk – had contact with the Children’s and Youth Service; had a caregiver who had a corrections sentence; whose family was supported by the benefit; who had an hospital event – were more likely to have had lower outcomes than those who were not.

But be careful. About 44 percent of those deemed at risk were classified as ‘on track’ when they were 21 – into level 4 education, on good wages or self-employed. Admittedly this is lower than 84 percent of the others who were on track. But the implication is that one can succeed despite these handicaps and one could fail without them.

What that means is that one has to be very careful in labeling a child as doomed because of some association with these risk factors or claiming there that there is no potential problem if there is no association. Indeed about half of those who are deemed ‘failures’ are not associated with the risk factors.

Perhaps that is because there is some other factor which is important and not in the database – say IQ, or how loving and supportive the family is, or they were born without a Fetal Alcohol Spectrum Disorder, or ....

(As a researcher I am always delighted by being surprised. One result is that children in Auckland, Wellington and Christchurch seem to be less at risk than those elsewhere. There is a tendency to see urban centres as wicked compared with the rural arcadia. Apparently not, but we do not know why.)

So we learn that big data can be only as good as what it is available. It allows us to test hypotheses and quantify effects but only about what is measured.

Moreover, we need to be very cautious about the ability to predict from the research – any research. This was amusingly illustrated by (American academic journalist) Sue Halperin who, in preparation for reviewing a couple of books on the private use of big data, checked her own status. She found she was classified as a ‘gay male’ by one database and considered single and not in a relationship by another because she liked the webpage for an organisation founded by the man with whom she has been in a relationship for thirty years.

What is happening here is that the algorithms (the numerical methods which aggregate the data) are making point predictions. But each estimate is subject to a margin of error. I suppose big data addicts will claim that as they get more data they will get more precise.

In which case they have misunderstood any statistics they were taught. Briefly, a statistical estimate can be ‘inconsistent’ (among other things) which means larger samples may not ensure greater precision.One of the problems  is that anyone can run today’s statistical packages without the slightest understanding of the statistical method, the underpinning methodology or the ethical foundations. A degree in computer methods does not make one a statistician.)

But who cares? Apparently not the firms paying for the big data result, often to target you for advertising purposes. Supposing they are after gay males – an invitation to Sue Halperin is just collateral damage.

So one must worry about the files that they build up on you. Earlier I said I was comfortable with the rigorous protections in the IDI database, but it claims to have 166 billion facts. Does that mean it has more than 35,000 facts per New Zealander? (Are there 35,000 facts about me or you?)

Even so the conclusions that are reached from them may be wrong. I wonder just how many competent statisticians the proposed Ministry for Vulnerable Children has. They obviously had little counsel in setting up the ministry, because the conclusion that not all ‘vulnerable children’ are vulnerable and that almost as many children not classified as vulnerable are in fact vulnerable, seems to have been ignored.

Commercial big data miners pose a threat to our privacy and perhaps to our liberties. Moreover there are government databases which pose a similar threat – offshore ones which are not protected as rigorously as SNZ’s IDI.

Reading the Halperin review I concluded that my best defence is to bugger up the system. I keep getting invitations to put my face on the web by organisations who use face recognition algorithms. I don’t but if they are insistent I shall send them an image of Donald Trump. I learned too that those ‘free’ personality tests on the web are sometimes put up by commercial big data mining firms to get more information on you for their algorithms so they can better target you. Not my thing, but if I do I propose to respond at last three times with the extra ones being fake. (Further obscuring tactics can be reported in the comments below.)

(A documented example is Cambridge Analytica, a firm Halperin worries about, has been using quizzes it has put on Facebook to build up psychological profiles.  Apparently its results were sold to the Republican Party and are considered one of the reasons Trump was more effective at targeting supporters than Clinton.)

And yet, and yet. I welcome the use of big data for research purposes. As usual for new developments the good and the bad get thoroughly mixed up. The bad damages the usefulness of the good and takes a long time to control.

 

The Data Futures Partnership is inviting New Zealanders to discuss how they feel about these developments. The relevant website if you would like to give an opinion is here.