Big data can be used for good and it can be used for evil. Some recent public research illustrates the former but there are doubts about some private uses

It is not generally realised that Statistics New Zealand has a large research database – the Integrated Data Infrastructure (IDI) – containing microdata about people and households from a range of government agencies, SNZ surveys including the 2013 Census, and non-government organisations. It contains over 166 billion facts – and is continually growing.

Before you panic – you are probably in there – SNZ has very strict privacy controls to limit the abuse of the database. It is hard to think of any further controls without making it useless for researchers, although ultimately its security relies on their integrity and that of those who work at SNZ. (One is that the data is ‘de-identified’ that is personal identifying information, such as names and addresses, is removed and identifiers, such as IRD and NHI numbers, are encrypted.)

That it can be a useful research tool is illustrated by an analytical paper recently released by the Treasury: Using IDI Data to Estimate Fiscal Impacts of Better Social Sector PerformanceTo simplify, the study follows the IDI data of children born in 1993 through to their early twenties, exploring the association of various explanatory characteristics (such as gender, ethnicity, region, contact with CYF, family welfare history, caregiver details) with education, health, welfare and corrections outcomes as young adults.

You will not be surprised that the research found that those who were assessed at risk – had contact with the Children’s and Youth Service; had a caregiver who had a corrections sentence; whose family was supported by the benefit; who had an hospital event – were more likely to have had lower outcomes than those who were not.

But be careful. About 44 percent of those deemed at risk were classified as ‘on track’ when they were 21 – into level 4 education, on good wages or self-employed. Admittedly this is lower than 84 percent of the others who were on track. But the implication is that one can succeed despite these handicaps and one could fail without them.

What that means is that one has to be very careful in labeling a child as doomed because of some association with these risk factors or claiming there that there is no potential problem if there is no association. Indeed about half of those who are deemed ‘failures’ are not associated with the risk factors.

Perhaps that is because there is some other factor which is important and not in the database – say IQ, or how loving and supportive the family is, or they were born without a Fetal Alcohol Spectrum Disorder, or ....

(As a researcher I am always delighted by being surprised. One result is that children in Auckland, Wellington and Christchurch seem to be less at risk than those elsewhere. There is a tendency to see urban centres as wicked compared with the rural arcadia. Apparently not, but we do not know why.)

So we learn that big data can be only as good as what it is available. It allows us to test hypotheses and quantify effects but only about what is measured.

Moreover, we need to be very cautious about the ability to predict from the research – any research. This was amusingly illustrated by (American academic journalist) Sue Halperin who, in preparation for reviewing a couple of books on the private use of big data, checked her own status. She found she was classified as a ‘gay male’ by one database and considered single and not in a relationship by another because she liked the webpage for an organisation founded by the man with whom she has been in a relationship for thirty years.

What is happening here is that the algorithms (the numerical methods which aggregate the data) are making point predictions. But each estimate is subject to a margin of error. I suppose big data addicts will claim that as they get more data they will get more precise.

In which case they have misunderstood any statistics they were taught. Briefly, a statistical estimate can be ‘inconsistent’ (among other things) which means larger samples may not ensure greater precision.One of the problems  is that anyone can run today’s statistical packages without the slightest understanding of the statistical method, the underpinning methodology or the ethical foundations. A degree in computer methods does not make one a statistician.)

But who cares? Apparently not the firms paying for the big data result, often to target you for advertising purposes. Supposing they are after gay males – an invitation to Sue Halperin is just collateral damage.

So one must worry about the files that they build up on you. Earlier I said I was comfortable with the rigorous protections in the IDI database, but it claims to have 166 billion facts. Does that mean it has more than 35,000 facts per New Zealander? (Are there 35,000 facts about me or you?)

Even so the conclusions that are reached from them may be wrong. I wonder just how many competent statisticians the proposed Ministry for Vulnerable Children has. They obviously had little counsel in setting up the ministry, because the conclusion that not all ‘vulnerable children’ are vulnerable and that almost as many children not classified as vulnerable are in fact vulnerable, seems to have been ignored.

Commercial big data miners pose a threat to our privacy and perhaps to our liberties. Moreover there are government databases which pose a similar threat – offshore ones which are not protected as rigorously as SNZ’s IDI.

Reading the Halperin review I concluded that my best defence is to bugger up the system. I keep getting invitations to put my face on the web by organisations who use face recognition algorithms. I don’t but if they are insistent I shall send them an image of Donald Trump. I learned too that those ‘free’ personality tests on the web are sometimes put up by commercial big data mining firms to get more information on you for their algorithms so they can better target you. Not my thing, but if I do I propose to respond at last three times with the extra ones being fake. (Further obscuring tactics can be reported in the comments below.)

(A documented example is Cambridge Analytica, a firm Halperin worries about, has been using quizzes it has put on Facebook to build up psychological profiles.  Apparently its results were sold to the Republican Party and are considered one of the reasons Trump was more effective at targeting supporters than Clinton.)

And yet, and yet. I welcome the use of big data for research purposes. As usual for new developments the good and the bad get thoroughly mixed up. The bad damages the usefulness of the good and takes a long time to control.


The Data Futures Partnership is inviting New Zealanders to discuss how they feel about these developments. The relevant website if you would like to give an opinion is here.

Comments (5)

by Katharine Moody on February 06, 2017
Katharine Moody

For those that are game - this is one (hilarious) example of an app analysing your Facebook 'likes' to determine whether Donald Trump would like you or not;

But be aware as Brian says as it's a classic data miner type app.



by Rich on February 07, 2017

As somebody who works a bit in this field, it's very important to realise limitations:

- at one end of the scale, if the question is "Should I show this user advertisement A or B?" then any probability over 0.5 is going to have a better (as in selling more stuff) outcome than random.

- but if you want to know whether a new drug works, you need rather better numbers.

Some people have got terribly excited over using data crunching to inform party policies. I'm hugely sceptical of this, partly because there are far too many unknowns and partly because a party is much better to use its values as a compass than polling reports - e.g. "the numbers all point to our core voters being keen on bigoted fucktards - let's recruit a few as candidates".


by Colin Fleming on February 07, 2017
Colin Fleming

Note that misuse of this data is not just limited to uses stemming from lack of integrity from those with access to the data, a major risk for any large data collection is breaches. Lest anyone think that governments are immune to this, see the OPM breach in the US, which included data on all US federal employees, and included all the data from their security interviews. This means that it is literally all the data that the US Government thought might make their employees vulnerable to blackmail.

Admittedly that appears to have been perpetrated by a sophisticated state actor, but there are plenty of cases when it happens due to sheer incompetence, or just simple mistakes (see the AU Red Cross Blood Service leak or a large leak of Indian patient Pathology reports). 

Computer security is incredibly hard, and as systems get more complex and more data passes through more hands, it becomes essentially impossible to guarantee the safety of data online. Security professionals trying to maintain data safe can essentially never ever make a single mistake, and an attacker only has to get lucky once. As private companies and governments increasingly use contractors for many of their services, including those which require access to their data, this becomes even more difficult to enforce.

If you'd like to get an idea of the scale of the problem, check out Have I been pwned?, a website dedicated to notifying users when their data has been compromised. They have data on 2,056 million user accounts which have had their details leaked.

I'm increasingly of the opinion that most data should never be collected, and if it is required for some reason there should be extremely tight restrictions on how it can be used and how long it can be stored. Our Government is in many ways pretty advanced, but systems like RealMe, while undoubtedly convenient, really scare me. Unfortunately it's getting harder and harder to do anything without them. I sure hope the Government is paying attention to data security.

by Colin Fleming on February 07, 2017
Colin Fleming

One further point on the integrity argument, too. The nature of stored data is that, absent any restrictions, it will be stored forever. This means that you're not just trusting the integrity of whoever is in charge right now, but also whoever might be in charge in the future. Many people in the US who were fine with large-scale NSA data collection under Obama are much less convinced now that Trump is in charge.

It might sound a bit unhinged to worry about that in New Zealand, but in a post-Snowden post-Trump world, things that would have sounded like paranoid fantasy in 2013 are now the new normal.

by Dennis Horne on February 08, 2017
Dennis Horne

Data. Stored forever but irretrievable in just a few years. 

Post new comment

You must be logged in to post a comment.