Machine learning & data science: what to worry about in the near future

Henry Kissinger recently opined about machine learning. OK, he used the ridiculously overblown phrase “AI” rather than “machine learning” but the latter is what he seemed to be talking about. I’m not a fan of the old reptile, but it is a reasonably thoughtful piece of gaseous bloviation from a politician. Hopefully whoever wrote it for him was well compensated.

There are obvious misapprehensions here; for example, noticing that chess programs are pretty good. You’d expect them to be good by now; we’ve been doing computer chess since 1950.

Machine learning

To put this in perspective; steel-belted radial tires and transistor radios were invented 3 years after computer chess -we’re pretty good at those as well. It is very much worth noting the first important computer chess paper (Shannon of course) had this sentence in it:

Importance of saving: Saving the best for the last

The reality is, computer chess largely hasn’t been a useful wedge in attacking problems of greater significance.  Kissinger also mentioned Alpha Go; a recent achievement, but it is something which isn’t conceptually much different from TD-Gammon;  done in the 1990s.

Despite all the marketing hype coming out of Mountain View, there really hasn’t been much in the way of conceptual breakthroughs in machine learning since the 1990s.  Improvements in neural networks have caused excitement, and the ability of deep learning to work more efficiently on images is an improvement in capabilities.

Data science

Stuff like gradient boost machines have also been a considerable technical improvement in usable machine learning. They don’t really count as big conceptual breakthroughs; just normal improvements for a field of engineering that has poor theoretical substructure. As for actual “AI” -almost nobody is really working on this.

None the less, there have been progress in machine learning and data science. I’m betting on some of the improvements having a significant impact on society, particularly now that the information on these techniques is out there and commodified in reasonably decent software packages.

Most of these things have not been spoken about by government policymaker types like Kissinger, and are virtually never mentioned in dopey “news” articles on the subject, mostly because nobody bothers asking people who do this for a living.

I’d say most of these things haven’t quite reached the danger point for ordinary people who do not live in totalitarian societies, though national security agency type organizations and megacorps are already using these techniques or could be if they weren’t staffed with dimwits. There are also areas which we are still very bad at, which are to a certain extent keeping us safe.

The real dangers out there are pretty pedestrian looking, but people don’t think through the implications. I keep using the example, but numskull politicians were harping on the dangers of Nanotech about 15 years ago, and nothing came of that either. There were obvious dangerous trends happening in the corporeal world 15 years ago which had nothing to do with nanotech.

The obesity rate was an obvious problem back then, whether from chemicals in the environment, the food supply, or the various cocktails of mind-altering pharmacies that fat people need to get through the day. The US was undergoing a completely uncommented upon and vast demographic, industrial, and economic shift.

Kinds of Cryptocurrency: What type of virtual currencies there are

Also, there was an enormous real estate bubble brewing. I almost think numskull politicians talk about bullshit like nanotech to avoid talking about real problems. Similarly, politicians and marketers prefer talking about “AI” to issues in data science which may cause real problems in society.

Data science

The biggest issue we face has a real-world example most people have seen by now. There exist various systems for road toll collection. To replace toll-takers, people are encouraged to get radio tags for their cars like “ezpass.”

Not everyone will have one of these, so government choices are to continue to employ toll-takers, removing most of the benefit of having such tools or use an image recognition system to read license plates and send people a bill. The technology which underlies this system is pretty much what we’re up against as a society.

As should be obvious: not many workers were replaced. Arguably none were; though uneducated toll takers were somewhat replaced by software engineers. The real danger we face from this system doesn’t job replacement; it is Orwellian dystopia.

Here is a list of obvious dangers in “data science” I’m flagging over the next 10-20 years as worth worrying about as a society.

1) Face recognition software 

(and to a lesser extent Voice Recognition) is getting quite good. Viola-Jones  (a form of the boosted machine) is great at picking out faces, and sticking them in classifiers that label them has become routine.

Shitbirds like Facebook also have one of the greatest self-owned labeled data sets in the world, and are capable of much evil with it. Governments potentially have very good data sets also. It isn’t quite at the level where we can all be instantly recognized, like, say with those spooky automobile license plate readers, but it’s probably not far away either. Plate readers are a much simpler problem; one theoretically mostly solved in the 90s when Yann LeCun and Leon Bottou developed convolutional nets for ATM machines.

2) Machine learning and statistics on large data is getting quite respectable.

For quite a while I didn’t care that Facebook, Google and the advertisers had all my data because it was too expensive to process it down into something useful enough to say anything about me. That’s no longer true.

Once you manage to beat the data cleaning problems, you can make sense of lots of disparate data. Even unsophisticated old school stuff like éclat is pretty helpful and various implementations of this sort of thing are efficient enough to be dangerous.

3) Community detection. 

This is an interesting bag of ideas that has grown powerful over the years. Interestingly I’m not sure there is a good book on the subject, and it seems virtually unknown among practitioners who do not specialize in it. A lot of it is “just” graph theory or un/semi-supervised learning of various kinds.

Image result for community detection algorithm

4) Human/computer interfaces 

are getting better. Very often a machine learning algorithm is more like a filter that sends vastly smaller lists of problems for human analysts to solve. Palantir originated to do stuff like this, and while very little stuff on human-computer interfaces is open source, the software is pretty good at this point.

5) Labels are becoming ubiquitous. 

Most people do supervise learning, which … requires labels for supervision. Unfortunately, with various kinds of cookies out there, people using nerd dildos for everything, networked GPS, IoT, radio tags, and so on; there are labels for all kinds of things that didn’t exist before.

I’m guessing as of now or very soon, you won’t need to be a government agency to track individuals in truly Orwellian ways based on the trash data in your various devices; you’ll just need a few tens of millions of dollars worth of online ad company. Pretty soon this will be offered as a service.

Ignorance of these topics is keeping us safe

1) Database software is crap. 

Databases are … OK for some purposes; they’re nowhere near their theoretical capabilities in solving these kinds of problems. Database researchers are, oddly enough, generally not interested in solving real data problems. So you get mediocre crap like Postgres; bleeding-edge designs from the 1980s. You have total horse shit like Spark, laughably insane things like Hive, and … sort of OK designs like vegetables… These will keep database engineers and administrators employed for decades to come, and prevent the solution of all kinds of important problems.

There are people and companies out there that know what they’re doing. One to watch is 1010 data; people who understand basic computing facts, like “latency.”

Hopefully, they will be badly managed by their new owners. The engineering team is probably the best to beat this challenge. The problem with databases is multifold: getting at the data you need is important. Keeping it close to learning algorithms is also important. None of these things are done well by any existing publicly available database engines.

Saving Money on your Business

Most of what exists in terms of database technology are suitable for billing systems, not data science. Usually, people build custom tools to solve specific problems; like the high-frequency trader guys who built custom data tee-offs and backtesting frameworks instead of buying a more general tool like Kx. This is fine by me; perpetual employment. Lots of companies do have big data storage, but most of them still can’t get at their data in any useful way.

If you’ve ever seen these things, and actually did know what you were doing, even at the level of 1970s DBA, you would laugh hysterically. Still, enough spergs have built pieces of Kx type things that eventually someone will get it right.


2) Database metadata is hard to deal with. 

One of the most difficult problems for any data scientist is the data preparation phase. There’s much to be said about the preparation of data, but one of the most important tasks in preparing data for analysis is joining data gathered in different databases. The very simple example is the data from the ad server and the data from the sales database not talking to each other. So, when I click around Amazon and buy something, the imbecile ad-server will continue to serve me ads on the thing that Amazon knows it has already sold me.

This is a trivial example: one that Amazon could solve in principle, but in practice, it is difficult and hairy enough that it isn’t worth the money for Amazon to fix this (I have a hack which fixes the ad serving problem, but it doesn’t solve the general problem).

Data Science to Save the World

This is a pervasive problem, and it’s a huge, huge thing preventing more data being used against the average individual. If “AI” were really a thing, this is where it would be applied. This is actually a place where machine learning potentially could be used, but I think there are several reasons it won’t be, and this will remain a big impediment to tracking and privacy invasions in 20 years.

FWIIW back to my ezpass license plate photographer thing; sticking a billing system in with at least two government databases per state that something like ezpass works in -unless they all used the same system (possible), it was a clever thing which hits this bullet point.

3) The most commonly used forms of machine learning requires many examples. 

People have been concentrating on Deep Learning, which almost inherently requires many, many examples. This is good for private minded; most data science teams are too dumb to use techniques that don’t require a lot of examples.

These techniques exist; some of them have for a long time. For the sake of this discussion, I’ll call these “sort of like Bayesian” -which isn’t strictly true, but which will shut people up. I think it’s great the average sperglord is spending all his time on Deep Learning which is 0.2% shinier, assuming you have Google’s data sets. If a company like google had techniques that required a few examples, they’d actually be even more dangerous.

Look Before You Leap: Three Steps to Leading Change

4) Most people can only do supervised learning. 

Machine learning & data science: what to worry about in the near future

(For that matter, non-batch learning terrifies most “data scientists” -just like Kalman filters terrify statisticians even though it is the same damn thing as linear regression). There is some work on stuff like reinforcement learning being mentioned in the funny papers.

I guess reinforcement learning is interesting, but it is not really all that useful for anything practical. The real interesting stuff is semi-supervised, unsupervised, online and weak learning. Of course, all of these things are actually hard, in that they mostly do not exist as prepackaged tools in R you can use in a simple recipe. So, the fact that most domain “experts” are actually kind of shit at machine learning is keeping us safe.