Hack This: How to Consult Google's Machine Learning Oracle

Posted On // 2 comments

Hack This: How to Consult Google's Machine Learning Oracle

Machine learning and its artificial intelligence parent are probably most often regarded by regular-ass people as kind of opaque and esoteric subjects. Or even just tech buzzwords, which is a shame because it doesn't have to be like that. These things are just tools and as tools they can be employed for extremely complex, inscrutable-seeming tasks found in fields like neuroprosthetics or machine perception, or they can be used for everyday things like classifying spam.

In other words, machine learning doesn't have to be brain surgery, though it can beuseful for that. At the same time, getting into something like Google's TensorFlow open-source machine learning library is pretty daunting. Fortunately, within its folio of cloud services Google offers an extremely accessible machine learning platform known as the Prediction API. It's been called a machine learning "black box" because it hides many of the inner workings of its algorithms, offering instead a clean and very simple interface for creating machine learning models from training data and then using those models to made predictions from new data.
You don't even really need to know any code to get going with Prediction, but being able to use the API programmatically greatly increases its power. So, in the guide below, I'm going to explain how to use Prediction mostly just by using Google's browser-based API explorer, but where appropriate, I'll tell you where using code would be useful and where you would start with that.


Machine learning, in the crudest high-level sense, is taking data and then using that data to create mathematical models of some phenomenon—and then using those models to say useful things about new data. The more data we can feed to a model, the more we can "train" it, the less fuzzy its predictions become.
If I have two data points for a spam classification algorithm, one not-spam email and one spam email, my model isn't going to have very much to say. It will basically be making blind guesses. But with 10 million emails, it's going to start to figure out what is special about the spam emails to make them spam, e.g. what features in a spam email are important in determining its spaminess. The model will eventually be able to classify spam with very little error, basically none. Machine learning depends on quantity and quality of data.


We can ask Prediction to predict two very general things:


We can ask Google Cloud "what is this?" We give Google some choices and we tell Google about some observations that have been made about those choices. Then, Google takes all of that and makes a model. We can then give Google some new observations and ask it what those observations are of. Google will return its best guess, and tell us how sure it is of that guess.
So, imagine some data like this:
butterfly, wings, 1 inch, yellow, orange
bird, wings, 5 inches, blue
plane, wings, 300 feet, silver
butterfly, wings, 1.5 inches, orange, red
dog, tail, 24 inches, brown
I want to use that data to be able to predict whether some new animal/thing is a butterfly, bird, plane, or dog.
So, I ask it to do that. And to ask the model a question, I need to provide it will some observations about I'm asking about. With these features below, I'll ask Google what is this?
wings, 3 inches, black, brown
And it will make a prediction. But it won't be very good because we haven't provided very much data.


Google can also give us numbers. This is a different sort of model—a regression model. Say that we take the bank balances of a variety of different people, and we know three things about those people: occupation, gender, age. We fill out a spreadsheet where the data looks something like this (but with a lot more entries):
$2100, student, male, 28
$10,000, lawyer, male, 55
$7005, engineer, female, 33
We feed that into the API and Google will make a model that will predict the balance of someone new, with these properties:
bartender, female, 40
And Google will spit back an actual number. A new number, not a classification. Not a choice among options. That's huge.


Prediction is one of many APIs Google offers as part of its cloud platform. These are all basically gateways or interfaces that we can use to access different services, such as Google Maps, Google Translate, or YouTube. We'd normally think of accessing YouTube via, well, YouTube, but there is also a YouTube API where we can access YouTube videos, comments, analytics, and the rest of it as data in a sort of raw form. You can even imagine the actual YouTube site as being only been one possible implementation of that data of many. That's a pretty good way of thinking about APIs, generally—an underlying interface offering some useful service that can be implemented in any number of different ways.
To use the Prediction API, you first need to register a Google Cloud Platform projecthere. Then, you need to enable billing for the project. Using Google Prediction is free until a certain threshold of use is met, and then it's not. There's pretty much no way you're going to hit that threshold here, but Google still needs you to enable billing.
OK, next you need to enable the Prediction and the Google Cloud Storage APIs on the project you just created. Do that here.


Assuming you're square with Google per the above instructions, we can actually get to the machine learning. We'll need some data in this general format.
item, feature 1, feature 2, feature 3, feature 4 ...
The "item" is the thing that the machine learning model is actually learning about. In this row of data, it's learning that some entity existed that had these four characteristics, or features. Given a lot of rows with a lot of features, it will get better and better at saying what sort of entities new collections of features correspond to. You could also think of the "item" here as a label that we assign to a certain collection of features.
We need to actually find some data now. There's a load of sample datasets out there, many of which are archived at the University of California, Irvine's Machine Learning Repository. Note that they're not all already in the right format. In many cases, the label is at the end of the row, not the beginning. This isn't too hard to fix programmatically but it's a bit beyond the scope of the current Hack This.
I did find one that's just about perfect. It has to do with fuel consumption given different sorts of automobile features. A tiny sampling looks like this:
18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
The first column is the actual miles per-gallon, while the following rows correspond to cylinders, displacement, horsepower, weight, acceleration, model year, origin, and model name. In that order.
Given the 398 entries in the dataset, we should be able to the predict fuel efficiency of a new car based on some or all of these features. You can look at it here and then you should download it as a text file (.txt).
First, open the file you just downloaded up in a text editor. It doesn't matter which one really, just so long as it has a find-and-replace function. You'll see a file full of neat columns that are separated by spaces and tabs rather than commas. We need commas, alas. I managed to fix this in about a minute just by finding various quantities of spaces and then replacing them with a comma. You'll need to find and replace a tab for the last one. A big part of using datasets winds up being cleaning and formatting them, but this was pretty easy.
Now that we have the dataset on our computer, we need to upload to Google. This is why we authorized Google Cloud Storage above. First, head over to the Cloud Storage browser, which you'll find here. Once you're at the Storage page, you're going to create a "bucket," which is pretty much what it sounds like: a virtual receptacle that you can chuck files of most any type. And you have to give your bucket a name, which has to be unique across the whole of Google Cloud, so you might have to get creative. I managed to snag "fuel-efficiency," sorry.
Once you have a bucket, go ahead and upload your .txt file to that bucket. Now you have a unique location within cloud storage that you can refer back to. It's referenced as "bucketname/filename." Simple enough.


Next, we get to actually make a machine learning model. For machine learning laypeople like us, this is what's cool about the Prediction API—we can almost completely outsource the actual guts and gears of machine learning to Google.
To actually access the API itself, we're going to use Google's API Explorer. This is a browser-based interface that we can use for interacting with APIs without writing actual code. All we have to do is fill out some stuff in a form and the Explorer will put it together into a proper API request and send it without us having to really deal with anything. This is handy, but it's also pretty limiting.
To get there, navigate from the Google cloud console (the general interface within which you've been doing all of this stuff) to the API manager and then click on the Prediction API within the list. You'll get a page that looks like this one:
Click on the "Try this" link and you'll be directed to the API Explorer.
You'll next see a list of services. Pick the "insert" one, which will direct you to a page that looks like this:
Give it the name of your project (which you created in the beginning), and then click in the "request body" field. It'll give you a dropdown. We need to make up an id for the model we're about to create and then we need to tell it where our data is. For me, it looks like this:
Click execute, and you should get a "200" reply, indicating that the request didn't have any errors. It will also give you a URL for your new model. This the "selfLink."


And now the moment we've all been waiting for. To make a prediction, we're going to use the same API Explorer functionality. Head back to the page listing all of the Prediction API services, and now instead of picking insert, pick predict.
So, go ahead and give it your project name again and then the model ID you created in the last step. From the dropdown in the request body field, pick input and then, from the new dropdown, pick csvInstance. Maybe you can guess what it wants: comma-separated values. These values describe some new var that we want Google to predict the fuel efficiency of. I'm going to do this for my own vehicle because it's probably easier than trying to make some data up.
This is what I'm feeding it:
Here's what Google predicted:
"outputValue": "19.181282"
Which is a bit low, but I also fudged my figures a bit.


Using the Prediction API via code rather than the API Explorer is a pretty simple matter. In Python, making a prediction based on an existing model (what we just did) would look like this (the actual data is from some other project, so don't worry about it):
data = '11.1,1.0,2.0,19.1,98,4,2,2.5,37,2.0,4.0,1.0,2.0,670'
prediction = api.trainedmodels().predict(project='your project id here', id='your model id here', body={
'input': {
'csvInstance': data
Easy enough, right? The tricky part actually has to do with user authentication, which is neccessary because using this API could potentially cost someone money if some usage limits were hit (that are well beyond what we've just done). When money is involved with Google's cloud services, you have to use authentication. This is easy in the API Explorer, but doing it in code I have a hard enough time explaining to myself let alone a bunch of strangers.


I kind of think of the Prediction API as inspiration to go forth and really learn the nuts and bolts of machine learning—or just to think of cool machine learning ideas—but this could have all kinds of out-of-the-box applications for anything from weird art projects to analyzing website traffic. I'm using it to analyze data from sound files recorded of different background environments. Eventually, I want a tool that can take ambient sound and make predictions about where it's from. Prediction makes this easy.
As a final note, some of this can be tricky and you might break things once or twice. Maybe you try and give it a file with the wrong formatting or holes where some data should be. In dealing with huge datasets, this is potentially a huge chore. In a lot of cases, Excel or Google Sheets can help with this part, but expect some trial and error, generally. Predicting the future is worth it. Courtesy of Motherboard.Vicehttp://motherboard.vice.com/read/hack-this-how-to-consult-googles-machine-learning-oracle-2


Blogger said...

DreamHost is definitely one of the best hosting company for any hosting plans you require.

Peter John said...

Thanks for sharing this interesting blog post on Google's machine learning Oracle. I have found some interesting application development information on SupraITS website.