Machine Learning over JSON
May 20, 2015
Supervised machine learning is the problem of approximating functions X -> Y from many example (x, y) pairs. Now, the vast majority of supervised learning algorithms assume that X is p-dimensional Euclidean space. As I'll argue, this is a poor model for many problems, and modeling X as a JSON document fits much better.
Categorical Features
An immediate but somewhat pedantic example of non-Euclidean spaces X comes from categorical data. Most models--for example, linear or logistic regression, SVMs, neural nets, and many-to-most implementations of tree-based models--do not operate directly on categorical features. Instead, somewhere you need to turn the categorical feature into one or more numerical ones.
One popular way of making categorical features numeric is adding one boolean feature for each level of the categorical variable, giving them names like "Car is a Ford?", "Car is a Toyota?", "Car is a BMW?", etc. Another popular way is to replace each category by some chosen number. For example, you might declare that "Monday" = 1, "Tuesday" = 2, and so on. This is usually easier if you have categorical features with some natural notion of order among them.
Data is Hierarchical
Categorical features don't immediately fit into the mold of Euclidean space, but this isn't nearly as much of a violation as the fact that many datasets have hierarchical properties. Let me give you an example: The General Social Survey asks randomly chosen individuals a number of questions about their lives. As is standard, they publish the data as a large matrix, with each row corresponding to a surveyed individual, and each column corresponding to a question on the survey. Here are some of the column definitions, excerpted from their codebook:
Column Name | Explanation ------------+---------------------------- SBSEX1 | SEX OF R 1ST SIBLING SBSEX2 | SEX OF R 2ND SIBLING SBSEX3 | SEX OF R 3RD SIBLING SBSEX4 | SEX OF R 4TH SIBLING SBSEX5 | SEX OF R 5TH SIBLING SBSEX6 | SEX OF R 6TH SIBLING SBSEX7 | SEX OF R 7TH SIBLING SBSEX8 | SEX OF R 8TH SIBLING SBSEX9 | SEX OF R 9TH SIBLING SBYRBRN1 | BIRTH YEAR OF R 1ST SIBLING SBYRBRN2 | BIRTH YEAR OF R 2ND SIBLING SBYRBRN3 | BIRTH YEAR OF R 3RD SIBLING SBYRBRN4 | BIRTH YEAR OF R 4TH SIBLING SBYRBRN5 | BIRTH YEAR OF R 5TH SIBLING SBYRBRN6 | BIRTH YEAR OF R 6TH SIBLING SBYRBRN7 | BIRTH YEAR OF R 7TH SIBLING SBYRBRN8 | BIRTH YEAR OF R 8TH SIBLING SBYRBRN9 | BIRTH YEAR OF R 9TH SIBLING SBREL1 | R RELATION TO R 1ST SIBLING SBREL2 | R RELATION TO R 2ND SIBLING SBREL3 | R RELATION TO R 3RD SIBLING SBREL4 | R RELATION TO R 4TH SIBLING SBREL5 | R RELATION TO R 5TH SIBLING SBREL6 | R RELATION TO R 6TH SIBLING SBREL7 | R RELATION TO R 7TH SIBLING SBREL8 | R RELATION TO R 8TH SIBLING SBREL9 | R RELATION TO R 9TH SIBLING SBALIVE1 | R 1ST SIBLING ALIVE? SBALIVE2 | R 2ND SIBLING ALIVE? SBALIVE3 | R 3RD SIBLING ALIVE? SBALIVE4 | R 4TH SIBLING ALIVE? SBALIVE5 | R 5TH SIBLING ALIVE? SBALIVE6 | R 6TH SIBLING ALIVE? SBALIVE7 | R 7TH SIBLING ALIVE? SBALIVE8 | R 8TH SIBLING ALIVE? SBALIVE9 | R 9TH SIBLING ALIVE?
(Above, "R" is short for "RESPONDENT").
As you can see, some of the columns have names like "SBSEX1" and "SBYRBRN9", (sex and birth year, respectively, of the respondent's 9th sibling). For the vast majority of respondents, the value under SBSEX9 is missing, simply because very few people have nine or more siblings. Unsurprisingly, if SBSEX9 is missing, then so is SBYRBRN9, since if you don't have an 9th sibling then they have neither an age nor a gender.
An alternative representation of a person and his or her siblings would look something like this:
{ "name": "John Smith", "gender": "male", "occupation": "student", "age": 18, "siblings": [ { "sex": "female", "birth_year": 1991, "relation": "biological", "alive": true }, { "sex": "male", "birth_year": 1989, "relation": "step", "alive": true } ] }
Isn't this representation so much more natural than the General Social Survey's "vector representation"? In this representation, John has a list of siblings, each of whom has attributes of his or her own. To figure out how many siblings John has, you just find the length of his siblings list, instead of seeing which columns of his vector are null. In fact, John is now allowed to have arbitrarily many siblings, elegantly bypassing the limit of nine in the vector representation.
The cleaner representation above is called JSON, an extremely popular data format among programmers, though substantially less known in the Statistics and many other scientific communities. JSON is strictly more expressive than the vector representation. For example, if you have some vector of values like
V1 |V2 |V3 |...|V100 1.23|4.56|-7.8|...|9.10
then you can represent it in JSON as
{ "V1": 1.23, "V2": 4.56, "V3": -7.8, ... "V100": 9.10 }
Missingness is Chunky
JSON captures data missingness realistically and elegantly. As an example, let's consider a person and some of their social media profiles. A JSON representation might look like:
{ "name": "Jane Doe", "age": 14, "profiles": { "facebook": { "friends": [...], "likes": [...], ..., }, "twitter": { "followers": [...], "last_tweet": "we <3 u justin!", ..., } } ..., }
Jane has an account with Facebook and Twitter. If she didn't have a Twitter account, she'd be missing all of the Twitter data at once. This is my point: real data has missingness, and it tends to be "chunky", with lots of variables missing all together since they come from the same place. In the General Social Survey example, for instance, if you didn't have a ninth sibling, then that sibling had neither an age nor a gender.
Not only does this missingness occur in chunks, but the missingness is also expected: Not everyone has a Twitter account or a Facebook account or a LinkedIn account, and that's okay and expected. Not everyone uses all features of an app. Not everyone has a credit report. Not everyone has a ninth sibling. In fact, calling these features "missing" is something of a misnomer, since this "missingness" is actually a very usual case, often even the typical case.
Machine Learning over JSON
So how can we approximate functions whose domain is a JSON document? How can we do machine learning over JSON?
With existing machine learning algorithms, you have to coerce the JSON document into the vector representation with your featurization. Essentially you do this by flattening and imputing. You generate features like the number of Facebook friends, the number of likes in the last month, whether the string "justin" appears in your last tweet, and so on.
Many of these features are missing, and in that case you set the feature value to something fixed, perhaps 0, -1, or the feature's mean. You probably also include some boolean variables indicating whether a particular data source is present or not, like whether or not a user has a Twitter account. Missingness is usually informative!
This featurization approach to making predictions on JSON documents is fine. It works, it's easy to understand, and you can implement it with existing machine learning tools. But I wonder if we could do better? I wonder if we could design a machine learning model especially to make predictions from JSON documents? Would it be easier to use? Could it outperform machine learning models designed to operate on numeric vectors?