Forget about privacy: you still have terrible targeting

https://apenwarr.ca/log/20190201
  • Transfer
I don’t feel sorry for letting your programs explore my personal data if in exchange I get something useful. But this usually does not happen.

My former colleague told me: "Everyone loves to collect data, but no one likes to analyze it later." This statement may be shocking, but people involved in data collection and analysis have come across this. It all starts with a brilliant idea: we will collect information about each click that a person makes on each page of the application! We will track how long they reflect on a particular choice! How often do they use the back button! How many seconds they watch our intro video before disconnecting! How many times they share our post in social networks!

And they keep track of it all. Track something easy. Add logs, drop them into the database, and forward.

And then what? Well, then all this will have to analyze. And as a person who analyzed a lot of data sets relating to various things, I can say: the work of the analyst is difficult and mostly ungrateful (except for salary).

The problem is that there are practically no ways to confirm your rightness (and the very definition of rightness is not very clear, as discussed below). It is almost never possible to draw simple conclusions, only complex ones - and complex conclusions are prone to error. What analysts do not say is how many incorrect graphs (and, accordingly, conclusions) are made on the way to the right ones. Or to those that we think are right. A good schedule is so convincing that it almost doesn't matter whether it is correct or not, if you just need to convince someone. Perhaps this is why newspapers, magazines and lobbyists publish so many deceptive graphs.

However, we postpone for the time being the error. We make a very unrealistic assumption that we cope very well with the analysis of all sorts of things. What's next?

Well, let's get rich on targeted advertising and personal recommendation algorithms. After all, that's exactly what they are doing!

Or not?

With personal recommendations, everything is surprisingly bad. Today, the very first recommendation would be an article with an eye-catching headline about movie stars or what Trump has done or not done in the last 6 hours. Or not an article, but a video or documentary. I don’t want to read or watch it, but sometimes it still sucks me - and then welcome to the apocalypse of recommendations, now the algorithm thinks I like to read about Trump, and now Trump will be everywhere. Never give AI positive feedback.

This, by the way, is a terrible secret of machine learning supporters: almost everything that the MO gives out can be obtained much cheaper with the help of the dumbest heuristics programmed by hand, as the MOT mainly trains with examples of what people did, following on the heels of dumbest heuristics. There is no magic here. If you train a computer using MO to select a resume, he will recommend you to interview men with the names of white people, as your HR department already does . If you ask a computer what kind of video a person wants to watch, he will recommend some political propaganda rubbish, since in 50% of cases 90% of people actually see it, not being able to control themselves - and this is a pretty good percentage of success.

I note that there are several examples of the excellent use of MO for something that traditional algorithms do poorly - image processing or winning in strategic games. It's great, but all the chances are that your favorite MO application will become an expensive substitute for stupid heuristics.

The person working with the web search told me that they already have an algorithm that guarantees the maximum ratio of clicks to views for any search: you just need to give a page with links to porn. And someone else said that this situation can be wrapped up and made a pornography detector: any link with high clickability, regardless of the request, most likely leads to pornography.

The problem is that decent looking businesses cannot constantly give you links to porn, this is “not safe for watching at work”, so the work of most modern recommender algorithms is to return something as close as possible to porn but this is “safe to watch at work”. In other words, zvezdulek (ideally beautiful, or at least controversial), politicians, or both. They approach this boundary as close as possible, since this is the local maximum of their profitability. Sometimes they cross this line, they have to apologize or pay a symbolic penalty, and then everything returns to normal.

It upsets me, but figs with it, it's just math. And perhaps human nature. And capitalism. Yes figs with him, I may not like it, but I can understand it.

I complain that none of the above is related to the collection of my personal information.

The hottest recommendations have nothing to do with me.


Let's be honest: the best targeted ad will be the one that I get from a search engine that gives me exactly what I was looking for. And everything is a plus: I find what I was looking for, the seller helps me to buy his goods, the search engine receives money for mediation. I do not know anyone who would complain about such an advertisement. This is a good advertisement.

And she also has nothing to do with my personal information!

Google has been providing search-based contextual advertising for more than ten years, even before they started asking me to log in. Even today, a person can use any search engine without logging into their account. And they still produce ads based on search queries. Great business.

In my case, another ad works well. I sometimes play games, use Steam, and sometimes I watch games on Steam and check out the ones I plan to buy. When the sale begins on these games, Steam sends me a notification letter, and sometimes after that I buy them. All in the black: I get the desired game (with a discount!), The game maker gets paid, Steam gets paid for mediation. And also, if I want, I can forbid sending these letters to me - but I don’t want, because this is a good advertisement.

But nobody had to build my profile for this? Steam has my account, I said what games I want, and the service sold them to me. This is not building a profile, it is just memorizing a list that I myself have provided.

Amazon shows me a notification suggesting that I might want to buy some consumable items that I bought in the past. This is also useful, and also does not require the creation of a profile, except for memorizing the transactions that they already do. And again, everyone wins.

Amazon also recommends products similar to those that I bought, or products that I studied. It is useful for about 20%. If I just bought a monitor for a computer, and you know that I did it, since I bought it from you, you can stop trying to sell me monitors. But a few days after buying electronics, they also offer me to buy USB cables, which is most likely correct. So, okay, 20% of the targeting benefits are better than 0% of the benefits. We must pay tribute to Amazon for creating my useful profile, although this is just a profile of what I did on the site, and they don’t share it with anyone. This does not seem to be an invasion of privacy. No one is surprised that Amazon remembers what I bought from them or what they looked at.

It turns out worse when sellers decide that I may want something. And they decide this because I went to their site and looked at something. Then their advertising partners chase me all over the web, trying to sell it to me. And they do it, even if I already bought it. The irony is that all this is happening because of the hesitant attempts to protect my privacy. The seller does not distribute information about me and my transactions to his advertising partners (since otherwise, all the chances are that they are in trouble from a legal point of view), therefore the advertising partner does not know that I bought the goods. He only knows (because of the tracker from the partner installed on the seller’s website) that I looked at the item, so they continue to advertise it to me just in case.

OK, now we are getting to an interesting topic. The advertiser has a tracker that he places on different sites to keep track of me. He does not know what I bought, but he knows what I was looking at, perhaps even for a long time and on many sites.

Using this information, his diligently trained AI makes conclusions about what else I might want to see, based on ...

And on the basis of what? People like me? Things my Facebook friends are watching? Some complex matrix formula that people can't understand, but which works 10% better?

Probably not. Perhaps he just guesses my gender, age, income level and marital status. And then, if I'm a guy, he sells me cars and gadgets, and if a girl is fashionable things. Not because all the guys love cars and gadgets, but because a certain non-creative person got into this process and said “sell my car mostly to men”, and “sell my clothes mainly to women.” Perhaps the AI ​​makes conclusions based on the wrong demographics (I know that Google is mistaken about me), but it doesn’t matter, because it usually turns out to be mostly right, which is better than being 0% right, and advertisers get most of the demographically targeted advertising, which is better than targeting with an efficiency of 0%.

You understand that everything works this way? Well, surely. This can be confirmed based on how poorly advertising actually works. Each person is able to remember such a thing for a few seconds that they wanted to buy, but the Algorithm could not offer it to them, while the Outbrain advertising platform earns bags of money by selling links to car insurance for people who do not have a car. With the same success, this could be a TV commercial from the 90s, shown late at night, when you could be sure about my demographic profile on the basis of the fact that I had not slept yet.

You keep track of me everywhere, write all my actions in your logs forever, substitute for someone to steal your database, desperately fear that some kind of new EU law can destroy your business ... And all for the sake of this?

Statistical astrology


Of course, everything is actually not quite as simple as described. Not one company tracks me on each of the sites I visit. These companies are carriages and they all track me on every visited site. Some of them do not even do advertising, they simply track, and then sell this tracking information to advertisers, who seem to be using it to improve targeting.

Awesome ecosystem. Let's take a look at the news sites. Why are they loading so slowly? Because of the trackers. Not because of the advertising - because of the trackers. There are just a couple of advertisements that usually do not load for so long. But there are a lot of trackers there, since each of them pays them a little bit so that they are allowed to track the views of each page. If you are a giant publisher balancing on the verge of bankruptcy, and you already have 25 trackers on your site, and the 26th tracking company calls you and promises to pay $ 50K a year to add their tracker as well - you will refuse them ? Your page is already barely turning, so slowing down the load by another 1/25 will not change anything, but $ 50K will do it.

(“Ad-blockers” remove annoying ads, but they also speed up the web, mostly by removing trackers. Fucking shame - the trackers themselves don't have to slow down the load, but they slow it down, because their developers always turn out to be idiots, each of which has to download thousands of lines Javascript code for what can be done in two lines. But this is another conversation).

And then ad sellers and ad networks buy tracking data from trackers. The more tracking data they have, the better they can manage ads, right? Well maybe.

And the funny thing is that every tracker has a piece of data about you, but not all of the data, since every tracker is not on every web site. On the other hand, it is rather difficult to compare the activity of people between different trackers, since none of them want to give you their secret ingredient. Therefore, each seller of advertising makes every effort to compare all the data for all the trackers that they buy, but basically it does not work. Suppose we have 25 trackers, each of which tracks a million users, and maybe a lot of data overlaps there. In a rational world, one would guess that this data describes several millions of individual users. But in a crazy world where overlap cannot be proved, there may be 25 million users! The more data trackers your ad network buys, the more information you get! Probably! So, targeting improves! Maybe! And therefore, you should buy ads on our network, and not on another network that has less data! Well, apparently!

But all this does not work. They are still trying to sell me car insurance for the subway ride.

And it's not just advertising


Many things related to targeted advertising, obviously, do not work - if someone had ever stopped and looked at all this carefully. But too many people have an incentive to think differently. But if you care about your personal life, then it all comes down to the fact that they still continue to collect your personal information, whether this method works or not.

What about content recommendation algorithms? Do they work?

Obviously not. Have you ever tried them? No seriously.

Ok, this is not really fair. Some things work. Pandora's music recommendations work unexpectedly well., but they do it in a completely unobvious way. The obvious way is to take a list of songs that your users are listening to, zafigachit it in the training set for MO, and use the result to compile a list of songs for new users, based on ... uh ... their profile? Well, they don't have a profile, they just joined. Perhaps based on a few of the first songs that they choose manually? Perhaps, but they probably started with a very popular song, which says nothing, or a very rare song, to check the vastness of your base, which tells you even less.

I am sure that Mixcloudit works that way. After each mix, the service tries to find the “most similar” mix from which to proceed. This is usually someone else who has downloaded the exact same mix. The first mix turns out to be the most similar to this mix, that's why it gives it away. Fucking awesome, machine learning, keep up the good work.

This leads us to the “random song, thumb up / thumb down” system that everyone uses. But everyone except Pandora turns out badly. Why? Apparently, because a lot of time Pandora manually encodes the music characteristics of the car and writes "real algorithms" (and not MO), who are trying to produce lists of songs based on the correct combination of these characteristics.

In this sense, Pandora can not be called pure MO. She often gives out a list of songs that you like after one or two fingers up / down, because you are traveling through a multidimensional connected network of songs that people built with hard work and not a massive matrix of average lists of songs taken from average people who are not try to generate these lists of songs. Pandora has failed a lot of things (especially “access to Canada”), but their musical recommendations work great.

There is only one catch. If Pandora is able to give you a good list of songs based on the first and a couple of ratings, then it seems to me that it does not build your profile. And he does not need your personal information.

Netflix


And, so as not to get up twice, I’ll give a little talk about Netflix - a strange development case that started with a very good recommendation algorithm, which was then deliberately worsened.

Once upon a time, there was a $ 1 million prize from Netflix , promised to the best team capable of predicting the ratings of films that a person puts down, based on the ratings already put, and with better accuracy than Netflix can. And this, not too unexpectedly, led to a fiasco with privacy , when it became clear that the published data sets could be deanonymized. Yes, it is precisely to this that the long-term storage of personal information of people in the database leads.

Netflix believed that their business depended on a good recommendation algorithm. He was already quite good: I remember how I used Netflix 10 years ago, and I received several recommendations for films that I would never have found myself, but at the same time I liked them. But this has not happened to me on Netflix for a very, very long time.

The story is as follows: once Netflix was a DVD mailing service. DVD mailing by mail is a slow thing, so it was absolutely necessary that at least one movie on discs coming once a week was interesting enough so that on Friday evening it could entertain you. After spending too many Fridays in a row with bad movies, you probably would have unsubscribed. A good recommendation system was the key to success. I think that very interesting mathematics was used in this case, which guaranteed that the service could rent out as many percent of the disks available in the warehouse as it was impractical to have a car of copies of the latest blockbuster that month would be popular, and next month no longer nobody will be needed.

But in the end, Netflix moved online, and the cost of bad recommendations dropped dramatically: just stop watching and switch to a new movie. Moreover, it was perfectly normal when a lot of people watch one blockbuster. And even better, since they can then cache it with the provider, and the cache works better when people are all boring and average.

Worse, Netflix noticed a pattern: the more people watch movies a week, the less likely they are to refuse service. And it makes sense: the more time you spend on Netflix, the more you “need” it. And when new users test the service for almost a fixed fee, a high retention rate leads to faster growth.

I learned this when I met the word satisficing[A hybrid of English words satisfying (satisfactory) and suffice (sufficient) / approx. transl.] - this is when we dig in the dirt in search of not the best option, but rather good. Today, Netflix is ​​not looking for a better movie, it just finds a good enough one. If he has a choice between a film that has scored a lot of prizes, which you will like with 80% likelihood or 20% like you will hate, and the mainstream film, 0% special, but you won't spit with 99% likelihood, then he will recommend a second one each time. Outliers harm business.

The bottom line is that you don’t need to build a risky profile that violates the user's privacy to recommend a mainstream movie. Such films are specially designed to be harmless to almost everyone. My screen with recommendations on Netflix is ​​no longer “recommended for you”, it is “new issues”, and then “now in trend” and “reconsider”.

Netflix, as promised, paid $ 1 million for the winning recommendation algorithm, which was even better than before. But instead of using it, they threw it out.

Some expensive A / B testers have determined that this is what makes me watch thoughtless TV shows the most hours per day. Their profits are growing. And for this they do not even need to invade my personal life.

And who am I to say that they are wrong?

Also popular now: