Conference in Budapest (October 29-31) Data Crunch
This year, I attended the Data Crunch conference in Budapest on data analytics and Data Engeneering. Speakers from Linkedin, Uber, Github and many second-tier companies are invited to this conference, where people share their experiences or talk about tools for working with data. Well, what is just as interesting to me is to communicate with the conference participants to understand how our Russian reality differs from Europe and the USA.
From me to mark this:
- Full Stack Data Sceince - 2 reports were devoted to about the same topic that I wrote before . Make DS / DA a person who can solve problems from beginning to end. Do not divide the work by "functions", but divide DS by "topics". Those. working with data is not division between those who prepare, process, analyze, build models and visualize, and this is the division of "topics" between specialists who can do everything in full.
- From zero to hero - the guys talked about how they built their DS department from scratch. In general, as usual, common sensible ideas work:
- 2 DS as the minimum team size.
- and 2 Data engeneer to them.
- B product owner, who would communicate with the business.
- Build a good ecosystem. Speakers usually drown for open source. Every report usually mentions Hadoop. The problem is true in many ways that in the project in which I work, as well as many of the readers, no Hadoop is needed, because there is no data volume where there would be a prize from it. In general, my attitude to open source - try, study, but if your company has already bought something, then it can be more profitable to continue living in the ecosystem of proprietary software than rushing to other technologies and then “mating” them or learning from scratch.
- Test what you are doing. A / B tests and evaluation of results. Oddly enough, but simple tips do not all do in practice.
- Democratisation of data in Uber - I have already written a separate article about this
- AI ethics - discussed that many tasks have several fundamentally different optima. Conventionally, you may have an “effective” solution and an “ethical solution”. And the problem is that their maximization occurs under different conditions. And there is no right decision in mathematics or algorithms. It is up to people to decide what they want from their "cars". As an example, the speaker said that the algorithm for assessing the risk of recurrence of crimes tends to give an increased assessment of the risk to black Americans. This risk assessment is used to make decisions on early releases. The dilemma is that the socially unacceptable "discrimination" of black faces an objectively unacceptable subsequent increase in crime from those who were in vain prematurely released. And you can not combine both solutions in the same algorithm. What is interesting by the way
- ML and information warfare - the dude told how he analyzed text and links to each other and found some suspicious Facebook activity on Facebook before Trump's elections. He argues that someone of a massive under-investment "scans" the agenda, so that the language in which the conservative groups began to speak became more racist. Investigated this by analyzing the used lexicon in groups of Neo-Nazis, and then compared it with the language of the Conservatives. And he discovered that the lexicon began to draw closer together before the Trump elections, although nothing of the kind had been observed before. In general, hinted that Putin is to blame :)
From conversations with people at the conference:
- R vs Python. People live with two tools and usually R like people with backgrounds in science and math, and python like people with backgrounds in development. The most frequent use of R is for exploratory, Python for pipeline. Models write on both. I have personal experience with production models on R, for example.
- A / B tests - the introduction of a regular assessment of their actions and the choice of solutions based on A / B tests still remains a rare practice for companies (out of a dozen groups with whom I spoke, only 1 have A / B tests). People do not want to spend energy on A / B tests, they say, and so they know, or the CEO "sees" how to properly ...
- Everyone has communication problems - with managers, with clients, within the company, etc. Improving communications is a growth point for almost all teams.
- The main work on machine learning goes not along the line of choosing the coolest models, but the feature engeneering and data preparation. Neither Google nor Facebook has any “secret” models, but the effectiveness of their algorithms is more likely in processing and data preparation for these models. This is generally a good "news", because it means that the public xgboost or regression is the cutting edge algorithm for most tasks.