sen77 November 2, 2018 at 20:07

Better Way To Code

Original author: Mike Bostock (@mbostock)

Transfer

From a translator:
I am neither a professional programmer nor a professional translator, but the appearance of the tool described in the article from the creator of the popular library D3.jsmade a strong impression on me.

I was surprised to find that on Habré, and indeed on the Russian-speaking Internet, this tool has been unjustly ignored for more than a year. Therefore, I decided that I was simply obliged to contribute to the development of the art of programming, in JavaScript in particular.

Meet d3.express, an integrated research environment.
(since January 31, 2018, d3.express has been called Observable and lives on beta.observablehq.com )

If you have ever had to trick your code or understand someone else's, then you are not alone. This article is for you.

For the last eight years I have been developing tools for visualizing information. The most successful result of my efforts was the D3 js library . However, the danger of such a long development of the toolbox is that you forget why you are doing it : the tool becomes an end in itself, the benefits of its use are fading into the background.

The purpose of the visualization tool is to build visualizations. But what is the purpose of visualization? Word to Ben Schneiderman (Per Ben Shneiderman):

“The result of visualization is insight, not pictures”

Visualization is the key. The key to insight . A way to think, understand, reveal and convey something about this world. If we see in visualization only the task of finding visual decryption, then we ignore a lot of other tasks: finding significant data, cleaning it, transforming it into effective structures, statistical analysis, modeling, explaining our investigations ...

These tasks are often solved using code. Alas, programming is damn complicated! The name itself already implies incomprehensibility. The word "programming» ( "Code» ) is derived from the machine code : low-level instructions executable by the processor. Since then, the code has become more friendly, but there is still a long way to go.

Ghost in the Shell (1995)

As a prime example, here's a bash command to generate a background map of California population density . It just returns simplified geometry. A few more commands are needed to get SVG.

This is not machine code. From a machine point of view, this is a very high-level programming. On the other hand, this cannot be called human language: strange punctuation marks, incomprehensible abbreviations, levels of nesting. And two languages: JavaScript is awkwardly woven into Bash.

Bret Victor gives us such a brief definition of programming :

Programming is the blind manipulation of symbols

By “blind,” he means the inability to see the results of our manipulations. We can edit the program, restart it and see the result. But the programs are complex and dynamic, we do not have the ability to directly, directly observe the results of our editing.

By "symbols" he means that we do not directly manipulate the output of our program, but instead work with abstractions. These abstractions can be effective, but they can also be difficult to control. In determining this Donald Norman chasm evaluation and execution chasm ( Gulf of Evaluation and Gulf of Execution ).

But it’s obvious that some scripts are easier to read than others. One of the symptoms of non-human code is “spaghetti”: code without structure and modularity, where in order to understand one part of the program you must understand the entire program as a whole. This is often caused by a shared mutable state . Where part of the structure is modified by various parts of the program, it is very difficult to guess what its significance is.

Well, in fact, how do we know what the program does? If we cannot track all the environmental conditions in our heads, reading the code is not enough. We use logs, debuggers and tests, but these tools are limited. A debugger, for example, can only show a few values at a particular point in time. We continue to experience tremendous difficulties in understanding the code, and we can perceive it as a miracle if something basically works.

Despite these problems, we still write code for for countless applications, more than ever. Why? Maybe we are masochists? (Maybe) Can't we trade? (Partially.) Is there really no better solution?

In general - and this is a critical definition - no. Code is often the best tool in our arsenal, because it’s the mostfull (general) of what we have; code has almost unlimited expressiveness. Alternatives to code, like high-level programming interfaces and languages, feel fine in specific areas. But these alternatives sacrifice versatility for the sake of better efficiency in these areas.

If you cannot determine the constrain the domain, most likely you will not be able to find a viable alternative to the code. There is no universal substitution, at least as long as people mainly think and communicate through language. It is very difficult to determine the field of science. Science is fundamental: to study the world, derive meaning from empirical observations, model systems, calculate quantitative quantities.

A tool to facilitate discovery should be able to express new, original thoughts. As we do not use phrase templates to compose a written word, we cannot be limited in graphic templates for visualization or in a limited list of formulas for statistical analysis. We need more than configuration. We need a composition of primitives in the formation of our own design.

If our goal is to help people gain insight from observation, we must consider the problem of how people write code . What Victor expressed about mathematics can be applied to the code:

The whole arsenal for understanding and predicting the quantitative indicators of our world should not be limited to ridiculous (freakish) tricks for manipulating abstract symbols

Improving our ability to program is not only making workflows more convenient or efficient. This is an opportunity for people to better understand their world.

Introducing Observable

If we cannot get rid of the code, can we at least make it easier for people with our sausage-like fingers and brain-sized brains?

To clarify this issue, I am building an integrated research environment called Observsable . It serves for data analysis, for understanding systems and algorithms, for training and presenting various programming techniques, as well as for sharing interactive visual explanations. In order to make visualizations easier and, in turn, easier to make our discoveries, we must first make the programming process easier.

I can't pretend to make the programming process easier. The ideas that we want to express, explore and explain can be irresistibly complex. But reducing the cognitive load in programming, we can do an analysis of quantitative phenomena available to a wider audience.

1. Reactivity

The first principle of Observable is reactivity. Instead of issuing commands to change the general state, each part of the state in the reactive program determines how it is calculated, and the environment controls their assessment; runtime distributes the received state. Instead of issuing commands to change the general state, each part of the state in the reactive program determines how it is calculated, and the environment itself controls their assessment. The environment itself propagates the derivative state. If you write formulas in Excel, you do reactive programming.

Here is a simple Observable notepad to illustrate reactive programming. It’s a bit like a developer’s console in a browser, except that our work is saved automatically and we can visit here in the future or share work with others. And it’s also reactive.

in imperative programming c=a+bsets c equal a+b. This is a value assignment . If aor bchange, it cremains in the previous value until we perform a new assignment of value for c. In reactive programming c=a+b, this is a description of a variable. This means that c is always equal a+b, even if aor bchanges. The environment itself is relevant c.

As programmers, we now only care about the current state. The environment itself manages state changes. This may seem like an insignificant thing here, but in large programs this removes a significant burden from you.

The research environment should do more than add a few numbers, let's try to work with the data. To download the data - several years of Apple stock price statistics - we will use d3.csv . It uses the Fetch API to download the file from GitHub and then parses it to give us an array of objects.

require ('d3') and the data request are asynchronous. Imperative code could be a problem , but here we clearly noticed: cells that reference "d3" are not calculated until the data is loaded.

Reactivity means that we can write most asynchronous code as if it were synchronous.

What does the data look like? Let's see:

d3.csv is conservative and does not draw conclusions about data types such as numbers and strings, so all fields are strings. We need more accurate types. We can convert the “close” field to a number by applying the (+) operator to it and immediately see the effect: the purple line becomes a green number.

To steal the date, you need a little more work, since JavaScript natively does not support this format.

Imagine we have a parseTime function that parses a string and returns a “Date” entity. What happens if we call her?

Oops! Threw an error. But the error is both local and temporary: other cells are not affected and it will disappear when we define parseTime. Therefore, notebooks in Observable are not only reactive, they are also structured. Global errors no longer exist.

When defining parseTime, we again see the effect: data is reloaded, parsed and displayed. We are still manipulating abstract symbols, but at least we are doing it less blindly.

Now we can request data, say, to calculate the time range:

Oh, we forgot to give a name to the data! Let's fix it:

Here we find another “human” feature: cells can be written in any order.

Visual conclusion

As in the developer's console, the result of executing a cell in Observable is visible immediately below the code. But unlike the console, Observable cells can display graphical user interfaces! Let's visualize our data on a simple line chart.

First we determine the size of the chart: width, height and margin.

Now the scale : temporary for x and linear for y.

And finally the SVG element. Since this definition is more complicated than our previous cells, we can use curly braces ({and}) to define it as a block, not an expression:

DOM.svg is a method for conveniently calling document.createElementNS . It returns a new SVG node with a specific width and height. Let's expand it to use d3-selectionfor DOM manipulation:

I can’t show the code and the graph at the same time due to the limited screen size, so let's first look at the graph as it builds. This gives the feeling of visual feedback that you get as you identify the three main components of the graph: the x, y axes, and the line.

This animation was done by entering each line of code in order (except for return, since it is needed in order to see everything):

This is a simple graph, but the program topology is already becoming more complicated. Here is a directional acyclic link graph itself made in Observable using GraphViz :

Node 93 is an SVG element. A few observations: It is now very easy to make this graph responsive. The objects width, height and margin are constants, but if they were dynamic, the axes and the graph itself would be updated automatically. In a similar way, it is also easy to make the chart dynamic by redefining data. We will see this soon with the example of streaming data.

But let's take a closer look at the reactive code. In imperative programming, the definitions of variables are spread throughout the code, rather than being done in one place. For example, we can collect the scaling for the x-scale immediately after loading the page, but defer the domain definition until data is received.

Such a fragmented definition can be shuffled with other data and affect the purity of the code. It also favors reuse: self-contained, stateless definitions are easier to copy / import into other documents.

You can create any kind of DOM - HTML, canvas, WebGL, use any kind of library.

Here is a graph made using Vega Lite :

Animation

How about a canvas? Say we need a globe. We can load borders of the countries of the world and apply orthogonal projection .

( Mesh , if interested, is a combined border represented as a broken line. I use the mesh method because this dataset contains polygons, and it is a little faster and more beautiful for rendering such objects. )

A powerful feature of reactive programming is that we can quickly replace a static definition, such as a fixed-size orthogonal projection, with a dynamic definition, such as a rotating projection. The environment itself will redraw the canvas whenever the projection changes.

Obsesrvable dynamic variables are implemented as generators , functions that return multiple values. For example, a generator with a while-true loop produces an infinite stream of values. The environment extracts a new value from each active generator up to sixty times per second.

(looks better at 60 FPS)

Defining our canvas creates a new canvas every time it starts. This may be acceptable, but we can get better performance by reprocessing the canvas. The previous variable value is displayed as this.

Oh! Using the old one, canvaswe blur our globe:

Glitch is easily fixed by cleaning canvasbefore redrawing.

Thus, by complicating things a bit, you can improve performance, and the resulting animation has a minor overhead compared to vanilla JavaScript.

Interaction

If generators are good for animation scenarios, how about interaction? Generators to the rescue again! Only now, our generators are asynchronous, returning the promises that are resolved whenever a new input appears.

To make the rotation interactive, let's first define the range input. Then we connect it to the generator, which gives the current value of the input whenever it changes. We could implement this generator manually, but there is a convenient built-in method called Generators.input.

Now we substitute the value as the longitude for the rotation of the interactive globe:

This is a brief definition of the user interface. But we can still be reduced by folding the definition rangeand anglein a single cell using Observable operatorviewof. It shows input for the user, but for the code, the current value is presented.

The ability to display arbitrary DOMs and set arbitrary values to the code makes the interfaces in Observable very, well ... bright. You are not limited to sliders and drop-down menus. Here is the Cubehelix color picker, implemented as a slider table, one for each color channel.

When you drag the slider, its value is updated on the corresponding output, and then the generator displays the current color.

We can create any graphical interfaces that we want. And we can design smart programming interfaces to present their meanings to the code. This allows you to quickly create powerful interfaces for studying data. Here is a bar chart showing the behavior of several hundred stocks over a five-year period. (I shortened the code, but it looks like this .)

In other environments, a histogram like this can be a visual dead end. You can look at it, but to check the basic values, you need to separately request data in the code. In Obsesrvable, we can quickly complement visualizations, and show selections interactively. Then we can see the data under the cursor by direct manipulation.

It uses the default object inspector, but you can do anything interactively, for example, live totals, real-time statistics, or even related visualizations.

To show that this is not magic, the above code is for adapting the d3-brush to Obsesrvable. By the event brush, we calculate the new filtered data, set it as the value of the SVG node and send an input event.

Animated Transitions

By default, reactions occur instantly: when the variable value changes, the runtime recounts the derived variables and immediately updates the display. But such urgency is not always required and it is sometimes useful to animate transitions to feel the reality of the object . Here, for example, we can monitor the columns as they are re-sorted:

Understanding the implementation of this diagram requires familiarity with D3, namely: key-related data and a step transition , but even if this code is opaque, it hopefully demonstrates that today's open source libraries are easily used in Obsesrvable.

2. Visibility

Visual output of the program helps to better perceive the current state of the program. Interactive programming helps to more thoroughly analyze the behavior of the program by typing: changing, deleting, changing the order and observing what is happening.

For example, removing the binding forces in the comments in the chart below, we better understand their contribution to the overall arrangement of the figures.

(watch on YouTube how I play with this.)

You must have seen similar toys - for example, Steve Haroz has a great sandbox for d3-force. Here you do not need to create a user interface for playback; it comes free with interactive programming!

Algorithm Visualization

A more detailed approach to studying the behavior of the program is to supplement the code to display the internal state. Generators also help here. We can take a normal function like this to sum the array of numbers:

And turn it into a generator that returns a local state at runtime, in addition to the normal return value at the end:

Then, to understand the behavior, we can visualize or check the internal state . This approach provides a clean separation between our implementation of the algorithm and its study, rather than the implementation of the visualization code directly inside the algorithm.

As an example, let's look at the hierarchical packing structure of D3 circles .

We have a set of circles that we want to pack in as little space as possible without overlapping, like a cluster of penguins in Antarctica. Our task is to place circles one at a time until all circles are placed.

Since we want the circles to be packed as tightly as possible, it is pretty obvious that every circle we place must touch at least one ( actually two ) circles that we have already placed. But if we randomly select an existing circle as a tangent circle, we will spend a lot of time putting a new circle in the middle of the package, where it will overlap other circles. Ideally, we only consider circles that are outside the package. But how can we effectively determine which circles are outside?

Wang Algorithmsupports the “outer chain” shown here in red, which represents these same outer circles. When placing a new circle, he selects the circle in the front chain, which is closest to the beginning. A new circle is located next to this circle and its front chain neighbor.

If this placement does not overlap with any other circle in the front chain, the algorithm proceeds to the next circle. If it overlaps, then we cut the front chain between the tangent circles and the overlapping circle, and the overlapping circle becomes the new tangent circle. We repeat this process until there is no overlap.

I find this animation fascinating. If you look closely, you will see brief moments when a large circle is squeezed out of the package, when the front chain is cut out. This is not only pleasing to the eye, but also extremely useful for identifying a long-standing error in the implementation of D3, where very rarely it would overlap the wrong side of the front chain, and the circles would overlap.

After we have assembled our circles, we need to calculate the circumference for the packaging so that the circular packing can be repeated in a hierarchy. The usual way to do this is to scan the front chain for the circle that is farthest from the origin. This is a pretty decent assumption, but not accurate. Fortunately, there is a simple extension of the Welzl algorithm for computingsmallest closed circle in linear time.

To see how the Waelz algorithm works, suppose we already know the outer circle for some circles and want to include a new circle in it. If the new circle is inside the current circle, we can move on to the next circle. If the new circle is outside the vicious circle, we must expand the circle.

When the circle is outside the vicious circle (left), it should touch the new outer circle (right).

However, we know something about this new circle: it is the only circle that is outside the outer circle and, therefore, it must touch the new outer circle. And if we know how to find one tangent circle for the outer circle, we can find others recursively!

There is a bit of geometry, I'm a little hack, of course. We also need to calculate boundary cases for recursion: outer circles for one, two, or three tangent circles. (This is the Apollonius problem ) Geometry also dictates that there cannot be more than three tangent circles, or that the environment already contains all the circles, so we know that our recursive approach will eventually end.

Here is a more complete recursive algorithm overview showing the stack:

The left is the highest level, where the set of touching circles is empty. The algorithm is repeated every time a new circle is outside the circle. During recursion, this new circle must be laid on a set of circles in contact with it. So, from left to right, there is zero, one, two and three contiguous circles, painted over in black.

In addition to explaining how the algorithm works, this animation gives a sense of the time the algorithm spends at different levels of recursion. Since it processes the circles in a random order, the enclosing circle expands rapidly to bring the final answer closer. But whenever it repeats, it must re-establish all previous circles to make sure that they fit into the new surrounding circle.

3. Reusability

One way to write less code is to reuse it. 440,000 or so packages published in npm testify to the popularity of this approach.

But libraries are an example of active reuse: they must be intentionally designed for reuse. And this is a significant burden. It’s hard enough to develop an effective general abstraction! (Contact any open source developer.) Implementing one-time code, as is usually the case in D3 examples, is easier because you only need to consider a specific task, not an abstract class of tasks.

I am figuring out if we can have better passive reuse in Observable. Where, using the structure of reactive documents, we can easily reconfigure the code, even if this code was not specifically designed for reuse.

For starters, you can think of notebooks de facto as libraries. Let's say in one notebook I implement a custom color scale:

In another notebook I can import this color scale and use it.

Importing is also useful if you have created many notebooks for researching ideas and want to combine them into one block.

More interestingly, Observable allows you to rebuild the definitions of variables upon import. Here I define a stream from real-time data through a WebSocket. (again, the details of this code are not critical, for simplicity you can imagine an imaginary library ... )

This data set has the same shape as our previous line graph: an array of objects with time and value. Can we reuse this chart? Yeah! The operator withallows you to enter local variables into imported variables, replacing the original definitions.

We can not only insert our dynamic data into a previously static diagram, but also scale the coordinate axes if necessary. Here we change the domain for the axes x, and yto the appropriate size for the data. the axis xnow shows the last sixty seconds or so.

Adding new definitions xand yto the operatorwith, the graph now glides smoothly at 60 FPS, and the axis ydoes not jerk distractingly:

4. Portability

Observable notebooks work in a browser, not in a desktop application or in the cloud; There is a server for saving your scripts, but all calculations and rendering take place locally in the client. What does it mean to have a research environment on the web?

The web environment covers web standards, including vanilla JavaScript and the DOM. It works with open source code, be it snippets of code that you find on the Internet or in libraries published in npm. This minimizes the specialized knowledge needed to work productively in a new environment.

There is a new syntax in Observable for reactivity, but I tried to leave it as small and familiar as possible, for example, using generators. These are four forms of variable definition:

The standard library is also minimal and not platform specific. You should be able to bring your existing code and knowledge into the Observable and vice versa, take your code and knowledge from the Observable.

The web environment first allows your code to work everywhere because it works in a browser. No need to install anything. It becomes easier for others to repeat and test your analysis. In addition, your code for research can gracefully transition into code for explanation . You do not need to start from scratch when you want to share your ideas.

It’s great that journalists and scientists share data and code. But the code on GitHub is not always easy to run: you need to reproduce the necessary environment, operating system, application, packages, etc. If your code is already running in a browser, then it runs in any other browser. And this is the beauty of the Internet!

Creating more portable code for analysis can have an impact on our communication style. I will quote Victor again :

An active reader asks questions, considers alternatives, casts doubt on assumptions, and even casts doubt on the credibility of the author. An active reader tries to summarize specific examples and come up with concrete examples for generalizations. An active reader does not passively absorb information, but uses the author’s arguments as a springboard for critical thinking and deep understanding.

PS

If you want to help me design an Observable, that's great! Please contact me. You can find my email address on my GitHub profile and contact me on Twitter .

beta.observablehq.com

Thanks for reading!

Tags: