4p4 June 29, 2015 at 04:23

Maximum overload - JavaScript adventures in the C ++ world

Tutorial

How to properly expand the capabilities of a programming language using operator overloading.

The creators and maintainers of programming languages are often asked to add new features to the language. The most common answer that can be heard from them is:

“Why, because what you offer can be made available using the means of the language.”

Operator overloading appeared in C ++ at the request of physicists and mathematicians who wanted to conveniently operate with home-made data types, large numbers, matrices.

Although physicists and mathematicians liked this feature, programmers, including C ++ creators, never really liked operator overloading. Too complicated, a lot of implicit things, so the overload of operators entrenched the opinion of something harmful and used in rare cases.

Today I’ll try to show why it is so difficult and how to use overload correctly by creating one new type called var whose behavior will be as close as possible to a similar type in JavaScript.

That is, we will try to create a class that will be able to contain either a number, or a string, or an array, or an object. A type that can be initialized with language literals. A type that is correctly converted where necessary.

First, declare the class itself:

struct var {
};

(Why a struct and not a class? The only difference is that in the struct all members are public by default. To make the code easier to read, there will be a struct.)

Let's try to put a numeric value and a string in var:

struct var {
	char *str;
	double num;
};

Now we need to write constructors. They are called when you write:

var i = 100;
var s = "hello";
struct var {
	char *str;
	double num;
	var (double initial) {
		num = initial;
	}
	var (char *initial) {
		str = initial;
	}
}

Okay, now for everything to come to life, we need to display the value on the screen:

var i = 100, s = "hello";
log(i);
log(s);

How to achieve this?

void log(var x) {
	....
	а тут то что написать?
}

How do we know which of the two contents is used in this instance of var?

Clearly, you need to add an internal type. But how to do that? It is logical to use enum:

enum varType { varNum, varStr };

Change the class definition:

struct var {
	varType type;
	char *str;
	double num;
	var (double initial); 
	var (char *initial);
};

Now in the designers you need to assign a type:

var::var (double initial) {
	type = varNum;
	num = initial;
}
var::var (char *initial) {
	type = varStr;
	str = initial;
}

Well, now you can return to log ():

void log(var x) {
	if (x.type == varNum) printf("%f\n", x.num);
	if (x.type == varStr) printf("%s\n", x.str);
}

And now we need to block the assignment operator:

void var::operator = (double initial) {
	type = varNum;
	num = initial;
}
void var::operator = (char *initial) {
	type = varStr;
	str = initial;
}

Now you can write:

var a = 10, b = "hello";

Interestingly, the assignment operator turned out to be a complete copy of the constructor. Maybe it's worth reusing? So let's do it. Everywhere in the “assignment constructor”, you can simply call the “assignment operator”.

At the moment, here is our full working code:

#include 
enum varType { varNum, varStr };
struct var {
	varType type;
	char *str;
	double num;
	var (double initial); 
	var (char *initial);
	void operator = (double initial);
	void operator = (char *initial);
};
var::var (double initial) {
	(*this) = initial;
}
var::var (char *initial) {
	(*this) = initial;
}
void var::operator = (double initial) {
	type = varNum;
	num = initial;
}
void var::operator = (char *initial) {
	type = varStr;
	str = initial;
}
void log(var x) {
	if (x.type == varNum) printf("%f\n", x.num);
	if (x.type == varStr) printf("%s\n", x.str);
}
int main() {
	var x = 100, s = "hello";
	log(x);
	log(s);
}

But what if we just write:

int main() {
	var y;
}

We are abused by the compiler! We cannot declare a variable without initializing it. Mess, what's the matter? And the fact that all of our designers require an initial value.

We need an "empty" constructor, it is the default constructor, default constructor. But what will the variable be equal if it is not equal to anything else? It is still unknown whether it will be a number or a string, or something else.

To do this, the concept of "empty value" is introduced, known as null or undefined.

enum varType { varNull, varNum, varStr };
var::var() {
	type = varNull;
}

Now you can simply declare variables without thinking about the type.

var a, b, c;

And to assign values in the code:

a = 1; b = "foo";

But we still cannot write:

a = b;

We need the assignment operator var = var:

void var::operator= (var src) {
	type = src.type;
	num = src.num;
	str = src.str;
}

Upon assignment, the type will change! And "a" will become a string.

Let's try to move on. Temporarily forget that our numbers and lines are unfinished. Let's try to deal with the array.

First we need a new type in enum:

enum varType { varNull, varNum, varStr, varArr };

Now a pointer to the element buffer, and size:

struct var {
	...
	int size;
	var *arr;
	...
}

Now overload the element access operator:

struct var {
	...
	var operator [](int i);
	...
}

Such an operator is called a “subscript operator” or an index operator.

Our goal: to store elements of type var in the array. That is, we are talking about recursion.

By the way, with the same operator, we have to access the individual characters in the string, and the properties of the object. But in the case of an object, there will be a line at the input. After all, the key is a string value:

var operator [](char *key);

No, that’s not good. We need not a pointer to a character buffer, but a string, let's do this:

struct var {
	...
	var operator [](var key);
	...
}

Then, when everything works, we can write:

x[1]

x["foo"]

The compiler will convert to var! Why? After all, we already have constructors from literals of numbers and strings.

It will be possible to write like this:

y = "foo";
x[y];

By the way, literal, (literal) is a "literal value", that is, the value that you typed directly in the code. For example, the assignment “int a = b;” is the assignment by name, and “int a = 123;” is the literal assignment, the assignment is literal, “by literal” 123.

One thing is not clear, how does var become an array? Suppose we create a variable "a", and how to say that it is an array?

var a ???;

JavaScript uses several methods:

var a = new Array;
var a = [];

Let's try both:

var newArray() {
	var R;
	R.type = varArr;
	R.size = 0;
	R.arr = new var [10];
	return R;
}

So far, in order to focus on more essential things, we will pretend that 10 elements are all that we need.

Now an interesting point, try to do something like:

var a = [];

You cannot use [] in C ++, but you can use any identifier, that is, a name. For example Array.

var a = Array;

How to do it? To do this, apply the "syntactic type", like this:

enum varSyntax { Array };

Wherever we mention the word "Array", the compiler will realize that the type "varSyntax" is needed. But the compiler chooses by type the type of function, constructor or operator to use.

struct var {
	...
	var (varSyntax initial) {
		if (initial == Array) {
			type = varArr;
			size = 0;
			arr = new var[10];
		}
	}
	...
}
var a = Array;

Of course, where the constructor is, there is the assignment, we immediately recall, and write the assignment operator of type varSyntax.

void var::operator=(varSyntax initial) {
	...
}

In the following code, first “a” is initialized by the var (varSyntax) constructor, and then “b” is initialized by an empty constructor and assigned by the “var operator = (varSyntax)” operator.

var a = Array, b;
b = Array;

Since the constructor and the assignment through "=" always go in pairs, it is logical to apply the same trick, and reuse the code from the assignment in the constructor.

struct var {
	...
	var (varSyntax initial) {
		(*this) = initial;
	}
	operator= (var Syntax);
	...
};
void var::operator= (varSyntax initial) {
	if (initial == Array) {
		type = varArr;
		size = 0;
		arr = new var*[10];
	}
//	else if (initial == Object) {
//		...
//	}
}

Somewhere, there, we can create empty objects. But it is later.

Well, it's time to try:

int main() {
	var a = Array;
	a[0] = 100;
	log(a[0]);
}

error: conversion from 'int' to 'var' is ambiguous
	a[0] = 100.0;

Wow, that's the thing, we declared operator [] from var. For some reason, the compiler expects an int. If you change var [0] to var [1] then everything will compile. What?

int main() {
	var a = Array;
	a[1] = 100;
	log(a[1]);
}

So, with one, it compiles ...

Only this code will not do anything yet, because we haven't written operator [] yet.

I must write! Probably something like this:

var var::operator [](var key) {
	return arr[key];
}

error: no viable overloaded operator[] for type 'var *'
        return arr[i];
               ~~~^~

Oh compiler, what else is wrong?

It turns out that index access to the pointer requires an int, but the compiler does not yet know how to turn var into an int.

Well, you can define an int operator, there is such a thing in C ++! But it’s better where you can’t create a new operator, don’t create it (long history), so let's do this:

struct var {
	...
	int toInt() {
		return num;
	}
	...
}
var var::operator[] (var i) {
	return arr[i.toInt()];
}

It compiles, but does not display anything after launch, what’s the matter?

But how, in general, can it work? How can one read and write the contents of an element through the same operator?

After all, both lines should work:

a[1] = 100;
log(a[1]);

In one record, in another reading. It turns out that operator = should return a reference to the element. Pay attention to the & symbol, in this case it is:

var& var::operator[] (var i) {
	return arr[i.toInt()];
}

But, although "a [1]" worked, "a [0]" continues to swear. Why all the same?

The fact is that 0 can be considered both a number and a pointer, while var has two constructors, one for a number (double) and one for a pointer (char *). Because of this, it seems to be completely normal code, when using 0 as a literal, it suddenly produces compilation errors. This is one of the most sophisticated tortures of C ++ and the ambiguous call series.

But in general, the compiler first of all considers zero integer, that is, int.

Fortunately, just teaching our var to initialize from int. As usual, we immediately write the constructor and operator =.

var::var (int initial) {
	(*this) = (double) initial;
}
void var::operator = (int initial) {
	(*this) = (double) initial;
}

Here, in order to reuse the code, both calls to operator = (double) are simply redirected.

So, what happened at the moment:

#include 
enum varType { varNull, varNum, varStr, varArr };
enum varSyntax { Array };
struct var {
	varType type;
	char *str;
	double num;
	var ();
	var (double initial); 
	var (int initial); 
	var (char *initial);
	void operator = (double initial);
	void operator = (int initial);
	void operator = (char *initial);
	var *arr;
	int size;
	var &operator [](var i);
	var (varSyntax initial) {
		(*this) = initial;
	}
	void operator= (varSyntax initial);
	void operator= (var src) {
		type = src.type;
		num = src.num;
		str = src.str;
		arr = src.arr;
	}
	int toInt() {
		return num;
	}
};
var::var() {
	type = varNull;	
}
var::var (double initial) {
	(*this) = initial;
}
var::var (int initial) {
	(*this) = (double)initial;
}
var::var (char *initial) {
	(*this) = initial;
}
void var::operator = (double initial) {
	type = varNum;
	num = initial;
}
void var::operator = (int initial) {
	(*this) = (double) initial;
}
void var::operator = (char *initial) {
	type = varStr;
	str = initial;
}
void log(var x) {
	if (x.type == varNum) printf("%f\n", x.num);
	if (x.type == varStr) printf("%s\n", x.str);
}
void var::operator= (varSyntax initial) {
	if (initial == Array) {
		type = varArr;
		size = 0;
		arr = new var[10];
	}
}
var &var::operator[] (var i) {
	return arr[i.toInt()];
}
int main() {
	var x = 100, s = "hello";
	var a = Array;
	a[0] = 200;
	log(a[0]);
	log(x);
	log(s);
}

By the way, what if we want to display an array on the screen?

void log(var x) {
	if (x.type == varNum) printf("%f\n", x.num);
	if (x.type == varStr) printf("%s\n", x.str);
	if (x.type == varArr) printf("[Array]\n");
}

So far, only so.

But I want more.

First, you need to make a self-tuning array length:

var &var::operator[] (var i) {
	int pos = i.toInt();
	if (pos >= size) size = pos+1;
	return arr[pos];
}

And you need to do push () - adding one element to the end:

var var::push(var item) {
	if (type != varArr) {
		var nil;
		return nil;
	}
	(*this)[size] = item;
	size++;
	return item;
}

Since we are working with a pointer, it is not out of place to check the type. In the process of preparing this article, this is precisely where the program fell. Well, we do not check the size yet, we are engaged in global design, but we will return to this issue.

Now you can rewrite the log () function so that it displays the entire array:

void log(var x) {
	if (x.type == varNum) printf("%f ", x.num);
	if (x.type == varStr) printf("%s ", x.str);
	if (x.type == varArr) {
		printf("[");
		for (int i = 0; i < x.size; i++) log(x[i]);
		printf("]");
	}
}

What a minimum of work was needed, what life-giving recursion does!

int main() {
	var a = Array;
	a[0]=100;
	a.push(200);
	log(a[0]);
	log(a[1]);
	log(a);
}

Data output after startup:

100.000000 200.000000 [100.000000 200.000000]

Well, wonderful, we have some basic polymorphism.

You can even already place an array in an array, and mixed up with strings and numbers.

int main() {
	var a = Array;
	a.push(100);
	a.push("foo");
	a[2] = Array;
	a[2][0] = 200;
	a[2][1] = "bar";
	log(a);
}

[100.000000 foo [200.000000 bar ]]

I wonder what will happen if we try to write like this:

var a = Array;
var b = a.push(Array);
b.push(200);
b.push("foo");
log(a);

And here is what:

[[]]

Why did this happen?

Check in this simple way:

printf("%\n", a.arr[0].size);
printf("%\n", b.size);

Logically, we should see the same number: 2.

But actually a.arr [0] .size == 0!

The thing is that a [0] and b are two DIFFERENT variables, two different instances. At the moment when a.push () function was assigned via return, their fields coincided, that is, size, arr were identical, but after b.push () b.size increased, and a [0] did not increase. size.

This is a brain problem, which is even difficult to describe in words, and perhaps the reader is completely confused while reading the last lines, called “pass by reference”.

In C ++, usually, passing by reference is called when & is before the argument, but this is a special case. Generally, this means that changing the copy changes the original.

Let's see how to solve such a problem. At first, everything that was connected with the array, put into a separate class, so historically happened that I called it lst. Especially do not go into his device, so, grasp the general essence:

class lst {
	typedef var** P;
	P p;
	int capacity, size;
	void zeroInit();
public:
	lst();
	~lst();
	int length();
	void resize(int newsize);
	var pop();
	void push(const var &a);
	var& operator [](int i);
	void delIns(int pos, int delCount, var *item, int insCount);
};

Let me explain that this is a small class for storing a list of pointers with the ability to dynamically resize, and additional commands push/pop/delIns.

This is all we need to ensure that our arrays closely match JavaScript Array.

Now let's forget how “var” was arranged before, and try to enter “lst” in it correctly:

struct Ref {
	int uses;
	void *data;
	Ref () {
		uses = 1;
	}
};
struct var {
	varType type;
	union {
		double num;
		Ref* ref;
	};
...
};

Firstly, we combined num and ref, because all the same, at the same time, we do not need these properties. We save memory.

Secondly, instead of the direct value of everything connected with the array, we will have a link with a counter inside. This is called reference counting.

In the same link, we will then store Object.

Note that the counter is immediately set to 1.

Whenever the link count is programmed, two main methods are written immediately, the “connector” and the “disconnector”.

The first is done "ref = src.ref, ref-> uses ++", usually it is called copy, link, attach, or, in fact, reference.

void var::copy(const var &a) {
	// к этому месту экземпляр должен быть пуст.
	type = a.type;
	if (type == varNum || type == varBool) num = a.num;
	else {
		if (a.type == varNull) { return; }
		ref = a.ref;
		if (ref) ref->uses++;
	}
}

Secondly, the reverse process occurs, the usage counter decreases and if it becomes zero, then the original memory is freed.

Usually it is called unlink, unreference, detach. I used to call it unref ().

void var::unref() {
	if (type == varNum || type == varNull || type == varBool) return;
	else if (type == varStr) {
		ref->uses--;
		if (ref->uses == 0) {
			delete (chr*)ref->data, delete ref;
		}
	}
	else if (type == varArr) {
		ref->uses--;
		if (ref->uses == 0) {
			deleteLst();
		}
	}
	else if (type == varObj) {
		ref->uses--;
		if (ref->uses == 0) {
			deleteObj();
		}
	}
	type = varNull;
	ref = 0;
}

data in the Ref structure is of type void *, that is, just a pointer, and will store a link to the actual instance of the array (lst) or object (obj). At the word object, we are talking about the object in which we will store key / value pairs in accordance with JavaScript [Object object].

Basically, reference counting is a type of garbage collector.

Usually, the words “garbage collector” (GC) mean an interval collector that runs on a timer, but technically, link counting is the simplest garbage collector, even according to Wikipedia’s classification.

And, as you can see, it is not so simple, the brain can be broken at a time.

Just so that the reader does not get confused, I’ll repeat all over again:

We make the var class and in it we encapsulate either double, or lst (for an array), or chr (for strings), or keyval (for objects).

Here is our string class:

struct chr {
	int size;
	wchar_t *s;
	chr ();
	~chr();
	void set(double i);
	void set(wchar_t *a, int length = -1);
	void setUtf(char *a, int length = -1);
	void setAscii(char *a, int length = -1);
	char * getAscii();
	char * getUtf();
	wchar_t operator [](int i);
	double toNumber ();
	int intToStr(int i, char *s);
	void dblToStr (double d, char *s);
	int cmp (const chr &other);
	int find (int start, wchar_t *c, int subsize);
	chr substr(int pos, int count = -1);
	int _strcount(const chr &substring);
	void _cpto(int from, const chr &dest, int to, int count);
	chr clone();
	void replace(chr &A, chr &B, chr &dest);
};

And here is the class for the objects:

struct keyval {
	var keys, vals;
	keyval ();
	void set(var key, var val);
	var &get(var key);
};

There is already complete recursion and polymorphism, look, keyval uses arrays in the form of var. To become part of var. And it works!

One of the most important features of using reference counting is that if you want to change the object, you must understand that everyone who refers to it will also receive the changed object.

For instance:

void f(var t) {
	t += "world";
	log(t);
}
var s = "hello";
f(s);
log(s);

Conclusion:

world
world

When passing s to f () instead of copying all the characters in the string, only one pointer is copied and one counter is incremented.

But after changing the string t, the string s will also change. What we need in the case of arrays, but not in the case of strings! This is called pass-by-reference.

When we need the variable passed through reference counting to be changed separately from its source, we must call the detach / unref / unlink function before each change.

This is how strings work in Delphi, for example. This is called the term copy-on-write.

It is believed that this is a bad decision. But how to refuse copy-on-write, but keep pass-by-reference and copy-pointer-and-increment possible (reference counting)?

The answer has become the standard in modern programming: instead of changing a variable, make it immutable! This is called immutability.

According to the principle of immutability, lines in JavaScript are set only once, and after that they cannot be changed. All functions of working with strings that change something, return new lines. This greatly facilitates the hard work of carefully arranging all copy / unref, pointer checks, and other work with memory.

Here, all of a sudden, I am forced to interrupt, because the article exceeded 20K characters that are comfortable for the reader. But still need to reload about 20 operators! Even operator, (comma). Combine objects and arrays, write JSON.parse, implement comparisons of strings and Boolean values, write a constructor for Boolean, come up with and implement a notation to initialize the values of arrays and objects, solve the problem of multi-argument log (...), figure out what to do with undefined, typeof correctly implement replace / slice etc. And all this without a single template, only an overload of operators and functions.

So, if you are interested, we will continue soon.

For the most curious, a link to the library repository:

github.com/exebook/jslike

Tags:

Maximum overload - JavaScript adventures in the C ++ world

Also popular now: