EasyCodeIt Progress: Parsing statements

TheDcoder

Hello everyone,

I am writing the current state of progress and what I am doing in these blog-like posts, hopefully these will receive regular updates as I make progress. Feel free to post in this thread to offer comments, suggestions, insights, questions etc.

Currently I am working on parsing statements as I finished working on parsing expressions.

The issue I am tackling with at the moment is dealing whitespace between tokens and safely traversing through the token array.

As ECI is written in C, there are no high-level programming features like objects which can hold a dynamic state or safeguards to prevent the program from crashing if reading beyond the array. So I need to come up with some method to safely and conveniently access tokens and skip whitespace when needed.

I have thought about writing a peek function which will automatically raise an error, thanks to the somewhat smart error system in the parser which is implemented via setjmp (basically a beefed up version of "GoTo" which can jump across functions).

Here is the current code I have for parsing statements:

struct Statement statement_get(struct Token *token, struct Token **next) {
	struct Statement statement;
	struct Token *next_token = NULL;
	
	bool function, declaration = false;
	if (token->type == TOK_WORD && kwd_is_declarator(token->keyword)) {
		function = token->keyword == KWD_FUNC;
		declaration = true;
	}
	
	if (declaration) {
		statement.type = SMT_DECLARATION;
		statement.declaration = malloc(sizeof *statement.declaration);
		if (statement.declaration == NULL) raise_mem("parsing declaration statement");
		
		statement.declaration->is_function = function;
		if (function) {
			// ...
		} else {
			// Variable Declaration
			statement.declaration->scope = SCO_AUTO;
			statement.declaration->is_static = false;
			statement.declaration->is_constant = false;
			statement.declaration->name = NULL;
			statement.declaration->initializer = NULL;
			
			// Metadata
			do {
				if (!token->info) /* Not a keyword*/ break;
				enum Keyword kwd = *(enum Keyword *)(token->info);
				if (!kwd_is_declarator(kwd)) break;
				switch (kwd) {
					case KWD_GLOBAL:
						statement.declaration->scope = SCO_GLOBAL;
						break;
					case KWD_LOCAL:
						statement.declaration->scope = SCO_LOCAL;
						break;
					case KWD_STATIC:
						statement.declaration->is_static = true;
						break;
					case KWD_CONST:
						statement.declaration->is_constant = true;
						break;
				}
			} while (TOK_WORD == (++token)->type);
			
			// Name
			if (token->type != TOK_VARIABLE) raise_unexpected_token("a variable", token);
			
			statement.declaration->name = malloc(token->data_len + 1);
			if (!statement.declaration->name) raise_mem("storing variable name");
			strncpy(statement.declaration->name, token->data, token->data_len);
			
			// Initializer
			if (token[1].type != TOK_OPERATOR) goto next;
			if (token[1].op_info.sym != OPR_EQU) /* ... */;
			
			// ... parse expression and store it as initializer
			
		}
	} else {
		statement.type = SMT_EXPRESSION;
		statement.expression = malloc(sizeof *statement.expression);
		if (!statement.expression) raise_mem("parsing expression statement");
		size_t token_count = 0;
		while (true) {
			if (token[token_count].type == TOK_WHITESPACE && token[token_count].newline || token[token_count].type == TOK_EOF) break;
			++token_count;
		}
		*statement.expression = expression_get(token, token_count);
		next_token = token + token_count + 1;
	}
	
	// Set the next token
	next: *next = next_token ? next_token : token + 1;
	return statement;
}

As shown in the code, the tokens array is being accessed directly and that is a safety hazard, I have implemented a basic safeguard in the form of dummy padding tokens at both the start and end of the array, but it only protects accessing 1 step beyond the valid range, so another solution is needed.

I'll post updates here on what strategy I am going to use. Thanks for reading my technical rambling 🙂

seadoggie

TheDcoder In C, there are no high-level programming features like objects

😲 I need to learn more about C. I don't understand how the low level world could work without objects

TheDcoder

seadoggie AutoIt also doesn't have objects in the traditional sense, and I think you already know how AutoIt works... with variables and functions 😉

If you want to get started with C, I recommend picking up a good book, my personal recommendation is C Programming: A Modern Approach

This book is the latest take on C while being technically accurate about the presented information, there are many other books but they have factual inaccuracies and might miss important stuff. With C it is important that you get the basics right, otherwise you will be left confused by the syntax and operations.

This is the reason why many find "pointers" to be hard, I read the book and I had no trouble understanding them, it's all about understanding the basics 😁

seadoggie

TheDcoder Sure, AutoIt doesn't have objects, but I'd also never attempt to create something so complicated as a parser in AutoIt 😃 I use AutoIt for a lot of Excel/Outlook/browser automation and not too much else. I try to use the right tool for the job. AutoIt seems like saw, sometimes it works, but sometimes you really need that powertool.

Thanks for the book recommendation! I'll try to cram it in between school, work, anime, and gaming (picked up Albion Online and I might be hooked on MMOs now 😅)

argumentum

seadoggie ...gaming (picked up Albion Online...

I play Team Fortress 2 😀

TheDcoder

seadoggie Sure, AutoIt doesn't have objects, but I'd also never attempt to create something so complicated as a parser in AutoIt 😃

True, but AutoIt is well capable of creating advanced programs. Object-oriented and Procedural programming are different paradigms so different trains of thought are used when programming. By the way, I had the opposite problem with OOP, I couldn't see the point for object and would just use variables and functions when I began writing JavaScript 😁

Best of luck with C, but don't force yourself to learn it, especially as it seems you already have a lot going on.

argumentum

seadoggie ...gaming (picked up Albion Online...

I play Team Fortress 2

I am not big on multiplayer, but I recently picked up Rocket League when they made it free-to-play and then promptly dropped it. My current gaming jam is Just Cause 4 (single-player mode), it's a fun game 🙂

seadoggie

Back to the topic (lol!)... I was thinking about the issue you mentioned... could you use a function to get the next item in the tokens array and return an invalid value (set an error... how does it work in C?) if there isn't another item?

function token_get(token, index){
    if(token.end < index) then return error
    return token[index]
}

Sorry for the indentation... can't get the code to indent properly, even spaces get stripped 😅
Edit: Ahh, triple ` make code blocks!

TheDcoder

seadoggie could you use a function to get the next item in the tokens array

That's basically what I have in my mind too.

seadoggie return an invalid value (set an error... how does it work in C?)

You can return an invalid value (bascially NULL if you are dealing with pointers), but that would mean more error checking, which is what I am trying to avoid. Luckily C has setjmp which can return us to a common point to handle all errors, this is more or less analogous to "exceptions" in OOP languages. This is already implemented, see the raise_error function in parse.c 🙂

TheDcoder

For the time being, I have come to the conclusion that the best course of approach to deal with whitespace is just to remove it from the array, so I modified the token_list_to_array to do just that:

struct Token *token_list_to_array(struct TokenList *list, bool pad, bool strip_ws) {
	size_t token_count = list->length;
	if (strip_ws) {
		struct TokenListNode *node = list->head;
		for (size_t i = 0; i < list->length; ++i) {
			if (tokens[i].type == TOK_WHITESPACE && !tokens[i].newline) --token_count;
			node = node->next;
		}
	}
	
	struct Token *tokens = malloc(sizeof(struct Token) * (token_count + (pad ? 2 : 0)));
	if (!tokens) return NULL;
	if (pad) /* Reserve first element for padding */ ++tokens;
	
	struct TokenListNode *node = list->head;
	
	for (size_t i = 0; i < token_count; ++i) {
		if (node->token->type == TOK_WHITESPACE && !node->token->newline) {
			--i; // No increment in the next iteration
			goto next_node;
		}
		tokens[i] = *node->token;
		next_node: node = node->next;
	}
	
	if (pad) {
		// Apply padding
		tokens[token_count] = (struct Token){
			.type = TOK_EOF,
			.data = list->tail->token->data + list->tail->token->data_len,
			.data_len = 0,
		};
		*--tokens = (struct Token){
			.type = TOK_EOF,
			.data = list->head->token->data,
			.data_len = 0,
		};
	}
	
	return tokens;
}

Hopefully this doesn't come bite me back in the future 😅

TheDcoder

Hey everyone, apologies for not giving any updates for the entire month of march, I was busy with a work project and I also had to deal with taxes.

I recently started working on ECI again and I am currently working on the statement parser, I found that I was frequently in need of "dynamic arrays" to store an indeterminate amount of information, so to tackle the problem from the root without using a one-off solution, I decided to make a new library in C for just that (I didn't like the other libraries I could find).

Here it is: https://github.com/TheDcoder/dynarr

The code is very simply and is bascially a wrapper around realloc.

Fun Fact: I like pronouncing the name as "Dinner", though I guess one can also call it "Diner" depending on how they pronounce the "y" 🙂

Expect more updates in the coming days!