Lexer
Tokenizer: splits raw input into a token stream for the parser.
Header of the lexer module.
- Author
jguillem
Defines
-
TOK(node)
Convenience accessor: get t_token* from a t_list node.
Enums
-
enum t_token_type
Token types for the shell’s lexer.
These types represent the different kinds of tokens that can be identified in the shell’s input. They include words, operators, and special tokens like EOF and errors.
Values:
-
enumerator TOK_WORD
A word token ex: “ls” or “echo”
-
enumerator TOK_PIPE
A pipe token ex: “|”
-
enumerator TOK_AND
A logical AND token ex: “&&”
-
enumerator TOK_OR
A logical OR token ex: “||”
-
enumerator TOK_SEMICOLON
A semicolon token ex: “;”
-
enumerator TOK_AMPERSAND
An ampersand token ex: “&”
-
enumerator TOK_NEWLINE
A newline token (end of command)
-
enumerator TOK_REDIR_IN
A redirect input token ex: “<”
-
enumerator TOK_REDIR_OUT
A redirect output token ex: “>”
-
enumerator TOK_REDIR_APPEND
A redirect append token ex: “>>”
-
enumerator TOK_HEREDOC
A heredoc token ex: “<<”
-
enumerator TOK_HEREDOC_STRIP
-
enumerator TOK_REDIR_DUP_IN
A heredoc without leading tab ex : “<<-” A duplicate redirect input token ex: “<&”
-
enumerator TOK_REDIR_DUP_OUT
A duplicate redirect output token ex: “>&”
-
enumerator TOK_LPAREN
A left parenthesis token ex: “(”
-
enumerator TOK_RPAREN
A right parenthesis token ex: “)”
-
enumerator TOK_EOF
An end-of-file token
-
enumerator TOK_ERROR
An error token (invalid syntax)
-
enumerator TOK_WORD
Functions
-
t_list *lexer_tokenize(const char *input)
Tokenize an input string into a list of tokens.
This function takes a string as input and breaks it down into individual tokens based on the shell’s syntax rules.
Tokenize an input string into a list of tokens.
skip spaces and recognize the type of token before adding it in the list. The memory is allocated and the list must be freed.
- Parameters:
input – The input string to tokenize.
input – The string input from readline
- Returns:
A pointer to a t_list containing the parsed tokens, or NULL on error.
- Returns:
t_list* with t_token* as content
-
int lexer_check_quotes(const char *input, char *unclosed_quote)
Check for unclosed quotes.
Sets *unclosed_quote to the quote char (’'’, ‘”’) or 0 if balanced.
- Parameters:
input – The input string to check.
unclosed_quote – A pointer to a char where the unclosed quote will be stored.
- Returns:
1 if an open quote is found, 0 if balanced.
-
void lexer_free_tokens(t_list *tokens)
Free an entire token list (tokens + strings + nodes).
This function frees all memory associated with a list of tokens.
- Parameters:
tokens – The list of tokens to free.
-
t_list *token_new(t_token_type type, char *value, int io_number)
Token helpers (used by lexer internally and by tests)
These functions are used to create and free tokens, check for operators, and read operators and words from the input string. They are also used by the unit tests to verify the correctness of the lexer implementation.
- Parameters:
type – The type of the token to create.
value – The raw string value of the token (with quotes preserved).
io_number – The file descriptor number for redirection tokens (set to -1 if not applicable).
- Returns:
A new t_list node containing the created token, or NULL on allocation failure.
-
int is_operator(char c)
check if a char is the beginning of an operator
check if the character is in “&|><;{}\n”
- Parameters:
c – character to check
- Returns:
0 | 1
-
int is_operator_start(const char *line)
check if a string is an operator
manage the digits in case of redirection then call is_operator
- Parameters:
line – string of operator to test
- Returns:
1 | 0
-
t_list *read_operator(const char **line)
read the operator and tokenize it
choose the good function to tokenize
- Parameters:
the – string of the operator
- Returns:
a t_list* node
-
t_list *read_word(const char **line)
-
struct t_operator
- #include <lexer.h>
Struct representing a shell token.
This struct holds the type of the token, its raw value (with quotes preserved for later expansion), and an optional io_number for redirection tokens.
Public Members
-
const char *literal
The literal string of the operator (e.g., “|”, “&&”)
-
t_token_type type
The token type corresponding to this operator
-
const char *literal
-
struct t_token
- #include <lexer.h>
Token data node.
Stored in a t_list* returned by lexer_tokenize (each node->content is a t_token*). No *next field — traversal is via the t_list wrapper.
Public Members
-
t_token_type type
Type of the token (word, operator, etc.)
-
char *value
Raw token string (quotes preserved for expander)
-
int io_number
File descriptor number for redirection tokens (-1 if none)
-
t_token_type type
lexer.c — Main Entry Point
main file of lexer module
- Author
jguillem
Functions
-
t_list *lexer_tokenize(const char *input)
main function of lexer module, construct a linked list of token from an input string.
Tokenize an input string into a list of tokens.
skip spaces and recognize the type of token before adding it in the list. The memory is allocated and the list must be freed.
- Parameters:
input – The string input from readline
- Returns:
t_list* with t_token* as content
lexer_words.c — Word Tokens
file to manage words of the prompt command
- Author
jguillem
lexer_operator.c — Operator Tokens
file to manage operators of the prompt command
- Author
jguillem
Functions
-
int is_operator(char c)
check if a char is the beginning of an operator
check if the character is in “&|><;{}\n”
- Parameters:
c – character to check
- Returns:
0 | 1
-
int is_operator_start(const char *line)
check if a string is an operator
manage the digits in case of redirection then call is_operator
- Parameters:
line – string of operator to test
- Returns:
1 | 0
-
static int extract_io_number(const char **line)
extract the file descriptor before the operator
- Parameters:
the – string of the operator
- Returns:
an int
-
static t_list *create_operator_token(const char **line, int io_number, const char *literal, t_token_type type)
Create a new t_list token node.
- Parameters:
line – Address of the string to tokenize
io_number – file descriptor for redirection
literal – representation of the operator
type – A t_token_type
- Returns:
A t_list token node
-
t_list *read_operator(const char **line)
read the operator and tokenize it
choose the good function to tokenize
- Parameters:
the – string of the operator
- Returns:
a t_list* node
Variables
-
static const t_operator operators[] = {{";", TOK_SEMICOLON}, {"\n", TOK_NEWLINE}, {"(", TOK_LPAREN}, {")", TOK_RPAREN}, {"||", TOK_OR}, {"|", TOK_PIPE}, {"&&", TOK_AND}, {"&", TOK_AMPERSAND}, {"<<-", TOK_HEREDOC_STRIP}, {"<<", TOK_HEREDOC}, {"<&", TOK_REDIR_DUP_IN}, {"<", TOK_REDIR_IN}, {">>", TOK_REDIR_APPEND}, {">&", TOK_REDIR_DUP_OUT}, {">", TOK_REDIR_OUT}, {NULL, 0}}
array of struct s_operator struct
match literal representation and token type
important to keep long operators before short ones (e.g. && and &)
this allow strncmp to works properly
token.c — Token Helpers
Functions
-
t_list *token_new(t_token_type type, char *value, int io_number)
Token helpers (used by lexer internally and by tests)
These functions are used to create and free tokens, check for operators, and read operators and words from the input string. They are also used by the unit tests to verify the correctness of the lexer implementation.
- Parameters:
type – The type of the token to create.
value – The raw string value of the token (with quotes preserved).
io_number – The file descriptor number for redirection tokens (set to -1 if not applicable).
- Returns:
A new t_list node containing the created token, or NULL on allocation failure.
-
void token_free(t_token *token)
-
void lexer_free_tokens(t_list *tokens)
Free an entire token list (tokens + strings + nodes).
This function frees all memory associated with a list of tokens.
- Parameters:
tokens – The list of tokens to free.
lexer_display.c — Debug Display
Display and JSON serialization of token lists.
- Author
pulgamecanica