= CAST
http://cast.rubyforge.org
== Description
CAST parses C code into an abstract syntax tree (AST), lets you break it, then vomit it out as code. The parser does C99.
This fork supports Ruby 1.9.3, gemspec, and requires Hoe. The Rubyforge page above is documentation for the original version, but most things should be the same.
== Installation
gem install csquare-cast
== Library Overview
Everything is in the module C.
- There's the parser (Parser).
- There's the tree (Node and its subclasses).
- That's it.
=== Usage
You call Parser#parse, and it gives you a tree of Node objects. Watch:
require 'cast/cast'
# create a parser
parser = C::Parser.new
# (optional) set some settings...
parser.pos.filename = "toy.c" # used for error messages
parser.type_names << 'LinkedList' # treat these words as types
# gimme a tree
ugly_c_code = open("toy.c"){|f| f.read}
tree = parser.parse(ugly_c_code)
# what's the tree look like?
p tree
If there's a parse error, #parse raises a ParseError (which has a nice error message in #message).
== The Parser
Here's a quiz: what does "a * b;" do?
I bet you said "why you l4m3r n00b, that's a statement that multiplies a by b and throws away the answer -- now go take your meaningless snippetage to your computing 101 class and let me finish hurting this JavaTM programmer." Well, you'd be both mean and wrong. It was, of course, a trick question. I didn't say if any of a and b are types! If only a is a type, it's actually a declaration. And if b is a type, it's a syntax error.
So, the parser's gonna need to know which identifiers are type names. This is one of the bits of state that a Parser keeps. Here's the complete list (um, of two):
- #type_names -- a Set of Strings.
- #pos -- the Node::Pos this parser will start parsing at.
A Node::Pos has three read-write atts: #filename, #line_num, #col_num. Default is nil, 1, 0.
== The Nodes
There are a couple of Node classes:
The bold ones are abstract.
The last 2 (the NodeLists) represent lists of nodes. Methodwise, they try to behave like normal ruby ::Arrays. Implementationwise, a NodeChain is a doubly linked list, whereas a NodeArray is an array. NodeChains may be more efficient when adding things at the beginning of a LARGE list.
=== Attributes
Each Node object has:
- #parent -- return the parent in the tree (a Node or nil).
- #pos, #pos= -- the position in the source file (a Node::Pos).
- #to_s -- return the code for the tree (a String).
- #inspect -- return a pretty string for inspection. Try it.
- #match?(str), #=~(str) -- return true iff str parses as a Node equal to this one.
- #detach -- remove this node from the tree (parent becomes nil) and return it.
- #detached?, #attached? -- return true if parent is nil or non-nil respectively.
- #replace_with(node) -- replace this node with node in the tree.
- #swap_with(node) -- exchange this node with node in their trees.
- #insert_prev(*nodes), #insert_next(*nodes) -- insert nodes before this node in the parent list. Parent must be a NodeList! Useful for adding statements before a node in a block, for example.
- #Foo? -- (where Foo is a module name) return self.is_a?(Foo).
The #Foo? method is a convienience for a common need. Example:
## make a tree
ast = C::Parser.new.parse(code_string)
## print all global variables
ast.entities.each do |node|
node.Declaration? or next
node.declarators.each do |decl|
unless decl.type.Function?
puts "#{decl.name}: #{decl.type}"
end
end
end
If you're a duck-typing purist, then sorry for the cardiac arrest
you're now experiencing. CAST does pay attention to the
class of Node objects for quite a few things. This is the
cleanest way to distinguish, e.g., an Add from a
Subtract (which both have the same methods but represent
very different things). It just seems impractical (and unnecessary)
to allow duck typing in this situation.
The #=~ method lets you do:
if declarator.type =~ 'const int *'
puts "Ooh, a const int pointer!"
end
This is not the same as
declarator.type.to_s == 'const int *'
That'd require you to guess how to_s formats its strings (most notably, the whitespace).
=== Fields and children
Each concrete Node class has a member for each bit of
important C stuff it pertains to. I know you peeked at the big list
below, so you know the kind of thing I mean.
But these aren't defined as attrs as you normally do in
Ruby -- they're fields. If a node has a field
foo, it means there's a setter #foo= and getter
#foo. (A field foo? means the setter is
#foo= and the getter is #foo?.) Some fields are
even more special: child fields. These form the
tree structure, and always have a Node or nil
value.
Why divulge these bizarre internal secrets? Because these
Node methods are defined in terms of fields and children:
- #==, #eql? -- Equality is checked recursively. That is, all fields (including children) must be equal.
- #dup, #clone -- Children are copied recursively (other fields and attrs as normal).
Then there's the tree-twiddling methods, which only ever
yield/return/affect (non-nil) children.
- #next, #prev -- return the next/prev sibling.
- #list_next, #list_prev -- like #next/#prev, but also requires the parent to be NodeList. I'll be honest; I don't remember why I added these methods. They may well suddenly disappear.
- #each, #reverse_each -- Yield all (non-nil) children. Node includes Enumerable, so you get the free chocolates too.
- #depth_first, #reverse_depth_first -- Walk the tree in that order, yielding two args (event, node) at each node. event is :down on the way down, :up on the way up. If the block throws :prune, it won't descend any further.
- #preorder, #reverse_preorder, #postorder, #reverse_postorder -- Walk the tree depth first, yielding nodes in the given order. For the preorders, if the block throws :prune, it won't descend any further.
- #node_after(child), #node_before(child) -- return the node before/after child (same as child.next).
- #remove_node(child) -- remove child from this node (same as child.detach).
- #replace_node(child, new_child) -- replace child with yeah you guessed it (same as child.replace_with(newchild)).
If you're walking the tree looking for nodes to move around, don't
forget that modifying the tree during traversal is a criminal
offence.
And now, the episode you've been waiting for: THE FIELD LIST! (Cue
music and dim lights.)
Notes about the table:
- If no default is listed, it is false if the field name
ends in a '?', nil otherwise.
- nil is always allowed for a child field.
- There are some conventions we follow religiously to help you:
- Field names that end in '?' are always true-or-false.
- NodeList-valued fields always default to an
empty NodeArray or NodeChain, so you can
tack things on there with <<, without worrying
about needing to create the list first.
- A field is Node-valued if and only if it is
a child field.
- The rows go yellow, green, yellow, green, ... .
|
Class | Field | Child? | Type or possible values | Default | Comments |
TranslationUnit | entities | Y | NodeList | NodeChain[] |
He lives at the root of a parsed file.
|
Declaration | storage | | :typedef, :extern, :static, :auto, :register | |
There are also methods to query the storage more humanely:
- #typedef? -- true iff storage == :typedef
- #extern? -- true iff storage == :extern
- #static? -- true iff storage == :static
- #auto? -- true iff storage == :auto
- #register? -- true iff storage == :register
|
type | Y | DirectType | |
declarators | Y | NodeList | NodeArray[] |
inline? | | true, false | |
Declarator | indirect_type | Y | IndirectType | |
What on earth is a "declarator", you ask? Consider "int i,
*ip;". This is a Declaration with two
Declarators:
Declaration
type: Int
declarators:
- Declarator
name: "i"
- Declarator
indirect_type: Pointer
name: "ip"
The indirect_type of the ip
Declarator is a Pointer to nil.
"'Pointer to nil' my foot -- I want the type of the stupid
variable!" Here:
-
#type -- return the type, the whole type, and
nothing but the type. This is a clone; modifying it won't
modify the tree.
So calling #type on the ip Declarator
gives:
Pointer
type: Int
|
name | | String | |
init | Y | Expression | |
num_bits | Y | Integer | |
FunctionDef | storage | | :extern, :static | |
Just like Declaration, there's also:
- #extern? -- return true iff storage == :extern
- #static? -- return true iff storage == :static
There's also a pseudo-field:
- #prototype? -- same as !no_prototype?
- #prototype=(val) -- same as no_prototype = !val
no_prototype? means that no prototype was given. That means parameter types weren't given in the parens, but in the "old-style" declaration list. Example:
int main(argc, argv)
int argc;
char **argv;
{
return 0;
}
|
int main(int argc, char **argv) {
return 0;
}
|
No prototype.
|
Prototype.
|
Everyone tells you to use prototypes. That's because no type
checking is done when calling a function declared without a
prototype.
|
inline? | | true, false | |
type | Y | Type | |
name | | String | |
def | Y | Block | Block.new |
no_prototype? | | true, false | |
Parameter | register? | | true, false | |
Used in Functions.
|
type | Y | Type | |
name | | String | |
Enumerator | name | | String | |
Used in Enums.
|
val | Y | Expression | |
MemberInit | member | Y | NodeList of Member-or-Expression | |
Used in CompoundLiterals.
|
init | Y | Expression | |
Member | name | | String | |
Used in MemberInits.
|
Block | labels | Y | NodeList of Label | NodeArray[] |
|
stmts | Y | NodeList of Statement | NodeArray[] |
If | labels | Y | NodeList of Label | NodeArray[] |
|
cond | Y | Expression | |
then | Y | Statement | |
else | Y | Statement | |
Switch | labels | Y | NodeList of Label | NodeArray[] |
|
cond | Y | Expression | |
stmt | Y | Statement | |
While | labels | Y | NodeList of Label | NodeArray[] |
do? means it's a do-while loop.
|
do? | | true, false | |
cond | Y | Expression | |
stmt | Y | Statement | |
For | labels | Y | NodeList of Label | NodeArray[] |
|
init | Y | Expression or Declaration | |
cond | Y | Expression | |
iter | Y | Expression | |
stmt | Y | Statement | |
Goto | labels | Y | NodeList of Label | NodeArray[] |
|
target | | String | |
Continue | labels | Y | NodeList of Label | NodeArray[] |
|
Break | labels | Y | NodeList of Label | NodeArray[] |
|
Return | labels | Y | NodeList of Label | NodeArray[] |
|
expr | Y | Expression | |
ExpressionStatement | labels | Y | NodeList of Label | NodeArray[] |
|
expr | Y | Expression | |
PlainLabel | name | | String | |
|
Default | | | | |
|
Case | expr | Y | Expression | |
|
Comma | exprs | Y | NodeList of Expression | |
|
Conditional | cond | Y | Expression | |
|
then | Y | Expression | |
else | Y | Expression | |
Variable | name | | String | |
|
Index | expr | Y | Expression | |
|
index | Y | Expression | |
Call | expr | Y | Expression | |
|
args | Y | NodeList of Expression-or-Type | |
Dot | expr | Y | Expression | |
|
member | Y | String | |
Arrow | expr | Y | Expression | |
|
member | Y | String | |
PostInc | expr | Y | Expression | |
|
PostDec | expr | Y | Expression | |
|
Cast | type | Y | Type | |
|
expr | Y | Expression | |
Address | expr | Y | Expression | |
|
Dereference | expr | Y | Expression | |
|
Sizeof | expr | Y | Type or Expression | |
|
Positive | expr | Y | Expression | |
|
Negative | expr | Y | Expression | |
|
PreInc | expr | Y | Expression | |
|
PreDec | expr | Y | Expression | |
|
BitNot | expr | Y | Expression | |
|
Not | expr | Y | Expression | |
|
Add | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
Subtract | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
Multiply | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
Divide | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
Mod | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
Equal | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
NotEqual | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
Less | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
More | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
LessOrEqual | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
MoreOrEqual | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
BitAnd | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
BitOr | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
BitXor | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
ShiftLeft | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
ShiftRight | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
And | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
Or | expr1 | Y | Expression | |
|
expr2 | Y | Expression | |
Assign | lval | Y | Expression | |
|
rval | Y | Expression | |
MultiplyAssign | lval | Y | Expression | |
|
rval | Y | Expression | |
DivideAssign | lval | Y | Expression | |
|
rval | Y | Expression | |
ModAssign | lval | Y | Expression | |
|
rval | Y | Expression | |
AddAssign | lval | Y | Expression | |
|
rval | Y | Expression | |
SubtractAssign | lval | Y | Expression | |
|
rval | Y | Expression | |
ShiftLeftAssign | lval | Y | Expression | |
|
rval | Y | Expression | |
ShiftRightAssign | lval | Y | Expression | |
|
rval | Y | Expression | |
BitAndAssign | lval | Y | Expression | |
|
rval | Y | Expression | |
BitXorAssign | lval | Y | Expression | |
|
rval | Y | Expression | |
BitOrAssign | lval | Y | Expression | |
|
rval | Y | Expression | |
StringLiteral | val | | String | |
The String in val is the literal string entered. "\n"
isn't converted to a newline, for instance.
|
CharLiteral | val | | String | |
The String in val is the literal string entered. '\n'
isn't converted to a newline, for instance.
|
CompoundLiteral | type | Y | Type | |
Here's an example. (struct S){1, .x = 2, .y [3] .z =
4} parses as:
CompoundLiteral
type: Struct
name: "S"
member_inits:
- MemberInit
init: IntLiteral
val: 1
- MemberInit
member:
- Member
name: "x"
init: IntLiteral
val: 2
- MemberInit
member:
- Member
name: "y"
- IntLiteral
val: 3
- Member
name: "z"
init: IntLiteral
val: 4
"That's legal syntax!?" Yep. Look it up.
|
member_inits | Y | NodeList of MemberInit | NodeArray[] |
IntLiteral | val | | Integer | |
Also:
- #dec? -- return true iff format == :dec
- #hex? -- return true iff format == :hex
- #oct? -- return true iff format == :oct
|
format | | :dec, :hex, :oct | :dec |
FloatLiteral | val | | Float | |
|
Pointer | const? | | true, false | |
|
restrict? | | true, false | |
volatile? | | true, false | |
type | Y | Type | |
Array | const? | | true, false | |
|
restrict? | | true, false | |
volatile? | | true, false | |
type | Y | Type | |
length | Y | Expression | |
Function | const? | | true, false | |
|
restrict? | | true, false | |
volatile? | | true, false | |
type | Y | Type | |
params | Y | NodeList of Parameter | NodeArray[] |
var_args? | | true, false | |
Struct | const? | | true, false | |
|
restrict? | | true, false | |
volatile? | | true, false | |
name | | String | |
members | Y | NodeList of Member | NodeArray[] |
Union | const? | | true, false | |
|
restrict? | | true, false | |
volatile? | | true, false | |
name | | String | |
members | Y | NodeList of Member | NodeArray[] |
Enum | const? | | true, false | |
|
restrict? | | true, false | |
volatile? | | true, false | |
name | | String | |
members | Y | NodeList of Enumerator | |
CustomType | const? | | true, false | |
This is for typedef'd names.
|
restrict? | | true, false | |
volatile? | | true, false | |
name | | String | |
Void | const? | | true, false | |
const void!? Yes, think about: const void *.
|
restrict? | | true, false | |
volatile? | | true, false | |
Int | const? | | true, false | |
longness sounds silly, so here are some less silly
methods:
- #short? -- return true iff longness == -1
- #plain? -- return true iff longness == 0
- #long? -- return true iff longness == 1
- #long_long? -- return true iff longness == 2
Oh, and look, a pseudo-field:
- #signed? -- same as !unsigned?
- #signed=(val) -- same as unsigned = !val
|
restrict? | | true, false | |
volatile? | | true, false | |
longness | | -1, 0, 1, 2 | 0 |
unsigned? | | true, false | |
Float | const? | | true, false | |
Less silly-sounding longness substitutes:
- #plain? -- return true iff longness == 0
- #double? -- return true iff longness == 1
- #long_double? -- return true iff longness == 2
|
restrict? | | true, false | |
volatile? | | true, false | |
longness | | 0, 1, 2 | 0 |
Char | const? | | true, false | |
Also:
- #signed? -- return true iff signed == true
- #unsigned? -- return true iff signed == false
- #plain? -- return true iff signed == nil
Yes, C99 says that char, signed char, and
unsigned char are 3 distinct types (unlike with
int -- go figure). Like Martian chalk and Venusian
cheese: completely different, but you can fit 'em each in one
byte. Mmm, Martian chalk...
|
restrict? | | true, false | |
volatile? | | true, false | |
signed | | true, false, nil | |
Bool | const? | | true, false | |
This is the rarely seen _Bool type.
|
restrict? | | true, false | |
volatile? | | true, false | |
Complex | const? | | true, false | |
This is the rarely seen _Complex type.
- #plain? -- return true iff longness == 0
- #double? -- return true iff longness == 1
- #long_double? -- return true iff longness == 2
|
restrict? | | true, false | |
volatile? | | true, false | |
longness | | 0, 1, 2 | 0 |
Imaginary | const? | | true, false | |
This is the rarely seen _Imaginary type.
- #plain? -- return true iff longness == 0
- #double? -- return true iff longness == 1
- #long_double? -- return true iff longness == 2
|
restrict? | | true, false | |
volatile? | | true, false | |
longness | | 0, 1, 2 | 0 |
BlockExpression | block | Y | Block | Block.new |
Only if the block_expressions extension is enabled.
See "Extensions" section below.
|
=== Node Construction
Wanna make a Node? Take your pick:
- #new(field1, field2, ...) -- fields are in the order
listed above.
- #new(:field1 => val, :field2 => val, ...) -- field order
doesn't matter this way.
- #new_at(pos, *args) -- set the pos to that
given, and pass args to #new.
They're for losers, though. What you really want to do is make
Nodes by parsing C code. Each class -- even the abstract
classes like Statement -- has a .parse method:
function_def = C::FunctionDef.parse <<EOS
void frobnicate(int karma) {
use_waffle_iron();
}
stmt = C::Statement.parse('while (not_looking) paint_car();')
Need to tell it to treat WaffleIron as a type name? All
those parse methods use C.default_parser:
C.default_parser.type_names << 'WaffleIron'
type = C::Type.parse('WaffleIron')
Alternatively, you could've given parse your own parser:
parser = C::Parser.new
parser.type_names << 'WaffleIron'
type = C::Type.parse('WaffleIron', parser)
In fact, there's also C.parse(str, parser=nil), which is an
alias for C::TranslationUnit.parse(str, parser).
ast = C.parse(STDIN)
Yes, all that talk in the intro about doing parser =
C::Parser.new; parser.parse(...) was actually all a charade to
make you type more. I so own you.
== Extensions
CAST has developed extensions! To the C99 grammar, I mean.
-
Types are allowed as function arguments. This lets you parse macros like va_arg().
-
Blocks in parentheses are allowed as expressions ((a gcc extension)[http://gcc.gnu.org/onlinedocs/gcc-4.2.1/gcc/Statement-Exprs.html#Statement-Exprs]). You need to call
#enable_block_expressions on the parser first. They pop out as BlockExpression nodes.
C.default_parser.enable_block_expressions
node = C.parse 'char *brag(void) { return ({"I'm tricky!";}); }'
node.entities[0].def.stmts.last.expr.class # => C::BlockExpression
== Open Issues
- Is it okay to bastardize the =~ operator like that?
- Should binary operators have a list of expressions rather than
just 2? That'd allow to_s to format the strings better in some
cases and make it consistent with Comma.
- At the moment CAST chokes on preprocessor #-lines. Ruby makes it
trivial to filter these out before passing the string to
Parser#parse, and this makes it obvious when you forget
to run a file through the preprocessor (which you need to do to
get type names at least once). Is this stupid? Ideally we should
probably have a builtin preprocessor and use that.
{Vote now}[mailto:george.ogata@gmail.com]
== To Do
- Stop embarrasing yourself and write the parser in C.
- Make it -wd friendly.
- Fix the "TODO" bits in c.y. These are for rarely used C
constructs, like the declaration int arr[*];.
- Add a comment node. Might make it useful for doc extraction.
Anyone want this? Not all comments will be caught, though.
Comments can appear between any two tokens, and I don't really
want to add comment fields to every node. They'd probably just
appear between toplevel entities (in
TranslationUnit#entities) and between statements (in
Block#stmts).
- Make it rdoc-able.
If any of these affect you greatly, {kick me}[mailto:george.ogata@gmail.com] to make it happen faster.
== Contact
I'm not really sure what people are going to try to use this for. If
there's some functionality you think would make a good addition, or
think I've made a mess of this poor puppy, give me a yell.
You can spam me at george.ogata@gmail.com. It'd help if you prefixed the subject with "[cast] " so I can easily
distinguish CAST spam from fake Rolex spam.
== License
Ruby License