Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

csquare-cast

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

csquare-cast

0.2.2
Rubygems

Version published: 13 years ago

Maintainers: 1

Created: 13 years ago

Source

= CAST

http://cast.rubyforge.org

== Description

CAST parses C code into an abstract syntax tree (AST), lets you break it, then vomit it out as code. The parser does C99.

This fork supports Ruby 1.9.3, gemspec, and requires Hoe. The Rubyforge page above is documentation for the original version, but most things should be the same.

== Installation

gem install csquare-cast

== Library Overview

Everything is in the module C.

There's the parser (Parser).
There's the tree (Node and its subclasses).
That's it.

=== Usage

You call Parser#parse, and it gives you a tree of Node objects. Watch:

require 'cast/cast'

# create a parser
parser = C::Parser.new

# (optional) set some settings...
parser.pos.filename = "toy.c"      # used for error messages
parser.type_names << 'LinkedList'  # treat these words as types

# gimme a tree
ugly_c_code = open("toy.c"){|f| f.read}
tree = parser.parse(ugly_c_code)

# what's the tree look like?
p tree

If there's a parse error, #parse raises a ParseError (which has a nice error message in #message).

== The Parser

Here's a quiz: what does "a * b;" do?

I bet you said "why you l4m3r n00b, that's a statement that multiplies a by b and throws away the answer -- now go take your meaningless snippetage to your computing 101 class and let me finish hurting this Java^TM programmer." Well, you'd be both mean and wrong. It was, of course, a trick question. I didn't say if any of a and b are types! If only a is a type, it's actually a declaration. And if b is a type, it's a syntax error.

So, the parser's gonna need to know which identifiers are type names. This is one of the bits of state that a Parser keeps. Here's the complete list (um, of two):

#type_names -- a Set of Strings.
#pos -- the Node::Pos this parser will start parsing at.

A Node::Pos has three read-write atts: #filename, #line_num, #col_num. Default is nil, 1, 0.

== The Nodes

There are a couple of Node classes:

Node
- TranslationUnit
- Declaration
- Declarator
- FunctionDef
- Parameter
- Enumerator
- MemberInit
- Member
- Statement
  - Block
  - If
  - Switch
  - While
  - For
  - Goto
  - Continue
  - Break
  - Return
  - ExpressionStatement
- Label
  - PlainLabel
  - Default
  - Case

Node
- Expression
  - Comma
  - Conditional
  - Variable
  - UnaryExpression
    - PostfixExpression
      - Index
      - Call
      - Dot
      - Arrow
      - PostInc
      - PostDec
    - PrefixExpression
      - Cast
      - Address
      - Dereference
      - Sizeof
      - Plus
      - Minus
      - PreInc
      - PreDec
      - BitNot
      - Not

Node
- Expression
  - BinaryExpression
    - Add
    - Subtract
    - Multiply
    - Divide
    - Mod
    - Equal
    - NotEqual
    - Less
    - More
    - LessOrEqual
    - MoreOrEqual
    - BitAnd
    - BitOr
    - BitXor
    - ShiftLeft
    - ShiftRight
    - And
    - Or

Node
- Expression
  - AssignmentExpression
    - Assign
    - MultiplyAssign
    - DivideAssign
    - ModAssign
    - AddAssign
    - SubtractAssign
    - ShiftLeftAssign
    - ShiftRightAssign
    - BitAndAssign
    - BitXorAssign
    - BitOrAssign
  - Literal
    - StringLiteral
    - CharLiteral
    - CompoundLiteral
    - IntLiteral
    - FloatLiteral

Node
- Type
  - IndirectType
    - Pointer
    - Array
    - Function
  - DirectType
    - Struct
    - Union
    - Enum
    - CustomType
    - PrimitiveType
      - Void
      - Int
      - Float
      - Char
      - Bool
      - Complex
      - Imaginary
- NodeList
  - NodeArray
  - NodeChain

The bold ones are abstract.

The last 2 (the NodeLists) represent lists of nodes. Methodwise, they try to behave like normal ruby ::Arrays. Implementationwise, a NodeChain is a doubly linked list, whereas a NodeArray is an array. NodeChains may be more efficient when adding things at the beginning of a LARGE list.

=== Attributes

Each Node object has:

#parent -- return the parent in the tree (a Node or nil).
#pos, #pos= -- the position in the source file (a Node::Pos).
#to_s -- return the code for the tree (a String).
#inspect -- return a pretty string for inspection. Try it.
#match?(str), #=~(str) -- return true iff str parses as a Node equal to this one.
#detach -- remove this node from the tree (parent becomes nil) and return it.
#detached?, #attached? -- return true if parent is nil or non-nil respectively.
#replace_with(node) -- replace this node with node in the tree.
#swap_with(node) -- exchange this node with node in their trees.
#insert_prev(*nodes), #insert_next(*nodes) -- insert nodes before this node in the parent list. Parent must be a NodeList! Useful for adding statements before a node in a block, for example.
#Foo? -- (where Foo is a module name) return self.is_a?(Foo).

The #Foo? method is a convienience for a common need. Example:

## make a tree
ast = C::Parser.new.parse(code_string)

## print all global variables
ast.entities.each do |node|
  node.Declaration? or next
  node.declarators.each do |decl|
    unless decl.type.Function?
      puts "#{decl.name}: #{decl.type}"
    end
  end
end

If you're a duck-typing purist, then sorry for the cardiac arrest you're now experiencing. CAST does pay attention to the class of Node objects for quite a few things. This is the cleanest way to distinguish, e.g., an Add from a Subtract (which both have the same methods but represent very different things). It just seems impractical (and unnecessary) to allow duck typing in this situation.

The #=~ method lets you do:

if declarator.type =~ 'const int *'
  puts "Ooh, a const int pointer!"
end

This is not the same as

declarator.type.to_s == 'const int *'

That'd require you to guess how to_s formats its strings (most notably, the whitespace).

=== Fields and children

Each concrete Node class has a member for each bit of important C stuff it pertains to. I know you peeked at the big list below, so you know the kind of thing I mean.

But these aren't defined as attrs as you normally do in Ruby -- they're fields. If a node has a field foo, it means there's a setter #foo= and getter #foo. (A field foo? means the setter is #foo= and the getter is #foo?.) Some fields are even more special: child fields. These form the tree structure, and always have a Node or nil value.

Why divulge these bizarre internal secrets? Because these Node methods are defined in terms of fields and children:

#==, #eql? -- Equality is checked recursively. That is, all fields (including children) must be equal.
#dup, #clone -- Children are copied recursively (other fields and attrs as normal).

Then there's the tree-twiddling methods, which only ever yield/return/affect (non-nil) children.

#next, #prev -- return the next/prev sibling.
#list_next, #list_prev -- like #next/#prev, but also requires the parent to be NodeList. I'll be honest; I don't remember why I added these methods. They may well suddenly disappear.
#each, #reverse_each -- Yield all (non-nil) children. Node includes Enumerable, so you get the free chocolates too.
#depth_first, #reverse_depth_first -- Walk the tree in that order, yielding two args (event, node) at each node. event is :down on the way down, :up on the way up. If the block throws :prune, it won't descend any further.
#preorder, #reverse_preorder, #postorder, #reverse_postorder -- Walk the tree depth first, yielding nodes in the given order. For the preorders, if the block throws :prune, it won't descend any further.
#node_after(child), #node_before(child) -- return the node before/after child (same as child.next).
#remove_node(child) -- remove child from this node (same as child.detach).
#replace_node(child, new_child) -- replace child with yeah you guessed it (same as child.replace_with(newchild)).

If you're walking the tree looking for nodes to move around, don't forget that modifying the tree during traversal is a criminal offence.

And now, the episode you've been waiting for: THE FIELD LIST! (Cue music and dim lights.)

Notes about the table:

If no default is listed, it is false if the field name ends in a '?', nil otherwise.
nil is always allowed for a child field.
There are some conventions we follow religiously to help you:
- Field names that end in '?' are always true-or-false.
- NodeList-valued fields always default to an empty NodeArray or NodeChain, so you can tack things on there with <<, without worrying about needing to create the list first.
- A field is Node-valued if and only if it is a child field.
- The rows go yellow, green, yellow, green, ... .

Class Field Child? Type or possible values Default Comments

TranslationUnit entities Y NodeList NodeChain[] He lives at the root of a parsed file.

Declaration

storage

:typedef, :extern, :static, :auto, :register

There are also methods to query the storage more humanely:

#typedef? -- true iff storage == :typedef
#extern? -- true iff storage == :extern
#static? -- true iff storage == :static
#auto? -- true iff storage == :auto
#register? -- true iff storage == :register

type Y DirectType

declarators Y NodeList NodeArray[]

inline? true, false

Declarator

indirect_type

IndirectType

What on earth is a "declarator", you ask? Consider "int i, *ip;". This is a Declaration with two Declarators:

    Declaration
        type: Int
        declarators:
            - Declarator
                name: "i"
            - Declarator
                indirect_type: Pointer
                name: "ip"

The indirect_type of the ip Declarator is a Pointer to nil. "'Pointer to nil' my foot -- I want the type of the stupid variable!" Here:

#type -- return the type, the whole type, and nothing but the type. This is a clone; modifying it won't modify the tree.

So calling #type on the ip Declarator gives:

    Pointer
        type: Int

name String

init Y Expression

num_bits Y Integer

FunctionDef

storage

:extern, :static

Just like Declaration, there's also:

#extern? -- return true iff storage == :extern
#static? -- return true iff storage == :static

There's also a pseudo-field:

#prototype? -- same as !no_prototype?
#prototype=(val) -- same as no_prototype = !val

no_prototype? means that no prototype was given. That means parameter types weren't given in the parens, but in the "old-style" declaration list. Example:

int main(argc, argv) int argc; char **argv; { return 0; }	int main(int argc, char **argv) { return 0; }
No prototype.	Prototype.

Everyone tells you to use prototypes. That's because no type checking is done when calling a function declared without a prototype.

inline? true, false

type Y Type

name String

def Y Block Block.new

no_prototype? true, false

Parameter register? true, false Used in Functions.

type Y Type

name String

Enumerator name String Used in Enums.

val Y Expression

MemberInit member Y NodeList of Member-or-Expression Used in CompoundLiterals.

init Y Expression

Member name String Used in MemberInits.

Block labels Y NodeList of Label NodeArray[]

stmts Y NodeList of Statement NodeArray[]

If labels Y NodeList of Label NodeArray[]

cond Y Expression

then Y Statement

else Y Statement

Switch labels Y NodeList of Label NodeArray[]

cond Y Expression

stmt Y Statement

While labels Y NodeList of Label NodeArray[] do? means it's a do-while loop.

do? true, false

cond Y Expression

stmt Y Statement

For labels Y NodeList of Label NodeArray[]

init Y Expression or Declaration

cond Y Expression

iter Y Expression

stmt Y Statement

Goto labels Y NodeList of Label NodeArray[]

target String

Continue labels Y NodeList of Label NodeArray[]

Break labels Y NodeList of Label NodeArray[]

Return labels Y NodeList of Label NodeArray[]

expr Y Expression

ExpressionStatement labels Y NodeList of Label NodeArray[]

expr Y Expression

PlainLabel name String

Default

Case expr Y Expression

Comma exprs Y NodeList of Expression

Conditional cond Y Expression

then Y Expression

else Y Expression

Variable name String

Index expr Y Expression

index Y Expression

Call expr Y Expression

args Y NodeList of Expression-or-Type

Dot expr Y Expression

member Y String

Arrow expr Y Expression

member Y String

PostInc expr Y Expression

PostDec expr Y Expression

Cast type Y Type

expr Y Expression

Address expr Y Expression

Dereference expr Y Expression

Sizeof expr Y Type or Expression

Positive expr Y Expression

Negative expr Y Expression

PreInc expr Y Expression

PreDec expr Y Expression

BitNot expr Y Expression

Not expr Y Expression

Add expr1 Y Expression

expr2 Y Expression

Subtract expr1 Y Expression

expr2 Y Expression

Multiply expr1 Y Expression

expr2 Y Expression

Divide expr1 Y Expression

expr2 Y Expression

Mod expr1 Y Expression

expr2 Y Expression

Equal expr1 Y Expression

expr2 Y Expression

NotEqual expr1 Y Expression

expr2 Y Expression

Less expr1 Y Expression

expr2 Y Expression

More expr1 Y Expression

expr2 Y Expression

LessOrEqual expr1 Y Expression

expr2 Y Expression

MoreOrEqual expr1 Y Expression

expr2 Y Expression

BitAnd expr1 Y Expression

expr2 Y Expression

BitOr expr1 Y Expression

expr2 Y Expression

BitXor expr1 Y Expression

expr2 Y Expression

ShiftLeft expr1 Y Expression

expr2 Y Expression

ShiftRight expr1 Y Expression

expr2 Y Expression

And expr1 Y Expression

expr2 Y Expression

Or expr1 Y Expression

expr2 Y Expression

Assign lval Y Expression

rval Y Expression

MultiplyAssign lval Y Expression

rval Y Expression

DivideAssign lval Y Expression

rval Y Expression

ModAssign lval Y Expression

rval Y Expression

AddAssign lval Y Expression

rval Y Expression

SubtractAssign lval Y Expression

rval Y Expression

ShiftLeftAssign lval Y Expression

rval Y Expression

ShiftRightAssign lval Y Expression

rval Y Expression

BitAndAssign lval Y Expression

rval Y Expression

BitXorAssign lval Y Expression

rval Y Expression

BitOrAssign lval Y Expression

rval Y Expression

StringLiteral val String The String in val is the literal string entered. "\n" isn't converted to a newline, for instance.

CharLiteral val String The String in val is the literal string entered. '\n' isn't converted to a newline, for instance.

CompoundLiteral

type

Type

Here's an example. (struct S){1, .x = 2, .y [3] .z = 4} parses as:

CompoundLiteral
    type: Struct
        name: "S"
    member_inits:
        - MemberInit
            init: IntLiteral
                val: 1
        - MemberInit
            member:
                - Member
                    name: "x"
            init: IntLiteral
                val: 2
        - MemberInit
            member:
                - Member
                    name: "y"
                - IntLiteral
                    val: 3
                - Member
                    name: "z"
            init: IntLiteral
                val: 4

"That's legal syntax!?" Yep. Look it up.

member_inits Y NodeList of MemberInit NodeArray[]

IntLiteral

val

Integer

Also:

#dec? -- return true iff format == :dec
#hex? -- return true iff format == :hex
#oct? -- return true iff format == :oct

format :dec, :hex, :oct :dec

FloatLiteral val Float

Pointer const? true, false

restrict? true, false

volatile? true, false

type Y Type

Array const? true, false

restrict? true, false

volatile? true, false

type Y Type

length Y Expression

Function const? true, false

restrict? true, false

volatile? true, false

type Y Type

params Y NodeList of Parameter NodeArray[]

var_args? true, false

Struct const? true, false

restrict? true, false

volatile? true, false

name String

members Y NodeList of Member NodeArray[]

Union const? true, false

restrict? true, false

volatile? true, false

name String

members Y NodeList of Member NodeArray[]

Enum const? true, false

restrict? true, false

volatile? true, false

name String

members Y NodeList of Enumerator

CustomType const? true, false This is for typedef'd names.

restrict? true, false

volatile? true, false

name String

Void const? true, false const void!? Yes, think about: const void *.

restrict? true, false

volatile? true, false

Int

const?

true, false

longness sounds silly, so here are some less silly methods:

#short? -- return true iff longness == -1
#plain? -- return true iff longness == 0
#long? -- return true iff longness == 1
#long_long? -- return true iff longness == 2

Oh, and look, a pseudo-field:

#signed? -- same as !unsigned?
#signed=(val) -- same as unsigned = !val

restrict? true, false

volatile? true, false

longness -1, 0, 1, 2 0

unsigned? true, false

Float

const?

true, false

Less silly-sounding longness substitutes:

#plain? -- return true iff longness == 0
#double? -- return true iff longness == 1
#long_double? -- return true iff longness == 2

restrict? true, false

volatile? true, false

longness 0, 1, 2 0

Char

const?

true, false

Also:

#signed? -- return true iff signed == true
#unsigned? -- return true iff signed == false
#plain? -- return true iff signed == nil

Yes, C99 says that char, signed char, and unsigned char are 3 distinct types (unlike with int -- go figure). Like Martian chalk and Venusian cheese: completely different, but you can fit 'em each in one byte. Mmm, Martian chalk...

restrict? true, false

volatile? true, false

signed true, false, nil

Bool const? true, false This is the rarely seen _Bool type.

restrict? true, false

volatile? true, false

Complex

const?

true, false

This is the rarely seen _Complex type.

#plain? -- return true iff longness == 0
#double? -- return true iff longness == 1
#long_double? -- return true iff longness == 2

restrict? true, false

volatile? true, false

longness 0, 1, 2 0

Imaginary

const?

true, false

This is the rarely seen _Imaginary type.

#plain? -- return true iff longness == 0
#double? -- return true iff longness == 1
#long_double? -- return true iff longness == 2

restrict? true, false

volatile? true, false

longness 0, 1, 2 0

BlockExpression block Y Block Block.new Only if the block_expressions extension is enabled. See "Extensions" section below.

=== Node Construction

Wanna make a Node? Take your pick:

#new(field1, field2, ...) -- fields are in the order listed above.
#new(:field1 => val, :field2 => val, ...) -- field order doesn't matter this way.
#new_at(pos, *args) -- set the pos to that given, and pass args to #new.

They're for losers, though. What you really want to do is make Nodes by parsing C code. Each class -- even the abstract classes like Statement -- has a .parse method:

function_def = C::FunctionDef.parse <<EOS
  void frobnicate(int karma) {
    use_waffle_iron();
  }
stmt = C::Statement.parse('while (not_looking) paint_car();')

Need to tell it to treat WaffleIron as a type name? All those parse methods use C.default_parser:

C.default_parser.type_names << 'WaffleIron'
type = C::Type.parse('WaffleIron')

Alternatively, you could've given parse your own parser:

parser = C::Parser.new
parser.type_names << 'WaffleIron'
type = C::Type.parse('WaffleIron', parser)

In fact, there's also C.parse(str, parser=nil), which is an alias for C::TranslationUnit.parse(str, parser).

ast = C.parse(STDIN)

Yes, all that talk in the intro about doing parser = C::Parser.new; parser.parse(...) was actually all a charade to make you type more. I so own you.

== Extensions

CAST has developed extensions! To the C99 grammar, I mean.

Types are allowed as function arguments. This lets you parse macros like va_arg().
Blocks in parentheses are allowed as expressions ((a gcc extension)[http://gcc.gnu.org/onlinedocs/gcc-4.2.1/gcc/Statement-Exprs.html#Statement-Exprs]). You need to call #enable_block_expressions on the parser first. They pop out as BlockExpression nodes.

C.default_parser.enable_block_expressions node = C.parse 'char *brag(void) { return ({"I'm tricky!";}); }' node.entities[0].def.stmts.last.expr.class # => C::BlockExpression

== Open Issues

Is it okay to bastardize the =~ operator like that?
Should binary operators have a list of expressions rather than just 2? That'd allow to_s to format the strings better in some cases and make it consistent with Comma.
At the moment CAST chokes on preprocessor #-lines. Ruby makes it trivial to filter these out before passing the string to Parser#parse, and this makes it obvious when you forget to run a file through the preprocessor (which you need to do to get type names at least once). Is this stupid? Ideally we should probably have a builtin preprocessor and use that.

{Vote now}[mailto:george.ogata@gmail.com]

== To Do

Stop embarrasing yourself and write the parser in C.
Make it -wd friendly.
Fix the "TODO" bits in c.y. These are for rarely used C constructs, like the declaration int arr[*];.
Add a comment node. Might make it useful for doc extraction. Anyone want this? Not all comments will be caught, though. Comments can appear between any two tokens, and I don't really want to add comment fields to every node. They'd probably just appear between toplevel entities (in TranslationUnit#entities) and between statements (in Block#stmts).
Make it rdoc-able.

If any of these affect you greatly, {kick me}[mailto:george.ogata@gmail.com] to make it happen faster.

== Contact

I'm not really sure what people are going to try to use this for. If there's some functionality you think would make a good addition, or think I've made a mess of this poor puppy, give me a yell.

You can spam me at george.ogata@gmail.com. It'd help if you prefixed the subject with "[cast] " so I can easily distinguish CAST spam from fake Rolex spam.

== License

Ruby License

FAQs

What is csquare-cast?

Is csquare-cast well maintained?

Package last updated on 09 May 2012

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

csquare-cast

Related posts

Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm

Malicious npm Package Typosquats Popular TypeScript ESLint Plugin, Exfiltrates Data and Enables Remote Exploitation