Security News
Introducing the Socket Python SDK
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
github.com/t7a/tada
tada (TAble DAta) is a package that enables test-driven data pipelines in pure Go.
DISCLAIMER: still under development. API subject to breaking changes until v1.
If you still want to use this regardless of the disclaimer, congratulations, you are an alpha tester! Please DM your feedback to me on the Gophers slack channel or create an issue.
tada combines concepts from pandas, spreadsheets, R, Apache Spark, and SQL. Its most common use cases are cleaning, aggregating, transforming, and analyzing data.
Some notable features of tada:
The key data types are Series, DataFrames, and groupings of each. A Series is analogous to one column of a spreadsheet, and a DataFrame is analogous to a whole spreadsheet. Printing either data type will render an ASCII table.
Both Series and DataFrames have one or more "label levels". On printing, these appear as the leftmost columns in a table, and typically have values that help identify ("label") specific rows. They are analogous to the "index" concept in pandas.
For more detail and implementation notes, see this doc.
Logo: @egonelbre, licensed under CC0
You start with a CSV. Like most real-world data, it is messy. This one is missing a score in the first row. And we know that scores must range between 0 and 10, so the score of -100 and 1000 in the second and third rows must also be erroneous:
var data = `name, score
joe doe,
john doe, -100
jane doe, 1000
john doe, 5
jane doe, 8
john doe, 7
jane doe, 10`
You want to write and validate a function that discards erroneous data, groups by the name
column, and returns the mean of the groups.
First you write a test. You can test in two ways:
func TestDataPipeline(t *testing.T) {
want := `name, mean_score
jane doe, 9
john doe, 6`
df, _ := tada.ReadCSV(strings.NewReader(data))
ret := sampleDataPipeline(df)
eq, diffs, _ := ret.EqualsCSV(true, strings.NewReader(want))
if !eq {
t.Errorf("sampleDataPipeline(): got %v, want %v, has diffs: \n%v", ret, want, diffs)
}
}
func Test_sampleDataPipelineTyped(t *testing.T) {
type output struct {
Name []string `tada:"name"`
MeanScore []float64 `tada:"mean_score"`
}
want := output{
Name: []string{"jane doe", "john doe"},
MeanScore: []float64{9, 5},
}
df, _ := tada.ReadCSV(strings.NewReader(data))
out := sampleDataPipeline(df)
var got output
out.Struct(&got)
if !reflect.DeepEqual(got, want) {
t.Errorf("sampleDataPipelineTyped(): got %v, want %v", got, want)
}
}
Then you write the data pipeline:
func sampleDataPipeline(df *tada.DataFrame) *tada.DataFrame {
err := df.HasCols("name", "score")
if err != nil {
log.Fatal(err)
}
df.InPlace().DropNull()
df.Cast(map[string]tada.DType{"score": tada.Float64})
validScore := func(v interface{}) bool { return v.(float64) >= 0 && v.(float64) <= 10 }
df.InPlace().Filter(map[string]tada.FilterFn{"score": validScore})
df.InPlace().Sort(tada.Sorter{Name: "name", DType: tada.String})
ret := df.GroupBy("name").Mean("score")
if ret.Err() != nil {
log.Fatal(ret.Err())
}
return ret
}
More examples
s := tada.NewSeries([]float{1,2,3})
s := tada.NewSeries([]float{1,2,3}, []string{"foo", "bar", "baz"})
df := tada.NewDataFrame([]interface{}{
[]string{"a"},
[]float64{100},
}).SetColNames([]string{"foo", "bar"})
f, err := os.Open("foo.csv")
... handle err
defer f.Close()
df, err := tada.ReadCSV(f)
... handle err
More examples
InPlace()
.Cast()
it to tada.Float64
, tada.String
, or tada.DateTime
, respectively.FAQs
Unknown package
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.
Security News
A new Rust RFC proposes "Trusted Publishing" for Crates.io, introducing short-lived access tokens via OIDC to improve security and reduce risks associated with long-lived API tokens.