On overview of some software I developed. Most scripts have an included help.
Run them with python in your Terminal:
$ python scriptname.py -h
art.py : A script to test the significance of recall, precision and f-score differences between 2 machine learning systems with approximate randomization testing. Should be suited for a.o. TiMBL, term extraction and MBT experiments. Depends on confusionmatrix.py, combinations.py, and on scipy (optional) (www.scipy.org). Example files (+readme.txt) to test the script are in this sample folder.
linkvalidator.py : A script to validate the contents of the links on a webpage, to see if the links are still working.
transcode.py : A script to convert a file from one encoding into another. E.g. from latin1 to utf8.
zipf.py : A script to create a Zipf curve from a textfile. Assumes that the text file is UTF8 encoded.
groups.py : A script to find (cognitive) groups in a list of figures. Example: 1, 2, 15, 16, 150, 1200, 1205, 1205 will be grouped into [1, 2], [15, 16], [150], [1200, 1205, 1205]
reline.py : A script to convert the line endings in a file. Namely from Windows to Mac OS X.
tokenizer.py : A tokenizer. Optimized for English biomedical texts.
pixel.py : A script to convert pixels into sizes for different resolutions.
abbreviations.py : Script that links abbreviations to their full definitions.
confusionmatrix.py : Script that reconstructs the confusionmatrix using the outputfile from Timbl.
binarize.py : Script to create binary instance files using Timbl instance files. Can also be used to create SVMLight formatted files.
nfold.py : A script that splits an instancefile into n parts. Can be used for Mbt and Timbl style files.
selftrain.py : A script that can be used to do simple self-learning. Some sample data [testdata.tgz].
weightedoverlap.py : A Python module that calculates Information Gain (IG) and Gain Ratio (GR) from Timbl instance files. Useful if you want to use these weighing methods without running Timbl.
visual.0.1.4.tgz : A package to visualize (Timbl) instances. Not that accurate of course. Only for Mac OS X and may only work with python2.5. Can be used to create images like this:
 
Example of visual
lottery.py : A script to experience how lottery is a complete waste of money. Compute your losses. You need combinations.py to run the script.
------------------------
Python modules that may be of interest (not always documented)
timbl.py : A module that contains a function to call TiMBL. Does not parse the TiMBL output.
multitimbl.py : Running TiMBL on multiple CPUs. Depends on timbl.py.
nspheres.py : A module to compute the intersection point between n-dimensional spheres. In the figure below, this is shown in 2 dimensions. We are looking for the coordinates of point 3 while knowing its distance from point 1 (namely 0.8) and point 2 (namely 0.6).
Example of nspheres.py
combinations.py : A script to compute all unique subsets of length r of a set of n elements. Can be used if you want to combine every token from a sentence with every other token.
gdep.py : A module that keeps GDep (GENIA Dependency parser (K. Sagae)) readily available so you don't have to start it up every time you want to parse.
highlight.py : Module that creates rich text files (RTF). Can be used to add color to words in texts.
Example:
Example of highlight.py
levenshtein.py : Module that computes the Levenshtein distance.
tessels.tgz : Module that can be used to create pentagrams that tessel the plane (NodeBox needed). Three of the 14 types can be created:
Example of type6 Example of type8 Example of type9
The final goal of the pentagrams is to create a pattern like this:
 
       Example of tesselation
------------------------
Non-Python scripts
The scripts may have an included help. Run them in your Terminal:
$ ./script -h
pysearch : A bash script that looks for patterns in Python files. Can be used to search all Python files in a directory tree for e.g. "def create".
Imager-1.1.1.dmg : Mac OSX software to conduct visual/audio experiments. The program shows images together with sound and lets the subject choose between various options. You should add your own audio (.wav) and image (.jpg) files. For more info see MANUAL.txt.
------------------------
Python : An introduction to the Python Programming Language for linguists (Dutch).
------------------------
All scripts on this page are written by Vincent Van Asch and available under the GPL license except for tokenizer.py which is branched off from the MBSP package; for the license see LICENSE in the package.
Last updated: 8th of February 2012