code and data

Nowadays you can find most of the code I write on Github. This page includes some highlights.

Bibsearch

David Vilar and I wrote a tool for keyword-based search and retrieval of official BibTeX files across the entire ACL Anthology and other official databases. It can also search the arXiv (bibsearch arxiv), download PDFs (bibsearch open), and generate BibTeX files from LaTeX source (bibsearch bib). Checkout the source or install it via pip (pip3 install bibsearch). For more information on usage and abilities, run bibsearch man.

Lexically constrained decoding

Information on how to use lexically constrained decoding (see our NAACL 2018 paper) in Sockeye can be found here. In summer 2020, I ported it to fairseq.

SacreBLEU

SacreBLEU is a convenience tool for managing references for standard MT bakeoffs like WMT and IWSLT, as well as a tool with a goal of making it easier to compute comparable BLEU scores across research papers. You can install it via pip:

pip3 install sacrebleu

Checkout out the code at its official repo.

Bitext Workshop

I recorded Peter Brown and Bob Mercer’s talks and the subsequent Q&A session at the 2013 EMNLP workshop Twenty Years of Bitext. I then had them transcribed, cleaned them up, and annotated them, as a service to posterity. Peter and Bob delivered exactly the sort of talk you might have hoped for, that was both reminiscent and humorous. It was really an historic event.

Picture of PowerPoint slide entitled 'Oh yes, everything's right on schedule, Fred'

Fisher Callhome Spanish Translation Dataset

Picture of a Spanish speech translation lattice

We collected ASR output (using Kaldi) and human translations (using Amazon’s Mechanical Turk) for the Fisher Spanish and CALLHOME Spanish datasets, which together provide a four-way parallel dataset (among acoustic input, transcripts, ASR output in various forms, and English translations) for research in the translation of Spanish conversational speech. The dataset is available through the LDC.

Stack decoder visualizer

I wrote a JQuery stack decoder to help visualize word-based MT for our MT class. You can play with the live online demo or get the code from GitHub.

Syntactic feature extraction

You can find data (including the grammar) and code for extracting TSG feature sets on GitHub. This data includes a version of Mark Johnson’s exhaustive CKY parser modified to parse with grammars containing rules intermingled terminals and nonterminals and with a number of other convenient command-line options. You can find my version of his parser with some minor changes and improvements on Github.

Bayesian tree substitution grammar learning

Picture of a parse tree with TSG annotations

The code for the experiments in our 2009 paper on inferring tree substitution grammars is available on GitHub. It is small, modular, and well-documented, and despite being written in Perl, I have been told that it is easy to understand. It includes a patch to Mark Johnson’s CKY parser that allows it to be used with TSGs.

Reranking feature extraction

Charniak and Johnson’s reranking code (from their 2005 ACL paper) extracts a large set of syntactic features from parse trees. An impediment to extracting their features is that it’s integrated into their reranking framework, requiring fairly specialized file formats. I modified their extract-spfeatures program to enable the extraction of their feature set from a single parse tree in standard bracketed format.

It is available on GitHub.