This package includes two scripts and related files for extracting segments of audio from a corpus of audio data and transcripts. The scripts are a python script and a script for the acoustic analysis software package Praat (written in the Praat scripting language, with the .praat extension). The scripts are intended for use with a Linux (or Mac OSX) shell. 

Directory Contents: This directory contains the following: 

     get_corpus_data.py
     get_sounds_sounds.praat
     foo.txt (a sample file containing lines of audio transcript data)
     bar/ (directory for sample audio files corresponding to the transcript data in foo.txt: See below for access)
     sph2pipe (a tool released by the Linguistic Data Consortium for converting audio files in the NIST .sph format to .wav)
     AddFileNameToLine.sh (a Bash shell script for appending the name of a file to the lines it contains)
     README (this file)

Input: get_corpus_data.py takes STDIN input and three arguments: 

	- a directory containing audio files. This must be an absolute path 
	  if the directory is not local to the directory from which the script is run;
	- the format of the input audio (wav, sph or mp3); 
	- the format of output audio (wav, sph or mp3). 

Using foo.txt and bar/ as examples (and assuming both are in the same directory as the script files), get_corpus_data.py is invoked as follows: 

$ cat foo.txt | python get_corpus_data.py bar/ wav wav 

The input file (foo.txt in the example) should contain lines of data from an audio transcript corpus (with time stamps) prefixed with the name of the transcript file each line is taken from and speaker metadata: 

$ cat foo.txt 
SBC_005.A:CA.B:CA.txt:58.04 59.49 B(CA:F:38): like a giant octopus,
SBC_005.A:CA.B:CA.txt:214.59 216.29 A(CA:F:38): or history or something like that,
SBC_005.A:CA.B:CA.txt:141.84 142.74 A(CA:M:33): .. I] didn't like the book,
SBC_005.A:CA.B:CA.txt:284.94 286.04 A(CA:M:33): (H) And it was like that.
SBC_006.A:CA.B:CA.txt:307.40 309.55 B(CA:F:34): well my next free day's like October fourth.
SBC_006.A:CA.B:CA.txt:763.06 764.36 A(CA:F:30): She's like talking about RO=M,
SBC_007.A:MT.B:MT.txt:1110.48 1113.05 A(MT:F:27): ... He goes [back] to school like the secon=d.
SBC_007.A:MT.B:MT.txt:1173.50 1177.97 A(MT:F:28): ... like if she had to go shopping or something maybe you could go with her,

The script a regular expression to identify the file name for each line, the time stamp (e.g. 58.04 59.49), the tag and metadata for the speaker in the line and the datum itself. For example, the first line of foo.txt is broken down as follows: 

Source file: SBC_005.A:CA.B:CA.txt
Time stamp: 58.04-59.49 
Speaker: B(CA:F:38)
Datum: like a giant octopus,

[Note that in the sample the file name includes the name or tag of each speaker and their accent or dialect. The speaker in the first line of foo.txt is B, speaking with a Californian accent, female and 38 years old.]

The script then copies the audio file corresponding to the source file for the line to the local directory and then calls get_corpus.sounds.praat, a script for the Praat acoustic analysis package (in the Praat scripting language). The praat script extracts the segment of audio from the source audio file indicated by the time stamps provided, aligns it with the transcript data for that segment and saves the pair as a Praat "collection," which can then be analyzed in Praat for whatever purposes the user desires. 

The sample data are from the Santa Barbara corpus a freely available audio corpus with transcripts: 

Du Bois, John W., Wallace L. Chafe, Charles Meyer, Sandra A. Thompson, Robert Englebretson, and Nii Martey. 2000-2005. Santa Barbara corpus of spoken American English, Parts 1-4. Philadelphia: Linguistic Data Consortium.
http://www.linguistics.ucsb.edu/research/santa-barbara-corpus

In order to use the sample files, please download the following into bar/:

http://www.linguistics.ucsb.edu/corpusmedia/SBC005.wav
http://www.linguistics.ucsb.edu/corpusmedia/SBC006.wav
http://www.linguistics.ucsb.edu/corpusmedia/SBC007.wav

NOTE: In order for get_corpus_data.py to function properly each transcript file in the corpus must correspond with an audio file and the name of the audio file must be the same prefix of the name of the transcript file (either properly or trivially). For example: 

	SBC_005.A:CA.B:CA.txt
	SBC_005.wav

	or

	SBC_005.txt
	SBC_005.wav

The lines of the data files (lists of lines of data from the corpus) must be prefixed with the name of the transcript files from which they are taken. For example, if the first five lines of SBC_005.txt look like the following: 

	$ head -5 SBC_005.txt
	0.00 3.46 DARRYL:         But,
	3.46 6.71                 .. but to try and .. and talk me out of believing in Murphy's Law,
	6.71 9.07                 by offering a miracle as a replacement,
	9.07 10.07                that doesn't d- work.
	10.07 10.27               (TSK) (H)

then the file must be modified to look like the following: 

	SBC_005.txt:0.00 3.46 DARRYL:         But,
	SBC_005.txt:3.46 6.71                 .. but to try and .. and talk me out of believing in Murphy's Law,
	SBC_005.txt:6.71 9.07                 by offering a miracle as a replacement,
	SBC_005.txt:9.07 10.07                that doesn't d- work.
	SBC_005.txt:10.07 10.27               (TSK) (H)

This can be done by running AddFileNameToLine.sh

	$ chmod +x AddFileNameToLine.sh
	$ ./AddFileNameToLine.sh 






