| Engine : TextSearch Objects : CharSet Objects : Functions : Constants : Examples : Structure : Support : Download : Copyright & License : History : Home | Version 2.1.0 | 
mxTextTools is a collection of high-speed string manipulation routines and new Python objects for dealing with common text processing tasks.
One of the major features of this package is the integrated Tagging Engine which allows accessing the speed of compiled C programs while maintaining the portability of Python. The Tagging Engine uses byte code "programs" written in form of Python tuples. These programs are then translated into an internal binary form which gets processed by a very fast virtual machine designed specifically for scanning text data.
As a result, the Tagging Engine allows parsing text at higher speeds than e.g. regular expression packages while still maintaining the flexibility of programming the parser in Python. Callbacks and user-defined matching functions extends this approach far beyond what you could do with other common text processing methods.
Two other major features are the search and character set objects provided by the package. Both are implemented in C to give you maximum performance on all supported platforms.
A note about the word 'tagging': This originated from what is done in HTML to mark some text with a certain extra information. The Tagging Engine extends this notion to assigning Python objects to text substrings. Every substring marked in this way carries a 'tag' (the object) which can be used to do all kinds of useful things.
If you are looking for more tutorial style documentation of mxTextTools, there's a new book by David Mertz about Text Processing with Python which covers mxTextTools and other text oriented tools at great length.
The Tagging Engine is a low-level virtual machine (VM) which executes text search specific byte codes. This byte code is passed to the engine in form of Tag Tables which define the "program" to execute in terms of commands and command arguments.
The Tagging Engine is capable of handling 8-bit text and Unicode (with some minor exceptions). Even combinations of the two string formats are accepted, but should be avoided for performance reasons in production code.
Marking certains parts of a text should not involve storing hundreds of small strings. This is why the Tagging Engine uses a specially formatted list of tuples to return the results.
A Tag List is a list of tuples marking certain slices of a text. The tuples always have the format
(object, left_index, right_index, subtags)
	  with the meaning: object contains information
	  about the slice [left_index:right_index] in
	  some text. subtags is either another Tag List
	  created by recursively invoking the Tagging Engine or
	  None. 
	
	  Note: Only the commands Table and
	  TableInList create new Tag Lists and make them
	  available via subtags and then only if the
	  Tagging Engine was not called with None as
	  value for the taglist. All other commands set
	  this tuple entry to None. This is important to
	  know if you want to analyze a generated Tag List, since it
	  may require recursing into the subtags Tag List
	  if that entry is not None.
	
To create such taglists, you have to define a Tag Table and let the Tagging Engine use it to mark the text. Tag Tables are defined using standard Python tuples containing other tuples in a specific format:
tag_table = (('lowercase',AllIn,a2z,+1,+2),
	     ('upper',AllIn,A2Z,+1),
	     (None,AllIn,white+newline,+1),
	     (None,AllNotIn,alpha+white+newline,+1),
	     (None,EOF,Here,-4)) # EOF 
	The tuples contained in the table use a very simple format:
(tagobj, command+flags, command_argument [,jump_no_match] [,jump_match=+1])
Starting with version 2.1.0 of mxTextTools, the Tagging Engine no longer uses these tuples directly, but instead compiles the Tag Table definitions into special TagTable objects. These objects are then processed by the Tagging Engine.
	  Even though the tag() Tagging Engine API
	  compiles Tag Table definitions into the TagTable object
	  on-the-fly, you can also compile the definitions yourself
	  and then pass the TagTable object directly to
	  tag().
	
To simplify writing Tag Table definitions, the Tag Table compiler also allows using string jump targets instead of jump offsets in the tuples:
tag_table = (
             'start',
             ('lowercase',AllIn,a2z,+1,'skip'),
	     ('upper',AllIn,A2Z,'skip'),
             'skip',
	     (None,AllIn,white+newline,+1),
	     (None,AllNotIn,alpha+white+newline,+1),
	     (None,EOF,Here,'start')) # EOF 
	These strings can be used as jump targets for jne and je when compiling the definition using TagTable() or UnicodeTagTable() and then get replaced with the numeric relative offsets at compile time.
	  The Tagging Engine has a new command JumpTarget
	  for this purpose which is implemented as no operation (NOP)
	  command.
	
Starting with version 2.1.0 of mxTextTools, the Tagging Engine uses compiled TagTable instances for performing the scanning. These TagTables are Python objects which can be created explicitely using a tag table definition in form of a tuple or a list (the latter are not cacheable, so it's usually better to transform the list into a tuple before passing it to the TagTable constructor).
	  The TagTable() constructor will "compile" and
	  check the tag table definition. It then stores the table in
	  an internal data structure which allows fast access from
	  within the Tagging Engine. The compiler also takes care of
	  any needed conversions such as Unicode to string or
	  vice-versa.
	
	  There are generally two different kinds of compiled
	  TagTables: one for scanning 8-bit strings and one for
	  Unicode. tag() will complain if you try to scan
	  strings with a UnicodeTagTable or Unicode with a string
	  TagTable.
	
 
	  Note that tag() can take TagTables and tuples
	  as tag table input. If given a tuple, it will automatically
	  compile the tuple into a TagTable needed for the requested
	  type of text (string or Unicode).
	
 
	  The TagTable() constructor caches compiled
	  TagTables if they are defined by a tuple and declared as
	  cacheable. In that case, the compile TagTable will be stored
	  in a dictionary addressed by the definition tuple's
	  id() and be reused if the same compilation is
	  requested again at some later point. The cache dictionary is
	  exposed to the user as tagtable_cache
	  dictionary. It has a hard limit of 100 entries, but can also
	  be managed by user routines to lower this limit.
	
The Tagging Engine reads the Tag Table starting at the top entry. While performing the command actions (see below for details) it moves a read-head over the characters of the text. The engine stops when a command fails to match and no alternative is given or when it reaches a non-existing entry, e.g. by jumping beyond the end of the table.
Tag Table entries are processed as follows:
	  If the command matched, say the slice
	  text[l:r], the default action is to append
	  (tagobj,l,r,subtags) to the taglist (this
	  behaviour can be modified by using special
	  flags; if you use None as tagobj,
	  no tuple is appended) and to continue matching with the
	  table entry that is reached by adding
	  jump_match to the current position (think of
	  them as relative jump offsets). 
	  The head position of the engine stays where the command left
	  it (over index r), e.g. for
	  (None,AllIn,'A') right after the last 'A'
	  matched.
	
	  In case the command does not match, the
	  engine either continues at the table entry reached after
	  skipping jump_no_match entries, or if this
	  value is not given, terminates matching the current
	  table and returns not matched. The head position is
	  always restored to the position it was in before the
	  non-matching command was executed, enabling
	  backtracking.
	
	  The format of the command_argument is dependent
	  on the command. It can be a string, a set, a search object,
	  a tuple of some other wild animal from Python land. See the
	  command section below for details.
	
A table matches a string if and only if the Tagging Engine reaches a table index that lies beyond the end of the table. The engine then returns matched ok. Jumping beyond the start of the table (to a negative table index) causes the table to return with result failed to match.
Starting with version 2.1.0, the Tagging Engine supports carrying along an optional context object. The object can be used for storing data specific to the tagging procedure, error information, etc.
	  You can access the context object by using a Python function
	  as tag object which is then called with the context object
	  as last argument if CallTag is used as command
	  flag or in matching functions which are called as a result
	  of the Call or CallArg commands.
	
	  To remain backward compatible, the context object is only
	  provided as last argument if given to the tag()
	  function.
	
New commands which make use of the context object at a lower level will eventually appear in the Tagging Engine in future releases.
	  The commands and constants used here are integers defined in
	  Constants/TagTables.py and imported into the
	  package's root module. For the purpose of explaining the
	  taken actions we assume that the tagging engine was called
	  with tag(text,table,start=0,stop=len(text)). The
	  current head position is indicated by x.
	
| Command | Matching Argument | Action | 
| Fail | Here | Causes the engine to fail matching at the current head position. | 
| Jump | To | Causes the engine to perform a relative jump by jump_no_matchentries. | 
| AllIn | string | Matches all characters found in text[x:stop]up to the first that is not included in string. At least
	      one character must match. | 
| AllNotIn | string | Matches all characters found in text[x:stop]up to the first that is included in string. At least one
	      character must match. | 
| AllInSet | set | Matches all characters found in text[x:stop]up to the first that is not
		included in the string set. At least one character
		must match.  Note: String sets only work with
		8-bit text. UseAllInCharSetif you plan
		to use the tag table with 8-bit and Unicode text. | 
| AllInCharSet | CharSet object | Matches all characters found in text[x:stop]up to the first that is not included in the CharSet. At
	      least one character must match. | 
| Is | character | Matches iff text[x] == character. | 
| IsNot | character | Matches iff text[x] != character. | 
| IsIn | string | Matches iff text[x] is in string. | 
| IsNotIn | string | Matches iff text[x] is not in string. | 
| IsInSet | set | Matches iff text[x] is in set.
		Note: String sets only work with 8-bit
		text. UseIsInCharSetif you plan to use
		the tag table with 8-bit and Unicode text. | 
| IsInCharSet | CharSet object | Matches iff text[x]is contained in the
		CharSet. | 
| Word | string | Matches iff text[x:x+len(string)] == string. | 
| WordStart | string | Matches all characters up to the first occurance of
	      string in text[x:stop].If string is not found, the command does not match and the head position remains unchanged. Otherwise, the head stays on the first character of string in the found occurance. At least one character must match. | 
| WordEnd | string | Matches all characters up to the first occurance of
	      string in text[x:stop].If string is not found, the command does not match and the head position remains unchanged. Otherwise, the head stays on the last character of string in the found occurance. | 
| sWordStart | TextSearch object | Same as WordStart except that the TextSearch object is used to perform the necessary action (which can be much faster) and zero matching characters are allowed. | 
| sWordEnd | TextSearch object | Same as WordEnd except that the TextSearch object is used to perform the necessary action (which can be much faster). | 
| sFindWord | TextSearch object | Uses the TextSearch object to find the given substring. If found, the tagobj is assigned only to the slice of the substring. The characters leading up to it are ignored. The head position is adjusted to right after the substring -- just like for sWordEnd. | 
| Call | function | Calls the matching function(text,x,stop)orfunction(text,x,stop,context)if a context
	      object was provided to thetag()function
	      call.
		The function must return the index  
		The entry is considered to be matching, iff  | 
| CallArg | (function,[arg0,...]) | Same as Call except that function(text,x,stop[,arg0,...])orfunction(text,x,stop,[,arg0,...],context)(if acontextobject is used) is being
	      called.The command argument must be a tuple. | 
| Table | tagtable or ThisTable | Matches iff tagtable matches text[x:stop].This calls the engine recursively. In case of success the head position is adjusted to point right after the match and the returned taglist is made available in the subtags field of this table's taglist entry. 
		You may pass the special constant
		 | 
| SubTable | tagtable or ThisTable | Same as Table except that the subtable reuses this
	      table's tag list for its tag list.  The subtagsentry is set to None.
		You may pass the special constant
		 | 
| TableInList | (list_of_tables,index) | Same as Table except that the matching table to be used
	      is read from the list_of_tablesat positionindexwhenever this command is
	      executed.This makes self-referencing tables possible which would otherwise not be possible (since Tag Tables are immutable tuples). Note that it can also introduce circular references, so be warned ! | 
| SubTableInList | (list_of_tables,index) | Same as TableInList except that the subtable reuses this
	      table's tag list. The subtagsentry is set
	      toNone. | 
| EOF | Here | Matches iff the head position is beyond stop.  The match recorded by the Tagging
	      Engine is thetext[stop:stop]. | 
| Skip | offset | Always matches and moves the head position to x +
	      offset. | 
| Move | position | Always matches and moves the head position to slice[position]. Negative indices move the
	      head toslice[len(slice)+position+1],
	      e.g. (None,Move,-1) moves to EOF.slicerefers to the current text slice being worked on by the
	      Tagging Engine. | 
| JumpTarget | Target String | Always matches, does not move the head position. This command is only used internally by the Tag Table compiler, but can also be used for writing Tag Table definitions, e.g. to follow the path the Tagging Engine takes through a Tag Table definition. | 
| Loop | count | Remains undocumented for this release. | 
| LoopControl | Break/Reset | Remains undocumented for this release. | 
The following flags can be added to the command integers above:
(tagobj,l,r,subtags)
		to the taglist upon successful matching, call
		tagobj(taglist,text,l,r,subtags) or
		tagobj(taglist,text,l,r,subtags,context)
		if a context object was passed to the
		tag() function.
		
(tagobj,l,r,subtags)
		to the taglist upon successful matching, append the
		match found as string.  
		
		  Note that this will produce non-standard taglists ! 
		  It is useful in combination with join()
		  though and can be used to implement smart split()
		  replacements algorithms.
		
(tagobj,l,r,subtags)
		to the taglist upon successful matching, call
		tagobj.append((None,l,r,subtags)).
		
(tagobj,l,r,subtags)
		to the taglist upon successful matching, append
		tagobj itself. 
		Note that this can cause the taglist to have a non-standard format, i.e. functions relying on the standard format could fail.
		  This flag is mainly intended to build
		  join-lists usable by the
		  join()-function (see below).
		
l (the left position of
		the match) after a successful match.
		This is useful to implement lookahead strategies.
Using the flag has no effect on the way the tagobj itself is treated, i.e. it will still be processed in the usual way.
Some additional constants that can be used as argument or relative jump position:
	  Internally, the Tag Table is used as program for a state
	  machine which is coded in C and accessible through the
	  package as tag() function along with the
	  constants used for the commands (e.g. Allin, AllNotIn,
	  etc.). Note that in computer science one normally
	  differentiates between finite state machines, pushdown
	  automata and turing machines. The Tagging Engine offers all
	  these levels of complexity depending on which techniques you
	  use, yet the basic structure of the engine is best compared
	  to a finite state machine.
	
Tip: if you are getting an error 'call of a non-function' while writing a table definition, you probably have a missing ',' somewhere in the tuple !
Writing these Tag Tables by hand is not always easy. However, since Tag Tables can easily be generated using Python code, it is possible to write tools which convert meta-languages into Tag Tables which then run on all platforms supported by mxTextTools at nearly C speeds.
Mike C. Fletcher has written a nice tools for generating Tag Tables using an EBNF notation. You may want to check out his SimpleParse add-on for mxTextTools.
Recently, Tony J. Ibbs has also started to work in this direction. His meta-language for mxTextTools aims at simplifying the task of writing Tag Table tuples.
More references to third party extensions or applications built on top of mxTextTools can be found in the Add-ons Section.
The packages includes a nearly complete Python emulation of the Tagging Engine in the Examples subdirectory (pytag.py). Though it is unsupported it might still provide some use since it has a builtin debugger that will let you step through the Tag Tables as they are executed. See the source for further details.
As an alternative you can build a version of the Tagging Engine that provides lots of debugging output. See mxTextTools/Setup for explanations on how to do this. When enabled the module will create several .log files containing the debug information of various parts of the implementation whenever the Python interpreter is run with the debug flag enabled (python -d). These files should give a fairly good insight into the workings of the Tag Engine (though it still isn't as elegant as it could be).
	  Note that the debug version of the module is almost as fast
	  as the regular build, so you might as well do regular work
	  with it.
    
    
     
	  The TextSearch object is immutable and usable for one search
	  string per object only. However, once created, the
	  TextSearch objects can be applied to as many text strings as
	  you like -- much like compiled regular expressions. Matching
	  is done exact (doing translations on-the-fly if supported by
	  the search algorithm).
	 
	  Furthermore, the TextSearch objects can be pickled and
	  implement the copy protocol as defined by the copy
	  module. Comparisons and hashing are not implemented (the
	  objects are stored by id in dictionaries).
	 
	  Depending on the search algorithm, TextSearch objects can
	  search in 8-bit strings and/or Unicode. Searching in memory
	  buffers is currently not supported. Accordingly, the search
	  string itself may also be an 8-bit string or Unicode.
	 
	      In older versions of mxTextTools there were two separate
	      constructors for search objects:  
	      Note: The FastSearch algorithm is *not* included
	      in the public release of mxTextTools.
	     
	     
		   
		      Not included in the public release of
		      mxTextTools.  
		   
		   
		  This function supports keyword arguments.
	       
	       
	       
	     
	      To provide some help for reflection and pickling the
	      TextSearch object gives (read-only) access to these
	      attributes.
	     
	     
	      The TextSearch object has the following methods:
	     
	     
	       
	       
	     
	      Note that translating the text before doing the search
	      often results in a better performance. Use
	       
	  The CharSet object is an immutable object which can be used
	  for character set based string operations like text
	  matching, searching, splitting etc.
	 
	  CharSet objects can be pickled and implement the copy
	  protocol as defined by the copy module as well as the
	  'in'-protocol, so that  
	  The objects support both 8-bit strings and UCS-2 Unicode in
	  both the character set definition and the various methods.
	  Mixing of the supported types is also allowed.  Memory
	  buffers are currently not supported.
	 
	     
		   
		  The constructor supports the re-module syntax for
		  defining character sets: "a-e" maps to "abcde" (the
		  backslash can be used to escape the special meaning
		  of "-", e.g.  r"a\-e" maps to "a-e") and "^a-e" maps
		  to the set containing all but the characters
		  "abcde".
		 
		  Note that the special meaning of "^" only applies if
		  it appears as first character in a CharSet
		  definition. If you want to create a CharSet with the
		  single character "^", then you'll have to use the
		  escaped form: r"\^". The non-escape form "^" would
		  result in a CharSet matching all characters.
		 
		  To add the backslash character to a CharSet you have
		  to escape with itself: r"\\".
		 
		  Watch out for the Python quoting semantics in these
		  explanations: the small r in front of some of these
		  strings makes the raw Python literal strings which
		  means that no interpretation of backslashes is
		  applied: r"\\" == "\\\\" and r"a\-e" == "a\\-e".
	       
	     
	      To provide some help for reflection and pickling the
	      CharSet object gives (read-only) access to these
	      attributes.
	     
	     
	      The CharSet object has these methods:
	     
	     
	       
		   
	       
		   
	       
	       
	       
		   
	     
	  These functions are defined in the package:
	 
	 
		   
		  Returns a tuple  
		  In case of a non match (success == 0), it points to
		  the error location in text.  If you provide a tag
		  list it will be used for the processing. 
		 
		  Passing  
		   
		  This function supports keyword arguments.
	       
	       
		  The format expected as joinlist is similar to
		  a tag list: it is a sequence of tuples
		   
		  The optional argument sep is a separator to be used
		  in joining the slices together, it defaults to the
		  empty string (unlike string.join). start and stop
		  allow to define the slice of joinlist the function
		  will work in.
		
		 
		  Important Note: The syntax used for negative
		  slices is different than the Python standard: -1
		  corresponds to the first character *after* the string,
		  e.g. ('Example',0,-1) gives 'Example' and not 'Exampl',
		  like in Python. To avoid confusion, don't use negative
		  indices. 
		 
		  This function can handle mixed 8-bit string /
		  Unicode input. Coercion is always towards Unicode.
	       
	       
	       
		  A few restrictions apply, though:
		 
		  If one of these conditions is not met, a ValueError
		  is raised.  
		 
		  This function can handle mixed 8-bit string /
		  Unicode input. Coercion is always towards Unicode.
	       
	       
		  Note that the translation string used is generated
		  at import time. Locale settings will only have an
		  effect if set prior to importing the package. 
		 
		  This function is almost twice as fast as the one in
		  the string module. 
		 
		  This function can handle mixed 8-bit string /
		  Unicode input. Coercion is always towards Unicode.
	       
	       
		  This function can handle mixed 8-bit string /
		  Unicode input. Coercion is always towards Unicode.
	       
	       
		  This function can handle 8-bit string or Unicode
		  input.
	       
	       
		  This function can handle mixed 8-bit string /
		  Unicode input. Coercion is always towards Unicode.
	       
	       
		  replacements must be list of tuples (replacement,
		  left, right).  The replacement string is then used
		  to replace the slice text[left:right].
		 
		  Note that the replacements do not affect one another
		  w/r to indexing: indices always refer to the
		  original text string.
		
		 
		  Replacements may not overlap. Otherwise a ValueError
		  is raised.
		 
		  This function can handle mixed 8-bit string /
		  Unicode input. Coercion is always towards Unicode.
	       
	       
		  This function can handle 8-bit string and
		  Unicode input.
	       
	       
		  This function can handle mixed 8-bit string /
		  Unicode input. Coercion is always towards Unicode.
		 
	       
		  This function can handle mixed 8-bit string /
		  Unicode input. Coercion is always towards Unicode.
		   
	       
		  This is a special case of string.split() that has
		  been optimized for single character splitting
		  running 40% faster. 
		 
		  This function can handle mixed 8-bit string /
		  Unicode input. Coercion is always towards Unicode.
	       
	       
		  If the character is not found, the second string is
		  empty. nth may also be negative: the search is then
		  done from the right and the first string is empty in
		  case the character is not found.  
		 
		  The splitting character itself is not included in
		  the two substrings. 
		 
		  This function can handle mixed 8-bit string /
		  Unicode input. Coercion is always towards Unicode.
	       
	       
		  If no suffix is found to be matching, None is
		  returned.  An empty suffix ('') matches the
		  end-of-string. 
		 
		  The optional 256 char translate string is used to
		  translate the text prior to comparing it with the
		  given suffixes. It uses the same format as the
		  search object translate strings. If not given, no
		  translation is performed and the match done exact.
		  On-the-fly translation is not supported for Unicode
		  input.
		 
		  This function can handle either 8-bit strings or
		  Unicode. Mixing these input types is not supported.
	       
	       
		  If no prefix is found to be matching, None is
		  returned. An empty prefix ('') matches the
		  end-of-string. 
		 
		  The optional 256 char translate string is used to
		  translate the text prior to comparing it with the
		  given suffixes. It uses the same format as the
		  search object translate strings. If not given, no
		  translation is performed and the match done exact.
		  On-the-fly translation is not supported for Unicode
		  input.
		 
		  This function can handle either 8-bit strings or
		  Unicode. Mixing these input types is not supported.
	       
	       
		  The following combinations are considered to be
		  line-ends: '\r', '\r\n', '\n'; they may be used in
		  any combination.  The line-end indicators are
		  removed from the strings prior to adding them to the
		  list.
		 
		  This function allows dealing with text files from
		  Macs, PCs and Unix origins in a portable way.
 		 
		  This function can handle 8-bit string and
		  Unicode input.
	       
	       
		  Line ends are treated just like for splitlines() in
		  a portable way.  
 		 
		  This function can handle 8-bit string and
		  Unicode input.
	       
	       
		  This function is just here for completeness. It
		  works in the same way as string.split(text).  Note
		  that CharSet().split() gives you much more control
		  over how splitting is performed. whitespace is
		  defined as given below (see Constants).
 		 
		  This function can handle 8-bit string and
		  Unicode input.
	       
	       
		  Unicode input is not supported.
	       
	       
		  Unicode input is not supported.
	       
	       
	       
		  Returns a character set for string: a bit encoded
		  version of the characters occurring in string.
		 
		  If logic is 0, then all characters not in
		  string will be in the set. 
		 
		  Unicode input is not supported.
	       
	       
		  Same as  
		  Unicode input is not supported.
	       
	       
		  Find the first occurence of any character from set
		  in  
		  Unicode input is not supported.
	       
	       
		  Strip all characters in text[start:stop] appearing
		  in set.  mode indicates where to strip (<0: left;
		  =0: left and right; >0: right). set must be a
		  string obtained with  
		  Unicode input is not supported.
	       
	       
		  Split text[start:stop] into substrings using set,
		  omitting the splitting parts and empty
		  substrings.  
		  Unicode input is not supported.
	       
	       
		  Split text[start:stop] into substrings using set, so
		  that every second entry consists only of characters
		  in set.  
		  Unicode input is not supported.
	       
	     
	  The TextTools.py also defines some other functions, but
	  these are left undocumented since they may disappear in future
	  releases.
	 
     
	  The package exports these constants. They are defined in
	  Constants/Sets.
	 
	  Note that Unicode defines many more characters in the
	  following categories. The character sets defined here are
	  restricted to ASCII (and parts of Latin-1) only.
	 
	 
	       
	       
	       
	       
	       
	       
	       
	       
	       
	       
	       
	       
	       
	       
	       
	       
	       
	       
	     
	  The Examples/ subdirectory of the package contains a
	  few examples of how tables can be written and used. Here is a
	  non-trivial example for parsing HTML (well, most of it):
	 
	  I hope this doesn't scare you away :-) ... it's
	  fast as hell.
    
    
     
      Entries enclosed in brackets are packages (i.e. they are
      directories that include a __init__.py file). Ones with
      slashes are just ordinary subdirectories that are not accessible
      via  
      The package TextTools imports everything needed from the other
      components. It is sometimes also handy to do a  
      Examples/ contains a few demos of what the Tag Tables
      can do.
     
    
    
    
     
      Mike C. Fletcher is working on a Tag Table generator called SimpleParse.
      It works as parser generating front end to the Tagging Engine
      and converts a EBNF style grammar into a Tag Table directly
      useable with the  
      Tony J. Ibbs has started to work on a meta-language
      for mxTextTools. It aims at simplifying the task of writing
      Tag Table tuples using a Python style syntax. It also gets rid
      off the annoying jump offset calculations.
     
      Andrew Dalke has started work on a parser generator called Martel built
      upon mxTextTools which takes a regular expression grammer for a
      format and turns the resultant parsed tree into a set of
      callback events emulating the XML/SAX API. The results look very
      promising !
     
	  eGenix.com is providing commercial support for this
	  package. If you are interested in receiving information
	  about this service please see the eGenix.com
	  Support Conditions.
     
	  © 1997-2000, Copyright by Marc-André Lemburg;
	  All Rights Reserved.  mailto: mal@lemburg.com
	 
	  © 2000-2001, Copyright by eGenix.com Software GmbH,
	  Langenfeld, Germany; All Rights Reserved.  mailto: info@egenix.com
	 
	  This software is covered by the eGenix.com Public
	  License Agreement. The text of the license is also
	  included as file "LICENSE" in the package's main directory.
	 
	   By downloading, copying, installing or otherwise using
	  the software, you agree to be bound by the terms and
	  conditions of the eGenix.com Public License
	  Agreement. 
     Things that still need to be done:
	 Things that changed from 2.0.3 to 2.1.0:
	 
	  Version 2.1.0 introduces full Unicode support to mxTextTools
	  and the Tagging Engine. As a result, a few things had to be
	  restructured and modified. Hopefully, the new design
	  decisions will provide more room for future enhancements.
	 
	  The new version is expected to behave nearly 100% backward
	  compatible to previous versions. If needed, aliases or
	  factory functions were provided to maintain interface
	  compatibility.
	 Things that changed from 2.0.2 to 2.0.3:
	 Things that changed from 2.0.0 to 2.0.2:
	 Things that changed from 1.1.1 to 2.0.0:
	 Things that changed from 1.1.0 to 1.1.1:
	 Things that changed from 1.0.2 to 1.1.0:
	 Things that changed from 1.0.1 to 1.0.2:
	 Things that changed from 1.0.0 to 1.0.1:
	 Things that changed from the really old TagIt module version 0.7 to mxTextTools
	1.0.0:
	 
     
          © 1997-2000, Copyright by Marc-André Lemburg;
          All Rights Reserved.  mailto: mal@lemburg.com
         
          © 2000-2001, Copyright by eGenix.com Software GmbH; 
          All Rights Reserved.  mailto: info@egenix.com
    TextSearch Object
    
	
TextSearch Object Constructors
	
	    
BMS() for
	      Boyer-Moore and FS() for the (unpublished)
	      FastSearch algorithm. With 2.1.0 the interface was
	      changed to merge these two constructors into one having
	      the algorithm type as parameter.
	    
	      
		    TextSearch(match,translate=None,algorithm=default_algorithm)
		  algorithm defines the algorithm to
		  use. Possible values are:
		
		  
algorithm defaults to BOYERMOORE (or
		  FASTSEARCH if available) for 8-bit match strings and
		  TRIVIAL for Unicode match strings.
		translate is an optional
		  translate-string like the one used in the module
		  're', i.e. a 256 character string mapping the
		  oridnals of the base character set to new
		  characters. It is supported by the BOYERMOORE and
		  the FASTSEARCH algorithm only.  
		
		    BMS(match[,translate])
		    FS(match[,translate])TextSearch Object Instance Variables
	
	    
	      
		    match
		    translate
		    algorithmTextSearch Object Instance Methods
	
	    
	      
		    search(text,[start=0,stop=len(text)])[start:stop] and return
		the slice (l,r) where the substring was
		found, or (start,start) if it was not
		found.
		    find(text,[start=0,stop=len(text)])[start:stop] and return
		the index where the substring was found, or
		-1 if it was not found. This interface is
		compatible with string.find.
		    findall(text,start=0,stop=len(text))search(), but return a list of
		all non-overlapping slices (l,r) where
		the match string can be found in text.string.translate() to do that efficiently.
	CharSet Object
    
	
c in charset works as
	  expected. Comparisons and hashing are not implemented (the
	  objects are stored by id in dictionaries).
	CharSet Object Constructor
	
	    
	      
		    CharSet(definition)
		  definition may be an 8-bit string or
		  Unicode. 
		CharSet Object Instance Variables
	
	    
	      
		    definitionCharSet Object Instance Methods
	
	    
	      
		    contains(char)
		  
		    search(text[, direction=1, start=0, stop=len(text)])
		  text[start:stop] for the first
		character included in the character set. Returns
		None if no such character is found or the
		index position of the found character.
		direction defines the search direction:
		  a positive value searches forward starting from
		  text[start], while a negative value
		  searches backwards from text[stop-1].
	      
	      
		    match(text[, direction=1, start=0, stop=len(text)])
		  text[start:stop] which appear in the
		character set. Returns the length of this match as
		integer.
		direction defines the match direction:
		  a positive value searches forward starting from
		  text[start] giving a prefix match,
		  while a negative value searches backwards from
		  text[stop-1] giving a suffix match.
	      
	      
		    split(text, [,start=0, stop=len(text)])text[start:stop] into a list of
		substrings using the character set definition,
		omitting the splitting parts and empty substrings.
	      
		    splitx(text, [,start=0, stop=len(text)])text[start:stop] into a list of
		substrings using the character set definition, such
		that every second entry consists only of characters in
		the set.
	      
		    strip(text[, where=0, start=0, stop=len(text)])text[start:stop]
		appearing in the character set.  
		where indicates where to strip (<0:
		  left; =0: left and right; >0: right).
	      Functions
    
	
	    
	      
		    tag(text,tagtable,sliceleft=0,sliceright=len(text),taglist=[],context=None)
		  text may be an 8-bit string or
		  Unicode. tagtable must be either Tag
		  Table definition (a tuple of tuples) or a compiled
		  TagTable() object matching the text
		  string type. Tag Table definitions are automatically
		  compiled into TagTable() objects by this
		  constructor.
		(success, taglist,
		  nextindex), where nextindex indicates the
		  next index to be processed after the last character
		  matched by the Tag Table.
		None as taglist results in no
		  tag list being created at all. 
		context is an optional extension to the
		  Tagging Engine introduced in version 2.1.0 of
		  mxTextTools. If given, it is made available to the
		  Tagging Engine during the scan and can be used for
		  e.g. CallTag.
		
		    join(joinlist[,sep='',start=0,stop=len(joinlist)])(string,l,r[,...]) (the resulting
		  string will then include the slice
		  string[l:r]) or strings (which are
		  copied as a whole). Extra entries in the tuple are
		  ignored. 
		
		    cmp(a,b)
		    joinlist(text,list[,start=0,stop=len(text)])join() from a list of tuples
		(replacement,l,r,...) in such a way that all
		slices text[l:r] are replaced by the given
		replacement. 
		
		  
		    upper(string)
		    lower(string)
		    is_whitespace(text,start=0,stop=len(text))
		    replace(text,what,with,start=0,stop=len(text))
		    multireplace(text,replacements,start=0,stop=len(text))
		    find(text,what,start=0,stop=len(text))
		    findall(text,what,start=0,stop=len(text))(left,right) meaning that
		what can be found at text[left:right].
		
		    collapse(text,separator=' ')
		    charsplit(text,char,start=0,stop=len(text))
		    splitat(text,char,nth=1,start=0,stop=len(text))
		    suffix(text,suffixes,start=0,stop=len(text)[,translate])
		    prefix(text,prefixes,start=0,stop=len(text)[,translate])
		    splitlines(text)
		    countlines(text)
		    splitwords(text)
		    str2hex(text)
		    hex2str(hex)
		    isascii(text)
		    set(string[,logic=1])
		    invset(string)set(string,0).  
		
		    setfind(text,set[,start=0,stop=len(text)])text[start:stop]. set
		  must be a string obtained from set().
		
		    setstrip(text,set[,start=0,stop=len(text),mode=0])set().
		
		    setsplit(text,set[,start=0,stop=len(text)])set must be a string
		  obtained from set().
		
		    setsplitx(text,set[,start=0,stop=len(text)])set must be a string obtained
		  from set().
		Constants
    
	
	    
	      
		    a2z
		    A2Z
		    a2z
		    umlaute
		    Umlaute
		    alpha
		    a2z
		    german_alpha
		    number
		    alphanumeric
		    white
		    newline
		    formfeed
		    whitespace
		    any
		    *_charset
		    *_set
		    tagtable_cache
		    BOYERMOORE, FASTSEARCH, TRIVIALExamples of Use
    
	
    from simpleparse.stt.TextTools import *
    error = '***syntax error'			# error tag obj
    tagname_set = set(alpha+'-'+number)
    tagattrname_set = set(alpha+'-'+number)
    tagvalue_set = set('"\'> ',0)
    white_set = set(' \r\n\t')
    tagattr = (
	   # name
	   ('name',AllInSet,tagattrname_set),
	   # with value ?
	   (None,Is,'=',MatchOk),
	   # skip junk
	   (None,AllInSet,white_set,+1),
	   # unquoted value
	   ('value',AllInSet,tagvalue_set,+1,MatchOk),
	   # double quoted value
	   (None,Is,'"',+5),
	     ('value',AllNotIn,'"',+1,+2),
	     ('value',Skip,0),
	     (None,Is,'"'),
	     (None,Jump,To,MatchOk),
	   # single quoted value
	   (None,Is,'\''),
	     ('value',AllNotIn,'\'',+1,+2),
	     ('value',Skip,0),
	     (None,Is,'\'')
	   )
    valuetable = (
	# ignore whitespace + '='
	(None,AllInSet,set(' \r\n\t='),+1),
	# unquoted value
	('value',AllInSet,tagvalue_set,+1,MatchOk),
	# double quoted value
	(None,Is,'"',+5),
	 ('value',AllNotIn,'"',+1,+2),
	 ('value',Skip,0),
	 (None,Is,'"'),
	 (None,Jump,To,MatchOk),
	# single quoted value
	(None,Is,'\''),
	 ('value',AllNotIn,'\'',+1,+2),
	 ('value',Skip,0),
	 (None,Is,'\'')
	)
    allattrs = (# look for attributes
	       (None,AllInSet,white_set,+4),
	        (None,Is,'>',+1,MatchOk),
	        ('tagattr',Table,tagattr),
	        (None,Jump,To,-3),
	       (None,Is,'>',+1,MatchOk),
	       # handle incorrect attributes
	       (error,AllNotIn,'> \r\n\t'),
	       (None,Jump,To,-6)
	       )
    htmltag = ((None,Is,'<'),
	       # is this a closing tag ?
	       ('closetag',Is,'/',+1),
	       # a coment ?
	       ('comment',Is,'!',+8),
		(None,Word,'--',+4),
		('text',sWordStart,BMS('-->'),+1),
		(None,Skip,3),
		(None,Jump,To,MatchOk),
		# a SGML-Tag ?
		('other',AllNotIn,'>',+1),
		(None,Is,'>'),
		    (None,Jump,To,MatchOk),
		   # XMP-Tag ?
		   ('tagname',Word,'XMP',+5),
		    (None,Is,'>'),
		    ('text',WordStart,'</XMP>'),
		    (None,Skip,len('</XMP>')),
		    (None,Jump,To,MatchOk),
		   # get the tag name
		   ('tagname',AllInSet,tagname_set),
		   # look for attributes
		   (None,AllInSet,white_set,+4),
		    (None,Is,'>',+1,MatchOk),
		    ('tagattr',Table,tagattr),
		    (None,Jump,To,-3),
		   (None,Is,'>',+1,MatchOk),
		   # handle incorrect attributes
		   (error,AllNotIn,'> \n\r\t'),
		   (None,Jump,To,-6)
		  )
    htmltable = (# HTML-Tag
		 ('htmltag',Table,htmltag,+1,+4),
		 # not HTML, but still using this syntax: error or inside XMP-tag !
		 (error,Is,'<',+3),
		  (error,AllNotIn,'>',+1),
		  (error,Is,'>'),
		 # normal text
		 ('text',AllNotIn,'<',+1),
		 # end of file
		 ('eof',EOF,Here,-5),
		)
      
	
	Package Structure
    
    
[TextTools]
       [Constants]
              Sets.py
              TagTables.py
       Doc/
       [Examples]
              HTML.py
              Loop.py
              Python.py
              RTF.py
              RegExp.py
              Tim.py
              Words.py
              altRTF.py
              pytag.py
       [mxTextTools]
              test.py
       TextTools.py
    
    import.
    from
      simpleparse.stt.TextTools.Constants.TagTables import *.
    Optional Add-Ons for mxTextTools
    
    
tag() function.
    Support
    
	
Copyright & License
    
	
History & Future