First, Emscripten. Now, cheerp.
Category Archives: Languages
Ragel State Machine
C is 343 times faster than Python!
Rewriting a Python library function in C drops execution time from 110 microseconds to 320 nanoseconds. That’s a respectable optimization.
Out-of-order parsing
More and more, I think parsing needs to be modernized. Parsing was the hot subject in the 1960s, and may very well have been synonymous with computer science up through the early 1970s, but has stayed in that form ever since.
We can do better, and we have to relax the restrictions imposed by thinking of parsed texts as strings.
Yes, Tomita was a big step (http://pdf.aminer.org/000/211/233/recognition_of_context_free_and_stack_languages.pdf), but that was 30 years ago (!!) and a pretty small step, in hindsight.
High-polish use of subprocess.Popen
Python has a pretty decent facility to launch and operate a child process, subprocess.popen. However, like many “scripting systems”, it’s easy to do something that mostly works but is rough around the edges and not all that robust, and this is because sub-processes don’t all run in 100 milliseconds without errors.
First off, avoid the use of subprocess.call. It waits for the process to terminate before returning, which means that if your subprocess hangs, your Python program will hang.
Second, if you’re using Python 2.7 on POSIX, use subprocess32, which is a backport of subprocess from Python 3.
Third, stop using os.popen in favor of subprocess.Popen. It’s a little more complicated, but worth it.
Fourth, keep in mind that Popen.communicate() also blocks until the process terminates, so don’t use it either. Also, communicate() doesn’t seem to handle large amounts of output on some systems (reports of “no more than 65535 bytes of output due to Linux pipe implementation”).
Reading stdout
Now, on to actual details. Let’s call dir on Windows and number each line in the output
ldir.py
from __future__ import print_function import subprocess import sys proc = subprocess.Popen(args=['dir'] + sys.argv[1:], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=True) linenum = 1 while True: line = proc.stdout.readline() if len(line) == 0: break print("%d: %s" % (linenum, line), end='') linenum += 1
We are merging stderr and stdout together in this example (stderr=subprocess.STDOUT). If we run this on C:\Windows\System32 like so
ldir.py /s C:\Windows\System32
we’ll start seeing output like this
1: Volume in drive C is OSDisk 2: Volume Serial Number is 062F-8F58 3: 4: Directory of c:\Windows\System32 5: 6: 04/23/2014 06:09 PM <DIR> . 7: 04/23/2014 06:09 PM <DIR> .. 8: 04/12/2011 12:38 AM <DIR> 0409 9: 01/14/2014 11:21 AM <DIR> 1033 10: 06/10/2009 02:16 PM 2,151 12520437.cpx 11: 06/10/2009 02:16 PM 2,233 12520850.cpx 12: 02/14/2013 09:34 PM 131,584 aaclient.dll 13: 11/20/2010 08:24 PM 3,727,872 accessibilitycpl.dll
And since this is under our control, we can pipe to more, we can control-C to stop it, and so on.
There are still complications, mostly around buffering. The default for Popen is to not buffer data, but that only affects the reader – the source process can still buffer. You can trick programs into thinking they are writing into a console, which usually means that output will be unbuffered. You can use the low-level pty module directly (on Unix) or something higher-level like pexpect
- Unix: http://pexpect.sourceforge.net/pexpect.html
- Windows: https://bitbucket.org/mherrmann_at/wexpect
Of course, not all processes write lines. You can use a more generalized approach by reading bytes from the stdout pipe. The previous program modifed to read 128 bytes at a time looks like this
while True: line = proc.stdout.read(128) if len(line) == 0: break print("<%d>: %s" % (linenum, line), end='') linenum += 1
and produces this output (with numbers changed to to stand out more)
<1>: Volume in drive C is OSDisk Volume Serial Number is 062F-8F58 Directory of c:\Windows\System32 04/23/2014 06:09 PM <DIR<2>: > . 04/23/2014 06:09 PM <DIR> .. 04/12/2011 12:38 AM <DIR> 0409 01/14/2014 11:21 AM <DIR><3>: 1033 06/10/2009 02:16 PM 2,151 12520437.cpx 06/10/2009 02:16 PM 2,233 12520850.cpx 02/14/201<4>: 3 09:34 PM 131,584 aaclient.dll 11/20/2010 08:24 PM 3,727,872 accessibilitycpl.dll
And of course this would work for programs that are reading and writing octet streams, not just text.
Reading stdout and stderr
Sometimes you want to read from stderr and stdout independently, because you need to react to output on stderr. You can’t just call read or readline, because it could block waiting for input on a handle.
On Unix systems, you can call select on the stdin and stdout handles, because select works on file-like objects, including pipes. On Windows, select only works on sockets, so you need to use some threads and a queue to have a blocking read per handle. Since this works on Unix as well, we can do it for both.
import Queue io_q = Queue.Queue(5) # somewhat arbitrary, readers block when queue is full def read_from_stream(identifier, stream): for line in stream: io_q.put((identifier, line)) if not stream.closed: stream.close() import threading threading.Thread(target=read_from_stream, name='stdout-stream', args=('STDOUT', proc.stdout)).start() threading.Thread(target=read_from_stream, name='stderr-stream', args=('STDERR', proc.stderr)).start() while True: try: item = io_q.get(False) except Queue.Empty: if proc.poll() is not None: break else: identifier, line = item print(identifier + ':', line, end='')
This works well, but has a flaw – it is basically busy-waiting, burning CPU while waiting for input to come in. We’re doing this because we don’t want to block at the reader level – consider that in a more complex situation, we might want to do processing while waiting for input to come in. There’s also a race condition here, in that we could check the queue, it could be empty, then a reader could put something in the queue while we are checking proc.poll(), and then we could miss that item.
We could do something like this, which is not clean, but works
import Queue io_q = Queue.Queue(5) def read_from_stream(identifier, stream): if not stream: print('%s does not exist' % identifier) io_q.put(('EXIT', identifier)) return for line in stream: io_q.put((identifier, line)) if not stream.closed: stream.close() print('%s is done' % identifier) io_q.put(('EXIT', identifier)) import threading active = 2 threading.Thread(target=read_from_stream, name='stdout-stream', args=('STDOUT', proc.stdout)).start() threading.Thread(target=read_from_stream, name='stderr-stream', args=('STDERR', proc.stderr)).start() while True: try: item = io_q.get(True, 1) except Queue.Empty: if proc.poll() is not None: break else: identifier, line = item if identifier == 'EXIT': active -= 1 if active == 0: break else: print(identifier + ':', line, end='') proc.wait() print(proc.returncode)
Now there’s no busy-waiting, and we exit instantly. This is also a lot of scaffolding to write for each time we use subprocess.Popen(). One answer would be to wrap this up into a helper class, or rather a set of helper classes.
stdin and stdout and stderr
There are two cases here
- Feeding a pipe that takes input and returns output.
- Running an interactive process
For the former, you could just have a file or psuedo-file feed the Popen process instead of subprocess.PIPE. For the latter, you definitely need to trick your Popen process into thinking that it’s writing to a TTY, otherwise the buffering will kill you.
TBD
Reference
http://pymotw.com/2/subprocess/
http://sharats.me/the-ever-useful-and-neat-subprocess-module.html
http://pexpect.readthedocs.org/en/latest/FAQ.html#whynotpipe
Python 2.7 end-of-life extended to 2020
Guido Van Rossum evidently announced at PyCon 2014 that Python 2.7 would be supported through 2020 (the previous cut-off date was 2015).
http://www.i-programmer.info/news/216-python/7179-python-27-to-be-maintained-until-2020.html
A HackerNews thread started by intimating that this was partly in release to RedHat needing long-term support for the version of Python in RHEL 7, and that version of Python will almost certainly be Python 2.7.
https://news.ycombinator.com/item?id=7581434
I doubt that it was anything more than a very minor contributing factor.
Assignment operator could not be generated
What does this warning mean, and how do you fix it?
warning C4512: '<some type>' : assignment operator could not be generated
The compiler will auto-generate some class members for you
- default constructor (if no other constructor is explicitly declared)
- destructor
- copy constructor (if no move constructor or move assignment operator is explicitly declared)
- copy assignment operator (if no move constructor or move assignment operator is explicitly declared)
C++ 11 added two new auto-generated class members (and it added “if destructor then copy constructor and copy assignment operator generation is deprecated”):
- move constructor (if no copy constructor, move assignment operator or destructor is explicitly declared)
- move assignment operator (if no copy constructor, copy assignment operator or destructor is explicitly declared)
Compiler-generated functions are public and non-virtual. As a reminder, here are the signatures of all of these functions:
class Object { Object(); // default constructor Object(const Object& other); // copy constructor Object(Object&& other); // move constructor Object& operator=(const Object& other); // copy assignment operator Object& operator=(Object&& other); // move assignment operator ~Object(); // destructor };
So, what if you can’t actually create a meaningful copy assignment operator? For example, if you have const data, you can’t assign to it. Remember that the auto-generated copy assignment operator just generates assignment operator code for each member of the class, recursively, and you can’t assign to const int, you can only construct it.
struct ConstantOne { ConstantOne() : value(1) {} const int value; }; int main(int /*argc*/, char ** /*argv*/) { ConstantOne b; return 0; }
This will give you a warning when you compile, because the auto-generated assignment operator is technically illegal, and so the compiler won’t generate it. It’s a warning, because your code probably doesn’t need an assignment operator. For Visual C++, you’ll see something like this:
warning C4512: 'ConstantOne' : assignment operator could not be generated
You have several options. The easiest is just to declare an assignment operator without a body. As long as you never actually try to use the assignment operator, you’ll be fine. And, the contract for this object says that assignment would be illegal anyway, so you’ll get a valid-to-you compile error if you accidentally try to use it.
struct ConstantOne { ConstantOne() : value(1) {} const int value; private: ConstantOne& operator=(const ConstantOne&); }; int main(int /*argc*/, char ** /*argv*/) { ConstantOne b; ConstantOne c; c = b; return 0; }
The standard is to make these private, to reinforce that they are not meant to be used. If you compile code with an assignment operator, you’ll get a compile-time error.
error C2248: 'ConstantOne::operator =' : cannot access private member declared in class 'ConstantOne'
And in C++11, there’s even a keyword to add here to declare that it indeed should not be allowed:
struct ConstantOne { ConstantOne() : value(1) {} const int value; ConstantOne& operator=(const ConstantOne&) = delete; };
Note that you don’t need the trickery of making it private, and you get a nicer compile-time error if you try to use the assignment operator.
This happens in big real-world projects quite often. In fact, it happens enough that the delete keyword was added in C++11. Visual Studio 2013 and up, GCC 4.7 and up, and Clang 2.9 and up support the delete and default keywords.
Now, there is another approach to the assignment operator when you have const data – generate an assignment operator that can write to const data with const_cast. You’re almost always doing the wrong thing if you do this, but sometimes it has to be done anyway. It looks horrible, and it is horrible. But maybe it’s necessary.
struct ConstantOne { ConstantOne() : value(1) {} const int value; ConstantOne& operator=(const ConstantOne& that) { if (this != &that) { *const_cast(&this->value) = that.value; } return *this; } }; int main(int /*argc*/, char ** /*argv*/) { ConstantOne b; ConstantOne c; c = b; return 0; }
The reason this is horrible is that you are violating the contract – you’re altering a constant value in the LHS of the equation. Depending on your circumstance, that can still be a correct thing to do.
Git code review
http://fabiensanglard.net/git_code_review/index.php
Stony Brook Algorithm repository
The descent to C
An explanation of the C programming language for programmers used to Python, Ruby etc.
http://www.chiark.greenend.org.uk/~sgtatham/cdescent/