Maybe useful as a foundation for tabular reporting? http://www.matts411.com/post/grid/
Github project page at: https://github.com/mmurph211/Grid
MSR, the International Working Conference on Mining Software Repositories, has as its 2013 challenge the mining of Stack Overflow data.
I wonder what I could find…
Mock battle between types and tests, where each side actually programmed using the other side’s magic bullet.
Microsoft Research paper
Good lecture: Effective ML lecture by Yaron Minsky
Note to self – learn ML.
RFC 2822 – Internet Message Format (which obsoleted/replaced RFC 822, but consider them the same thing). It specifies the format of an Internet Message, which to the rest of us is an email message.
One piece of RFC 2822 specifies the format of email addresses. Someone wrote a Perl regex to validate email addresses according to RFC 822. It is nightmarish
I’ll reproduce 10% of it here.
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
Yes, the full regex is 82 lines long. And it doesn’t handle the full spec, it does not handle comments in email addresses (comments can be nested, and a regex is an expression of a regular grammar, which can’t count, so can’t handle nesting). This was not generated by hand, of course, it was generated by Perl code combining multiple simpler regular expressions together.
By comparison, the grammar for email addresses, including the bits of the foundational grammar, are about 30 lines, and readable. The actual email address grammar is 17 lines of ABNF, and it looks like there’s around 15 lines of base elements (like “comment” and “FWS” (folding white space).
So why do people use regular expressions? Because, with current tools, it’s faster to write them, and they execute faster too. The “time to write” factor is so severe that you see people using ad-hoc (and broken) regular expression parsing even in areas where a real parser would be ideal.
Now, also, this is an example of getting into trouble in the first place by not having a grammar – email addresses evolved ad-hoc, and grammars were retrofitted to them. But still, the point is that the goal should be to make high-level and high-powered parsing techniques really easy to use and apply.
Visualizing bug occurrence and bug fix rates, both for better tracking and thus projection of effort and time, but also to help improve the development process.
I liked the visualization techniques, especially the visual effort estimation, which can show strong tendencies in one direction (over-estimating, under-estimating, etc).
This paper tries to work out prediction models for bug fix times and bug fix effort, seeing if attributes of a bug report can be used to build a recommender system.
This paper investigates a possible link between technical quality of software and the rate at which defects can be resolved. Intuitively, we have talked about metrics like “small classes, low coupling, low duplication” as being components of good code, but how true is this?
Predicting bugs by fine-grained churn in code; exact changes at the statement level, rather than gross file-level or commit-level churn.
This paper uses mining of a repository to answer questions about the evolution of production code and test code. It introduces a Change History View visualizing commit-behavior over time, a Growth History View showing relative growth of code over time, and a Test Coverage Evolution View showing test coverage of a system over time.
This paper uses repository mining to find cross-cutting concerns, pieces of functionality not captured into separate modules. Files that are changed together can indicate interdependent code.
This paper introduces a way to measure interface and implementation stability of libraries.
This paper investigates the lifespan of code smells, finding that, at least in the investigated open source systems, engineers are aware of code smells but not very concerned with their impact. At least in this paper, code smells are considered to be synonymous with anti-patterns at the programming level.
Revision repositories usually provide information about differences as insertions and deletions, making it very hard to track the evolution of individual source lines. This paper develops a method to infer source changes from diffs.
This is a thoughtful review and projection of where software development might go, given that many work modes are migrating partially or in whole to the web.
However, one piece missing here is a more thorough exploration of rich web clients, which really don’t exist at the moment. Why don’t they exist? Because they are currently hard to write. However, they are higher-performing than lightweight clients, and always will be. So how can we make rich clients easier to write?
Not complete, but papers that I found useful or interesting.
2011 – Automated Evaluation of Syntax Error Recovery, Maartje de Jonge and Eelco Visser. The paper develops an automated method for parsers to recover when syntax errors are encountered.
2011 – An Algorithm for Layout Preservation in Refactoring Transformations, Maartje de Jonge, Eelco Visser. The paper develops an algorithm for preserving source even while manipulating the program as an abstract syntax tree, allowing for more comprehensive refatoring techniques.
2011 – Growing a Language Environment with Editor Libraries, Sebastian Erdweg, Lennart C. L. Kats, Tilman Rendel, Christian Kastner, Klaus Ostermann, Eelco Visser. The paper introduces the idea of extending IDEs through editor libraries, allowing syntatic and semantic features such as syntax coloring or reference following for arbitrary languages.
2011 – Integrated Language Deﬁnition Testing Enabling Test-Driven Language Development, Lennart C. L. Kats, Rob Vermaas, Eelco Visser. The paper promotes the idea and use of a language-parametric testing language (LPTL) to do reusable and generic tests of language definitions. Basically unit tests and test-driven-development for language design and implementation.
2010 – Pure and Declarative Syntax Deﬁnition: Paradise Lost and Regained, Lennart C. L. Kats, Eelco Visser, Guido Wachsmuth. The paper advocates for scannerless parsing, for ambiguity, for grammars written as naturally as possible, and thus for generalized parsing to be used.
2010 – The Spoofax Language Workbench, Lennart C. L. Kats, Eelco Visser. Spoofax is a language workbench based on Eclipse for developing DSLs, providing a comprehensive environment that integrates syntax deﬁnition, program transformation, code generation, and declarative speciﬁcation of IDE components.
2010 – The Spoofax Language Workbench: Rules for Declarative Speciﬁcation of Languages and IDEs, Lennart C. L. Kats, Eelco Visser. Implementation details of Spoofax.