解析源代码 - 不同语言的唯一标识符？

I'm building an application that receives source code as input and analyzes several aspects of the code. It can accept code from many common languages, e.g. C/C++, C#, Java, Python, PHP, Pascal, SQL, and more (however many languages are unsupported, e.g. Ada, Cobol, Fortran). Once the language is known, my application knows what to do (I have different handlers for different languages).

我正在构建一个接收源代码作为输入的应用程序,并分析代码的几个方面。它可以接受来自许多常见语言的代码,例如C / C ++,C#,Java,Python,PHP,Pascal,SQL等(但是不支持多种语言,例如Ada,Cobol,Fortran)。知道语言后,我的应用程序知道该怎么做(我有不同语言的处理程序)。

Currently I'm asking the user to input the programming language the code is written in, and this is error-prone: although users know the programming languages, a small percentage of them (on rare occasions) click the wrong option just due to recklessness, and that breaks the system (i.e. my analysis fails).

目前我要求用户输入编写代码的编程语言,这很容易出错:尽管用户知道编程语言,但由于鲁莽,他们中的一小部分(极少数情况下)会点击错误的选项,这打破了系统(即我的分析失败)。

It seems to me like there should be a way to figure out (in most cases) what the language is, from the input text itself. Several notes:

在我看来,应该有一种方法可以从输入文本本身中找出(在大多数情况下)语言是什么。几点说明:

I'm receiving pure text and not file names, so I can't use the extension as a hint.

我收到纯文本而不是文件名,所以我不能使用扩展名作为提示。

The user is not required to input complete source codes, and can also input code snippets (i.e. the include/import part may not be included).

用户不需要输入完整的源代码,也可以输入代码片段(即可以不包括包含/导入部分)。

it's clear to me that any algorithm I choose will not be 100% proof, certainly for very short input codes (e.g. that could be accepted by both Python and Ruby), in which cases I will still need the user's assistance, however I would like to minimize user involvement in the process to minimize mistakes.

我很清楚,我选择的任何算法都不是100%证明,当然对于非常短的输入代码(例如,Python和Ruby都可以接受),在这种情况下我仍然需要用户的帮助,但我想尽量减少用户参与流程以最大限度地减少错误。

Examples:

If the text contains "x->y()", I may know for sure it's C++ (?)

如果文本包含“x-> y()”,我可能确定它是C ++(?)

If the text contains "public static void main", I may know for sure it's Java (?)

如果文本包含“public static void main”,我可能肯定知道它是Java(?)

If the text contains "for x := y to z do begin", I may know for sure it's Pascal (?)

如果文本包含“for x:= y to z do begin”,我可能确定它是Pascal(?)

My question:

Are you familiar with any standard library/method for figuring out automatically what the language of an input source code is?

您是否熟悉任何标准库/方法,以自动确定输入源代码的语言是什么?

What are the unique code "tokens" with which I could certainly differentiate one language from another?

什么是独特的代码“令牌”,我当然可以将一种语言与另一种语言区分开来?

I'm writing my code in Python but I believe the question to be language agnostic.

我正在用Python编写代码,但我认为这个问题与语言无关。

Thanks

14 个解决方案

#1

Vim has a autodetect filetype feature. If you download vim sourcecode you will find a /vim/runtime/filetype.vim file.

Vim具有自动检测文件类型功能。如果你下载vim源代码,你会发现一个/vim/runtime/filetype.vim文件。

For each language it checks the extension of the file and also, for some of them (most common), it has a function that can get the filetype from the source code. You can check that out. The code is pretty easy to understand and there are some very useful comments there.

对于每种语言,它检查文件的扩展名,并且对于其中一些(最常见),它有一个可以从源代码获取文件类型的函数。你可以检查一下。代码很容易理解,并且有一些非常有用的注释。

#2

build a generic tokenizer and then use a Bayesian filter on them. Use the existing "user checks a box" system to train it.

构建一个通用的标记化器,然后对它们使用贝叶斯过滤器。使用现有的“用户检查框”系统来训练它。

#3

Here is a simple way to do it. Just run the parser on every language. Whatever language gets the farthest without encountering any errors (or has the fewest errors) wins.

这是一个简单的方法。只需在每种语言上运行解析器。无论遇到任何错误(或错误最少),任何语言都能获得最大的胜利。

This technique has the following advantages:

该技术具有以下优点:

You already have most of the code necessary to do this.

您已经拥有了执行此操作所需的大部分代码。

The analysis can be done in parallel on multi-core machines.

分析可以在多核机器上并行完成。

Most languages can be eliminated very quickly.

大多数语言都可以很快消除。

This technique is very robust. Languages that might appear very similar when using a fuzzy analysis (baysian for example), would likely have many errors when the actual parser is run.

这种技术非常强大。使用模糊分析(例如baysian)时可能看起来非常相似的语言在运行实际解析器时可能会有很多错误。

If a program is parsed correctly in two different languages, then there was never any hope of distinguishing them in the first place.

如果一个程序以两种不同的语言正确解析,那么从来没有任何希望区分它们。

#4

I think the problem is impossible. The best you can do is to come up with some probability that a program is in a particular language, and even then I would guess producing a solid probability is very hard. Problems that come to mind at once:

我认为问题是不可能的。你能做的最好的事情是想出一个程序使用特定语言的概率,即便如此,我猜想产生一个可靠的概率是非常困难的。立刻想到的问题:

use of features like the C pre-processor can effectively mask the underlyuing language altogether

使用像C预处理器这样的功能可以有效地完全掩盖语音不足的语言

looking for keywords is not sufficient as the keywords can be used in other languages as identifiers

寻找关键字是不够的,因为关键字可以在其他语言中用作标识符

looking for actual language constructs requires you to parse the code, but to do that you need to know the language

寻找实际的语言结构需要你解析代码,但要做到这一点,你需要知道语言

what do you do about malformed code?

你怎么处理格式错误的代码?

Those seem enough problems to solve to be going on with.

那些似乎有足够的问题需要解决才能继续下去。

#5

One program I know which even can distinguish several different languages within the same file is ohcount. You might get some ideas there, although I don't really know how they do it.

我知道的一个程序甚至可以在同一个文件中区分几种不同的语言。你可能会在那里得到一些想法,虽然我真的不知道他们是怎么做到的。

In general you can look for distinctive patterns:

一般来说,你可以寻找独特的模式:

Operators might be an indicator, such as := for Pascal/Modula/Oberon, => or the whole of LINQ in C#

运算符可能是一个指标,例如:= for Pascal / Modula / Oberon,=>或C#中的整个LINQ

Keywords would be another one as probably no two languages have the same set of keywords

关键字将是另一个,因为可能没有两种语言具有相同的关键字集

Casing rules for identifiers, assuming the piece of code was writting conforming to best practices. Probably a very weak rule

标识符的套管规则,假设代码片段符合最佳实践。可能是一个非常弱的规则

Standard library functions or types. Especially for languages that usually rely heavily on them, such as PHP you might just use a long list of standard library functions.

标准库函数或类型。特别是对于通常严重依赖它们的语言,例如PHP,您可能只使用一长串标准库函数。

You may create a set of rules, each of which indicates a possible set of languages if it matches. Intersecting the resulting lists will hopefully get you only one language.

您可以创建一组规则,每个规则指示一组可能匹配的语言。相交结果列表有望只为您提供一种语言。

The problem with this approach however, is that you need to do tokenizing and compare tokens (otherwise you can't really know what operators are or whether something you found was inside a comment or string). Tokenizing rules are different for each language as well, though; just splitting everything at whitespace and punctuation will probably not yield a very useful sequence of tokens. You can try several different tokenizing rules (each of which would indicate a certain set of languages as well) and have your rules match to a specified tokenization. For example, trying to find a single-quoted string (for trying out Pascal) in a VB snippet with one comment will probably fail, but another tokenizer might have more luck.

然而,这种方法的问题在于你需要进行标记化和比较标记(否则你无法真正知道运算符是什么,或者你发现的是否在注释或字符串中)。但是,每种语言的标记规则也不同;只是在空格和标点符号上分割所有内容可能不会产生非常有用的标记序列。您可以尝试几种不同的标记化规则(每种规则也指示一组特定的语言)并使规则与指定的标记化匹配。例如,尝试在带有一条注释的VB片段中查找单引号字符串(用于尝试Pascal)可能会失败,但另一个标记器可能会有更多运气。

But since you want to perform analysis anyway you probably have parsers for the languages you support, so you can just try running the snippet through each parser and take that as indicator which language it would be (as suggested by OregonGhost as well).

但是既然你想要执行分析,你可能已经拥有了你支持的语言的解析器,所以你可以尝试通过每个解析器运行代码片段,并将其作为指示器,将其作为指示器(如OregonGhost所建议的那样)。

#6

Some thoughts:

$x->y() would be valid in PHP, so ensure that there's no $ symbol if you think C++ (though I think you can store function pointers in a C struct, so this could also be C).

$ x-> y()在PHP中是有效的,所以如果你认为C ++,请确保没有$符号(虽然我认为你可以在C结构中存储函数指针,所以这也可以是C)。

public static void main is Java if it is cased properly - write Main and it's C#. This gets complicated if you take case-insensitive languages like many scripting languages or Pascal into account. The [] attribute syntax in C# on the other hand seems to be rather unique.

public static void main是Java,如果它正确套装 - 写Main,它是C#。如果您将不区分大小写的语言(如许多脚本语言或Pascal)考虑在内,这会变得复杂。另一方面,C#中的[]属性语法似乎相当独特。

You can also try to use the keywords of a language - for example, Option Strict or End Sub are typical for VB and the like, while yield is likely C# and initialization/implementation are Object Pascal / Delphi.

您也可以尝试使用语言的关键字 - 例如,Option Strict或End Sub是VB等的典型,而yield可能是C#,初始化/实现是Object Pascal / Delphi。

If your application is analyzing the source code anyway, you code try to throw your analysis code at it for every language and if it fails really bad, it was the wrong language :)

如果您的应用程序正在分析源代码,那么您的代码会尝试为每种语言抛出分析代码,如果它真的很糟糕,那就是错误的语言:)

#7

My approach would be:

我的方法是:

Create a list of strings or regexes (with and without case sensitivity), where each element has assigned a list of languages that the element is an indicator for:

创建字符串或正则表达式列表(有和没有区分大小写),其中每个元素都分配了一个语言列表,该元素是以下指标:

class => C++, C#, Java

class => C ++,C#,Java

interface => C#, Java

interface => C#,Java

implements => Java

implements => Java

[attribute] => C#

[attribute] => C#

procedure => Pascal, Modula

procedure => Pascal,Modula

create table / insert / ... => SQL

create table / insert / ... => SQL

etc. Then parse the file line-by-line, match each element of the list, and count the hits.

然后逐行解析文件,匹配列表中的每个元素,并计算命中数。

The language with the most hits wins ;)

获胜次数最多的语言;)

#8

How about word frequency analysis (with a twist)? Parse the source code and categorise it much like a spam filter does. This way when a code snippet is entered into your app which cannot be 100% identified you can have it show the closest matches which the user can pick from - this can then be fed into your database.

词频分析怎么样(有一个扭曲)?解析源代码并将其分类,就像垃圾邮件过滤器一样。这样,当您的应用程序中输入的代码片段无法100%识别时,您可以让它显示用户可以选择的最接近的匹配项 - 然后可以将其输入您的数据库。

#9

Here's an idea for you. For each of your N languages, find some files in the language, something like 10-20 per language would be enough, each one not too short. Concatenate all files in one language together. Call this lang1.txt. GZip it to lang1.txt.gz. You will have a set of N langX.txt and langX.txt.gz files.

这是给你的一个想法。对于你的每种N语言,找一些语言文件,每种语言10-20就足够了,每一种都不会太短。用一种语言连接所有文件。叫这个lang1.txt。将它gZip为lang1.txt.gz。您将拥有一组N langX.txt和langX.txt.gz文件。

Now, take the file in question and append to each of he langX.txt files, producing langXapp.txt, and corresponding gzipped langXapp.txt.gz. For each X, find the difference between the size of langXapp.gz and langX.gz. The smallest difference will correspond to the language of your file.

现在,取出有问题的文件并附加到每个langX.txt文件,生成langXapp.txt和相应的gzipped langXapp.txt.gz。对于每个X,找到langXapp.gz和langX.gz的大小之间的差异。最小的差异将对应于文件的语言。

Disclaimer: this will work reasonably well only for longer files. Also, it's not very efficient. But on the plus side you don't need to know anything about the language, it's completely automatic. And it can detect natural languages and tell between French or Chinese as well. Just in case you need it :) But the main reason, I just think it's interesting thing to try :)

免责声明:只有较长的文件才能合理地使用。而且,效率不高。但从好的方面来说,你不需要了解语言,它是完全自动的。它还可以检测自然语言,并在法语或中文之间进行分析。万一你需要它:)但主要原因,我只是认为这是有趣的尝试:)

#10

The most bulletproof but also most work intensive way is to write a parser for each language and just run them in sequence to see which one would accept the code. This won't work well if code has syntax errors though and you most probably would have to deal with code like that, people do make mistakes. One of the fast ways to implement this is to get common compilers for every language you support and just run them and check how many errors they produce.

最安全但也是最耗费工作量的方法是为每种语言编写一个解析器,然后按顺序运行它们以查看哪一个接受代码。如果代码有语法错误,这将无法正常工作,你很可能不得不处理这样的代码,人们确实会犯错误。实现这一目标的一种快速方法是为您支持的每种语言提供通用编译器,然后运行它们并检查它们产生的错误数量。

Heuristics works up to a certain point and the more languages you will support the less help you would get from them. But for first few versions it's a good start, mostly because it's fast to implement and works good enough in most cases. You could check for specific keywords, function/class names in API that is used often, some language constructions etc. Best way is to check how many of these specific stuff a file have for each possible language, this will help with some syntax errors, user defined functions with names like this() in languages that doesn't have such keywords, stuff written in comments and string literals.

启发式算法可以达到某一点,您支持的语言越多,从中获得的帮助就越少。但是对于前几个版本来说,这是一个良好的开端,主要是因为它实施起来快,并且在大多数情况下工作得很好。您可以检查API中经常使用的特定关键字,函数/类名,一些语言结构等。最好的方法是检查文件对于每种可能的语言有多少这些特定的东西,这将有助于解决一些语法错误,用户定义的函数,其名称类似于this(),在没有这些关键字的语言中,用注释和字符串文字写的东西。

Anyhow you most likely would fail sometimes so some mechanism for user to override language choice is still necessary.

无论如何,你很可能有时会失败,所以仍然需要一些用户覆盖语言选择的机制。

#11

I think you never should rely on one single feature, since the absence in a fragment (e.g. somebody systematically using WHILE instead of for) might confuse you.

我认为你永远不应该依赖一个单一的特征,因为片段中的缺席(例如有人系统地使用WHILE而不是for)可能会让你感到困惑。

Also try to stay away from global identifiers like "IMPORT" or "MODULE" or "UNIT" or INITIALIZATION/FINALIZATION, since they might not always exist, be optional in complete sources, and totally absent in fragments.

还要尽量远离像“IMPORT”或“MODULE”或“UNIT”或INITIALIZATION / FINALIZATION这样的全局标识符,因为它们可能并不总是存在,在完整的源代码中是可选的,并且在片段中完全不存在。

Dialects and similar languages (e.g. Modula2 and Pascal) are dangerous too.

方言和类似语言(例如Modula2和Pascal)也很危险。

I would create simple lexers for a bunch of languages that keep track of key tokens, and then simply calculate a key tokens to "other" identifiers ratio. Give each token a weight, since some might be a key indicator to disambiguate between dialects or versions.

我会为一堆跟踪关键令牌的语言创建简单的词法分析器,然后简单地将关键令牌计算为“其他”标识符比率。为每个标记赋予权重,因为某些标记可能是消除方言或版本之间歧义的关键指标。

Note that this is also a convenient way to allow users to plugin "known" keywords to increase the detection ratio, by e.g. providing identifiers of runtime library routines or types.

注意,这也是允许用户插入“已知”关键字以增加检测率的便利方式,例如通过例如提供运行时库例程或类型的标识符。

#12

Very interesting question, I don't know if it is possible to be able to distinguish languages by code snippets, but here are some ideas:

非常有趣的问题,我不知道是否有可能通过代码片段来区分语言,但这里有一些想法:

One simple way is to watch out for single-quotes: In some languages, it is used as character wrapper, whereas in the others it can contain a whole string

一种简单的方法是注意单引号:在某些语言中,它用作字符包装器,而在其他语言中,它可以包含整个字符串

A unary asterisk or a unary ampersand operator is a certain indication that it's either of C/C++/C#.

一元星号或一元号&符号运算符是某种迹象表明它是C / C ++ / C#。

Pascal is the only language (of the ones given) to use two characters for assignments :=. Pascal has many unique keywords, too (begin, sub, end, ...)

Pascal是使用两个字符进行分配的唯一语言(给出的语言):=。 Pascal也有很多独特的关键词(开头,子,尾,......)

The class initialization with a function could be a nice hint for Java.

使用函数进行类初始化可能是Java的一个很好的提示。

Functions that do not belong to a class eliminates java (there is no max(), for example)

不属于某个类的函数会消除java(例如,没有max())

Naming of basic types (bool vs boolean)

基本类型的命名(bool vs boolean)

Which reminds me: C++ can look very differently across projects (#define boolean int) So you can never guarantee, that you found the correct language.

这让我想起:C ++在各个项目中看起来会有很大差异(#define boolean int)所以你永远无法保证,你找到了正确的语言。

If you run the source code through a hashing algorithm and it looks the same, you're most likely analyzing Perl

如果您通过散列算法运行源代码并且它看起来相同,那么您最有可能分析Perl

Indentation is a good hint for Python

缩进是Python的一个很好的暗示

You could use functions provided by the languages themselves - like token_get_all() for PHP - or third-party tools - like pychecker for python - to check the syntax

你可以使用语言本身提供的函数 - 比如用于PHP的token_get_all() - 或第三方工具 - 比如用于python的pychecker - 来检查语法

Summing it up: This project would make an interesting research paper (IMHO) and if you want it to work well, be prepared to put a lot of effort into it.

总结一下:这个项目将成为一篇有趣的研究论文(恕我直言),如果你想让它运作良好,那就准备好投入很多精力。

#13

There is no way of making this foolproof, but I would personally start with operators, since they are in most cases "set in stone" (I can't say this holds true to every language since I know only a limited set). This would narrow it down quite considerably, but not nearly enough. For instance "->" is used in many languages (at least C, C++ and Perl).

没有办法让这个万无一失,但我个人会从操作员开始,因为他们在大多数情况下都是“一成不变”(我不能说这适用于所有语言,因为我只知道一组有限的一套)。这会大大缩小范围,但还不够。例如,“ - >”用于许多语言(至少C,C ++和Perl)。

I would go for something like this:

我会选择这样的东西:

Create a list of features for each language, these could be operators, commenting style (since most use some sort of easily detectable character or character combination).

为每种语言创建一个功能列表,这些功能可以是运算符,注释样式(因为大多数使用某种易于检测的字符或字符组合)。

For instance: Some languages have lines that start with the character "#", these include C, C++ and Perl. Do others than the first two use #include and #define in their vocabulary? If you detect this character at the beginning of line, the language is probably one of those. If the character is in the middle of the line, the language is most likely Perl.

例如:某些语言的行以“#”开头,包括C,C ++和Perl。除了前两个以外的其他人在他们的词汇表中使用#include和#define吗?如果您在行的开头检测到此字符,则该语言可能就是其中之一。如果角色位于该行的中间,则该语言很可能是Perl。

Also, if you find the pattern := this would narrow it down to some likely languages.

此外,如果您找到模式:=这会将其缩小到某些可能的语言。

Etc.

I would have a two-dimensional table with languages and patterns found and after analysis I would simply count which language had most "hits". If I wanted it to be really clever I would give each feature a weight which would signify how likely or unlikely it is that this feature is included in a snippet of this language. For instance if you can find a snippet that starts with /* and ends with */ it is more than likely that this is either C or C++.

我会找到一个包含语言和模式的二维表,经过分析后我会简单地计算哪种语言最“点击”。如果我希望它真的很聪明,我会给每个功能一个权重,这意味着这个功能被包含在这种语言的片段中的可能性或可能性。例如,如果您可以找到以/ *开头并以* /结尾的代码段,则很可能是C或C ++。

The problem with keywords is someone might use it as a normal variable or even inside comments. They can be used as a decider (e.g. the word "class" is much more likely in C++ than C if everything else is equal), but you can't rely on them.

关键字的问题是有人可能将其用作普通变量甚至是评论内部。它们可以用作决策者(例如,如果其他条件相同的话,C ++中的“class”更可能比C更可能),但你不能依赖它们。

After the analysis I would offer the most likely language as the choice for the user with the rest ordered which would also be selectable. So the user would accept your guess by simply clicking a button, or he can switch it easily.

在分析之后,我将提供最可能的语言作为用户的选择,其余订购的也是可选择的。因此,用户只需单击按钮即可接受您的猜测,或者他可以轻松切换。

#14

In answer to 2: if there's a "#!" and the name of an interpreter at the very beginning, then you definitely know which language it is. (Can't believe this wasn't mentioned by anyone else.)

回答2:如果有“#!”和一开始的口译员的名字,然后你肯定知道它是哪种语言。 (不敢相信其他人没有提到过。)

#1