用于识别文本文件中的编程语言的代码

i'm supposed to write code which when given a text file (source code) as input will output which programming language is it. This is the most basic definition of the problem. More constraints follow:

我应该编写代码,当给出一个文本文件(源代码)作为输入时将输出哪种编程语言。这是问题的最基本定义。更多限制如下:

I must write this in C++.

我必须用C ++编写。

A wide variety of languages should be recognized - html, php, perl, ruby, C, C++, Java, C#...

应该识别各种语言 - html,php,perl,ruby,C,C ++,Java,C#...

Amount of false positives (wrong recognition) should be low - better to output "unknown" than a wrong result. (it will be in the list of probabilities for example as unknown: 100%, see below)

误报(错误识别)的数量应该低 - 输出“未知”比错误结果更好。 (它将在概率列表中,例如未知:100%,见下文)

The output should be a list of probabilities for each language the code knows, so if it knows C, Java and Perl, the output should be for example: C: 70%, Java: 50%, Perl: 30% (note there is no need to have the probabilities sum up to 100%)

输出应该是代码知道的每种语言的概率列表,因此如果它知道C,Java和Perl,则输出应该是例如:C:70%,Java:50%,Perl:30%(注意有不需要概率总和达到100%)

It should have a good ratio of accuracy/speed (speed is a bit more favored)

它应该具有良好的准确度/速度比(速度更受青睐)

It would be very nice if the code could be written in a way that adding new languages for recognition will be fairly easy and involve just adding "settings/data" for that particular language. I can use anything available - a heuristic, a neural network, black magic. Anything. I'am even allowed to use existing solutions, but: the solution must be free, opensource and allow commercial usage. It must come in form of easily integrable source code or as a static library - no DLL. However i prefer writing my own code or just using fragments of another solution, i'm fed up with integrating code of others. Last note: maybe some of you will suggest FANN (fast artificial neural network library) - this is the only thing i cannot use, since this is the thing we use ALREADY and we want to replace that.

如果能够以一种方式编写代码,即添加用于识别的新语言相当容易并且仅涉及为该特定语言添加“设置/数据”,那将是非常好的。我可以使用任何可用的东西 - 启发式,神经网络,黑魔法。任何东西。我甚至被允许使用现有的解决方案,但是:解决方案必须是免费的,开源的并允许商业用途。它必须以易于集成的源代码或静态库的形式出现 - 没有DLL。但是,我更喜欢编写自己的代码或只使用其他解决方案的片段,我厌倦了整合其他代码。最后一点:也许你们中的一些人会建议FANN(快速人工神经网络库) - 这是我唯一不能使用的东西,因为这是我们使用ALREADY的东西,我们想要替换它。

Now the question is: how would you handle such a task, what would you do? Any suggestions how to implement this or what to use?

现在的问题是:你将如何处理这样的任务,你会做什么?有任何建议如何实现这个或使用什么?

EDIT: based on the comments and answers i must emphasize some things i forgot: speed is very crucial, since this will get thousands of files and is supposed to answer fast, so looking at a thousand files should produce answers for all of them in a few seconds at most (the size of files will be small of course, a few kB each one). So trying to compile each one is out of question. The thing is, that i really want probabilities for each language - so i rather want to know that the file is likely to be C or C++ but that the chance it is a bash script is very low. Due to code obfuscation, comments etc. i think that looking for a 100% accurate code is a bad idea and in fact is not the goal of this.

编辑:基于评论和答案,我必须强调一些我忘记的事情:速度是非常关键的,因为这将获得数千个文件,并且应该快速回答,所以查看一千个文件应该为所有这些文件生成答案最多几秒钟(文件的大小当然很小,每个小几KB)。所以试图编译每一个都是不可能的。问题是,我真的想要每种语言的概率 - 所以我宁愿知道该文件可能是C或C ++,但它是bash脚本的可能性非常低。由于代码混淆,评论等我认为寻找100%准确的代码是一个坏主意,事实上并不是这个目标。

10 个解决方案

#1

You have a problem of document classification. I suggest you read about naive bayes classifiers and support vector machines. In the articles there are links to libraries which implement these algorithms and many of them have C++ interfaces.

您有文档分类问题。我建议你阅读一下天真的贝叶斯分类器和支持向量机。在文章中有链接到实现这些算法的库,其中许多都有C ++接口。

#2

One simple solution I could think of is that you could just identify the keywords used in different languages. Each identified word would have score +1. Then calculate ratio = identified_words / total_words. The language that gets most score is the winner. Off course there are problems like usage of comments e.t.c. But I think that is a very simple solution that should work in most cases.

我能想到的一个简单的解决方案是,您可以只识别不同语言中使用的关键字。每个识别的单词将得+1。然后计算ratio = identified_words / total_words。获得最多分数的语言是胜利者。当然,有一些问题,例如使用评论e.t.c.但我认为这是一个非常简单的解决方案,应该适用于大多数情况。

#3

I'm sorry but if you have to parse thousands of files, then your best bet is to look at the file extension. Don't over engineer a simple problem, or put burdensome requirements on a simply task.

对不起,如果你要解析成千上万的文件,那么最好的办法是查看文件扩展名。不要过度设计一个简单的问题,或者对简单的任务施加繁重的要求。

It sounds like you have thousands of files of source code and you have no idea what programming language they were written in. What kind of programming environment do you work in? (Ruling out the possibility of an artificial homework requirement) I mean one of the basics of software engineering that I can always rely on are that c++ code files have .cpp extension, that java code files have the .java extension, that c code files have the .c extension etc... Is your company playing fast and loose with these standards? If so I would be really worried.

听起来你有成千上万的源代码文件,你不知道它们是用什么编程语言编写的。你在什么样的编程环境中工作? (排除人工作业要求的可能性)我的意思是我总能依赖的软件工程的基础之一是c ++代码文件具有.cpp扩展名,java代码文件具有.java扩展名,即c代码文件有.c扩展等...你的公司是否快速和宽松地遵守这些标准?如果是这样,我会非常担心。

#4

If you know that the source files will conform to standards, file extensions are unique to just about every language. I assume that you've already considered this and ruled it out based on some other information.

如果您知道源文件符合标准,则文件扩展名对于几乎所有语言都是唯一的。我假设您已经考虑过这一点并根据其他一些信息对其进行了排除。

If you can't use file extensions, the best way would be to find the things between languages that are most different and use those to determine filetype. For example, for loop statement syntax won't vary much between languages, but package include statements should. If you have a file including java.util.*, then you know it's a java file.

如果您不能使用文件扩展名,最好的方法是找到最不同的语言之间的东西,并使用它们来确定文件类型。例如,for循环语句语法在语言之间不会有太大差异,但包include语句应该。如果你有一个包含java.util。*的文件,那么你知道它是一个java文件。

#5

As dmckee suggested, you might want to have a look at the Unix file program, whose source is available. The heuristics used by this utility might be a great source of inspiration. Since it is written in C, I guess that it qualifies for C++. :) You do not get confidence percentages directly, though; maybe are they used internally?

正如dmckee建议的那样,你可能想看看Unix文件程序,它的源代码是可用的。该实用程序使用的启发式方法可能是一个很好的灵感来源。由于它是用C语言编写的,我猜它有资格使用C ++。 :)但你没有直接获得信心百分比;也许他们在内部使用?

#6

Take a look at nedit. It has a syntax highlighting recognition system, under Syntax Highlighting->Recognition Patterns. You can browse sample recognition patterns here, or download the program and check out the standard ones.

看看nedit。它具有语法高亮识别系统,在语法高亮 - >识别模式下。您可以在此处浏览样本识别模式,或下载程序并查看标准模式。

Here's a description of the highlighting system.

这是突出显示系统的描述。

#7

Since the list of languages is known upfront you know the syntax/grammar for each of them. Hence you can, as an example, to write a function to extract reserved words from the provided source code.

由于语言列表是预先知道的,因此您可以了解每种语言的语法/语法。因此,作为示例,您可以编写一个函数来从提供的源代码中提取保留字。

Build a binary tree that will have all reserved words for all languages that you support. And then just walk that tree with the extracted reserved words from the previous step.

构建一个二叉树,它将包含您支持的所有语言的所有保留字。然后用上一步中提取的保留字来走这棵树。

If in the end you only have 1 possibility left - this is your language. If you reach the end of the program too soon - then (from where you stopped) - you can analyse your position on a tree to work out which languages are still the possibitilies.

如果最终你只剩1个可能性 - 这是你的语言。如果你太快到达程序的末尾 - 然后(从你停止的地方) - 你可以分析你在树上的位置,以确定哪些语言仍然是可能的。

#8

You can maybe try to think about languages differences and model these with a binary tree, like "is feature X found ? " if yes, proceed in one direction, if not, proceed in another direction.

您可以尝试考虑语言差异并使用二叉树对其进行建模,例如“是否找到要素X?”如果是,则向一个方向前进,如果不是,则向另一个方向前进。

By constructing this search tree efficiently you could end with a rather fast code.

通过有效地构建此搜索树,您可以以相当快的代码结束。

#9

This one is not fast and may not satisfy your requirements, but just an idea. It should be easy to implement and should give 100% result.

这个并不快,可能无法满足您的要求,只是一个想法。它应该易于实现,并且应该给出100%的结果。

You could try to compile/execute the input text with different compilers/interpreters (opensource or free) and check for errors behind the scene.

您可以尝试使用不同的编译器/解释器(opensource或free)编译/执行输入文本,并检查场景背后的错误。

#10

The Sequitur algorithm infers context-free grammars from sequences of terminal symbols. Perhaps you could use that to compare against a set of known production rules for each language.

Sequitur算法从终端符号序列中推断出无上下文语法。也许您可以使用它来比较每种语言的一组已知生产规则。

#1