воскресенье, 24 августа 2008 г.

Around the CAPTCHA

This article is incomplete. Thus it may contain mistakes. We will answer your questions if you have any. NetworkSecurityResearch@gmail.com



AOL CAPTCHA hacked. Download
eBay Partner Network CAPTCHA hacked.Download

Server project needs MATLAB 2007a Compiler Runtime (MCR) installed.

Abstract

This article is dedicated to the present-day CAPTCHAs, CAPTCHA-recognition approaches and possible ways to improve the CAPTCHA.



Introduction


CAPTCHA - «Completely Automated Public Turing test to tell Computers and Humans Apart» [1]. Any problem that can be easily solved by a human, but cannot be solved by a computer (or which requires excessive computation), can be used as a Turing test. To protect against auto registration, mass mailing in forums and guestbooks, a text-based CAPTCHA was chosen (we will reference it as text-CAPTCHA below). Initially text-CAPTCHA was used in 1997 by AltaVista search system to protect against automated URLs (Uniformed Resource Locator) submitting. An example of a such CAPTCHA is shown on Figure 1. Since then, this type of protection was adopted by other resources: free mail services, blogs, social networks, etc.




Figure 1. An example of text-CAPTCHA by AltaVista (taken from [2]).


First text-CAPTCHAs were developed to be unrecognizable by contepmorary optical character recognition programs (OCRs). It could be done simply by adding noises and distorting the characters. Since then there were text-CAPTCHA engines for forums, guestbooks and catalogues. But, there were not so many such engines, and text-CAPTCHA was default for certain engine types. This resulted in the first wave of text-CAPTCHA recognition systems, and automated spam programs were designed using these techniques. These programs could send spam messages to forums, guestbooks and auto register in the catalogues. The examples of text-CAPTCHAs that could be solved by one of such programs are shown on Figure 2. The capabilities of this program are not limitied to the text-CAPTCHAs shown.



Figure 2. Examples of text-CAPTCHAs


The second wave of text-CAPTCHA recognition systems is related to the decrease of the SMTP spam efficiency. We have managed to contact a real spammer on one of their forums, and he revealed that overall profits from SMTP spam activities had decreased by 43% in the past 3 years. This is why spammers are looking for new ideas to increase the efficiency of their systems all the time, and an obvious way to do it is to send spam from well known web services like Yahoo!, Hotmail, AOL, Gmail, that offer free mailboxes. Spammers are ready to pay significant amounts of money for text-CAPTCHA recognition systems for these web services, but only recently these services started to pay attention to automated recognition. For example, a month has passed after the publication of the automated text-CAPTCHA recognition at http://network-security-research.blogspot.com before Yahoo! updated its CAPTCHA to the new one. Publications of such kind don't help spammers to increase the spam efficiency, but rather bring an attention of a given web service to the vulnerabilities in its CAPTCHA system, forcing a vendor to update it.


Despite the fact that popular web services try to make their CAPTCHAs as resistant to segmentation attacks as possible, there was not a single CAPTCHA presented that couldn't be solved automatically, as shown in [3]. This paper describes a way to automatically recognize text-CAPTCHA by Microsoft, which was designed especially to be segmentation-resistant [4]. Moreover, text-CAPTCHAs that are in use now (at the moment of writing this article, 07/16/08) on web services like Gmail, Yahoo, Hotmail (Figure 3) can also be recognized automatically. We came to this conclusion after testing an utility which has been provided to us by the same person who has given the information on the SMTP spam efficiency. This utility is an automated recognition tool for such text-CAPTCHAs. The results are these: Gmail - 10%, Yahoo! - 10%, Hotmail - 35%. Average time needed for recognition of one text-CAPTCHA was 1 second.




Figure 3. Examples of Gmail, Yahoo! and Hotmail text-CAPTCHAs


Below we will describe methods used for recognition of text-CAPTCHAs shown on Figure 2, and also provide sources of information for creating text-CAPTCHA recognition systems shown on Figure 3.


Approaches to text-CAPTCHA recognition

First approach to text-CAPTCHA recognition is to use human resources to solve CAPTCHAs. The prices vary from $2,5 to $4 for 1000 CAPTCHAs. Pros are obvious: universality, high recognition ratio; cons are slow recognition speed (~20 seconds), high price ($2500 - $4000 for 10 millions of messages, given that it is possible to send 10 messages per day from one registered mailbox without reentering text-CAPTCHA) and unstable amount of solved text-CAPTCHAs. Note that newest such portal we know about appeared just a few days ago ( www.decaptcher.com ). It means that these services are widespread and it’s necessary to find a method to fight with them.


The second approach is connected with machine learning and the image recognition task. textCAPTCHA recognition consists of three stages. These stages are: preprocessing, segmentation and recognition. Preprocessing is telling background and probably noise from characters apart. Segmentation is an extraction of stand-alone symbols from the textCAPTCHA. Recognition works on a stand-alone symbols. These three stages are interconnected. The segmentator efficiency depends on the efficiency of the preprocessing. And the efficiency of the whole system depends on the efficiency of the segmentator and the recognizer. This can be shown by the following:


SYS = Seg*Rec^N (1)

where SYS is a relative frequency of the correctly recognized textCAPTCHAs, Seg is the relative frequency of the correctly segmented text-CAPTCHAs, Rec is the relative frequency of a correctly recognized symbol, N is a number of symbols on the text CAPTCHA. We should use the "relative frequency" term because we deal with limited size sets and don't know the probability distribution function. We suppose that the relative frequency insignificantly differs from the probability due to the Bernully theorem which says that with the large size of set the relative frequency is approximately equal to the probability. Consider the following formula:


(2)

Na/N is a relative frequency. Na – is a number of successes in N tests, e > 0, p – is the possibility of the event. We will say that relative frequency is valid if


(3)

This inequality is the Chebyshev inequality:


(4)

As it was said in [5] the recognition of stand-alone symbols is the simple task for a machine. Moreover, the accuracy of machine recognition supersedes the human recognition. The major problem in the recognition of text-CAPTCHA is the segmentation of the CAPTCHA. That's the reason why CAPTCHA developers usually try to complicate the segmentation as much as possible. Later in this article we will show few different approaches for recognition of simple text-CAPTCHAs. Some of these CAPTCHAs has been already described in different articles about text-CAPTCHAs recognition[3],[6-9], however the number of the simple text-CAPTCHAs doesn't become smaller. We will use MATLAB for preprocessing and segmentation.


Preprocessing, segmentation, recognition

Let’s consider text-CAPTCHA of the AOL portal. The portal provides free-mail services. As the protection from the automated registrations of e-mails the portal uses the text-CAPTCHAs showed in the following figure.



Figure 4. http://www.aol.com text-CAPTCHA

The preprocessing procedure for this CAPTCHA is a deletion the background with subsequent transformation the image into binary form. These actions could be done easily because the symbols are always in a gray-scale. According to experiments we made most CAPTCHAs have color scale lower than RGB(117,117,117). The following code deletes background.


%input: img – input image
%output: img_out – image after preprocessing
sz = size(img);

img_out = zeros(sz(1),sz(2));

if(isrgb(img))

for i=1:sz(1)
for j=1:sz(2)

if(img(i,j,1)<=117 & img(i,j,2)<=117 & img(i,j,3)<=117)

img_out(i,j)=0;

else

img_out(i,j)=1;

end
end
end
else

for i=1:sz(1)
for j=1:sz(2)

if(img(i,j)<=117)

img_out(i,j)=0;

else

img_out(i,j)=1;

end
end
end
end

After the preprocessing we’ve got the image in a binary form. Several such images are shown in the following figure.



Figure 5. text-CAPTCHA after the preprocessing

The next stage is the segmentation. We will use the following algorithm.

1) Scan all image pixels, assigning preliminary labels to nonzero pixels and recording label equivalences in a union-find table.
2) Resolve the equivalence classes using the union-find algorithm described in [10].
3) Relabel the pixels based on the resolved equivalence classes.


This algorithm is realized in MATLAB in function bwlabeln. bwlabel function could be used as well.


This function returns matrix of labels L. The L contains labels of connected components in a binary image. The connecting value in this example is 8, labels for background will be 0, labels for connected components will be from 1 to n, where n is integer number > 0. In the next step we search objects in the matrix those, whose pixels number is more than 30 (this number was received experimentally), because some images after preprocessing contain noise(fig.6).



Figure 6. Noise after preprocessing

Selected objects are symbols we’re looking for. We’ll set size 28x28 to every object. The examples of obtained symbols are shown on the following figure.



Figure 7. Symbols obtained after the segmentation

The code for the segmentation:



%Input: img_bw – the binary image
%Output: vec_out – the matrix of symbols in binary form
[img_labled num]=bwlabeln(~img_bw,8);

vec_out = zeros(784,1);

k1=0;

for ii = 1:num

[x y]=find(img_labled(:,:)==ii);

sz = size(x);

if(sz(1)>30)

k1=k1+1;

matr = zeros(sz(1),2);

for j=1:sz(1)

matr(j,1) = x(j);

matr(j,2) = y(j);

end
%transform coordinates of the symbol into the symbol image
%change symbol size to 28x28
%transform the image into the binary form

k=0;

for i=1:28
for j=1:28

k=k+1;

vec_out(k,k1) = simbol(i,j);

end
end


end

end


The next stage is the counting of correctly segmented text-CAPTCHAs. The correctly segmented text-CAPTCHA is the CAPTCHA with the extracted number of symbols equal to the number of symbols on the text-CAPTCHA when these extracted symbols could be recognized correctly. The set we’re using for counting doesn’t contain CAPTCHAs used to create preprocessing and segmentation units. Improper preprocessing and not correct segmentation could be seen on the following figure:



Figure 8. Incorrect preprocessing and segmentation

The efficiency of the whole system( preprocessing, segmentation and recognition) must be not less than 0.1 in order to approve this method of segmentation.
Thus we should make the following assumptions:
1) The relative frequency of the correct CAPTCHAs segmentation will be equal to the ratio of correctly segmented CAPTCHAs to the all CAPTCHAs
2) The relative frequency of correctly recognized symbols will be equal to the relative frequency of correct recognition of English letters by Feature-based SVM described in [11]. It’s equal to 0.9256. We made this assumption because number of classes for English alphabet letters is 26, and the number of classes in a text-CAPTCHA is usually less. This is so because in most cases the CAPTCHA uses not all letters. Also we’re able to collect larger training set.

By taking into account these assumptions we have the following


Seg = SYS/RecN, (5)
Seg = 0.1/0.9256^8 = 0.19

From 400 text-CAPTCHAs the system incorrectly segmented 88, thus the relative frequency if correct segmentation is 0,78. It’s bigger than estimated minimum so we can propose that the efficiency of our system will be about SYS = 0,78*0,9256^8 = 0.42


The next stage is the data preparation for the symbol recognition algorithm, which in our case is the Polynomial support vector machine ( Polynomial SVM ). Polynomial SVM was chosen due to its simplicity and high efficiency. So we need new set with text-CAPTCHAs wasn’t involved in either creation of preprocessing algorithm or segmentation algorithm. With the help of these algorithms we extract symbols from the set. After the extraction we must sort symbols manually according to their classes. For this text-CAPTCHA we have 26 classes (23456789bcfghjklmnpqrstvwx). The learning set contains 4361 examples. Then the learning set is divided into two sets: the training set and the testing set. In our example we have 90% CAPTCHAs from learning set in the training set, and 10% in the testing set. The less number of errors was made by the machine with the polynomial degree 5. The symbol recognition relative frequency was 0.904. This value is less than our theoretical one. In order to improve it we need to enlarge our learning set. But we will not do it because the expected efficiency of our system SYS = 0,78*0,904^8 = 0.35 is still larger than a minimum 0.1.


The final stage is the testing of our recognition system. So we need set of text-CAPTCHAs weren’t involved in previous stages. We used 100 CAPTCHAs, 37 of them were correctly recognized by our system. Thus the relative frequency is 0.37, which corresponds to our theoretical relative frequency.


Yahoo.com

In January 19 2008 Network Security Research presented the CAPTCHA recognition engine for the Yahoo! Portal. The following will describe both preprocessing and segmentation algorithms used in this engine. The text-CAPTCHA itself is shown on the following figure



Figure 9. text-CAPTCHA used by the Yahoo! at the 19-th of January 2008

This CAPTCHA has no background so the preprocessing algorithm will only turn the image from the gray scale into the black-and-white with the help of im2bw function using the default parameters. For the segmentation of the image we will use the fuzzy k-means clusterization algorithm. This algorithm is realized in fcm MATLAB function. The description of the algorithm could be found in [12]. In order to use this function for the segmentation we must turn the image into the coordinate matrix. That’s done by the following code:



%input: img_in – input image
%output: cord – coordinate matrix
img = im2bw(img_in);

count = find(img==0);

sz = size(count);

if(isempty(sz)) return; end

cord = zeros(sz(1),2);

sz = size(img);

k=0;

for i=1:sz(1)
for j=1:sz(2)

if(img(i,j)==0)

k=k+1;
cord(k,1)=i;
cord(k,2)=j;

end
end
end

The number of clusters our image will be divided into must be passed1 as a parameter to this function. In our case the cluster is a symbol. The CAPTCHA uses different number of symbols(from 4 to 6). In order to determine the number of symbols the following approach was used.
We calculated horizontal brightness histogram for every text-CAPTCHA and then used these values as the learning set for the SVM. The relative frequency at the testing set was 0.256. The horizontal brightness histogram calculation code is:



%input: img – black-and-white image
%output: hist - horizontal brightness histogram
hist=zeros(sz(2),1);

for i=1:sz(2)

k = 0;

for j=1:sz(1)

k = k +img(j,i);

end

hist(i) = double(k)/double(sz(1));

end

Then we empirically select optimal parameters for the fcm function. These parameters are: exponential weight, maximum number of iterations in the clusterization algorithm, minimal allowable improvement value of target function in a one iteration of the algorithm. The clusterization example is presented in the figure 10.



Figure 10. Clusterized text-CAPTCHA of Yahoo!

The clusterization code:



%Input: cord – coordinate matrix,
num_cluster – clusters number
%Output: the massive of black-and-white images of the symbol

options(1)=2; % exponential weight value
options(2)=200; % maximum number of clusterization algorithm
iterationsoptions(3)=0.0003; % minimal allowable improvement value of target function in a %one iteration of the algorithm

options(4)=0;

index_cell = cell(1,num_cluster);

[center, U, obj_fcn] = fcm(cord, num_cluster ,options);

maxU = max(U);

left = maxU/1.05;

for i=1:num_cluster

index_cell{i} = find(U(i, :)>=left & U(i, :)<= maxU );%

end


%the received clusters are sorted due to their centers
%turn coordinates of the cluster into black-and-white image

After the clusterization we have got symbols with noise (fig.11). We need to get rid of noise, so we used 2d median filter. Then we extracted the object with maximum number of pixels with the help of algorithm used for segmentation of AOL text-CAPTCHA.



Figure 11. Symbols from the Yahoo! text-CAPTCHA after the clusterization

After the filtration symbols are set to 28x28 size. The result is shown on the 12th figure.



Figure 12. Symbols from the Yahoo! text-CAPTCHA after the filtration

As a result we have theoretical recognition efficiency – 0,394. The testing of the recognition system on the testing set revealed the recognition relative frequency equal to 0.35. Such difference between theoretical and practical values can be explained by fact, that we didn’t take into account some segmentation errors(fig. 13).



Figure 13. Incorrectly segmented Yahoo! text-CAPTCHA

Currently available text-CAPTCHAs of Yahoo!, Hotmail, Gmail portals

Figure 3 shows text-CAPTCHAs of these portals. None of the described above methods will handle this type of CAPTCHAs. And it’s difficult to find some heuristic segmentation algorithm, so the ideal solution of the problem would be the system with combined recognition and segmentation. Yann LeCun proposed such system and it’s designed to recognize handwritten texts. The example of work of the system is available at http://yann.lecun.com/exdb/lenet/index.html
At this site you can find works dedicated to the development of convolution networks and applying them for handwritten text recognition [13-16]. Also note works dedicated to slant-normalization in handwritten text recognition [17-19].


As we noted before the recognition systems for the Yahoo!, Hotmail, Gmail portals were created and worked. These systems used the generalized approach to design both segmentator and symbol recognizer.


CAPTCHA improvement: Evolutionary CAPTCHA

Let’s discuss the possible ways the CAPTCHA could be improved.
We consider the situation when attacker knows everything about the algorithm the program uses to generate the CAPTCHA except some inside randomness of the program. Most CAPTCHA-solving engines we know exploit the vulnerabilities of the exact CAPTCHA. It means that if the CAPTCHA is generated “in simple way” in 15% of times, the attacker could target this set. The example of CAPTCHAs generated “in simple way” could be the CAPTCHAs with not interconnected characters or just interconnected couples of characters. The attacker could count on that and make recognition engine specifically for these types of CAPTCHAs. In other hand average recognition engine solves small percent of all CAPTCHAs. It solves less CAPTCHAs than humans do. So obviously in every engine there are plenty of CAPTCHAs the exact recognizer can’t solve. Let’s make an assumption that these not-solved CAPTCHAs belong to one or few types of this CAPTCHA. These types could be not predetermined by a developer: both solved and unsolved CAPTCHAs could use same font type, characters number, etc. The question is: is it possible to predetermine the exact set in all sets of the CAPTCHA, which will be vulnerable to attacks of the recognizer? Even more important question is: when the attack has already started is it possible to determine such set in order to exclude it from all possible generated CAPTCHAs?


Network Security Research proposes the new technology in CAPTCHA generation. The main idea is to make the CAPTCHA available to dynamically determine the sets suit to the recognizer and avoid such areas. The notion about reusing unrecognized CAPTCHAs was mentioned not once. However it’s useful to note that CAPTCHAs are not 100% solvable by humans. For example as we’ve already mentioned, AOL avoided using some characters (0, 1, a, d, e, i, o, u, y, z) in order to prevent frequent human fails. Thus it looks like if the system is reusing unrecognized CAPTCHAs then its positive fails rate grows. So it’s useful to be able to determine unrecognized CAPTCHAs types instead of using unrecognized CAPTCHAs themselves. That is so because humans don’t recognize the exact CAPTCHAs, when the CAPTCHA recognition engines don’t recognize some types of CAPTCHA.


CAPTCHA generation engine uses lots of parameters so we need a method to dynamically determine the delimiters of each parameter. We propose the using of evolutionary algorithm as a method based on natural evolution and inheritance. This method allows us:
1) To handle encoded parameters instead of parameters themselves
2) To make a solution search considering not the only point, but population
3) To use results of our system as a fitness function
4) Using the probabilistic rules instead of deterministic ones.


The remaining part of the paper will be dedicated to the description of evolutionary CAPTCHA.
Some number of CAPTCHAs consequently generated by the engine is population. This number is called the population size. The population size must be as small as possible in order to avoid evolutionary engine speed falling. But it also must be large enough to represent possible states of the system, i.e. we must avoid the situation when all CAPTCHAs in a population are presented by CAPTCHAs showed to recognition engine.
So, the first population is generated. Fitness function will be calculated depending on answers to the CAPTCHA test. We consider CAPTCHAs as solutions to the automated recognition problem, and the fitness function value tells us how good this solution is. If the answer for the CAPTCHA showed at figure 14 is “URGN7PNC”, then the fitness function returns the value 0.



Figure 14. CAPTCHA

This is so because this solution is useless – the CAPTCHA was recognized. If the answer differs from “URGN7PNC” then fitness function returns value greater than zero. We suggest using number of unrecognized characters as a fitness function value. So, for example, fitness function value for this CAPTCHA and the answer “00000000” will be 8, and for answer “0000000C” it will be 7.


The evolutionary system allows solutions with high fitness function values to reproduce with more probability.
The mating pool is the set of solutions (CAPTCHAs) gathered to reproduce. Firstly the solutions for mating pool are selected randomly, but solutions with greater value of fitness function have more chances to be chosen. The method is called “the roulette method”, because it looks like we’re using the roulette (randomness) where the sectors (solutions) with greater values of fitness function are wider. In order to improve the chances of good solutions to be chosen we subsequently use the “tournament method”. The number of tournaments affects the speed of the system so it has to be chosen carefully. The tournament is when we consequently pick N competitors from the population using the roulette method and then take one with the best fitness value to the mating pool. In this case N is the number of tournaments.


So we collected CAPTCHAs into mating pool to reproduce. The size of mating pool we chose equal to the size of population. To understand the term “to reproduce” we should consider how the CAPTCHA is represented in the system. The solution(CAPTCHA) is made from chromosome. The chromosome is the number of parameters along with description of its borders. The parameter is the gene – the minimal meaning unit of the algorithm.


Children inherit parents genes. The number of genes picked from every parent is a random value. The process is called “crossing over”. There is a term – number of crossing over points. This number indicates the amount of points which split parents genes into the intervals before crossing over. See the following figure showing the crossing over with two crossing over points 1 and 3.



Figure 15 Crossing over (two points, 1 and 3)

After the crossing over on every gene we make the mutation with some probability. The mutation is made to avoid the system staying in local minimums. The probability of mutation is chosen empirically. And in our case it’s 0,001.


There is a chance that after the crossing over we got the solutions worse than any of parents. That could be so because we don’t know the influence of genes on each other. It is so called epistasis.


In order to avoid the decreasing of best value of fitness function in next generation we can save the number of best solution and include it in next population. In our case we save number of best parents equal to 10% of population. We could either replace the worse of children by them or random children.


Theoretically children average value of fitness function is greater than parents one. So theoretically the recognition engine success percentage should fall with every new population when we’re using evolutionary CAPTCHA. Let’s check if this assumption is correct.


In order to check it we developed the text-CAPTCHA engine. Also we developed CAPTCHA recognition engine, which accuracy is 30%. The population size was 60. The results showed that the accuracy of recognizer is constantly falling. The foolowing figure shows first five populations of one of the tests.



Figure 16 Test results

Thus our dynamically self reorganize and literally fights with the recognizer. Another words using probabilistic model the system complicates the CAPTCHA for recognition engine (fig. 17).



Figure 17 CAPTCHAs before the evolutionary changes and after them

Of course there is still much to improve, but obviously the concept is working and worth the attention.


Conclusion

At the present moment text-CAPTCHA is still effective because there is no universal algorithm for text recognition. However the recognition engine could be made for the exact CAPTCHA engine. There are two strategies of defense: 1) make changes in CAPTCHA as soon as they broke it (or even better as often as it’s possible) 2) use innovative approaches(our one for example) in CAPTCHAs engines to either increase the amount of attacker work needed to design the new attack-method or make the attack almost impossible due to its complicity.



REFERENCES
1. CAPTCHA http://en.wikipedia.org/wiki/CAPTCHA
2. Henry S. Baird, Kris Popat: Human Interactive Proofs and Document Image Analysis. Document Analysis Systems 2002: 507-518, pp 3-4.
3. Jeff Yan, Ahmad Salah El Ahmad “A Low-cost Attack on a Microsoft CAPTCHA”
4. Patrice Y. Simard, Richard Szeliski, Josh Benaloh, Julien Couvreur, and Iulian Calinov “Using Character Recognition and Segmentation to Tell Computer from Humans”
5. Kumar Chellapilla, Kevin Larson, Patrice Simard, Mary Czerwinski “Computers beat Humans at Single Character Recognition in Reading based Human Interaction Proofs (HIPs)”
6. Greg Mori, Jitendra Malik “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA”
7. Kumar Chellapilla, Patrice Y. Simard “Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)”
8. Per-Ola Kristiansson “Defeating a simple CAPTCHA using Optical Character Recognition”
9. Jeff Yan, Ahmad Salah El Ahmad “Breaking Visual CAPTCHAs with Naïve Pattern Recognition Algorithms”
10. Robert Sedgewick, Algorithms in C, 3rd ed., Addison-Wesley, 1998, pp. 11-20.
11. HeroSvm 2.1 http://www.cenparmi.concordia.ca/~jdong/HeroSvm.html Error rates of different methods on NIST lowercase database
12. Zimmermann H.-J. Fuzzy Set Theory - and Its Applications.3rd ed.- Kluwer Academic Publishers, 1996.- 435p.
13. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998
14. Y. LeCun, P. Haffner, L. Bottou and Y. Bengio: Object Recognition with Gradient-Based Learning, in Forsyth, D. (Eds), Feature Grouping, Springer, 1999
15. Y. LeCun, P. Haffner, L. Bottou and Y. Bengio: Gradient-Based Learning for Object Detection, Segmentation and Recognition, AT&T Labs, 1999
16. Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Intelligent Signal Processing, 306-351, IEEE Press, 2001
17. JOSÉ J. DE OLIVEIRA JÚNIOR��, JOÃO M. DE CARVALHO��, CINTHIA O. DE A. FREITAS_, ROBERT SABOURIN “Evaluating NN and HMM Classi_ers for Handwritten Word Recognition”
18. Alceu de S. Britto Jr1,4, Robert Sabourin2, Edouard Lethelier1, Flávio Bortolozzi1,
Ching Y. Suen “Slant normalization of handwritten numeral strings”
19. E. Kavallieratou*, N. Fakotakis, G. Kokkinakis “Slant estimation algorithm for OCR systems”

пятница, 29 февраля 2008 г.

Yahoo! has changed their CAPTCHA

In our previous post we've claimed that Yahoo! CAPTCHA is broken.

Since then many poorly educated journalists (mostly Russians) wrote fiction stories about this research.

From these stories we knew that:

  1. John Wane blackmailed Yahoo!;

  2. Network Security Research is a part of some criminal network;

  3. we’re going to sell millions of Yahoo! accounts.

None of that is true. All we wanted to say was that Yahoo! CAPTCHA recognition complexity was overestimated.

More than a month has passed since we had released the code. And something happened inside Yahoo!. At last, they have changed their CAPTCHA. The released version of the code does not work anymore.

But worry not! We have some interesting stuff to release. Stay tuned to Network Security Research and AI.




среда, 16 января 2008 г.

Yahoo! CAPTCHA is broken

Hello! We're security researchers from Russia. Our fields of interest: security research, OS security, machine learning. This blog is dedicated to our security research.

Few months ago we received information that yahoo CAPTCHA recognition system exists in the wild with the recognition rate about 30%. So we decided to conduct few experiments. We explored yahoo CAPTCHA and designed a similar system with even better recognition rate (about 35%). The vendor was notified. The vendor didn't reply. In this article we’ll present you our own research.



Many internet resources that specialize in CAPTCHA recognition claim that yahoo CAPTCHA is very difficult for machine recognition.



http://sam.zoy.org/pwntcha/ - “A very good captcha, but not always human-solvable”


http://www.lafdc.com/captcha/ - “Very difficult”


http://captcha.ru/articles/visual/ - “Very good”



However, that’s not right. Your CAPTCHA has vulnerability we’ll discuss later. It’s not necessary to achieve high degree of accuracy when designing automated recognition software. The accuracy of 15% is enough when attacker is able to run 100 000 tries per day, taking into the consideration the price of not automated recognition – one cent per one CAPTCHA.


The implementation of yahoo CAPTCHA recognition engine is here . It consists of two projects (client and server).
First project (server) needs MATLAB 2007a Compiler Runtime (MCR) installed. It waits for a connection and receives CAPTCHA, after that it sends recognized CAPTCHA text string back to client.
Client reads jpg-files in test1 directory and sends them one by one to the server located on the same machine.


If you have any questions or propositions, please contact us. We’re open to discussions.

NetworkSecurityResearch@gmail.com




-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: PGP 8.1

mQGiBEeN+VYRBADQgAFM+oFlMCVyM93B9Mrm7GltHQ1cyf+AIzy3tq29l5eq4zkZ
Kx+QdSvz+chRWjYAjgUS9zgSwzoQa1ZRsTd9o4mhL6hukYxp7IU/s7T8gg2wUkYF
3MHJctgrkj2K+Ux64jy7ONw2XrDcRUjd0Q8fVHS7hRzYhi6v6fuXCWi/+wCg/xuj
n7zGVhf4zN1zKQmdzb7PLjcD/3qhMtsZrzB7Hcf+LnU6WLdqT+Vm6d+QUjAKVxAE
o4cXtqZIfLc3HTF1WdzUb0JsSS8HiOmy3Xi3ckr8DRnwa+jU+nwel5p9E2sn67GV
Izl+lDLBjM6DAnja/MA40Ddu/kOVgysKmgrlHShm+ybji5bgMaP3q7zPG3GwbO/m
p0W7BADIbDDsM3snTJLtEkK3j3lkE2WIMDaMZy5jGADfdnM/7N26S7ONDJuG7I5s
71BNjHwlOHbuW4HtZsSzUa9DgPMCOT8p/mii+OpLOsnyFKWs1fssG3wyMSu7FXdL
uivaXH2DxCDNDqOG48XXz8d1hx2CWqFxB2Im8uFzVYmT6tC2CrQ9TmV0d29yayBT
ZWN1cml0eSBSZXNlYXJjaCA8TmV0d29ya1NlY3VyaXR5UmVzZWFyY2hAZ21haWwu
Y29tPokAXQQQEQIAHQUCR435VgcLCQgHAwIKAhkBBRsDAAAABR4BAAAAAAoJEBTq
o++UHsXIrYkAoIUtLN0TBrVFwEM0hR3Oksc6PNVtAJ9lKMsN8+Lwrd5s+EN6Vuc3
nQ/i0LkEDQRHjflWEBAA+RigfloGYXpDkJXcBWyHhuxh7M1FHw7Y4KN5xsncegus
5D/jRpS2MEpT13wCFkiAtRXlKZmpnwd00//jocWWIE6YZbjYDe4QXau2FxxR2FDK
IldDKb6V6FYrOHhcC9v4TE3V46pGzPvOF+gqnRRh44SpT9GDhKh5tu+Pp0NGCMbM
HXdXJDhK4sTw6I4TZ5dOkhNh9tvrJQ4X/faY98h8ebByHTh1+/bBc8SDESYrQ2DD
4+jWCv2hKCYLrqmus2UPogBTAaB81qujEh76DyrOH3SET8rzF/OkQOnX0ne2Qi0C
NsEmy2henXyYCQqNfi3t5F159dSST5sYjvwqp0t8MvZCV7cIfwgXcqK61qlC8wXo
+VMROU+28W65Szgg2gGnVqMU6Y9AVfPQB8bLQ6mUrfdMZIZJ+AyDvWXpF9Sh01D4
9Vlf3HZSTz09jdvOmeFXklnN/biudE/F/Ha8g8VHMGHOfMlm/xX5u/2RXscBqtNb
no2gpXI61Brwv0YAWCvl9Ij9WE5J280gtJ3kkQc2azNsOA1FHQ98iLMcfFstjvbz
ySPAQ/ClWxiNjrtVjLhdONM0/XwXV0OjHRhs3jMhLLUq/zzhsSlAGBGNfISnCnLW
hsQDGcgHKXrKlQzZlp+r0ApQmwJG0wg9ZqRdQZ+cfL2JSyIZJrqrol7DVes91hcA
AgIQALg1uiEsWMaa9mpNDH4uBSa+tFbz/mEqoV6G8w6LK+XdzbohQQa1cQgr5J44
PPrhe/1uS9llrt4WtUsChFFNJqMM3sURLE7c7uWdVVabKQoXOY+c0wiuMmJ+O5WZ
BaKJLNUvgZBTqsBFs2dttF2AjJCR5u2Yg+yVjeKzbvhJtEwIs14DYIAkS6768E2K
eJfoJUls7Cr5GKXNoFeV3vYQK56bNOsbeObeIbuZvarbelrJIFspLBt1wYqZH9zt
gek/Oz/GJvZwxGbPQhTTtLKMrzzbWWMudaEOX2/HkQFnOyp/57AoZVYK8z2/6H5N
sdIukFynx4BIMgxjhVffp73rshzsDpuSd2GJWzOoxa0hvw4y7DiVQ/VZPC4bFAzQ
lYyEfi0M6JRjthalrVAmWPraXtZoctTK+B3/Jb8rr2W4twJCWl6xP7vsgDM/yTmG
QDMNVM62YKH/R1K0/avqTso7RXrL3I1z5idQ3tjGQj+OEJIjkvDGeq5aqGbkXSTr
XhO7hPspNBUa86kRS20g3OGrMLxVZIDKeRaPrZFaVEQ3+48b5qcmKA6i7SAgYtf9
MskaRTPknB1G/LNS2+Wh1j1MBvmicfdxuD+W//KURlhtJEtFb+5uSqdt5/148dGU
pd/EMBLbbyVQX0mtbtmTEBMmRgw2OLcwTE9+b531OWKc0VCZiQBMBBgRAgAMBQJH
jflWBRsMAAAAAAoJEBTqo++UHsXI99gAoKHO03H+us50GVQaYpBsdMA00V9cAJ4s
YHrXFcQNqu1okd2Rx9iADQSWMA==
=pp5+
-----END PGP PUBLIC KEY BLOCK-----