WO2016118286A1 - Optimizing multi-class image classification using patch features - Google Patents

Optimizing multi-class image classification using patch features Download PDF

Info

Publication number
WO2016118286A1
WO2016118286A1 PCT/US2015/067554 US2015067554W WO2016118286A1 WO 2016118286 A1 WO2016118286 A1 WO 2016118286A1 US 2015067554 W US2015067554 W US 2015067554W WO 2016118286 A1 WO2016118286 A1 WO 2016118286A1
Authority
WO
WIPO (PCT)
Prior art keywords
individual
patches
patch
images
clusters
Prior art date
Application number
PCT/US2015/067554
Other languages
French (fr)
Inventor
Ishan MISRA
Jin Li
Xian-Sheng Hua
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority to EP15834702.1A priority Critical patent/EP3248143B1/en
Priority to CN201580073396.5A priority patent/CN107209860B/en
Publication of WO2016118286A1 publication Critical patent/WO2016118286A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/285Selection of pattern recognition techniques, e.g. of classifiers in a multi-classifier system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • G06V10/763Non-hierarchical techniques, e.g. based on statistics of modelling distributions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • Computer vision may include object recognition, object categorization, object class detection, image classification, etc.
  • Object recognition may describe finding a particular object (e.g., a handbag of a particular make, a face of a particular person, etc.).
  • Object categorization and object class detection may describe finding objects that belong in a particular category or class (e.g., faces, shoes, cars, etc.).
  • Image classification may describe assigning an entire image to a particular category or class (e.g., location recognition, texture classification, etc.).
  • Computerized object recognition, detection, and/or classification using images is challenging because objects in the real world vary greatly in visual appearance. For instance, objects associated with a single label (e.g., cat, dog, car, house, etc.) exhibit diversity in color, shape, size, viewpoint, lighting, etc.
  • Some current object detection, recognition, and/or classification methods include training classifiers based on supervised, or labeled, data. Such methods are not scalable. Others of the current object detection, recognition, and/or classification methods leverage localized image features (e.g., Histogram of Oriented Gradients (HOG)) to learn common- sense knowledge (e.g., eye is part of a person) or specific sub-labels of generic labels (e.g., a generic label of horse includes sub-labels of brown horse, riding horse, etc.). However, using localized image features (e.g., HOG) is computationally intensive. Accordingly, current techniques for object detection, recognition, and/or classification are not scalable and are computationally intensive.
  • HOG Histogram of Oriented Gradients
  • This disclosure describes techniques for optimizing multi-class image classification by leveraging patch-based features extracted from weakly supervised images.
  • the techniques described herein leverage patch-based features to optimize the multi-class image classification by improving accuracy in using classifiers to classify incoming images and reducing the amount of computational resources used for training classifiers.
  • the systems and methods describe learning classifiers from weakly supervised images available on the Internet.
  • the systems described herein may receive a corpus of images associated with a set of labels. Each image in the corpus of images may be associated with at least one label in the set of labels.
  • the system may extract one or more patches from individual images in the corpus of images.
  • the system may extract patch-based features from the one or more patches and patch representations from individual patches of the one or more patches.
  • the system may arrange the patches into clusters based at least in part on the patch-based features.
  • the system may determine similarity values representative of a similarity between individual patches. At least some of the individual patches may be removed from individual clusters based at least in part on the similarity values.
  • the system may extract patch-based features based at least in part on patches remaining in refined clusters.
  • the system may train classifiers based at least in part on the patch-based features.
  • the systems and methods further describe applying the classifiers to classify new images.
  • a user may input an image into the trained system described herein.
  • the system may extract patches from the image and extract features from the image.
  • the system may apply a classifier to the extracted features to classify the new image.
  • the system may output a result to the user.
  • the result may include classification of the image determined by applying the classifier to the features extracted from the image.
  • FIG. 1 is a diagram showing an example system for training classifiers from images and applying the trained classifiers to classify new images.
  • FIG. 2 is a diagram showing additional components of the example system for training classifiers from weakly supervised images and applying the trained classifiers to classify new images.
  • FIG. 3 illustrates an example process for training classifiers from patch-based features.
  • FIG. 4 illustrates an example process for determining whether a label is learnable based at least in part on filtering a corpus of images.
  • FIG. 5 illustrates an example process for filtering a corpus of images.
  • FIG. 6 illustrates another example process for filtering a corpus of images.
  • FIG. 7 illustrates an example process for determining similarity values.
  • FIG. 8 illustrates an example process for removing patches from clusters of patches.
  • FIG. 9 illustrates an example process for diversity selection of particular patches for training.
  • FIG. 10 illustrates a diagram showing an example system for classifying a new image.
  • FIG. 11 illustrates an example process for classifying a new image.
  • Computer vision object e.g., people, animals, landmarks, etc.
  • texture, and/or scene classification in images may be useful for several applications including photo and/or video recognition, image searching, product related searching, etc.
  • Current classification methods include training classifiers based on supervised, or labeled, data. Such methods are not scalable or extendable.
  • current classification methods leverage localized image features (e.g., HOG) to learn common-sense knowledge (e.g., eye is part of a person) or specific sub-labels of generic labels (e.g., a generic label of horse includes sub-labels of brown horse, riding horse, etc.).
  • localized image features e.g., HOG
  • HOG localized image features
  • current data-mining techniques require substantial investments of computer resources and are not scalable and/or extendable.
  • Techniques described herein optimize multi-class image classification by leveraging patch-based features extracted from weakly supervised images.
  • the systems and methods described herein may be useful for training classifiers and classifying images using the classifiers.
  • Such classification may be leveraged for several applications including object recognition (e.g., finding a particular object such as a handbag of a particular make, a face of a particular person, etc.), object categorization or class detection (e.g., finding objects that belong in a particular category or class), and/or image classification (e.g., assigning an entire image to a particular category or class).
  • object recognition e.g., finding a particular object such as a handbag of a particular make, a face of a particular person, etc.
  • object categorization or class detection e.g., finding objects that belong in a particular category or class
  • image classification e.g., assigning an entire image to a particular category or class.
  • image classification e.g., assigning an entire
  • the systems and methods describe learning classifiers from weakly supervised images available on the Internet.
  • the system described herein may receive a corpus of images associated with a set of labels. Each image in the corpus of images may be associated with at least one label in the set of labels.
  • the system may extract one or more patches from individual images in the corpus of images.
  • a patch may represent regions or parts of an image. Patches may be representative of an object or a portion of an object in an image and may be discriminative such that they may be detected in multiple images with high recall and precision. In at least some examples, patches may be discriminative such that they may be detected in a number of images associated with a same label more frequently than they may be detected in images associated with various, different labels.
  • the system may extract patch-based features from the individual images.
  • Patch- based features are image-level features that describe or represent an image. Patch-based features may represent a patch distribution over a patch dictionary as described below.
  • Patch-based features for an individual image are based at least in part patches that are extracted from the individual image.
  • a plurality of patches is extracted from an individual image and the patch-based features may be based on the plurality of patches extracted from the individual image.
  • only a single patch is extracted from an image and the patch-based features may be based on the single patch. Patch-based features enable the systems described herein to train classifiers using less data, therefore increasing efficiency and reducing computational resources consumed for training.
  • the system may extract patch representations from the individual patches.
  • Patch representations describe features extracted from individual patches.
  • Patch representations may represent patch-level features and may be used for refining the clusters, as described below.
  • the system may arrange individual patches of the one or more patches into clusters based at least in part on patch-based features. Individuals of the clusters correspond to individual labels of the set of labels.
  • the clusters may be refined based at least in part on the patch-based features.
  • the system may determine similarity values based at least in part on the patch representations. The similarity values may be representative of similarity between individual patches in same and/or different clusters.
  • the system may process the clusters to remove at least some of the individual patches based at least in part on the similarity values. Based at least in part on the patches that remain after processing the clusters, the system may extract patch-based features from the patches in the refined clusters.
  • the system may leverage the patch-based features extracted from the refined clusters of patches to train classifiers.
  • a user may input an image into the trained system described herein.
  • the system may extract patches and features from the image.
  • the system may apply a classifier to the extracted features to classify the input image.
  • the system may output a result to the user.
  • the result may include classification of the image determined by applying the classifier to the features extracted from the image.
  • the environment described below constitutes but one example and is not intended to limit application of the system described below to any one particular operating environment. Other environments may be used without departing from the spirit and scope of the claimed subject matter.
  • the various types of processing described herein may be implemented in any number of environments including, but not limited to, stand alone computing systems, network environments (e.g., local area networks or wide area networks), peer-to-peer network environments, distributed-computing (e.g., cloud- computing) environments, etc.
  • FIG. 1 is a diagram showing an example system 100 for training classifiers from images and applying the trained classifiers to classify new images.
  • the example operating environment 100 may include a service provider 102, one or more network(s) 104, one or more users 106, and one or more user devices 108 associated with the one or more users 106.
  • the functionality described herein can be performed, at least in part, by one or more hardware logic components such as accelerators.
  • an accelerator can represent a hybrid device, such as one from ZYLEX or ALTERA that includes a CPU course embedded in an FPGA fabric.
  • the service provider 102 may include one or more server(s) 110, which may include one or more processing unit(s) 112 and computer-readable media 114.
  • Executable instructions stored on computer-readable media 114 can include, for example, an input module 116, a training module 118, and a classifying module 120, and other modules, programs, or applications that are loadable and executable by processing units(s) 112 for classifying images.
  • the one or more server(s) 110 may include devices.
  • the service provider 102 may be any entity, server(s), platform, etc., that may learn classifiers from weakly supervised images and apply the learned classifiers for classifying new images.
  • the service provider 102 may receive a corpus of images associated with a set of labels and may extract patches from individual images in the corpus.
  • the service provider 102 may extract features from the patches and images for training a classifier.
  • the service provider 102 may leverage the classifier to classify new images input by users 106.
  • the network(s) 104 may be any type of network known in the art, such as the Internet.
  • the users 106 may communicatively couple to the network(s) 104 in any manner, such as by a global or local wired or wireless connection (e.g., local area network (LAN), intranet, etc.).
  • the network(s) 104 may facilitate communication between the server(s) 110 and the user devices 108 associated with the users 106.
  • the users 106 may operate corresponding user devices 108 to perform various functions associated with the user devices 108, which may include one or more processing unit(s) 112, computer-readable storage media 114, and a display.
  • Executable instructions stored on computer-readable media 114 can include, for example, the input module 116, the training module 118, and the classifying module 120, and other modules, programs, or applications that are loadable and executable by processing units(s) 112 for classifying images.
  • the users 106 may utilize the user devices 108 to communicate with other users 106 via the one or more network(s) 104.
  • User device(s) 108 can represent a diverse variety of device types and are not limited to any particular type of device. Examples of device(s) 108 can include but are not limited to stationary computers, mobile computers, embedded computers, or combinations thereof.
  • Example stationary computers can include desktop computers, work stations, personal computers, thin clients, terminals, game consoles, personal video recorders (PVRs), set-top boxes, or the like.
  • Example mobile computers can include laptop computers, tablet computers, wearable computers, implanted computing devices, telecommunication devices, automotive computers, personal data assistants (PDAs), portable gaming devices, media players, cameras, or the like.
  • Example embedded computers can include network enabled televisions, integrated components for inclusion in a computing device, appliances, microcontrollers, digital signal processors, or any other sort of processing device, or the like.
  • the service provider 102 may include one or more server(s) 110, which may include devices.
  • server(s) 110 may include devices. Examples support scenarios where device(s) that may be included in the one or more server(s) 110 can include one or more computing devices that operate in a cluster or other clustered configuration to share resources, balance load, increase performance, provide fail-over support or redundancy, or for other purposes.
  • Device(s) included in the one or more server(s) 110 can represent, but are not limited to, desktop computers, server computers, web-server computers, personal computers, mobile computers, laptop computers, tablet computers, wearable computers, implanted computing devices, telecommunication devices, automotive computers, network enabled televisions, thin clients, terminals, personal data assistants (PDAs), game consoles, gaming devices, work stations, media players, personal video recorders (PVRs), set-top boxes, cameras, integrated components for inclusion in a computing device, appliances, or any other sort of computing device.
  • PDAs personal data assistants
  • PVRs personal video recorders
  • device(s) that may be included in the one or more server(s) 110 and/or user device(s) 108 can include any type of computing device having one or more processing unit(s) 112 operably connected to computer-readable media 114 such as via a bus, which in some instances can include one or more of a system bus, a data bus, an address bus, a PCI bus, a Mini-PCI bus, and any variety of local, peripheral, and/or independent buses.
  • Executable instructions stored on computer-readable media 114 can include, for example, the input module 116, the training module 118, and the classifying module 120, and other modules, programs, or applications that are loadable and executable by processing units(s) 112.
  • an accelerator can represent a hybrid device, such as one from ZyXEL® or Altera® that includes a CPU course embedded in an FPGA fabric.
  • Device(s) that may be included in the one or more server(s) 110 and/or user device(s) 108 can further include one or more input/output (I/O) interface(s) coupled to the bus to allow device(s) to communicate with other devices such as user input peripheral devices (e.g., a keyboard, a mouse, a pen, a game controller, a voice input device, a touch input device, gestural input device, and the like) and/or output peripheral devices (e.g., a display, a printer, audio speakers, a haptic output, and the like).
  • user input peripheral devices e.g., a keyboard, a mouse, a pen, a game controller, a voice input device, a touch input device, gestural input device, and the like
  • output peripheral devices e.g., a display, a printer, audio speakers, a haptic output, and the like.
  • Devices that may be included in the one or more server(s) 110 can also include one or more network interfaces coupled to the bus to enable communications between computing device and other networked devices such as user device(s) 108.
  • Such network interface(s) can include one or more network interface controllers (NICs) or other types of transceiver devices to send and receive communications over a network.
  • NICs network interface controllers
  • some components are omitted from the illustrated system.
  • Processing unit(s) 112 can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU.
  • FPGA field-programmable gate array
  • DSP digital signal processor
  • illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
  • ASICs Application-Specific Integrated Circuits
  • ASSPs Application-Specific Standard Products
  • SOCs System-on-a-chip systems
  • CPLDs Complex Programmable Logic Devices
  • processing unit(s) 112 may execute one or more modules and/or processes to cause the server(s) 110 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 112 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems.
  • the computer-readable media 114 of the server(s) 110 and/or user device(s) 108 may include components that facilitate interaction between the service provider 102 and the users 106.
  • the computer-readable media 114 may include the input module 116, the training module 118, and the classifying module 120, as described above.
  • the modules (116, 118, and 120) can be implemented as computer-readable instructions, various data structures, and so forth via at least one processing unit(s) 112 to configure a device to execute instructions and to perform operations implementing training classifiers from images and leveraging the classifiers to classify new images. Functionality to perform these operations may be included in multiple devices or a single device.
  • the computer-readable media 114 may include computer storage media and/or communication media.
  • Computer storage media can include volatile memory, nonvolatile memory, and/or other persistent and/or auxiliary computer storage media, removable and non-removable computer storage media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data.
  • Computer memory is an example of computer storage media.
  • computer storage media includes tangible and/or physical forms of media included in a device and/or hardware component that is part of a device or external to a device, including but not limited to random-access memory (RAM), static random- access memory (SRAM), dynamic random-access memory (DRAM), phase change memory (PRAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, compact disc read-only memory (CD-ROM), digital versatile disks (DVDs), optical cards or other optical storage media, miniature hard drives, memory cards, magnetic cassettes, magnetic tape, magnetic disk storage, magnetic cards or other magnetic storage devices or media, solid-state memory devices, storage arrays, network attached storage, storage area networks, hosted computer storage or any other storage memory, storage device, and/or storage medium that can be used to store and maintain information for access by a computing device.
  • RAM random-access memory
  • SRAM static random- access memory
  • DRAM dynamic random-access memory
  • PRAM phase change
  • communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • Such signals or carrier waves, etc. can be propagated on wired media such as a wired network or direct-wired connection, and/or wireless media such as acoustic, RF, infrared and other wireless media.
  • computer storage media does not include communication media. That is, computer storage media does not include communication media consisting solely of a modulated data signal, a carrier wave, or a propagated signal, per se.
  • FIG. 2 is a diagram showing additional components of the example system 200 for training classifiers from weakly supervised images and applying the trained classifiers to classify new images.
  • the system 200 may include the input module 116, the training module 118, and the classifying module 120.
  • the input module 116 may receive images and, in some examples, may remove at least some of the images using a filtering process described below.
  • the input module 116 may include additional components or modules, such as a receiving module 202 and a filtering module 204.
  • the receiving module 202 may receive the plurality of images based at least in part on sending a query.
  • a query may be a query for a single label or a plurality of labels.
  • a query may be a textual query, image query, etc.
  • the query may include words used to identify a label (e.g., "orca whale") and related words and/or phrases (e.g., "killer whale,” "blackfish,” etc.).
  • a user 106 may include optional modifiers to the query.
  • the input module 116 may send the query to one or more search engines, social-networking services, blogging services, and/or other websites or web services.
  • the receiving module 202 may receive the plurality of images based at least in part on sending the query.
  • the receiving module 202 may receive weakly supervised images.
  • Weakly supervised images may include images associated with a label.
  • the label may or may not correctly identify the subject matter of the image.
  • the label may identify the image or individual objects in the image, but the system described herein may not be able to determine which subject (e.g., the image or an individual object in the image) the label identifies.
  • supervised images may be labeled with a certainty above a predetermined threshold and unsupervised images may not be labeled at all.
  • the techniques described herein may be applied to various types of multimedia data (e.g., videos, animations, etc.) and, in such examples, the receiving module 202 may receive various types of multimedia data items.
  • the weakly supervised images may be available on the Internet.
  • weakly supervised images may be extracted from data available on the Internet in search engines, social-networking services, blogging services, data sources, and/or other websites or web services.
  • search engines include Bing®, Google®, Yahoo! Search®, Ask®, etc.
  • social-networking services include Facebook®, Twitter®, Instagram®, MySpace®, Flickr®, YouTube®, etc.
  • blogging services include WordPress®, Blogger®, Squarespace®, Windows Live Spaces®, WeiBo®, etc.
  • data sources include ImageNet (maintained by Stanford University), open video annotation project (maintained by Harvard University), etc.
  • the weakly supervised images may be accessible by the public (e.g., data stored in search engines, public Twitter® pictures, public Facebook® pictures, etc.). However, in other examples, the weakly supervised images may be private (e.g., private Facebook® pictures, private YouTube® videos, etc.) and may not be viewed by the public. In such examples (i.e., when the weakly supervised images are private), the systems and methods described herein may not proceed without first obtaining permission from the authors of the weakly supervised images to access the image.
  • a user 106 may be provided with notice that the systems and methods herein are collecting PII. Additionally, prior to initiating PII data collection, users 106 may have an opportunity to opt-in or opt-out of the PII data collection. For example, a user 106 may opt-in to the PII data collection by taking affirmative action indicating that he or she consents to the PII data collection. Alternatively, a user 106 may be presented with an option to opt-out of the PII data collection. An opt-out option may require an affirmative action to opt-out of the PII data collection, and in the absence of affirmative user action to opt-out, PII data collection may be impliedly permitted.
  • PII personally identifiable information
  • labels correspond to queries.
  • Labels may correspond to a descriptive term for a particular entity (e.g., animal, plant, attraction, etc.). Queries are textual terms or phrases that may be used to collect the corpus of images from search engines, social networks, etc.
  • a label corresponds to a particular query, but in some examples, a label may correspond to more than one query.
  • the label "orca whale” may correspond to queries such as “orca whale,” “killer whale,” and/or "blackfish.”
  • the plurality of images returned to the receiving module 202 may be noisy. Accordingly, the filtering module 204 may filter one or more images from the plurality of images to mitigate the noise in the images used for training classifiers. In additional or alternative examples, the receiving module 202 may receive new images for classifying by the trained classifiers.
  • the training module 118 may train classifiers from weakly supervised images.
  • the training module 118 may include additional components or modules for training the classifiers.
  • the training module 118 may include an extraction module 206, which includes a patch extracting module 208 and feature extracting module 210, a clustering module 212, a refining module 214, and a learning module 216.
  • the extraction module 206 may include a patch extracting module 208 and a feature extracting module 210.
  • the patch extracting module 208 may access a plurality of images from the receiving module 202 and extract one or more patches from individual images of the plurality of images.
  • patches may represent regions or parts of an image. Individual patches may correspond to an object or a portion of an object in an image. In some examples, there may be multiple patches in an individual image.
  • the feature extracting module 210 may extract global features and patch-based features. Additionally, the feature extracting module 210 may extract patch representations from the patches. Leveraging global features and patch-based features improves accuracy in recognizing and classifying objects in images. The patch representations may be leveraged for refining the patches, as described below.
  • Global feature extraction may describe the process of identifying interesting portions or shapes of images and extracting those features for additional processing.
  • the process of identifying interesting portions or shapes of images may occur via common multimedia feature extraction techniques such as SIFT (scale-invariant feature transform), deep neural networks (DNN) feature extractor, etc.
  • multimedia feature extraction may describe turning an image into a high dimensional feature vector. For example, all information provided may be organized as a single vector, which is commonly referred to as a feature vector.
  • each image in the corpus of images may have a corresponding feature vector based on a suitable set of features.
  • Global features may include visual features, textual features, etc. Visual features may range from simple visual features, such as edges and/or corners, to more complex visual features, such as objects. Textual features include tags, classes, and/or metadata associated with the images.
  • Patch-based feature extraction may describe extracting image-level features based at least in part on patches extracted from an image.
  • the patch-based features may be based at least in part on patches in refined clusters of patches, as described below.
  • patch-based features are similar to mid-layer representations in DNNs.
  • Patch-based features may represent a patch distribution over the patch dictionary, described below. Patch-based features enable the systems described herein to train classifiers using less data, therefore increasing efficiency and reducing computational resources consumed for training.
  • Various models that linearly transform a feature space associated with the images may be used to extract patch-based features, such as latent Dirichlet allocation (LDA), Support Vector Machines (SVM), etc.
  • LDA latent Dirichlet allocation
  • SVM Support Vector Machines
  • the feature extracting module 210 may also extract patch representations.
  • Patch representations describe features extracted from individual patches. As described above, patch representations may represent patch-level features and may be used for refining the clusters. Various models may be used to extract patch representations, such as but not limited to, LDA representations of HOG, etc.
  • the clustering module 212 may arrange the patches in clusters based on the patch-based features. In at least some examples, to increase the speed of processing the images for training classifiers, the clustering module 212 may arrange the individual patches into a plurality of clusters based at least in part on the patch-based features, as described above. Patches may be placed in a same cluster based at least in part on over- clustering the LDA representation of the patches associated with an image to generate the clusters. Aspect ratio may be implicitly captured by the patch-based features. In some examples, each cluster may represent a particular label. In other examples, each cluster may represent various views of a particular cluster. In additional or alternative examples, the clustering module 212 may use different methods of vector quantization including K- Means clustering to arrange the clusters of patches.
  • the refining module 214 may remove patches from individual clusters based at least in part on similarity values that are representative of a similarity between individual patches.
  • the refining module 214 may determine the similarity values, as described below.
  • the similarity values may be used to determine entropy values and the entropy values may be used for processing the patches via diversity selection, as described below.
  • Entropy values may represent certainty measures.
  • One or more patches may be removed from individual clusters based at least in part on the entropy values and diversity selection. Following the removal of patches from the individual clusters, the remaining patches may have lower entropy values and/or more diversity than the patches in the pre-processed clusters.
  • the resulting clusters may be refined clusters of patches used for training classifiers to classify images.
  • the learning module 216 may leverage one or more learning algorithms to train classifiers for one or more labels associated with the refined clusters of patches.
  • the feature extracting module 210 may extract patch-based features from the patches in the refined clusters of patches.
  • the classifiers may be trained based at least in part on the extracted patch-based features and, in at least some examples, global features. For example, learning algorithms such as fast rank, Stochastic Gradient Descent (SGD), SVMs, boosting, etc., may be applied to learn a classifier for identifying particular labels of the one or more labels.
  • classifiers for all of the labels may be trained at the same time using multi-label learning techniques, such as multiclass SVM or SGD.
  • the training described above may be applied to new labels as new labels are received and the new classifiers may be added to the classifier(s) 218.
  • the classifying module 120 may store the classifier(s) 218.
  • the classifying module 120 may receive patches and patch-based features extracted from new images and may apply the classified s) 218 to the patch-based features.
  • the classifying module 120 may output results including labels identifying and/or classifying images. In at least some examples, the output results may include confidence scores corresponding to each label.
  • FIGS. 3-5 describe example processes for training classifiers from weakly supervised images.
  • the example processes are described in the context of the environment of FIGS. 1 and 2 but are not limited to those environments.
  • the processes are illustrated as logical flow graphs, each operation of which represents an operation in the illustrated or another sequence of operations that may be implemented in hardware, software, or a combination thereof.
  • the operations represent computer-executable instructions stored on one or more computer-readable media 114 that, when executed by one or more processors 112, configure a computing device to perform the recited operations.
  • computer-executable instructions include routines, programs, objects, components, data structures, and the like that configure a computing device to perform particular functions or implement particular abstract data types.
  • the computer-readable media 114 may include hard drives, floppy diskettes, optical disks, CD-ROMs, DVDs, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, flash memory, magnetic or optical cards, solid-state memory devices, or other types of storage media suitable for storing electronic instructions, as described above.
  • FIG. 3 illustrates an example process 300 for training classifiers from patch- based features.
  • Block 302 illustrates sending a query.
  • training classifiers may begin with the input module 116 sending a query, as described above.
  • Block 304 illustrates receiving a corpus of images associated with the query. Based at least in part on sending the query, images relating to the query may be returned to the receiving module 202 from the one or more search engines, social-networking services, blogging services, and/or other websites or web services, as described above. Additional queries associated individual labels of a set of labels may be sent to the one or more search engines, social-networking services, blogging services, and/or other websites or web services as described above, and corresponding images may be returned and added to the corpus of images for training classifier(s) 218.
  • the corpus may be noisy and may include images that are unrelated to the queries, are of low quality, etc. Accordingly, the corpus of images may be refined.
  • the filtering module 204 may filter individual images from the corpus of images, as described below in FIGS. 4-6.
  • Block 306 illustrates accessing a corpus of images.
  • the extraction module 206 may access the corpus of images from the input module 116 for processing.
  • the corpus of images may be filtered before proceeding with processing the individual images from the corpus of images. Example processes for filtering are described in FIGS. 4-6.
  • Block 308 illustrates extracting patches from individual images.
  • patches may represent regions or parts of an image. Individual patches may correspond to an object or a portion of an object in an image. In some examples, there may be multiple patches in each image.
  • the patch extraction module 208 may leverage edge detection to extract patches that correspond to objects or portions of objects in images. In at least one example, the patch extraction module 208 may use structured edge detection and/or fast edge detection (e.g., via structured random forests, etc.). In other examples, the patch extraction module 208 may extract patches based at least in part on detecting edges using intensity, color gradients, classifiers, etc.
  • Block 310 illustrates extracting features.
  • the feature extracting module 210 may extract global features and/or patch-based features from the individual images and may extract patch representations from the patches.
  • the global features may represent contextual information extracted from individual images.
  • the patch-based features may represent distinguishing features of the patches associated with individual images. Patch representations may represent distinguishing features a particular patch.
  • Block 312 illustrates arranging the patches into clusters.
  • the clustering module 212 may arrange the individual patches into a plurality of clusters based at least in part on the patch-based features, as described above. For each cluster, the clustering module 212 may determine a canonical size. The clustering module 212 may predetermine and cache the ⁇ for the LDA. The predetermined canonical size may be leveraged for determining similarity values, as described below.
  • Block 314 illustrates determining similarity values for the patches.
  • the refining module 214 may remove at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on similarity values.
  • the refining module 214 may determine similarity values that are representative of a similarity between two individual patches, the determining may be based at least in part on the patch representations.
  • the refining module 214 may leverage HOG for the LDA features.
  • the refining module 214 may determine similarity values by standardizing the patch representations (e.g., LDA HOG) extracted from a first individual patch of the individual patches and a second individual patch of the individual patches to a predetermined canonical size.
  • the patch representations may be standardized by zero padding the patch representations extracted from the first individual patch and the second individual patch.
  • the first individual patch is part of a particular cluster of the plurality of patches associated with a label and the second individual patch is part of a different cluster of the plurality of patches associated with a different label of the plurality of labels. That is, in some examples, similarity values may be determined for patches in different clusters via intercluster comparisons.
  • the first individual patch and the second individual patch are part of a same cluster of the plurality of clusters, the same cluster associated with a same label of the plurality of labels. That is, in some examples, similarity values may be determined for patches in the same cluster via intracluster comparisons.
  • the refining module 214 may compute a dot product based at least in part on the standardized patch representations of the first individual patch and the second individual patch. In at least one example, weight vectors derived from the LDA feature extraction of the patches may be used for computing the dot product. In other examples, the refining module 214 may approximate the dot product by a Euclidean distance comparison. Leveraging the Euclidean distance enables the refining module 214 to use an index (e.g., k-dimensional tree) for nearest neighbor determinations for identifying patches that have low entropy values and high diversity, as described below. In some examples, the patches in the index may be stored and new patches provided during training and/or classifying may be compared to patches in the index for quickly and efficiently determining similarity (e.g., calculating similarity values) between the patches.
  • similarity e.g., calculating similarity values
  • Block 316 illustrates removing individual patches from the clusters.
  • the refining module 214 may remove at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on the similarity values. In at least some examples, the refining module 214 may remove at least some of the individual patches based at least in part on entropy values and diversity selection. To determine whether a particular patch has a high entropy value or a low entropy value, the refining module 214 may access a plurality of individual patches in a particular cluster of the plurality of clusters. The particular cluster may be associated with a label of the plurality of labels.
  • the refining module 214 may process the individual patches to determine top nearest neighbors, as described above.
  • the individual patches may be iteratively processed. As the individual patches are processed, a predetermined number of top nearest neighbors may be selected for training the classified s) 218 (and those patches that are not selected are removed from the clusters).
  • specific data structures may be leveraged that increase the speed in which nearest neighbors may be determined.
  • the specific data structures may incorporate a cosine similarity metric that may be approximated by Euclidean distance. Accordingly, nearest neighbor determination may be accelerated by leveraging an index (e.g., k-dimensional tree) for all of the patches and approximating nearest neighbors using the index.
  • an index e.g., k-dimensional tree
  • the refining module 214 may determine an entropy value for each of the individual patches based at least in part on determining labels associated with the nearest neighbors within a cluster. The refining module 214 may leverage the nearest neighbor determinations to generate distributions for labels that may be representative of entropy values for individual patches. If a particular individual patch and a nearest neighbor patch are associated with a same label, the refining module 214 may assign a low entropy value (e.g., close to 0) based at least in part on a low entropy distribution. The low entropy value (e.g., close to 0) may indicate that the particular individual patch and the nearest neighbor patch are highly representative of the label.
  • a low entropy value e.g., close to 0
  • the refining module 214 may assign a high entropy value (e.g., close to 1) based at least in part on a high entropy distribution.
  • the high entropy value (e.g., close to 1) may indicate that the particular individual patch and the nearest neighbor patch are not representative of a same label.
  • the refining module 214 may remove all individual patches with entropy values above a predetermined threshold to ensure the training data is highly representative of the label.
  • the refining module 214 may also remove patches that reduce the diversity of the patches.
  • the resulting patches may be arranged in a dictionary that is diverse and has a number of patches below a predetermined threshold. Patches may be diverse if the patches are representative of various portions of an object and/or various views of an object identified by the label.
  • the dictionary may be stored and new patches may be added to the dictionary over time. The dictionary of patches may be used to generate patch representations.
  • the refining module 214 may perform diversity selection by ordering individual patches in the dictionary based at least in part on the entropy value associated with each of the individual patches. Then, in a plurality of iterations, the refining module 214 may process the ordered individual patches by determining nearest neighbor patches for each individual patch of the ordered individual patches. The refining module 214 may select a particular patch if the particular patch has a threshold number of nearest neighbors with entropy values below a predetermined value. The refining module 214 may remove nearest neighbor patches to the particular patch following each iteration.
  • the refining module 214 may further refine the remaining patches for efficiency. For instance, suppose the patches are associated with a predetermined number of labels (e.g., £), the refining module 214 may group the patches from each label into clusters (e.g., Pi, . . . , PE). In at least one example, the individual patches selected for processing in each cluster (e.g., Pi, . . .
  • PE may be ordered based on a corresponding entropy value and grouped into sub-clusters
  • a final group of patches (e.g., F) for training the classifier may be iteratively selected to maximize the efficiency and accuracy of classification.
  • the recognition and/or classification performance e.g., rripv
  • rripv may be measured using the following example algorithm or algorithms similar to the example algorithm below.
  • Block 318 illustrates training a classifier.
  • the learning module 216 may train one or more classifiers 218 for the plurality of labels based at least in part on patches in the refined plurality of clusters.
  • the classifiers 218 may be trained based at least in part on patch-based features extracted from the patches in the refined clusters and, in at least some examples, global features. For example, learning algorithms such as fast rank, SGD, SVM, boosting, etc., may be applied to learn a classifier for identifying particular labels of the one or more labels.
  • classifiers for all of the labels may be trained at the same time using multi-label learning techniques, such as multiclass SVM or SGD.
  • the training described above may be applied to new labels as new labels are received and the new classifiers may be added to the classifier(s) 218.
  • FIG. 4 illustrates an example process 400 for determining whether a label is learnable based at least in part on filtering a corpus of images.
  • Block 402 illustrates sending a query, as described above.
  • Block 404 illustrates receiving a corpus of images associated with the query, as described above.
  • Block 406 illustrates filtering the corpus of images.
  • the corpus of images may be noisy and may include images that are unrelated to the queries, are of low quality, etc. Accordingly, the corpus of images may be refined.
  • the filtering module 204 may filter individual images from the corpus of images, as described in FIGS. 5-6 below. In addition to the processes described below, the filtering module 204 may apply specific filters to remove specifically identified images from the corpus of images. For instance, the filtering module 204 may remove cartoon images, images with human faces covering a predetermined portion of the image, images with low gradient intensity, etc.
  • Block 408 illustrates determining whether a label is learnable. If removing images from the corpus results in a number of images below a predetermined threshold, the filtering module 204 may determine that the label is not learnable and may turn to human intervention, as illustrated in Block 410. Conversely, if removing images from the corpus results in a number of images above a predetermined threshold, the filtering module 204 may determine that the label is learnable and may proceed with training classifier(s) 218 as illustrated in Block 412. An example process of training classifier(s) 218 is described in FIG. 3, above.
  • FIG. 5 illustrates an example process 500 for filtering a corpus of images.
  • Block 502 illustrates determining nearest neighbors for each image in the corpus of images.
  • the filtering module 204 may arrange each of the images in the corpus of images into a k-dimensional tree for facilitating nearest neighbor lookup.
  • the facilitating module 204 may determine a predetermined number of nearest neighbors.
  • the filtering module 204 may leverage global features extracted from individual images for determining the nearest neighbors.
  • the filtering module 204 may determine how many times a particular individual image appears in the neighborhood of any individual image. If the particular individual image appears below a predetermined number of times, the particular individual image may be removed from the corpus of images.
  • Block 504 illustrates arranging individual images into clusters.
  • the filtering module 204 may cluster the individual images into clusters corresponding to individual labels of the plurality of labels.
  • the filtering module 204 may use single linkage clustering and may arrange individual images within a predetermined distance into a predetermined number of clusters.
  • Block 506 illustrates determining entropy values for each individual image in the cluster.
  • the filtering module 204 may process the clusters to determine nearest neighbors of an image. For each image in a particular cluster, the filtering module 204 may determine the nearest neighbors of an image in other clusters. The filtering module 204 may determine entropy values based at least in part on comparing the nearest neighbors to one another. If nearest neighbors to a particular cluster are stable (e.g., low entropy value), the particular cluster is likely stable and representative and/or distinctive of a label. However, if nearest neighbors to a particular cluster are unstable (e.g., high entropy value), the particular cluster is likely unstable and not representative or distinctive of a label.
  • Block 508 illustrates removing at least some individual images.
  • the filtering module 204 may remove individual images having entropy values above a predetermined threshold.
  • FIG. 6 illustrates another example process 600 for filtering a corpus of images.
  • Block 602 illustrates collecting negative images.
  • a negative image is an image that is known to be excluded from training data associated with a label.
  • the receiving module 202 may perform two or more queries. At least one query may be a query for a particular label as described above (e.g., CenturyLink Field). Additional queries may include queries for individual words that make up a particular label having two or more words (e.g., CenturyLink, Field). An initial query of the additional queries may include a first word of the two or more words (e.g., CenturyLink). Each additional query of the additional queries may include each additional word of the two or more words (e.g., Field). The receiving module 202 may receive results from the two or more queries. The results returned for at least the second query may represent the negative images. In other examples, the receiving module 202 may leverage a knowledge graph (e.g., Satori, etc.) for collecting negative images.
  • a knowledge graph e.g., Satori, etc.
  • Block 604 illustrates comparing images to negative images.
  • the filtering module 204 may compare individual images returned as a result of the first query to the individual images returned in the additional queries to determine similarity values as described above.
  • Block 606 illustrates removing individual images from the corpus of images based on similarity values.
  • the filtering module 204 may remove individual images with similarity values above a predetermined threshold. That is, if individual images are too similar to negative images, the individual images may be removed from the corpus.
  • FIG. 7 illustrates an example process 700 for determining similarity values.
  • the refining module 214 may determine similarity values representative of a similarity between the individual patches. The similarity values may be determined based at least in part on the patch representations. In at least one example, the refining module 214 may leverage HOG for the LDA features.
  • Block 702 illustrates standardizing patch representations extracted from individual patches.
  • the refining module 214 may arrange a plurality of patches into clusters based on an aspect ratio of the patches.
  • the refining module 214 may determine similarity values by standardizing patch representations (e.g., LDA HOG) extracted from a first individual patch of the individual patches and a second individual patch of the individual patches to a predetermined canonical size.
  • the patch representations e.g., LDA HOG
  • the patch representations may be standardized by zero padding the patch representations extracted from the first individual patch and the second individual patch.
  • Block 704 illustrates computing a dot product based on standardized patch representations.
  • the refining module 214 may compute a dot product based at least in part on the standardized values of the first individual patch and the second individual patch.
  • weight vectors derived from the LDA feature extraction may be used for computing the dot product.
  • the refining module 214 may approximate the dot product by a Euclidean distance comparison. Leveraging the Euclidean distance enables the refining module 214 to use a k-dimensional tree for nearest neighbor determinations for identifying patches that have low entropy values and high diversity, as described below.
  • FIG. 8 illustrates an example process 800 for removing patches from clusters of patches.
  • the refining module 214 may remove at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on similarity values. In at least some examples, the refining module 214 may remove at least some of the individual patches based at least in part on entropy values and diversity selection.
  • Block 802 illustrates accessing the plurality of individual patches in a particular cluster.
  • the refining module 214 may access a plurality of individual patches in a particular cluster of the plurality of clusters.
  • the particular cluster may be associated with a label of the plurality of labels.
  • Block 804 illustrates determining nearest neighbors for each individual patch.
  • the refining module 214 may process the individual patches to determine top nearest neighbors, as described above.
  • the individual patches may be iteratively processed. As the individual patches are processed, a predetermined number of top nearest neighbors may be selected for training the classifiers 218.
  • specific data structures may be leveraged that increase the speed in which nearest neighbors may be determined.
  • the specific data structures may incorporate a cosine similarity metric that may be approximated by Euclidean distance. Accordingly, nearest neighbor determination may be accelerated by leveraging a k- dimensional tree for all of the patches and approximating nearest neighbors using the k- dimensional tree.
  • Block 806 illustrates determining an entropy value based on nearest neighbors for each individual patch.
  • the refining module 214 may determine an entropy value for each of the individual patches based at least in part on determining the nearest neighbors within a cluster. If a particular individual patch and a nearest neighbor patch are associated with a same label, the refining module 214 may assign a low entropy value (e.g., close to 0). The low entropy value (e.g., close to 0) may indicate that the particular individual patch and the nearest neighbor patch are highly representative of the label.
  • the refining module 214 may assign a high entropy value (e.g., close to 1), indicating that the particular individual patch and the nearest neighbor patch are not representative of a same label.
  • Block 808 illustrates removing individual patches from the clusters of patches.
  • the refining module 214 may remove individual patches based at least in part on entropy values and/or diversity selection.
  • the refining module 214 may remove individual patches with entropy values above a predetermined threshold to ensure the training data is highly representative of the label.
  • the refining module 214 may also remove patches that reduce the diversity of the patches. Patches may be diverse if the patches are representative of various portions of an object and/or various views of an object identified by the label.
  • the refining module 214 may perform diversity selection by ordering individual patches based at least in part on the entropy value associated with each of the individual patches.
  • the refining module 214 may process the ordered individual patches by determining nearest neighbor patches for each individual patch of the ordered individual patches.
  • the refining module 214 may remove nearest neighbor patches from the cluster following each iteration.
  • the refining module 214 may select a particular patch if the particular patch had a number of nearest neighbors above a predetermined threshold with entropy values below a predetermined threshold.
  • FIG. 9 illustrates an example process 900 for diversity selection of particular patches for training the classified s) 218.
  • the refining module 214 may also remove patches that reduce the diversity of the patches. Patches may be diverse if the patches are representative of various portions of an object and/or various views of an object identified by the label.
  • Block 902 illustrates ordering individual patches based on entropy values.
  • the refining module 214 may perform diversity selection by ordering individual patches based at least in part on the entropy value associated with each of the individual patches.
  • Block 904 illustrates processing individual patches.
  • the refining module 214 may process the ordered individual patches by determining nearest neighbor patches for each individual patch of the ordered individual patches.
  • Block 906 illustrates removing nearest neighbors for each individual patch.
  • the refining module 214 may remove nearest neighbor patches from the cluster following each iteration.
  • Block 908 illustrates determining particular patches having a number of nearest neighbors above a predetermined threshold with entropy values below a predetermined threshold.
  • the refining module 214 may determine particular patches have a number of nearest neighbors above a predetermined threshold with entropy values below a predetermined threshold.
  • Block 910 illustrates selecting particular patches for training the classifier(s) 218.
  • the refining module 214 may select a particular patch if the particular patch had a number of nearest neighbors above a predetermined threshold with entropy values below a predetermined threshold. Based at least in part on the refining module 214 removing individual patches with entropy values above a predetermined threshold and individual patches to maximize the diversity of the individual patches, the refining module 214 may further refine the remaining patches for efficiency.
  • the individual patches selected for processing in each cluster may be ordered based on a corresponding entropy value and grouped into sub-clusters.
  • a final group of patches may be for training the classifier may be iteratively selected to maximize efficiency and accuracy of classification.
  • the feature extracting module 210 may extract patch-based features from the final group of patches (e.g., refined cluster of patches) for use in training the classifiers.
  • FIG. 10 illustrates a diagram showing an example system 1000 for classifying a new image.
  • the system 1000 may include the input module 116, training module 118, and classifying module 120.
  • the input module 116 may include the receiving module 202.
  • the receiving module 202 may receive a new image 1002 for classifying.
  • the user(s) 106 may input one or more images into the receiving module 202 via one of the user devices 108.
  • a user 106 may select an image stored on his or her user device 108 for input into the input module 116.
  • a user 106 may take a photo or video via his or her user device 108 and input the image into the input module 116.
  • the receiving module 202 may send the new image 1002 to the extraction module 206 stored in the training module 118.
  • the patch extraction module 208 that is stored in the extraction module 208 may extract patches from the new image 1002, as described above.
  • the patch extracting module 208 may send the patches 1004 to the feature extracting module 210 for extracting patch-based features from the image 1002, based at least in part on the patches 1004, as described above.
  • the feature extracting module 210 may send the patch-based features 1006 to the classifying module for classifying by the classifier(s) 218.
  • the classifying module 120 may apply the classifier(s) 218 to the patch-based features 1006 for classification.
  • the classifying module 120 may send the classified result 1008 to the user(s) 106.
  • the classified result 1008 may include a confidence score.
  • the example process 1100 is described in the context of the environment of FIGS. 1, 2, and 10 but is not limited to those environments.
  • the process 1100 is illustrated as a logical flow graph, each operation of which represents an operation in the illustrated or another sequence of operations that may be implemented in hardware, software, or a combination thereof.
  • the operations represent computer-executable instructions stored on one or more computer-readable media 114 that, when executed by one or more processors 112, configure a computing device to perform the recited operations.
  • computer-executable instructions include routines, programs, objects, components, data structures, and the like that configure a computing device to perform particular functions or implement particular abstract data types.
  • the computer-readable media 114 may include hard drives, floppy diskettes, optical disks, CD-ROMs, DVDs, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, flash memory, magnetic or optical cards, solid-state memory devices, or other types of storage media suitable for storing electronic instructions, as described above.
  • FIG. 11 illustrates an example process 1100 for classifying a new image 1002.
  • Block 1102 illustrates receiving input.
  • the receiving module 202 may receive a new image 1002 to be classified.
  • the user(s) 106 may input one or more images into the receiving module 202 via one of the user devices 108.
  • Block 1104 illustrates extracting patches 1004.
  • the patch extraction module 208 may extract patches 1004 from the new image 1002, as described above.
  • Block 1106 illustrates extracting features 1006.
  • the patch extracting module 208 may send the patches 1004 to the feature extracting module 210 for extracting patch-based features 1006 from the image 1002, based at least in part on the extracted patches 1004, as described above.
  • Block 1108 illustrates applying a classifier 218.
  • the feature extracting module 210 may send the patch-based features 1006 to the classifying module for classifying by the classified s) 218.
  • the classifying module 120 may apply the classifier(s) 218 to the patch-based features 1006 for classification.
  • Block 1110 illustrates outputting the result 1008.
  • the classifying module 120 may send the classified result 1008 to the user(s) 106.
  • a computer-implemented method comprising: accessing a corpus of images, wherein individual images of the corpus are associated with at least one label of a plurality of labels; extracting one or more patches from the individual images; extracting patch-based features from the one or more patches; extracting patch representations from individual patches of the one or more patches; arranging the individual patches into a plurality of clusters based at least in part on the patch-based features, wherein individual clusters of the plurality of clusters correspond to individual labels of the plurality of labels; determining similarity values representative of a similarity between ones of the individual patches, the determining based at least in part on patch representations; removing at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on the similarity values; and training a classifier for the plurality of labels based at least in part patch-based features extracted from the individual clusters.
  • LDA latent Dirichlect allocation
  • a computer-implemented method as paragraph C recites, wherein the first individual patch is part of a particular cluster of the plurality of patches associated with the at least one label of the plurality of labels and the second individual patch is part of a different cluster of the plurality of patches associated with a different label of the plurality of labels.
  • a computer-implemented method as paragraph C recites, wherein the first individual patch and the second individual patch are part of a same cluster of the plurality of clusters, the same cluster associated with a same label of the plurality of labels.
  • removing at least some of the individual patches from the individual clusters comprises: accessing a plurality of individual patches in a particular cluster of the plurality of clusters; determining nearest neighbors of individual patches of the plurality of individual patches based at least in part on the similarity values; determining entropy values for the individual patches based at least in part on determining the nearest neighbors of the individual patches; and removing at least some individual patches with entropy values above a predetermined threshold.
  • H One or more computer-readable media encoded with instructions that, when executed by a processor, configure a computer to perform a method as any of paragraphs A- G recites.
  • a device comprising one or more processors and one or more computer- readable media encoded with instructions that, when executed by the one or more processors, configure a computer to perform a computer-implemented method as recited in any of paragraphs A-G.
  • a system comprising: computer-readable media storing one or more modules; a processing unit operably coupled to the computer-readable media, the processing unit adapted to execute the one or more modules, the one or more modules comprising: a patch extracting module to access a plurality of images and extract one or more patches from individual images of the plurality of images; a feature extracting module to extract patch-based features from the one or more patches and patch representations from individual patches of the one or more patches; a clustering module to arrange the individual patches into a plurality of clusters based at least in part on the patch-based features; a refining module to remove at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on entropy values and diversity selection; and a learning module to train a classifier for at least one label based at least in part on the individual clusters.
  • a system as paragraph J recites, further comprising a receiving module to receive the plurality of images based at least in part on a query of the at least one label.
  • a system as paragraph J or K recites, further comprising a filtering module to remove at least some of the individual images based at least in part on: the at least some of the individual images having entropy values above a predetermined threshold; or the at least some of the individual images and negative images having image similarity values above a predetermined threshold.
  • N A system as paragraph M recites, wherein the learning module trains the classifier for the at least one label based at least in part on the global features and the patch-based features.
  • the refining module further determines similarity values representative of similarities between individual patches of the one or more patches, the determining comprising: standardizing patch representations extracted from a first individual patch of the individual patches and a second individual patch of the individual patches to a predetermined canonical size; and computing a dot product based at least in part on the standardized patch representations of the first individual patch and the second individual patch.
  • the refining module removes the at least some of the individual patches from the individual clusters of the plurality of clusters based at least in part on: accessing a plurality of individual patches in a particular cluster of the plurality of clusters; determining nearest neighbors of individual patches of the plurality of patches based at least in part on the similarity values; determining entropy values based at least in part on determining the nearest neighbors to individual patches; filtering at least some of the individual patches with entropy values above a predetermined threshold, remaining individual patches of the plurality of individual patches comprising filtered patches; determining nearest neighbor patches for the filtered patches via a plurality of iterations; removing nearest neighbor patches for the filtered patches in each iteration of the plurality of iterations; determining that a particular filtered patch of the filtered patches had a number of nearest neighbors below a predetermined threshold with entropy values below a predetermined threshold; and removing the particular
  • R A system as any of paragraphs J-Q recite, further comprising a receiving module to receive a new image for classifying by the classifier.
  • T One or more computer-readable media as paragraph S recites, wherein training the classifier comprises: extracting new patch-based features from remaining individual patches of the individual clusters; and training the classifier based at least in part on the new patch-based features.
  • the acts further comprise, prior to extracting the one or more patches from the multimedia weakly supervised data items, filtering the plurality of weakly supervised images, the filtering including: collecting negative images; comparing the individual weakly supervised images with the negative images; and removing one or more of the individual weakly supervised images from the plurality of images based at least in part on the one or more of the individual weakly supervised images and the negative images having similarity values above a predetermined threshold.
  • W A device comprising one or more processors and one or more computer readable media as recited in any of paragraphs S-V.
  • a system comprising: computer-readable media; one or more processors; and one or more modules on the computer-readable media and executable by the one or more processors to perform operations comprising: accessing a plurality of weakly supervised images; extracting one or more patches from individual weakly supervised images of the plurality of weakly supervised images; extracting patch-based features from the one or more patches; extracting patch representations from the one or more patches; arranging individual patches into a plurality of clusters based at least in part on the patch-based features; removing at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on similarity values representative of similarity between ones of the individual patches; and training a classifier for at least one label based at least in part on the plurality of clusters.
  • training the classifier comprises: extracting new patch-based features from remaining individual patches of the individual clusters; and training the classifier based at least in part on the new patch-based features.
  • a system as paragraph X or Y recites, wherein the operations further comprise, prior to extracting the one or more patches from the multimedia weakly supervised data items, filtering the plurality of weakly supervised images, the filtering including: determining nearest neighbors for each individual weakly supervised images of the plurality of weakly supervised images; arranging one or more individual weakly supervised images within a predetermined distance into image clusters; determining an entropy value for each individual weakly supervised image in an individual image cluster of the image clusters, wherein determining an entropy value for each individual weakly supervised image comprises determining a similarity value representing a similarity between each individual weakly supervised image in a particular image cluster and each individual weakly supervised image in one or more other image clusters; and removing at least some of the individual weakly supervised images when the entropy value is above a predetermined threshold.
  • AA A system as any of paragraphs X-Z recite, wherein the operations further comprise, prior to extracting the one or more patches from the multimedia weakly supervised data items, filtering the plurality of weakly supervised images, the filtering including: collecting negative images; comparing the individual weakly supervised images with the negative images; and removing one or more of the individual weakly supervised images from the plurality of images based at least in part on the one or more of the individual weakly supervised images and the negative images having similarity values above a predetermined threshold.
  • Conditional language such as, among others, "can,” “could,” “might” or “may,” unless specifically stated otherwise, are understood within the context to present that certain examples include, while other examples do not necessarily include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that certain features, elements and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without user input or prompting, whether certain features, elements and/or steps are included or are to be performed in any particular example.
  • Conjunctive language such as the phrase "at least one of X, Y or Z,” unless specifically stated otherwise, is to be understood to present that an item, term, etc. may be either X, Y, or Z, or a combination thereof.

Abstract

Optimizing multi-class image classification by leveraging patch-based features extracted from weakly supervised images to train classifiers is described. A corpus of images associated with a set of labels may be received. One or more patches may be extracted from individual images in the corpus. Patch-based features may be extracted from the one or more patches and patch representations may be extracted from individual patches of the one or more patches. The patches may be arranged into clusters based at least in part on the patch-based features. At least some of the individual patches may be removed from individual clusters based at least in part on determined similarity values that are representative of similarity between the individual patches. The system may train classifiers based in part on patch-based features extracted from patches in the refined clusters. The classifiers may be used to accurately and efficiently classify new images.

Description

OPTIMIZING MULTI-CLASS IMAGE CLASSIFICATION USING PATCH
FEATURES
BACKGROUND
[0001] Computer vision may include object recognition, object categorization, object class detection, image classification, etc. Object recognition may describe finding a particular object (e.g., a handbag of a particular make, a face of a particular person, etc.). Object categorization and object class detection may describe finding objects that belong in a particular category or class (e.g., faces, shoes, cars, etc.). Image classification may describe assigning an entire image to a particular category or class (e.g., location recognition, texture classification, etc.). Computerized object recognition, detection, and/or classification using images is challenging because objects in the real world vary greatly in visual appearance. For instance, objects associated with a single label (e.g., cat, dog, car, house, etc.) exhibit diversity in color, shape, size, viewpoint, lighting, etc.
[0002] Some current object detection, recognition, and/or classification methods include training classifiers based on supervised, or labeled, data. Such methods are not scalable. Others of the current object detection, recognition, and/or classification methods leverage localized image features (e.g., Histogram of Oriented Gradients (HOG)) to learn common- sense knowledge (e.g., eye is part of a person) or specific sub-labels of generic labels (e.g., a generic label of horse includes sub-labels of brown horse, riding horse, etc.). However, using localized image features (e.g., HOG) is computationally intensive. Accordingly, current techniques for object detection, recognition, and/or classification are not scalable and are computationally intensive.
SUMMARY
[0003] This disclosure describes techniques for optimizing multi-class image classification by leveraging patch-based features extracted from weakly supervised images. The techniques described herein leverage patch-based features to optimize the multi-class image classification by improving accuracy in using classifiers to classify incoming images and reducing the amount of computational resources used for training classifiers.
[0004] The systems and methods describe learning classifiers from weakly supervised images available on the Internet. In at least some examples, the systems described herein may receive a corpus of images associated with a set of labels. Each image in the corpus of images may be associated with at least one label in the set of labels. The system may extract one or more patches from individual images in the corpus of images. The system may extract patch-based features from the one or more patches and patch representations from individual patches of the one or more patches. The system may arrange the patches into clusters based at least in part on the patch-based features. Moreover, the system may determine similarity values representative of a similarity between individual patches. At least some of the individual patches may be removed from individual clusters based at least in part on the similarity values. The system may extract patch-based features based at least in part on patches remaining in refined clusters. The system may train classifiers based at least in part on the patch-based features.
[0005] The systems and methods further describe applying the classifiers to classify new images. In at least one example, a user may input an image into the trained system described herein. The system may extract patches from the image and extract features from the image. The system may apply a classifier to the extracted features to classify the new image. Additionally, the system may output a result to the user. The result may include classification of the image determined by applying the classifier to the features extracted from the image.
[0006] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0007] The Detailed Description is set forth with reference to the accompanying figures, in which the left-most digit of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in the same or different figures indicates similar or identical items or features.
[0008] FIG. 1 is a diagram showing an example system for training classifiers from images and applying the trained classifiers to classify new images.
[0009] FIG. 2 is a diagram showing additional components of the example system for training classifiers from weakly supervised images and applying the trained classifiers to classify new images.
[0010] FIG. 3 illustrates an example process for training classifiers from patch-based features.
[0011] FIG. 4 illustrates an example process for determining whether a label is learnable based at least in part on filtering a corpus of images. [0012] FIG. 5 illustrates an example process for filtering a corpus of images.
[0013] FIG. 6 illustrates another example process for filtering a corpus of images.
[0014] FIG. 7 illustrates an example process for determining similarity values.
[0015] FIG. 8 illustrates an example process for removing patches from clusters of patches.
[0016] FIG. 9 illustrates an example process for diversity selection of particular patches for training.
[0017] FIG. 10 illustrates a diagram showing an example system for classifying a new image.
[0018] FIG. 11 illustrates an example process for classifying a new image.
DETAILED DESCRIPTION
[0019] Computer vision object (e.g., people, animals, landmarks, etc.), texture, and/or scene classification in images (e.g., photo, video, etc.) may be useful for several applications including photo and/or video recognition, image searching, product related searching, etc. Current classification methods include training classifiers based on supervised, or labeled, data. Such methods are not scalable or extendable. Moreover, current classification methods leverage localized image features (e.g., HOG) to learn common-sense knowledge (e.g., eye is part of a person) or specific sub-labels of generic labels (e.g., a generic label of horse includes sub-labels of brown horse, riding horse, etc.). However, using localized image features (e.g., HOG) is computationally intensive. That is, current data-mining techniques require substantial investments of computer resources and are not scalable and/or extendable.
[0020] Techniques described herein optimize multi-class image classification by leveraging patch-based features extracted from weakly supervised images. The systems and methods described herein may be useful for training classifiers and classifying images using the classifiers. Such classification may be leveraged for several applications including object recognition (e.g., finding a particular object such as a handbag of a particular make, a face of a particular person, etc.), object categorization or class detection (e.g., finding objects that belong in a particular category or class), and/or image classification (e.g., assigning an entire image to a particular category or class). For instance, such classification may be useful for photo and/or video recognition, image searching, product related searching, etc. The techniques described herein leverage patch- based features to optimize the multi-class image classification by reducing the amount of computational resources used for training classifiers. Additionally, using patch-based features improves efficiency and accuracy in using the classifiers to classify incoming images.
[0021] The systems and methods describe learning classifiers from weakly supervised images available on the Internet. In at least some examples, the system described herein may receive a corpus of images associated with a set of labels. Each image in the corpus of images may be associated with at least one label in the set of labels. The system may extract one or more patches from individual images in the corpus of images. A patch may represent regions or parts of an image. Patches may be representative of an object or a portion of an object in an image and may be discriminative such that they may be detected in multiple images with high recall and precision. In at least some examples, patches may be discriminative such that they may be detected in a number of images associated with a same label more frequently than they may be detected in images associated with various, different labels.
[0022] The system may extract patch-based features from the individual images. Patch- based features are image-level features that describe or represent an image. Patch-based features may represent a patch distribution over a patch dictionary as described below. Patch-based features for an individual image are based at least in part patches that are extracted from the individual image. In some examples, a plurality of patches is extracted from an individual image and the patch-based features may be based on the plurality of patches extracted from the individual image. In other examples, only a single patch is extracted from an image and the patch-based features may be based on the single patch. Patch-based features enable the systems described herein to train classifiers using less data, therefore increasing efficiency and reducing computational resources consumed for training.
[0023] The system may extract patch representations from the individual patches. Patch representations describe features extracted from individual patches. Patch representations may represent patch-level features and may be used for refining the clusters, as described below.
[0024] The system may arrange individual patches of the one or more patches into clusters based at least in part on patch-based features. Individuals of the clusters correspond to individual labels of the set of labels. The clusters may be refined based at least in part on the patch-based features. The system may determine similarity values based at least in part on the patch representations. The similarity values may be representative of similarity between individual patches in same and/or different clusters. The system may process the clusters to remove at least some of the individual patches based at least in part on the similarity values. Based at least in part on the patches that remain after processing the clusters, the system may extract patch-based features from the patches in the refined clusters. The system may leverage the patch-based features extracted from the refined clusters of patches to train classifiers.
[0025] The systems and methods herein further describe applying the classifiers to classify new images. In at least one example, a user may input an image into the trained system described herein. The system may extract patches and features from the image. The system may apply a classifier to the extracted features to classify the input image. Additionally, the system may output a result to the user. The result may include classification of the image determined by applying the classifier to the features extracted from the image.
Illustrative Environment
[0026] The environment described below constitutes but one example and is not intended to limit application of the system described below to any one particular operating environment. Other environments may be used without departing from the spirit and scope of the claimed subject matter. The various types of processing described herein may be implemented in any number of environments including, but not limited to, stand alone computing systems, network environments (e.g., local area networks or wide area networks), peer-to-peer network environments, distributed-computing (e.g., cloud- computing) environments, etc.
[0027] FIG. 1 is a diagram showing an example system 100 for training classifiers from images and applying the trained classifiers to classify new images. More particularly, the example operating environment 100 may include a service provider 102, one or more network(s) 104, one or more users 106, and one or more user devices 108 associated with the one or more users 106. Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components such as accelerators. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. For example, an accelerator can represent a hybrid device, such as one from ZYLEX or ALTERA that includes a CPU course embedded in an FPGA fabric. [0028] As shown, the service provider 102 may include one or more server(s) 110, which may include one or more processing unit(s) 112 and computer-readable media 114. Executable instructions stored on computer-readable media 114 can include, for example, an input module 116, a training module 118, and a classifying module 120, and other modules, programs, or applications that are loadable and executable by processing units(s) 112 for classifying images. The one or more server(s) 110 may include devices. The service provider 102 may be any entity, server(s), platform, etc., that may learn classifiers from weakly supervised images and apply the learned classifiers for classifying new images. The service provider 102 may receive a corpus of images associated with a set of labels and may extract patches from individual images in the corpus. The service provider 102 may extract features from the patches and images for training a classifier. The service provider 102 may leverage the classifier to classify new images input by users 106.
[0029] In some examples, the network(s) 104 may be any type of network known in the art, such as the Internet. Moreover, the users 106 may communicatively couple to the network(s) 104 in any manner, such as by a global or local wired or wireless connection (e.g., local area network (LAN), intranet, etc.). The network(s) 104 may facilitate communication between the server(s) 110 and the user devices 108 associated with the users 106.
[0030] In some examples, the users 106 may operate corresponding user devices 108 to perform various functions associated with the user devices 108, which may include one or more processing unit(s) 112, computer-readable storage media 114, and a display. Executable instructions stored on computer-readable media 114 can include, for example, the input module 116, the training module 118, and the classifying module 120, and other modules, programs, or applications that are loadable and executable by processing units(s) 112 for classifying images. Furthermore, the users 106 may utilize the user devices 108 to communicate with other users 106 via the one or more network(s) 104.
[0031] User device(s) 108 can represent a diverse variety of device types and are not limited to any particular type of device. Examples of device(s) 108 can include but are not limited to stationary computers, mobile computers, embedded computers, or combinations thereof. Example stationary computers can include desktop computers, work stations, personal computers, thin clients, terminals, game consoles, personal video recorders (PVRs), set-top boxes, or the like. Example mobile computers can include laptop computers, tablet computers, wearable computers, implanted computing devices, telecommunication devices, automotive computers, personal data assistants (PDAs), portable gaming devices, media players, cameras, or the like. Example embedded computers can include network enabled televisions, integrated components for inclusion in a computing device, appliances, microcontrollers, digital signal processors, or any other sort of processing device, or the like.
[0032] As described above, the service provider 102 may include one or more server(s) 110, which may include devices. Examples support scenarios where device(s) that may be included in the one or more server(s) 110 can include one or more computing devices that operate in a cluster or other clustered configuration to share resources, balance load, increase performance, provide fail-over support or redundancy, or for other purposes. Device(s) included in the one or more server(s) 110 can represent, but are not limited to, desktop computers, server computers, web-server computers, personal computers, mobile computers, laptop computers, tablet computers, wearable computers, implanted computing devices, telecommunication devices, automotive computers, network enabled televisions, thin clients, terminals, personal data assistants (PDAs), game consoles, gaming devices, work stations, media players, personal video recorders (PVRs), set-top boxes, cameras, integrated components for inclusion in a computing device, appliances, or any other sort of computing device.
[0033] As described above, device(s) that may be included in the one or more server(s) 110 and/or user device(s) 108 can include any type of computing device having one or more processing unit(s) 112 operably connected to computer-readable media 114 such as via a bus, which in some instances can include one or more of a system bus, a data bus, an address bus, a PCI bus, a Mini-PCI bus, and any variety of local, peripheral, and/or independent buses. Executable instructions stored on computer-readable media 114 can include, for example, the input module 116, the training module 118, and the classifying module 120, and other modules, programs, or applications that are loadable and executable by processing units(s) 112. Alternatively, or in addition, the functionally described herein can be performed, at least in part, by one or more hardware logic components such as accelerators. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. For example, an accelerator can represent a hybrid device, such as one from ZyXEL® or Altera® that includes a CPU course embedded in an FPGA fabric. [0034] Device(s) that may be included in the one or more server(s) 110 and/or user device(s) 108 can further include one or more input/output (I/O) interface(s) coupled to the bus to allow device(s) to communicate with other devices such as user input peripheral devices (e.g., a keyboard, a mouse, a pen, a game controller, a voice input device, a touch input device, gestural input device, and the like) and/or output peripheral devices (e.g., a display, a printer, audio speakers, a haptic output, and the like). Devices that may be included in the one or more server(s) 110 can also include one or more network interfaces coupled to the bus to enable communications between computing device and other networked devices such as user device(s) 108. Such network interface(s) can include one or more network interface controllers (NICs) or other types of transceiver devices to send and receive communications over a network. For simplicity, some components are omitted from the illustrated system.
[0035] Processing unit(s) 112 can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU. For example, and without limitation, illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In various examples, the processing unit(s) 112 may execute one or more modules and/or processes to cause the server(s) 110 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 112 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems.
[0036] In at least one configuration, the computer-readable media 114 of the server(s) 110 and/or user device(s) 108 may include components that facilitate interaction between the service provider 102 and the users 106. For example, the computer-readable media 114 may include the input module 116, the training module 118, and the classifying module 120, as described above. The modules (116, 118, and 120) can be implemented as computer-readable instructions, various data structures, and so forth via at least one processing unit(s) 112 to configure a device to execute instructions and to perform operations implementing training classifiers from images and leveraging the classifiers to classify new images. Functionality to perform these operations may be included in multiple devices or a single device. [0037] Depending on the exact configuration and type of the server(s) 110 and/or the user devices 108, the computer-readable media 114 may include computer storage media and/or communication media. Computer storage media can include volatile memory, nonvolatile memory, and/or other persistent and/or auxiliary computer storage media, removable and non-removable computer storage media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data. Computer memory is an example of computer storage media. Thus, computer storage media includes tangible and/or physical forms of media included in a device and/or hardware component that is part of a device or external to a device, including but not limited to random-access memory (RAM), static random- access memory (SRAM), dynamic random-access memory (DRAM), phase change memory (PRAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, compact disc read-only memory (CD-ROM), digital versatile disks (DVDs), optical cards or other optical storage media, miniature hard drives, memory cards, magnetic cassettes, magnetic tape, magnetic disk storage, magnetic cards or other magnetic storage devices or media, solid-state memory devices, storage arrays, network attached storage, storage area networks, hosted computer storage or any other storage memory, storage device, and/or storage medium that can be used to store and maintain information for access by a computing device.
[0038] In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transmission mechanism. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. Such signals or carrier waves, etc. can be propagated on wired media such as a wired network or direct-wired connection, and/or wireless media such as acoustic, RF, infrared and other wireless media. As defined herein, computer storage media does not include communication media. That is, computer storage media does not include communication media consisting solely of a modulated data signal, a carrier wave, or a propagated signal, per se.
Training Classifiers
[0039] FIG. 2 is a diagram showing additional components of the example system 200 for training classifiers from weakly supervised images and applying the trained classifiers to classify new images. As shown in FIGS. 1 and 2, the system 200 may include the input module 116, the training module 118, and the classifying module 120.
[0040] The input module 116 may receive images and, in some examples, may remove at least some of the images using a filtering process described below. The input module 116 may include additional components or modules, such as a receiving module 202 and a filtering module 204.
[0041] In at least one example, the receiving module 202 may receive the plurality of images based at least in part on sending a query. A query may be a query for a single label or a plurality of labels. A query may be a textual query, image query, etc. For example, the query may include words used to identify a label (e.g., "orca whale") and related words and/or phrases (e.g., "killer whale," "blackfish," etc.). In at least one example, a user 106 may include optional modifiers to the query. For example, if a user wishes to use "jaguar" as a query, a user may modify the query "jaguar" to include "animal." In such examples, the resulting corpus of images may include jaguar animals but may exclude Jaguar® cars. The input module 116 may send the query to one or more search engines, social-networking services, blogging services, and/or other websites or web services. The receiving module 202 may receive the plurality of images based at least in part on sending the query.
[0042] In at least one example, the receiving module 202 may receive weakly supervised images. Weakly supervised images may include images associated with a label. However, the label may or may not correctly identify the subject matter of the image. Additionally, the label may identify the image or individual objects in the image, but the system described herein may not be able to determine which subject (e.g., the image or an individual object in the image) the label identifies. In contrast, supervised images may be labeled with a certainty above a predetermined threshold and unsupervised images may not be labeled at all. In additional or alternative examples, the techniques described herein may be applied to various types of multimedia data (e.g., videos, animations, etc.) and, in such examples, the receiving module 202 may receive various types of multimedia data items.
[0043] The weakly supervised images may be available on the Internet. For example, for any query associated with a label, weakly supervised images may be extracted from data available on the Internet in search engines, social-networking services, blogging services, data sources, and/or other websites or web services. Examples of search engines include Bing®, Google®, Yahoo! Search®, Ask®, etc. Examples of social-networking services include Facebook®, Twitter®, Instagram®, MySpace®, Flickr®, YouTube®, etc. Examples of blogging services include WordPress®, Blogger®, Squarespace®, Windows Live Spaces®, WeiBo®, etc. Examples of data sources include ImageNet (maintained by Stanford University), open video annotation project (maintained by Harvard University), etc.
[0044] In some examples, the weakly supervised images may be accessible by the public (e.g., data stored in search engines, public Twitter® pictures, public Facebook® pictures, etc.). However, in other examples, the weakly supervised images may be private (e.g., private Facebook® pictures, private YouTube® videos, etc.) and may not be viewed by the public. In such examples (i.e., when the weakly supervised images are private), the systems and methods described herein may not proceed without first obtaining permission from the authors of the weakly supervised images to access the image.
[0045] In the examples where the weakly supervised images are private or include personally identifiable information (PII) that identify or can be used to identify, contact, or locate a person to whom such images pertain, a user 106 may be provided with notice that the systems and methods herein are collecting PII. Additionally, prior to initiating PII data collection, users 106 may have an opportunity to opt-in or opt-out of the PII data collection. For example, a user 106 may opt-in to the PII data collection by taking affirmative action indicating that he or she consents to the PII data collection. Alternatively, a user 106 may be presented with an option to opt-out of the PII data collection. An opt-out option may require an affirmative action to opt-out of the PII data collection, and in the absence of affirmative user action to opt-out, PII data collection may be impliedly permitted.
[0046] As described above, labels correspond to queries. Labels may correspond to a descriptive term for a particular entity (e.g., animal, plant, attraction, etc.). Queries are textual terms or phrases that may be used to collect the corpus of images from search engines, social networks, etc. Typically, a label corresponds to a particular query, but in some examples, a label may correspond to more than one query. For example, in such examples, the label "orca whale" may correspond to queries such as "orca whale," "killer whale," and/or "blackfish."
[0047] The plurality of images returned to the receiving module 202 may be noisy. Accordingly, the filtering module 204 may filter one or more images from the plurality of images to mitigate the noise in the images used for training classifiers. In additional or alternative examples, the receiving module 202 may receive new images for classifying by the trained classifiers.
[0048] The training module 118 may train classifiers from weakly supervised images. The training module 118 may include additional components or modules for training the classifiers. In at least one example, the training module 118 may include an extraction module 206, which includes a patch extracting module 208 and feature extracting module 210, a clustering module 212, a refining module 214, and a learning module 216.
[0049] As described above, the extraction module 206 may include a patch extracting module 208 and a feature extracting module 210. The patch extracting module 208 may access a plurality of images from the receiving module 202 and extract one or more patches from individual images of the plurality of images. As described above, patches may represent regions or parts of an image. Individual patches may correspond to an object or a portion of an object in an image. In some examples, there may be multiple patches in an individual image.
[0050] The feature extracting module 210 may extract global features and patch-based features. Additionally, the feature extracting module 210 may extract patch representations from the patches. Leveraging global features and patch-based features improves accuracy in recognizing and classifying objects in images. The patch representations may be leveraged for refining the patches, as described below.
[0051] Global feature extraction may describe the process of identifying interesting portions or shapes of images and extracting those features for additional processing. The process of identifying interesting portions or shapes of images may occur via common multimedia feature extraction techniques such as SIFT (scale-invariant feature transform), deep neural networks (DNN) feature extractor, etc. In at least one example, multimedia feature extraction may describe turning an image into a high dimensional feature vector. For example, all information provided may be organized as a single vector, which is commonly referred to as a feature vector. In at least one example, each image in the corpus of images may have a corresponding feature vector based on a suitable set of features. Global features may include visual features, textual features, etc. Visual features may range from simple visual features, such as edges and/or corners, to more complex visual features, such as objects. Textual features include tags, classes, and/or metadata associated with the images.
[0052] Patch-based feature extraction may describe extracting image-level features based at least in part on patches extracted from an image. In at least one example, the patch-based features may be based at least in part on patches in refined clusters of patches, as described below. In some examples, patch-based features are similar to mid-layer representations in DNNs. Patch-based features may represent a patch distribution over the patch dictionary, described below. Patch-based features enable the systems described herein to train classifiers using less data, therefore increasing efficiency and reducing computational resources consumed for training. Various models that linearly transform a feature space associated with the images may be used to extract patch-based features, such as latent Dirichlet allocation (LDA), Support Vector Machines (SVM), etc.
[0053] The feature extracting module 210 may also extract patch representations. Patch representations describe features extracted from individual patches. As described above, patch representations may represent patch-level features and may be used for refining the clusters. Various models may be used to extract patch representations, such as but not limited to, LDA representations of HOG, etc.
[0054] The clustering module 212 may arrange the patches in clusters based on the patch-based features. In at least some examples, to increase the speed of processing the images for training classifiers, the clustering module 212 may arrange the individual patches into a plurality of clusters based at least in part on the patch-based features, as described above. Patches may be placed in a same cluster based at least in part on over- clustering the LDA representation of the patches associated with an image to generate the clusters. Aspect ratio may be implicitly captured by the patch-based features. In some examples, each cluster may represent a particular label. In other examples, each cluster may represent various views of a particular cluster. In additional or alternative examples, the clustering module 212 may use different methods of vector quantization including K- Means clustering to arrange the clusters of patches.
[0055] The refining module 214 may remove patches from individual clusters based at least in part on similarity values that are representative of a similarity between individual patches. The refining module 214 may determine the similarity values, as described below. The similarity values may be used to determine entropy values and the entropy values may be used for processing the patches via diversity selection, as described below. Entropy values may represent certainty measures. One or more patches may be removed from individual clusters based at least in part on the entropy values and diversity selection. Following the removal of patches from the individual clusters, the remaining patches may have lower entropy values and/or more diversity than the patches in the pre-processed clusters. The resulting clusters may be refined clusters of patches used for training classifiers to classify images.
[0056] The learning module 216 may leverage one or more learning algorithms to train classifiers for one or more labels associated with the refined clusters of patches. The feature extracting module 210 may extract patch-based features from the patches in the refined clusters of patches. The classifiers may be trained based at least in part on the extracted patch-based features and, in at least some examples, global features. For example, learning algorithms such as fast rank, Stochastic Gradient Descent (SGD), SVMs, boosting, etc., may be applied to learn a classifier for identifying particular labels of the one or more labels. In at least some examples, classifiers for all of the labels may be trained at the same time using multi-label learning techniques, such as multiclass SVM or SGD. In other examples, the training described above may be applied to new labels as new labels are received and the new classifiers may be added to the classifier(s) 218.
[0057] The classifying module 120 may store the classifier(s) 218. The classifying module 120 may receive patches and patch-based features extracted from new images and may apply the classified s) 218 to the patch-based features. The classifying module 120 may output results including labels identifying and/or classifying images. In at least some examples, the output results may include confidence scores corresponding to each label. Example Processes
[0058] FIGS. 3-5 describe example processes for training classifiers from weakly supervised images. The example processes are described in the context of the environment of FIGS. 1 and 2 but are not limited to those environments. The processes are illustrated as logical flow graphs, each operation of which represents an operation in the illustrated or another sequence of operations that may be implemented in hardware, software, or a combination thereof. In the context of software, the operations represent computer-executable instructions stored on one or more computer-readable media 114 that, when executed by one or more processors 112, configure a computing device to perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, and the like that configure a computing device to perform particular functions or implement particular abstract data types.
[0059] The computer-readable media 114 may include hard drives, floppy diskettes, optical disks, CD-ROMs, DVDs, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, flash memory, magnetic or optical cards, solid-state memory devices, or other types of storage media suitable for storing electronic instructions, as described above. Finally, the order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations may be combined in any order and/or in parallel to implement the process.
[0060] FIG. 3 illustrates an example process 300 for training classifiers from patch- based features.
[0061] Block 302 illustrates sending a query. In at least some examples, training classifiers may begin with the input module 116 sending a query, as described above.
[0062] Block 304 illustrates receiving a corpus of images associated with the query. Based at least in part on sending the query, images relating to the query may be returned to the receiving module 202 from the one or more search engines, social-networking services, blogging services, and/or other websites or web services, as described above. Additional queries associated individual labels of a set of labels may be sent to the one or more search engines, social-networking services, blogging services, and/or other websites or web services as described above, and corresponding images may be returned and added to the corpus of images for training classifier(s) 218. In some examples, the corpus may be noisy and may include images that are unrelated to the queries, are of low quality, etc. Accordingly, the corpus of images may be refined. In at least some examples, the filtering module 204 may filter individual images from the corpus of images, as described below in FIGS. 4-6.
[0063] Block 306 illustrates accessing a corpus of images. The extraction module 206 may access the corpus of images from the input module 116 for processing. In at least some embodiments, the corpus of images may be filtered before proceeding with processing the individual images from the corpus of images. Example processes for filtering are described in FIGS. 4-6.
[0064] Block 308 illustrates extracting patches from individual images. As described above, patches may represent regions or parts of an image. Individual patches may correspond to an object or a portion of an object in an image. In some examples, there may be multiple patches in each image. The patch extraction module 208 may leverage edge detection to extract patches that correspond to objects or portions of objects in images. In at least one example, the patch extraction module 208 may use structured edge detection and/or fast edge detection (e.g., via structured random forests, etc.). In other examples, the patch extraction module 208 may extract patches based at least in part on detecting edges using intensity, color gradients, classifiers, etc. [0065] Block 310 illustrates extracting features. As described above, the feature extracting module 210 may extract global features and/or patch-based features from the individual images and may extract patch representations from the patches. The global features may represent contextual information extracted from individual images. The patch-based features may represent distinguishing features of the patches associated with individual images. Patch representations may represent distinguishing features a particular patch.
[0066] Block 312 illustrates arranging the patches into clusters. In at least some examples, to increase the speed of processing the images for training classifiers, the clustering module 212 may arrange the individual patches into a plurality of clusters based at least in part on the patch-based features, as described above. For each cluster, the clustering module 212 may determine a canonical size. The clustering module 212 may predetermine and cache the∑ for the LDA. The predetermined canonical size may be leveraged for determining similarity values, as described below.
[0067] Block 314 illustrates determining similarity values for the patches. The refining module 214 may remove at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on similarity values. The refining module 214 may determine similarity values that are representative of a similarity between two individual patches, the determining may be based at least in part on the patch representations. In at least one example, the refining module 214 may leverage HOG for the LDA features. The refining module 214 may determine similarity values by standardizing the patch representations (e.g., LDA HOG) extracted from a first individual patch of the individual patches and a second individual patch of the individual patches to a predetermined canonical size. In at least one example, the patch representations (e.g., LDA HOG) may be standardized by zero padding the patch representations extracted from the first individual patch and the second individual patch. In some examples, the first individual patch is part of a particular cluster of the plurality of patches associated with a label and the second individual patch is part of a different cluster of the plurality of patches associated with a different label of the plurality of labels. That is, in some examples, similarity values may be determined for patches in different clusters via intercluster comparisons. In other examples, the first individual patch and the second individual patch are part of a same cluster of the plurality of clusters, the same cluster associated with a same label of the plurality of labels. That is, in some examples, similarity values may be determined for patches in the same cluster via intracluster comparisons.
[0068] The refining module 214 may compute a dot product based at least in part on the standardized patch representations of the first individual patch and the second individual patch. In at least one example, weight vectors derived from the LDA feature extraction of the patches may be used for computing the dot product. In other examples, the refining module 214 may approximate the dot product by a Euclidean distance comparison. Leveraging the Euclidean distance enables the refining module 214 to use an index (e.g., k-dimensional tree) for nearest neighbor determinations for identifying patches that have low entropy values and high diversity, as described below. In some examples, the patches in the index may be stored and new patches provided during training and/or classifying may be compared to patches in the index for quickly and efficiently determining similarity (e.g., calculating similarity values) between the patches.
[0069] Block 316 illustrates removing individual patches from the clusters. As described above, the refining module 214 may remove at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on the similarity values. In at least some examples, the refining module 214 may remove at least some of the individual patches based at least in part on entropy values and diversity selection. To determine whether a particular patch has a high entropy value or a low entropy value, the refining module 214 may access a plurality of individual patches in a particular cluster of the plurality of clusters. The particular cluster may be associated with a label of the plurality of labels. The refining module 214 may process the individual patches to determine top nearest neighbors, as described above. In at least one example, the individual patches may be iteratively processed. As the individual patches are processed, a predetermined number of top nearest neighbors may be selected for training the classified s) 218 (and those patches that are not selected are removed from the clusters). In some examples, specific data structures may be leveraged that increase the speed in which nearest neighbors may be determined. In at least one example, the specific data structures may incorporate a cosine similarity metric that may be approximated by Euclidean distance. Accordingly, nearest neighbor determination may be accelerated by leveraging an index (e.g., k-dimensional tree) for all of the patches and approximating nearest neighbors using the index.
[0070] The refining module 214 may determine an entropy value for each of the individual patches based at least in part on determining labels associated with the nearest neighbors within a cluster. The refining module 214 may leverage the nearest neighbor determinations to generate distributions for labels that may be representative of entropy values for individual patches. If a particular individual patch and a nearest neighbor patch are associated with a same label, the refining module 214 may assign a low entropy value (e.g., close to 0) based at least in part on a low entropy distribution. The low entropy value (e.g., close to 0) may indicate that the particular individual patch and the nearest neighbor patch are highly representative of the label. Conversely, if the particular individual patch and the nearest neighbor patch are associated with different labels, the refining module 214 may assign a high entropy value (e.g., close to 1) based at least in part on a high entropy distribution. The high entropy value (e.g., close to 1) may indicate that the particular individual patch and the nearest neighbor patch are not representative of a same label. The refining module 214 may remove all individual patches with entropy values above a predetermined threshold to ensure the training data is highly representative of the label.
[0071] The refining module 214 may also remove patches that reduce the diversity of the patches. The resulting patches may be arranged in a dictionary that is diverse and has a number of patches below a predetermined threshold. Patches may be diverse if the patches are representative of various portions of an object and/or various views of an object identified by the label. In some examples, the dictionary may be stored and new patches may be added to the dictionary over time. The dictionary of patches may be used to generate patch representations.
[0072] The refining module 214 may perform diversity selection by ordering individual patches in the dictionary based at least in part on the entropy value associated with each of the individual patches. Then, in a plurality of iterations, the refining module 214 may process the ordered individual patches by determining nearest neighbor patches for each individual patch of the ordered individual patches. The refining module 214 may select a particular patch if the particular patch has a threshold number of nearest neighbors with entropy values below a predetermined value. The refining module 214 may remove nearest neighbor patches to the particular patch following each iteration.
[0073] Based at least in part on the refining module 214 removing individual patches with entropy values above a predetermined threshold and individual patches to maximize the diversity of the individual patches, the refining module 214 may further refine the remaining patches for efficiency. For instance, suppose the patches are associated with a predetermined number of labels (e.g., £), the refining module 214 may group the patches from each label into clusters (e.g., Pi, . . . , PE). In at least one example, the individual patches selected for processing in each cluster (e.g., Pi, . . . , PE ) may be ordered based on a corresponding entropy value and grouped into sub-clusters A final group of patches (e.g., F) for training the classifier may be iteratively selected to maximize the efficiency and accuracy of classification. The recognition and/or classification performance (e.g., rripv) may be measured using the following example algorithm or algorithms similar to the example algorithm below.
counters indicating which subset of Pi is being processed.
For t = \ ... T (iterations)
st = argmax rripv (F U P 1)
F = F u Pi St
Figure imgf000020_0001
[0074] Block 318 illustrates training a classifier. The learning module 216 may train one or more classifiers 218 for the plurality of labels based at least in part on patches in the refined plurality of clusters. The classifiers 218 may be trained based at least in part on patch-based features extracted from the patches in the refined clusters and, in at least some examples, global features. For example, learning algorithms such as fast rank, SGD, SVM, boosting, etc., may be applied to learn a classifier for identifying particular labels of the one or more labels. In at least some examples, classifiers for all of the labels may be trained at the same time using multi-label learning techniques, such as multiclass SVM or SGD. In other examples, the training described above may be applied to new labels as new labels are received and the new classifiers may be added to the classifier(s) 218.
[0075] FIG. 4 illustrates an example process 400 for determining whether a label is learnable based at least in part on filtering a corpus of images.
[0076] Block 402 illustrates sending a query, as described above. Block 404 illustrates receiving a corpus of images associated with the query, as described above.
[0077] Block 406 illustrates filtering the corpus of images. In some examples, the corpus of images may be noisy and may include images that are unrelated to the queries, are of low quality, etc. Accordingly, the corpus of images may be refined. In at least some examples, the filtering module 204 may filter individual images from the corpus of images, as described in FIGS. 5-6 below. In addition to the processes described below, the filtering module 204 may apply specific filters to remove specifically identified images from the corpus of images. For instance, the filtering module 204 may remove cartoon images, images with human faces covering a predetermined portion of the image, images with low gradient intensity, etc.
[0078] Block 408 illustrates determining whether a label is learnable. If removing images from the corpus results in a number of images below a predetermined threshold, the filtering module 204 may determine that the label is not learnable and may turn to human intervention, as illustrated in Block 410. Conversely, if removing images from the corpus results in a number of images above a predetermined threshold, the filtering module 204 may determine that the label is learnable and may proceed with training classifier(s) 218 as illustrated in Block 412. An example process of training classifier(s) 218 is described in FIG. 3, above.
[0079] FIG. 5 illustrates an example process 500 for filtering a corpus of images.
[0080] Block 502 illustrates determining nearest neighbors for each image in the corpus of images. For each label of the plurality of labels, the filtering module 204 may arrange each of the images in the corpus of images into a k-dimensional tree for facilitating nearest neighbor lookup. For each image, the facilitating module 204 may determine a predetermined number of nearest neighbors. The filtering module 204 may leverage global features extracted from individual images for determining the nearest neighbors. The filtering module 204 may determine how many times a particular individual image appears in the neighborhood of any individual image. If the particular individual image appears below a predetermined number of times, the particular individual image may be removed from the corpus of images.
[0081] Block 504 illustrates arranging individual images into clusters. The filtering module 204 may cluster the individual images into clusters corresponding to individual labels of the plurality of labels. The filtering module 204 may use single linkage clustering and may arrange individual images within a predetermined distance into a predetermined number of clusters.
[0082] Block 506 illustrates determining entropy values for each individual image in the cluster. The filtering module 204 may process the clusters to determine nearest neighbors of an image. For each image in a particular cluster, the filtering module 204 may determine the nearest neighbors of an image in other clusters. The filtering module 204 may determine entropy values based at least in part on comparing the nearest neighbors to one another. If nearest neighbors to a particular cluster are stable (e.g., low entropy value), the particular cluster is likely stable and representative and/or distinctive of a label. However, if nearest neighbors to a particular cluster are unstable (e.g., high entropy value), the particular cluster is likely unstable and not representative or distinctive of a label.
[0083] Block 508 illustrates removing at least some individual images. The filtering module 204 may remove individual images having entropy values above a predetermined threshold.
[0084] FIG. 6 illustrates another example process 600 for filtering a corpus of images.
[0085] Block 602 illustrates collecting negative images. A negative image is an image that is known to be excluded from training data associated with a label. In at least some examples, the receiving module 202 may perform two or more queries. At least one query may be a query for a particular label as described above (e.g., CenturyLink Field). Additional queries may include queries for individual words that make up a particular label having two or more words (e.g., CenturyLink, Field). An initial query of the additional queries may include a first word of the two or more words (e.g., CenturyLink). Each additional query of the additional queries may include each additional word of the two or more words (e.g., Field). The receiving module 202 may receive results from the two or more queries. The results returned for at least the second query may represent the negative images. In other examples, the receiving module 202 may leverage a knowledge graph (e.g., Satori, etc.) for collecting negative images.
[0086] Block 604 illustrates comparing images to negative images. The filtering module 204 may compare individual images returned as a result of the first query to the individual images returned in the additional queries to determine similarity values as described above.
[0087] Block 606 illustrates removing individual images from the corpus of images based on similarity values. The filtering module 204 may remove individual images with similarity values above a predetermined threshold. That is, if individual images are too similar to negative images, the individual images may be removed from the corpus.
[0088] FIG. 7 illustrates an example process 700 for determining similarity values. As described above, the refining module 214 may determine similarity values representative of a similarity between the individual patches. The similarity values may be determined based at least in part on the patch representations. In at least one example, the refining module 214 may leverage HOG for the LDA features.
[0089] Block 702 illustrates standardizing patch representations extracted from individual patches. In at least some examples, to increase the speed of processing the images for training classifiers, the refining module 214 may arrange a plurality of patches into clusters based on an aspect ratio of the patches. The refining module 214 may determine similarity values by standardizing patch representations (e.g., LDA HOG) extracted from a first individual patch of the individual patches and a second individual patch of the individual patches to a predetermined canonical size. In at least one example, the patch representations (e.g., LDA HOG) may be standardized by zero padding the patch representations extracted from the first individual patch and the second individual patch.
[0090] Block 704 illustrates computing a dot product based on standardized patch representations. Based at least in part on standardizing the patch representations, the refining module 214 may compute a dot product based at least in part on the standardized values of the first individual patch and the second individual patch. In at least one example, weight vectors derived from the LDA feature extraction may be used for computing the dot product. In other examples, the refining module 214 may approximate the dot product by a Euclidean distance comparison. Leveraging the Euclidean distance enables the refining module 214 to use a k-dimensional tree for nearest neighbor determinations for identifying patches that have low entropy values and high diversity, as described below.
[0091] FIG. 8 illustrates an example process 800 for removing patches from clusters of patches. As described above, the refining module 214 may remove at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on similarity values. In at least some examples, the refining module 214 may remove at least some of the individual patches based at least in part on entropy values and diversity selection.
[0092] Block 802 illustrates accessing the plurality of individual patches in a particular cluster. To determine whether a particular patch has a high entropy value or a low entropy value, the refining module 214 may access a plurality of individual patches in a particular cluster of the plurality of clusters. The particular cluster may be associated with a label of the plurality of labels.
[0093] Block 804 illustrates determining nearest neighbors for each individual patch. The refining module 214 may process the individual patches to determine top nearest neighbors, as described above. In at least one example, the individual patches may be iteratively processed. As the individual patches are processed, a predetermined number of top nearest neighbors may be selected for training the classifiers 218. In some examples, specific data structures may be leveraged that increase the speed in which nearest neighbors may be determined. In at least one example, the specific data structures may incorporate a cosine similarity metric that may be approximated by Euclidean distance. Accordingly, nearest neighbor determination may be accelerated by leveraging a k- dimensional tree for all of the patches and approximating nearest neighbors using the k- dimensional tree.
[0094] Block 806 illustrates determining an entropy value based on nearest neighbors for each individual patch. The refining module 214 may determine an entropy value for each of the individual patches based at least in part on determining the nearest neighbors within a cluster. If a particular individual patch and a nearest neighbor patch are associated with a same label, the refining module 214 may assign a low entropy value (e.g., close to 0). The low entropy value (e.g., close to 0) may indicate that the particular individual patch and the nearest neighbor patch are highly representative of the label. Conversely, if the particular individual patch and the nearest neighbor patch are associated with different labels, the refining module 214 may assign a high entropy value (e.g., close to 1), indicating that the particular individual patch and the nearest neighbor patch are not representative of a same label.
[0095] Block 808 illustrates removing individual patches from the clusters of patches. The refining module 214 may remove individual patches based at least in part on entropy values and/or diversity selection. The refining module 214 may remove individual patches with entropy values above a predetermined threshold to ensure the training data is highly representative of the label. The refining module 214 may also remove patches that reduce the diversity of the patches. Patches may be diverse if the patches are representative of various portions of an object and/or various views of an object identified by the label. The refining module 214 may perform diversity selection by ordering individual patches based at least in part on the entropy value associated with each of the individual patches. Then, in a plurality of iterations, the refining module 214 may process the ordered individual patches by determining nearest neighbor patches for each individual patch of the ordered individual patches. The refining module 214 may remove nearest neighbor patches from the cluster following each iteration. The refining module 214 may select a particular patch if the particular patch had a number of nearest neighbors above a predetermined threshold with entropy values below a predetermined threshold.
[0096] FIG. 9 illustrates an example process 900 for diversity selection of particular patches for training the classified s) 218. As described above, the refining module 214 may also remove patches that reduce the diversity of the patches. Patches may be diverse if the patches are representative of various portions of an object and/or various views of an object identified by the label.
[0097] Block 902 illustrates ordering individual patches based on entropy values. The refining module 214 may perform diversity selection by ordering individual patches based at least in part on the entropy value associated with each of the individual patches.
[0098] Block 904 illustrates processing individual patches. In a plurality of iterations, the refining module 214 may process the ordered individual patches by determining nearest neighbor patches for each individual patch of the ordered individual patches.
[0099] Block 906 illustrates removing nearest neighbors for each individual patch. The refining module 214 may remove nearest neighbor patches from the cluster following each iteration.
[0100] Block 908 illustrates determining particular patches having a number of nearest neighbors above a predetermined threshold with entropy values below a predetermined threshold. The refining module 214 may determine particular patches have a number of nearest neighbors above a predetermined threshold with entropy values below a predetermined threshold.
[0101] Block 910 illustrates selecting particular patches for training the classifier(s) 218. The refining module 214 may select a particular patch if the particular patch had a number of nearest neighbors above a predetermined threshold with entropy values below a predetermined threshold. Based at least in part on the refining module 214 removing individual patches with entropy values above a predetermined threshold and individual patches to maximize the diversity of the individual patches, the refining module 214 may further refine the remaining patches for efficiency. In at least one example, the individual patches selected for processing in each cluster may be ordered based on a corresponding entropy value and grouped into sub-clusters. A final group of patches may be for training the classifier may be iteratively selected to maximize efficiency and accuracy of classification. The feature extracting module 210 may extract patch-based features from the final group of patches (e.g., refined cluster of patches) for use in training the classifiers.
Applying the Classifiers
[0102] FIG. 10 illustrates a diagram showing an example system 1000 for classifying a new image. As shown in FIG. 10, the system 1000 may include the input module 116, training module 118, and classifying module 120. [0103] The input module 116 may include the receiving module 202. The receiving module 202 may receive a new image 1002 for classifying. The user(s) 106 may input one or more images into the receiving module 202 via one of the user devices 108. For example, in at least one example, a user 106 may select an image stored on his or her user device 108 for input into the input module 116. In another example, a user 106 may take a photo or video via his or her user device 108 and input the image into the input module 116.
[0104] The receiving module 202 may send the new image 1002 to the extraction module 206 stored in the training module 118. The patch extraction module 208 that is stored in the extraction module 208 may extract patches from the new image 1002, as described above. The patch extracting module 208 may send the patches 1004 to the feature extracting module 210 for extracting patch-based features from the image 1002, based at least in part on the patches 1004, as described above. The feature extracting module 210 may send the patch-based features 1006 to the classifying module for classifying by the classifier(s) 218. The classifying module 120 may apply the classifier(s) 218 to the patch-based features 1006 for classification. The classifying module 120 may send the classified result 1008 to the user(s) 106. In at least one example, the classified result 1008 may include a confidence score.
Example Processes
[0105] The example process 1100 is described in the context of the environment of FIGS. 1, 2, and 10 but is not limited to those environments. The process 1100 is illustrated as a logical flow graph, each operation of which represents an operation in the illustrated or another sequence of operations that may be implemented in hardware, software, or a combination thereof. In the context of software, the operations represent computer-executable instructions stored on one or more computer-readable media 114 that, when executed by one or more processors 112, configure a computing device to perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, and the like that configure a computing device to perform particular functions or implement particular abstract data types.
[0106] The computer-readable media 114 may include hard drives, floppy diskettes, optical disks, CD-ROMs, DVDs, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, flash memory, magnetic or optical cards, solid-state memory devices, or other types of storage media suitable for storing electronic instructions, as described above. Finally, the order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations may be combined in any order and/or in parallel to implement the process.
[0107] FIG. 11 illustrates an example process 1100 for classifying a new image 1002.
[0108] Block 1102 illustrates receiving input. The receiving module 202 may receive a new image 1002 to be classified. As described above, the user(s) 106 may input one or more images into the receiving module 202 via one of the user devices 108.
[0109] Block 1104 illustrates extracting patches 1004. The patch extraction module 208 may extract patches 1004 from the new image 1002, as described above.
[0110] Block 1106 illustrates extracting features 1006. The patch extracting module 208 may send the patches 1004 to the feature extracting module 210 for extracting patch-based features 1006 from the image 1002, based at least in part on the extracted patches 1004, as described above.
[0111] Block 1108 illustrates applying a classifier 218. The feature extracting module 210 may send the patch-based features 1006 to the classifying module for classifying by the classified s) 218. The classifying module 120 may apply the classifier(s) 218 to the patch-based features 1006 for classification.
[0112] Block 1110 illustrates outputting the result 1008. The classifying module 120 may send the classified result 1008 to the user(s) 106.
[0113] A. A computer-implemented method comprising: accessing a corpus of images, wherein individual images of the corpus are associated with at least one label of a plurality of labels; extracting one or more patches from the individual images; extracting patch-based features from the one or more patches; extracting patch representations from individual patches of the one or more patches; arranging the individual patches into a plurality of clusters based at least in part on the patch-based features, wherein individual clusters of the plurality of clusters correspond to individual labels of the plurality of labels; determining similarity values representative of a similarity between ones of the individual patches, the determining based at least in part on patch representations; removing at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on the similarity values; and training a classifier for the plurality of labels based at least in part patch-based features extracted from the individual clusters.
[0114] B. A computer-implemented method as paragraph A recites, wherein extracting patch representations from the individual patches comprises extracting features from the individual patches via latent Dirichlect allocation (LDA). [0115] C. A computer-implemented method as paragraph B recites, wherein determining the similarity values representative of the similarity between the individual patches comprises: standardizing patch representations extracted from a first individual patch of the individual patches and a second individual patch of the individual patches to a predetermined canonical size; and computing a dot product based at least in part on the standardized patch representations of the first individual patch and the second individual patch.
[0116] D. A computer-implemented method as paragraph C recites, wherein the first individual patch is part of a particular cluster of the plurality of patches associated with the at least one label of the plurality of labels and the second individual patch is part of a different cluster of the plurality of patches associated with a different label of the plurality of labels.
[0117] E. A computer-implemented method as paragraph C recites, wherein the first individual patch and the second individual patch are part of a same cluster of the plurality of clusters, the same cluster associated with a same label of the plurality of labels.
[0118] F. A computer-implemented method as any of paragraphs A-E recite, wherein removing at least some of the individual patches from the individual clusters comprises: accessing a plurality of individual patches in a particular cluster of the plurality of clusters; determining nearest neighbors of individual patches of the plurality of individual patches based at least in part on the similarity values; determining entropy values for the individual patches based at least in part on determining the nearest neighbors of the individual patches; and removing at least some individual patches with entropy values above a predetermined threshold.
[0119] G. A computer-implemented method as paragraph F recites, further comprising: ordering the individual patches based at least in part on the entropy values associated with the individual patches; processing the ordered individual patches via a plurality of iterations, the processing including determining nearest neighbor patches for each of the ordered individual patches; removing nearest neighbor patches for each of the ordered individual patches in each iteration of the plurality of iterations; determining that a particular individual patch of the individual patches had a number of nearest neighbors above a predetermined threshold with entropy values below a predetermined threshold; and selecting the particular individual patch for training the classifier. [0120] H. One or more computer-readable media encoded with instructions that, when executed by a processor, configure a computer to perform a method as any of paragraphs A- G recites.
[0121] I. A device comprising one or more processors and one or more computer- readable media encoded with instructions that, when executed by the one or more processors, configure a computer to perform a computer-implemented method as recited in any of paragraphs A-G.
[0122] J. A system comprising: computer-readable media storing one or more modules; a processing unit operably coupled to the computer-readable media, the processing unit adapted to execute the one or more modules, the one or more modules comprising: a patch extracting module to access a plurality of images and extract one or more patches from individual images of the plurality of images; a feature extracting module to extract patch-based features from the one or more patches and patch representations from individual patches of the one or more patches; a clustering module to arrange the individual patches into a plurality of clusters based at least in part on the patch-based features; a refining module to remove at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on entropy values and diversity selection; and a learning module to train a classifier for at least one label based at least in part on the individual clusters.
[0123] K. A system as paragraph J recites, further comprising a receiving module to receive the plurality of images based at least in part on a query of the at least one label.
[0124] L. A system as paragraph J or K recites, further comprising a filtering module to remove at least some of the individual images based at least in part on: the at least some of the individual images having entropy values above a predetermined threshold; or the at least some of the individual images and negative images having image similarity values above a predetermined threshold.
[0125] M. A system as any of paragraphs J-L recite, wherein the feature extracting module further extracts global features from the individual images, the global features representing contextual information about the individual images.
[0126] N. A system as paragraph M recites, wherein the learning module trains the classifier for the at least one label based at least in part on the global features and the patch-based features.
[0127] O. A system as any of paragraphs J-N recite, wherein the refining module further determines similarity values representative of similarities between individual patches of the one or more patches, the determining comprising: standardizing patch representations extracted from a first individual patch of the individual patches and a second individual patch of the individual patches to a predetermined canonical size; and computing a dot product based at least in part on the standardized patch representations of the first individual patch and the second individual patch.
[0128] P. A system as any of paragraphs J-0 recite, wherein the refining module removes the at least some of the individual patches from the individual clusters of the plurality of clusters based at least in part on: accessing a plurality of individual patches in a particular cluster of the plurality of clusters; determining nearest neighbors of individual patches of the plurality of patches based at least in part on the similarity values; determining entropy values based at least in part on determining the nearest neighbors to individual patches; filtering at least some of the individual patches with entropy values above a predetermined threshold, remaining individual patches of the plurality of individual patches comprising filtered patches; determining nearest neighbor patches for the filtered patches via a plurality of iterations; removing nearest neighbor patches for the filtered patches in each iteration of the plurality of iterations; determining that a particular filtered patch of the filtered patches had a number of nearest neighbors below a predetermined threshold with entropy values below a predetermined threshold; and removing the particular filtered patch.
[0129] Q. A system as any of paragraphs J-P recite, further comprising a classifying module to store the classifier for the at least one label.
[0130] R. A system as any of paragraphs J-Q recite, further comprising a receiving module to receive a new image for classifying by the classifier.
[0131] S. One or more computer-readable media encoded with instructions that, when executed by a processor, configure a computer to perform acts comprising: accessing a plurality of weakly supervised images; extracting one or more patches from individual weakly supervised images of the plurality of weakly supervised images; extracting patch-based features from the one or more patches; extracting patch representations from the one or more patches; arranging individual patches into a plurality of clusters based at least in part on the patch-based features; removing at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on similarity values representative of similarity between ones of the individual patches; and training a classifier for at least one label based at least in part on the plurality of clusters. [0132] T. One or more computer-readable media as paragraph S recites, wherein training the classifier comprises: extracting new patch-based features from remaining individual patches of the individual clusters; and training the classifier based at least in part on the new patch-based features.
[0133] U. One or more computer-readable media as paragraph S or T recites, wherein the acts further comprise, prior to extracting the one or more patches from the multimedia weakly supervised data items, filtering the plurality of weakly supervised images, the filtering including: determining nearest neighbors for each individual weakly supervised images of the plurality of weakly supervised images; arranging one or more individual weakly supervised images within a predetermined distance into image clusters; determining an entropy value for each individual weakly supervised image in an individual image cluster of the image clusters, wherein determining an entropy value for each individual weakly supervised image comprises determining a similarity value representing a similarity between each individual weakly supervised image in a particular image cluster and each individual weakly supervised image in one or more other image clusters; and removing at least some of the individual weakly supervised images when the entropy value is above a predetermined threshold.
[0134] V. One or more computer-readable media as any of paragraphs S-U recite, wherein the acts further comprise, prior to extracting the one or more patches from the multimedia weakly supervised data items, filtering the plurality of weakly supervised images, the filtering including: collecting negative images; comparing the individual weakly supervised images with the negative images; and removing one or more of the individual weakly supervised images from the plurality of images based at least in part on the one or more of the individual weakly supervised images and the negative images having similarity values above a predetermined threshold.
[0135] W. A device comprising one or more processors and one or more computer readable media as recited in any of paragraphs S-V.
[0136] X. A system comprising: computer-readable media; one or more processors; and one or more modules on the computer-readable media and executable by the one or more processors to perform operations comprising: accessing a plurality of weakly supervised images; extracting one or more patches from individual weakly supervised images of the plurality of weakly supervised images; extracting patch-based features from the one or more patches; extracting patch representations from the one or more patches; arranging individual patches into a plurality of clusters based at least in part on the patch-based features; removing at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on similarity values representative of similarity between ones of the individual patches; and training a classifier for at least one label based at least in part on the plurality of clusters.
[0137] Y. A system as paragraph X recites, wherein training the classifier comprises: extracting new patch-based features from remaining individual patches of the individual clusters; and training the classifier based at least in part on the new patch-based features.
[0138] Z. A system as paragraph X or Y recites, wherein the operations further comprise, prior to extracting the one or more patches from the multimedia weakly supervised data items, filtering the plurality of weakly supervised images, the filtering including: determining nearest neighbors for each individual weakly supervised images of the plurality of weakly supervised images; arranging one or more individual weakly supervised images within a predetermined distance into image clusters; determining an entropy value for each individual weakly supervised image in an individual image cluster of the image clusters, wherein determining an entropy value for each individual weakly supervised image comprises determining a similarity value representing a similarity between each individual weakly supervised image in a particular image cluster and each individual weakly supervised image in one or more other image clusters; and removing at least some of the individual weakly supervised images when the entropy value is above a predetermined threshold.
[0139] AA. A system as any of paragraphs X-Z recite, wherein the operations further comprise, prior to extracting the one or more patches from the multimedia weakly supervised data items, filtering the plurality of weakly supervised images, the filtering including: collecting negative images; comparing the individual weakly supervised images with the negative images; and removing one or more of the individual weakly supervised images from the plurality of images based at least in part on the one or more of the individual weakly supervised images and the negative images having similarity values above a predetermined threshold. Conclusion
[0140] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are described as illustrative forms of implementing the claims.
[0141] Conditional language such as, among others, "can," "could," "might" or "may," unless specifically stated otherwise, are understood within the context to present that certain examples include, while other examples do not necessarily include, certain features, elements and/or steps. Thus, such conditional language is not generally intended to imply that certain features, elements and/or steps are in any way required for one or more examples or that one or more examples necessarily include logic for deciding, with or without user input or prompting, whether certain features, elements and/or steps are included or are to be performed in any particular example. Conjunctive language such as the phrase "at least one of X, Y or Z," unless specifically stated otherwise, is to be understood to present that an item, term, etc. may be either X, Y, or Z, or a combination thereof.

Claims

1. A system for reducing computational resources utilized for training a classifier to identify labels associated with a plurality of images, the system comprising:
computer-readable media storing one or more modules;
a processing unit operably coupled to the computer-readable media, the processing unit adapted to execute the one or more modules, the one or more modules comprising:
a patch extracting module to access the plurality of images and extract one or more patches from individual images of the plurality of images;
a feature extracting module to extract patch-based features from the one or more patches and patch representations from individual patches of the one or more patches;
a clustering module to arrange the individual patches into a plurality of clusters based at least in part on the patch-based features;
a refining module to remove at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on entropy values and diversity selection; and
a learning module to train the classifier for at least one label of the labels based at least in part on the individual clusters.
2. A system as claim 1 recites, further comprising a receiving module to receive the plurality of images based at least in part on a query of the at least one label.
3. A system as either claim 1 or claim 2 recites, further comprising a filtering module to remove at least some of the individual images based at least in part on:
the at least some of the individual images having entropy values above a predetermined threshold; or
the at least some of the individual images and negative images having image similarity values above a predetermined threshold.
4. A system as any one of claims 1-3 recites, wherein the feature extracting module further extracts global features from the individual images, the global features representing contextual information about the individual images.
5. A system as claim 4 recites, wherein the learning module trains the classifier for the at least one label based at least in part on the global features and the patch-based features.
6. A system as any one of claims 1-5 recites, wherein the refining module further determines similarity values representative of similarities between individual patches of the one or more patches, the determining comprising:
standardizing patch representations extracted from a first individual patch of the individual patches and a second individual patch of the individual patches to a predetermined canonical size; and
computing a dot product based at least in part on the standardized patch representations of the first individual patch and the second individual patch.
7. A system as claim 6 recites, wherein the refining module removes the at least some of the individual patches from the individual clusters of the plurality of clusters based at least in part on:
accessing a plurality of individual patches in a particular cluster of the plurality of clusters;
determining nearest neighbors of individual patches of the plurality of patches based at least in part on the similarity values;
determining entropy values based at least in part on determining the nearest neighbors to individual patches;
filtering at least some of the individual patches with entropy values above a predetermined threshold, remaining individual patches of the plurality of individual patches comprising filtered patches;
determining nearest neighbor patches for the filtered patches via a plurality of iterations;
removing nearest neighbor patches for the filtered patches in each iteration of the plurality of iterations;
determining that a particular filtered patch of the filtered patches had a number of nearest neighbors below a predetermined threshold with entropy values below a predetermined threshold; and
removing the particular filtered patch.
8. A system as any one of claims 1-7 recites, further comprising a classifying module to store the classifier for the at least one label.
9. A computer-implemented method for reducing computational resources utilized for training a classifier to identify a plurality of labels associated with a corpus of images, the computer-implemented method comprising: accessing the corpus of images, wherein individual images of the corpus are associated with at least one label of the plurality of labels;
extracting one or more patches from the individual images;
extracting patch-based features from the one or more patches;
extracting patch representations from individual patches of the one or more patches;
arranging the individual patches into a plurality of clusters based at least in part on the patch-based features, wherein individual clusters of the plurality of clusters correspond to individual labels of the plurality of labels;
determining similarity values representative of a similarity between ones of the individual patches, the determining based at least in part on patch representations;
removing at least some of the individual patches from individual clusters of the plurality of clusters based at least in part on the similarity values; and
training the classifier for the plurality of labels based at least in part patch-based features extracted from the individual clusters.
10. A computer-implemented method as claim 9 recites, wherein determining the similarity values representative of the similarity between the individual patches comprises: standardizing patch representations extracted from a first individual patch of the individual patches and a second individual patch of the individual patches to a predetermined canonical size; and
computing a dot product based at least in part on the standardized patch representations of the first individual patch and the second individual patch.
11. A computer-implemented method as claim 10 recites, wherein:
the first individual patch is part of a particular cluster of the plurality of patches associated with the at least one label of the plurality of labels and the second individual patch is part of a different cluster of the plurality of patches associated with a different label of the plurality of labels; or
the first individual patch and the second individual patch are part of a same cluster of the plurality of clusters, the same cluster associated with a same label of the plurality of labels.
12. A computer-implemented method as any one of claims 9-11 recites, wherein removing at least some of the individual patches from the individual clusters comprises: accessing a plurality of individual patches in a particular cluster of the plurality of clusters; determining nearest neighbors of individual patches of the plurality of individual patches based at least in part on the similarity values;
determining entropy values for the individual patches based at least in part on determining the nearest neighbors of the individual patches; and
removing at least some individual patches with entropy values above a
predetermined threshold.
13. A computer-implemented method as claim 12 recites, further comprising:
ordering the individual patches based at least in part on the entropy values associated with the individual patches;
processing the ordered individual patches via a plurality of iterations, the processing including determining nearest neighbor patches for each of the ordered individual patches;
removing nearest neighbor patches for each of the ordered individual patches in each iteration of the plurality of iterations;
determining that a particular individual patch of the individual patches had a number of nearest neighbors above a predetermined threshold with entropy values below a predetermined threshold; and
selecting the particular individual patch for training the classifier.
14. A computer-readable medium having thereon computer-executable instructions, that when executed configure a computer to perform a method as any one of claims 9-13 recites.
15. A device comprising:
one or more processors; and
a computer-readable medium having thereon computer-executable instructions, that when executed by the one or more processors configure the device to perform a method as any one of claims 9-13 recites.
PCT/US2015/067554 2015-01-22 2015-12-28 Optimizing multi-class image classification using patch features WO2016118286A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP15834702.1A EP3248143B1 (en) 2015-01-22 2015-12-28 Reducing computational resources utilized for training an image-based classifier
CN201580073396.5A CN107209860B (en) 2015-01-22 2015-12-28 Method, system, and computer storage medium for processing weakly supervised images

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14/602,494 US10013637B2 (en) 2015-01-22 2015-01-22 Optimizing multi-class image classification using patch features
US14/602,494 2015-01-22

Publications (1)

Publication Number Publication Date
WO2016118286A1 true WO2016118286A1 (en) 2016-07-28

Family

ID=55358101

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/067554 WO2016118286A1 (en) 2015-01-22 2015-12-28 Optimizing multi-class image classification using patch features

Country Status (4)

Country Link
US (1) US10013637B2 (en)
EP (1) EP3248143B1 (en)
CN (1) CN107209860B (en)
WO (1) WO2016118286A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215275A (en) * 2020-09-30 2021-01-12 佛山科学技术学院 Image processing system and method suitable for K-means algorithm, and recording medium
US11403328B2 (en) 2019-03-08 2022-08-02 International Business Machines Corporation Linking and processing different knowledge graphs

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10127439B2 (en) * 2015-01-15 2018-11-13 Samsung Electronics Co., Ltd. Object recognition method and apparatus
US9842390B2 (en) * 2015-02-06 2017-12-12 International Business Machines Corporation Automatic ground truth generation for medical image collections
US9530082B2 (en) * 2015-04-24 2016-12-27 Facebook, Inc. Objectionable content detector
US9704054B1 (en) * 2015-09-30 2017-07-11 Amazon Technologies, Inc. Cluster-trained machine learning for image processing
CN107908175B (en) * 2017-11-08 2020-06-23 国网电力科学研究院武汉南瑞有限责任公司 On-site intelligent operation and maintenance system for power system
CN108198147B (en) * 2018-01-02 2021-09-14 昆明理工大学 Multi-source image fusion denoising method based on discriminant dictionary learning
JP2020052509A (en) * 2018-09-25 2020-04-02 富士ゼロックス株式会社 Information processing apparatus, program and information processing system
CN109543536B (en) * 2018-10-23 2020-11-10 北京市商汤科技开发有限公司 Image identification method and device, electronic equipment and storage medium
CN109815788A (en) * 2018-12-11 2019-05-28 平安科技(深圳)有限公司 A kind of picture clustering method, device, storage medium and terminal device
US11257222B2 (en) 2019-03-05 2022-02-22 International Business Machines Corporation Iterative approach for weakly-supervised action localization
CN113841156A (en) * 2019-05-27 2021-12-24 西门子股份公司 Control method and device based on image recognition
CN110533067A (en) * 2019-07-22 2019-12-03 杭州电子科技大学 The end-to-end Weakly supervised object detection method that frame based on deep learning returns
US11341358B2 (en) * 2019-09-30 2022-05-24 International Business Machines Corporation Multiclassification approach for enhancing natural language classifiers
CN110689092B (en) * 2019-10-18 2022-06-14 大连海事大学 Sole pattern image depth clustering method based on data guidance
CN111444969B (en) * 2020-03-30 2022-02-01 西安交通大学 Weakly supervised IVOCT image abnormal region detection method
KR102548246B1 (en) * 2020-11-09 2023-06-28 주식회사 코난테크놀로지 Object detection data set composition method using image entropy and data processing device performing the same
WO2022177928A1 (en) * 2021-02-16 2022-08-25 Carnegie Mellon University System and method for reducing false positives in object detection frameworks
CN114049508B (en) * 2022-01-12 2022-04-01 成都无糖信息技术有限公司 Fraud website identification method and system based on picture clustering and manual research and judgment
CN114359610B (en) * 2022-02-25 2023-04-07 北京百度网讯科技有限公司 Entity classification method, device, equipment and storage medium
CN115439688B (en) * 2022-09-01 2023-06-16 哈尔滨工业大学 Weak supervision object detection method based on surrounding area sensing and association
CN116563953B (en) * 2023-07-07 2023-10-20 中国科学技术大学 Bottom-up weak supervision time sequence action detection method, system, equipment and medium
CN116580254B (en) * 2023-07-12 2023-10-20 菲特(天津)检测技术有限公司 Sample label classification method and system and electronic equipment

Family Cites Families (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6327581B1 (en) 1998-04-06 2001-12-04 Microsoft Corporation Methods and apparatus for building a support vector machine classifier
US6915025B2 (en) 2001-11-27 2005-07-05 Microsoft Corporation Automatic image orientation detection based on classification of low-level image features
US6762769B2 (en) 2002-01-23 2004-07-13 Microsoft Corporation System and method for real-time texture synthesis using patch-based sampling
US7386527B2 (en) 2002-12-06 2008-06-10 Kofax, Inc. Effective multi-class support vector machine classification
US7124149B2 (en) 2002-12-13 2006-10-17 International Business Machines Corporation Method and apparatus for content representation and retrieval in concept model space
US7164798B2 (en) 2003-02-18 2007-01-16 Microsoft Corporation Learning-based automatic commercial content detection
US7490071B2 (en) 2003-08-29 2009-02-10 Oracle Corporation Support vector machines processing system
US20050289089A1 (en) 2004-06-28 2005-12-29 Naoki Abe Methods for multi-class cost-sensitive learning
US7827179B2 (en) * 2005-09-02 2010-11-02 Nec Corporation Data clustering system, data clustering method, and data clustering program
US7949186B2 (en) * 2006-03-15 2011-05-24 Massachusetts Institute Of Technology Pyramid match kernel and related techniques
US7873583B2 (en) 2007-01-19 2011-01-18 Microsoft Corporation Combining resilient classifiers
US7983486B2 (en) 2007-08-29 2011-07-19 Seiko Epson Corporation Method and apparatus for automatic image categorization using image texture
US8086549B2 (en) 2007-11-09 2011-12-27 Microsoft Corporation Multi-label active learning
US20090290802A1 (en) 2008-05-22 2009-11-26 Microsoft Corporation Concurrent multiple-instance learning for image categorization
US8140450B2 (en) 2009-03-27 2012-03-20 Mitsubishi Electric Research Laboratories, Inc. Active learning method for multi-class classifiers
US8296248B2 (en) * 2009-06-30 2012-10-23 Mitsubishi Electric Research Laboratories, Inc. Method for clustering samples with weakly supervised kernel mean shift matrices
US8386574B2 (en) 2009-10-29 2013-02-26 Xerox Corporation Multi-modality classification for one-class classification in social networks
US8605956B2 (en) 2009-11-18 2013-12-10 Google Inc. Automatically mining person models of celebrities for visual search applications
US8447139B2 (en) 2010-04-13 2013-05-21 International Business Machines Corporation Object recognition using Haar features and histograms of oriented gradients
US9396545B2 (en) * 2010-06-10 2016-07-19 Autodesk, Inc. Segmentation of ground-based laser scanning points from urban environment
US8645380B2 (en) 2010-11-05 2014-02-04 Microsoft Corporation Optimized KD-tree for scalable search
US8798393B2 (en) * 2010-12-01 2014-08-05 Google Inc. Removing illumination variation from images
US9870376B2 (en) 2011-04-01 2018-01-16 Excalibur Ip, Llc Method and system for concept summarization
US9036925B2 (en) 2011-04-14 2015-05-19 Qualcomm Incorporated Robust feature matching for visual search
US9239848B2 (en) * 2012-02-06 2016-01-19 Microsoft Technology Licensing, Llc System and method for semantically annotating images
KR101912748B1 (en) 2012-02-28 2018-10-30 한국전자통신연구원 Scalable Feature Descriptor Extraction and Matching method and system
ITMI20121210A1 (en) 2012-07-11 2014-01-12 Rai Radiotelevisione Italiana A METHOD AND AN APPARATUS FOR THE EXTRACTION OF DESCRIPTORS FROM VIDEO CONTENT, PREFERABLY FOR SEARCH AND RETRIEVAL PURPOSE
US9224071B2 (en) 2012-11-19 2015-12-29 Microsoft Technology Licensing, Llc Unsupervised object class discovery via bottom up multiple class learning
WO2014130571A1 (en) 2013-02-19 2014-08-28 The Regents Of The University Of California Methods of decoding speech from the brain and systems for practicing the same
US9020248B2 (en) 2013-02-22 2015-04-28 Nec Laboratories America, Inc. Window dependent feature regions and strict spatial layout for object detection
US9317781B2 (en) 2013-03-14 2016-04-19 Microsoft Technology Licensing, Llc Multiple cluster instance learning for image classification
US9158995B2 (en) 2013-03-14 2015-10-13 Xerox Corporation Data driven localization using task-dependent representations
CN103246893B (en) 2013-03-20 2016-08-24 西交利物浦大学 The ECOC coding specification method of stochastic subspace based on rejection
CN103268607B (en) * 2013-05-15 2016-10-12 电子科技大学 A kind of common object detection method under weak supervision condition
CN103839080A (en) * 2014-03-25 2014-06-04 上海交通大学 Video streaming anomalous event detecting method based on measure query entropy
US9875301B2 (en) 2014-04-30 2018-01-23 Microsoft Technology Licensing, Llc Learning multimedia semantics from large-scale unstructured data
US10325220B2 (en) 2014-11-17 2019-06-18 Oath Inc. System and method for large-scale multi-label learning using incomplete label assignments
CN104463249B (en) * 2014-12-09 2018-02-02 西北工业大学 A kind of remote sensing images airfield detection method based on Weakly supervised learning framework
US9785866B2 (en) 2015-01-22 2017-10-10 Microsoft Technology Licensing, Llc Optimizing multi-class multimedia data classification using negative data

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
BHARATH HARIHARAN ET AL: "Discriminative Decorrelation for Clustering and Classification", 7 October 2012, COMPUTER VISION ECCV 2012, SPRINGER BERLIN HEIDELBERG, BERLIN, HEIDELBERG, PAGE(S) 459 - 472, ISBN: 978-3-642-33764-2, XP047018760 *
FEIFEI CHEN ET AL: "Action recognition through discovering distinctive action parts", JOURNAL OF THE OPTICAL SOCIETY OF AMERICA A, vol. 32, no. 2, 8 January 2015 (2015-01-08), US, pages 173 - 185, XP055258538, ISSN: 1084-7529, DOI: 10.1364/JOSAA.32.000173 *
JUNEJA MAYANK ET AL: "Blocks That Shout: Distinctive Parts for Scene Classification", IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION. PROCEEDINGS, IEEE COMPUTER SOCIETY, US, 23 June 2013 (2013-06-23), pages 923 - 930, XP032493050, ISSN: 1063-6919, [retrieved on 20131002], DOI: 10.1109/CVPR.2013.124 *
MISRA ISHAN ET AL: "Data-driven exemplar model selection", IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION, IEEE, 24 March 2014 (2014-03-24), pages 339 - 346, XP032609926, DOI: 10.1109/WACV.2014.6836080 *
SICRE RONAN ET AL: "Discovering and Aligning Discriminative Mid-level Features for Image Classification", INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, IEEE COMPUTER SOCIETY, US, 24 August 2014 (2014-08-24), pages 1975 - 1980, XP032697962, ISSN: 1051-4651, [retrieved on 20141204], DOI: 10.1109/ICPR.2014.345 *
XIAOZHI CHEN ET AL: "Learning a compact latent representation of the Bag-of-Parts model", 2014 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP 2014) : PARIS, FRANCE, 27 - 30 OCTOBER 2014, 1 October 2014 (2014-10-01), Piscataway, NJ, pages 5926 - 5930, XP055258715, ISBN: 978-1-4799-5751-4, DOI: 10.1109/ICIP.2014.7026197 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11403328B2 (en) 2019-03-08 2022-08-02 International Business Machines Corporation Linking and processing different knowledge graphs
CN112215275A (en) * 2020-09-30 2021-01-12 佛山科学技术学院 Image processing system and method suitable for K-means algorithm, and recording medium

Also Published As

Publication number Publication date
CN107209860A (en) 2017-09-26
US20160217344A1 (en) 2016-07-28
EP3248143A1 (en) 2017-11-29
US10013637B2 (en) 2018-07-03
EP3248143B1 (en) 2023-04-05
CN107209860B (en) 2021-07-16

Similar Documents

Publication Publication Date Title
US10013637B2 (en) Optimizing multi-class image classification using patch features
CN107209861B (en) Optimizing multi-category multimedia data classification using negative data
US10438091B2 (en) Method and apparatus for recognizing image content
US9875301B2 (en) Learning multimedia semantics from large-scale unstructured data
US10482146B2 (en) Systems and methods for automatic customization of content filtering
Sun et al. Chinese herbal medicine image recognition and retrieval by convolutional neural network
Moran et al. Sparse kernel learning for image annotation
US20160188633A1 (en) A method and apparatus for tracking microblog messages for relevancy to an entity identifiable by an associated text and an image
WO2013160192A1 (en) Method for binary classification of a query image
US10943098B2 (en) Automated and unsupervised curation of image datasets
US10489681B2 (en) Method of clustering digital images, corresponding system, apparatus and computer program product
CN111080551B (en) Multi-label image complement method based on depth convolution feature and semantic neighbor
Cheung et al. An analytic system for user gender identification through user shared images
JP6017277B2 (en) Program, apparatus and method for calculating similarity between contents represented by set of feature vectors
Lahrache et al. Bag‐of‐features for image memorability evaluation
JP5833499B2 (en) Retrieval device and program for retrieving content expressed by high-dimensional feature vector set with high accuracy
Das et al. Content based image recognition by information fusion with multiview features
JP6090927B2 (en) Video section setting device and program
JP6283308B2 (en) Image dictionary construction method, image representation method, apparatus, and program
Wu et al. Social Attribute Annotation for Personal Photo Collection
Correlogram et al. 0–9symbols
Li et al. Beyond Bag-of-Words: combining generative and discriminative models for scene categorization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15834702

Country of ref document: EP

Kind code of ref document: A1

REEP Request for entry into the european phase

Ref document number: 2015834702

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE