
By Anthony Matarazzo
Recently, I had the personal opportunity to study consumers from the sales
clerk side of the task. I placed on my trusty uniform issued in only sparkling white with various emblems of billion dollar
businesses on it and went to work. I easily mastered the cash register. But I
was more interested in the people. I found that the general public has a lot of
different people. Smelly, stinky, ugly, beautiful,
intelligent, stupid, deaf, blind, repugnant, brilliant, impatient, impatient,
impatient, impatient. Ahh, that’s about the
only problem I can solve for these rude rats: impatience, or rather reduce the
time they are in the store. I propose that by using optical recognition during
the sales process, multiple objects can be rang up at once saving precious time
for the consumer.
Impatience, the customers would sign, tap
their foot, cross elbows, and bounce in place. Even when the time out of the
door was less than a minute there were many visible signs that improvements
were needed. I found that some customers would leave the store when five or
more people were in line. What was the hold up? Me? I was running the cash draw
at top speed, “click a ding ding”. But I did not let
any of their sly remarks hurt my feelings (sob). Instead, I focused on what the
customers actually did while they were in the store; that is, their physical
actions.
The customers would line in a single file, at most times one
by one, placing their items on the counter. I noticed almost immediately that
each customer typically had less than five items of various shapes, colors, and
sizes. Drinks standing some spills, bags lying down, and
other items lying with the largest area down. From a visual perspective
the objects were clustered nearer the customer and not the clerk. At most times
a distinct space existed between the items; a natural artifact of the scene. At
times objects were one on top another like two boxes of candy or two bags of
peanuts.
In the current system, I would pick each item up, locate the
UPC code, jiggle it in front of the bar code reader and place it back on the
counter after a Churrrrrp. The chirp means
that the cash register read the UPC. That’s changing hands too many times.
Three hand changes (customer to counter to me to counter) not to mention the
hand elbow and arm extension I had to perform to pick up the item.
In addition, on approximately one out of every tenth item, a
rescan had to be done. Ahh, it is the plastic around
the UPC that had to be flattened out, or the color of the packaging was too
reflective or the UPC was printed directly on the plastic seam of the
bag. Sometimes I had to enter the UPC manually. In this case a good
thirty seconds was spent. The end result was that that time was lost to the
consumer. While as an individual we may not think that an extra thirty seconds
can be a big deal, as a convenient store consumer in a line it obviously
matters. With my optical recognition system specifically designed for the retail
store transaction model, customers will wait less in line and leave the store
sooner; that will attract repeat business.
Optical Cash Registers are just around the corner. It is no
lie that when bar codes came out, many people would only go to the stores that
had them so they could get out faster. Now that we have all grown accustom to
this speed of checkout, as a consumer we are still impatient. It is as if the
goods are ours as soon as we place our hands on them and paying for them is
simply a byproduct of owning them. An optical cash register can reduce the
sales cycle time because it can scan the items while they are on the counter
and scan multiple objects at once.
Scanning the items while they are on the counter will save
the clerk a lot physical action. They will not have to pick the item up, find
the UPC, wave it in front of the scanner and finally place it back on the
counter or in a bag. This is a savings in time from three to thirty seconds.
Scanning multiple objects at once will multiple the time savings. So if a
customer has four items, they will be rang up in ONE
second rather than thirty seconds or even a minute.
But how would one solve this problem. I propose that by
matching colors, general shape and selected patterns of a multi-camera feed a
highly efficient object recognition system can be developed for the retail
market. A technical flowchart of the process is shown below.
The Main Loop of the process will first determine if any
items are in the view. This can be accomplished by using a single camera view.
The detection will be achieved simply by comparing it to a static background
that was set earlier or by requiring the cashier to press a button when all
items are in view. The system should also take
into account hands being placed in the view and perhaps other objects like a
sheet of paper just for durability. When erroneous items are present, they
should not be considered as part of a transaction. This could be accomplished
easily by the cashier or by automatically seeing the item as part of the static
background. One could include it when an amount of time has past while and the
object was not used in the sale. Or by including various items related to the
counter or desk as part of the recognition database. It will not be a lot of items
as the sales clerk’s life is extremely limited.
When a start of sale has been confirmed, all the cameras
should be queried for their ImageProcessor
Object. The image processor object contains the necessary functions to operate
on the image as well as the image itself. Color correction should be performed
at this stage if needed. For example, it may be easier for the recognition
routines to only handle a specific color depth. As well I am sure because of the constraints on the environment
preparing the image in some way could as well speed up the recognition process
and perhaps the color matching routines. Erasing artifacts of the image, specular
light reflections,
Next, the system has to determine where each of the objects are in the different camera views. It has to match the
corresponding images with their counterpart. Remember that the background has
been erased therefore each object will be on a black background. Then you must
identify each of the items in the areas for each of the cameras. This is
accomplished by the ItemMatcher object. As well, it
should be understood that occlusion errors can be found at this stage. For
example, let us say that camera one has the view of two boxes and one bottle.
However in Camera two it only shows one box and one bottle. At this point, the
occlusion flag should be marked on the items. It is also possible that given
the different camera views, the system could resolve the occlusion. In this
case the system should continue.
Once each item has been resolved for each of its views, the
system should start a recognition thread for each item. I recommend a pooled
source with a limit on max threads.
Once the recognition threads have completed, the result should
be checked to ensure all items have been found. If not it should be reported to
the cashier to move the items around. The cashier might have to move them so
that more space is between them or un stack the items
so that one item does not occlude another. Just as a punishment to the consumer
we could embarrass them by a naughty buzz. Because it will eventually become
common custom that other customers would yell at one another rather than the
poor sales clerk. “Hey fix your occlusion!”
The recognition thread, figure C, will do most of the
complex work for the system. There will be one thread per item. The 3D image
generation occurs at this level. Its functional correctness is critical to the
success of the system. It generates a series of X,Y,Z
points from the camera views.
3D Image generation from a digital source is relatively new.
There are many current research studies being conducted for the best method.
However, most of these studies deal with environments that are unconstrained,
that is, the environment’s background vary. For this project there are many
known facts about the environment that can be controlled that will have a
direct effect on performance. For example, just having a known background will
automatically identify the specific objects we want to recognize. So the system
will only work with items that need to be scanned. Secondly, since the camera
angles of our setup will be known, the finding and matching up of significant
points will be much faster. I am investigating the article “3D Object Modeling
and Recognition Using Local Affine-Invariant Image” by Fred Rothganger,
Svetlana Lazebnik, Cordelia
Schmid and Jean Ponce for possible modifications to
this fixed environment. Perhaps using a weighted triangle reduction in
preprocess for the recognition engine.
Next, the GeneralShape
object will determine what the object’s form is. This is achieved by analyzing
the points gathered from the 3D image. This analysis will determine sub shape
characteristics of the product. A typical characteristic might be that the item
is narrow and closed at the top while it has a large based on the bottom. That
would be that it might be a bottle. Or perhaps it is a box shape, which might
lead to the conclusion that it is a box of candy. Most of the analysis routines
will utilize the distance between two or more points to determine the existence
of a these characteristics which is why 3D image is so important.
Knowing the general shape of an object will greatly reduce
the amount of data that has to be searched. As well, some objects have
properties that will allow the Emblem object to perform more efficiently. For
example, on most bottles, the label is near the upper half of the bottle.
The Emblem object will scan the 3D Image object to
find unique characteristics incorporated into the package design. This could be
the company logo, lettering type, a UPC box, a specific gradient pattern on the
item or other types of visual distinctions. Since multiple types of emblems
will be stored in the database, the chance of matching will be greater because
this takes into account that sometimes the object’s back is facing the camera.
The last criteria that will be gathered from the 3D image
will be a color sample. The color sample is a standard deviation of n * n
pixels across the entire area of the object. Using an overall
color of the object, or the top two, will as well reduce the amount of data
that has to be matched. For example, a bottle object that is comprised
mostly of black and red would probably be a Coca Cola product.
The last and final stage of the recognition thread will be
to search the database. The product information will be stored in a relational
database such as MSDE, but efficient key searching must be maintained. The
database will contain the size, general shape, overall colors, color sampling,
and emblem samples. If the developer so chooses, a specialized structure may be
used. I recommend using compiled SQC or direct ODBC calls for efficiency. The
primary key should be the general shape followed by overall color, and size. I
believe this will return a small dataset that will be matched for likeness
against using the emblem and color samples. The likeness compare is processor
intensive as all records returned in the dataset have to be compared and graded
for statistical likeness. The top one will be returned.
After the recognition thread completes, the database should be searched for
specials on the item (see figure D). This information could be stored along
with the item’s description in the database. If the flag is set for a special
two for deal, the price should be updated to reflect this.
Next the total should appear on the cashier’s screen and customer’s screen. The
appropriate functions for entering the amount received from the customer should
be readily available as described in the interface requirements section of this
document.
One important feature of this system is that the inventory for specific items
will be more accurate. This is because at most times when the customer chooses
products based on a two for one deal or a special price the cashier currently
only scans one of the items while placing the register in a special sale. The
end result will be that the two specific items inventory will be updated rather
than just the one.
So now with
this model the customer’s experience would be:
In conclusion, general shape, emblem and general color are
the criterion that would be used to cross reference a database similar to a UPC
code. The key matching will not be done on an exact basis but more “likeness”
using a statistical grading system. The system of course will have to exist in
software first. Then I would get a system profiler to decide which components
can be built inside hardware. I believe NVIDIA or Intel may be able to place
some of the algorithms for image recognition; after all they are really just
image processing routines. Remember it must be real time and with the process I
described above, it can be a reality.
1. Trace
edges of the image and find each distinct object in the view. This would make a
great object in C++. It will be a set of routines that given a static
background will locate the differences from that background. Separate each
pixel of the sub image into a corresponding color bitmap buffer. Place each
item into a stl vector or programmers choice. Inherit
COM model that permits ATL collections for ease of interoperability. As well,
the object must be able to run as a thread. For ease of use I will call it ImageProcessor. As well, it must decide by
the light, shadow, and edges if two objects are on top another. It should flag
this with a property inside the stl vector.
2. Write
an object that gathers information from a digital camera. We will call it Camera.
Provide object connections between ImageProcessor
and Camera. That is the data must be in its own object. This image data
must be compatible throughout the system. This is described in 3.
3. Write
an object that can hold an image and perform color balancing on it. It must as
well be able to perform color balancing on certain sections of the image, while
not affecting the other parts including odd shapes. Saturation, contrast,
sharpness, edge tracing, blurring most of the convolution filters. There may be
a third party tool for this, but I would prefer owning the code. For
simplicity I will call the object Image. The object must also be able to
find the edges of the object and rotate the image so that the image is facing a
normalized, that is predetermined, direction. A set of routines must be written
that decide which part of the object is the top and which is the bottom. As
well, the object must synchronize the image with its counter image from the
other camera. Could solve the problem with the bucket
solution by placing north, east, south, and west facing images in the database.
4. Write
an object that will start threads that read a camera. The read from the camera
will report if anything is in the view. This will be the main process. It will
loop until an object is located. After an object has been
located. All of the other camera objects will be queried for their ImageProcessor Object. So there should be one ImageProcessor object for each camera.
5. Now
that each ImageProcessor object has been
retrieved it can be shipped to an object that locates each view of the object
between each ImageProcessor. That’s multiple
pictures of each customer item. It could be called ObjectMatcher.
This could easily be achieved by known location information from a known
reference point. The ObjectMatcher must also
take into account that some objects may be on top of one another, the
information gathered from the Image object (see above).
6. Combine
these images together to create a three dimensional map of the object, or a one
eighty view. This will be the specialized image object used by the recognition
routines. It will use the Image objects methods to normalize the color
set of the given object. Rotate the images so that all three items have been
sized using high quality filters. It will contain information that describe the
size of the item, over all color of the item (what are the top two colors?) and
standard deviation of a given area. The sizing can be accomplished by placing a
measurement picture during setup or better still as part of the counter top so
that it is self calibrating. A grid. This object will
be called 3DImage. One thing however, I know algorithms exist that
perform this function already. The point is I do not believe we need an
enormous about of points to decide the GeneralShape
accordingly described below.
7. The
3DImage object will be shipped first to the GeneralShape
object. This object will decide if the object is a Bag, Bottle, Soda
Bottle, Candy Box, Candy Bag, Cookies or Other. Using these specific terms
instead of the basic 3D primitives or thereof a combination, should inform the
developer that the GeneralShape object will be
used to describe a higher level object, a real world one. Thinking of it in
this way will greatly improve the searching efficiency of the system. The GeneralShape object will have a property that
reports the real world object. Someone will have to run the inventory to
determine general shapes of items in convenient stores. .
8. Next
the 3DImage object will be shipped to the EmblemObject
which will identify product lettering as well as company logos. I believe that
a set of routines can be developed that arrive at the same emblem location each
time a scan is done on an image. That means internally an emblem may not be the
actual company logo, but rather an identifying mark or series of marks that
make the object most unique in color. Perhaps data search set could be reduced
by reducing the emblem to shape type.
9. Perform
color sampling of the 3D Image over an N sized grid. This information will be
used as key matches.
10. A database will exist that
contains the size that contains the size, general shape and several associated
emblem marks. Scan the database and find the most likely match given these
criteria. Special image compares should be done not for exactness but for
likeness. Each search will reduce the dataset but statistics on the matching
color sample in step j must be kept that the top matching item is picked. So
the Key set will be GeneralShape, EmblemObject,
and ColorSamples.
11. Perform this function for
each item in the field of view.
12. Use the standard cash
register sales technique for reporting funds to collect, change, etc.

a. Why
Three D?
b. What
advantages does a system like this have over the current system.
c. Cash
Register Functions
d. The
existing Cash Register and hardware requirements for the new system.
e.
Displaying map that informs cashier of items occluded.
f.
Running an inventory to show the computer system the objects it will be working
with.
i. Inventory will be better tracked because the
actual item is range up. Sometimes when the cashier has two or more items, they
will press the two for one and scan only one of the items. At most times the
items are different inventory items. For example coke products.
g.
Building the database.
h.
Cost of system development
i. Competitors
and existing products
j.
Integration with current systems.
k. How
would this product make money and pay for itself?
l. Rollout
and installation requirements
m. Training
using onsite and multimedia course.
n. Manager
functions for inventory. Historical and statistical based ordering methods that
adopt over the lifetime of the system’s use. Since the inventory is tracked
better, the system can also report which items the manager will probably have
to order.
o. Safe
management
p. Oil
sales
q. Credit
Card and Debit Card processing
r.
Feasibility.
s.
What resources are needed to complete the project.